To count edges from a PGR file collected from BGP paths we need to determine which observations are positive and negative evidence for the existence of an edge. We apply an assumption that many BGP paths are the shortest path between AS pairs. Of course, this assumption does not hold in general. Despite this, our inference algorithm can account for this error in the false positive rate of the inferred parameters. We provide an algorithm for counting how often a would be edge would create a shortened path and we consider that evidence that that edge does not exist, and record a negative observation.
The edge counting algorithm is computationally intensive. As such, it is recommended that for large datasets (anything more than a few time periods and every route collector will be quite large) the counting algorithm be executed on a compute cluster with at least 20 cpu cores. Any number of cores can be used on a multinode system if the nodes can communicate with infiniband and you can launch a job with slurm or some such job allocation technology. With few cores, expect this computation to take on the order of a day.
The counting algorithm program can be built from the soure code. A possibly working binary is included as well, but is unlikely to function due to differing UPC++ build environments. It is highly recommended to build the counting algorithm program yourself from the source. Once built, the program can be run on the PGR binary file as follows on a single node with 20 UPC++ ranks:
upcxx-run -N 1 -n 20 ./count.out --input_file 2022_01_01_pgrs.bin --output_directory 2022_01_01_classes/ --max_periods 1
This will output 20 separate class files. Each class file is a csv file containing the number of positive and negative observations of each route collector overall time periods. In order to run the EM algorithm to infer the edge probabilities and other model parameters the files must be combined with the Merge Class Counts program.
python merge_class_counts.py --input_dir 2022_01_01_classes --output_file 2022_01_01_em_input.bin --num_nodes N
N here is the number of unique ASes seen in the PGR files. You can calculate this by iterating throught the pickled PGR files from step 1.
The counting algorithm requires UPC++ in order to build and launch. The installation of UPC++ is available for different systems at UPC++.
Making the counting executable will also require an upto date boost library implementation. You may need to point the compilation commands in the makefile to your boost installation if it is not in the default location.