Optimizing PraxisMapper 1: Reading and Parsing .PBF files

What Is This For?

PraxisMapper used OsmSharp to read data from OpenStreetMap extract files, with the .osm.pbf extension. This was very fast but used huge amount of RAM when generating complete entries. A 170MB PBF file would require nearly 8GB of RAM to run through processing in one shot. It read through the file in 2 passes: the first to find which elements matched the criteria given, and the second to fill in data to all the elements that depended on other elements (A Relation can have multiple Ways, and a Way is a set of at least 2 Points). This is theoretically the fastest way to handle this in a general case. Practically, this exposes a surprisingly low cap on which files we can use in a single run based on how much RAM the computer running it has.

After struggling with trying to make a complete set of a continent's data out of the smallest extracts available and still hitting issues in some areas, I decided that my best course of action was to make my own derivative of the FeatureInterpreter used by OsmSharp. I would shortly discover that I also needed to make my own PbfReader to fully control the logic for these optimizations. Despite the name, my PbfReader class isn't derived from the one in OsmSharp, but my FeatureInterpreter is based on DefaultFeatureInterpreter from the library, and it maintains the original MIT license from OsmSharp as a derivative work. I also had to import the protobuf classes for the OSM PBF file format to make all of my changes.

Before starting these changes, my 170MB benchmark file took over 45 minutes to run at its slowest, and this time includes the operations to save an entry as a JSON element to a text file. I did this work several months ago, so my notes on precisely how big each improvement was has been lost and these notes presented are from memory.

Improvements

Remove Unneeded Work

OsmSharp uses several parts of data from the OSM PBF spec as a generic library for reading and writing data. My first change was to mark the parts that weren't needed for a project that only reads data from extracts as ProtoIgnore, so that the generated objects wouldn't attempt to populate those fields at all. Most of these ignored fields are entries that don't show up in the extracts at all, like user names or history elements. They needed to remain in order for the files to be read correctly, but the parser knows to skip them and go to the next field to continue processing data. This provided a small boost to read times.

Indexing The File

I kept the idea of doing 2 passes on the file to complete processing, but I changed what my passes do. The first pass is an 'index' pass, which reads the header of each block to find out which elements are stored within. I then save the locations and sizes of each block to an index, and save the IDs of elements in each block to a Dictionary to let me find which block has that data when something referenced it without storing everything in RAM all the time. The second pass then went from the biggest elements (Relations) to the smallest (Points) in reverse order to search which ones met the current requirements. It takes about 5 seconds to create an index for the benchmark file. This dramatically cut down on RAM use, since the app would now read data from disk on demand and drop any unneeded data after processing a block. This removed a few minutes of runtime as well, though not as many as I initially expected.

Indexing helped RAM use dramatically, but wasn't as big of a speed improvement as I expected. As it turns out, keeping each element as an entry in a Dictionary was using a significant amount of time due to the amount of entries being stored. The first improvement there was to store only the first and last entries in a block instead of every entry. This required changing to a List of Tuples because I now needed to search each entry to see if the requested ID was in the given range for a block. This was not yet faster because iterating a list from start to end is still not actually faster than Dictionary lookup that's 8,000 times larger in most cases. I changed my search logic to treat the list as an array, and to treat that array as a B-Tree. Now, searching through a list of 1,000,000 blocks can be done in 20 calls or less, which uses much less CPU power in total and beats the 8,000x larger Dictionary lookup time as well (and uses much less RAM than that would, obviously). This was a significant improvement in runtime, cutting off several minutes and reducing the CPU usage by about half thanks to spending less time iterating over lists.

Avoid Recursion

The DefaultFeatureInterpreter class is highly recursive, and each Way in a Relation will result in a call to FindRings(), which attempts to attach each Way to an existing set of Ways that will form a closed shape or create a new one for other Ways to connect to. The issue this presents to us here is that some Relations are simply gigantic (11,000 is the smallest number that saw this error occur occasionally, and much larger Relations exist), and they will hit a stack call limit trying to process. This call limit is a fixed amount in .NET, and cannot be configured or changed. The only answer to allow the most detailed Relations to process was to replace the recursive logic with a non-recursive function. This was a bugfix improvement, as the code is effectively the same processing but done in a typical loop instead of a recursive function call, and doesn't seem to improve performance on most elements.

Smart Caching

OsmSharp keeps everything loaded into RAM for a search while processing. PraxisMapper was reading everything from a block on disk on demand. To remove some of the disk access, I held a requested block in RAM for the currently processing block, and the next block to process. If a block wasn't read in the previous block, it was dropped from RAM until the next time it was requested. This logic is pretty effective at reducing disk access, since related elements tend to have similar OSM Ids (If you make a Way with 8 Nodes, all 8 of those Nodes are likely to be in order, and likely to be in the same block). This can also be passed into the B-Tree search logic as a 'hint' to skip multiple reads if odds are high that we already know the result. Testing showed that most blocks called in for looking up referenced data were used once in a block, and a small number of blocks got multiple calls into from other elements.

Multithreading

OsmSharp runs in one thread. I wrote my parser to split off a Task for each element in a block. This doesn't make an individual element faster, but multiple elements can be processed at once. In an ideal case, this makes each block takes as long as its largest element takes to process. This ideal setup rarely occurs. Regardless, this is the second biggest single improvement made to the parsing process, and the B-Tree logic combined with this results in a greater improvement than either one alone. The only locking needed to keep this running smooth is on the FileStream for the .osm.pbf file, so loading blocks from disks is a blocking action, but once it's in RAM any number of threads can read the data without interfering with each other.

FileStream for Results

Earlier logic used the single line File.WriteLine() API call to save an entry to the output file. This results in an open/close and lock/unlock loop per entry, which adds up over time. Switching to a FileStream object that stays open for the duration of the operation removes a measurable amount of overhead per block.

Treat Nodes Differently

When running this process, Nodes are so small that the overhead of multithreading does hurt performance. When a block of Nodes is being processed, instead of firing off a Task per entry, PraxisMapper simply loops over them and processed results inline in a single thread.

Minor Changes

Some small changes were made to the FeatureInterpreter to make it clearer which things did not need done per call. Some of these were possibly done automatically by the compiler, but they're now written explicitly in a more efficient manner. A list of tags to determine areas was made constant and static instead of being created every call, some functions return early on an unusable entry, and a few I'm sure I forgot. These don't save much on a small file, but for large files they may cut off a measurable amount of time.

Results

Once all of these changes were put in place, my benchmark file can now be processed in about 6 minutes, almost 8x faster than the original logic, and with about a quarter of the peak RAM required at 2GB instead of 8GB. The tradeoff is that much more disk access occurs, which adds more variance to the total runtime. The original logic results in very consistent times to process, and the current logic's total running time can vary by 10% in either direction for a file.

Now, a block of Relations takes a varying amount of time to process (always did, always will, because Relations can vary between 3 and 54,000 members to look up and process), a block of Ways typically takes under 1 second (exceptions apply, but Ways have a cap of 2,000 Nodes per the spec), and 10-50 blocks of Nodes can be processed per second.