Completely store simple lines and box-aligned polygons in the sweeper event value #12
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
With 327d8f4, geometry IDs for points are stored - together with their geometry - directly in the sweeper event value, if the geometry ID fits into 56 bits (we require 1 byte to store the string length, the remaining 7 byte are payload). This was possible because (a) the point geometry was already stored within the sweeper event value as a low-hanging optimization with very little overhead and (b) the sweeper queue value also has a 64 bit integer holding the position of the geometry on disk. For points, the only remaining information stored on disk was the geometry ID (a free string), so it was natural to store this directly into the 64 bit integer if it fitted. This completely avoids the geometry cache (and the necessary disk lookups) for points with short IDs. Note that integer IDs (and also all IDs used by
osm2rdfto identify OSM objects) are not stored as strings, but as base256 encoded integers, which means that all integer IDs up to a value of 2^56 can be stored this way.This PR introduces the same technique to simple lines. Simple lines are lines that only consist of two anchor points (line segments), and for which all heuristic precomputations can be done cheaply on the fly. So far, the only information stored on disk for simple lines were the two anchor points, and the string ID. Now, the line segment is stored within the sweeper event value by setting the left anchor point to the point value in the sweeper queue described above (which was previously unused for simple lines), and be reconstructing the second anchor point from other information in the sweeper event values which describes the line segments bounding box. The ID is then also stored directly in the 64 bit disk offset value. With this change, simple lines with geometry IDs having fewer than 8 characters also require no disk lookup.
As a second change, we introduce the novel geometry storage type of aligned box polygons. These are polygons that are exactly equivalent to their axis-aligned bounding box. As an axis-aligned bounding box can be described by two points, we store these aligned box polygons in exactly the same fashion as simple lines.
For a testing dataset with all named OSM geometries in Germany, the full geometric join was around 15% faster with this PR (166 seconds vs 195 seconds). This was mainly driven by the change for simple lines, as there were only around 2,000 (0.02% of all geometries) aligned box polygons in this dataset.