Skip to content

Conversation

@patrickbr
Copy link
Member

@patrickbr patrickbr commented Oct 28, 2025

With 327d8f4, geometry IDs for points are stored - together with their geometry - directly in the sweeper event value, if the geometry ID fits into 56 bits (we require 1 byte to store the string length, the remaining 7 byte are payload). This was possible because (a) the point geometry was already stored within the sweeper event value as a low-hanging optimization with very little overhead and (b) the sweeper queue value also has a 64 bit integer holding the position of the geometry on disk. For points, the only remaining information stored on disk was the geometry ID (a free string), so it was natural to store this directly into the 64 bit integer if it fitted. This completely avoids the geometry cache (and the necessary disk lookups) for points with short IDs. Note that integer IDs (and also all IDs used by osm2rdf to identify OSM objects) are not stored as strings, but as base256 encoded integers, which means that all integer IDs up to a value of 2^56 can be stored this way.

This PR introduces the same technique to simple lines. Simple lines are lines that only consist of two anchor points (line segments), and for which all heuristic precomputations can be done cheaply on the fly. So far, the only information stored on disk for simple lines were the two anchor points, and the string ID. Now, the line segment is stored within the sweeper event value by setting the left anchor point to the point value in the sweeper queue described above (which was previously unused for simple lines), and be reconstructing the second anchor point from other information in the sweeper event values which describes the line segments bounding box. The ID is then also stored directly in the 64 bit disk offset value. With this change, simple lines with geometry IDs having fewer than 8 characters also require no disk lookup.

As a second change, we introduce the novel geometry storage type of aligned box polygons. These are polygons that are exactly equivalent to their axis-aligned bounding box. As an axis-aligned bounding box can be described by two points, we store these aligned box polygons in exactly the same fashion as simple lines.

For a testing dataset with all named OSM geometries in Germany, the full geometric join was around 15% faster with this PR (166 seconds vs 195 seconds). This was mainly driven by the change for simple lines, as there were only around 2,000 (0.02% of all geometries) aligned box polygons in this dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants