-
Notifications
You must be signed in to change notification settings - Fork 41
Description
Description
I'm experiencing inconsistent query results when running the same spatial query with DuckDB on EC2. When executing the identical query multiple times in succession, I get different row counts:
First run: 35,817 records
Second run: 291,614 records
Third run: 9,555 records
The correct number of records - which I get from running locally with 1.1.3 is 1,632,012
This appears to be a non-deterministic behavior introduced in DuckDB 1.2.0.
Environment
Local: DuckDB 1.1.3 (gives consistent results)
EC2: DuckDB 1.2.0
R package: duckdb
Extensions: spatial, aws, httpfs, icu
I'm not able to installed 1.1.3 into the EC2, closest I get is v1.1.3-dev165 but then the spatial extensions aren't working
Error: IO Error: Failed to install 'spatial'
The file was built for DuckDB version 'v1.1.3', but we can only load extensions built for DuckDB version '19864453f7'.
SQL I'm running
WITH events AS (SELECT *
, ST_Point(longitude, latitude) geom
, timezone('Pacific/Auckland', event_timestamp::TIMESTAMP WITH TIME ZONE) AS event_timestamp_nz
, CAST(event_timestamp AS DATE) dom
, strftime(date_trunc('hour', event_timestamp_nz), '%H:%M') timegroup
, extract(dow FROM event_timestamp_nz) dow
FROM read_parquet('MYLOCATION')
)
, filtered_route_ways AS (SELECT *
, st_buffer(ST_Envelope(geom), 0.001) bbox_geom
FROM route_ways)
, ranked_matches AS (SELECT
e.*
, r.way_id
, r.highway
, r.maxspeed
, r.osm_tuple_length_m
, r.osm_way_length_m
, r.osm_tuple_direction
, ROW_NUMBER() OVER (
PARTITION BY e.machine_gid, e.event_timestamp
ORDER BY ST_Distance(e.geom, r.geom)
) AS rank
FROM events e
JOIN filtered_route_ways r
ON ST_Intersects(e.geom, r.bbox_geom))
select * from ranked_matches;
SETUP before running
con <- dbConnect(duckdb::duckdb())
dbExecute(con, "INSTALL spatial; LOAD spatial;")
dbExecute(con, "INSTALL aws; LOAD aws;")
dbExecute(con, "INSTALL httpfs; LOAD httpfs;")
dbExecute(con, "INSTALL icu; LOAD icu;")
dbExecute(con, "SET TimeZone = 'Pacific/Auckland';")
dbWriteTable(con, "route_ways", route_ways)
dbExecute(con, "ALTER TABLE route_ways ADD COLUMN geom GEOMETRY;")
dbExecute(con, "UPDATE route_ways SET geom = ST_GeomFromText(linestring);")
dbExecute(con, "CREATE INDEX idx_geom ON route_ways USING RTREE (geom);")
dbExecute(con, "CALL load_aws_credentials('robinsight-data-hub');")
The 'events' source is exactly the same location in S3 that the local and EC2 versions are querying. the 'route_ways' dataframe is exactly the same, and I've confirmed both the EC2 and local queries return the correct results for the count from the 'events' CTE and 'filtered_route_ways' CTE.
UPDATE:
I replaced the ST_Intersects function with the following SQL and I'm getting consistent results between local and EC2 versions
WITH events AS (
SELECT *
, longitude
, latitude
, ST_Point(longitude, latitude) AS geom
, timezone('Pacific/Auckland', event_timestamp::TIMESTAMP WITH TIME ZONE) AS event_timestamp_nz
, CAST(event_timestamp AS DATE) AS dom
, strftime(date_trunc('hour', event_timestamp_nz), '%H:%M') AS timegroup
, extract(dow FROM event_timestamp_nz) AS dow
FROM read_parquet('MYLOCATION')
),
route_ways_bounds AS (
SELECT *
, geom
, ST_Buffer(ST_Envelope(geom), 0.001) AS bbox_geom
, ST_XMin(ST_Buffer(ST_Envelope(geom), 0.001)) AS min_lon
, ST_XMax(ST_Buffer(ST_Envelope(geom), 0.001)) AS max_lon
, ST_YMin(ST_Buffer(ST_Envelope(geom), 0.001)) AS min_lat
, ST_YMax(ST_Buffer(ST_Envelope(geom), 0.001)) AS max_lat
FROM route_ways
),
filtered_events AS (
SELECT e.*, r.way_id, r.highway, r.maxspeed, r.osm_tuple_length_m, r.osm_way_length_m, r.osm_tuple_direction, r.geom AS route_geom
FROM events e
JOIN route_ways_bounds r
ON e.longitude BETWEEN r.min_lon AND r.max_lon
AND e.latitude BETWEEN r.min_lat AND r.max_lat
),
ranked_matches AS (
SELECT
f.*,
ROW_NUMBER() OVER (
PARTITION BY f.machine_gid, f.event_timestamp
ORDER BY ST_Distance(f.geom, f.route_geom)
) AS rank
FROM filtered_events f
)