-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Query or filter before download #292
Comments
Thank you very much for creating this issue and sharing your ideas! Tbh, I'm not sure whether I will ever be be able to implement what you describe, but I think I've found something related. The GDAL website says
I've just finished running some tests using 1) the current/regular approach (i.e. download + convert to GPKG + read) and 2) this new vsicurl approach which, I guess, merges the download and conversion processes into a single step. I got surprising results, see below! Regular approach library(tictoc)
library(sf)
#> Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
# Bounding box
xmin <- 12.5; xmax <- 12.8; ymin <- 5.6; ymax <- 5.8
# Regular approach:
out_gpkg <- tempfile(".gpkg")
tic()
gdal_utils(
util = "vectortranslate",
source = paste0(
"http://download.openstreetmap.fr/extracts/africa/cameroon.osm.pbf"
),
options = c("lines", "-spat", xmin, ymin, xmax, ymax),
destination = out_gpkg,
quiet = FALSE
)
#> 0...10...20...30...40...50...60...70...80...90...100 - done.
st_read(out_gpkg, layer = "lines", quiet = FALSE)
#> Reading layer `lines' from data source
#> `C:\Users\user\AppData\Local\Temp\RtmpmaN2iG\.gpkgba78a9879f0'
#> using driver `ESRI Shapefile'
#> Simple feature collection with 28 features and 10 fields
#> Geometry type: LINESTRING
#> Dimension: XY
#> Bounding box: xmin: 12.38803 ymin: 5.549327 xmax: 12.9215 ymax: 5.862042
#> Geodetic CRS: WGS 84
toc()
#> 209.67 sec elapsed Created on 2024-10-07 with reprex v2.0.2 With vsicurl library(tictoc)
library(sf)
#> Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
# Bounding box
xmin <- 12.5; xmax <- 12.8; ymin <- 5.6; ymax <- 5.8
# With vsicurl:
out_gpkg <- tempfile(".gpkg")
tic()
gdal_utils(
util = "vectortranslate",
source = paste0(
"/vsicurl/",
"http://download.openstreetmap.fr/extracts/africa/cameroon.osm.pbf"
),
options = c("lines", "-spat", xmin, ymin, xmax, ymax),
destination = out_gpkg,
quiet = FALSE
)
#> 0...10...20...30...40...50...60...70...80...90...100 - done.
st_read(out_gpkg, layer = "lines", quiet = FALSE)
#> Reading layer `lines' from data source
#> `C:\Users\user\AppData\Local\Temp\Rtmps9OBHW\.gpkg4c707402581a'
#> using driver `ESRI Shapefile'
#> Simple feature collection with 28 features and 10 fields
#> Geometry type: LINESTRING
#> Dimension: XY
#> Bounding box: xmin: 12.38803 ymin: 5.549327 xmax: 12.9215 ymax: 5.862042
#> Geodetic CRS: WGS 84
toc()
#> 39.25 sec elapsed Created on 2024-10-07 with reprex v2.0.2 @StephanLo @Robinlovelace (and any other person interested in this problem) could you please try to replicate these results, possibly with other OSM extracts? No rush! I cannot focus on EDIT: There is also another file handler named |
Very interesting, seems visicurl is working, at least for this provider. I also get a substantial (~4x) speed-up, with the query done before downloads it seems: library(tictoc)
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
xmin <- 12.5; xmax <- 12.8; ymin <- 5.6; ymax <- 5.8
out_gpkg <- tempfile(".gpkg")
tic()
gdal_utils(
util = "vectortranslate",
source = paste0(
"http://download.openstreetmap.fr/extracts/africa/cameroon.osm.pbf"
),
options = c("lines", "-spat", xmin, ymin, xmax, ymax),
destination = out_gpkg,
quiet = FALSE
)
#> 0...10...20...30...40...50...60...70...80...90...100 - done.
st_read(out_gpkg, layer = "lines", quiet = FALSE)
#> Reading layer `lines' from data source `/tmp/RtmpkLFyPq/.gpkg312f948e93fc8' using driver `ESRI Shapefile'
#> Simple feature collection with 28 features and 10 fields
#> Geometry type: LINESTRING
#> Dimension: XY
#> Bounding box: xmin: 12.38803 ymin: 5.549327 xmax: 12.9215 ymax: 5.862042
#> Geodetic CRS: WGS 84
toc()
#> 135.783 sec elapsed
tic()
gdal_utils(
util = "vectortranslate",
source = paste0(
"/vsicurl/",
"http://download.openstreetmap.fr/extracts/africa/cameroon.osm.pbf"
),
options = c("lines", "-spat", xmin, ymin, xmax, ymax),
destination = out_gpkg,
quiet = FALSE
)
#> 0...10...20...30...40...50...60...70...80..
#> Warning in CPL_gdalvectortranslate(source, destination, options, oo, doo, : GDAL Error 1: Layer lines already exists, and -append not specified.
#> Consider using -append, or -overwrite.
#> Error in gdal_utils(util = "vectortranslate", source = paste0("/vsicurl/", : gdal_utils vectortranslate: an error occured
st_read(out_gpkg, layer = "lines", quiet = FALSE)
#> Reading layer `lines' from data source `/tmp/RtmpkLFyPq/.gpkg312f948e93fc8' using driver `ESRI Shapefile'
#> Simple feature collection with 28 features and 10 fields
#> Geometry type: LINESTRING
#> Dimension: XY
#> Bounding box: xmin: 12.38803 ymin: 5.549327 xmax: 12.9215 ymax: 5.862042
#> Geodetic CRS: WGS 84
toc()
#> 36.247 sec elapsed Created on 2024-10-08 with reprex v2.1.0 |
Thank you for following up on this query! When I have time I will also try to replicate for some queries and give feedback. |
@Robinlovelace could you please replicate that code in two separate R sessions? The reprex that you posted says that the second |
Oh yeah, will do when I get a chance! |
Without visicurl: library(tictoc)
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
xmin <- 12.5; xmax <- 12.8; ymin <- 5.6; ymax <- 5.8
out_gpkg <- tempfile(".gpkg")
tic()
gdal_utils(
util = "vectortranslate",
source = paste0(
"http://download.openstreetmap.fr/extracts/africa/cameroon.osm.pbf"
),
options = c("lines", "-spat", xmin, ymin, xmax, ymax),
destination = out_gpkg,
quiet = FALSE
)
#> 0...10...20...30...40...50...60...70...80...90...100 - done.
fs::file_size(out_gpkg)
#> 4K
st_read(out_gpkg, layer = "lines", quiet = FALSE)
#> Reading layer `lines' from data source `/tmp/RtmpAogZLc/.gpkg817b3768d1e5c' using driver `ESRI Shapefile'
#> Simple feature collection with 28 features and 10 fields
#> Geometry type: LINESTRING
#> Dimension: XY
#> Bounding box: xmin: 12.38803 ymin: 5.549327 xmax: 12.9215 ymax: 5.862042
#> Geodetic CRS: WGS 84
toc()
#> 105.191 sec elapsed Created on 2024-10-08 with reprex v2.1.0 |
With visicurl: library(tictoc)
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
xmin <- 12.5; xmax <- 12.8; ymin <- 5.6; ymax <- 5.8
out_gpkg <- tempfile(".gpkg")
tic()
gdal_utils(
util = "vectortranslate",
source = paste0(
"/vsicurl/",
"http://download.openstreetmap.fr/extracts/africa/cameroon.osm.pbf"
),
options = c("lines", "-spat", xmin, ymin, xmax, ymax),
destination = out_gpkg,
quiet = FALSE
)
#> 0...10...20...30...40...50...60...70...80...90...100 - done.
fs::file_size(out_gpkg)
#> 4K
st_read(out_gpkg, layer = "lines", quiet = FALSE)
#> Reading layer `lines' from data source `/tmp/RtmpUUdkfn/.gpkg8306d2c30b177' using driver `ESRI Shapefile'
#> Simple feature collection with 28 features and 10 fields
#> Geometry type: LINESTRING
#> Dimension: XY
#> Bounding box: xmin: 12.38803 ymin: 5.549327 xmax: 12.9215 ymax: 5.862042
#> Geodetic CRS: WGS 84
toc()
#> 31.305 sec elapsed Created on 2024-10-08 with reprex v2.1.0 |
As far as I understand this could be activated by default, wonder if there's a check to tell us if the GDAL version installed has this capability (yes in most cases I guess)? |
Does the same thing work for attributate queries? |
I think so! We just need to add a
Not sure, I'll run more tests as soon as possible. I also need to check the benefits with 1) larger extracts; 2) larger (o no) spatial filters. |
Good considerations. As this would be new and at this stage experimental, could be optional with the default being FALSE, and after some time for testing and requests for feedback after it's on CRAN we could turn it ON, hoping for not too many breaks to workflows. Interesting points about saving the .pbf. Given that we need an intermediary .gpkg file anyway (right?), that will be unique for each query, we could use the file name of the intermediary file to cache intermediate results and avoid repeat downloads for repeat queries. That could involve representing the filename in the query through something like
or a hash. I think the former would be better, simpler... Great to be thinking about this. Will be a cool new feature if it gets implemented and I'm keen to help out! Will mark as "Help wanted" also and look to do a call out for any GDAL pros out there who want to pitch in. |
Is your feature request related to a problem? Please describe.
Sometimes the spatial match returned by oe_match is much larger than the actual area of interest. This leads to unnecesary large downloads.
For example,
Once in
oe_get
I specifiedplace = "Cape Town"
(one city in South Africa), but the spatial match returned by the OSM provider was"South Africa (including Lesotho)"
. The data download is much larger than necessary. Another time I specified the sf polygon for one level 1 administrative boundary in a European country, but, the OSM provider matched "Europe", resulting in a large download.We may be interested in specific feature types only, but for large areas. At the moment we need to download the entire OSM dataset for the study area before we can query the target features.
Describe the solution you'd like
It would be great if we can either specify query on the provider side, so to filter out the target information before the .pbf file is downloaded. Something similar to the
query
andwkt_filter
parameters inoe_get
but, that get applied before downloading.Describe alternatives you've considered
osmdata
instead ofosmextract
then specify a query for the target features, but subdivide the area of interest into smaller components to avoid a query timeout. (This works, but I feel it would be more elegant if we can do a large query in osmextract for specific features)Additional context
Thank you for the nice package!
The text was updated successfully, but these errors were encountered: