Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Query or filter before download #292

Open
StephanLo opened this issue Apr 30, 2024 · 11 comments
Open

[FEATURE] Query or filter before download #292

StephanLo opened this issue Apr 30, 2024 · 11 comments

Comments

@StephanLo
Copy link

StephanLo commented Apr 30, 2024

Is your feature request related to a problem? Please describe.

  1. Sometimes the spatial match returned by oe_match is much larger than the actual area of interest. This leads to unnecesary large downloads.
    For example,
    Once in oe_get I specified place = "Cape Town" (one city in South Africa), but the spatial match returned by the OSM provider was "South Africa (including Lesotho)". The data download is much larger than necessary. Another time I specified the sf polygon for one level 1 administrative boundary in a European country, but, the OSM provider matched "Europe", resulting in a large download.

  2. We may be interested in specific feature types only, but for large areas. At the moment we need to download the entire OSM dataset for the study area before we can query the target features.

Describe the solution you'd like
It would be great if we can either specify query on the provider side, so to filter out the target information before the .pbf file is downloaded. Something similar to the query and wkt_filter parameters in oe_get but, that get applied before downloading.

Describe alternatives you've considered

  1. Using osmdata instead of osmextract then specify a query for the target features, but subdivide the area of interest into smaller components to avoid a query timeout. (This works, but I feel it would be more elegant if we can do a large query in osmextract for specific features)
  2. Not tried, but one possibility: https://docs.overturemaps.org/getting-data/more-queries/

Additional context
Thank you for the nice package!

@agila5
Copy link
Collaborator

agila5 commented Oct 7, 2024

Thank you very much for creating this issue and sharing your ideas! Tbh, I'm not sure whether I will ever be be able to implement what you describe, but I think I've found something related. The GDAL website says

/vsicurl/ is a file system handler that allows on-the-fly random reading of files available through HTTP/FTP web protocols, without prior download of the entire file. It requires GDAL to be built against libcurl.

I've just finished running some tests using 1) the current/regular approach (i.e. download + convert to GPKG + read) and 2) this new vsicurl approach which, I guess, merges the download and conversion processes into a single step. I got surprising results, see below!

Regular approach

library(tictoc)
library(sf)
#> Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE

# Bounding box
xmin <- 12.5; xmax <- 12.8; ymin <- 5.6; ymax <- 5.8

# Regular approach: 
out_gpkg <- tempfile(".gpkg")
tic()
gdal_utils(
  util = "vectortranslate", 
  source = paste0(
    "http://download.openstreetmap.fr/extracts/africa/cameroon.osm.pbf"
  ), 
  options = c("lines", "-spat", xmin, ymin, xmax, ymax), 
  destination = out_gpkg, 
  quiet = FALSE
)
#> 0...10...20...30...40...50...60...70...80...90...100 - done.
st_read(out_gpkg, layer = "lines", quiet = FALSE)
#> Reading layer `lines' from data source 
#>   `C:\Users\user\AppData\Local\Temp\RtmpmaN2iG\.gpkgba78a9879f0' 
#>   using driver `ESRI Shapefile'
#> Simple feature collection with 28 features and 10 fields
#> Geometry type: LINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: 12.38803 ymin: 5.549327 xmax: 12.9215 ymax: 5.862042
#> Geodetic CRS:  WGS 84
toc()
#> 209.67 sec elapsed

Created on 2024-10-07 with reprex v2.0.2

With vsicurl

library(tictoc)
library(sf)
#> Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE

# Bounding box
xmin <- 12.5; xmax <- 12.8; ymin <- 5.6; ymax <- 5.8

# With vsicurl: 
out_gpkg <- tempfile(".gpkg")
tic()
gdal_utils(
  util = "vectortranslate", 
  source = paste0(
    "/vsicurl/", 
    "http://download.openstreetmap.fr/extracts/africa/cameroon.osm.pbf"
  ), 
  options = c("lines", "-spat", xmin, ymin, xmax, ymax), 
  destination = out_gpkg, 
  quiet = FALSE
)
#> 0...10...20...30...40...50...60...70...80...90...100 - done.
st_read(out_gpkg, layer = "lines", quiet = FALSE)
#> Reading layer `lines' from data source 
#>   `C:\Users\user\AppData\Local\Temp\Rtmps9OBHW\.gpkg4c707402581a' 
#>   using driver `ESRI Shapefile'
#> Simple feature collection with 28 features and 10 fields
#> Geometry type: LINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: 12.38803 ymin: 5.549327 xmax: 12.9215 ymax: 5.862042
#> Geodetic CRS:  WGS 84
toc()
#> 39.25 sec elapsed

Created on 2024-10-07 with reprex v2.0.2

@StephanLo @Robinlovelace (and any other person interested in this problem) could you please try to replicate these results, possibly with other OSM extracts? No rush! I cannot focus on osmextract for the foreseeable future. Nevertheless, if confirmed, I think this would imply we might rewrite several parts of the package and completely skip the download process!!!!

EDIT: There is also another file handler named vsicurl_streaming but, based on some tests I've just run, that does not outperform "regular" vsicurl. Anyway, this definetely requires more explorations.

@Robinlovelace
Copy link
Member

Robinlovelace commented Oct 8, 2024

Very interesting, seems visicurl is working, at least for this provider. I also get a substantial (~4x) speed-up, with the query done before downloads it seems:

library(tictoc)
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
xmin <- 12.5; xmax <- 12.8; ymin <- 5.6; ymax <- 5.8
out_gpkg <- tempfile(".gpkg")
tic()
gdal_utils(
  util = "vectortranslate", 
  source = paste0(
    "http://download.openstreetmap.fr/extracts/africa/cameroon.osm.pbf"
  ), 
  options = c("lines", "-spat", xmin, ymin, xmax, ymax), 
  destination = out_gpkg, 
  quiet = FALSE
)
#> 0...10...20...30...40...50...60...70...80...90...100 - done.
st_read(out_gpkg, layer = "lines", quiet = FALSE)
#> Reading layer `lines' from data source `/tmp/RtmpkLFyPq/.gpkg312f948e93fc8' using driver `ESRI Shapefile'
#> Simple feature collection with 28 features and 10 fields
#> Geometry type: LINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: 12.38803 ymin: 5.549327 xmax: 12.9215 ymax: 5.862042
#> Geodetic CRS:  WGS 84
toc()
#> 135.783 sec elapsed

tic()
gdal_utils(
  util = "vectortranslate", 
  source = paste0(
    "/vsicurl/", 
    "http://download.openstreetmap.fr/extracts/africa/cameroon.osm.pbf"
  ), 
  options = c("lines", "-spat", xmin, ymin, xmax, ymax), 
  destination = out_gpkg, 
  quiet = FALSE
)
#> 0...10...20...30...40...50...60...70...80..
#> Warning in CPL_gdalvectortranslate(source, destination, options, oo, doo, : GDAL Error 1: Layer lines already exists, and -append not specified.
#>         Consider using -append, or -overwrite.
#> Error in gdal_utils(util = "vectortranslate", source = paste0("/vsicurl/", : gdal_utils vectortranslate: an error occured
st_read(out_gpkg, layer = "lines", quiet = FALSE)
#> Reading layer `lines' from data source `/tmp/RtmpkLFyPq/.gpkg312f948e93fc8' using driver `ESRI Shapefile'
#> Simple feature collection with 28 features and 10 fields
#> Geometry type: LINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: 12.38803 ymin: 5.549327 xmax: 12.9215 ymax: 5.862042
#> Geodetic CRS:  WGS 84
toc()
#> 36.247 sec elapsed

Created on 2024-10-08 with reprex v2.1.0

@StephanLo
Copy link
Author

Thank you for following up on this query!

When I have time I will also try to replicate for some queries and give feedback.

@agila5
Copy link
Collaborator

agila5 commented Oct 8, 2024

@Robinlovelace could you please replicate that code in two separate R sessions? The reprex that you posted says that the second ogr2ogr operations was not successful (because the output file already existed due to the first ogr2ogr) and, if possible, I would like to isolate the two methods. Thanks! No time pressure :)

@Robinlovelace
Copy link
Member

Oh yeah, will do when I get a chance!

@Robinlovelace
Copy link
Member

Without visicurl:

library(tictoc)
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
xmin <- 12.5; xmax <- 12.8; ymin <- 5.6; ymax <- 5.8
out_gpkg <- tempfile(".gpkg")
tic()
gdal_utils(
  util = "vectortranslate", 
  source = paste0(
    "http://download.openstreetmap.fr/extracts/africa/cameroon.osm.pbf"
  ), 
  options = c("lines", "-spat", xmin, ymin, xmax, ymax), 
  destination = out_gpkg, 
  quiet = FALSE
)
#> 0...10...20...30...40...50...60...70...80...90...100 - done.
fs::file_size(out_gpkg)
#> 4K
st_read(out_gpkg, layer = "lines", quiet = FALSE)
#> Reading layer `lines' from data source `/tmp/RtmpAogZLc/.gpkg817b3768d1e5c' using driver `ESRI Shapefile'
#> Simple feature collection with 28 features and 10 fields
#> Geometry type: LINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: 12.38803 ymin: 5.549327 xmax: 12.9215 ymax: 5.862042
#> Geodetic CRS:  WGS 84
toc()
#> 105.191 sec elapsed

Created on 2024-10-08 with reprex v2.1.0

@Robinlovelace
Copy link
Member

With visicurl:

library(tictoc)
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
xmin <- 12.5; xmax <- 12.8; ymin <- 5.6; ymax <- 5.8
out_gpkg <- tempfile(".gpkg")
tic()
gdal_utils(
  util = "vectortranslate", 
  source = paste0(
    "/vsicurl/", 
    "http://download.openstreetmap.fr/extracts/africa/cameroon.osm.pbf"
  ), 
  options = c("lines", "-spat", xmin, ymin, xmax, ymax), 
  destination = out_gpkg, 
  quiet = FALSE
)
#> 0...10...20...30...40...50...60...70...80...90...100 - done.
fs::file_size(out_gpkg)
#> 4K
st_read(out_gpkg, layer = "lines", quiet = FALSE)
#> Reading layer `lines' from data source `/tmp/RtmpUUdkfn/.gpkg8306d2c30b177' using driver `ESRI Shapefile'
#> Simple feature collection with 28 features and 10 fields
#> Geometry type: LINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: 12.38803 ymin: 5.549327 xmax: 12.9215 ymax: 5.862042
#> Geodetic CRS:  WGS 84
toc()
#> 31.305 sec elapsed

Created on 2024-10-08 with reprex v2.1.0

@Robinlovelace
Copy link
Member

As far as I understand this could be activated by default, wonder if there's a check to tell us if the GDAL version installed has this capability (yes in most cases I guess)?

@Robinlovelace
Copy link
Member

Does the same thing work for attributate queries?

@agila5
Copy link
Collaborator

agila5 commented Oct 8, 2024

As far as I understand this could be activated by default, wonder if there's a check to tell us if the GDAL version installed has this capability (yes in most cases I guess)?

I think so! We just need to add a /vsicul/ to the URL and replace download.file with gdal_utils(...). However, I think that kinda requires several modifications inside the package, since we are currently assuming there is a "download directory" with the ".osm.pbf" files. I guess the download-and-save .pbf files could be completely skipped (unless a user already has its own OSM extract) and this new procedure might represent a good new approach for osmextract v1.0!

Does the same thing work for attributate queries?

Not sure, I'll run more tests as soon as possible. I also need to check the benefits with 1) larger extracts; 2) larger (o no) spatial filters.

@Robinlovelace
Copy link
Member

Good considerations. As this would be new and at this stage experimental, could be optional with the default being FALSE, and after some time for testing and requests for feedback after it's on CRAN we could turn it ON, hoping for not too many breaks to workflows.

Interesting points about saving the .pbf. Given that we need an intermediary .gpkg file anyway (right?), that will be unique for each query, we could use the file name of the intermediary file to cache intermediate results and avoid repeat downloads for repeat queries. That could involve representing the filename in the query through something like

snakecase::to_snake_case("Some 
+ 
+ Random
+   Query")
[1] "some_random_query"

or a hash. I think the former would be better, simpler...

Great to be thinking about this. Will be a cool new feature if it gets implemented and I'm keen to help out! Will mark as "Help wanted" also and look to do a call out for any GDAL pros out there who want to pitch in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants