HTTP Range requests are awesome (for data tooling and projects) because ... #388
Replies: 2 comments
-
Sharing some examples of HTTP Range Requests in the wild. DuckDB Parquet ReadingWhen you run a query against a remote Parquet file, DuckDB will scan the Parquet metadata first, then only the row groups / columns relevant to the query. You can test it out on their WASM Shell with the following query. The select distance from 'https://uwdata.github.io/mosaic-datasets/data/flights-10m.parquet' limit 10; ProtomapsProtomaps relies on PMTiles, an open specification for single-file tile pyramids built on compressed Hilbert ordering and queryable via HTTP Range Requests. Range Requests on TorrentsThe distribyted gateway is able to pull data from a torrented SQLite database via range requests. |
Beta Was this translation helpful? Give feedback.
-
Discovered another great one, Static Wiki! Uses a remote SQLite database and range request to serve Wikipedia statically (repo). |
Beta Was this translation helpful? Give feedback.
-
Range requests are so cool - i feel one should just start a discussion thread just about that. I've had long thoughts in the past (before parquet really, though still feel relevant) where you add a small metadata file adjacent to a CSV with some kind of indexing e.g. row 1000 starts at byte X, row 5000 starts at byte Y etc, row 10k at byte Z, row 100k at byte XX. That way you could quickly retreive a subset of rows. Or even, gasp, turn csv into a columnar model by transposing and then doing this indexing ...
(Qu: why not just use parquet - because for csv is everywhere and you don't have to be a data scientist to use it. The index file is non-invasive and you can even create it for files you don't control but which are unchanging ...)
TODO: what are range requests
TODO: examples of range requests in the wild
I always thought this was really cool: https://phiresky.github.io/blog/2021/hosting-sqlite-databases-on-github-pages/
tl;dr: he puts 670mb sqlite database on github pages and queries from browser. How does he avoid loading the db into memory (sql.js does that ...): he creates a virtual filesystem api using range requests to only load parts of db at a time (using indexes).
Beta Was this translation helpful? Give feedback.
All reactions