HTTP Range requests are awesome (for data tooling and projects) because ... #388

rufuspollock · 2023-11-08T10:24:36Z

rufuspollock
Nov 8, 2023
Maintainer

Range requests are so cool - i feel one should just start a discussion thread just about that. I've had long thoughts in the past (before parquet really, though still feel relevant) where you add a small metadata file adjacent to a CSV with some kind of indexing e.g. row 1000 starts at byte X, row 5000 starts at byte Y etc, row 10k at byte Z, row 100k at byte XX. That way you could quickly retreive a subset of rows. Or even, gasp, turn csv into a columnar model by transposing and then doing this indexing ...

(Qu: why not just use parquet - because for csv is everywhere and you don't have to be a data scientist to use it. The index file is non-invasive and you can even create it for files you don't control but which are unchanging ...)

TODO: what are range requests

TODO: examples of range requests in the wild

I always thought this was really cool: https://phiresky.github.io/blog/2021/hosting-sqlite-databases-on-github-pages/

tl;dr: he puts 670mb sqlite database on github pages and queries from browser. How does he avoid loading the db into memory (sql.js does that ...): he creates a virtual filesystem api using range requests to only load parts of db at a time (using indexes).

So how do you use a database on a static file hoster? Firstly, SQLite (written in C) is compiled to WebAssembly. SQLite can be compiled with emscripten without any modifications, and the sql.js library is a thin JS wrapper around the wasm code.

sql.js only allows you to create and read from databases that are fully in memory though - so I implemented a virtual file system that fetches chunks of the database with HTTP Range requests when SQLite tries to read from the filesystem: sql.js-httpvfs. From SQLite’s perspective, it just looks like it’s living on a normal computer with an empty filesystem except for a file called /wdi.sqlite3 that it can read from. Of course it can’t write to this file, but a read-only database is still very useful.

davidgasquez · 2023-11-08T10:56:24Z

davidgasquez
Nov 8, 2023
Collaborator

Sharing some examples of HTTP Range Requests in the wild.

DuckDB Parquet Reading

When you run a query against a remote Parquet file, DuckDB will scan the Parquet metadata first, then only the row groups / columns relevant to the query. You can test it out on their WASM Shell with the following query. The flights-10m.parquet is around 70Mb, and the query will scan less than 2Mb.

select distance from 'https://uwdata.github.io/mosaic-datasets/data/flights-10m.parquet' limit 10;

Protomaps

Protomaps relies on PMTiles, an open specification for single-file tile pyramids built on compressed Hilbert ordering and queryable via HTTP Range Requests.

Range Requests on Torrents

The distribyted gateway is able to pull data from a torrented SQLite database via range requests.

0 replies

davidgasquez · 2023-11-20T09:04:34Z

davidgasquez
Nov 20, 2023
Collaborator

Discovered another great one, Static Wiki! Uses a remote SQLite database and range request to serve Wikipedia statically (repo).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Curated open data

HTTP Range requests are awesome (for data tooling and projects) because ... #388

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Curated open data

HTTP Range requests are awesome (for data tooling and projects) because ... #388

rufuspollock Nov 8, 2023 Maintainer

TODO: what are range requests

TODO: examples of range requests in the wild

Replies: 2 comments

davidgasquez Nov 8, 2023 Collaborator

DuckDB Parquet Reading

Protomaps

Range Requests on Torrents

davidgasquez Nov 20, 2023 Collaborator

rufuspollock
Nov 8, 2023
Maintainer

davidgasquez
Nov 8, 2023
Collaborator

davidgasquez
Nov 20, 2023
Collaborator