Open
Description
Superset does not support loading data from sqlite so we want to use duckdb instead! Duckdb is well suited for our data because it's designed to handle local data warehouses. It's also a cheaper option for superset because something like BQ we'd have to pay for query compute costs.
Success Criteria
- all parquet files except CEMS and long-named tables are in a
.duckdb
file -
.duckdb
file is generated & distributed to S3/GCS in nightly builds
### Tasks
- [x] Move script from devtools to package and add as entry point. Add to tests
- [x] Run `parquet_to_duckdb.py` script in `gcp_pudl_etl.sh`. Add lines to copy pudl.duckdb to `s3://pudl.catalyst.coop`, `gs://pudl.catalyst.coop` and `gs://superset.catalyst.coop`
- [x] Create a duckdb schema from our metadata classes
- [x] Load all the parquet files into the .duckdb file
- [ ] Figure out why foreign keys slow freeze the loading step. Is it related to the enum issue?
- [ ] Agree on solution to handling enums
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3812
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3811
- [ ] Bonus: Use SQL Expression Language instead of raw text to specify constraints
Metadata
Metadata
Assignees
Type
Projects
Status
Backlog