Skip to content

Load parquet files to a duckdb file #3739

Open
@bendnorman

Description

@bendnorman

Superset does not support loading data from sqlite so we want to use duckdb instead! Duckdb is well suited for our data because it's designed to handle local data warehouses. It's also a cheaper option for superset because something like BQ we'd have to pay for query compute costs.

Success Criteria

  • all parquet files except CEMS and long-named tables are in a .duckdb file
  • .duckdb file is generated & distributed to S3/GCS in nightly builds
### Tasks
- [x] Move script from devtools to package and add as entry point. Add to tests
- [x] Run `parquet_to_duckdb.py` script in `gcp_pudl_etl.sh`. Add lines to copy pudl.duckdb to `s3://pudl.catalyst.coop`, `gs://pudl.catalyst.coop` and `gs://superset.catalyst.coop`
- [x] Create a duckdb schema from our metadata classes
- [x] Load all the parquet files into the .duckdb file
- [ ] Figure out why foreign keys slow freeze the loading step. Is it related to the enum issue?
- [ ] Agree on solution to handling enums
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3812
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3811
- [ ] Bonus: Use SQL Expression Language instead of raw text to specify constraints

Metadata

Metadata

Assignees

Labels

duckdbIssues referring to duckdb, the embedded OLAP databasesuperset

Type

No type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions