Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load parquet files to a duckdb file #3739

Open
4 of 9 tasks
Tracked by #3703
bendnorman opened this issue Jul 25, 2024 · 10 comments · May be fixed by #3741
Open
4 of 9 tasks
Tracked by #3703

Load parquet files to a duckdb file #3739

bendnorman opened this issue Jul 25, 2024 · 10 comments · May be fixed by #3741
Assignees
Labels
duckdb Issues referring to duckdb, the embedded OLAP database superset

Comments

@bendnorman
Copy link
Member

bendnorman commented Jul 25, 2024

Superset does not support loading data from sqlite so we want to use duckdb instead! Duckdb is well suited for our data because it's designed to handle local data warehouses. It's also a cheaper option for superset because something like BQ we'd have to pay for query compute costs.

Success Criteria

  • all parquet files except CEMS and long-named tables are in a .duckdb file
  • .duckdb file is generated & distributed to S3/GCS in nightly builds

Tasks

Preview Give feedback
  1. 0 of 1
  2. 0 of 1
@bendnorman bendnorman converted this from a draft issue Jul 25, 2024
@bendnorman bendnorman added duckdb Issues referring to duckdb, the embedded OLAP database superset labels Jul 25, 2024
@bendnorman bendnorman self-assigned this Jul 25, 2024
@bendnorman
Copy link
Member Author

Duckdb table names have a character limit of 63. We have four tables that exceed 63 characters:

core_eiaaeo__yearly_projected_fuel_cost_in_electric_sector_by_type: 66
core_eiaaeo__yearly_projected_generation_in_electric_sector_by_technology: 73
core_eiaaeo__yearly_projected_generation_in_end_use_sectors_by_fuel_type: 72
out_eia923__yearly_generation_fuel_by_generator_energy_source_owner: 67

We should rename these resources, enforce the resource name length constraint earlier in the code and update our documentation.

@cmgosnell
Copy link
Member

hhhhmmm long aeo names. @jdangerx made an aeo schema pr before migrating a lot of the AEO tables. I think the trouble here is that there are sooooo many AEO tables and so many of them contain the same pieces of information just broken down by different attributes.

@zaneselvans
Copy link
Member

On the topic of making a DuckDB schema with our metadata classes, I'd been thinking we either want to have to_sqlite() and to_duckdb() methods in place of the generic to_sql() we have now (which only creates SQLite schemas) or maybe add dialect="duckdb" and dialect="sqlite" arguments to to_sql() that do the right thing, and have it default to dialect="sqlite" since that's the legacy behavior.

@bendnorman
Copy link
Member Author

Agreed! That's what I'm working on right now. I've added a dialect="duckdb" option to to_sql(). For now, I'm just going to add some if statements to handle the different dialects, but there is probably a cleaner way to store the metadata to SQL logic of multiple dialects. I might make a MetadataSQLConverter class or type.

@zaneselvans
Copy link
Member

It might also be possible to use SQLAlchemy for this -- if the checks, constraints, etc can be stated using their generic API, and then output to the appropriate dialect. IIRC there was at least one SQLite specific thing that we had to code manually though.

@zaneselvans zaneselvans linked a pull request Jul 28, 2024 that will close this issue
@jdangerx jdangerx linked a pull request Aug 6, 2024 that will close this issue
@jdangerx
Copy link
Member

jdangerx commented Aug 7, 2024

In our inframundo meeting we decided that we can skip the "hard" ones for now and get back to them before we actually release to the public:

  • tables that have long names
  • CEMS, which is too big to load all at once

@jdangerx jdangerx assigned zaneselvans and unassigned bendnorman Aug 21, 2024
@zaneselvans
Copy link
Member

Something weird is going on with how big the DuckDB file is. Parquet with snappy compression is expected to be about the same size as the compressed DuckDB file. In Parquet, PUDL only takes up like 1-2GB (minus CEMS), and the DuckDB file is like 13GB, which just seems totally wacked.

@bendnorman
Copy link
Member Author

I think Duckdb uses a different compression algorithm so duckdb files aren't expected to be as small as parquet files: duckdb/duckdb#8162 (comment)

@zaneselvans
Copy link
Member

A factor of 10 feels suspicious though. I searched around for comparisons of the DuckDB and Parquet compression ratios and even a couple of years ago it looked like DuckDB should be less than 2x as big as Parquet.

@bendnorman
Copy link
Member Author

bendnorman commented Sep 11, 2024

Hmm I thought it could be that we're not specifying varchar lengths but the docs say that shouldn't matter.

It looks like many blocks in our out_eia__monthly_generators table are uncompressed:

D select compression, count(*) as count from pragma_storage_info('out_eia__monthly_generators') group by compression order by count desc;
┌──────────────┬───────┐
│ compression  │ count │
│   varchar    │ int64 │
├──────────────┼───────┤
│ Uncompressed │  4205 │
│ RLE          │  2714 │
│ Constant     │  1719 │
│ Dictionary   │  1281 │
│ FSST         │   722 │
│ ALPRD        │   182 │
│ BitPacking   │    51 │
│ ALP          │    35 │
└──────────────┴───────┘

Not sure why this is or if it's expected.

Another idea: Maybe our indexes are taking up a lot of space?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duckdb Issues referring to duckdb, the embedded OLAP database superset
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

4 participants