-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load parquet files to a duckdb file #3739
Comments
Duckdb table names have a character limit of 63. We have four tables that exceed 63 characters:
We should rename these resources, enforce the resource name length constraint earlier in the code and update our documentation. |
hhhhmmm long aeo names. @jdangerx made an aeo schema pr before migrating a lot of the AEO tables. I think the trouble here is that there are sooooo many AEO tables and so many of them contain the same pieces of information just broken down by different attributes. |
On the topic of making a DuckDB schema with our metadata classes, I'd been thinking we either want to have |
Agreed! That's what I'm working on right now. I've added a |
It might also be possible to use SQLAlchemy for this -- if the checks, constraints, etc can be stated using their generic API, and then output to the appropriate dialect. IIRC there was at least one SQLite specific thing that we had to code manually though. |
In our inframundo meeting we decided that we can skip the "hard" ones for now and get back to them before we actually release to the public:
|
Something weird is going on with how big the DuckDB file is. Parquet with snappy compression is expected to be about the same size as the compressed DuckDB file. In Parquet, PUDL only takes up like 1-2GB (minus CEMS), and the DuckDB file is like 13GB, which just seems totally wacked. |
I think Duckdb uses a different compression algorithm so duckdb files aren't expected to be as small as parquet files: duckdb/duckdb#8162 (comment) |
A factor of 10 feels suspicious though. I searched around for comparisons of the DuckDB and Parquet compression ratios and even a couple of years ago it looked like DuckDB should be less than 2x as big as Parquet. |
Hmm I thought it could be that we're not specifying varchar lengths but the docs say that shouldn't matter. It looks like many blocks in our
Not sure why this is or if it's expected. Another idea: Maybe our indexes are taking up a lot of space? |
Superset does not support loading data from sqlite so we want to use duckdb instead! Duckdb is well suited for our data because it's designed to handle local data warehouses. It's also a cheaper option for superset because something like BQ we'd have to pay for query compute costs.
Success Criteria
.duckdb
file.duckdb
file is generated & distributed to S3/GCS in nightly buildsTasks
The text was updated successfully, but these errors were encountered: