Avoid post-concatenation memory spike in CSV / Excel extractors

For many of our CSV and Excel based datasets, we read many partitions of many "pages", concatenate the page fragments together, and produce a dictionary of dataframes as an interim asset that gets pickled and written to disk.

Typically, we then have a downstream asset that corresponds to each of the tables that were part of the dictionary of dataframes which got pickled to disk, and so each of those downstream assets reads in **all of the dataframes** even though it only really needs to access one of them. These downstream assets also generally kick off at the same time, resulting in a big spike in memory usage.  This was particularly significant in the context of the EIA-930 which has 3 tables of hourly data, all of which are pretty big. It was spiking to ~12GB of RAM.

It seems like it should be possible to modify the CSV / Excel Extractor class so that when it produces multiple raw dataframes, they are each output separately, probably as a (hopefully subsettable) `@multi_asset`, such that each of the downstream assets can only read in the one dataframe they depend on.

I think the code that reads in the pickled dictionary of dataframes is in `pudl.extract.extractor.partition_extractor_factory` and I think the thing that produces the dictionary of dataframes is in `pudl.extract.extractor.raw_df_factor` but maybe @e-belfer or @jdangerx know more.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Avoid post-concatenation memory spike in CSV / Excel extractors #3571

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Avoid post-concatenation memory spike in CSV / Excel extractors #3571

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions