Skip to content

Avoid post-concatenation memory spike in CSV / Excel extractors #3571

@zaneselvans

Description

@zaneselvans

For many of our CSV and Excel based datasets, we read many partitions of many "pages", concatenate the page fragments together, and produce a dictionary of dataframes as an interim asset that gets pickled and written to disk.

Typically, we then have a downstream asset that corresponds to each of the tables that were part of the dictionary of dataframes which got pickled to disk, and so each of those downstream assets reads in all of the dataframes even though it only really needs to access one of them. These downstream assets also generally kick off at the same time, resulting in a big spike in memory usage. This was particularly significant in the context of the EIA-930 which has 3 tables of hourly data, all of which are pretty big. It was spiking to ~12GB of RAM.

It seems like it should be possible to modify the CSV / Excel Extractor class so that when it produces multiple raw dataframes, they are each output separately, probably as a (hopefully subsettable) @multi_asset, such that each of the downstream assets can only read in the one dataframe they depend on.

I think the code that reads in the pickled dictionary of dataframes is in pudl.extract.extractor.partition_extractor_factory and I think the thing that produces the dictionary of dataframes is in pudl.extract.extractor.raw_df_factor but maybe @e-belfer or @jdangerx know more.

Metadata

Metadata

Assignees

No one assigned

    Labels

    csvIssues related to working with / extracting data from CSV filesdagsterIssues related to our use of the Dagster orchestratorexcelIssues involving data in Microsoft Excel spreadsheetsperformanceMake PUDL run faster!

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions