Skip to content

Change EPA CEMS data types to reduce memory usage #1049

@TrentonBush

Description

@TrentonBush

EPA CEMS is bigger than laptop memory, there is no getting around that. But after loading, fully 50% of memory is taken by one column, 'unitid'. This column is a string dtype, but could probably be changed to categorical, saving 2GB memory per year of data and 50GB memory for the full dataset.

This can be done by users after loading the data, but I think would be better to change the dtype in the ETL pipeline before it is written to parquet, rather than after loading. Changing dtypes after loading requires reading the whole string column into memory first, which can exceed machine memory, crash the process, and prevent the dataset from loading at all.

Alternatively, the function to load epacems could read data in chunks, change the dtype of each, and concatenate them.

import pudl
import pandas as pd
from pathlib import Path

pudl_settings = pudl.workspace.setup.get_defaults()
cems_path = Path(pudl_settings["parquet_dir"]) / "epacems"

# read 1 year, all columns, all states of cems
cems_1year = pd.read_parquet(
        cems_path,
        use_nullable_dtypes=True,
        columns=None, # all cols
        filters=pudl.output.epacems.year_state_filter(
            states=pudl.constants.us_states.keys(), # all states
            years=2019,
        ),
)

print(cems.memory_usage(deep=True).sum() / 2**20) # megabytes total
print(cems.memory_usage(deep=True) / 2**20) # megabytes per column
print(cems.unitid.astype('category').memory_usage(deep=True) / 2**20) # megabytes unitid as categorical
print(cems.unitid.unique().shape[0]) # cardinality of unitid

Which produces the following:

4179.72

Index                           0.000122
plant_id_eia                  172.931328
unitid                       2069.946968
operating_datetime_utc        276.690125
operating_time_hours          138.345062
gross_load_mw                 138.345062
steam_load_1000_lbs           138.345062
so2_mass_lbs                  138.345062
so2_mass_measurement_code      34.587056
nox_rate_lbs_mmbtu            138.345062
nox_rate_measurement_code      34.587120
nox_mass_lbs                  138.345062
nox_mass_measurement_code      34.587120
co2_mass_tons                 138.345062
co2_mass_measurement_code      34.587056
heat_content_mmbtu            138.345062
facility_id                   172.931328
unit_id_epa                   172.931328
year                           34.586987
state                          34.590045
dtype: float64

69.29

1472

This shows that 'unitid' takes 2069 out of 4179 total MB, but when cast to categorical, only takes 69MB, a savings of 2GB. The cardinality is well within range of the categorical dtype, with only 1472 unique categories. There will doubtlessly be more categories with additional years, but only slightly more.

Metadata

Metadata

Assignees

No one assigned

    Labels

    data-typesDtype conversions, standardization and implications of data typesepacemsIntegration and analysis of the EPA CEMS dataset.performanceMake PUDL run faster!

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions