Change EPA CEMS data types to reduce memory usage

EPA CEMS is bigger than laptop memory, there is no getting around that. But after loading, fully 50% of memory is taken by one column, 'unitid'. This column is a string dtype, but could probably be changed to categorical, saving 2GB memory per year of data and 50GB memory for the full dataset.

This can be done by users after loading the data, but I think would be better to change the dtype in the ETL pipeline before it is written to parquet, rather than after loading. Changing dtypes after loading requires reading the whole string column into memory first, which can exceed machine memory, crash the process, and prevent the dataset from loading at all.

Alternatively, the function to load epacems could read data in chunks, change the dtype of each, and concatenate them.


```
import pudl
import pandas as pd
from pathlib import Path

pudl_settings = pudl.workspace.setup.get_defaults()
cems_path = Path(pudl_settings["parquet_dir"]) / "epacems"

# read 1 year, all columns, all states of cems
cems_1year = pd.read_parquet(
        cems_path,
        use_nullable_dtypes=True,
        columns=None, # all cols
        filters=pudl.output.epacems.year_state_filter(
            states=pudl.constants.us_states.keys(), # all states
            years=2019,
        ),
)

print(cems.memory_usage(deep=True).sum() / 2**20) # megabytes total
print(cems.memory_usage(deep=True) / 2**20) # megabytes per column
print(cems.unitid.astype('category').memory_usage(deep=True) / 2**20) # megabytes unitid as categorical
print(cems.unitid.unique().shape[0]) # cardinality of unitid
```

Which produces the following:
```
4179.72

Index                           0.000122
plant_id_eia                  172.931328
unitid                       2069.946968
operating_datetime_utc        276.690125
operating_time_hours          138.345062
gross_load_mw                 138.345062
steam_load_1000_lbs           138.345062
so2_mass_lbs                  138.345062
so2_mass_measurement_code      34.587056
nox_rate_lbs_mmbtu            138.345062
nox_rate_measurement_code      34.587120
nox_mass_lbs                  138.345062
nox_mass_measurement_code      34.587120
co2_mass_tons                 138.345062
co2_mass_measurement_code      34.587056
heat_content_mmbtu            138.345062
facility_id                   172.931328
unit_id_epa                   172.931328
year                           34.586987
state                          34.590045
dtype: float64

69.29

1472

```

This shows that 'unitid' takes 2069 out of 4179 total MB, but when cast to categorical, only takes 69MB, a savings of 2GB. The cardinality is well within range of the categorical dtype, with only 1472 unique categories. There will doubtlessly be more categories with additional years, but only slightly more.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Change EPA CEMS data types to reduce memory usage #1049

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Change EPA CEMS data types to reduce memory usage #1049

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions