-
-
Notifications
You must be signed in to change notification settings - Fork 128
Description
EPA CEMS is bigger than laptop memory, there is no getting around that. But after loading, fully 50% of memory is taken by one column, 'unitid'. This column is a string dtype, but could probably be changed to categorical, saving 2GB memory per year of data and 50GB memory for the full dataset.
This can be done by users after loading the data, but I think would be better to change the dtype in the ETL pipeline before it is written to parquet, rather than after loading. Changing dtypes after loading requires reading the whole string column into memory first, which can exceed machine memory, crash the process, and prevent the dataset from loading at all.
Alternatively, the function to load epacems could read data in chunks, change the dtype of each, and concatenate them.
import pudl
import pandas as pd
from pathlib import Path
pudl_settings = pudl.workspace.setup.get_defaults()
cems_path = Path(pudl_settings["parquet_dir"]) / "epacems"
# read 1 year, all columns, all states of cems
cems_1year = pd.read_parquet(
cems_path,
use_nullable_dtypes=True,
columns=None, # all cols
filters=pudl.output.epacems.year_state_filter(
states=pudl.constants.us_states.keys(), # all states
years=2019,
),
)
print(cems.memory_usage(deep=True).sum() / 2**20) # megabytes total
print(cems.memory_usage(deep=True) / 2**20) # megabytes per column
print(cems.unitid.astype('category').memory_usage(deep=True) / 2**20) # megabytes unitid as categorical
print(cems.unitid.unique().shape[0]) # cardinality of unitid
Which produces the following:
4179.72
Index 0.000122
plant_id_eia 172.931328
unitid 2069.946968
operating_datetime_utc 276.690125
operating_time_hours 138.345062
gross_load_mw 138.345062
steam_load_1000_lbs 138.345062
so2_mass_lbs 138.345062
so2_mass_measurement_code 34.587056
nox_rate_lbs_mmbtu 138.345062
nox_rate_measurement_code 34.587120
nox_mass_lbs 138.345062
nox_mass_measurement_code 34.587120
co2_mass_tons 138.345062
co2_mass_measurement_code 34.587056
heat_content_mmbtu 138.345062
facility_id 172.931328
unit_id_epa 172.931328
year 34.586987
state 34.590045
dtype: float64
69.29
1472
This shows that 'unitid' takes 2069 out of 4179 total MB, but when cast to categorical, only takes 69MB, a savings of 2GB. The cardinality is well within range of the categorical dtype, with only 1472 unique categories. There will doubtlessly be more categories with additional years, but only slightly more.