Skip to content

Reintegrate 88888 99999 #4291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 35 commits into
base: main
Choose a base branch
from
Open

Reintegrate 88888 99999 #4291

wants to merge 35 commits into from

Conversation

aesharpe
Copy link
Member

@aesharpe aesharpe commented May 22, 2025

Overview

Closes #808

What problem does this address?

Adds rows where utility_id_eia is 99999 or 88888 back into eia923 and eia861 (see issue linked above for more detail).

What did you change?

  • Remove documentation that talks about getting rid of those values
  • Remove lines of code that remove these values from 923 and 861
  • Write a function that combines certain 88888 values when necessary
  • Write a unit test that checks the combination of 88888 values
  • Update data source documentation page
  • Update row count validation tests

Possibilities

  • Generic Function: talked about making a more generic helper function for combining rows like 88888, but I didn't want to spend a bunch of time making something too generic / work with a lot of edge cases. For the time being, it's only applicable to 88888/99999 rows, but I'm sure we could adapt it in the future if needed.
  • Investigate 88888/99999 plant_id_eia: There are some of these that could maybe be added back in? But that could also be more complicated and OOS for this. We should at least mention them in the docs.

Documentation

Make sure to update relevant aspects of the documentation:

  • Update the release notes: reference the PR and related issues.
  • Update relevant Data Source jinja templates (see docs/data_sources/templates).
  • Update relevant table or source description metadata (see src/metadata).
  • Review and update any other aspects of the documentation that might be affected by this PR.

Testing

How did you make sure this worked? How can a reviewer verify this?

  • Materialize 861 and 923 assets in dagster
  • Run pytest test/unit/transform/eia861.py

To-do list

Final Run Through

  • If updating analyses or data processing functions: make sure to update row count expectations in dbt tests.
  • Run make pytest-coverage locally to ensure that the merge queue will accept your PR.
  • Review the PR yourself and call out any questions or issues you have.
  • For minor ETL changes or data additions, once make pytest-coverage passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and run relevant data validation tests using pytest and --live-dbs.
  • For bigger ETL or data changes run the full ETL locally and then run the data validations using make pytest-validate.
  • Alternatively, run the build-deploy-pudl GitHub Action manually.

@aesharpe aesharpe requested a review from krivard May 22, 2025 20:12
@aesharpe aesharpe self-assigned this May 22, 2025
@aesharpe aesharpe added eia923 Anything having to do with EIA Form 923 data-cleaning Tasks related to cleaning & regularizing data during ETL. eia861 Anything having to do with EIA Form 861 enhancement Improvements in existing functionality. data-loss data that we expect should exist seem to be missing or dropped in pudl tables missing-info Missing triage info: problem / impact statement labels May 22, 2025
@aesharpe aesharpe moved this from New to In review in Catalyst Megaproject May 22, 2025
Copy link
Contributor

@krivard krivard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hard parts look right! Tagged a couple of potential optimizations, an order-of-operations question, and a couple potential things to document

@@ -2176,7 +2220,6 @@ def core_operational_data_eia861(raw_eia861__operational_data: pd.DataFrame):

Transformations include:

* Remove rows with utility ids 88888.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to note here that we're removing 88888 rows with conflicting/irreconcilable non-numeric entries

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, worth putting here as well as in pre_process docstring.

return pd.concat([non_num_group, num_group], axis=1).reset_index()
# Exclude rows with 88888 utility_id_eia that can't be combined due to different values in
# non-numeric columns.
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a rough idea of how many rows we have to exclude? /would that be important for users to know?

Copy link
Member Author

@aesharpe aesharpe May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I might add a check that there aren't more than a certain number of rows getting dropped/combined.

# Make sure that the number of rows altered stays somewhat small. We don't expect
# there to be a lot of dropped or combined 88888 rows.
if (len(df) - len(recombined_df)) > 15:
ForkedPdb().set_trace()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing, will remove later

Comment on lines 1071 to 1079
if (len(df) - len(recombined_df)) > 15:
ForkedPdb().set_trace()
raise AssertionError(
f"Number of 88888 rows has changed by more than expected: {len(df) - len(recombined_df)}!"
)
# Check to make sure that the idx_cols are actually valid primary key cols:
return recombined_df

return df
Copy link
Member Author

@aesharpe aesharpe May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fails right now -- investigating what this threshold should be. For now, making sure all the dropped rows are being dropped for the right reason

@aesharpe aesharpe marked this pull request as ready for review June 10, 2025 19:27
@aesharpe aesharpe requested a review from jdangerx June 10, 2025 19:27
Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good - the docs updates help a lot with clarity about what the hell we're doing with 88888. Just a few blocking questions but nothing structural - let me know if anything is confusing!

This function also checks for duplicate primary key values in the reshaped data, and
consolidates them by summing the data values. This is necessary because the EIA-861
data is not always clean, and sometimes contains duplicate records that are
identical except for the values in the class columns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarification for me: what are the 'class columns'?

# Split raw df into primary keys plus nerc region and other value cols
nerc_df = df[idx_cols].copy()
other_df = df.drop(columns="nerc_region").set_index(idx_no_nerc)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any change in the intended behavior of the function... was this code just hanging out doing nothing useful before?

This function sums rows with a utility_id_eia of 88888 into a single row by
primary key. It drops rows with a utility_id_eia of 88888 if there are non-numeric
columns with different values that are impossible to combine. E.g.: boolean
columns where one value is Y and the other is N. This results in a small loss of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: you mention a few times that the loss of data is small - might be good to quantify that in the docs so people know what to expect.

rto_operation=lambda x: (
x.rto_operation.fillna(False).replace({"N": False, "Y": True})
),
rto_operation=lambda x: (_make_yn_bool(x.rto_operation.fillna(False))),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Do we no longer need to clean the "Y" and "N" values?

def test__combine_88888_values(actual, expected):
"""Test that combine_88888 correctly combines data from multiple sources."""
idx_cols = ["report_date", "utility_id_eia", "state"]
actual_test = eia861._combine_88888_values(actual, idx_cols)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the typical vocabulary word here is "observed" vs. "expected" value. And "actual" would probably be the "input" or "raw" data 🤷

@github-project-automation github-project-automation bot moved this from In review to In progress in Catalyst Megaproject Jun 10, 2025
@zaneselvans
Copy link
Member

I'm testing out the new "update the row counts for me" functionality by running a deployment on this branch. It should take about 2 hours to run, and the build should fail (due to the discrepancy in row count expectations) but should also produce a new etl_full_row_counts.csv that can be downloaded do your local repo with...

gcloud storage cp gs://builds.catalyst.coop/2025-06-23-2324-17a492713-reintegrate-88888-99999/etl_full_row_counts.csv dbt/seeds/

At which point git diff will show you what changes it found.

"""report_date,utility_id_eia,state,ba_code,value
2019-01-01,88888,TX,ERCOT,100
2019-01-01,88888,TX,ERCOT,300
2019-01-01,88888,pd.NA,ERCOT,800
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the drive-by comment but is the literal string "pd.NA" what we want here? Or should it just be a blank entry, which will be turned into an actual pandas null value by virtue of the apply_pudl_dtypes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-cleaning Tasks related to cleaning & regularizing data during ETL. data-loss data that we expect should exist seem to be missing or dropped in pudl tables eia861 Anything having to do with EIA Form 861 eia923 Anything having to do with EIA Form 923 enhancement Improvements in existing functionality. missing-info Missing triage info: problem / impact statement
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

Re-integrate 88888 and 99999 data to eia861
4 participants