-
-
Notifications
You must be signed in to change notification settings - Fork 128
Reintegrate 88888 99999 #4291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Reintegrate 88888 99999 #4291
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hard parts look right! Tagged a couple of potential optimizations, an order-of-operations question, and a couple potential things to document
@@ -2176,7 +2220,6 @@ def core_operational_data_eia861(raw_eia861__operational_data: pd.DataFrame): | |||
|
|||
Transformations include: | |||
|
|||
* Remove rows with utility ids 88888. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to note here that we're removing 88888 rows with conflicting/irreconcilable non-numeric entries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, worth putting here as well as in pre_process docstring.
src/pudl/transform/eia861.py
Outdated
return pd.concat([non_num_group, num_group], axis=1).reset_index() | ||
# Exclude rows with 88888 utility_id_eia that can't be combined due to different values in | ||
# non-numeric columns. | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a rough idea of how many rows we have to exclude? /would that be important for users to know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I might add a check that there aren't more than a certain number of rows getting dropped/combined.
…ly_dynamic_pricing table
… breakpoint for testing; will remove later
src/pudl/transform/eia861.py
Outdated
# Make sure that the number of rows altered stays somewhat small. We don't expect | ||
# there to be a lot of dropped or combined 88888 rows. | ||
if (len(df) - len(recombined_df)) > 15: | ||
ForkedPdb().set_trace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For testing, will remove later
src/pudl/transform/eia861.py
Outdated
if (len(df) - len(recombined_df)) > 15: | ||
ForkedPdb().set_trace() | ||
raise AssertionError( | ||
f"Number of 88888 rows has changed by more than expected: {len(df) - len(recombined_df)}!" | ||
) | ||
# Check to make sure that the idx_cols are actually valid primary key cols: | ||
return recombined_df | ||
|
||
return df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fails right now -- investigating what this threshold should be. For now, making sure all the dropped rows are being dropped for the right reason
…88888 rows when merging back together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good - the docs updates help a lot with clarity about what the hell we're doing with 88888. Just a few blocking questions but nothing structural - let me know if anything is confusing!
src/pudl/transform/eia861.py
Outdated
This function also checks for duplicate primary key values in the reshaped data, and | ||
consolidates them by summing the data values. This is necessary because the EIA-861 | ||
data is not always clean, and sometimes contains duplicate records that are | ||
identical except for the values in the class columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarification for me: what are the 'class columns'?
# Split raw df into primary keys plus nerc region and other value cols | ||
nerc_df = df[idx_cols].copy() | ||
other_df = df.drop(columns="nerc_region").set_index(idx_no_nerc) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any change in the intended behavior of the function... was this code just hanging out doing nothing useful before?
This function sums rows with a utility_id_eia of 88888 into a single row by | ||
primary key. It drops rows with a utility_id_eia of 88888 if there are non-numeric | ||
columns with different values that are impossible to combine. E.g.: boolean | ||
columns where one value is Y and the other is N. This results in a small loss of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: you mention a few times that the loss of data is small - might be good to quantify that in the docs so people know what to expect.
rto_operation=lambda x: ( | ||
x.rto_operation.fillna(False).replace({"N": False, "Y": True}) | ||
), | ||
rto_operation=lambda x: (_make_yn_bool(x.rto_operation.fillna(False))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: Do we no longer need to clean the "Y" and "N" values?
test/unit/transform/eia861.py
Outdated
def test__combine_88888_values(actual, expected): | ||
"""Test that combine_88888 correctly combines data from multiple sources.""" | ||
idx_cols = ["report_date", "utility_id_eia", "state"] | ||
actual_test = eia861._combine_88888_values(actual, idx_cols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the typical vocabulary word here is "observed" vs. "expected" value. And "actual" would probably be the "input" or "raw" data 🤷
Co-authored-by: Dazhong Xia <[email protected]>
…emove pd.to_pickle line and improve variable names
I'm testing out the new "update the row counts for me" functionality by running a deployment on this branch. It should take about 2 hours to run, and the build should fail (due to the discrepancy in row count expectations) but should also produce a new gcloud storage cp gs://builds.catalyst.coop/2025-06-23-2324-17a492713-reintegrate-88888-99999/etl_full_row_counts.csv dbt/seeds/ At which point |
test/unit/transform/eia861.py
Outdated
"""report_date,utility_id_eia,state,ba_code,value | ||
2019-01-01,88888,TX,ERCOT,100 | ||
2019-01-01,88888,TX,ERCOT,300 | ||
2019-01-01,88888,pd.NA,ERCOT,800 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the drive-by comment but is the literal string "pd.NA" what we want here? Or should it just be a blank entry, which will be turned into an actual pandas null value by virtue of the apply_pudl_dtypes
?
Overview
Closes #808
What problem does this address?
Adds rows where
utility_id_eia
is99999
or88888
back into eia923 and eia861 (see issue linked above for more detail).What did you change?
Possibilities
88888/99999
plant_id_eia
: There are some of these that could maybe be added back in? But that could also be more complicated and OOS for this. We should at least mention them in the docs.Documentation
Make sure to update relevant aspects of the documentation:
docs/data_sources/templates
).src/metadata
).Testing
How did you make sure this worked? How can a reviewer verify this?
pytest test/unit/transform/eia861.py
To-do list
Final Run Through
dbt
tests.make pytest-coverage
locally to ensure that the merge queue will accept your PR.make pytest-coverage
passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and run relevant data validation tests usingpytest
and--live-dbs
.make pytest-validate
.build-deploy-pudl
GitHub Action manually.