Reintegrate 88888 99999 #4291

aesharpe · 2025-05-22T20:12:28Z

Overview

Closes #808

What problem does this address?

Adds rows where utility_id_eia is 99999 or 88888 back into eia923 and eia861 (see issue linked above for more detail).

What did you change?

Remove documentation that talks about getting rid of those values
Remove lines of code that remove these values from 923 and 861
Write a function that combines certain 88888 values when necessary
Write a unit test that checks the combination of 88888 values
Update data source documentation page
Update row count validation tests

Possibilities

Generic Function: talked about making a more generic helper function for combining rows like 88888, but I didn't want to spend a bunch of time making something too generic / work with a lot of edge cases. For the time being, it's only applicable to 88888/99999 rows, but I'm sure we could adapt it in the future if needed.
Investigate 88888/99999 plant_id_eia: There are some of these that could maybe be added back in? But that could also be more complicated and OOS for this. We should at least mention them in the docs.

Documentation

Make sure to update relevant aspects of the documentation:

Update the release notes: reference the PR and related issues.
Update relevant Data Source jinja templates (see docs/data_sources/templates).
Update relevant table or source description metadata (see src/metadata).
Review and update any other aspects of the documentation that might be affected by this PR.

Testing

How did you make sure this worked? How can a reviewer verify this?

Materialize 861 and 923 assets in dagster
Run pytest test/unit/transform/eia861.py

To-do list

Final Run Through

If updating analyses or data processing functions: make sure to update row count expectations in dbt tests.
Run make pytest-coverage locally to ensure that the merge queue will accept your PR.
Review the PR yourself and call out any questions or issues you have.
For minor ETL changes or data additions, once make pytest-coverage passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and run relevant data validation tests using pytest and --live-dbs.
For bigger ETL or data changes run the full ETL locally and then run the data validations using make pytest-validate.
Alternatively, run the build-deploy-pudl GitHub Action manually.

krivard

The hard parts look right! Tagged a couple of potential optimizations, an order-of-operations question, and a couple potential things to document

src/pudl/transform/eia861.py

krivard · 2025-05-22T20:48:52Z

src/pudl/transform/eia861.py

@@ -2176,7 +2220,6 @@ def core_operational_data_eia861(raw_eia861__operational_data: pd.DataFrame):

    Transformations include:

-    * Remove rows with utility ids 88888.


We might want to note here that we're removing 88888 rows with conflicting/irreconcilable non-numeric entries

Agree, worth putting here as well as in pre_process docstring.

src/pudl/transform/eia861.py

krivard · 2025-05-22T21:01:47Z

src/pudl/transform/eia861.py

+                return pd.concat([non_num_group, num_group], axis=1).reset_index()
+            # Exclude rows with 88888 utility_id_eia that can't be combined due to different values in
+            # non-numeric columns.
+            return None


is there a rough idea of how many rows we have to exclude? /would that be important for users to know?

That's a good point. I might add a check that there aren't more than a certain number of rows getting dropped/combined.

…ly_dynamic_pricing table

…_eia861 table

… breakpoint for testing; will remove later

aesharpe · 2025-05-23T00:54:53Z

src/pudl/transform/eia861.py

+    # Make sure that the number of rows altered stays somewhat small. We don't expect
+    # there to be a lot of dropped or combined 88888 rows.
+    if (len(df) - len(recombined_df)) > 15:
+        ForkedPdb().set_trace()


For testing, will remove later

aesharpe · 2025-05-23T00:55:09Z

src/pudl/transform/eia861.py

+    if (len(df) - len(recombined_df)) > 15:
+        ForkedPdb().set_trace()
+        raise AssertionError(
+            f"Number of 88888 rows has changed by more than expected: {len(df) - len(recombined_df)}!"
+        )
+    # Check to make sure that the idx_cols are actually valid primary key cols:
+    return recombined_df
+
+    return df


Fails right now -- investigating what this threshold should be. For now, making sure all the dropped rows are being dropped for the right reason

…pk rows

…88888 rows when merging back together

jdangerx

Looks pretty good - the docs updates help a lot with clarity about what the hell we're doing with 88888. Just a few blocking questions but nothing structural - let me know if anything is confusing!

src/pudl/transform/eia861.py

jdangerx · 2025-06-10T20:08:08Z

src/pudl/transform/eia861.py

+    This function also checks for duplicate primary key values in the reshaped data, and
+    consolidates them by summing the data values. This is necessary because the EIA-861
+    data is not always clean, and sometimes contains duplicate records that are
+    identical except for the values in the class columns.


clarification for me: what are the 'class columns'?

jdangerx · 2025-06-10T20:09:06Z

src/pudl/transform/eia861.py

-    # Split raw df into primary keys plus nerc region and other value cols
-    nerc_df = df[idx_cols].copy()
-    other_df = df.drop(columns="nerc_region").set_index(idx_no_nerc)
-


I don't see any change in the intended behavior of the function... was this code just hanging out doing nothing useful before?

src/pudl/transform/eia861.py

jdangerx · 2025-06-10T20:11:18Z

src/pudl/transform/eia861.py

+    This function sums rows with a utility_id_eia of 88888 into a single row by
+    primary key. It drops rows with a utility_id_eia of 88888 if there are non-numeric
+    columns with different values that are impossible to combine. E.g.: boolean
+    columns where one value is Y and the other is N. This results in a small loss of


non-blocking: you mention a few times that the loss of data is small - might be good to quantify that in the docs so people know what to expect.

jdangerx · 2025-06-10T20:20:28Z

src/pudl/transform/eia861.py

-        rto_operation=lambda x: (
-            x.rto_operation.fillna(False).replace({"N": False, "Y": True})
-        ),
+        rto_operation=lambda x: (_make_yn_bool(x.rto_operation.fillna(False))),


blocking: Do we no longer need to clean the "Y" and "N" values?

test/unit/transform/eia861.py

jdangerx · 2025-06-10T20:28:06Z

test/unit/transform/eia861.py

+def test__combine_88888_values(actual, expected):
+    """Test that combine_88888 correctly combines data from multiple sources."""
+    idx_cols = ["report_date", "utility_id_eia", "state"]
+    actual_test = eia861._combine_88888_values(actual, idx_cols)


nit: the typical vocabulary word here is "observed" vs. "expected" value. And "actual" would probably be the "input" or "raw" data 🤷

Co-authored-by: Dazhong Xia <[email protected]>

…emove pd.to_pickle line and improve variable names

zaneselvans · 2025-06-23T23:30:19Z

I'm testing out the new "update the row counts for me" functionality by running a deployment on this branch. It should take about 2 hours to run, and the build should fail (due to the discrepancy in row count expectations) but should also produce a new etl_full_row_counts.csv that can be downloaded do your local repo with...

gcloud storage cp gs://builds.catalyst.coop/2025-06-23-2324-17a492713-reintegrate-88888-99999/etl_full_row_counts.csv dbt/seeds/

At which point git diff will show you what changes it found.

…-cooperative/pudl into reintegrate-88888-99999

zaneselvans · 2025-07-07T21:30:58Z

test/unit/transform/eia861.py

+        """report_date,utility_id_eia,state,ba_code,value
+2019-01-01,88888,TX,ERCOT,100
+2019-01-01,88888,TX,ERCOT,300
+2019-01-01,88888,pd.NA,ERCOT,800


Sorry for the drive-by comment but is the literal string "pd.NA" what we want here? Or should it just be a blank entry, which will be turned into an actual pandas null value by virtue of the apply_pudl_dtypes?

aesharpe added 2 commits May 22, 2025 13:53

Re-introduce 99999 and 88888 to 923

87451f3

Reintroduce 88888 and 99999 utility_id_eia values to 861

0d5973f

github-project-automation bot added this to Catalyst Megaproject May 22, 2025

github-project-automation bot moved this to New in Catalyst Megaproject May 22, 2025

aesharpe requested a review from krivard May 22, 2025 20:12

aesharpe self-assigned this May 22, 2025

aesharpe moved this from New to In review in Catalyst Megaproject May 22, 2025

aesharpe added 2 commits May 22, 2025 14:15

Merge branch 'main' into reintegrate-88888-99999

fba85ad

Update release notes

926439a

krivard reviewed May 22, 2025

View reviewed changes

aesharpe added 5 commits May 22, 2025 16:47

Make conditional statement negative

3d1fefb

Change axis=1 to axis=columns

77dc1fb

Switch back order of short_form and _pre_process in core_eia861__year…

35501a3

…ly_dynamic_pricing table

Switch back order of short_form and _pre_process in core_net_metering…

0481e07

…_eia861 table

Add dropped row assertion error and fix conditional -- NOTE: contains…

f3fb87d

… breakpoint for testing; will remove later

aesharpe commented May 23, 2025

View reviewed changes

aesharpe added 7 commits June 2, 2025 15:39

Merge branch 'main' into reintegrate-88888-99999

0282212

Merge branch 'main' into reintegrate-88888-99999

62ce21b

Add dropna=False to groupbys in combine_88888 function

06a6725

Update docstring for _tidy_class_dfs explaining dropped/consolidated …

591eef4

…pk rows

Update doc strings for combine_88888

247c8d5

Update clean_nerc function because it was duplicating utility_id_eia …

e7016a8

…88888 rows when merging back together

Increase acceptable line removal for combine 88888

74df227

aesharpe added 6 commits June 10, 2025 11:41

Add unit test for 861

f810495

Fix docstrings in 88888 func

3bf64be

Add another test cast to 861

874af04

Update docs page for EIA861

23e727f

Fix 88888 combine function conditional

a6fda34

Merge branch 'main' into reintegrate-88888-99999

d1e87b5

aesharpe marked this pull request as ready for review June 10, 2025 19:27

aesharpe requested a review from jdangerx June 10, 2025 19:27

jdangerx requested changes Jun 10, 2025

View reviewed changes

github-project-automation bot moved this from In review to In progress in Catalyst Megaproject Jun 10, 2025

aesharpe and others added 7 commits June 10, 2025 15:32

Update tidy_class_dfs docstring

b86d610

Update src/pudl/transform/eia861.py

fd5ec5d

Co-authored-by: Dazhong Xia <[email protected]>

Make custom_agg_group func name more specific

1a6889d

Update _pre_process doc string

3fb50e5

Add pytest param id to 861 tests to improve comprehensibility. Also r…

3b41fcf

…emove pd.to_pickle line and improve variable names

Merge branch 'main' into reintegrate-88888-99999

1d48220

Merge branch 'main' into reintegrate-88888-99999

17a4927

Merge branch 'main' into reintegrate-88888-99999

3bfac39

zaneselvans mentioned this pull request Jun 24, 2025

Update row count expectations with clobber after nightly build. #4346

Merged

aesharpe added 3 commits July 1, 2025 15:47

Merge branch 'reintegrate-88888-99999' of https://github.com/catalyst…

b2a8ad5

…-cooperative/pudl into reintegrate-88888-99999

Resolve conflicts with main

c637e60

Merge branch 'main' into reintegrate-88888-99999

c4b1f79

zaneselvans reviewed Jul 7, 2025

View reviewed changes

aesharpe added 2 commits July 9, 2025 12:04

Merge branch 'main' into reintegrate-88888-99999

afb5b7e

Replace pd.NA with blanks in 861 unit test

1cddd6a

		@@ -2176,7 +2220,6 @@ def core_operational_data_eia861(raw_eia861__operational_data: pd.DataFrame):

		Transformations include:

		* Remove rows with utility ids 88888.

Uh oh!

Reintegrate 88888 99999 #4291

Are you sure you want to change the base?

Reintegrate 88888 99999 #4291

Conversation

aesharpe commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

What problem does this address?

What did you change?

Documentation

Testing

To-do list

Uh oh!

krivard left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aesharpe May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aesharpe May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jdangerx left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zaneselvans commented Jun 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aesharpe commented May 22, 2025 •

edited

Loading

aesharpe May 22, 2025 •

edited

Loading

aesharpe May 23, 2025 •

edited

Loading