Transform PHMSA company data #4005

e-belfer · 2025-01-07T18:29:44Z

Overview

Addresses part of #3770. Taking over @seeess1 's #3929 PR.

What problem does this address?

Goal of the PR is to complete the first transformation of raw PHMSA data into a core asset.

What did you change?

Added a new transform script for PHMSA data.
Made one helper method (fix_eia_na) more generic since it can be applied across data sets (and updated references accordingly).
Added a few other helper methods.
Updated column map files with 2023 values (pulled from whatever was in 2022).
Specifically changed the ordering of columns in /Users/sam/Documents/pudl/src/pudl/package_data/phmsagas/column_maps/yearly_distribution.csv since we had fax columns mapped to emails and vice versa.

Remaining tasks

Questions:

We have an unaccounted for gas %age of 5% in the full ETL and 15% in the fast. Is there a good way to parameterize this in light of the ongoing DBT transition? Dbt setup #4011

Out of scope:

Update to latest DOI and debug if needed
Handle state encoding - see Standardize encoding of state/province columns throughout PUDL #4109

Documentation

Make sure to update relevant aspects of the documentation.

- [x] Update the [release notes](https://catalystcoop-pudl.readthedocs.io/en/nightly/release_notes.html): reference the PR and related issues.
- [ ] Update relevant Data Source jinja templates (see `docs/data_sources/templates`).
- [x] Update relevant table or source description metadata (see `src/metadata`).
- [ ] Review and update any other aspects of the documentation that might be affected by this PR.

Testing

How did you make sure this worked? How can a reviewer verify this?

# To-do list
- [x] If updating analyses or data processing functions: make sure to update or write data validation tests (e.g. `test_minmax_rows()`).
- [ ] Run `make pytest-coverage` locally to ensure that the merge queue will accept your PR.
- [x] Review the PR yourself and call out any questions or issues you have.
- [ ] For minor ETL changes or data additions, once `make pytest-coverage` passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and [run relevant data validation tests](https://catalystcoop-pudl.readthedocs.io/en/nightly/dev/testing.html#data-validation) using `pytest` and `--live-dbs`.
- [ ] For bigger ETL or data changes run the full ETL locally and then [run the data validations](https://catalystcoop-pudl.readthedocs.io/en/nightly/dev/testing.html#data-validation) using `make pytest-validate`.
- [ ] Alternatively, run the `build-deploy-pudl` GitHub Action manually.

…m-phmsagas-data

For more information, see https://pre-commit.ci

zaneselvans

Sorry this took so long. Lots of apparent data issues, and some questions about the normalization. Let me know if you want to find a time to chat synchronously early morning here / afternoon there.

src/pudl/helpers.py

src/pudl/metadata/enums.py

zaneselvans · 2025-03-06T23:22:14Z

src/pudl/metadata/resources/phmsagas.py

+                "services_shutoff_valve_in_system",
+                "services_shutoff_valve_installed",
+                "federal_land_leaks_repaired_or_scheduled",
+                "percent_unaccounted_for_gas",


There are some crazy values in the percentage of unaccounted for gas. It ranges from -99 to 563. Most values are between 0 and 20, so it seems like folks are actually reporting percentages. Only 3 above 100, but hundreds of values that are negative, mostly between 0 and 20 -- mirroring the positive distribution, but with much smaller numbers. This makes me think that these are sign errors since there's no way you can accidentally end up with 10% more gas in the pipeline is there?

In general we convert percentages to fractions so they can be used in calculations directly.

I think we should:

rename this unaccounted_for_gas_fraction

take the absolute value

divide by 100.

Potentially drop the values that are > 1.0 after all that? (there are only 3)

2017 PHMSA instructions (p. 4) say that negative calculations are possible but that they suggest reverting to 0 in this case:

If a company calculates a negative value for LAUF gas, it has likely either overcompensated for measurement conditions or overestimated known losses. Therefore, PHMSA instructs companies that calculate a negative value for percent of LAUF gas to report zero

And then on page 5:

PHMSA intends to propose changes to the instructions for PHMSA Form 7100.1-1, Gas
Distribution System Annual Report, related to calculating the percent of LAUF gas and negative
percent values. PHMSA intends to propose calculating percent LAUF gas by dividing the LAUF
volume by the gas disposition volume. PHMSA also intends to propose allowing a negative
value to be reported for percent LAUF gas. These changes would harmonize the PHMSA ·and
EIA methodologies for calculating percent LAUF gas.

This suggests to me we shouldn't be taking the absolute values of these numbers at all!

There's some interesting analysis here about the impact of these negative reports and how they're likely to be due to operator error, not sign problems. https://www.markey.senate.gov/imo/media/doc/documents/markey_lost_gas_report.pdf

So I think the options here are:

Leave negative values as is and add a note in the field description and in "Known inconsistencies"

Convert negative values to 0 and add some kind of a flag to flag these bad values

Add a second column where the negative values are converted to zero

If we look at the years where negative values are present, it seems like indeed we see enforcement of the no negative rule from 2013-2017 and then a return to allowing negative values in 2018 (and a large increase in the number of negative values reported!)

test = df[df.unaccounted_for_gas_fraction <0].report_year.value_counts() test.sort_index() report_year 1990 1 1991 1 1992 2 1993 10 1994 18 1995 19 1996 14 1997 38 1998 46 2004 28 2005 29 2006 23 2007 27 2008 24 2009 4 2012 1 2018 92 2019 106 2020 134 2021 137 2022 214 2023 244

Of the negative values, the vast majority are >-10%.

pd.cut(df[df.unaccounted_for_gas_fraction <0].unaccounted_for_gas_fraction, bins = 10).value_counts() unaccounted_for_gas_fraction (-0.0998, -1.8e-05] 1148 (-0.2, -0.0998] 40 (-0.299, -0.2] 11 (-0.399, -0.299] 4 (-0.898, -0.798] 2 (-0.999, -0.898] 2 (-0.798, -0.698] 2 (-0.499, -0.399] 1 (-0.698, -0.599] 1 (-0.599, -0.499] 1

Wow, thanks for digging into the background. Given that there's already an extensive explanation of it in I agree we should not take the absolute value, and make sure we document that there are known negative values in the column description and provide a reference to a fuller explanation in the data source docs page. And probably pull the PHMSA instructions PDF into the data source docs directory like we have for the EIA and FERC forms -- especially since they don't always preserve access to older versions of those documents.

zaneselvans · 2025-03-06T23:26:49Z

src/pudl/metadata/resources/phmsagas.py

+                "federal_land_leaks_repaired_or_scheduled",
+                "percent_unaccounted_for_gas",
+                "additional_information",
+                "preparer_email",


As with the address fields, we've got a number of fields that describe the person filing the form and they overlap a lot. It would be nice to either standardize them, or make the data source / agency / form specific, but not here or now.

I don't see comparable fields for email or fax anywhere in our field descriptions - am I missing something? Generally though I do agree, I'm writing this up in a separate issue.

I thought I had seen them for some of the other data sources that include contact information for whoever it is that filled out the forms but maybe I'm misremembering and we just have names and phone numbers.

zaneselvans · 2025-03-06T23:39:24Z

src/pudl/metadata/resources/phmsagas.py

+        "description": (
+            "This table contains operator-level natural gas distribution"
+            "data, corresponding to Parts A and D-I of the 2023 PHMSA gas "
+            "distribution system annual report. That includes data on the "


Some of the data mentioned in this table-level description does not appear to be in the table, including:

type of operator

type of gas being transported

The part about each report corresponding to one operator in one state is a little confusing. Looking at e.g. operator_id_phmsa==20211 they often submit many reports in the same year, all of which are labeled Initial and there's no indication that they pertain to different states. Is there a state column missing that's about what state's data they're reporting, rather than what state the office or HQ is in?

Looking at the raw table that feeds into this one, if this table really is intended to have one record per operator-state-year, maybe operating_state is the column that got dropped but should have been kept to differentiate between those types of records?

I also notice that it seems like each operator either has an Initial or a Supplemental filing in a given year. Is that expected? Are Supplemental filings meant to replce Initial filings completely? Or add to them?

Given that the raw PHMSA table that feeds into this one has 300+ columns, this is a pretty substantial downselection. It would be helpful to offer some explanation as to what criteria was used to group these columns together into a single table.

Good catch, I have now added in the fields that got missed in the original downselection, including operator type and gas type and state. Parts A and D-I of the form are the ones that have a natural primary key of report ID - the other parts of the field contain many categoricals in the columns that will want to be collapsed, forming different PKs for these different parts of the dataset. Thus the logic of starting with these parts of the form, with the idea that the other sections will form separate tables with distinct PKs. I can update the description to make this clearer.

Re: Initial vs Supplemental, there are no duplicates of operator ID + report year + state + commodity, meaning that supplemental reports must overwrite the initial reports.

Great, this makes the table make much more sense. And it seems like (operator_id_phmsa, report_year, state, commodity) is probably the natural primary key for the table. report_number is unique too, but it's more of a pseudo-key that doesn't really convey structural information about the table. It feels more like a column that should be included with an FK relationship to a filings table that contains all of the filing-specific information that's unique to a given filing, and would probably pertain to all of the records that come from that filing in all of the normalized tables that are constructed from the information in this big wide table.

Looking at the columns that are gathered in this yearly_distribution_operators it feels like there's a mix of information describing not only the operators and the filings, but also a summary of the kinds of incidents associated with the operator. Does it make conceptual sense to keep them all in the same table? It's pretty wide and they feel like different blocks of information.

When I use those four columns as the index I end up ~1000 duplicate records. operating_state contains a few hundred missing values and commodity is mostly null while operator_id_phmsa and report_year are completely non-null. Leaving commodity out of the index increases the number of duplicates a bit. The duplicates I always seem to have different supplemental_report_number and often include a mix of initial & supplemental report types, and occasionally some differences in the operator_name_phmsa but they don't usually seem material.

Not totally sure what the right thing to do here is.

src/pudl/transform/phmsagas.py

zaneselvans · 2025-03-07T17:13:58Z

src/pudl/transform/phmsagas.py

+    # Identify non-unique groups based on our PKs
+    non_unique_groups = df[


I don't understand the intention or the actual impacts of what's happening between here and the end of the transform function. What kind of duplication are we attempting to identify and remove? What does that have to do with whether the operator name contains the city name? Would it still make sense if we had per operating_state records in here?

There are ~10 records that have an address of "test" and a duplicated report ID. This method basically finds these duplicate IDs and keeps the one where there is no "test" in the name and where the city is contained in the operator name - the second condition captured a few additional edge cases. I did realize that the asset check for this is working on the already-cleaned data which doesn't help us one bit, so I moved this to an assertion in the transform

Co-authored-by: Zane Selvans <[email protected]>

For more information, see https://pre-commit.ci

…-cooperative/pudl into phmsa-company-transform

zaneselvans

I left some individual comments in the threads that were already going but I think the outstanding issues here are:

We need to clearly document the different meanings of the various date columns and make sure that they have the same meanings as columns with the same names elsewhere in PUDL. The one that will cause the most trouble right now is report_date since it is ubiquitous in PUDL and has a different meaning. I think it should be renamed, and report_year should be turned into a first-day-of-year date column and renamed report_date since that's consistent with the rest of PUDL.
From your research on the missing gas, it seems like the best option is to not touch the values (convert them to fractions, but don't take the absolute values or set them to zero) and refer users to the background material. I think we should get that background material into our docs folder like we have for EIA and FERC datasets (historical instructions / blank forms).
I think there are actually ~3 different tables hiding in the collection of columns being pulled in here, and they should probably be turned into 3 different assets. I think the 3 conceptual tables are:
- information about filings (with PK report_number)
- information about distribution operators (conceptually reported on the basis of report year, state, and operator ID)
- summary information about incidents (also conceptually reported on the basis of report year, state, and operator ID)
All those conceptual tables should retain the report_number column with a FK relationship that points back to the filings table so we can tell what filing the original information came from, but report_number doesn't seem like a real natural PK for these tables.
Unfortunately, what appear to be the natural primary keys don't quite result in unique keys, so it seems like there's some more investigation required to understand what's going on.

sam and others added 30 commits September 21, 2024 13:53

Update phmsagas DOI and start transformation

d41717d

Merge remote-tracking branch 'upstream/main' into issue-3770-transfor…

ba27b7e

…m-phmsagas-data

Starting data transformation

b40071e

Update notebook and change etl_fast phmsagas years

90fd277

Add 2023 package data columns for new phmsagas run

c5406b1

Added documentation

9fc3f45

Add troubleshooting to index

c3e2c67

Update troubleshooting

1bfb71d

Add helpers

7760e9a

Temp add change

e23bd61

Update column mappings

b5c7acd

Merge remote-tracking branch 'upstream/main' into issue-3770-transfor…

035712e

…m-phmsagas-data

Merge remote-tracking branch 'upstream/main' into issue-3770-transfor…

5d7d00f

…m-phmsagas-data

Update notebook and add draft transform script

9518e83

Merge remote-tracking branch 'upstream/main' into issue-3770-transfor…

418cd55

…m-phmsagas-data

Remove old files and cleanup helpers

66b63b2

[pre-commit.ci] auto fixes from pre-commit.com hooks

cfc62d4

For more information, see https://pre-commit.ci

Resolved merge conflicts

3225680

[pre-commit.ci] auto fixes from pre-commit.com hooks

f276746

For more information, see https://pre-commit.ci

Resolve merge conflicts

ac67b39

[pre-commit.ci] auto fixes from pre-commit.com hooks

57209e9

For more information, see https://pre-commit.ci

Remove list of columns

6b2747e

Remove '.0' logic

d6bb6ea

Updates in response to comments

1769487

[pre-commit.ci] auto fixes from pre-commit.com hooks

405cc1c

For more information, see https://pre-commit.ci

Clean up documentation and logic

78cf151

[pre-commit.ci] auto fixes from pre-commit.com hooks

0c69a84

For more information, see https://pre-commit.ci

Use bulk series str ops

044dcd5

[pre-commit.ci] auto fixes from pre-commit.com hooks

d8be474

For more information, see https://pre-commit.ci

Reorder transformations

c5aa116

e-belfer marked this pull request as ready for review March 4, 2025 14:14

e-belfer mentioned this pull request Mar 4, 2025

Standardize encoding of state/province columns throughout PUDL #4109

Open

3 tasks

e-belfer moved this from In progress to In review in Catalyst Megaproject Mar 4, 2025

e-belfer requested a review from zaneselvans March 4, 2025 14:25

zaneselvans added 5 commits March 4, 2025 15:21

Merge branch 'main' into phmsa-company-transform

eb8a62b

Merge branch 'main' into phmsa-company-transform

2bdaa54

Add dbt models and row counts for phmsa gas distributon operators

e9c27fe

Add tests for phone number standardization function

7a2e787

Add richer field descriptions and some minor transform tweaks.

eda9524

zaneselvans requested changes Mar 7, 2025

View reviewed changes

github-project-automation bot moved this from In review to In progress in Catalyst Megaproject Mar 7, 2025

Merge branch 'main' into phmsa-company-transform

5b61a81

cmgosnell moved this from In progress to In review in Catalyst Megaproject Mar 17, 2025

aesharpe mentioned this pull request Mar 24, 2025

Integrate PHMSA distribution data into PUDL #3770

Open

e-belfer and others added 5 commits April 16, 2025 10:29

Merge branch 'main' into phmsa-company-transform

ab1f229

Update src/pudl/helpers.py

c6b1705

Co-authored-by: Zane Selvans <[email protected]>

Update src/pudl/metadata/fields.py

1038f36

Co-authored-by: Zane Selvans <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

395144e

For more information, see https://pre-commit.ci

Address comments on helper functions, update migrations

3bbf294

e-belfer moved this from In review to In progress in Catalyst Megaproject Apr 16, 2025

e-belfer and others added 8 commits April 16, 2025 10:42

Remove unnecessary state codes enums

6b86663

Merge branch 'phmsa-company-transform' of https://github.com/catalyst…

baa6821

…-cooperative/pudl into phmsa-company-transform

Standardize address fields and add zipcode constraint

a0db74b

fix overzealous replace-all

1ccc331

Update PK to be just report_number

2ef4f96

Update migrations and fix report year bug

f28bd25

WIP remove table from DB and add missing columns

2cc7376

Merge branch 'main' into phmsa-company-transform

38c69a8

zaneselvans self-requested a review April 19, 2025 21:11

zaneselvans requested changes Apr 19, 2025

View reviewed changes

		# Identify non-unique groups based on our PKs
		non_unique_groups = df[

Uh oh!

Transform PHMSA company data #4005

Are you sure you want to change the base?

Transform PHMSA company data #4005

Uh oh!

Conversation

e-belfer commented Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

What problem does this address?

What did you change?

Remaining tasks

Documentation

Testing

How did you make sure this worked? How can a reviewer verify this?

Uh oh!

zaneselvans left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

e-belfer Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zaneselvans Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zaneselvans left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

e-belfer commented Jan 7, 2025 •

edited

Loading

e-belfer Apr 16, 2025 •

edited

Loading

zaneselvans Apr 19, 2025 •

edited

Loading