Skip to content

Transform PHMSA company data #4005

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 83 commits into
base: main
Choose a base branch
from
Open

Transform PHMSA company data #4005

wants to merge 83 commits into from

Conversation

e-belfer
Copy link
Member

@e-belfer e-belfer commented Jan 7, 2025

Overview

Addresses part of #3770. Taking over @seeess1 's #3929 PR.

What problem does this address?

Goal of the PR is to complete the first transformation of raw PHMSA data into a core asset.

What did you change?

  • Added a new transform script for PHMSA data.
  • Made one helper method (fix_eia_na) more generic since it can be applied across data sets (and updated references accordingly).
  • Added a few other helper methods.
  • Updated column map files with 2023 values (pulled from whatever was in 2022).
  • Specifically changed the ordering of columns in /Users/sam/Documents/pudl/src/pudl/package_data/phmsagas/column_maps/yearly_distribution.csv since we had fax columns mapped to emails and vice versa.

Remaining tasks

  • Update data DOI
  • Write to alembic
  • Fix test failures
  • Get validation asset checks running
  • Fix new test failures (migration, docs)
  • Why is report_date nulled during the transform?
  • Respond to PR comments
  • Add and transform any missing data columns
  • Prior to 2004 reports have 1+ states - how do we want to handle this!
  • Split report ID into report year and log? or just remove log column?
  • Check that report_submission_type collapses boolean cols
  • Migrate remaining asset checks to dbt
  • Add dbt test on report_year
  • Update DOI and update dbt checks

Questions:

  • We have an unaccounted for gas %age of 5% in the full ETL and 15% in the fast. Is there a good way to parameterize this in light of the ongoing DBT transition? Dbt setup #4011

Out of scope:

Documentation

Make sure to update relevant aspects of the documentation.

- [x] Update the [release notes](https://catalystcoop-pudl.readthedocs.io/en/nightly/release_notes.html): reference the PR and related issues.
- [ ] Update relevant Data Source jinja templates (see `docs/data_sources/templates`).
- [x] Update relevant table or source description metadata (see `src/metadata`).
- [ ] Review and update any other aspects of the documentation that might be affected by this PR.

Testing

How did you make sure this worked? How can a reviewer verify this?

# To-do list
- [x] If updating analyses or data processing functions: make sure to update or write data validation tests (e.g. `test_minmax_rows()`).
- [ ] Run `make pytest-coverage` locally to ensure that the merge queue will accept your PR.
- [x] Review the PR yourself and call out any questions or issues you have.
- [ ] For minor ETL changes or data additions, once `make pytest-coverage` passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and [run relevant data validation tests](https://catalystcoop-pudl.readthedocs.io/en/nightly/dev/testing.html#data-validation) using `pytest` and `--live-dbs`.
- [ ] For bigger ETL or data changes run the full ETL locally and then [run the data validations](https://catalystcoop-pudl.readthedocs.io/en/nightly/dev/testing.html#data-validation) using `make pytest-validate`.
- [ ] Alternatively, run the `build-deploy-pudl` GitHub Action manually.

sam and others added 30 commits September 21, 2024 13:53
@e-belfer e-belfer marked this pull request as ready for review March 4, 2025 14:14
@e-belfer e-belfer moved this from In progress to In review in Catalyst Megaproject Mar 4, 2025
@e-belfer e-belfer requested a review from zaneselvans March 4, 2025 14:25
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this took so long. Lots of apparent data issues, and some questions about the normalization. Let me know if you want to find a time to chat synchronously early morning here / afternoon there.

"services_shutoff_valve_in_system",
"services_shutoff_valve_installed",
"federal_land_leaks_repaired_or_scheduled",
"percent_unaccounted_for_gas",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some crazy values in the percentage of unaccounted for gas. It ranges from -99 to 563. Most values are between 0 and 20, so it seems like folks are actually reporting percentages. Only 3 above 100, but hundreds of values that are negative, mostly between 0 and 20 -- mirroring the positive distribution, but with much smaller numbers. This makes me think that these are sign errors since there's no way you can accidentally end up with 10% more gas in the pipeline is there?

In general we convert percentages to fractions so they can be used in calculations directly.

I think we should:

  • rename this unaccounted_for_gas_fraction
  • take the absolute value
  • divide by 100.
  • Potentially drop the values that are > 1.0 after all that? (there are only 3)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2017 PHMSA instructions (p. 4) say that negative calculations are possible but that they suggest reverting to 0 in this case:

If a company calculates a negative value for LAUF gas, it has likely either overcompensated for measurement conditions or overestimated known losses. Therefore, PHMSA instructs companies that calculate a negative value for percent of LAUF gas to report zero

And then on page 5:

PHMSA intends to propose changes to the instructions for PHMSA Form 7100.1-1, Gas
Distribution System Annual Report, related to calculating the percent of LAUF gas and negative
percent values. PHMSA intends to propose calculating percent LAUF gas by dividing the LAUF
volume by the gas disposition volume. PHMSA also intends to propose allowing a negative
value to be reported for percent LAUF gas. These changes would harmonize the PHMSA ·and
EIA methodologies for calculating percent LAUF gas.

This suggests to me we shouldn't be taking the absolute values of these numbers at all!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some interesting analysis here about the impact of these negative reports and how they're likely to be due to operator error, not sign problems. https://www.markey.senate.gov/imo/media/doc/documents/markey_lost_gas_report.pdf

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the options here are:

  • Leave negative values as is and add a note in the field description and in "Known inconsistencies"
  • Convert negative values to 0 and add some kind of a flag to flag these bad values
  • Add a second column where the negative values are converted to zero

Copy link
Member Author

@e-belfer e-belfer Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we look at the years where negative values are present, it seems like indeed we see enforcement of the no negative rule from 2013-2017 and then a return to allowing negative values in 2018 (and a large increase in the number of negative values reported!)

test = df[df.unaccounted_for_gas_fraction <0].report_year.value_counts()
test.sort_index()

report_year
1990      1
1991      1
1992      2
1993     10
1994     18
1995     19
1996     14
1997     38
1998     46
2004     28
2005     29
2006     23
2007     27
2008     24
2009      4
2012      1
2018     92
2019    106
2020    134
2021    137
2022    214
2023    244

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of the negative values, the vast majority are >-10%.

pd.cut(df[df.unaccounted_for_gas_fraction <0].unaccounted_for_gas_fraction, bins = 10).value_counts()

unaccounted_for_gas_fraction
(-0.0998, -1.8e-05]    1148
(-0.2, -0.0998]          40
(-0.299, -0.2]           11
(-0.399, -0.299]          4
(-0.898, -0.798]          2
(-0.999, -0.898]          2
(-0.798, -0.698]          2
(-0.499, -0.399]          1
(-0.698, -0.599]          1
(-0.599, -0.499]          1

Copy link
Member

@zaneselvans zaneselvans Apr 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, thanks for digging into the background. Given that there's already an extensive explanation of it in I agree we should not take the absolute value, and make sure we document that there are known negative values in the column description and provide a reference to a fuller explanation in the data source docs page. And probably pull the PHMSA instructions PDF into the data source docs directory like we have for the EIA and FERC forms -- especially since they don't always preserve access to older versions of those documents.

"federal_land_leaks_repaired_or_scheduled",
"percent_unaccounted_for_gas",
"additional_information",
"preparer_email",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with the address fields, we've got a number of fields that describe the person filing the form and they overlap a lot. It would be nice to either standardize them, or make the data source / agency / form specific, but not here or now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see comparable fields for email or fax anywhere in our field descriptions - am I missing something? Generally though I do agree, I'm writing this up in a separate issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I had seen them for some of the other data sources that include contact information for whoever it is that filled out the forms but maybe I'm misremembering and we just have names and phone numbers.

"description": (
"This table contains operator-level natural gas distribution"
"data, corresponding to Parts A and D-I of the 2023 PHMSA gas "
"distribution system annual report. That includes data on the "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the data mentioned in this table-level description does not appear to be in the table, including:

  • type of operator
  • type of gas being transported

The part about each report corresponding to one operator in one state is a little confusing. Looking at e.g. operator_id_phmsa==20211 they often submit many reports in the same year, all of which are labeled Initial and there's no indication that they pertain to different states. Is there a state column missing that's about what state's data they're reporting, rather than what state the office or HQ is in?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the raw table that feeds into this one, if this table really is intended to have one record per operator-state-year, maybe operating_state is the column that got dropped but should have been kept to differentiate between those types of records?

I also notice that it seems like each operator either has an Initial or a Supplemental filing in a given year. Is that expected? Are Supplemental filings meant to replce Initial filings completely? Or add to them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the raw PHMSA table that feeds into this one has 300+ columns, this is a pretty substantial downselection. It would be helpful to offer some explanation as to what criteria was used to group these columns together into a single table.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I have now added in the fields that got missed in the original downselection, including operator type and gas type and state. Parts A and D-I of the form are the ones that have a natural primary key of report ID - the other parts of the field contain many categoricals in the columns that will want to be collapsed, forming different PKs for these different parts of the dataset. Thus the logic of starting with these parts of the form, with the idea that the other sections will form separate tables with distinct PKs. I can update the description to make this clearer.

Re: Initial vs Supplemental, there are no duplicates of operator ID + report year + state + commodity, meaning that supplemental reports must overwrite the initial reports.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, this makes the table make much more sense. And it seems like (operator_id_phmsa, report_year, state, commodity) is probably the natural primary key for the table. report_number is unique too, but it's more of a pseudo-key that doesn't really convey structural information about the table. It feels more like a column that should be included with an FK relationship to a filings table that contains all of the filing-specific information that's unique to a given filing, and would probably pertain to all of the records that come from that filing in all of the normalized tables that are constructed from the information in this big wide table.

Looking at the columns that are gathered in this yearly_distribution_operators it feels like there's a mix of information describing not only the operators and the filings, but also a summary of the kinds of incidents associated with the operator. Does it make conceptual sense to keep them all in the same table? It's pretty wide and they feel like different blocks of information.

When I use those four columns as the index I end up ~1000 duplicate records. operating_state contains a few hundred missing values and commodity is mostly null while operator_id_phmsa and report_year are completely non-null. Leaving commodity out of the index increases the number of duplicates a bit. The duplicates I always seem to have different supplemental_report_number and often include a mix of initial & supplemental report types, and occasionally some differences in the operator_name_phmsa but they don't usually seem material.

Not totally sure what the right thing to do here is.

Comment on lines 187 to 188
# Identify non-unique groups based on our PKs
non_unique_groups = df[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the intention or the actual impacts of what's happening between here and the end of the transform function. What kind of duplication are we attempting to identify and remove? What does that have to do with whether the operator name contains the city name? Would it still make sense if we had per operating_state records in here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are ~10 records that have an address of "test" and a duplicated report ID. This method basically finds these duplicate IDs and keeps the one where there is no "test" in the name and where the city is contained in the operator name - the second condition captured a few additional edge cases. I did realize that the asset check for this is working on the already-cleaned data which doesn't help us one bit, so I moved this to an assertion in the transform

@github-project-automation github-project-automation bot moved this from In review to In progress in Catalyst Megaproject Mar 7, 2025
@cmgosnell cmgosnell moved this from In progress to In review in Catalyst Megaproject Mar 17, 2025
@e-belfer e-belfer moved this from In review to In progress in Catalyst Megaproject Apr 16, 2025
@zaneselvans zaneselvans self-requested a review April 19, 2025 21:11
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some individual comments in the threads that were already going but I think the outstanding issues here are:

  • We need to clearly document the different meanings of the various date columns and make sure that they have the same meanings as columns with the same names elsewhere in PUDL. The one that will cause the most trouble right now is report_date since it is ubiquitous in PUDL and has a different meaning. I think it should be renamed, and report_year should be turned into a first-day-of-year date column and renamed report_date since that's consistent with the rest of PUDL.
  • From your research on the missing gas, it seems like the best option is to not touch the values (convert them to fractions, but don't take the absolute values or set them to zero) and refer users to the background material. I think we should get that background material into our docs folder like we have for EIA and FERC datasets (historical instructions / blank forms).
  • I think there are actually ~3 different tables hiding in the collection of columns being pulled in here, and they should probably be turned into 3 different assets. I think the 3 conceptual tables are:
    • information about filings (with PK report_number)
    • information about distribution operators (conceptually reported on the basis of report year, state, and operator ID)
    • summary information about incidents (also conceptually reported on the basis of report year, state, and operator ID)
  • All those conceptual tables should retain the report_number column with a FK relationship that points back to the filings table so we can tell what filing the original information came from, but report_number doesn't seem like a real natural PK for these tables.
  • Unfortunately, what appear to be the natural primary keys don't quite result in unique keys, so it seems like there's some more investigation required to understand what's going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community new-data Requests for integration of new data. phmsa Data from the Pipeline and Hazardous Material Safety Administration
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

3 participants