-
-
Notifications
You must be signed in to change notification settings - Fork 124
Transform PHMSA company data #4005
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
For more information, see https://pre-commit.ci
For more information, see https://pre-commit.ci
For more information, see https://pre-commit.ci
For more information, see https://pre-commit.ci
For more information, see https://pre-commit.ci
For more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this took so long. Lots of apparent data issues, and some questions about the normalization. Let me know if you want to find a time to chat synchronously early morning here / afternoon there.
"services_shutoff_valve_in_system", | ||
"services_shutoff_valve_installed", | ||
"federal_land_leaks_repaired_or_scheduled", | ||
"percent_unaccounted_for_gas", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some crazy values in the percentage of unaccounted for gas. It ranges from -99 to 563. Most values are between 0 and 20, so it seems like folks are actually reporting percentages. Only 3 above 100, but hundreds of values that are negative, mostly between 0 and 20 -- mirroring the positive distribution, but with much smaller numbers. This makes me think that these are sign errors since there's no way you can accidentally end up with 10% more gas in the pipeline is there?
In general we convert percentages to fractions so they can be used in calculations directly.
I think we should:
- rename this
unaccounted_for_gas_fraction
- take the absolute value
- divide by 100.
- Potentially drop the values that are > 1.0 after all that? (there are only 3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2017 PHMSA instructions (p. 4) say that negative calculations are possible but that they suggest reverting to 0 in this case:
If a company calculates a negative value for LAUF gas, it has likely either overcompensated for measurement conditions or overestimated known losses. Therefore, PHMSA instructs companies that calculate a negative value for percent of LAUF gas to report zero
And then on page 5:
PHMSA intends to propose changes to the instructions for PHMSA Form 7100.1-1, Gas
Distribution System Annual Report, related to calculating the percent of LAUF gas and negative
percent values. PHMSA intends to propose calculating percent LAUF gas by dividing the LAUF
volume by the gas disposition volume. PHMSA also intends to propose allowing a negative
value to be reported for percent LAUF gas. These changes would harmonize the PHMSA ·and
EIA methodologies for calculating percent LAUF gas.
This suggests to me we shouldn't be taking the absolute values of these numbers at all!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some interesting analysis here about the impact of these negative reports and how they're likely to be due to operator error, not sign problems. https://www.markey.senate.gov/imo/media/doc/documents/markey_lost_gas_report.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think the options here are:
- Leave negative values as is and add a note in the field description and in "Known inconsistencies"
- Convert negative values to 0 and add some kind of a flag to flag these bad values
- Add a second column where the negative values are converted to zero
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we look at the years where negative values are present, it seems like indeed we see enforcement of the no negative rule from 2013-2017 and then a return to allowing negative values in 2018 (and a large increase in the number of negative values reported!)
test = df[df.unaccounted_for_gas_fraction <0].report_year.value_counts()
test.sort_index()
report_year
1990 1
1991 1
1992 2
1993 10
1994 18
1995 19
1996 14
1997 38
1998 46
2004 28
2005 29
2006 23
2007 27
2008 24
2009 4
2012 1
2018 92
2019 106
2020 134
2021 137
2022 214
2023 244
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of the negative values, the vast majority are >-10%.
pd.cut(df[df.unaccounted_for_gas_fraction <0].unaccounted_for_gas_fraction, bins = 10).value_counts()
unaccounted_for_gas_fraction
(-0.0998, -1.8e-05] 1148
(-0.2, -0.0998] 40
(-0.299, -0.2] 11
(-0.399, -0.299] 4
(-0.898, -0.798] 2
(-0.999, -0.898] 2
(-0.798, -0.698] 2
(-0.499, -0.399] 1
(-0.698, -0.599] 1
(-0.599, -0.499] 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, thanks for digging into the background. Given that there's already an extensive explanation of it in I agree we should not take the absolute value, and make sure we document that there are known negative values in the column description and provide a reference to a fuller explanation in the data source docs page. And probably pull the PHMSA instructions PDF into the data source docs directory like we have for the EIA and FERC forms -- especially since they don't always preserve access to older versions of those documents.
"federal_land_leaks_repaired_or_scheduled", | ||
"percent_unaccounted_for_gas", | ||
"additional_information", | ||
"preparer_email", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with the address fields, we've got a number of fields that describe the person filing the form and they overlap a lot. It would be nice to either standardize them, or make the data source / agency / form specific, but not here or now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see comparable fields for email or fax anywhere in our field descriptions - am I missing something? Generally though I do agree, I'm writing this up in a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I had seen them for some of the other data sources that include contact information for whoever it is that filled out the forms but maybe I'm misremembering and we just have names and phone numbers.
"description": ( | ||
"This table contains operator-level natural gas distribution" | ||
"data, corresponding to Parts A and D-I of the 2023 PHMSA gas " | ||
"distribution system annual report. That includes data on the " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the data mentioned in this table-level description does not appear to be in the table, including:
- type of operator
- type of gas being transported
The part about each report corresponding to one operator in one state is a little confusing. Looking at e.g. operator_id_phmsa==20211
they often submit many reports in the same year, all of which are labeled Initial
and there's no indication that they pertain to different states. Is there a state column missing that's about what state's data they're reporting, rather than what state the office or HQ is in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the raw table that feeds into this one, if this table really is intended to have one record per operator-state-year, maybe operating_state
is the column that got dropped but should have been kept to differentiate between those types of records?
I also notice that it seems like each operator either has an Initial or a Supplemental filing in a given year. Is that expected? Are Supplemental filings meant to replce Initial filings completely? Or add to them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the raw PHMSA table that feeds into this one has 300+ columns, this is a pretty substantial downselection. It would be helpful to offer some explanation as to what criteria was used to group these columns together into a single table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I have now added in the fields that got missed in the original downselection, including operator type and gas type and state. Parts A and D-I of the form are the ones that have a natural primary key of report ID - the other parts of the field contain many categoricals in the columns that will want to be collapsed, forming different PKs for these different parts of the dataset. Thus the logic of starting with these parts of the form, with the idea that the other sections will form separate tables with distinct PKs. I can update the description to make this clearer.
Re: Initial vs Supplemental, there are no duplicates of operator ID + report year + state + commodity, meaning that supplemental reports must overwrite the initial reports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, this makes the table make much more sense. And it seems like (operator_id_phmsa, report_year, state, commodity)
is probably the natural primary key for the table. report_number
is unique too, but it's more of a pseudo-key that doesn't really convey structural information about the table. It feels more like a column that should be included with an FK relationship to a filings table that contains all of the filing-specific information that's unique to a given filing, and would probably pertain to all of the records that come from that filing in all of the normalized tables that are constructed from the information in this big wide table.
Looking at the columns that are gathered in this yearly_distribution_operators
it feels like there's a mix of information describing not only the operators and the filings, but also a summary of the kinds of incidents associated with the operator. Does it make conceptual sense to keep them all in the same table? It's pretty wide and they feel like different blocks of information.
When I use those four columns as the index I end up ~1000 duplicate records. operating_state
contains a few hundred missing values and commodity
is mostly null while operator_id_phmsa
and report_year
are completely non-null. Leaving commodity
out of the index increases the number of duplicates a bit. The duplicates I always seem to have different supplemental_report_number
and often include a mix of initial & supplemental report types, and occasionally some differences in the operator_name_phmsa
but they don't usually seem material.
Not totally sure what the right thing to do here is.
src/pudl/transform/phmsagas.py
Outdated
# Identify non-unique groups based on our PKs | ||
non_unique_groups = df[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the intention or the actual impacts of what's happening between here and the end of the transform function. What kind of duplication are we attempting to identify and remove? What does that have to do with whether the operator name contains the city name? Would it still make sense if we had per operating_state
records in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are ~10 records that have an address of "test" and a duplicated report ID. This method basically finds these duplicate IDs and keeps the one where there is no "test" in the name and where the city is contained in the operator name - the second condition captured a few additional edge cases. I did realize that the asset check for this is working on the already-cleaned data which doesn't help us one bit, so I moved this to an assertion in the transform
Co-authored-by: Zane Selvans <[email protected]>
Co-authored-by: Zane Selvans <[email protected]>
For more information, see https://pre-commit.ci
…-cooperative/pudl into phmsa-company-transform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some individual comments in the threads that were already going but I think the outstanding issues here are:
- We need to clearly document the different meanings of the various date columns and make sure that they have the same meanings as columns with the same names elsewhere in PUDL. The one that will cause the most trouble right now is
report_date
since it is ubiquitous in PUDL and has a different meaning. I think it should be renamed, andreport_year
should be turned into a first-day-of-year date column and renamedreport_date
since that's consistent with the rest of PUDL. - From your research on the missing gas, it seems like the best option is to not touch the values (convert them to fractions, but don't take the absolute values or set them to zero) and refer users to the background material. I think we should get that background material into our docs folder like we have for EIA and FERC datasets (historical instructions / blank forms).
- I think there are actually ~3 different tables hiding in the collection of columns being pulled in here, and they should probably be turned into 3 different assets. I think the 3 conceptual tables are:
- information about filings (with PK
report_number
) - information about distribution operators (conceptually reported on the basis of report year, state, and operator ID)
- summary information about incidents (also conceptually reported on the basis of report year, state, and operator ID)
- information about filings (with PK
- All those conceptual tables should retain the
report_number
column with a FK relationship that points back to the filings table so we can tell what filing the original information came from, butreport_number
doesn't seem like a real natural PK for these tables. - Unfortunately, what appear to be the natural primary keys don't quite result in unique keys, so it seems like there's some more investigation required to understand what's going on.
Overview
Addresses part of #3770. Taking over @seeess1 's #3929 PR.
What problem does this address?
Goal of the PR is to complete the first transformation of raw PHMSA data into a core asset.
What did you change?
fix_eia_na
) more generic since it can be applied across data sets (and updated references accordingly)./Users/sam/Documents/pudl/src/pudl/package_data/phmsagas/column_maps/yearly_distribution.csv
since we had fax columns mapped to emails and vice versa.Remaining tasks
Questions:
Out of scope:
Documentation
Make sure to update relevant aspects of the documentation.
Testing
How did you make sure this worked? How can a reviewer verify this?