You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
e-belfer opened this issue
Sep 11, 2023
· 1 comment
· Fixed by #2932 or #3235
Labels
epicAny issue whose primary purpose is to organize other issues into a group.excelIssues involving data in Microsoft Excel spreadsheetsnew-dataRequests for integration of new data.phmsaData from the Pipeline and Hazardous Material Safety Administration
Produce PHMSA assets in dagster from the transmission and distribution tables, mirroring the format of the most recent year's form.
Design Notes
Transmission Data (1970-present)
Table design
All form parts are based on the 2021 form. Form part letters change over time, so this will require pairing older fields from .xlsx files to their correct tables.
Parts A-D: core_phmsa__yearly_transmission_summary_by_commodity
The summary table: one row per commodity, lists summarized miles of pipe, volume, and onshore/offshore miles of pipe by category.
Fields B and D are actually summaries from other fields, and could be compared directly to them as an additional step.
Part C is instructed to only be completed one time in all forms. This field could be filled in by operator ID and year to all relevant reports it covers.
Parts F-G: core_phmsa__yearly_inspections_and_assessments
Each row should have either information on interstate inspections or information on inspections in a particular state (as determined by the INTER_INTRA column
These fields don't exist prior to 2010.
Part H: core_phmsa__yearly_miles_of_transmission_pipe_by_nps
Each row should correspond to interstate pipelines within one state, or intrastate pipelines within one state.
This table has a bunch of different categoricals that will want to get reorganized.
Misc.
Parts O + N are the preparer's signature, email, telephone etc. (N), and the senior executive officers who signed off on the form (O). Not sure if we want to include these at all in our DB, alternately they could get appended to Parts A-E as part of the summary information on the report.
Some fields are not in the form but are added to the dataset, they're generally noted in the bottom of the PDF or in a separate file for each folder.
Distribution Data (1970-present)
Table design
All form parts are based on the 2021 form. Form part letters change over time, so this will require pairing older fields from .xlsx files to their correct tables. The original files (CSV and Excel) are not split by form part.
Each report corresponds to 1 state and one commodity group.
Part A: Operator Information
Essentially just the operator ID, name, address, and info on the state and commodity pertinent to the report.
Miles of main and services by decade of installation.
Each part is a table that has a bunch of different categoricals (e.g. pipe size) that will want to get tidied, changing the PK of the table.
core_phmsa__yearly_distribution_leaks_and_repairs
Part C: Leaks and Repairs
Table is organized by cause of leak and type of pipe. There are two additional fields at the bottom for "known leaks scheduled for repairs" and "leaks involving mechanical joint failure"
Part D-I
Each field occurs once per report, these should probably be kept in one table with Part A.
Part D, Excavation Damage: 6 fields of summary stats, should probably be collapsed into another table.
Part E, EFV and Service Valve Data: 4 fields of summary stats, should probably be collapsed into another table.
Part F: Total number of leaks on federal lands scheduled for repair or repaired.
Part G: %age of unaccounted gas.
Part H: Additional Info: a Notes field, discusses corrections, changes in calculations, and other ambiguities.
Part I: Preparer's info: Email, initial or supplemental report, contact info for preparer.
The content you are editing has changed. Please copy your edits and refresh the page.
After raw assets are extracted, we will have to define a core set of transformations for the PHMSA data. For each table, this could include: defining all columns with a datatype, transforming columns into categoricals to reduce the width of the table, defining primary keys, standardizing NAs.
Known cleaning steps:
Convert 2-digit report_year into 4-digit years (pre 2000).
Standardize report_state to use either shorthand or full state name
Ideally, adapt existing wide_to_tidy infrastructure to drastically collapse tables using categoricals (e.g., a column for "location" that includes onshore, offshore, total rather than 3x the columns).
Handle different aggregations of reporting over time for each form section (e.g., 1 form per state, one form per system)
Where granularity increases over time (e.g. onshore becomes onshore types A, B, C), aggregate these increasingly disaggregated columns back to have comparable totals over time.
Deal with extremely varying telephone formats
Standardize use of office vs. HQ addresses over time, and do general address cleaning
The content you are editing has changed. Please copy your edits and refresh the page.
e-belfer
added
admin
Catalyst operational tasks not related to coding.
new-data
Requests for integration of new data.
phmsa
Data from the Pipeline and Hazardous Material Safety Administration
excel
Issues involving data in Microsoft Excel spreadsheets
and removed
admin
Catalyst operational tasks not related to coding.
labels
Sep 11, 2023
epicAny issue whose primary purpose is to organize other issues into a group.excelIssues involving data in Microsoft Excel spreadsheetsnew-dataRequests for integration of new data.phmsaData from the Pipeline and Hazardous Material Safety Administration
Scope of PR:
Produce PHMSA assets in dagster from the transmission and distribution tables, mirroring the format of the most recent year's form.
Design Notes
Transmission Data (1970-present)
Table design
All form parts are based on the 2021 form. Form part letters change over time, so this will require pairing older fields from .xlsx files to their correct tables.
Parts A-D:
core_phmsa__yearly_transmission_summary_by_commodity
Parts F-G:
core_phmsa__yearly_inspections_and_assessments
Part H:
core_phmsa__yearly_miles_of_transmission_pipe_by_nps
Part I:
core_phmsa__yearly_miles_of_gathering_pipe_by_nps
Part J:
core_phmsa__yearly_miles_of_pipe_by_decade_installed
Part K:
core_phmsa__yearly_miles_of_transmission_pipe_by_specified_minimum_yield_strength
Part L:
core_phmsa__yearly_miles_of_pipe_by_class_location
Part M:
core_phmsa__yearly_failures_leaks_repairs
Part P:
core_phmsa__yearly_miles_of_pipe_by_material
Part Q:
core_phmsa__yearly_gas_transmission_miles_by_maop_determination_method
Part R:
core_phmsa__yearly_gas_transmission_miles_by_pt_range_and_internal_inspection
Part S:
core_phmsa__yearly_transmission_materials_verification
Part T:
core_phmsa__yearly_transmission_hca_miles_by_determination_method_and_risk_model
Misc.
Distribution Data (1970-present)
Table design
Part A: Operator Information
core_phmsagas__yearly_distribution_main_and_services
Part B: System Description
core_phmsa__yearly_distribution_leaks_and_repairs
Part C: Leaks and Repairs
Part D-I
Adapt infrastructure to handle PHMSA partitions
year
instead ofyears
pudl-archiver#252_present
in file name, make partitions actually reflect years of data available correctly. pudl-archiver#253Extraction into raw assets
After raw assets are extracted, we will have to define a core set of transformations for the PHMSA data. For each table, this could include: defining all columns with a datatype, transforming columns into categoricals to reduce the width of the table, defining primary keys, standardizing NAs.
Known cleaning steps:
report_year
into 4-digit years (pre 2000).report_state
to use either shorthand or full state namewide_to_tidy
infrastructure to drastically collapse tables using categoricals (e.g., a column for "location" that includes onshore, offshore, total rather than 3x the columns).First round of cleaning into core assets
The text was updated successfully, but these errors were encountered: