Skip to content

Integrate the Census-based FIPS codes to replace addfips #4019

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Apr 10, 2025

Conversation

cmgosnell
Copy link
Member

@cmgosnell cmgosnell commented Jan 17, 2025

Overview

Closes #3884.

What problem does this address?

What did you change?

Questions

  • should we publish a transformed table as a _core or core table?? if core we may want to break it out into multiple tables bc really its many geographies all squished into one table
  • rn i've called this table "geocodes" because that's what census calls it. sound okay?
  • should i add in a lil manually compiled non-us state codes that addfips was including?

Documentation

Make sure to update relevant aspects of the documentation.

- [ ] Update the [release notes](https://catalystcoop-pudl.readthedocs.io/en/nightly/release_notes.html): reference the PR and related issues.
- [ ] Update relevant Data Source jinja templates (see `docs/data_sources/templates`).
- [ ] Update relevant table or source description metadata (see `src/metadata`).
- [ ] Review and update any other aspects of the documentation that might be affected by this PR.

Testing

How did you make sure this worked? How can a reviewer verify this?

# To-do list
- [ ] If updating analyses or data processing functions: make sure to update or write data validation tests (e.g. `test_minmax_rows()`).
- [ ] Run `make pytest-coverage` locally to ensure that the merge queue will accept your PR.
- [ ] Review the PR yourself and call out any questions or issues you have.
- [ ] For minor ETL changes or data additions, once `make pytest-coverage` passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and [run relevant data validation tests](https://catalystcoop-pudl.readthedocs.io/en/nightly/dev/testing.html#data-validation) using `pytest` and `--live-dbs`.
- [ ] For bigger ETL or data changes run the full ETL locally and then [run the data validations](https://catalystcoop-pudl.readthedocs.io/en/nightly/dev/testing.html#data-validation) using `make pytest-validate`.
- [ ] Alternatively, run the `build-deploy-pudl` GitHub Action manually.

@cmgosnell cmgosnell changed the base branch from main to fips-fix January 17, 2025 13:34
@cmgosnell cmgosnell self-assigned this Jan 17, 2025
Base automatically changed from fips-fix to main January 17, 2025 15:30
@cmgosnell cmgosnell requested a review from krivard January 29, 2025 16:35
Copy link
Contributor

@krivard krivard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re:Q1 - are any of the areas both a subdivision and a place/place and consolidated city/subdivision and consolidated city, or can we file them as an "aux code" "aux type" pair? if they're mutually exclusive then we could do either:

  • one table, with a variable-length concatenated FIPS (3 chars for country, 5 for state, 8 for county, 13 for aux)
  • 3 tables (country could just be hard coded): state, county, aux

Re:Q2 - 👍 "geocodes"

@cmgosnell
Copy link
Member Author

cmgosnell commented Mar 10, 2025

hm @krivard I've been assuming we want to update the vintage of the fips codes (the addfips package's most recent data is from 2015). But I've already found one problem with this update. CT! all the county names have been.. idk boring-afied

image

image

the data table i've been testing with (core_eia861__yearly_service_territory) only has the old names in the data so all of the CT counties are not being labeled right now... which makes sense bc apparently these new regional counties are not fully implemented until 2024 according to wikipedia.

@cmgosnell
Copy link
Member Author

okay since this revelation about different years having different info in there i decided to incorporate 2015 and 2023. I'm purposefully rn not merging the code using report_date. because even in this example the report_date from census don't line up. but now there are both options in the data. this seems to be working as intended.

@cmgosnell
Copy link
Member Author

okay i also added in 2009 bc there were a small handful of counties that were mapped w/ addfips that weren't being mapped without an older year.

rn the county codes that don't match are only two and they are bc of name changes:
image

there are also 13 non-us state records that are not mapped
image

@cmgosnell cmgosnell requested a review from krivard March 11, 2025 20:42
@cmgosnell cmgosnell enabled auto-merge March 18, 2025 20:01
@cmgosnell cmgosnell added this pull request to the merge queue Mar 18, 2025
@zaneselvans zaneselvans added the geospatial Spatial data and transformations. Anything related to mapping. label Mar 18, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 18, 2025
@cmgosnell
Copy link
Member Author

oh nice okay i ran the validation tests before actually fixing the Bedford City vs County problem.... and then the row counts changed! because previously Bedford City was getting dropped because addfips wasn't labeling Bedford City and we were dropna-ing on the fips codes. so i am going to update the row counts!

@cmgosnell cmgosnell requested a review from krivard April 1, 2025 13:09
Comment on lines -1057 to -1064
# The Virgin Islands and Guam aren't covered by addfips but they have FIPS:
st_croix = (df.state == "VI") & (df.county.isin(["St. Croix", "Saint Croix"]))
df.loc[st_croix, "county_id_fips"] = "78010"
st_john = (df.state == "VI") & (df.county.isin(["St. John", "Saint John"]))
df.loc[st_john, "county_id_fips"] = "78020"
st_thomas = (df.state == "VI") & (df.county.isin(["St. Thomas", "Saint Thomas"]))
df.loc[st_thomas, "county_id_fips"] = "78030"
df.loc[df.state == "GU", "county_id_fips"] = "66010"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didn't realize this was here until after i added these guys into the CSV. that feels like a better place for them anyway so i delete this guy.

@cmgosnell
Copy link
Member Author

what has changed (i think) since last review:

  • removed addfips and updated conda lock files
  • make the fips codes to be strings in the non-census data tables (i was dev-ing on an already typed table before... silly me)
  • .... long pause .... figure out how dbt works ty for your help
  • update the row counts (see the comment above for why - tldr tiny addition of new fips == smol # of more rows)
  • remove now duplicate territory codes

Copy link
Contributor

@krivard krivard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming I read that right, good to go!

out_ferc714__respondents_with_fips,2020,8995
out_ferc714__respondents_with_fips,2021,8943
out_ferc714__respondents_with_fips,2022,8953
out_ferc714__respondents_with_fips,2023,8964
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️ ✔️ no 2006es 🏆

Comment on lines +295 to +296
df.astype({county_col: pd.StringDtype()})
.assign(county_tmp=lambda x: _clean_area_name_col(x[county_col], {}))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking understanding: this change is for two reasons:

  1. we needed county_col as a string, and sometimes it wasn't
  2. _clean_area_name_col couldn't handle a Series so we have to run it one row at a time

if yes, all good, if no, help?

Copy link
Member

@zaneselvans zaneselvans Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought assign with a lambda function / any callable that takes a series still does its work vectorized. Is that not true? (df.apply() is row-wise and notoriously slow)

@zaneselvans zaneselvans enabled auto-merge April 10, 2025 01:15
@zaneselvans zaneselvans added this pull request to the merge queue Apr 10, 2025
Merged via the queue into main with commit 392222b Apr 10, 2025
19 checks passed
@zaneselvans zaneselvans deleted the fips-integrate branch April 10, 2025 02:53
@github-project-automation github-project-automation bot moved this from In review to Done in Catalyst Megaproject Apr 10, 2025
@krivard krivard mentioned this pull request May 7, 2025
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
geospatial Spatial data and transformations. Anything related to mapping.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Fix FIPS coding in PUDL
4 participants