Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible to add non-PUDL data sources and metadata to the archiver #506

Merged
merged 6 commits into from
Jan 16, 2025

Conversation

e-belfer
Copy link
Member

@e-belfer e-belfer commented Jan 7, 2025

Overview

A lightweight temporary solution to #499.

What problem does this address?
In concert with catalyst-cooperative/pudl#4003, makes it possible to add non-PUDL metadata sources to the archiver directly.

What did you change in this PR?

  • Added a new metadata folder that replicates the one in the PUDL repository, but only contains information about data sources that aren't integrated into PUDL (currently only MECS).
  • Added a new from_non_pudl_metadata() method in the DataPackage class which calls the NON_PUDL_SOURCES dictionary (should work after the PR in the PUDL repository merges to make this possible).
  • Updated the README
  • Renamed mecs to eiamecs to make it more consistent with our other data sources
  • Added some documentation on how to add additional contributors to sources.py in preparation for the hackathon, while I was at it!

Testing

How did you make sure this worked? How can a reviewer verify this?
pudl_archiver --dataset eiamecs --sandbox --initialize

To-do list

Tasks

Preview Give feedback

@e-belfer e-belfer marked this pull request as ready for review January 8, 2025 16:49
"""Create a datapackage for sources that won't end up in PUDL."""
# TODO: This is a slightly ugly workaround to avoid having to add sources into
# the from_id method - should I just fix this at the source?
data_source = DataSource(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you change this to call from_id instead of dict_from_id, that should fix the broken tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I see now that from_id doesn't take the sources dict right now. It's a bit of a pain to have to do another PUDL PR, but I think the most consistent API would be to add sources to the from_id method, since you're basically just copying what from_id does here.

Copy link
Member

@zschira zschira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I think updating from_id to take sources as a parameter would be my preference to fix the test issues.

@e-belfer e-belfer requested a review from zschira January 15, 2025 19:03
Copy link
Member

@zschira zschira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, got sidetracked earlier, but looks good now!

@e-belfer e-belfer merged commit 9530606 into main Jan 16, 2025
3 checks passed
@e-belfer e-belfer deleted the add-non-pudl-metadata branch January 16, 2025 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metadata Managing data about our data new-data
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants