Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-design pudl-archiver to better separate dependencies from main pudl repository #499

Open
e-belfer opened this issue Dec 17, 2024 · 1 comment
Labels
metadata Managing data about our data

Comments

@e-belfer
Copy link
Member

e-belfer commented Dec 17, 2024

Our pudl-archiver and pudl repositories are linked together because we store metadata for datasets ingested in PUDL in the pudl repository. Right now, adding an archiver follows the following workflow:

  1. Make a PR in the pudl repo to add the dataset's metadata to pudl.metadata.sources
  2. Once that PR merges, recreate the pudl-archiver environment locally to reinstall PUDL and write an archiver.

The problem
When we ingest datasets we don't envision ending up in PUDL in the short-term, we need to do one of two things:

  1. Add them into our PUDL metadata repository anyways.
  2. Develop a hacky workaround, like in Add archiver for EIA MECS data #462.

It's also impossible to use the pudl-archiver repository without installing PUDL. In order to be able to more broadly use our lovely archiving infrastructure to archive other datasets or to make it easier for external contributors to write archivers, we should more clearly separate these two repositories.

Existing interconnections
The pudl-archiver repository currently has the following dependencies on pudl functions, classes and dictionaries:

In pudl_archiver.frictionless.py:

from pudl.metadata.classes import Contributor, DataSource, License
from pudl.metadata.constants import CONTRIBUTORS

See in particular the from_pudl_metadata method in the DataPackage class.

The desired solution

  • Metadata for every dataset should only be defined in one place - metadata should not be duplicated in the pudl and pudl-archiver repositories.
  • There's a structured way to add archivers to the pudl-archiver repository without needing to open a PR in the main PUDL repo, if the archives are not clearly intended to land in PUDL.
  • The pudl repository can be installed as an optional dependency for the pudl-archiver repository.

Questions to resolve

  • If a dataset is ingested into PUDL, where should the source of truth for the metadata live?
  • How can we test archivers made by external users without our Zenodo creds? -- Should we handle this in the CI, or have us test it manually?

Some potential options to explore

  1. Use GHA to sync the source metadata files from pudl into the pudl-archiver, and add a second sources file for non-PUDL metadata. Doesn't resolve CONTRIBUTORS import.
    Basically kicks the can down the road and creates a dependency in GH, not ideal.

  2. Add a flag for archivers dependent on PUDL metadata, but make it possible for archives to rely on metadata provided in the pudl-archiver repository.

  3. Move all metadata into the pudl-archiver repo, and have PUDL depend on the pudl-archiver repo.
    This is the most complex, but we would basically invert all the imports. This would touch all of our settings classes and Resource creation, as well as the creation of the default run settings.

  4. Come up with an interim solution:

  • add flag for "get metadata from PUDL"
  • when flag is tripped, call function that imports SOURCES from pudl
  • move towards using generic frictionless package to reduce custom imports from PUDL
  1. Just handle the addition of metadata for new archivers in the PUDL repo.
    Will this cause problems for our existing docs and workflow? Every dataset needs a jinja template to be added, and we could note them in the 'other datasets' documentation as datasets we've archived but aren't integrating. Make sure these aren't named "pudl raw whatever" if we aren't integrating them, though.
  • make a complementary method to from_pudl_metadata that doesn't append "PUDL Raw" to the title for datasets that aren't in a list. Possibly grab this dynamically from the PUDL settings.
@e-belfer e-belfer added the metadata Managing data about our data label Dec 17, 2024
@e-belfer e-belfer moved this from New to Backlog in Catalyst Megaproject Dec 17, 2024
@e-belfer
Copy link
Member Author

After talking with @zschira, we're going with option #5 - rather than implementing a rapid-fire hacky solution, let's just make it possible to move forward on this project without getting mired in a half-refactor of frictionless packages or creating more janky code we'll have to undo later this year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metadata Managing data about our data
Projects
Status: Icebox
Development

No branches or pull requests

1 participant