Re-design `pudl-archiver` to better separate dependencies from main `pudl` repository #499

e-belfer · 2024-12-17T16:51:21Z

Our pudl-archiver and pudl repositories are linked together because we store metadata for datasets ingested in PUDL in the pudl repository. Right now, adding an archiver follows the following workflow:

Make a PR in the pudl repo to add the dataset's metadata to pudl.metadata.sources
Once that PR merges, recreate the pudl-archiver environment locally to reinstall PUDL and write an archiver.

The problem
When we ingest datasets we don't envision ending up in PUDL in the short-term, we need to do one of two things:

Add them into our PUDL metadata repository anyways.
Develop a hacky workaround, like in Add archiver for EIA MECS data #462.

It's also impossible to use the pudl-archiver repository without installing PUDL. In order to be able to more broadly use our lovely archiving infrastructure to archive other datasets or to make it easier for external contributors to write archivers, we should more clearly separate these two repositories.

Existing interconnections
The pudl-archiver repository currently has the following dependencies on pudl functions, classes and dictionaries:

In pudl_archiver.frictionless.py:

from pudl.metadata.classes import Contributor, DataSource, License
from pudl.metadata.constants import CONTRIBUTORS

See in particular the from_pudl_metadata method in the DataPackage class.

The desired solution

Metadata for every dataset should only be defined in one place - metadata should not be duplicated in the pudl and pudl-archiver repositories.
There's a structured way to add archivers to the pudl-archiver repository without needing to open a PR in the main PUDL repo, if the archives are not clearly intended to land in PUDL.
The pudl repository can be installed as an optional dependency for the pudl-archiver repository.

Questions to resolve

If a dataset is ingested into PUDL, where should the source of truth for the metadata live?
How can we test archivers made by external users without our Zenodo creds? -- Should we handle this in the CI, or have us test it manually?

Some potential options to explore

Use GHA to sync the source metadata files from pudl into the pudl-archiver, and add a second sources file for non-PUDL metadata. Doesn't resolve CONTRIBUTORS import.
Basically kicks the can down the road and creates a dependency in GH, not ideal.
Add a flag for archivers dependent on PUDL metadata, but make it possible for archives to rely on metadata provided in the pudl-archiver repository.
Move all metadata into the pudl-archiver repo, and have PUDL depend on the pudl-archiver repo.
This is the most complex, but we would basically invert all the imports. This would touch all of our settings classes and Resource creation, as well as the creation of the default run settings.
Come up with an interim solution:

add flag for "get metadata from PUDL"
when flag is tripped, call function that imports SOURCES from pudl
move towards using generic frictionless package to reduce custom imports from PUDL

Just handle the addition of metadata for new archivers in the PUDL repo.
Will this cause problems for our existing docs and workflow? Every dataset needs a jinja template to be added, and we could note them in the 'other datasets' documentation as datasets we've archived but aren't integrating. Make sure these aren't named "pudl raw whatever" if we aren't integrating them, though.

make a complementary method to from_pudl_metadata that doesn't append "PUDL Raw" to the title for datasets that aren't in a list. Possibly grab this dynamically from the PUDL settings.

The text was updated successfully, but these errors were encountered:

e-belfer · 2024-12-18T16:09:27Z

After talking with @zschira, we're going with option #5 - rather than implementing a rapid-fire hacky solution, let's just make it possible to move forward on this project without getting mired in a half-refactor of frictionless packages or creating more janky code we'll have to undo later this year.

e-belfer added the metadata Managing data about our data label Dec 17, 2024

e-belfer added this to Catalyst Megaproject Dec 17, 2024

github-project-automation bot moved this to New in Catalyst Megaproject Dec 17, 2024

e-belfer moved this from New to Backlog in Catalyst Megaproject Dec 17, 2024

e-belfer moved this from Backlog to Icebox in Catalyst Megaproject Dec 18, 2024

This was referenced Jan 7, 2025

Make it possible to pass another sources dict to DataSource catalyst-cooperative/pudl#4003

Merged

Make it possible to add non-PUDL data sources and metadata to the archiver #506

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-design `pudl-archiver` to better separate dependencies from main `pudl` repository #499

Re-design `pudl-archiver` to better separate dependencies from main `pudl` repository #499

e-belfer commented Dec 17, 2024 •

edited

Loading

e-belfer commented Dec 18, 2024

Re-design pudl-archiver to better separate dependencies from main pudl repository #499

Re-design pudl-archiver to better separate dependencies from main pudl repository #499

Comments

e-belfer commented Dec 17, 2024 • edited Loading

e-belfer commented Dec 18, 2024

Re-design `pudl-archiver` to better separate dependencies from main `pudl` repository #499

Re-design `pudl-archiver` to better separate dependencies from main `pudl` repository #499

e-belfer commented Dec 17, 2024 •

edited

Loading