Re-design pudl-archiver
to better separate dependencies from main pudl
repository
#499
Labels
metadata
Managing data about our data
Our
pudl-archiver
andpudl
repositories are linked together because we store metadata for datasets ingested in PUDL in thepudl
repository. Right now, adding an archiver follows the following workflow:pudl
repo to add the dataset's metadata topudl.metadata.sources
pudl-archiver
environment locally to reinstall PUDL and write an archiver.The problem
When we ingest datasets we don't envision ending up in PUDL in the short-term, we need to do one of two things:
It's also impossible to use the
pudl-archiver
repository without installing PUDL. In order to be able to more broadly use our lovely archiving infrastructure to archive other datasets or to make it easier for external contributors to write archivers, we should more clearly separate these two repositories.Existing interconnections
The
pudl-archiver
repository currently has the following dependencies onpudl
functions, classes and dictionaries:In
pudl_archiver.frictionless.py
:See in particular the
from_pudl_metadata
method in theDataPackage
class.The desired solution
pudl
andpudl-archiver
repositories.pudl-archiver
repository without needing to open a PR in the main PUDL repo, if the archives are not clearly intended to land in PUDL.pudl
repository can be installed as an optional dependency for thepudl-archiver
repository.Questions to resolve
Some potential options to explore
Use GHA to sync the source metadata files from
pudl
into thepudl-archiver
, and add a second sources file for non-PUDL metadata. Doesn't resolveCONTRIBUTORS
import.Basically kicks the can down the road and creates a dependency in GH, not ideal.
Add a flag for archivers dependent on PUDL metadata, but make it possible for archives to rely on metadata provided in the
pudl-archiver
repository.Move all metadata into the pudl-archiver repo, and have PUDL depend on the pudl-archiver repo.
This is the most complex, but we would basically invert all the imports. This would touch all of our settings classes and Resource creation, as well as the creation of the default run settings.
Come up with an interim solution:
pudl
Will this cause problems for our existing docs and workflow? Every dataset needs a jinja template to be added, and we could note them in the 'other datasets' documentation as datasets we've archived but aren't integrating. Make sure these aren't named "pudl raw whatever" if we aren't integrating them, though.
from_pudl_metadata
that doesn't append "PUDL Raw" to the title for datasets that aren't in a list. Possibly grab this dynamically from the PUDL settings.The text was updated successfully, but these errors were encountered: