Description
Description
This Epic tracks updates to the data archiving and access processes. The previous process for creating new archives involved first running the scraper to download new data locally. Next, the archiver could be used to upload new data to zenodo and create a new archive version. This manual process makes updating archives somewhat difficult, and requires someone being aware of upstream updates, which often leads to stale data. Combining the archiver and scrapers will not only simplify this process, but also make automation much easier.
Once new data archives are created, there is still no easy way to access these raw archives outside of PUDL. This is because the Datastore
that PUDL uses for accessing these data archives is embedded within PUDL. Making the Datastore
a standalone software package would allow accessing these archives from client projects, and by users.
Scope
- How do we know when we are done? This epic is done when dataset archives are updated automatically.
- What is out of scope? Integrating specific datasets.
Tasks
Archiver
- Refactor archiving process to simplify the workflow, and make it easier to add new datasets pudl-zenodo-storage#35
- Combine archiver/scraper repos into a single repo pudl-archiver#4
- Develop high level script(s) for managing scraping/archiving pudl-archiver#5
- Validate zipfiles and re-download if corrupted pudl-archiver#3
- Create github action for running the scraping/archiving process at desired frequencies pudl-archiver#2
- Develop unit tests for new archiver pudl-archiver#10
PUDL Integration
- Notify PUDL when a new archive is created
- Kick off nightly build to detect problems stemming from new data
Create standalone Datastore
- Move datastore source code to new repo so it can be used as a library
- Pull over tests from PUDL, and setup CI
- Implement basic CLI for accessing data
- Package Datastore on pypi and conda-forge