Skip to content

Improve and Automate raw data archiving/access #1418

Closed
@bendnorman

Description

@bendnorman

Description

This Epic tracks updates to the data archiving and access processes. The previous process for creating new archives involved first running the scraper to download new data locally. Next, the archiver could be used to upload new data to zenodo and create a new archive version. This manual process makes updating archives somewhat difficult, and requires someone being aware of upstream updates, which often leads to stale data. Combining the archiver and scrapers will not only simplify this process, but also make automation much easier.

Once new data archives are created, there is still no easy way to access these raw archives outside of PUDL. This is because the Datastore that PUDL uses for accessing these data archives is embedded within PUDL. Making the Datastore a standalone software package would allow accessing these archives from client projects, and by users.

Scope

- How do we know when we are done? This epic is done when dataset archives are updated automatically.
- What is out of scope? Integrating specific datasets.

Tasks

Archiver

PUDL Integration

  • Notify PUDL when a new archive is created
  • Kick off nightly build to detect problems stemming from new data

Create standalone Datastore

  • Move datastore source code to new repo so it can be used as a library
  • Pull over tests from PUDL, and setup CI
  • Implement basic CLI for accessing data
  • Package Datastore on pypi and conda-forge

Metadata

Metadata

Assignees

Labels

epicAny issue whose primary purpose is to organize other issues into a group.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions