Replies: 2 comments 1 reply
-
Quick & Dirty Near Term
|
Beta Was this translation helpful? Give feedback.
-
Longer TermI'm just going to ramble a little here General principlesWhatever solution we come up with I think we want to do our best to ensure:
Separate EL DAGs / ReposI agree it seems like we're going to keep having data sources that need more involved pre-processing, with both the FERC DBF/XBRL and the SEC 10-K PDFs being good examples, and having a clear boundary between the tabular data processing that happens in PUDL, and the unstructured / more resource intensive data processing that happens less frequently and relies on other kinds of tools / dependencies that are outside of PUDL's scope makes sense. I don't think spinning all of those separate EL pre-processing steps out into a single repo makes sense, since they're going to have very different dependencies / infra. I also don't think making these into very infrequent, manually run processes is a great idea. I think that's likely to result in problems accumulating between runs that we don't notice until later, and could easily result in much less reproducible outputs -- having the nightly builds has given us great visibility into when there are problems and where, often with some advance notice. As has having the monthly archiver runs. I'm worried that if we spin things out separately, the non-PUDL repos will each end up having their own janky way of doing things that only 1 person understands, which will be bad for maintainability and reproducibility. Even if there are parts of our overall data processing system that sit outside the main PUDL repo, I think they should also be transparent, reproducible, and maintainable. Part of why we run e.g. the FERC extractions infrequently is because we don't want to deal with all of the downstream impacts frequently. But if that process were running independently of the rest of PUDL, they could run more frequently, and produce their own lineage of outputs on a regular basis, and whenever PUDL is ready to grab a new version of (say) the FERC-714 outputs, we could just update the Zenodo DOI to pull the most recent outputs, even if not every upstream output ended up in a PUDL release. Could we have separate code locations / sub-dags that were part of the main PUDL repo that only run conditionally with a sensor when the code that controls e.g. the FERC extraction has been changed or the input data has changed? Like factor out Dagster DeploymentA lot of this is once again dancing around the functionality that a "real" Dagster deployment is designed to provide -- not everything running all the time from scratch, just those things that have been touched to update the outputs that need refreshing. Is there some way we could have a deployment that writes Parquet & DuckDB outputs directly to cloud storage on a nightly basis if |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Background
One clear pain point in integrating SEC10k data has been the handoff of data from the upstream repo to PUDL. I tried to start some discussion last year to come up with a solution to handle this, but the setup we ended up with is janky, unnecessarily complex, and has created a bad development experience when working with 10k data in PUDL.
In coming up with a better solution to handle this data, I think it's worth considering if that solution should be generalized to other cases that follow a similar pattern. One such case is the extraction of FERC XBRL data. While this doesn't require a dedicated ML model like the 10k extraction, it does share similarities in that the extraction is relatively time/resource consuming, and generally gets run less often than the normal PUDL ETL during development. This can also lead to reproducibility issues, since developers are not regularly updating their raw FERC SQLite DB's, which can lead to minor differences between the data used locally during development, and the data run in the CI/nightly builds.
The core similarity in both of these cases is that before the data makes its way into PUDL, it passes through an Extract and Load step to get it into a more usable format, and neither the raw data nor the EL implementations change frequently, and thus don't necessarily need to be run as frequently as the rest of the ETL.
Proposed Solution
To more cleanly handle these cases, I believe we should create a dedicated EL pipeline that produces the raw versions of these datasets that PUDL interacts with. This EL will be run only as frequently as the data or implementation is changed, and after it is run the outputs will be archived along with the rest of our raw archives. PUDL will then be able to depend on these versioned archives as normal, ensuring developers, the CI, and nightly builds are all using the exact same versions of these datasets without adding any new complexity to PUDL.
Detailed design
Where does this EL live?
For a while we considered turning the mozilla_sec_eia repo into a PUDL ML repo, but now I think it would make more sense to make this repo the new home of this EL. This would mean migrating the XBRL extraction out of its existing repo and into this new repo, which would allow us to remove the dependency on the current repo from PUDL. This would require some upfront work to get this code migrated, but ultimately it is embracing the idea of "distribute data, not code", and I believe would significantly reduce friction in managing XBRL extraction.
This repo will only perform the absolute minimal amount of transformation required to get the data into a standard, accessible format, and the outputs will be written to GCS. From here we will have archivers in the
pudl-archiver
repo, which pull these outputs and archive them on Zenodo. I believe it better for the EL to write to GCS instead of zenodo directly, because interaction with GCS is much simpler than using the Zenodo API, and I think it is best to keep this complexity isolated in thepudl-archiver
repo.How does the EL interface with the archiver?
The EL will write outputs to one
duckdb
file per dataset, which will have an attachedfrictionless DataPackage
, to communicate all necessary metadata. The archiver can then simply mirror each of theseduckdb
/DataPackage
combos into a zenodo deposition. We could also mirror these files toS3
, which would allow easy access for users interested in the raw data, as well as usage within the PUDL viewer.Running the EL
In the initial phase, running the EL would be a somewhat manual process, but we should consider how we can make this as streamlined and reproducible as possible. To do this we will provide a script which will launch a pre-configured cloud VM with necessary resources like sufficient memory and GPU for SEC10k extraction. The script will also deploy a dockerized version of the EL pipeline on the VM, which the developer can control. While running the EL, outputs will be written to a 'staging' area within GCS. After a complete EL run, the developer can then run a script to 'publish' the outputs they've just created in the 'staging' area. The publication process will be non-destructive, so when it is run a new subdirectory will be created within a 'production' are in GCS, where the new outputs will be written. Once a new version shows up in this 'production' area, the archivers will automatically discover this new data, making it immediately available in PUDL with a simple update of the DOI for that dataset.
How could this be funded?
We've allocated 80 hours this quarter for improving CI speed + 20 hours for improving external CI access, and I believe this project could fit in either of those buckets. Completely pulling FERC extraction out of CI would improve CI speed, as well as making the CI more consistent with local development by reducing the possibility for different versions of raw FERC data being used in different environments.
We also have allocated time for getting the PUDL viewer up to feature parity with Datasette, and streamlining getting FERC + raw 10k data into a format that can be accessed by the PUDL viewer could fit into this box as well.
How do we get 10k data out now?
We probably don't want to further delay getting good versions of 10k data out now to complete this project, so we should come up with an adequate intermediate solution to get that data out. I propose we create a static archive on zenodo in the easiest way possible even if this is somewhat janky or manual for the moment. We will make PUDL depend on this archive until get to a point where we could adopt this new framework.
Beta Was this translation helpful? Give feedback.
All reactions