Creating a reusable pattern to handle resource intensive extraction from raw data formats #4159

zschira · 2025-04-01T15:54:12Z

zschira
Apr 1, 2025
Maintainer

Background

One clear pain point in integrating SEC10k data has been the handoff of data from the upstream repo to PUDL. I tried to start some discussion last year to come up with a solution to handle this, but the setup we ended up with is janky, unnecessarily complex, and has created a bad development experience when working with 10k data in PUDL.

In coming up with a better solution to handle this data, I think it's worth considering if that solution should be generalized to other cases that follow a similar pattern. One such case is the extraction of FERC XBRL data. While this doesn't require a dedicated ML model like the 10k extraction, it does share similarities in that the extraction is relatively time/resource consuming, and generally gets run less often than the normal PUDL ETL during development. This can also lead to reproducibility issues, since developers are not regularly updating their raw FERC SQLite DB's, which can lead to minor differences between the data used locally during development, and the data run in the CI/nightly builds.

The core similarity in both of these cases is that before the data makes its way into PUDL, it passes through an Extract and Load step to get it into a more usable format, and neither the raw data nor the EL implementations change frequently, and thus don't necessarily need to be run as frequently as the rest of the ETL.

Proposed Solution

To more cleanly handle these cases, I believe we should create a dedicated EL pipeline that produces the raw versions of these datasets that PUDL interacts with. This EL will be run only as frequently as the data or implementation is changed, and after it is run the outputs will be archived along with the rest of our raw archives. PUDL will then be able to depend on these versioned archives as normal, ensuring developers, the CI, and nightly builds are all using the exact same versions of these datasets without adding any new complexity to PUDL.

Detailed design

Where does this EL live?

For a while we considered turning the mozilla_sec_eia repo into a PUDL ML repo, but now I think it would make more sense to make this repo the new home of this EL. This would mean migrating the XBRL extraction out of its existing repo and into this new repo, which would allow us to remove the dependency on the current repo from PUDL. This would require some upfront work to get this code migrated, but ultimately it is embracing the idea of "distribute data, not code", and I believe would significantly reduce friction in managing XBRL extraction.

This repo will only perform the absolute minimal amount of transformation required to get the data into a standard, accessible format, and the outputs will be written to GCS. From here we will have archivers in the pudl-archiver repo, which pull these outputs and archive them on Zenodo. I believe it better for the EL to write to GCS instead of zenodo directly, because interaction with GCS is much simpler than using the Zenodo API, and I think it is best to keep this complexity isolated in the pudl-archiver repo.

How does the EL interface with the archiver?

The EL will write outputs to one duckdb file per dataset, which will have an attached frictionless DataPackage, to communicate all necessary metadata. The archiver can then simply mirror each of these duckdb/DataPackage combos into a zenodo deposition. We could also mirror these files to S3, which would allow easy access for users interested in the raw data, as well as usage within the PUDL viewer.

Running the EL

In the initial phase, running the EL would be a somewhat manual process, but we should consider how we can make this as streamlined and reproducible as possible. To do this we will provide a script which will launch a pre-configured cloud VM with necessary resources like sufficient memory and GPU for SEC10k extraction. The script will also deploy a dockerized version of the EL pipeline on the VM, which the developer can control. While running the EL, outputs will be written to a 'staging' area within GCS. After a complete EL run, the developer can then run a script to 'publish' the outputs they've just created in the 'staging' area. The publication process will be non-destructive, so when it is run a new subdirectory will be created within a 'production' are in GCS, where the new outputs will be written. Once a new version shows up in this 'production' area, the archivers will automatically discover this new data, making it immediately available in PUDL with a simple update of the DOI for that dataset.

How could this be funded?

We've allocated 80 hours this quarter for improving CI speed + 20 hours for improving external CI access, and I believe this project could fit in either of those buckets. Completely pulling FERC extraction out of CI would improve CI speed, as well as making the CI more consistent with local development by reducing the possibility for different versions of raw FERC data being used in different environments.

We also have allocated time for getting the PUDL viewer up to feature parity with Datasette, and streamlining getting FERC + raw 10k data into a format that can be accessed by the PUDL viewer could fit into this box as well.

How do we get 10k data out now?

We probably don't want to further delay getting good versions of 10k data out now to complete this project, so we should come up with an adequate intermediate solution to get that data out. I propose we create a static archive on zenodo in the easiest way possible even if this is somewhat janky or manual for the moment. We will make PUDL depend on this archive until get to a point where we could adopt this new framework.

zaneselvans · 2025-04-02T23:04:28Z

zaneselvans
Apr 2, 2025
Maintainer

Quick & Dirty Near Term

The current situation is brittle -- we expect to have a single version of the data output, but it's in a place where many versions of the data could live, and the way we read it now with pd.read_parquet() would just slurp in all versions of the data that are present.
If we add deltalake to the pudl-archiver repo dependencies, can we create an archiver there that pulls only the most recent version of the SEC 10-K outputs? That way we don't have to change how things work upstream, and can still get this data into our existing Zenodo archiving workflow.
Issues I saw when I poked at trying to create an archiver:
- The partitions we list in the sec10k source metadata in PUDL are years, but the actual data is quarters, including some quarters that lie outside the stated years (e.g. 1993Q4)
- As with the EPA CEMS, there are too many quarters of data to treat them each independently -- we'll run over the 100 file max on Zenodo.
- We could zip up a number of table-specific parquet files that include all of the data for a year and then we'd have ~30 annual partitions, containing quarterly files, which would be very similar to the current CEMS situation, but with multiple tables and parquet, instead of a single table and CSV.
- We could also just not partition the data in the archive, and load all of it every time. However, this results in failing FK constraints in the fast ETL, because we have all the years of EIA utility IDs, in the SEC 10-K data, and only a few years of data in the corresponding EIA tables.
- Alternatively, we could load all of the data from an unpartitioned archve, and then just select the required years based on whatever the sec10k ETL settings say we are processing. For this to work all of the tables would need to contain a relevant datetime column to select on. It looks like 3/4 of them have year_quarter and one has report_date so we could probably do this.
- However, will this cause a disconnect between the source metadata (which specifies partitions) and the ETL settings (which say which chunks of data to grab... which are usually archive partitions)? Or can we have them be separate? I think they're separate for the old monolithic FERC-714 right now so this can probably be made to work.

1 reply

zschira Apr 3, 2025
Maintainer Author

I think using a monolithic approach seems simplest to me, and I think it can definitely be made to work with the data source partitioning. While we generally use the working_partions in our ETL settings to grab a specific file from an archive, I don't see any reason we can't make the extract function use the datastore to grab the monolithic file, and then select the requested years of data from there.

zaneselvans · 2025-04-03T05:31:05Z

zaneselvans
Apr 3, 2025
Maintainer

Longer Term

I'm just going to ramble a little here

General principles

Whatever solution we come up with I think we want to do our best to ensure:

Transparency
- users can see where the data is coming from, including direct access to the version of the data that's being handed off to PUDL.
- users can see the code that's being used to process the data, even if they aren't deploying all the associated infrastructure.
Reproducibility
- The version of the data that's being pulled into PUDL is well defined and publicly available long term.
- This might not be the original raw data (e.g. 100s of GB of millions of PDFs) but should at least be the tables of data that are produced and used as inputs into PUDL.
- Less frequent processing that depends on manual operation might result in less reproducible outputs. Having it all programmatically run means it's very reproducible.
Maintainability
- Our current system is imperfect, but it's pretty uniform, and meets the above criteria.
- Because there's just one big data pipeline, and it all runs every night, everything that's required to run it has been programmatically encoded somewhere, and we all have to interact with it.
- Having multiple different input data management systems means a larger number of interfaces where things can break and more combinations of things that need to be tested.
- More input management systems also means more cognitive load for getting up to speed on how to run PUDL.
- Having multiple repositories / pipelines that manage different parts of the data processing could also result in lack of uniformity across them, making it harder to maintain, and harder to ensure that we all (or many of us) understand how to run them.

Separate EL DAGs / Repos

I agree it seems like we're going to keep having data sources that need more involved pre-processing, with both the FERC DBF/XBRL and the SEC 10-K PDFs being good examples, and having a clear boundary between the tabular data processing that happens in PUDL, and the unstructured / more resource intensive data processing that happens less frequently and relies on other kinds of tools / dependencies that are outside of PUDL's scope makes sense.

I don't think spinning all of those separate EL pre-processing steps out into a single repo makes sense, since they're going to have very different dependencies / infra.

I also don't think making these into very infrequent, manually run processes is a great idea. I think that's likely to result in problems accumulating between runs that we don't notice until later, and could easily result in much less reproducible outputs -- having the nightly builds has given us great visibility into when there are problems and where, often with some advance notice. As has having the monthly archiver runs. I'm worried that if we spin things out separately, the non-PUDL repos will each end up having their own janky way of doing things that only 1 person understands, which will be bad for maintainability and reproducibility.

Even if there are parts of our overall data processing system that sit outside the main PUDL repo, I think they should also be transparent, reproducible, and maintainable.

Part of why we run e.g. the FERC extractions infrequently is because we don't want to deal with all of the downstream impacts frequently. But if that process were running independently of the rest of PUDL, they could run more frequently, and produce their own lineage of outputs on a regular basis, and whenever PUDL is ready to grab a new version of (say) the FERC-714 outputs, we could just update the Zenodo DOI to pull the most recent outputs, even if not every upstream output ended up in a PUDL release.

Could we have separate code locations / sub-dags that were part of the main PUDL repo that only run conditionally with a sensor when the code that controls e.g. the FERC extraction has been changed or the input data has changed? Like factor out dbf2sqlite and xbrl2sqlite into a separate versioned package?

Dagster Deployment

A lot of this is once again dancing around the functionality that a "real" Dagster deployment is designed to provide -- not everything running all the time from scratch, just those things that have been touched to update the outputs that need refreshing. Is there some way we could have a deployment that writes Parquet & DuckDB outputs directly to cloud storage on a nightly basis if main has changed, and then if it succeeds copy the outputs over to our nightly distribution buckets? Or even just have the deployment refresh dynamically whenever something gets merged into main? As @jdangerx mentioned recently DuckDB now allows remote connections in cloud buckets.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catalyst Cooperative

Creating a reusable pattern to handle resource intensive extraction from raw data formats #4159

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Catalyst Cooperative

Creating a reusable pattern to handle resource intensive extraction from raw data formats #4159

Uh oh!

Uh oh!

zschira Apr 1, 2025 Maintainer

Background

Proposed Solution

Detailed design

Where does this EL live?

How does the EL interface with the archiver?

Running the EL

How could this be funded?

How do we get 10k data out now?

Replies: 2 comments · 1 reply

Uh oh!

zaneselvans Apr 2, 2025 Maintainer

Quick & Dirty Near Term

Uh oh!

zschira Apr 3, 2025 Maintainer Author

Uh oh!

zaneselvans Apr 3, 2025 Maintainer

Longer Term

General principles

Separate EL DAGs / Repos

Dagster Deployment

zschira
Apr 1, 2025
Maintainer

Replies: 2 comments 1 reply

zaneselvans
Apr 2, 2025
Maintainer

zschira Apr 3, 2025
Maintainer Author

zaneselvans
Apr 3, 2025
Maintainer