Skip to content

How to handle storage/access to pre-trained model weights (FERC1-FERC1 plant match) #3020

@zschira

Description

@zschira

Background

The updated FERC-FERC inter-year plant matching model in #3007 uses PCA, which is much faster if we pre-fit the model and save the weights somewhere. However, when we cache the model using sklearn's built in tooling for this, it contains many files and occupies close to 1GB of disk space, so we probably shouldn't be committing this directly to PUDL.

Some possible approaches I see for dealing with this:

Use git lfs

I don't have much experience with git lfs, so I don't have a great sense for the tradeoffs involved, but seems very possible.

Use GCS and the Datastore

We could upload the weights to a cloud bucket and potentially use the Datastore for access. We need to be able to upload weights, and also wouldn't be using zenodo as the backend, so this might need to be reworked to use for this purpose. The model doesn't really need to be updated frequently, so maybe we could probably make the pre-fitting/uploading a manual process, but that doesn't feel ideal.

Use GCS with mlflow

mlflow has tooling for storing models and associating models with different performance metrics. It has built-in integration with sklearn and other ml frameworks, and can use GCS as a storage backend. This tooling is nice, but also might be overkill to just store weights for a pretty simple model that doesn't need to change frequently. However, if we plan to tackle more of these record linkage problems, and potentially integrate more complex models into PUDL, then maybe it would be smart to start moving in this direction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    cloudStuff that has to do with adapting PUDL to work in cloud computing context.dagsterIssues related to our use of the Dagster orchestratorferc1Anything having to do with FERC Form 1performanceMake PUDL run faster!

    Type

    No type

    Projects

    Status

    Icebox

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions