How to handle storage/access to pre-trained model weights (FERC1-FERC1 plant match)

## Background

The updated FERC-FERC inter-year plant matching model in #3007 uses PCA, which is much faster if we pre-fit the model and save the weights somewhere. However, when we cache the model using sklearn's built in tooling for this, it contains many files and occupies close to 1GB of disk space, so we probably shouldn't be committing this directly to PUDL.

Some possible approaches I see for dealing with this:

#### Use git lfs
I don't have much experience with git lfs, so I don't have a great sense for the tradeoffs involved, but seems very possible.

#### Use GCS and the `Datastore`
We could upload the weights to a cloud bucket and potentially use the `Datastore` for access. We need to be able to upload weights, and also wouldn't be using zenodo as the backend, so this might need to be reworked to use for this purpose. The model doesn't really need to be updated frequently, so maybe we could probably make the pre-fitting/uploading a manual process, but that doesn't feel ideal.

#### Use GCS with `mlflow`
`mlflow` has tooling for storing models and associating models with different performance metrics. It has built-in integration with sklearn and other ml frameworks, and can use GCS as a storage backend. This tooling is nice, but also might be overkill to just store weights for a pretty simple model that doesn't need to change frequently. However, if we plan to tackle more of these record linkage problems, and potentially integrate more complex models into PUDL, then maybe it would be smart to start moving in this direction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to handle storage/access to pre-trained model weights (FERC1-FERC1 plant match) #3020

Background

Use git lfs

Use GCS and the `Datastore`

Use GCS with `mlflow`

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

How to handle storage/access to pre-trained model weights (FERC1-FERC1 plant match) #3020

Description

Background

Use git lfs

Use GCS and the Datastore

Use GCS with mlflow

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Use GCS and the `Datastore`

Use GCS with `mlflow`