-
-
Notifications
You must be signed in to change notification settings - Fork 128
Description
Background
The updated FERC-FERC inter-year plant matching model in #3007 uses PCA, which is much faster if we pre-fit the model and save the weights somewhere. However, when we cache the model using sklearn's built in tooling for this, it contains many files and occupies close to 1GB of disk space, so we probably shouldn't be committing this directly to PUDL.
Some possible approaches I see for dealing with this:
Use git lfs
I don't have much experience with git lfs, so I don't have a great sense for the tradeoffs involved, but seems very possible.
Use GCS and the Datastore
We could upload the weights to a cloud bucket and potentially use the Datastore
for access. We need to be able to upload weights, and also wouldn't be using zenodo as the backend, so this might need to be reworked to use for this purpose. The model doesn't really need to be updated frequently, so maybe we could probably make the pre-fitting/uploading a manual process, but that doesn't feel ideal.
Use GCS with mlflow
mlflow
has tooling for storing models and associating models with different performance metrics. It has built-in integration with sklearn and other ml frameworks, and can use GCS as a storage backend. This tooling is nice, but also might be overkill to just store weights for a pretty simple model that doesn't need to change frequently. However, if we plan to tackle more of these record linkage problems, and potentially integrate more complex models into PUDL, then maybe it would be smart to start moving in this direction.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status