-
Notifications
You must be signed in to change notification settings - Fork 7
Preprocessing #390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preprocessing #390
Conversation
…loop which runs through all models and select the best model. This outerloop goes through various pre-processing methods of data. Plenty of work is still required, name of variables should become consistent with the code base, input and output processes should become integrated. However for now , the whole pipeline and plotting functionalities run with no error
… pass only 2 positional arguments instead of 3
…line Preprocessing pipeline
autoemulate/preprocess_target.py
Outdated
""" | ||
if model.transformer is not None and isinstance(transformer.base_model, TargetPCA): | ||
x_reconstructed_mean = model.transformer_.inverse_transform(x_latent_pred) | ||
x_reconstructed_std = transformer.base_model.inverse_transform_std(x_latent_std) # TODO: fix such this is a method of the transformer | ||
|
||
else: | ||
""" | ||
# TODO: implment also "delta method", in addition to "sampling method" for variance reconstruction | ||
if len(x_latent_pred.shape) == 1: | ||
x_latent_pred = x_latent_pred.reshape(-1, 1) | ||
x_latent_std = x_latent_std.reshape(-1, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm - should the code on L450-L455 be in """
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed these comments and expanded the function description as:
def inverse_transform_with_std(model, x_latent_pred, x_latent_std, n_samples=1000):
"""
Transforms uncertainty (standard deviation) from latent space to original space
using a sampling-based method.
This approach draws samples from the latent Gaussian distribution
(defined by the predicted mean and standard deviation), reconstructs
each sample in the original space, and computes the resulting mean and
standard deviation.
Future improvements could include:
- Analytical propagation for linear reductions (e.g., PCA)
- Delta method for nonlinear reductions (e.g., VAE)
Update issue #376 with more details on this.
Thanks again @marjanfamili and @ContiPaolo - these processing implementations and reaction diffusion examples are really great! I made some changes for the warnings mentioned (09cd26a, a273831) and I think this is ok now as I didn't see this issue anymore in the tests. There are a couple of small comments above and this comment I wasn't sure about. One other small issue I noticed was running notebook Otherwise looks good to me! |
Name of the dimensionality reducer to use. | ||
Options: | ||
- 'PCA': Principal Component Analysis | ||
- 'AE': Autoencoder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have missed it but it looks like we don't have a non-variational autoencoder option? Btw AE is also mentioned in the 06_reaction_diffusion_time notebook (in section 2)
- 'AE': Autoencoder |
Well done everyone, this is really great! Love the examples :) I only have some minor comments about the notebooks. I think it would be great to expand a bit sections 2 and 3 to say more about the dimensionality reduction methods - an extra line on what options you are searching over and also which one is returned as part of the best model (e.g., "In this example we test PCA with a range of dimensionality values.... The best performing model uses PCA with 64 components."). It's in the code but it would be good to make it very explicit in the text as well. In section 2 to you also say in the text that the user can indicate which |
I'd also update the intro section to the two notebooks to make it clear to the reader how they differ from the start (and how the second one builds on top of the first). |
Small update to make - ensure that the new notebooks are added in |
Thank you @sgreenbury and @radka-j for all your comments! We have implemented all your suggestions regarding the code. I sorted the issue Sam spotted in tutorial 3 (@marjanfamili note that Also, thanks for all the feedback on the tutorials. Tutorials definitely need more explanation. I will fix and take all your comments into account. |
📝 Description
This pull request introduces dimensionality reduction on the outputs within the Autoemulate pipeline. This is essential to guarantee efficient performance of Autoemulate on high-dimensional outputs (e.g., large-scale systems, spatially distributed data on a fine grid, reconstruction of image data).
When
autoemulate.compare
is called, each dimensionality reducer is preliminarily trained on the entire dataset to reduce its dimensionality. The reduced data is then passed through cross-validation, and the best combination of reducer + model is returned.The preliminarily trained reducer is passed as a non-trainable object into the pipeline so that, during cross-validation, only the models are trained.
👩💻 Usage
autoemulate.setup
, a dictionary of dimensionality reducers (preprocessing_methods
) can be passed to integrate them into the pipeline.🆘 Help Needed
(n_samples,)
we get lots of tests fails, if it is 2D(n_samples, 1)
we get a lot of warnings and not quite sure what the cause of this is. Any insights would be appreciated.🔗 Related Issue
#378 #362 #363 #348
🛠 Type of Change
✅ Checklist
🖼️ Screenshots (if applicable)
💬 Additional Notes
A new class,
AutoEmulatePipeline
, replaces the_process_models
method to create ansklearn.Pipeline
that transforms both inputs and outputs. The previous implementation only allowed transformations of inputs. Now, two independent pipelines (one for inputs and one for outputs) are created and wrapped in thesklearn
classTransformedTargetRegressor
. All of this is handled within theAutoEmulatePipeline
class.Dimensionality reduction methods
Dimensionality reduction methods are implemented in the new file
preprocess_target.py
:TargetPCA
TargetVAE
Planned improvements:
preprocess_target.py
to create a registry of dimensionality reducers (similar to the existing model registry).This can be easily implemented in the
AutoEmulatePipeline
class, which is responsible for both input and output pipelines. The names of dimensionality reduction methods will need to be updated to remove Target or Output.Tutorial:
A tutorial on a reaction-diffusion system with high-dimensional outputs has been added to demonstrate the use of dimensionality reduction in Autoemulate. Add tutorial for spatial distributed example on website #372
Reviewers:
👀 Pay special attention to:
preprocessing_method
preprocesing_target.py
compare.py
predict
is overwritten inInputOutputPipeline
inpreprocess_target.py