Skip to content

Preprocessing #390

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 76 commits into from
Apr 23, 2025
Merged

Preprocessing #390

merged 76 commits into from
Apr 23, 2025

Conversation

marjanfamili
Copy link
Collaborator

@marjanfamili marjanfamili commented Apr 4, 2025

📝 Description

This pull request introduces dimensionality reduction on the outputs within the Autoemulate pipeline. This is essential to guarantee efficient performance of Autoemulate on high-dimensional outputs (e.g., large-scale systems, spatially distributed data on a fine grid, reconstruction of image data).

When autoemulate.compare is called, each dimensionality reducer is preliminarily trained on the entire dataset to reduce its dimensionality. The reduced data is then passed through cross-validation, and the best combination of reducer + model is returned.
The preliminarily trained reducer is passed as a non-trainable object into the pipeline so that, during cross-validation, only the models are trained.

👩‍💻 Usage

  • In autoemulate.setup, a dictionary of dimensionality reducers (preprocessing_methods) can be passed to integrate them into the pipeline.
preprocessing_methods = [{"name" : "PCA", "params" : {"reduced_dim": 8}},
                         {"name" : "VAE", "params" : {"reduced_dim": 8}]
em.setup(X, Y, ...,  preprocessing_methods=preprocessing_methods)

🆘 Help Needed

  • Currently we get a lot of warnings in the tests, it is due to dimensionality of the y, if the dimensionality is 1D ie (n_samples,)we get lots of tests fails, if it is 2D (n_samples, 1) we get a lot of warnings and not quite sure what the cause of this is. Any insights would be appreciated.

🔗 Related Issue

#378 #362 #363 #348

🛠 Type of Change

  • ✨ New feature (non-breaking change that adds functionality)
  • 📖 Documentation update
  • ♻️ Refactor/Code cleanup
  • 🧪 Test updates
  • 🎓 Tutorial added (new guides, walkthroughs, or examples)

✅ Checklist

  • My code follows the project's coding style 🎨
  • I have tested my changes locally 🖥️
  • I have added/updated unit tests (if applicable) 🧪
  • I have updated the documentation (README, comments, etc.) if needed 📚
  • My changes generate no new warnings or errors ⚠️

🖼️ Screenshots (if applicable)

image

💬 Additional Notes

  • A new class, AutoEmulatePipeline, replaces the _process_models method to create an sklearn.Pipeline that transforms both inputs and outputs. The previous implementation only allowed transformations of inputs. Now, two independent pipelines (one for inputs and one for outputs) are created and wrapped in the sklearn class TransformedTargetRegressor. All of this is handled within the AutoEmulatePipeline class.

  • Dimensionality reduction methods
    Dimensionality reduction methods are implemented in the new file preprocess_target.py:

    • Principal Component Analysis (PCA): class TargetPCA
    • Variational Autoencoder (VAE): class TargetVAE
      Planned improvements:
    • Move files from preprocess_target.py to create a registry of dimensionality reducers (similar to the existing model registry).
    • Generalize to input transformations (for potential high-dimensional inputs).
      This can be easily implemented in the AutoEmulatePipeline class, which is responsible for both input and output pipelines. The names of dimensionality reduction methods will need to be updated to remove Target or Output.
  • Tutorial:
    A tutorial on a reaction-diffusion system with high-dimensional outputs has been added to demonstrate the use of dimensionality reduction in Autoemulate. Add tutorial for spatial distributed example on website #372


Reviewers:
👀 Pay special attention to:

  • the format of new input preprocessing_method
  • preprocesing_target.py
  • change to the Autoemulate pipeline
  • change to how the results are stored in compare.py
  • how predict is overwritten in InputOutputPipeline in preprocess_target.py

marjanfamili and others added 30 commits March 21, 2025 12:43
…loop which runs through all models and select the best model. This outerloop goes through various pre-processing methods of data. Plenty of work is still required, name of variables should become consistent with the code base, input and output processes should become integrated. However for now , the whole pipeline and plotting functionalities run with no error
… pass only 2 positional arguments instead of 3
Comment on lines 449 to 459
"""
if model.transformer is not None and isinstance(transformer.base_model, TargetPCA):
x_reconstructed_mean = model.transformer_.inverse_transform(x_latent_pred)
x_reconstructed_std = transformer.base_model.inverse_transform_std(x_latent_std) # TODO: fix such this is a method of the transformer

else:
"""
# TODO: implment also "delta method", in addition to "sampling method" for variance reconstruction
if len(x_latent_pred.shape) == 1:
x_latent_pred = x_latent_pred.reshape(-1, 1)
x_latent_std = x_latent_std.reshape(-1, 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm - should the code on L450-L455 be in """ here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed these comments and expanded the function description as:

def inverse_transform_with_std(model, x_latent_pred, x_latent_std, n_samples=1000):
    """
    Transforms uncertainty (standard deviation) from latent space to original space 
    using a sampling-based method.

    This approach draws samples from the latent Gaussian distribution 
    (defined by the predicted mean and standard deviation), reconstructs 
    each sample in the original space, and computes the resulting mean and 
    standard deviation. 
    
    Future improvements could include:
    - Analytical propagation for linear reductions (e.g., PCA)
    - Delta method for nonlinear reductions (e.g., VAE)

Update issue #376 with more details on this.

@sgreenbury
Copy link
Collaborator

sgreenbury commented Apr 15, 2025

Thanks again @marjanfamili and @ContiPaolo - these processing implementations and reaction diffusion examples are really great!

I made some changes for the warnings mentioned (09cd26a, a273831) and I think this is ok now as I didn't see this issue anymore in the tests.

There are a couple of small comments above and this comment I wasn't sure about.

One other small issue I noticed was running notebook 03_emulation_sensitivity.ipynb and the call em.plot_eval(model=best_model) raised an error (AttributeError: 'dict' object has no attribute 'predict').

Otherwise looks good to me!

Name of the dimensionality reducer to use.
Options:
- 'PCA': Principal Component Analysis
- 'AE': Autoencoder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have missed it but it looks like we don't have a non-variational autoencoder option? Btw AE is also mentioned in the 06_reaction_diffusion_time notebook (in section 2)

Suggested change
- 'AE': Autoencoder

@radka-j
Copy link
Member

radka-j commented Apr 16, 2025

Well done everyone, this is really great! Love the examples :)

I only have some minor comments about the notebooks. I think it would be great to expand a bit sections 2 and 3 to say more about the dimensionality reduction methods - an extra line on what options you are searching over and also which one is returned as part of the best model (e.g., "In this example we test PCA with a range of dimensionality values.... The best performing model uses PCA with 64 components."). It's in the code but it would be good to make it very explicit in the text as well.

In section 2 to you also say in the text that the user can indicate which dim_reducer_output to use but I think that arg has been renamed to preprocessing_methods.

@radka-j
Copy link
Member

radka-j commented Apr 16, 2025

I'd also update the intro section to the two notebooks to make it clear to the reader how they differ from the start (and how the second one builds on top of the first).

@edwardchalstrey1
Copy link
Collaborator

Small update to make - ensure that the new notebooks are added in_toc.yml as I have done here

@ContiPaolo
Copy link
Collaborator

Thank you @sgreenbury and @radka-j for all your comments!

We have implemented all your suggestions regarding the code. I sorted the issue Sam spotted in tutorial 3 (@marjanfamili note that compare method returns self.best_models instead of self.best_combination), so I believe we are ready to merge!

Also, thanks for all the feedback on the tutorials. Tutorials definitely need more explanation. I will fix and take all your comments into account.

@sgreenbury sgreenbury mentioned this pull request Apr 23, 2025
3 tasks
@ContiPaolo ContiPaolo merged commit 66e1a88 into main Apr 23, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants