Preprocessing #390

marjanfamili · 2025-04-04T09:50:56Z

📝 Description

This pull request introduces dimensionality reduction on the outputs within the Autoemulate pipeline. This is essential to guarantee efficient performance of Autoemulate on high-dimensional outputs (e.g., large-scale systems, spatially distributed data on a fine grid, reconstruction of image data).

When autoemulate.compare is called, each dimensionality reducer is preliminarily trained on the entire dataset to reduce its dimensionality. The reduced data is then passed through cross-validation, and the best combination of reducer + model is returned.
The preliminarily trained reducer is passed as a non-trainable object into the pipeline so that, during cross-validation, only the models are trained.

👩‍💻 Usage

In autoemulate.setup, a dictionary of dimensionality reducers (preprocessing_methods) can be passed to integrate them into the pipeline.

preprocessing_methods = [{"name" : "PCA", "params" : {"reduced_dim": 8}},
                         {"name" : "VAE", "params" : {"reduced_dim": 8}]
em.setup(X, Y, ...,  preprocessing_methods=preprocessing_methods)

🆘 Help Needed

Currently we get a lot of warnings in the tests, it is due to dimensionality of the y, if the dimensionality is 1D ie (n_samples,)we get lots of tests fails, if it is 2D (n_samples, 1) we get a lot of warnings and not quite sure what the cause of this is. Any insights would be appreciated.

🔗 Related Issue

#378 #362 #363 #348

🛠 Type of Change

✨ New feature (non-breaking change that adds functionality)
📖 Documentation update
♻️ Refactor/Code cleanup
🧪 Test updates
🎓 Tutorial added (new guides, walkthroughs, or examples)

✅ Checklist

My code follows the project's coding style 🎨
I have tested my changes locally 🖥️
I have added/updated unit tests (if applicable) 🧪
I have updated the documentation (README, comments, etc.) if needed 📚
My changes generate no new warnings or errors ⚠️❌

🖼️ Screenshots (if applicable)

💬 Additional Notes

A new class, AutoEmulatePipeline, replaces the _process_models method to create an sklearn.Pipeline that transforms both inputs and outputs. The previous implementation only allowed transformations of inputs. Now, two independent pipelines (one for inputs and one for outputs) are created and wrapped in the sklearn class TransformedTargetRegressor. All of this is handled within the AutoEmulatePipeline class.
Dimensionality reduction methods
Dimensionality reduction methods are implemented in the new file preprocess_target.py:
- Principal Component Analysis (PCA): class TargetPCA
- Variational Autoencoder (VAE): class TargetVAE
  Planned improvements:
- Move files from preprocess_target.py to create a registry of dimensionality reducers (similar to the existing model registry).
- Generalize to input transformations (for potential high-dimensional inputs).
  This can be easily implemented in the AutoEmulatePipeline class, which is responsible for both input and output pipelines. The names of dimensionality reduction methods will need to be updated to remove Target or Output.
Tutorial:
A tutorial on a reaction-diffusion system with high-dimensional outputs has been added to demonstrate the use of dimensionality reduction in Autoemulate. Add tutorial for spatial distributed example on website #372

Reviewers:
👀 Pay special attention to:

the format of new input preprocessing_method
preprocesing_target.py
change to the Autoemulate pipeline
change to how the results are stored in compare.py
how predict is overwritten in InputOutputPipeline in preprocess_target.py

…loop which runs through all models and select the best model. This outerloop goes through various pre-processing methods of data. Plenty of work is still required, name of variables should become consistent with the code base, input and output processes should become integrated. However for now , the whole pipeline and plotting functionalities run with no error

… pass only 2 positional arguments instead of 3

…line Preprocessing pipeline

docs/tutorials/ReactionDiffusion.ipynb

tests/test_compare.py

sgreenbury · 2025-04-15T15:30:44Z

autoemulate/preprocess_target.py

+    """
+    if model.transformer is not None and isinstance(transformer.base_model, TargetPCA):
+        x_reconstructed_mean = model.transformer_.inverse_transform(x_latent_pred)
+        x_reconstructed_std = transformer.base_model.inverse_transform_std(x_latent_std)  # TODO: fix such this is a method of the transformer
+
+    else:
+    """
+    # TODO: implment also "delta method", in addition to "sampling method" for variance reconstruction
+    if len(x_latent_pred.shape) == 1:
+        x_latent_pred = x_latent_pred.reshape(-1, 1)
+        x_latent_std = x_latent_std.reshape(-1, 1)


Just to confirm - should the code on L450-L455 be in """ here?

I removed these comments and expanded the function description as:

def inverse_transform_with_std(model, x_latent_pred, x_latent_std, n_samples=1000): """ Transforms uncertainty (standard deviation) from latent space to original space using a sampling-based method. This approach draws samples from the latent Gaussian distribution (defined by the predicted mean and standard deviation), reconstructs each sample in the original space, and computes the resulting mean and standard deviation. Future improvements could include: - Analytical propagation for linear reductions (e.g., PCA) - Delta method for nonlinear reductions (e.g., VAE)

Update issue #376 with more details on this.

autoemulate/preprocess_target.py

sgreenbury · 2025-04-15T16:13:27Z

Thanks again @marjanfamili and @ContiPaolo - these processing implementations and reaction diffusion examples are really great!

I made some changes for the warnings mentioned (09cd26a, a273831) and I think this is ok now as I didn't see this issue anymore in the tests.

There are a couple of small comments above and this comment I wasn't sure about.

One other small issue I noticed was running notebook 03_emulation_sensitivity.ipynb and the call em.plot_eval(model=best_model) raised an error (AttributeError: 'dict' object has no attribute 'predict').

Otherwise looks good to me!

radka-j · 2025-04-16T11:07:40Z

autoemulate/preprocess_target.py

+        Name of the dimensionality reducer to use.
+        Options:
+        - 'PCA': Principal Component Analysis
+        - 'AE': Autoencoder


I might have missed it but it looks like we don't have a non-variational autoencoder option? Btw AE is also mentioned in the 06_reaction_diffusion_time notebook (in section 2)

Suggested change

- 'AE': Autoencoder

radka-j · 2025-04-16T11:16:48Z

Well done everyone, this is really great! Love the examples :)

I only have some minor comments about the notebooks. I think it would be great to expand a bit sections 2 and 3 to say more about the dimensionality reduction methods - an extra line on what options you are searching over and also which one is returned as part of the best model (e.g., "In this example we test PCA with a range of dimensionality values.... The best performing model uses PCA with 64 components."). It's in the code but it would be good to make it very explicit in the text as well.

In section 2 to you also say in the text that the user can indicate which dim_reducer_output to use but I think that arg has been renamed to preprocessing_methods.

radka-j · 2025-04-16T11:26:54Z

I'd also update the intro section to the two notebooks to make it clear to the reader how they differ from the start (and how the second one builds on top of the first).

edwardchalstrey1 · 2025-04-16T13:46:46Z

Small update to make - ensure that the new notebooks are added in_toc.yml as I have done here

ContiPaolo · 2025-04-16T16:32:48Z

Thank you @sgreenbury and @radka-j for all your comments!

We have implemented all your suggestions regarding the code. I sorted the issue Sam spotted in tutorial 3 (@marjanfamili note that compare method returns self.best_models instead of self.best_combination), so I believe we are ready to merge!

Also, thanks for all the feedback on the tutorials. Tutorials definitely need more explanation. I will fix and take all your comments into account.

docs/_toc.yml

marjanfamili and others added 30 commits March 21, 2025 12:43

pre-process target

a52f9f2

Added get_model function

d391596

Modified pipeline and add output dimensionality reducer

df31adc

Fixed issues compare.py

1a4dbe1

Fix issues in utils.py

fedfbbf

added test_notebook

5f2c9e8

Add reaction-diffusion tutorial

21142b8

changing the pipeline to have UQ

4c4265e

debugging and merge

f1152da

pre-commit

8856054

Added reaction-diffusion simulator

239030d

Added VAE

55c41c4

Add reactiondiffusion data

9d5e349

merged from the origin

919b639

Implemented reconstruction of std

0cf2600

improved reconstruction of std for PCA

4371c2d

fixed VAE issues

7342943

Add a non-trainable Reducer wrapper

c9404f7

Inserted pretrained reducers into the Pipeline

13d499f

Fixed reconstruction of UQ

9855b64

added predict_with_std

cd16eff

comparison with multiple reducers

3e76721

update notebook

dbad474

changed the name of hidden_dims to hidden_layers, debugged plot_cv to…

1edb24c

… pass only 2 positional arguments instead of 3

Removed the unnecessary tqdm itterations, much faster now

008766d

Merge pull request #363 from alan-turing-institute/Preprocessing-Pipe…

adf50dc

…line Preprocessing pipeline

fixed variable names

a49504b

added ModelPrepPipeline to wrap all pipeline creation in one

6dd391c

fixed printing and added spatio-temporal tutorial

096de43

Updated reaction_diffusion names

8ce0464

sgreenbury reviewed Apr 15, 2025

View reviewed changes

docs/tutorials/ReactionDiffusion.ipynb Outdated Show resolved Hide resolved

sgreenbury reviewed Apr 15, 2025

View reviewed changes

tests/test_compare.py Show resolved Hide resolved

sgreenbury reviewed Apr 15, 2025

View reviewed changes

autoemulate/preprocess_target.py Outdated Show resolved Hide resolved

sgreenbury reviewed Apr 15, 2025

View reviewed changes

autoemulate/preprocess_target.py Outdated Show resolved Hide resolved

sgreenbury added 2 commits April 15, 2025 16:40

Update imports

5286b89

Rename notebooks

4a5e6b2

sgreenbury approved these changes Apr 15, 2025

View reviewed changes

radka-j reviewed Apr 16, 2025

View reviewed changes

radka-j approved these changes Apr 16, 2025

View reviewed changes

radka-j mentioned this pull request Apr 16, 2025

make dimensionality reduction less experimental #247

Closed

ContiPaolo added 2 commits April 16, 2025 17:19

Remove obsolete comments and fixed return object of compare

e3a5fe0

added tutorial 5 to the documentation

168d9c1

ContiPaolo added 3 commits April 22, 2025 11:10

Add more comments to the tutorial 5

26961fd

fixed minor issues in testing preprocessing

1d02f6f

fixed issues with loading in tests

2af9f3e

sgreenbury mentioned this pull request Apr 23, 2025

Potential bug: _pickle.UnpicklingError: Weights only load failed with PyTorch 2.6 #312

Closed

sgreenbury reviewed Apr 23, 2025

View reviewed changes

docs/_toc.yml Show resolved Hide resolved

ContiPaolo and others added 2 commits April 23, 2025 15:33

Merge branch 'main' into Preprocessing

85cd6df

fixed issues with loading in tests

4bad82b

sgreenbury mentioned this pull request Apr 23, 2025

v0.3 release planning #417

Open

3 tasks

Merge remote-tracking branch 'origin/Preprocessing' into Preprocessing

9d014cd

ContiPaolo merged commit 66e1a88 into main Apr 23, 2025
4 checks passed

radka-j mentioned this pull request Apr 25, 2025

Integrating dimensionality reduction into sklearn Pipeline #362

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing #390

Preprocessing #390

marjanfamili commented Apr 4, 2025 •

edited by ContiPaolo

Loading

sgreenbury Apr 15, 2025

ContiPaolo Apr 16, 2025

sgreenbury commented Apr 15, 2025 •

edited

Loading

radka-j Apr 16, 2025

radka-j commented Apr 16, 2025

radka-j commented Apr 16, 2025

edwardchalstrey1 commented Apr 16, 2025

ContiPaolo commented Apr 16, 2025

Preprocessing #390

Preprocessing #390

Conversation

marjanfamili commented Apr 4, 2025 • edited by ContiPaolo Loading

📝 Description

👩‍💻 Usage

🆘 Help Needed

🔗 Related Issue

🛠 Type of Change

✅ Checklist

🖼️ Screenshots (if applicable)

💬 Additional Notes

sgreenbury Apr 15, 2025

Choose a reason for hiding this comment

ContiPaolo Apr 16, 2025

Choose a reason for hiding this comment

sgreenbury commented Apr 15, 2025 • edited Loading

radka-j Apr 16, 2025

Choose a reason for hiding this comment

radka-j commented Apr 16, 2025

radka-j commented Apr 16, 2025

edwardchalstrey1 commented Apr 16, 2025

ContiPaolo commented Apr 16, 2025

marjanfamili commented Apr 4, 2025 •

edited by ContiPaolo

Loading

sgreenbury commented Apr 15, 2025 •

edited

Loading