Skip to content

BorgwardtLab/MLDEEP

Repository files navigation

Machine learning analyses from "Data-driven Protease Engineering by DNA-Recording and Epistasis-aware Machine Learning"

This repository contains the code, data, figures, and results for the machine learning analyses from Huber et al. "Data-driven Protease Engineering by DNA-Recording and Epistasis-aware Machine Learning". It fully reproduces all results in the paper.

How to use this repository

The repository is structured as follows:

  • src contains the class definitions for all models and the data module.
  • experiments contains run scripts that define training and evaluation of models to create the results and figures of the paper.
  • data contains scripts for processing the raw data of the protease screen. Both raw and processed data are provided with the release of this repository (see 'Releases' panel on the GitHub page) and should be placed within this folder. To reproduce the data processing, run preprocess.py.
  • config contains configuration files for models and the data module. Here we set hyperparameters.
  • *.sh are some helper scripts to submit jobs.
  • opt.py executes the hyperparameter optimization.
  • cli.py executes training and evaluation of models.

We use the Lightning CLI (https://lightning.ai/pytorch-lightning) for our experimental setup. Generally, a model can be trained or evaluated by invoking cli.py with the desired config files:

python -m cli fit --config configs/trainer.yml --config configs/data.yml --config configs/MLP.yml

The fit subcommand can be replaced by test or predict, and the model can be switched by correspondingly exchanging the last config file. See the class definitions in the src folder for an explanation of the configuration options.

As the amount of checkpoints and predictions is quite large, we do not provide them here, but upon request.

Installation

Clone the repository and create a virtual environment with the following packages:

pip torch install jsonargparse[signatures] lightning kaleido logomaker plotly pyyaml tensorboard torchmetrics tqdm fair-esm scikit-learn rich biopandas optuna optuna-dashboard psycopg2

or run pip install -r requirements.txt.

Reproducing the figures

Every experiment folder has a plot.py file. To recreate the plots of the paper, run:

python -m experiments.EXPERIMENT_NAME.plot

Figures will be created in the experiment subfolder figures in different image formats.

Reproducing the results

Note: While the MLP model itself is rather cheap to run, reproducing the results of the paper requires to run the model many times with different configurations. It is hence necessary to run these computations on a suited compute cluster. We provide our jobscripts that were written for a Slurm submission system. To run them on your machine, adjust the #SBATCH parameters in the cpu_job.sh, gpu_job.sh, opt_cpu_job.sh, and opt_gpu_job.sh.

To re-train the models, just submit the job script in the experiment folder:

sh experiments/EXPERIMENT_NAME/job.sh

This will dispatch an array job with different models and configurations. The computations run in parallel, results are written to the runs folder of the experiment directory.

Reproducing the hyperparameter optimization

Note: We parallelize the optimization over multiple instances, and hence the result of the optimization depends on some hardware factors such as resource availability. It is hence non-deterministic and the results can vary compared to the paper.

Note: We use optuna (https://optuna.org) with a local database to store the hyperparameters and to be able to parallelize the optimization. You will need to set up a local database on your system and replace the storage address at the bottom of the opt.py file, please refer to the documentation of optuna.

Note: It is easiest to inspect the optimization results with optuna-dashboard. Provide the database address from above.

To re-run the hyperparameter optimization, run the following command:

sh opt_cpu_job.sh MODEL_NAME

where MODEL_NAME is one of MLP, ESM, GBT, LR, SVC, KNN. For the deep learning models (MLP, ESM) use opt_gpu_job.sh. This will dispatch several (long-running) jobs to your submission system, the results are written to the database.

License

Code in this repository is licensed under MIT, training data and model weights are licensed under CC-BY-4.0.

About

Data-driven Protease Engineering by DNA-Recording and Epistasis-aware Machine Learning

Resources

Stars

Watchers

Forks