Machine learning analyses from "Data-driven Protease Engineering by DNA-Recording and Epistasis-aware Machine Learning"
This repository contains the code, data, figures, and results for the machine learning analyses from Huber et al. "Data-driven Protease Engineering by DNA-Recording and Epistasis-aware Machine Learning". It fully reproduces all results in the paper.
The repository is structured as follows:
src
contains the class definitions for all models and the data module.experiments
contains run scripts that define training and evaluation of models to create the results and figures of the paper.data
contains scripts for processing the raw data of the protease screen. Both raw and processed data are provided with the release of this repository (see 'Releases' panel on the GitHub page) and should be placed within this folder. To reproduce the data processing, runpreprocess.py
.config
contains configuration files for models and the data module. Here we set hyperparameters.*.sh
are some helper scripts to submit jobs.opt.py
executes the hyperparameter optimization.cli.py
executes training and evaluation of models.
We use the Lightning CLI (https://lightning.ai/pytorch-lightning) for our experimental setup. Generally, a model can be trained or evaluated by invoking cli.py
with the desired config files:
python -m cli fit --config configs/trainer.yml --config configs/data.yml --config configs/MLP.yml
The fit
subcommand can be replaced by test
or predict
, and the model can be switched by correspondingly exchanging the last config file. See the class definitions in the src
folder for an explanation of the configuration options.
As the amount of checkpoints and predictions is quite large, we do not provide them here, but upon request.
Clone the repository and create a virtual environment with the following packages:
pip torch install jsonargparse[signatures] lightning kaleido logomaker plotly pyyaml tensorboard torchmetrics tqdm fair-esm scikit-learn rich biopandas optuna optuna-dashboard psycopg2
or run pip install -r requirements.txt
.
Every experiment folder has a plot.py
file. To recreate the plots of the paper, run:
python -m experiments.EXPERIMENT_NAME.plot
Figures will be created in the experiment subfolder figures
in different image formats.
Note: While the MLP model itself is rather cheap to run, reproducing the results of the paper requires to run the model many times with different configurations. It is hence necessary to run these computations on a suited compute cluster. We provide our jobscripts that were written for a Slurm submission system. To run them on your machine, adjust the #SBATCH
parameters in the cpu_job.sh
, gpu_job.sh
, opt_cpu_job.sh
, and opt_gpu_job.sh
.
To re-train the models, just submit the job script in the experiment folder:
sh experiments/EXPERIMENT_NAME/job.sh
This will dispatch an array job with different models and configurations. The computations run in parallel, results are written to the runs
folder of the experiment directory.
Note: We parallelize the optimization over multiple instances, and hence the result of the optimization depends on some hardware factors such as resource availability. It is hence non-deterministic and the results can vary compared to the paper.
Note: We use optuna (https://optuna.org) with a local database to store the hyperparameters and to be able to parallelize the optimization. You will need to set up a local database on your system and replace the storage address at the bottom of the opt.py
file, please refer to the documentation of optuna.
Note: It is easiest to inspect the optimization results with optuna-dashboard. Provide the database address from above.
To re-run the hyperparameter optimization, run the following command:
sh opt_cpu_job.sh MODEL_NAME
where MODEL_NAME is one of MLP, ESM, GBT, LR, SVC, KNN
. For the deep learning models (MLP, ESM) use opt_gpu_job.sh
. This will dispatch several (long-running) jobs to your submission system, the results are written to the database.
Code in this repository is licensed under MIT, training data and model weights are licensed under CC-BY-4.0.