Integrate low-throughput mutagenesis data to deep mutational scanning based variant impact predictors
We extended a linear regression-based predictor to explore whether incorporating data from alanine scanning (AS), a widely-used low-throughput mutagenesis method, would improve prediction results.
Code stored here are used for data processing, variant impact predictor modelling and result analysis.
- Create a virtual environment with Python 3.10.6.
- Install Jupyter Notebook and other required pacakges according to
requirements.txt
- Follow the code and instructions in the notebooks (
./jupyter_code/
).
P0_Data_processing
:- Download DMS data from MaveDB
- Normalize DMS and alanine scanning data
- Add other protein features
P1_Statistics_of_curated_data
: Overview of collected mutagenesis data:- Code and figure for: Fig 2, 3, 4 & S1, S14
P2_Linear_integration_of_AS_data
: Building and evaluating linear variant impact predictors using alanine scanning data as an extra feature- Code and figure for: Fig 5, 6, 7 & S4, S5, S9, S10, S15
P3_Alternative_modelling_options
: Building and evaluating variant impact predictors in alternative ways- Code and figure for: Fig S3, S6, S7, S8, S11, S12, S13
- Code and result for all statistical testings related to: Fig 5, S4, S5 & S6
data_compatibility_221024.csv
is the assay compatibility data for each pair of DMS and alanine scanning dataset used in this analysis. The class of assay compatibility is manually curated according to the following decision tree:- Folder
low-throughput_data
contains alanine scanning data collected from previously published papers. - Folder
demask
contains protein features downloaded from DeMaSk oline toolkit. - Folder
envision
contains protein features downloaded from Envision oline toolkit. - Folder
reference
contains protein sequence in FASTA format downloaded from UniProt.