Skip to content

PapenfussLab/DMS_with_Alanine_scan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Integrate low-throughput mutagenesis data to deep mutational scanning based variant impact predictors

Abstract

We extended a linear regression-based predictor to explore whether incorporating data from alanine scanning (AS), a widely-used low-throughput mutagenesis method, would improve prediction results.

Code stored here are used for data processing, variant impact predictor modelling and result analysis.

Setup & usage

  1. Create a virtual environment with Python 3.10.6.
  2. Install Jupyter Notebook and other required pacakges according to requirements.txt
  3. Follow the code and instructions in the notebooks (./jupyter_code/).

Notebook content

  • P0_Data_processing:
    • Download DMS data from MaveDB
    • Normalize DMS and alanine scanning data
    • Add other protein features
  • P1_Statistics_of_curated_data: Overview of collected mutagenesis data:
    • Code and figure for: Fig 2, 3, 4 & S1, S14
  • P2_Linear_integration_of_AS_data: Building and evaluating linear variant impact predictors using alanine scanning data as an extra feature
    • Code and figure for: Fig 5, 6, 7 & S4, S5, S9, S10, S15
  • P3_Alternative_modelling_options: Building and evaluating variant impact predictors in alternative ways
    • Code and figure for: Fig S3, S6, S7, S8, S11, S12, S13
    • Code and result for all statistical testings related to: Fig 5, S4, S5 & S6

Data content

  • data_compatibility_221024.csv is the assay compatibility data for each pair of DMS and alanine scanning dataset used in this analysis. The class of assay compatibility is manually curated according to the following decision tree:
  • Folder low-throughput_data contains alanine scanning data collected from previously published papers.
  • Folder demask contains protein features downloaded from DeMaSk oline toolkit.
  • Folder envision contains protein features downloaded from Envision oline toolkit.
  • Folder reference contains protein sequence in FASTA format downloaded from UniProt.