DeNoPro - a denovo proteogeomics pipeline to identify clinically relevent novel variants from RNAseq and Proteomics data.
DeNoPro provides a pipeline for the identification of novel peptides from matched RNAseq and MS/MS proteomics data.
The pipeline consists of de novo transcript assembly (Trinity), generation of a protein sequence database of 6-frame translated transcripts, and a combination of search engines (X! Tandem, MS-GF+, Tide) to query the custom database. Identified novel peptides and protein variants are then filtered by confidence and mapped to gene models using ACTG.
To install DeNoPro as a python module, open a terminal in the directory containing setup.py, and run
python setup.py install
DeNoPro can be made executable by running chmod u+x denopro
.
DeNoPro has been tested with Python 3, Python 2 is not supported at this time. R version 4.0.0 or greater is required to run the PGA package.
We recommend using a conda environment to maintain dependencies, and an environment config file using Python 3.9.6 and R 4.0.5 has been provided. To setup the conda environment, run conda env create -f denopro-env.yml
and activate with conda activate denopro-env
.
- Trinity version 2.8.5 - Used during
assemble
for de novo assembly of RNA transcripts - PGA (R>4.0) - Used in
customdb
for creation of 6-frame translated protein database - PySimpleGUIQt - Used to run the GUI functionality
- SearchGUI version 3.3.17 - Uses the
X! Tandem
,MS_GF+
andTide
search engines to search created custom database against mgf spectra files - PeptideShaker version 1.16.42 - Used to select matching identifications among the three search engines to output a list of confident novel peptides and their corresponding proteins
- ACTG - Used to map identified confident novel peptides to their corresponding genomic locations
- Bamstats - Used to process expression levels of novel peptides
DeNoPro was designed to be modular, to account for large processing times. The modes are
assemble
: de novo assembly of transcript sequences using Trinity
searchdb
: produces custom peptide database from assembled transcripts which are mapped against proteomics data
identify
: maps potential novel peptides from searchdb to a reference tracriptome outputting a list of confident novel peptides
novelorf
: finds novel ORFs in identified novel peptides
quantify
: evaluates expression levels of identified novel peptides in a sample
The standard workflow is
assemble
>> searchdb
>> identify
>> novelorf
>> quantify
denovo assembly of transcript sequences using Trinity
denopro assemble [options]
-c/--config_file
: Point to the path of config file to use. Default is./denopro.conf
--cpu
: Maximum number of threads to be used by Trinity--max_mem
: Maximum number of RAM (in GB) that can be allocated
output_dir
: Directory to use as pipeline outputdependency_locations/trinity
: Full path to Trinity installationdirectory_locations/fastq_for_trinity
: Directory containing FASTQ files
Produces custom peptide database from assembled transcripts which are mapped against proteomics data
denopro searchdb [options]
-c/--config_file
: Point to the path of config file to use. Default is./denopro.conf
output_dir
: Directory to use as pipeline outputdependency_locations/searchgui
: Full path to SearchGUI.jar
filedependency_locations/peptideshaker
: Full path to PeptideShaker.jar
filedirectory_locations/spectra_files
: Directory containing.mgf
files for database searchingdependency_locations/hg19
: Full path to reference transciptome (FASTA) of protein coding genes
Maps potential novel peptides from customdb to a reference tracriptome, outputting a list of confident novel peptides
denopro identify [options]
-c/--config_file
: Point to the path of config file to use. Default is./denopro.conf
output_dir
: Directory to use as pipeline outputdependency_locations/actg
: Full path to directory containingACTG.jar
andparam.xml
files
Note: Transcriptome model and reference genome are only needed if a serialization file needs to be constructed. If a serialization file is needed, leave
serialization_file
blank.
actg_options/transcriptome_gtf
: Path to transcriptome model to be used for mappingactg_options/ref_genome
: Path to directory containing reference genome (each file name must be the same as chromosome number written in the GTF files)actg_options/mapping_method
: Mapping method to be used. Options arePV
(Mapping [P]rotein database first, then [V]ariant splice graph),PS
(Mapping [P]rotein database first, then [S]ix-frame translation),VO
(Mapping [V]ariant splice graph [O]nly),SO
(Mapping [S]ix-frame translation [O]nly)protein_database
: Ifmapping_method
is PV or PS, path to directory containing protein databaseserialization_file
: Path to serialization file of a variant splice graph
Finds novel ORFs in identified novel peptides
denopro novelorf [options]
-c/--config_file
: Point to the path of config file to use. Default is./denopro.conf
output_dir
: Directory to use as pipeline output
Evaluates expression levels of identified novel peptides
denopro quantify [options]
-c/--config_file
: Point to the path of config file to use. Default is./denopro.conf
output_dir
: Directory to use as pipeline outputquantification_options/bamstats
: Full path to bamstats.jar
filequantification_options/bam_files
: Full path to directory containing BAM files to be analysedquantification_options/bed_file
: Full path to BED file to be used. Will be created with data from previous steps if left blank
DeNoPro offers a graphical interface to run the pipeline and edit configuration files.
The GUI uses the Qt framework through PySimpleGUIQt which can be installed with `conda install PySimpleGUIQt'.