DPREx

DPREx is a machine-learning based tool to discriminate potential genes intervals to be pathogenic in which repeats mutation happens.

How does the model work?

First, we use comprehensive epigenetic data which include chromatin accessibility, histone modification marks, the binding strength of TFs, distance to alternative splicing, non-B DNA structure, Rloop structure coverage fraction, free energy of pre-RNA, mean pairing probability of pre-RNA, and the binding strength of base excision repair (BER) pathway DNA glycosylases to annotate the intervals of input repeats.

Then, a pre-trained XGBoost based model is loaded to predict the annotated table. This model uses features dominantly come from repeat interval annotation, and doesn't care about the detail length of the interval , making its prediction mainly based on gene expression and regulation, instead of pattern of mutations.

How to use

prepare for the environment
- make sure the python's version is not less than 3.8.
- use virtualenv to create a new environment (optional but recommended).
- type in: pip3 install -r requirement.txt, make sure you are in the same path of this project.
Installation of external software and placement of data-files
- bedtools should be installed for sequence-relation options. Change the term of “bedtools” in fan_path2.json to the real installation path of bedtools.
- The RNAfold module of ViennaRNA (version 2.4.4) was used to calculate the free energy of 100bp upstream of the start and downstream of the end of the pre-RNA fragment corresponding to the submitted DNA fragment. For those without root privilege, to install ViennaRNA for RNAfold running, just download the install package of “ViennaRNA-2.4.4.tar.gz” to the path you plan to install softwares (<path for softwares>). Then run:
  - cd <path for softwares>
  - mkdir RNAfold_install
  - mkdir RNAfold_data
  - tar zxvf ViennaRNA-2.4.4.tar.gz1
  - cd ViennaRNA-2.4.4
  - ../configure –prefix=/RNAfold_install –datadir=/RNAfold_data
  - make
  - make install
- After above steps, the executive script of RNAfold is in the path of <path for softwares>/RNAfold_install/bin/RNAfold. Change the “run_Rfold” in fan_path2.json to the real installation path of the executive scripts of RNAfold.
- The TH-GRASS software was used to calculated the mean pairing probability of 100bp upstream of the start and downstream of the end of the pre-RNA fragment corresponding to the submitted DNA fragment. The code is in the github. We suggest that you install TH-GRASS via conda. For example:
  conda create tf-grass_env python=3.6
- Here note that python3.6 is necessary for a successful installation. Then go on with:
  - conda activate tf-grass_env
  - pip3 install -r requirments.txt
- After that, TF-GRASS is available for use. Change the term of “dir_GF” in fan_path2.json to the real installation path of TH-GRASS, which contain files like the train.py, test.py, and utils.py. Change the term of “conda_python3” in fan_path2.json to the real path of executive python3.6 of the conda environment of tf-grass.
- Download the hg19 reference gnome file in the format of fasta, as well as the corresponding .fa.fai files, put them into the same path, and then change the “hg19_in” in fan_path2.json to the real path that contain the fasta as well as the .fa.fai files.
- Download the file named “hg19.rloop.filtered.mergedRegion.bed” supplied by us, and then change the “dir_rloop_annot” in fan_path2.json to the real path that contain the file of “hg19.rloop.filtered.mergedRegion.bed”.
- Download the four files separately named “ACNEIL_SSINP_2_hg19_FE.bw”, “GSM2137770_POLB_hg19.bw”, “ogg1_hg19.bw” and “xrcc1_hg19.bw”. Then change the “dir_BER_annot” in fan_path2.json to the real path that contain the four .bw files.
set epigenetic data resources in manifest.txt.
- It should contain your sample file and epigenetic data file (in various formats can be found in the sample.).
- If you want to add new property, add new line following.
- To remove property, use '%' to comment out that line.
set configuration in config.json.
organise your input file in BED3 format.
invoke the command line API:
python main.py path-to-input/input.bed -o save_dir/

Calculation procedure

the coverage fraction of R_loop and DNA-repair factors binding region was calculated by comparing the location of submitted DNA fragments and the location of annotated regions of R_loop or DNA-repair factors binding region. The coverage fraction was defined as the bp length of overlapped region, divided by the bp length of the submitted fragment.

the pairing probability and free energy were calculated beginning from the primary RNA sequence of the submitted DNA fragment, which was gained with the help of bedtools and RNA fold module of Vienna RNA software. Related files were stored in the path of <dir_gen_files>/new_bed and <dir_gen_files>/get_seq, which would be deleted after pipline calculation finished. Note here by “<dir_gen_files>” we mean the parameter you filled in after the option of –dir_gen_files in the file of fan_path2.json.

begin from the RNA sequence, the pairing probability of each residue of the RNA fragment was calculated by the TH-GRASS software. Then the mean of the residues’ pairing probability was calculated as the “RNA_pair” feature. “st_pair” is the mean pairing probability of 100bp upstream of the start of submitted RNA fragment, while “ed_pair” is the mean pairing probability of 100bp downstream of the end of submitted RNA fragment. Related files were stored in the path of <dir_gen_files>/RNA_pair/, which would be deleted after pipline calculation finished.

begin from the RNA sequence, the free energy of the RNA fragment was calculated by the RNAfold module of viennaRNA software. “st_energy” is the free energy of 100bp upstream of the start of submitted RNA fragment, while “ed_energy” is the free energy of 100bp downstream of the end of submitted RNA fragment. Related files were stored in the path of <dir_gen_files>/RNA_free_energy/, which would be deleted after pipeline calculation finished.

Acknowledgment

The file named “hg19.rloop.filtered.mergedRegion.bed” for rloop annotation is from the publication of Piroon Jenjaroenpun et.al., “R-loopDB: a database for R-loop forming sequences (RLFS) and R-loops” (PMID: 27899586)

The DNA repair binding signal intensity files (.bw files) for BER annotation is from the website of https://de.cyverse.org/data/ds/iplant/home/abacolla/bigwig?selectedOrder=asc&selectedOrderBy=name&selectedPage=0&selectedRowsPerPage=100

parameter

input [positional] : the path of input BED6 file
--output [-o] : the path to store the result csv
--help <-h> : for help

input file

The input file should be organised following the BED6 format, the columns are:
chromosome start end
seperated with tab

Here are some examples:
chr1 12252260 12252264
chr3 187388896 187388910
chr4 1019055 1019075

result file

The result will be saved in a CSV file, with columns called repeats_id, probability_score and pathogenic.
For a record like "chr1 12252260 12252264", its repeats_id is "chr1@12252260@12252264".
The order of the records can be different from the input file, but it can be re-sorted with repeats_id.
The pathogenic column denotes whether a mutation is pathogenic if it happens in the repeat interval, the probability_score is positively related to that likelihood.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
ItvAnt		ItvAnt
models		models
README.md		README.md
config.json		config.json
main.py		main.py
manifest.txt		manifest.txt
requirement.txt		requirement.txt
workflow.png		workflow.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DPREx

How does the model work?

How to use

Calculation procedure

Acknowledgment

parameter

input file

result file

About

Releases

Packages

Languages

wykswr/DPREx

Folders and files

Latest commit

History

Repository files navigation

DPREx

How does the model work?

How to use

Calculation procedure

Acknowledgment

parameter

input file

result file

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages