‎ ‎ ‎CRAGR_v2

CRAGR_v2 (Cell fRee dnA fraGmentation in R) is a pipeline for CRAGR that incorporates Irreproducible Discovery Rate (IDR) to hotspot calling enable consistency between replicates or sets of samples.

Installation

(bash) Create and activate a conda environment with a Python 3.9.

conda create --name idr_cragr python=3.9
conda activate idr_cragr

(bash) Ensure that you have the proper dependencies (snakemake, tabix, bgzip, samtools, bedtools, idr, libiconv) installed.

# For Tabix, BGZIP, and SAMTools:
conda install bioconda::samtools

# For BEDTools:
conda install bioconda::bedtools

#For IDR:
conda install bioconda::idr 
# Open ~/.conda/envs/idr_cragr/lib/python3.9/site-packages/idr/idr.py
# Replace numpy.int in lines 299 and 300 with numpy.int_

# For Snakemake:
pip install snakemake==7.32.4
pip install pulp==2.7.0

# NOTE: For some HPC systems, libiconv.so may not be detected for the CRAGR installation process
# You may have to manually add libiconv to the path in this case, like so:
conda install conda-forge::libiconv
export $LD_LIBRARY_PATH=<PATH TO CONDA LIBRARY FOLDER>

(R) Install CRAGR.

install.packages("devtools")
devtools::install_github("epifluidlab/CRAGR_v2")

Note

CRAGR also requires BSgenome packages for the reference genome of interest, in order to perform GC correction. For hg19, install this. For hg38 install this.

Quick Start

(R) Identify the path to the CRAGR_v2 pipeline.

system.file("extdata/scripts/idr_pipeline", "idr_cragr.smk", package = "cragr")

Specify the relevant arguments for analysis in a params.yaml file. Check here for more information.
(bash) Run the pipeline.

snakemake -s {path_to_idr_crag_smk} --configfile params.yaml --cores {cores}

Example Usage

For your ease of use, we have prepared an lightweight example usage here. You may run the pipeline:

# Enter the example usage directory.
cd ~/inst/extdata/scripts/data/example

# Modify the r_script parameter in example_params.yaml.

# Run the pipeline.
snakemake -s {path_to_idr_crag_smk} --configfile example_params.yaml --cores {cores}

Note

In order to remain lightweight, the sample fragment files are relatively low coverage. As a result, the pipeline will identify no peaks and fail at the IDR step (47%).

Parameters

For a detailed understanding of the CRAG methods and analysis stages, we encourage you to read through our documentation here: CRAGR.

Our pipeline takes arguments in YAML format. A blank params.yaml is linked here, for your guidance.

Required Parameters

r_path: (REQUIRED) This is a path to your Rscript executable. You can usually find this with which Rscript.
cragr_script: (REQUIRED) This is a full path to the CRAGR.r script.
- You can find this by running system.file("extdata/scripts", "cragr.R", package = "cragr") and pasting the full output OR
- Locate the source code directory and place the path to inst/extdata/scripts/cragr.R.
samples: (REQUIRED) This is a path to a .txt file containing a list of paths (new-line separated) to the relevant sample fragment files.
- These files must have a consistent naming structure that starts with an ID separated by a .. For example {ID}.frag.bed.gz
- All of these fragment files must have an associated tabix index.
chroms: (REQUIRED) This is a path to a .txt file containing a list of chromosomes (new-line separated) that hotspots and IFS scores should be retrieved for.
- This file must be ordered, as this order determines the order of the sorting of the combined replicate fragment files.
genome: (REQUIRED, OPTIONS=['hg19', 'hg38']) This is the name of the genome build to use for the CRAG pipeline.
chrom_sizes: (REQUIRED) This is the path to a file containing the chromosome sizes for the relevant --genome.
- See inst/extdata/scripts/data/ for .chrom.sizes files for hg19 and hg38.
output_dir: This is the path to a directory that all intermediate and output files will be stored in.
- If this directory does not exist, it will be created.

Optional Parameters

split_method: (DEFAULT: fragment_count, OPTIONS:['fragment_count', 'by_sample']) This is determines how the fragments are split into the replicates.
- fragment_count: This will combine all of the fragment files and take --subsample fragments in each replicate with replacement. This inherently introduces overlaps to the replicates. To avoid very similar replicates, the total fragment count of all fragment files must be greater than 1.5*--subsample.
- samples: This will generate an equal split of the files into replicates. This ensures that the replicates have no data overlap.
seed: (DEFAULT: None) This is an integer that sets the seed of the random splitting of the replicates for reproducibility.
subsample: (DEFAULT: 200M) This argument sets the number of fragments to subsample from the combined fragments to create the two replicate fragment files.
hotspot_samples: (DEFAULT: None) This is a path to a .txt file containing a list of paths (new-line separated) to call hotspots over. If not specified, --samples will be used.
exclude_regions: (DEFAULT: None) This is the path to a blacklist BED file, highlighting problematic areas of the genome to ignore.
- You can learn more about the blacklist file purpose here.
- See inst/extdata/scripts/data/ for .exclude.regions files for hg19 and hg38.
gc_correct: (DEFAULT: True) This is a boolean parameter that determines whether or not to perform GC correction.
gc_correct_method: (DEFAULT: standard, OPTIONS=['standard', 'caret']) This determines the method used in the GC correction.
gc_correct_N: (DEFAULT: 1000000) This determines the maximal sample size for GC correction.
fdr_threshold: (DEFAULT: 0.2) This determines the FDR cutoff for filtering peaks after Stage 2 analysis and before IDR.
merge_gap: (DEFAULT: 200) If the distance between two hotspot intervals is less than this value, they will be merged into one larger hotspot.
idr_threshold: (DEFAULT: 0.05) This determines the cutoff for filtering peaks from the IDR value for each peak.
min_mapq: (DEFAULT: 30) This is the minimum MAPQ score of a fragment to include in the analysis.
min_fraglen: (DEFAULT: 0) This is the minimum fragment length to include in the analysis.
max_fraglen: (DEFAULT: 1000) This is the maximum fragment length to include in the analysis.
window_size: (DEFAULT: 200) This is the sliding window size to use for the CRAG IFS calculation.
step_size: (DEFAULT: 20) The step size to use for the CRAG IFS calculation.
total_fragment_min (DEFAULT: 1.5*--subsample) This determines the minimum total fragments across your fragment files to allow the pipeline to run without throwing an error. Modify this threshold at your own risk.
threads: (DEFAULT:1) This specifies the number of threads to use for the individual steps of the pipeline (i.e. CRAGR ifs, CRAGR peak, CRAGR signal)

Workflow Diagram

Citation

Zhou X, Zheng H, Fu H, McKillip KL, Pinney SM, Liu Y. (2022) CRAG: De novo characterization of cell-free DNA fragmentation hotspots in plasma whole-genome sequencing. Genome Medicine

Contact

Ravi Bandaru: [email protected]
Kundan Baliga: [email protected]
Yaping Liu: [email protected]

License

This project falls under an MIT license. See the included LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
R		R
inst/extdata/scripts		inst/extdata/scripts
man		man
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

‎ ‎ ‎CRAGR_v2

Table of Contents

Installation

Quick Start

Example Usage

Parameters

Required Parameters

Optional Parameters

Workflow Diagram

Citation

Contact

License

About

Uh oh!

Contributors 2

Uh oh!

Languages

License

epifluidlab/CRAGR_v2

Folders and files

Latest commit

History

Repository files navigation

‎ ‎ ‎CRAGR_v2

Table of Contents

Installation

Quick Start

Example Usage

Parameters

Required Parameters

Optional Parameters

Workflow Diagram

Citation

Contact

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages