-
Notifications
You must be signed in to change notification settings - Fork 56
Best practics for WGS and multithreading
Polina Bevad edited this page Jun 3, 2020
·
3 revisions
Here are some tips on using VarDictJava with multithreading mode in WGS.
- BED file is required for VarDict. VarDict uses memory linear to individual segment size in BED file. So if you use the whole chromosomes in BED file, it will require too much memory and time, and can cause the error OutOfMemoryError. It is recommended to use 10-100kb segments with 150bp overlapping for WGS.
- The 150bp overlapping is for VarDict to be able to call indels when they span two segments with only softly clipped reads to support them. For targeted sequencing or exome, use manufacturer supplied BED files. For exome, option "-x 100" is recommended if you want to call variants not in BED but might be hybrid captured.
- It is recommended to use VarDict in multithread mode (use option
-th
) for better performance. In this case, VarDict will process one segment from BED file per core and one core will remain for the Main thread. The core here means logical thread, so if you have Hyper-threading, there will be twice more threads then physicals cores. - Please try to use
--fisher
option (in master branch) to decrease time processing of R script. You then must use var2vcf_valid.pl or var2vcf_paired.pl right after VarDictJava step.
You can generate BED file for WGS calling with bedtools
.
You have to use a special genome file with bedtools. For humans and mammals, it must be in the genomes/
of bedtools folder, for other organisms you can create it in a tab-delimited format like:
chrI 15072421
chrII 15279323
...
Better description here: https://bedtools.readthedocs.io/en/latest/content/general-usage.html#genome-file-format.
The command to create regions with a window size of 50150 bp and overlapping size 150 bp:
bedtools makewindows -g human.hg38.genome -w 50150 -s 50000 > hg38.wgs.bed