Skip to content

Commit fd6fa52

Browse files
authored
Merge pull request #1016 from nf-core/dsl2-genotyping
DSL2: genotyping
2 parents 43dce8f + d39a10a commit fd6fa52

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

52 files changed

+2710
-104
lines changed

.github/workflows/ci.yml

+4-4
Original file line numberDiff line numberDiff line change
@@ -28,11 +28,11 @@ jobs:
2828
- "latest-everything"
2929
PARAMS:
3030
- "-profile test,docker --preprocessing_tool fastp --preprocessing_adapterlist 'https://github.com/nf-core/test-datasets/raw/modules/data/delete_me/fastp/adapters.fasta'"
31-
- "-profile test,docker --preprocessing_tool adapterremoval --preprocessing_adapterlist 'https://github.com/nf-core/test-datasets/raw/modules/data/delete_me/adapterremoval/adapterremoval_adapterlist.txt' --sequencing_qc_tool falco"
32-
- "-profile test,docker --mapping_tool bwamem --run_mapdamage_rescaling --run_pmd_filtering --run_trim_bam"
33-
- "-profile test,docker --mapping_tool bowtie2 --damagecalculation_tool mapdamage --damagecalculation_mapdamage_downsample 100"
31+
- "-profile test,docker --preprocessing_tool adapterremoval --preprocessing_adapterlist 'https://github.com/nf-core/test-datasets/raw/modules/data/delete_me/adapterremoval/adapterremoval_adapterlist.txt' --sequencing_qc_tool falco --run_genotyping --genotyping_tool 'freebayes' --genotyping_source 'raw'"
32+
- "-profile test,docker --mapping_tool bwamem --run_mapdamage_rescaling --run_pmd_filtering --run_trim_bam --run_genotyping --genotyping_tool 'ug' --genotyping_source 'trimmed'"
33+
- "-profile test,docker --mapping_tool bowtie2 --damagecalculation_tool mapdamage --damagecalculation_mapdamage_downsample 100 --run_genotyping --genotyping_tool 'hc' --genotyping_source 'raw'"
3434
- "-profile test,docker --skip_preprocessing"
35-
- "-profile test_humanbam,docker --run_mtnucratio --run_contamination_estimation_angsd --snpcapture_bed 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Human/1240K.pos.list_hs37d5.0based.bed.gz'"
35+
- "-profile test_humanbam,docker --run_mtnucratio --run_contamination_estimation_angsd --snpcapture_bed 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Human/1240K.pos.list_hs37d5.0based.bed.gz' --run_genotyping --genotyping_tool 'pileupcaller' --genotyping_source 'raw'"
3636
- "-profile test_multiref,docker" ## TODO add damage manipulation here instead once it goes multiref
3737
steps:
3838
- name: Check out pipeline code

CITATIONS.md

+19-2
Original file line numberDiff line numberDiff line change
@@ -100,10 +100,27 @@
100100
101101
- [QualiMap](https://doi.org/10.1093/bioinformatics/btv566)
102102

103-
> QualiMap Okonechnikov, K., Conesa, A., & García-Alcalde, F. (2016). Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics , 32(2), 292–294. Download: http://qualimap.bioinfo.cipf.es/
103+
> QualiMap Okonechnikov, K., Conesa, A., & García-Alcalde, F. (2016). Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics , 32(2), 292–294. doi: [10.1093/bioinformatics/btv566](https://doi.org/10.1093/bioinformatics/btv566).
104104
105105
- [DamageProfiler](https://doi.org/10.1093/bioinformatics/btab190)
106-
> DamageProfiler Neukamm, J., Peltzer, A., & Nieselt, K. (2020). DamageProfiler: Fast damage pattern calculation for ancient DNA. In Bioinformatics (btab190). doi: [10.1093/bioinformatics/btab190](https://doi.org/10.1093/bioinformatics/btab190). Download: https://github.com/Integrative-Transcriptomics/DamageProfiler
106+
107+
> DamageProfiler Neukamm, J., Peltzer, A., & Nieselt, K. (2020). DamageProfiler: Fast damage pattern calculation for ancient DNA. In Bioinformatics (btab190). doi: [10.1093/bioinformatics/btab190](https://doi.org/10.1093/bioinformatics/btab190).
108+
109+
- [GATK 3.5](https://console.cloud.google.com/storage/browser/gatk)
110+
111+
> DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D, Daly M. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics, 43(5), 491–498. doi: [10.1038/ng.806](https://doi.org/10.1038/ng.806).
112+
113+
- [GATK 4.X](https://github.com/broadinstitute/gatk/releases)
114+
115+
> Poplin R, Ruano-Rubio V, DePristo MA, Fennell TJ, Carneiro MO, Van der Auwera GA, Kling DE, Gauthier LD, Levy-Moonshine A, Roazen D, Shakir K, Thibault J, Chandran S, Whelan C, Lek M, Gabriel S, Daly MJ, Neale B, MacArthur DG, Banks E. (2017). Scaling accurate genetic variant discovery to tens of thousands of samples bioRxiv, 201178. doi: [10.1101/201178](https://doi.org/10.1101/201178).
116+
117+
- [FreeBayes](https://github.com/freebayes/freebayes)
118+
119+
> Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 \[q-bio.GN] 2012. doi: [10.48550/arXiv.1207.3907](https://doi.org/10.48550/arXiv.1207.3907).
120+
121+
- [BCFtools](https://github.com/samtools/bcftools)
122+
123+
> Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics (2011) 27(21) 2987-93.doi: [10.1093/bioinformatics/btr509](https://doi.org/10.1093/bioinformatics/btr509).
107124
108125
## Software packaging/containerisation tools
109126

bin/collect_genotypes.py

+102
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
#!/usr/bin/env python
2+
3+
# MIT License (c) Thiseas C. Lamnidis (@TCLamnidis)
4+
5+
import argparse
6+
import filecmp
7+
8+
def file_len(fname):
9+
with open(fname) as f:
10+
for i, l in enumerate(f):
11+
pass
12+
return i + 1
13+
14+
## A function to return the number of genotypes per line in a .geno file.
15+
def file_width(fname):
16+
with open(fname) as f:
17+
for i in f:
18+
return(len(i.strip()))
19+
break
20+
21+
## A function to check that there are no duplicate individual IDs across ind files.
22+
def check_for_duplicate_ids(indf1, indf2):
23+
with open(indf1) as f:
24+
inds1 = [x.strip().split()[0] for x in f.readlines()]
25+
with open(indf2) as f:
26+
inds2 = [x.strip().split()[0] for x in f.readlines()]
27+
intersection = set(inds1).intersection(inds2)
28+
if len(intersection) > 0:
29+
raise IOError("Input .ind files contain duplicate individual IDs. Duplicates: {}".format(intersection))
30+
31+
## Function to check that the snp files are identical
32+
def check_snp_files(snpf1, snpf2):
33+
if not filecmp.cmp(snpf1, snpf2):
34+
raise IOError("Input .snp files are not identical.")
35+
36+
## Function to check the consistency of an eigenstrat database
37+
def validate_eigenstrat(genof, snpf, indf):
38+
dimsGeno = [file_len(genof), file_width(genof)]
39+
linesSnp = file_len(snpf)
40+
linesInd = file_len(indf)
41+
42+
# print(dimsGeno,linesSnp,linesInd)
43+
##Check geno and snp compatibility
44+
if dimsGeno[0] != linesSnp:
45+
raise IOError("Input .snp and .geno files do not match.")
46+
47+
##Check geno and ind compatibility
48+
if dimsGeno[1] != linesInd:
49+
raise IOError("Input .ind and .geno files do not match.")
50+
51+
VERSION = "1.0.0"
52+
53+
parser = argparse.ArgumentParser(usage="%(prog)s (-i <Input file prefix>) (-c <input ind file> | -R | -E) [-L <SAMPLE LIST> | -S Ind [-S Ind2]] [-o <OUTPUT FILE PREFIX>]" , description="A tool to check two different EingenStrat databses for shared individuals, and extract or remove individuals from an EigenStrat database.")
54+
parser._optionals.title = "Available options"
55+
parser.add_argument("-g1", "--genoFn1", type = str, metavar = "<GENO FILE 1 NAME>", required = True, help = "The path to the input geno file of the first dataset.")
56+
parser.add_argument("-s1", "--snpFn1", type = str, metavar = "<SNP FILE 1 NAME>", required = True, help = "The path to the input snp file of the first dataset.")
57+
parser.add_argument("-i1", "--indFn1", type = str, metavar = "<IND FILE 1 NAME>", required = True, help = "The path to the input ind file of the first dataset.")
58+
parser.add_argument("-g2", "--genoFn2", type = str, metavar = "<GENO FILE 2 NAME>", required = True, help = "The path to the input geno file of the second dataset.")
59+
parser.add_argument("-s2", "--snpFn2", type = str, metavar = "<SNP FILE 2 NAME>", required = True, help = "The path to the input snp file of the second dataset.")
60+
parser.add_argument("-i2", "--indFn2", type = str, metavar = "<IND FILE 2 NAME>", required = True, help = "The path to the input ind file of the second dataset.")
61+
parser.add_argument("-o", "--output", type = str, metavar = "<OUTPUT FILES PREFIX>", required = True, help = "The desired output file prefix. Three output files are created, <OUTPUT FILES PREFIX>.geno , <OUTPUT FILES PREFIX>.snp and <OUTPUT FILES PREFIX>.ind .")
62+
parser.add_argument("-v", "--version", action='version', version="{}".format(VERSION), help="Print the version and exit.")
63+
args = parser.parse_args()
64+
65+
## Open input files
66+
GenoFile1 = open(args.genoFn1, "r")
67+
SnpFile1 = open(args.snpFn1, "r")
68+
IndFile1 = open(args.indFn1, "r")
69+
70+
GenoFile2 = open(args.genoFn2, "r")
71+
# SnpFile2 = open(args.snpFn2, "r") ## Never actually read in line by line
72+
IndFile2 = open(args.indFn2, "r")
73+
74+
## open output files
75+
GenoFileOut = open(args.output + ".geno", "w")
76+
SnpFileOut = open(args.output + ".snp", "w")
77+
IndFileOut = open(args.output + ".ind", "w")
78+
79+
## Perform basic validation on inputs
80+
validate_eigenstrat(args.genoFn1, args.snpFn1, args.indFn1)
81+
validate_eigenstrat(args.genoFn2, args.snpFn2, args.indFn2)
82+
check_for_duplicate_ids(args.indFn1, args.indFn2)
83+
check_snp_files(args.snpFn1, args.snpFn2)
84+
85+
## Now actually merge the data
86+
## Geno
87+
for line1, line2 in zip(GenoFile1, GenoFile2):
88+
geno_line="{}{}".format(line1.strip(),line2.strip())
89+
print(geno_line, file=GenoFileOut)
90+
91+
## Snp
92+
## Copying the file would be faster, but this way we do not rely on the os or external packages.
93+
## We already checked that the snp files are byte-identical, so we can just copy one of them.
94+
for line in SnpFile1:
95+
print(line.strip(), file=SnpFileOut)
96+
97+
## Ind
98+
## The indfiles are simply concatenated in the same order as the geno file.
99+
for line in IndFile1:
100+
print(line.strip(), file=IndFileOut)
101+
for line in IndFile2:
102+
print(line.strip(), file=IndFileOut)

conf/modules.config

+168
Original file line numberDiff line numberDiff line change
@@ -961,4 +961,172 @@ process {
961961
pattern: '*.flagstat'
962962
]
963963
}
964+
965+
//
966+
// GENOTYPING
967+
//
968+
969+
withName: SAMTOOLS_MPILEUP_PILEUPCALLER {
970+
tag = { "${meta.reference}|${meta.strandedness}" }
971+
ext.args = [
972+
"-B",
973+
"-q ${params.genotyping_pileupcaller_min_base_quality}",
974+
"-Q ${params.genotyping_pileupcaller_min_map_quality}",
975+
].join(' ').trim()
976+
ext.prefix = { "${meta.strandedness}_${meta.reference}" }
977+
publishDir = [
978+
enabled: false
979+
]
980+
}
981+
982+
withName: SEQUENCETOOLS_PILEUPCALLER {
983+
tag = { "${meta.reference}|${meta.strandedness}" }
984+
ext.args = {[
985+
"--${params.genotyping_pileupcaller_method}",
986+
params.genotyping_pileupcaller_transitions_mode == "SkipTransitions" ? "--skipTransitions" : params.genotyping_pileupcaller_transitions_mode == "TransitionsMissing" ? "--transitionsMissing" : "",
987+
"${meta.strandedness}" == 'single' ? "--singleStrandMode" : "" ,
988+
"--sampleNames", meta.sample_id.join(","),
989+
"-e pileupcaller.${meta.strandedness}.${meta.reference}"
990+
].join(' ').trim() }
991+
ext.prefix = { "${meta.strandedness}_${meta.reference}" }
992+
publishDir = [
993+
enabled: false // Not published because the output goes through COLLECT_GENOTYPES
994+
]
995+
}
996+
997+
withName: COLLECT_GENOTYPES {
998+
tag = { "${meta.reference}" }
999+
ext.prefix = { "pileupcaller_genotypes_${meta.reference}" }
1000+
publishDir = [
1001+
path: { "${params.outdir}/genotyping/" },
1002+
mode: params.publish_dir_mode,
1003+
enabled: true,
1004+
pattern: '*.{geno,snp,ind}'
1005+
]
1006+
}
1007+
1008+
withName: EIGENSTRATDATABASETOOLS_EIGENSTRATSNPCOVERAGE {
1009+
tag = { "${meta.reference}" }
1010+
ext.args = { "-j ${prefix}.json" }
1011+
ext.prefix = { "pileupcaller_genotypes_${meta.reference}_coverage" }
1012+
publishDir = [
1013+
path: { "${params.outdir}/genotyping/" },
1014+
mode: params.publish_dir_mode,
1015+
enabled: true,
1016+
pattern: '*.{tsv}'
1017+
]
1018+
}
1019+
1020+
withName: GATK_REALIGNERTARGETCREATOR {
1021+
tag = { "${meta.reference}|${meta.sample_id}" }
1022+
ext.args = [
1023+
params.genotyping_gatk_ug_defaultbasequalities > 0 ? "--defaultBaseQualities ${params.genotyping_gatk_ug_defaultbasequalities}" : "", // Empty string since GATK complains if its default of -1 is provided.
1024+
].join(' ').trim()
1025+
ext.prefix = { "${meta.sample_id}_${meta.reference}_realigntarget" }
1026+
publishDir = [
1027+
enabled: false
1028+
]
1029+
}
1030+
1031+
withName: GATK_INDELREALIGNER {
1032+
tag = { "${meta.reference}|${meta.sample_id}" }
1033+
ext.args = [
1034+
params.genotyping_gatk_ug_defaultbasequalities > 0 ? "--defaultBaseQualities ${params.genotyping_gatk_ug_defaultbasequalities}" : "", // Empty string since GATK complains if its default of -1 is provided.
1035+
].join(' ').trim()
1036+
ext.prefix = { "${meta.sample_id}_${meta.reference}_realigned" }
1037+
publishDir = [
1038+
path: { "${params.outdir}/genotyping/IndelRealigner" },
1039+
mode: params.publish_dir_mode,
1040+
enabled: params.genotyping_gatk_ug_keeprealignbam,
1041+
pattern: '*.{bam,bai}'
1042+
]
1043+
}
1044+
1045+
withName: GATK_UNIFIEDGENOTYPER {
1046+
tag = { "${meta.reference}|${meta.sample_id}" }
1047+
ext.args = {[
1048+
"--sample_ploidy ${meta2.ploidy}",
1049+
"-stand_call_conf ${params.genotyping_gatk_call_conf}",
1050+
"-dcov ${params.genotyping_gatk_ug_downsample}",
1051+
"--output_mode ${params.genotyping_gatk_ug_out_mode}",
1052+
"--genotype_likelihoods_model ${params.genotyping_gatk_ug_genotype_mode}",
1053+
params.genotyping_gatk_ug_defaultbasequalities > 0 ? "--defaultBaseQualities ${params.genotyping_gatk_ug_defaultbasequalities}" : "", // Empty string since GATK complains if its default of -1 is provided.
1054+
].join(' ').trim() }
1055+
ext.prefix = { "${meta.sample_id}_${meta.reference}" }
1056+
publishDir = [
1057+
path: { "${params.outdir}/genotyping/" },
1058+
mode: params.publish_dir_mode,
1059+
enabled: true,
1060+
pattern: '*.vcf.gz'
1061+
]
1062+
}
1063+
1064+
withName: BCFTOOLS_INDEX_UG {
1065+
tag = { "${meta.reference}|${meta.sample_id}" }
1066+
ext.args = "--tbi" //tbi indices for consistency with GATK HC
1067+
ext.prefix = { "${meta.sample_id}_${meta.reference}" }
1068+
publishDir = [
1069+
path: { "${params.outdir}/genotyping/" },
1070+
mode: params.publish_dir_mode,
1071+
enabled: true,
1072+
pattern: '*.vcf.gz.tbi'
1073+
]
1074+
}
1075+
1076+
withName: GATK4_HAPLOTYPECALLER {
1077+
tag = { "${meta.reference}|${meta.sample_id}" }
1078+
ext.args = {[
1079+
// Option names have changed from underscore_separated to hyphen-separated in GATK4
1080+
"--sample-ploidy ${meta2.ploidy}",
1081+
"-stand-call-conf ${params.genotyping_gatk_call_conf}",
1082+
"--output-mode ${params.genotyping_gatk_hc_out_mode}",
1083+
"--emit-ref-confidence ${params.genotyping_gatk_hc_emitrefconf}",
1084+
].join(' ').trim() }
1085+
ext.prefix = { "${meta.sample_id}_${meta.reference}" }
1086+
publishDir = [
1087+
path: { "${params.outdir}/genotyping/" },
1088+
mode: params.publish_dir_mode,
1089+
enabled: true,
1090+
pattern: '*.{vcf.gz,vcf.gz.tbi}'
1091+
]
1092+
}
1093+
1094+
withName: FREEBAYES {
1095+
tag = { "${meta.reference}|${meta.sample_id}" }
1096+
ext.args = {[
1097+
"-p ${ref_meta.ploidy}",
1098+
"-C ${params.genotyping_freebayes_min_alternate_count}",
1099+
params.genotyping_freebayes_skip_coverage != 0 ? "-g ${params.genotyping_freebayes_skip_coverage}" : "",
1100+
].join(' ').trim() }
1101+
ext.prefix = { "${meta.sample_id}_${meta.reference}" }
1102+
publishDir = [
1103+
path: { "${params.outdir}/genotyping/" },
1104+
mode: params.publish_dir_mode,
1105+
enabled: true,
1106+
pattern: '*.vcf.gz'
1107+
]
1108+
}
1109+
1110+
withName: BCFTOOLS_INDEX_FREEBAYES {
1111+
tag = { "${meta.reference}|${meta.sample_id}" }
1112+
ext.args = "--tbi" //tbi indices for consistency with GATK HC
1113+
ext.prefix = { "${meta.sample_id}_${meta.reference}" }
1114+
publishDir = [
1115+
path: { "${params.outdir}/genotyping/" },
1116+
mode: params.publish_dir_mode,
1117+
enabled: true,
1118+
pattern: '*.vcf.gz.tbi'
1119+
]
1120+
}
1121+
1122+
withName: BCFTOOLS_STATS_GENOTYPING {
1123+
tag = { "${meta.reference}|${meta.sample_id}" }
1124+
ext.prefix = { "${meta.sample_id}_${meta.reference}" }
1125+
publishDir = [
1126+
path: { "${params.outdir}/genotyping/" },
1127+
mode: params.publish_dir_mode,
1128+
enabled: true,
1129+
pattern: '*.txt'
1130+
]
1131+
}
9641132
}

conf/test_humanbam.config

+2-2
Original file line numberDiff line numberDiff line change
@@ -38,8 +38,8 @@ params {
3838
// //Sex Determination
3939
// sexdeterrmine_bedfile = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Human/1240K.pos.list_hs37d5.0based.bed.gz'
4040
// // Genotyping
41-
// pileupcaller_bedfile = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Human/1240K.pos.list_hs37d5.0based.bed.gz'
42-
// pileupcaller_snpfile = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Human/1240K_covered_in_JK2067_downsampled_s0.1.numeric_chromosomes.snp'
41+
genotyping_pileupcaller_bedfile = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Human/1240K.pos.list_hs37d5.0based.bed.gz'
42+
genotyping_pileupcaller_snpfile = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Human/1240K_covered_in_JK2067_downsampled_s0.1.numeric_chromosomes.snp'
4343

4444

4545
// BAM filtering

0 commit comments

Comments
 (0)