Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TRGT Special-purpose caller #412

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions docker/lr-trgt/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
FROM us.gcr.io/broad-dsp-lrma/lr-pb:0.1.40
mdanzi marked this conversation as resolved.
Show resolved Hide resolved

RUN wget "http://github.com/PacificBiosciences/trgt/releases/download/v0.4.0/trgt-v0.4.0-linux_x86_64.gz" -O "trgt.gz" && \
gunzip trgt.gz && \
chmod 777 trgt && \
mv trgt /usr/bin/

RUN wget "http://github.com/PacificBiosciences/trgt/releases/download/v0.4.0/trvz-v0.4.0-linux_x86_64.gz" -O "trvz.gz" && \
gunzip trvz.gz && \
chmod 777 trvz && \
mv trvz /usr/bin/
15 changes: 15 additions & 0 deletions docker/lr-trgt/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
VERSION = 0.4.0
mdanzi marked this conversation as resolved.
Show resolved Hide resolved
TAG1 = us.gcr.io/broad-dsp-lrma/lr-trgt:$(VERSION)
TAG2 = us.gcr.io/broad-dsp-lrma/lr-trgt:latest

all: build push

build:
docker build -t $(TAG1) -t $(TAG2) .

build_no_cache:
docker build --no-cache -t $(TAG1) -t $(TAG2) .

push:
docker push $(TAG1)
docker push $(TAG2)
21 changes: 21 additions & 0 deletions wdl/pipelines/PacBio/VariantCalling/PBCCSWholeGenome.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ import "../../../tasks/Utility/PBUtils.wdl" as PB
import "../../../tasks/Utility/Utils.wdl" as Utils
import "../../../tasks/VariantCalling/CallVariantsPBCCS.wdl" as VAR
import "../../../tasks/Utility/Finalize.wdl" as FF
import "../../../tasks/VariantCalling/TRGT.wdl" as TRGT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a general comment about including trgt in this production pipeline.
It's a bit complicated, so I'll outline the main logic here:

The production pipeline is intended for running routine and stable tools on all production samples. Its primary responsibility is two-fold: alignments merging (and associated QC/metric) + variant calling (SV, SNV, indel). The tools incorporated here are well accepted by the community and undergo less frequent updates. This means we don't need to update & rerun the workflow constantly.

There's another set of callers that are not quite "validated" yet, i.e. they are more special-purpose focused and/or target a specific type of variants. They are under active development and evaluation so their release frequencies are much higher. IMO trgt falls under this category.

So here's my proposal:
Let's leave trgt out of the main variant calling pipeline, and promote it to a standalone workflow (i.e. place it under /wdl/pipelines/PacBio/VariantCalling/). We have other means to ensure that such "special-purpose" callers are also run routinely, in addition to the main ones. This also has the benefit that if we decide to update the version of trgt or how to run it (added annotations, etc), we only need to update its WDLs and rerun that workflow, i.e. separation of concerns .

How does this sound?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You make an excellent point. I definitely agree that TRGT is much more 'under active development' than may be ideal for the main pipeline.

I agree with your plan and have gone ahead and created a new pipeline under /wdl/pipelines/PacBio/VariantCalling/ which is called PBCCS_CallTRs.wdl. This pipeline runs the processWithTRGT task in the TRGT.wdl file (under /wdl/tasks/VariantCalling/).


import "../../../tasks/QC/SampleLevelAlignedMetrics.wdl" as COV

Expand Down Expand Up @@ -31,6 +32,9 @@ workflow PBCCSWholeGenome {
run_dv_pepper_analysis: "to turn on DV-Pepper analysis or not (non-trivial increase in cost and runtime)"
ref_scatter_interval_list_locator: "A file holding paths to interval_list files; needed only when running DV-Pepper"
ref_scatter_interval_list_ids: "A file that gives short IDs to the interval_list files; needed only when running DV-Pepper"

call_trs: "whether to call TRs"
trs_catalog: "optionally specify a non-default catalog to use when calling TRs, for use with TRGT"
}

input {
Expand Down Expand Up @@ -58,6 +62,9 @@ workflow PBCCSWholeGenome {
Int? dvp_memory = 128
File? ref_scatter_interval_list_locator
File? ref_scatter_interval_list_ids

Boolean call_trs = true
File? trs_catalog
}

Map[String, String] ref_map = read_map(ref_map_file)
Expand Down Expand Up @@ -167,6 +174,17 @@ workflow PBCCSWholeGenome {
}
}

if call_trs {
call TRGT.runTRGT {
input:
input_bam = bam,
input_bam_bai = bai,
output_gs_path = svdir,
ref_fasta = ref_map['fasta'],
ref_fasta_fai = ref_map['fai'],
ref_dict = ref_map['dict']
}

output {
File aligned_bam = FinalizeBam.gcs_path
File aligned_bai = FinalizeBai.gcs_path
Expand Down Expand Up @@ -205,5 +223,8 @@ workflow PBCCSWholeGenome {
File? dvp_g_tbi = FinalizeDVPepperGTbi.gcs_path
File? dvp_phased_vcf = FinalizeDVPEPPERPhasedVcf.gcs_path
File? dvp_phased_tbi = FinalizeDVPEPPERPhasedTbi.gcs_path

File? trgt_vcf = runTRGT.trgt_output_vcf
File? trgt_bam = runTRGT.trgt_output_bam
}
}
90 changes: 90 additions & 0 deletions wdl/tasks/VariantCalling/TRGT.wdl
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
version 1.0

workflow runTRGT {

meta {
description: "Uses TRGT to size TRs in a bam file."
}

input {
File input_bam
File input_bam_bai
String basename = basename(input_bam, ".bam")
String output_gs_path
File ref_dict
File ref_fasta
File ref_fasta_index
File repeatCatalog = "https://zuchnerlab.s3.amazonaws.com/RepeatExpansions/TRGT/adotto_TRregions_TRGTFormatWithFlankingSeq_v1.0.bed"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never tried running workflows that try to localize HTTP files. So I worry this default value will fail the workflow's executions.

This points to a larger issue: where should these annotations be stored?

I don't mean we should resolve this here, but just getting it out there for us to think about it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. Apparently you can import WDLs from https, but can't localize files that way. But yes, this is obviously just because I didn't have a good place to store the annotation file. It is about 275MB, so not really appropriate for Github. Ideally, we would put it somewhere on gs. But I don't have an account where I can make a file publicly available there.

For now, I went ahead and moved this to a wget command within the Dockerfile so the annotation is built in to the image (and then referenced locally). But if we can make a place for me to host the annotation on gs, that would obviously be a much better solution as the annotation will certainly change over time and re-building the docker image for that is clearly not optimal.


#Optional runtime arguments
RuntimeAttr? runtime_attr_override
}

call processWithTRGT {
input:
input_bam = input_bam,
input_bam_bai = input_bam_bai,
basename = basename,
ref_fasta = ref_fasta,
ref_fasta_index = ref_fasta_index,
ref_dict = ref_dict,
repeatCatalog = repeatCatalog,
runtime_attr_override = runtime_attr_override
}

output {
File trgt_output_vcf = processWithTRGT.trgt_output_vcf
File trgt_output_bam = processWithTRGT.trgt_output_bam
}
}

task processWithTRGT {
input {
File input_bam
File input_bam_bai
String basename
File ref_fasta
File ref_fasta_index
File ref_dict
File repeatCatalog

RuntimeAttr? runtime_attr_override

}
#########################
RuntimeAttr default_attr = object {
cpu_cores: 4,
mem_gb: 16,
disk_gb: 500,
boot_disk_gb: 10,
preemptible_tries: 3,
max_retries: 1,
docker: "public.ecr.aws/s5z5a3q9/lr-trgt:0.4.0"
}
RuntimeAttr runtime_attr = select_first([runtime_attr_override, default_attr])

meta {
description: "Uses TRGT to size TRs in a bam file."
}

command <<<
set -euo pipefail
trgt --genome ~{ref_fasta} --repeats ~{repeatCatalog} --reads ~{input_bam} --threads ~{runtime_attr.cpu_cores} --output-prefix ~{basename}_trgt

>>>

output {
File trgt_output_vcf = "~{basename}_trgt.vcf.gz"
File trgt_output_bam = "~{basename}_trgt.spanning.bam"
}

runtime {
cpu: select_first([runtime_attr.cpu_cores, default_attr.cpu_cores])
memory: select_first([runtime_attr.mem_gb, default_attr.mem_gb]) + " GiB"
disks: "local-disk " + select_first([runtime_attr.disk_gb, default_attr.disk_gb]) + " HDD"
bootDiskSizeGb: select_first([runtime_attr.boot_disk_gb, default_attr.boot_disk_gb])
preemptible: select_first([runtime_attr.preemptible_tries, default_attr.preemptible_tries])
maxRetries: select_first([runtime_attr.max_retries, default_attr.max_retries])
docker: select_first([runtime_attr.docker, default_attr.docker])
}
}