-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Follow up to PR #418 #430
base: main
Are you sure you want to change the base?
Follow up to PR #418 #430
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few suggestions. Might want to omit "/cromwell_root" wherever it appears, as I suggested in the last PR.
@@ -1,17 +1,11 @@ | |||
version: 1.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't reviewed this. Going to assume it's OK.
@@ -0,0 +1,92 @@ | |||
# incorporate samtools and bcftools, plus some additional but most common genomics tools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not reviewed here -- appears to be the same as the version in sh_ingest_high_pass_inputs, which I've already reviewed.
docker/lr-gcloud-samtools/Makefile
Outdated
@@ -0,0 +1,12 @@ | |||
VERSION = 0.1.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not reviewed here -- appears to be the same as the version in sh_ingest_high_pass_inputs, which I've already reviewed.
@@ -0,0 +1,6 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not reviewed here -- appears to be the same as the version in sh_ingest_high_pass_inputs, which I've already reviewed.
wdl/tasks/QC/AlignedMetrics.wdl
Outdated
time gcloud storage cp ~{bam} ~{local_bam} | ||
|
||
mv ~{bai} "~{local_bam}.bai" | ||
mosdepth -x -n -Q1 ~{prefix} ~{local_bam} || echo "mosdepth failed somehow" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really your intention to swallow mosdepth failures with just a message to the log?
wdl/tasks/Utility/BAMutils.wdl
Outdated
samtools view --no-PG -H ~{local_bam} > header.txt | ||
grep -v "^@SQ" header.txt | ||
|
||
# fix SM in the RG lines | ||
grep "^@RG" header.txt > rg_lines.txt | ||
if ! grep -qF "SM:" rg_lines.txt; then | ||
sed -i "s/$/SM:tbd/" rg_lines.txt | ||
fi | ||
awk -v sm="~{sample_name}" -F '\t' 'BEGIN {OFS="\t"} { for (i=1; i<=NF; ++i) { if ($i ~ "SM:") $i="SM:"sm } print}' \ | ||
rg_lines.txt \ | ||
> fixed_rg_lines.txt | ||
cat fixed_rg_lines.txt | ||
|
||
# paste things back | ||
grep -v "^@RG" header.txt > otherlines.txt | ||
cat otherlines.txt fixed_rg_lines.txt > fixed_header.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this could be replaced with a 1-liner:
samtools view --no-PG -H "~{local_bam}" | \
sed -E "/^@RG/s/( SM:[^ ]*|$)/ SM:~{sample_name}/" > fixed_header.txt
#these are tabs --- ^ ^
wdl/tasks/Utility/BAMutils.wdl
Outdated
@@ -43,3 +52,64 @@ task GetReadGroupInfo { | |||
docker: "us.gcr.io/broad-dsp-lrma/lr-basic:0.1.1" | |||
} | |||
} | |||
|
|||
task ResetSamplename { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an awfully dangerous thing to do (arbitrary replacement of sample name). The task doesn't appear to be in use, currently. I hope we'll use it judiciously.
################################################################################### | ||
# generate PBI | ||
call PBUtils.PBIndex as Index {input: bam = uBAM} | ||
call BU.GetReadGroupInfo as RG {input: uBAM = uBAM, keys = ['SM', 'LB', 'PU']} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't look like we actually need the SM value
* update Hifiasm to version 0.19.5 * update how Hifiasm outputs are compressed (bgz replacing gz), also * monitor hifiasm resources usage
* update docker used in PBSV tasks to the version coming with official SMRTLink releases (2.9.0) * change how the 2-step PBSV process is done (following the recommended way now)
* to version 2.0.7 * using TRF bed * conditionally phase sv (requires phased bam) * generates its own vcf.gz and tbi
Overhaul how small variants are called in the WG pipelines * default to use DV to call small variants, Clair3 analysis needs to be requested explicitly * retire the Pepper toolchain completely from the CCS pipeline, using DV directly * for R10.4+ ONT data, also use DV directly * older ONT data would still use the PEPPER-DV-Margin pipeline * offers GPU version (though based on, it's not worth it yet) * update how bam haplotagging is done Cleanup structural variants calling * experiment with SNF2 phasing SV calls (implicitly depends on small variants calling now) * tune PBSV calling - discover now supports --hifi - output vcf.gz and tbi - less verbose logging by default Misc.: * optimizations to BAM merging and metrics workflow * updates coverage collection step * new R script to visualize log from vm_monitoring_script.sh
* organize dockstore.yml file a bit * make WDL validation shell script more usable * update pbmm2 and pbindex to versions in SMRTLink * update GeneralUtils.wdl - two bash-like new tasks [CoerceMapToArrayOfPairs, CoerceArrayOfPairsToMap] - cleanup task CollapseArrayOfStrings * update resource allocations to tasks - NanoplotFromBam (also changes docker) - MosDepthWGS
* incorporates gcloud cli (not just gsutil) * integrate libdeflate for more speedups
incorporate new tasks and optimize them * [CountMethylCallReads, GatherReadsWithoutMethylCalls] from sh_beans * [GetPileup, BamToRelevantPileup] from sh_more_atomic_qc * [GetReadGroupLines, GetSortOrder, SplitNameSortedUbam] from sh_ont_fc * [SamtoolsFlagStats, ParseFlagStatsJson] from sh_trvial_stats * [FilterBamByLen, InferSampleName] from sh_seqkit * [CountAlignmentRecords, StreamingBamErrored, CountAlignmentRecordsByFlag] from sh_maha_aln_metrics * [ResetSamplename] from sh_ingest_singlerg * [MergeBamsWithSamtools] from sh_ont_fc.Utils.wdl * [BamToFastq] from sh_more_bam_qcs and optimize it with sh_ingest_singlerg.Utils.wdl delete * GetSortOrder as that's now implemented in GatherBamMetadata * Drop2304Alignments as that's no longer used update dockers to the latest
c905a5a
to
a9ae6c3
Compare
CHERRY-PICK FROM VARIOUS QC/METRICS BRANCHES: * collect information about ML/MM tags in a long-read BAM (sh_beans) * a heuristic way to find peaks in a distribution (using dyst) (sh_dyst_peaker) * filter reads by length in a BAM * collect some read quality stats from (length-filtered) FASTQ/BAM (sh_seq_kit) * VerifyBamID2 (for contamination estimation) * naive sex-concordance check (sh_more_atomic_qc) * check fingerprint of a single BAM file (sh_sample_fp) * collect SAM flag stats (sh_trivial_stats)
* make BeanCounter finalization optional (wdl/pipelines/TechAgnostic/Utility/CountTheBeans.wdl) * custom struct for sub-workflow config using a JSON (wdl/pipelines/TechAgnostic/Utility/LongReadsContaminationEstimation.wdl) * make fingerprint checking subworkflow control size filtering (wdl/tasks/QC/FPCheckAoU.wdl) (wdl/pipelines/TechAgnostic/Utility/VerifyBamFingerprint.wdl) * fix a warning by IDE/miniwdl complaining WDL stdlib function length only applies to Array (wdl/tasks/Utility/BAMutils.wdl) * various updates to Finalize (wdl/tasks/Utility/Finalize.wdl) New tasks in (wdl/tasks/Utility/GeneralUtils.wdl) to * correctly convert Map to TSV * concatenate files
a9ae6c3
to
57dcfaf
Compare
* AlignAndCheckFingerprintCCS.wdl * CollectPacBioAlignedMetrics.wdl * CollectSMRTCellUnalignedMetrics.wdl
57dcfaf
to
be67190
Compare
(CHRRY-PICK & follow up to PR 406)
* SampleLevelAlignedMetrics.wdl * PBCLRWholeGenome.wdl
* new struct in AlignedBamQCandMetrics.wdl to facilicate as-sub-workflow calling * change parameters name for fingerprint workflows
be67190
to
295be09
Compare
* make saving of reads without methylation SAM tags optional * better parameter naming
295be09
to
74388b3
Compare
ea29e7c
to
29c570b
Compare
29c570b
to
50d7235
Compare
34d754d
to
c4b670f
Compare
(affects contamination estimation)
handle (ref-indepedent, alignment) a single piece of on-instrument demultiplexed CCS bam
c4b670f
to
1e26de7
Compare
Following up #418 with more optimizations, and two new pipelines for
ingesting single demultiplexed PacBio Hifi (u)BAM (aka a readgroup uBAM)