-
Notifications
You must be signed in to change notification settings - Fork 911
gatk4/markduplicates - pipe uncompressed output to speed up CRAM writing #7497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@pontushojer Are you still on this? Looks like an interesting concept to me :) |
@famosab I have unfortunately not had any time to explore this further. You are most welcome to pick this up if you feel so inclined :) |
samtools index ${prefix} | ||
fi | ||
# Create index for BAM/CRAM | ||
samtools index ${prefix} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be done inside the samtools view as well, there is a flag for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes --write-index
could be used instead of this, similar to this e.g. #7481
The tests are currently failing, it seems that the output read by samtools is causing some error
I think I have found the reason for this is that gatk outputs the picard exit result to stdout which corrupts the stream to samtools, see my PR to fix this in GATK here broadinstitute/gatk#9176. This will mean that this PR is not viable until the fix is merged and in a new GATK release. An alternative solution would be to not use picard bundled with GATK and instead use it independently, as it does not have this limitation. But then the changes here should be transferred to the |
WIP! This is my first PR to nf-core/modules. I have not gotten this to work locally yet but figured I would share anyway.
I was running nf-core/sarek which uses this module for duplicate marking. When using CRAM output this module currently writes a BAM and then converts it to CRAM. This is quite wasteful, on one of my samples the MarkDuplicates finished after 153 minutes while the module duration was 224 minutes, that's an additional 71 minutes for CRAM output.
In this PR the output is instead piped as uncompressed BAM to
�samtools
which writes output to the desired format. I have not done any benchmarks yet but judging from this article it should be comparable to just runningMarkDuplicates
.PR checklist
Closes #XXX
versions.yml
file.label
nf-core modules test <MODULE> --profile docker
nf-core modules test <MODULE> --profile singularity
nf-core modules test <MODULE> --profile conda
nf-core subworkflows test <SUBWORKFLOW> --profile docker
nf-core subworkflows test <SUBWORKFLOW> --profile singularity
nf-core subworkflows test <SUBWORKFLOW> --profile conda