Skip to content

Conversation

pontushojer
Copy link
Contributor

WIP! This is my first PR to nf-core/modules. I have not gotten this to work locally yet but figured I would share anyway.

I was running nf-core/sarek which uses this module for duplicate marking. When using CRAM output this module currently writes a BAM and then converts it to CRAM. This is quite wasteful, on one of my samples the MarkDuplicates finished after 153 minutes while the module duration was 224 minutes, that's an additional 71 minutes for CRAM output.

In this PR the output is instead piped as uncompressed BAM to �samtools which writes output to the desired format. I have not done any benchmarks yet but judging from this article it should be comparable to just running MarkDuplicates.

PR checklist

Closes #XXX

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Emit the versions.yml file.
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows test <SUBWORKFLOW> --profile conda

@famosab
Copy link
Contributor

famosab commented Mar 11, 2025

@pontushojer Are you still on this? Looks like an interesting concept to me :)

@pontushojer
Copy link
Contributor Author

@famosab I have unfortunately not had any time to explore this further. You are most welcome to pick this up if you feel so inclined :)

@famosab famosab moved this from Bumped to Needs help in 2025 spring cleaning - modules (PRs) Mar 11, 2025
samtools index ${prefix}
fi
# Create index for BAM/CRAM
samtools index ${prefix}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be done inside the samtools view as well, there is a flag for that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes --write-index could be used instead of this, similar to this e.g. #7481

@pontushojer
Copy link
Contributor Author

pontushojer commented May 15, 2025

The tests are currently failing, it seems that the output read by samtools is causing some error

    >   Runtime.totalMemory()=119537664
    >   [E::bgzf_read_block] Failed to read BGZF header at offset 68760
    >   [E::bgzf_read] Read block operation failed with error 2 after 0 of 4 bytes
    >   samtools view: error reading file "-"
    >   samtools view: error closing "-": -1

I think I have found the reason for this is that gatk outputs the picard exit result to stdout which corrupts the stream to samtools, see my PR to fix this in GATK here broadinstitute/gatk#9176.

This will mean that this PR is not viable until the fix is merged and in a new GATK release.

An alternative solution would be to not use picard bundled with GATK and instead use it independently, as it does not have this limitation. But then the changes here should be transferred to the picard/markduplicates module instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Needs help

Development

Successfully merging this pull request may close these issues.

3 participants