Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MarkDuplicates with single end data does not write the full report #1944

Open
1 of 2 tasks
mattsoup opened this issue Feb 20, 2024 · 3 comments
Open
1 of 2 tasks

MarkDuplicates with single end data does not write the full report #1944

mattsoup opened this issue Feb 20, 2024 · 3 comments
Assignees

Comments

@mattsoup
Copy link

This issue tracker is for bug reports only. Before opening a new issue here, please make a post on our support forum so that our support team can look at your issue and determine whether it needs to be escalated into a bug report.


Instructions

  • Use a concise yet descriptive title;
  • Determine whether your issue is a bug report, a feature request, or a documentation request;
  • Choose the corresponding template block below and fill it in, replacing or deleting text in italics (surrounded by _) as appropriate;
  • Delete the other template blocks and this header.

Bug Report

Affected tool(s)

MarkDuplicates

Affected version(s)

  • Latest public release version 3.1.1
  • Latest development/master branch as of [date of test?]

Description

MarkDuplicates with paired end data writes the expected report file. However, with the same data but with all read 2s filtered out (so, 'single end'), the output report has no value for ESTIMATED_LIBRARY_SIZE, and no histogram data is generated.

Steps to reproduce

picard MarkDuplicates INPUT='Filter_on_data_16' OUTPUT='/mnt/galaxy/data6/user-data/010/975/dataset_10975886.dat' METRICS_FILE='/mnt/galaxy/data6/user-data/010/975/dataset_10975885.dat' REMOVE_DUPLICATES='false' ASSUME_SORTED='true' DUPLICATE_SCORING_STRATEGY='SUM_OF_BASE_QUALITIES' OPTICAL_DUPLICATE_PIXEL_DISTANCE='100' VALIDATION_STRINGENCY='LENIENT' TAGGING_POLICY=All QUIET=true

Expected behavior

The full MarkDuplicates report should be written regardless of whether the input is paired or single end.

Actual behavior

Single end reports do not have estimated library size or the histogram data written.

@cmnbroad
Copy link
Contributor

Did you reset the samflags on the remaining single reads, or are they still marked as paired ?

@mattsoup
Copy link
Author

For this particular test I didn't reset the flags, but we found the problem with actual single end data that was never marked as paired.

@kockan
Copy link
Contributor

kockan commented Jun 17, 2024

Looks like ESTIMATED_LIBRARY_SIZE is only calculated for paired-end reads:

this.ESTIMATED_LIBRARY_SIZE = estimateLibrarySize(this.READ_PAIRS_EXAMINED - this.READ_PAIR_OPTICAL_DUPLICATES,
this.READ_PAIRS_EXAMINED - this.READ_PAIR_DUPLICATES);

public static Long estimateLibrarySize(final long readPairs, final long uniqueReadPairs) {

One could potentially check if the input contains all single-end or paired-end reads (what about paired-end reads with an unmapped mate being counted towards UNPAIRED_READS_EXAMINED?) and use the corresponding values as inputs to the function, but I'm not sure if this behavior was ever intended. @yfarjoun @lbergelson Any opinions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants