MarkDuplicates with single end data does not write the full report #1944

mattsoup · 2024-02-20T19:45:48Z

This issue tracker is for bug reports only. Before opening a new issue here, please make a post on our support forum so that our support team can look at your issue and determine whether it needs to be escalated into a bug report.

Instructions

Use a concise yet descriptive title;
Determine whether your issue is a bug report, a feature request, or a documentation request;
Choose the corresponding template block below and fill it in, replacing or deleting text in italics (surrounded by _) as appropriate;
Delete the other template blocks and this header.

Bug Report

Affected tool(s)

MarkDuplicates

Affected version(s)

Latest public release version 3.1.1
Latest development/master branch as of [date of test?]

Description

MarkDuplicates with paired end data writes the expected report file. However, with the same data but with all read 2s filtered out (so, 'single end'), the output report has no value for ESTIMATED_LIBRARY_SIZE, and no histogram data is generated.

Steps to reproduce

picard MarkDuplicates INPUT='Filter_on_data_16' OUTPUT='/mnt/galaxy/data6/user-data/010/975/dataset_10975886.dat' METRICS_FILE='/mnt/galaxy/data6/user-data/010/975/dataset_10975885.dat' REMOVE_DUPLICATES='false' ASSUME_SORTED='true' DUPLICATE_SCORING_STRATEGY='SUM_OF_BASE_QUALITIES' OPTICAL_DUPLICATE_PIXEL_DISTANCE='100' VALIDATION_STRINGENCY='LENIENT' TAGGING_POLICY=All QUIET=true

Expected behavior

The full MarkDuplicates report should be written regardless of whether the input is paired or single end.

Actual behavior

Single end reports do not have estimated library size or the histogram data written.

The text was updated successfully, but these errors were encountered:

cmnbroad · 2024-03-12T17:40:15Z

Did you reset the samflags on the remaining single reads, or are they still marked as paired ?

mattsoup · 2024-03-14T17:05:56Z

For this particular test I didn't reset the flags, but we found the problem with actual single end data that was never marked as paired.

kockan · 2024-06-17T14:54:26Z

Looks like ESTIMATED_LIBRARY_SIZE is only calculated for paired-end reads:

picard/src/main/java/picard/sam/DuplicationMetrics.java

Lines 111 to 112 in c8b2c06

    
           this.ESTIMATED_LIBRARY_SIZE = estimateLibrarySize(this.READ_PAIRS_EXAMINED - this.READ_PAIR_OPTICAL_DUPLICATES, 
        
                   this.READ_PAIRS_EXAMINED - this.READ_PAIR_DUPLICATES);

picard/src/main/java/picard/sam/DuplicationMetrics.java

Line 143 in c8b2c06

    
           public static Long estimateLibrarySize(final long readPairs, final long uniqueReadPairs) {

One could potentially check if the input contains all single-end or paired-end reads (what about paired-end reads with an unmapped mate being counted towards UNPAIRED_READS_EXAMINED?) and use the corresponding values as inputs to the function, but I'm not sure if this behavior was ever intended. @yfarjoun @lbergelson Any opinions?

kachulis assigned kockan Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MarkDuplicates with single end data does not write the full report #1944

MarkDuplicates with single end data does not write the full report #1944

mattsoup commented Feb 20, 2024

cmnbroad commented Mar 12, 2024

mattsoup commented Mar 14, 2024

kockan commented Jun 17, 2024

MarkDuplicates with single end data does not write the full report #1944

MarkDuplicates with single end data does not write the full report #1944

Comments

mattsoup commented Feb 20, 2024

This issue tracker is for bug reports only. Before opening a new issue here, please make a post on our support forum so that our support team can look at your issue and determine whether it needs to be escalated into a bug report.

Instructions

Bug Report

Affected tool(s)

Affected version(s)

Description

Steps to reproduce

Expected behavior

Actual behavior

cmnbroad commented Mar 12, 2024

mattsoup commented Mar 14, 2024

kockan commented Jun 17, 2024