Skip to content

No optical clusters detected with single end reads #2004

Open
@caterinar

Description

@caterinar

Dear all,

I am testing out the MarkDuplicates (Picard version 3.3.0).
When I try with the provided test data it works fine.
Nevertheless, if I create a single end test file (attached - I use a .sam extension) no optical duplicate cluster is detected, and the output is as follows:

`11:12:55.818 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/usrname/miniconda3/envs/latest_picard_env/share/picard-3.3.0-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so

[Thu Mar 27 11:12:55 CET 2025] MarkDuplicates TAGGING_POLICY=All INPUT=[/path/to/bam/optical_dupes.sam] OUTPUT=/path/to/output/marked_duplicates.bam METRICS_FILE=/path/to/output/marked_dup_metrics.txt ASSUME_SORT_ORDER=coordinate READ_NAME_REGEX=(?:.:)?([0-9]+)[^:]:(\d*):([0-9]+):.* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false CLEAR_DT=true DUPLEX_UMI=false FLOW_MODE=false FLOW_DUP_STRATEGY=FLOW_QUALITY_SUM_STRATEGY FLOW_USE_END_IN_UNPAIRED_READS=false FLOW_USE_UNPAIRED_CLIPPED_END=false FLOW_UNPAIRED_END_UNCERTAINTY=0 FLOW_UNPAIRED_START_UNCERTAINTY=0 FLOW_SKIP_FIRST_N_FLOWS=0 FLOW_Q_IS_KNOWN_END=false FLOW_EFFECTIVE_QUALITY_THRESHOLD=15 ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false USE_JDK_DEFLATER=false USE_JDK_INFLATER=false

[Thu Mar 27 11:12:55 CET 2025] Executing as usr@testsrv on Linux 5.15.0-84-generic amd64; OpenJDK 64-Bit Server VM 21.0.6+9-b895.97; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: 3.3.0
INFO 2025-03-27 11:12:56 MarkDuplicates Start of doWork freeMemory: 324136856; totalMemory: 335544320; maxMemory: 31675383808
INFO 2025-03-27 11:12:56 MarkDuplicates Reading input file and constructing read end information.
INFO 2025-03-27 11:12:56 MarkDuplicates Will retain up to 114765883 data points before spilling to disk.
INFO 2025-03-27 11:12:56 MarkDuplicates Read 4 records. 0 pairs never matched.
INFO 2025-03-27 11:12:56 MarkDuplicates After buildSortedReadEndLists freeMemory: 844555976; totalMemory: 1795162112; maxMemory: 31675383808
INFO 2025-03-27 11:12:58 MarkDuplicates Will retain up to 494927872 duplicate indices before spilling to disk.
INFO 2025-03-27 11:13:01 MarkDuplicates Traversing read pair information and detecting duplicates.
INFO 2025-03-27 11:13:01 MarkDuplicates Traversing fragment information and detecting duplicates.
INFO 2025-03-27 11:13:01 MarkDuplicates Sorting list of duplicate records.
INFO 2025-03-27 11:13:01 MarkDuplicates After generateDuplicateIndexes freeMemory: 5525398280; totalMemory: 13488881664; maxMemory: 31675383808
INFO 2025-03-27 11:13:01 MarkDuplicates Marking 3 records as duplicates.
INFO 2025-03-27 11:13:01 MarkDuplicates Found 0 optical duplicate clusters.
INFO 2025-03-27 11:13:01 MarkDuplicates Reads are assumed to be ordered by: coordinate
INFO 2025-03-27 11:13:01 MarkDuplicates Writing complete. Closing input iterator.
INFO 2025-03-27 11:13:01 MarkDuplicates Duplicate Index cleanup.
INFO 2025-03-27 11:13:01 MarkDuplicates Getting Memory Stats.
INFO 2025-03-27 11:13:01 MarkDuplicates Before output close freeMemory: 9494083328; totalMemory: 13488881664; maxMemory: 31675383808
INFO 2025-03-27 11:13:01 MarkDuplicates Closed outputs. Getting more Memory Stats.
INFO 2025-03-27 11:13:01 MarkDuplicates After output close freeMemory: 9494084392; totalMemory: 13488881664; maxMemory: 31675383808

[Thu Mar 27 11:13:01 CET 2025] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.09 minutes.
Runtime.totalMemory()=13488881664`

  • Would you maybe have an explanation why the reads in my example are not detected as an optical cluster even if they are reasonably close?
  • is there a parameter to specify that the reads are SE that I am overlooking?
  • Does the code behave differently for PE and SE reads?

Any help would be appreciated.
Thank you in advance.

optical_dupes.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions