Skip to content

Conversation

@tomkinsc
Copy link
Contributor

@tomkinsc tomkinsc commented Aug 22, 2025

This PR adds functionality to illumina_demux that enables handling of 3-barcode sample sheets used by the pooled cDNA-tagseq protocol, in which reads are demultiplexed in two stages: first a standard round of demultiplexing with a pair of outer barcodes, followed by a custom round that separates out reads based on inline barcodes. The first round is implemented with Picard, as before, and the second with a new tool called splitcode.

lauraluebbert and others added 30 commits January 20, 2025 19:48
add wrapper for AddCommentsToBam to picard.py, `AddCommentsToBamTool()`, to be used to add comment lines (`@CO`) to a bam file
add `collapse_dup_strs_to_str_or_md5()` to `util/misc.py`, which collapses a list of strings into a single string, where the single string returned  is either a string made from one unique value (in the event all input strings were duplicates), a delimiter-joined version of the input strings, or if the joined string would be too long, a truncated md5 hash of the joined inputs (without delimiters). A suffix can optionally be added (ex. `suffix="_muxed"`)
…llumina.py::illumina_demux() via --collapse_duplicated_barcodes

add collapsing of barcode or barcode pair duplicates as part of `illumina.py::main_illumina_demux()` via `--collapse_duplicated_barcodes` flag. The deduplication is done on the basis of (`barcode_1`) for single-end Illumina indexing, or (`barcode_1`,`barcode_2`) for paired-end Illumina indexing, with each group of duplicate indexes collapsing to a single row prior to demultiplexing via picard IlluminaBasecallsToSam. The deduplication happens within the `SampleSheet()` class after reading in the samplesheet (with future format flexibility in mind). Within each deduplicated group, the `barcode_[1-2]` columns are passed through verbatim.  Other columns in the sample sheet are either reported verbatim if only one distinct value exists within each deduplicated group, or set to a single string that represents all values within each group. The single string can be the delimited join of member values, optionally followed by a supplied suffix (`_muxed` is the default), or if the length of such a constructed string exceeds a specified limit, the single string is set to a truncated md5 hash of joined member values (without delimiting characters) followed by an optional suffix. This deduplication occurs in `util/misc.py::collapse_dup_strs_to_str_or_md5()`. The intent is for the new column values for each deduplicated group to reflect the values from the column prior to deduplication, so changes in sample sheet construction upstream manifest in a way that allows for (some) comparison between deduplicated sample sheets. In the case of a mixed pool where some barcodes or barcode pairs are deduplicated but other barcode pairs are distinct, the distinct rows are passed through without modification.
…function to SampleSheet() class

WIP adapting splitcode_demux; added inner_demux_mapper() convenience function to SampleSheet() class to more easily map from original sample IDs to collapsed (and other) values. refactoring is anticipated for the new functions added on this branch
… "run" and "muxed_run" inputs and values in inner_demux_mapper()
…on for plotting; refactoring in various places

WIP splitcode demux working (again, but generalized); stats aggregation for plotting; refactoring in various places.
Additional cleanup needed. conversion from fastq-to-bam needed on the output side. rendering of metrics in Picard style needed
the SampleSheet class now has a function, `rev_comp_barcode_values(barcode_columns_to_revcomp=None, inplace=True)`, which manipulates stored rows of the supplied sample sheet to reverse-complement sequences in specified columns. This function can be called once a sample sheet has been loaded. Barcodes can also be rev-comped when a SampleSheet object is initialized by passing column names to `barcode_columns_to_revcomp`. The most common case is to rev-comp `barcode_2`, and sample sheet init function allows `rev_comp_barcode_2=True` to do this without passing the column name explicitly to `barcode_columns_to_revcomp`. The `illumina_demux` and `splitcode_demux` functions now expose a CLI parameter, `--rev_comp_barcodes_before_demux `; if this flag is passed without values, the default `"barcode_2"` column will be rev-comped. If values are passed, the specified columns will rev-comped. If `--rev_comp_barcodes_before_demux` is not passed, no columns will be rev-comped. Related unit tests are included in `test/unit/test_illumina.py`. An efficient `reverse_complement()` function has been added to `util/misc.py`. The file `util/cmd.py` has been extended to offer a custom argparse action, `storeMultiArgsOrFallBackToConst`, which allows for the aforementioned behavior of `--rev_comp_barcodes_before_demux` (`None` if flag omitted, const default value if flag only, one or more values if flag and values passed).
…yle metrics file; re-enable splitcode code path

add convert_splitcode_demux_metrics_to_picard_style to emit picard-style metrics file; re-enable splitcode code path I had bypassed for development (re-using old splitcode results)
…r; and for picard fq -> ubam, one fq pair per worker
…ng from splitcode demux; minor alignment changes for readability

add comment in description tag of read group header for RGs originating from splitcode demux
https://samtools.github.io/hts-specs/SAMv1.pdf#page=4
…ret the value as an int if the sample sheet only has numbers in the library_id_per_column
tomkinsc added 3 commits July 15, 2025 21:12
…ode demux

refactor tabular reporting output of read counts following inner-barcode demux, move csv creation logic from plot_sorted_curve() to a new function, write_barcode_metrics_for_pools()
@dpark01 dpark01 requested a review from Copilot August 22, 2025 15:00
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This WIP pull request integrates SwiftSeq demux functionality and makes various code quality improvements. The changes include adding new tools for barcode demultiplexing, updating dependencies, improving concurrency patterns, and enhancing utility functions.

Key changes:

  • Adds SplitCode tool integration for barcode demultiplexing
  • Updates various dependencies and Python version requirements
  • Improves error handling and switches from ThreadPoolExecutor to ProcessPoolExecutor
  • Adds utility functions for string manipulation and reverse complement operations

Reviewed Changes

Copilot reviewed 19 out of 21 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
util/misc.py Adds string collapsing, MD5 hashing, and reverse complement utility functions
util/file.py Updates SQLite configuration and improves error handling
util/cmd.py Adds custom argparse action for multiple arguments with fallback
tools/splitcode.py New tool wrapper for splitcode barcode demultiplexing
tools/picard.py Adds new Picard tools and improves parameter formatting
tools/bwa.py Changes ThreadPoolExecutor to ProcessPoolExecutor
test files Adds comprehensive tests for new barcode handling functionality
requirements files Updates dependency versions and Python requirements
Docker/CI files Updates build configurations and documentation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

joined_values_delimited = f"{delimiter}".join(sorted([s for s in unique_vals if len(s)>0]) if sort_plural_vals else unique_vals) + (suffix if append_suffix_to_delimited_str else "")


if len(joined_values_delimited) <= int(hash_if_longer_than):
Copy link

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comparison logic is incorrect. When hash_if_longer_than is -1 (the default), this condition will never be true since string length cannot be <= -1. This means the function will always return the MD5 hash instead of the delimited string when using default parameters.

Suggested change
if len(joined_values_delimited) <= int(hash_if_longer_than):
if hash_if_longer_than < 0 or len(joined_values_delimited) <= int(hash_if_longer_than):

Copilot uses AI. Check for mistakes.
https://docs.python.org/3.12/library/tarfile.html#extraction-filters
'''
log.debug("Setting extraction filter for tarfile module to tar_filter")
tarfile.TarFile.extraction_filter = staticmethod(tarfile.tar_filter) if hasattr(tarfile, 'data_filter') else (lambda member, path: member)
Copy link

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition checks for hasattr(tarfile, 'data_filter') but should check for hasattr(tarfile, 'tar_filter') since that's the function being used. This could result in using the fallback lambda when tar_filter is actually available.

Suggested change
tarfile.TarFile.extraction_filter = staticmethod(tarfile.tar_filter) if hasattr(tarfile, 'data_filter') else (lambda member, path: member)
tarfile.TarFile.extraction_filter = staticmethod(tarfile.tar_filter) if hasattr(tarfile, 'tar_filter') else (lambda member, path: member)

Copilot uses AI. Check for mistakes.
# Otherwise, return joined values
# or if that would be too long, return (MD5 of joined values)[-8:]+(optional suffix)
# the input values are optionally sorted after removing empty strings
joined_values_delimited = f"{delimiter}".join(sorted([s for s in unique_vals if len(s)>0]) if sort_plural_vals else unique_vals) + (suffix if append_suffix_to_delimited_str else "")
Copy link

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When sort_plural_vals is True, empty strings are filtered out, but when False, unique_vals is used directly which may contain empty strings. This inconsistency could cause different behavior in the joined string depending on the sort flag.

Suggested change
joined_values_delimited = f"{delimiter}".join(sorted([s for s in unique_vals if len(s)>0]) if sort_plural_vals else unique_vals) + (suffix if append_suffix_to_delimited_str else "")
filtered_vals = [s for s in unique_vals if len(s) > 0]
joined_values_delimited = f"{delimiter}".join(sorted(filtered_vals) if sort_plural_vals else filtered_vals) + (suffix if append_suffix_to_delimited_str else "")

Copilot uses AI. Check for mistakes.
super(SplitCodeTool, self).__init__(install_methods=install_methods)

def _get_tool_version(self):
self.tool_version = subprocess.check_output([self.install_and_get_path(), '--version']).decode('UTF-8').strip()
Copy link

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line could raise subprocess.CalledProcessError if splitcode returns a non-zero exit code for --version. Consider adding error handling to provide a more informative error message.

Suggested change
self.tool_version = subprocess.check_output([self.install_and_get_path(), '--version']).decode('UTF-8').strip()
try:
self.tool_version = subprocess.check_output([self.install_and_get_path(), '--version']).decode('UTF-8').strip()
except subprocess.CalledProcessError as e:
log.error(f"Failed to get {TOOL_NAME} version. Command output: {e.output.decode('UTF-8') if e.output else 'No output'}. Return code: {e.returncode}")
raise RuntimeError(f"Could not determine {TOOL_NAME} version. Please check that {TOOL_NAME} is installed and working.") from e

Copilot uses AI. Check for mistakes.
else:
# If the joined string is too long, compute MD5 of joined values
#joined_values = f"".join(sorted(unique_vals)) + (suffix if calculate_md5_including_suffix else "")
joined_values = f"".join(unique_vals) + (suffix if calculate_md5_including_suffix else "")
Copy link

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line joins all values including potentially empty strings, which differs from the filtered approach used in line 102. Consider using consistent filtering logic for both the delimited string and MD5 calculation.

Suggested change
joined_values = f"".join(unique_vals) + (suffix if calculate_md5_including_suffix else "")
joined_values = f"".join([s for s in unique_vals if len(s) > 0]) + (suffix if calculate_md5_including_suffix else "")

Copilot uses AI. Check for mistakes.
Returns the reverse complement using string.maketrans
"""
table = bytearray.maketrans(b"ACTGN",b"TGACN")
return bytearray(seq.encode("UTF8")).translate(table)[::-1].decode("UTF8")
Copy link

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The translation table is recreated on every function call. Consider moving this to module level as a constant for better performance when the function is called frequently.

Suggested change
return bytearray(seq.encode("UTF8")).translate(table)[::-1].decode("UTF8")
# Translation table for reverse complementing DNA sequences
REVCOMP_TRANSLATION_TABLE = bytearray.maketrans(b"ACTGN", b"TGACN")
def reverse_complement(seq):
"""
Returns the reverse complement using string.maketrans
"""
return bytearray(seq.encode("UTF8")).translate(REVCOMP_TRANSLATION_TABLE)[::-1].decode("UTF8")

Copilot uses AI. Check for mistakes.
dpark01 and others added 4 commits October 7, 2025 08:21
Handle the case where splitcode processes 0 reads for a pool, which
results in an empty tag_qc array in the summary JSON. Previously this
would crash with KeyError: 'tag' when trying to create a DataFrame
from the empty array and access non-existent columns.

The fix adds defensive handling in create_lut():
- Check if tag_qc is empty before processing
- When empty, create output with correct schema and 0 read counts
- When not empty, process normally as before
- Log a warning for pools with 0 reads

This allows the 3-barcode demultiplexing workflow to complete
successfully even when some pools have no matching reads.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Fix three functions to handle cases where barcodes have zero reads:

1. plot_read_counts(): Check for positive values before setting log
   scale to avoid matplotlib warnings when all data is zero

2. plot_sorted_curve(): Handle division by zero when calculating
   fractions for pools with 0 total reads

3. write_barcode_metrics_for_pools(): Use ternary expressions to
   safely handle division by zero when calculating percentages

All functions now produce valid outputs and log warnings when
encountering zero counts, allowing the 3-barcode demultiplexing
workflow to complete successfully even when some barcodes/pools
have no matching reads.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@dpark01 dpark01 changed the title WIP swiftseq demux integration Add demultiplexing of 3-barcode sheets with splitcode Nov 6, 2025
dpark01 and others added 19 commits November 6, 2025 21:56
… fix bugs

This commit adds extensive unit test coverage for the create_splitcode_lookup_table
function (formerly create_lut) and fixes several bugs discovered during testing.

Test Coverage Added:
- New TestSplitcodeLookupTable class with 4 comprehensive test cases
- test_basic_single_pool: Validates full workflow with 3-sample pool
- test_zero_reads_pool: Tests edge case handling (empty splitcode output)
- test_multi_pool: Tests multiple pools in one run
- test_append_run_id: Tests flowcell ID suffix parameter
- All tests passing (4/4) in docker environment

Test Input Files:
- Created test/input/TestSplitcodeLookupTable/ with sample sheets and JSON files
- 3 sample sheets covering single pool, zero reads, and multi-pool scenarios
- 4 splitcode summary JSON files with realistic test data

Bugs Fixed:
1. Fixed append_run_id parameter causing JSON file lookup failures
   - Pool names include append_run_id suffix but JSON filenames don't
   - Added logic to strip suffix when looking up JSON files
   - Discovered by test_append_run_id

2. Fixed splitcode DataFrame type handling
   - Previously converted all columns to string, breaking numeric operations
   - Now only convert tag column to string, keep count/distance numeric
   - Prevents NaN errors in downstream arithmetic

Code Quality Improvements:
- Function renamed: create_lut → create_splitcode_lookup_table
- Added comprehensive NumPy-style docstring (60+ lines)
- Removed trivial read_json() wrapper function
- Fixed all pandas type coercion issues (dtype=str on all DataFrames)
- Removed defensive str() casting that was treating symptoms

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Problem:
- Picard's FastqToSam crashes on empty FASTQ inputs because it cannot
  determine quality score encoding with zero reads
- Both fastq_to_bam (read_utils.py) and run_picard_fastq_to_ubam
  (illumina.py) would fail when given empty input files
- Error: Cannot determine candidate qualities: no qualities found.

Solution:
1. Added FastqToSamTool.isFastqEmpty() static method:
   - Checks if FASTQ file is semantically empty (no reads)
   - Optimized for performance: files >1KB assumed non-empty without
     reading contents
   - Small files checked for whitespace-only content

2. Modified FastqToSamTool.execute() to handle empty inputs:
   - If FASTQ2 is empty but FASTQ1 is not, treat as single-read input
   - If FASTQ1 is empty, create valid empty BAM file using pysam:
     * Extracts header metadata from picardOptions
     * Creates proper BAM header with read group info
     * Writes header-only BAM (zero reads)
     * Bypasses Picard completely

3. Added test coverage:
   - test_fastq_to_bam_empty_inputs() verifies empty FASTQ handling
   - Tests that output BAM has valid header, zero reads, and is
     readable by samtools

Benefits:
- Both fastq_to_bam and run_picard_fastq_to_ubam now safe for empty inputs
- Produces semantically correct empty BAM files
- SamtoolsTool.isEmpty() returns True as expected
- No Picard crash, proper error handling

Files modified:
- tools/picard.py: Add isFastqEmpty() and defensive code in execute()
- test/unit/test_read_utils.py: Add test_fastq_to_bam_empty_inputs()

All tests passing (2/2 in TestFastqBam).
…output

Problem:
- rmdup_cdhit_bam() was calling cd-hit-dup with background=True, which returns
  immediately without waiting for the process to finish
- FastqToSamTool().execute() was called immediately after, trying to read
  cd-hit-dup's output files before they were fully written
- This created a race condition where FastqToSamTool.isFastqEmpty() would
  see empty/incomplete output files and create an empty BAM instead of waiting
- Additionally, input FASTQ files were being deleted (os.unlink) while cd-hit-dup
  was still running in background, causing "File openning failed" errors
  (cd-hit-dup reads input files twice)

Solution:
- Changed cd-hit-dup execution from background=True to background=False
- This ensures cd-hit-dup completes and output files are fully written
  before FastqToSamTool tries to read them
- Input files are only deleted after cd-hit-dup is completely done

Impact:
- Fixes test_cdhit_canned_input which was failing with 0 reads instead of 1772
- No functional change to workflow, just proper synchronization
- Slightly slower (blocking) but correct behavior

The previous background=True approach was unsafe and only worked by accident
when Picard would retry/wait for files to exist. The defensive empty-file
handling in FastqToSamTool exposed this pre-existing race condition.
Created test_illumina_splitcode.py with focused tests for run_splitcode_on_pool()
and supporting functions. These tests validate our assumptions about splitcode's
behavior rather than testing splitcode itself (which is an external tool).

Tests cover:
1. Basic splitcode execution on BAM input with inline barcodes
2. Summary JSON location and format ({pool_id}_summary.json)
3. JSON structure (tag_qc array with tag/count/distance fields)
4. Empty barcode outputs (barcodes with zero matching reads)
5. Output file locations and naming conventions
6. Barcode trimming behavior (left=1 removes barcode from R1)
7. append_run_id parameter and JSON file lookup logic

These tests focus on the integration points between our code and splitcode:
- Config file format we generate
- Keep file format (simplified - complex cases deferred)
- JSON parsing assumptions in create_splitcode_lookup_table
- File path conventions we rely on in splitcode_demux workflow

Status: Tests created but not yet fully passing - keep file format needs
investigation. The core JSON format tests are the most critical and should
work once keep file format is corrected.
- Created test_illumina_splitcode.py with 6 test cases validating splitcode behavior
- Tests focus on validating our assumptions about splitcode (not testing the external tool itself)
- Test coverage includes:
  * Basic splitcode execution and JSON output format (test_run_splitcode_on_pool_basic)
  * Handling of empty barcode outputs (test_run_splitcode_on_pool_empty_output)
  * JSON structure validation (test_splitcode_json_output_format)
  * Output file location conventions (test_splitcode_output_file_locations)
  * Barcode trimming behavior (test_splitcode_barcode_trimming)
  * append_run_id parameter handling (test_splitcode_with_append_run_id)

Key validations:
- Summary JSON location: {out_demux_dir_path}/{pool_id}_summary.json
- JSON structure: has 'tag_qc' array with tag/count/distance fields
- tag_qc has multiple entries per tag (one for each distance level 0-3)
- Output FASTQ naming: {output_prefix}_R1.fastq, {output_prefix}_R2.fastq
- Splitcode config format: FILE_NUMBER:START_BP:END_BP (e.g., 0:0:8 for 8bp barcode in R1)
- ID convention matches illumina.py: f"{sample_library_id}_R1"

All tests pass.
Added detailed inline documentation explaining splitcode behavior and file formats:

1. run_splitcode_on_pool() function:
   - Added comprehensive docstring describing all parameters, return values, and outputs
   - Documented splitcode config file format (FILE_NUMBER:START_BP:END_BP)
   - Documented keep file format and ID matching requirements
   - Explained tag_qc JSON structure (multiple entries per tag, one per distance level)
   - Referenced test_illumina_splitcode.py for detailed examples

2. Splitcode config file generation:
   - Added detailed comment block explaining each config column
   - Documented that locations format is FILE:START:END (e.g., 0:0:8 for 8bp barcode)
   - Clarified that endpoint is exclusive (Python-like slicing)
   - Explained left/right trimming options

3. Splitcode keep file generation:
   - Added comment block with format specification (TSV, no header)
   - Provided example showing barcode_id must match config file ID
   - Emphasized the critical requirement that IDs must match between files
   - Documented output filename pattern

4. Summary JSON parsing in create_splitcode_lookup_table():
   - Added critical comment explaining tag_qc structure
   - Provided example JSON showing multiple distance levels per tag
   - Documented that we filter to distance=0 (perfect) and distance=1 (1-mismatch)
   - Explained this is important for any code reading these JSON files

These comments document key learnings from test development and should help
future developers understand splitcode's behavior and file format requirements.
Added detailed module-level documentation serving as a quick reference for
splitcode file formats and behavior:

- Explains the purpose of these tests (validating assumptions vs testing the tool)
- Documents config file format with all columns explained
- Documents keep file format and ID matching requirement
- Provides concrete example of summary JSON tag_qc structure
- Emphasizes critical gotcha: tag_qc has multiple entries per tag (one per distance level)
- Lists all expected output files and their naming conventions

This docstring serves as a quick reference guide for developers working with
splitcode integration, consolidating key learnings from test development.
- Created generate_splitcode_config_and_keep_files() function to handle
  config and keep file generation for a single pool
- This separates pure text file transformation logic from the complex
  splitcode_demux orchestration function
- Added 9 comprehensive unit tests covering:
  * Basic single/multi-pool scenarios
  * Variable barcode lengths (4bp, 8bp, 10bp)
  * Hamming distance parameter (0, 1, 2)
  * R1 trimming with/without extra bp
  * Pool filtering (multi-pool sample sheets)
  * Config/keep ID matching validation
  * Output path construction
  * Empty pool error handling
  * Complex realistic scenarios with full IDs
- Refactored splitcode_demux() to use new function (137 lines -> 40 lines)
- All existing tests continue to pass

This makes the config/keep generation logic much easier to test and debug,
addressing a major gap in test coverage for the splitcode demux code path.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
When splitcode summary JSON files are missing or malformed, we now provide
comprehensive debugging information to help diagnose the issue faster.

Changes to illumina.py (create_splitcode_lookup_table):
- Check if JSON file exists before attempting to open
- If missing: Log directory contents (JSON files first, then others)
- If malformed: Log JSON parse error + first 10 lines of file
- If multiple matches: Warn about ambiguous files
- Always re-raise exception with full context

Added 2 test cases:
- test_missing_json_file_provides_debugging_info: Validates directory listing
- test_malformed_json_provides_file_preview: Validates JSON parse error handling

This addresses a common pain point where splitcode runs but the summary JSON
is not found in the expected location, making it hard to debug what went wrong.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Move all splitcode test classes from test_illumina_splitcode.py to test_illumina.py
- Add missing imports (shutil, json, tools.picard) to test_illumina.py
- Delete test_illumina_splitcode.py to follow project convention
- All 36 splitcode-related tests now pass in consolidated file

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Fixes FileNotFoundError when looking up splitcode summary JSON files in
workflows with append_run_id (e.g., flowcell IDs).

Root cause: Code was stripping run suffix from pool IDs before file lookup,
but splitcode creates files with the FULL pool ID including the suffix.

Changes:
- illumina.py: Remove incorrect suffix-stripping logic in create_splitcode_lookup_table()
- test_illumina.py: Update test_append_run_id to use correct filename with suffix

Before: Looked for AGCGTGTT-CTTCGCAA.l1_summary.json
After:  Looks for AGCGTGTT-CTTCGCAA.l1.AAGCG75M5.1_summary.json

This matches what splitcode actually creates and fixes all failing three-barcode
demux workflows in Terra workspace gcid-viral-y6/Swift-RNA-Seq.

All 36 splitcode tests pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…nctions

Integration test:
- Add TestSplitcodeDemuxIntegration class with end-to-end test
- Create minimal test data (RunInfo.xml, SampleSheet.tsv)
- Test verifies splitcode_demux runs without errors
- Test creates synthetic pool BAM with inline barcodes
- Verifies output BAM files are created for each sample

Bug fixes (discovered during testing):
- Fix dtype bugs in plot_read_counts(): Convert numeric columns after reading CSV
- Fix dtype bugs in plot_sorted_curve(): Convert numeric columns after reading CSV
- Fix dtype bugs in write_barcode_metrics_for_pools(): Convert numeric columns

Root cause: CSV files were read with dtype=str, but plotting/metrics functions
performed arithmetic operations (sum, division) on string columns, causing
TypeError. Solution: Explicitly convert num_reads_* columns to numeric after
reading CSVs while keeping other columns as strings.

All splitcode tests pass (37 total including new integration test).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
This commit fixes critical issues in the splitcode demultiplexing workflow
and adds comprehensive integration testing.

Key fixes:
- Remove hardcoded --no-output flag from splitcode wrapper that prevented
  --keep files from being written (tools/splitcode.py)
- Fix config/keep file format to work correctly with --keep-r1-r2 mode
  (removed _R1 suffix from barcode IDs in illumina.py)
- Change splitcode temp directory to subdirectory of outDir to ensure
  proper file persistence through BAM conversion (illumina.py)
- Add dummy --output files to satisfy splitcode requirements while using
  --keep mode for sample-specific outputs (illumina.py)

Test enhancements:
- Add comprehensive integration test for splitcode_demux command with
  read count verification and BAM content validation (test_illumina.py)
- Create helper methods to generate test input BAMs with inline barcodes
  and expected output BAMs after demux (test_illumina.py)
- Enhance assert_equal_bam_reads to show first 5 reads from each BAM
  when comparison fails for easier debugging (test/__init__.py)
- Add test data with simplified barcode sequences (TestSplitcodeDemuxIntegration/)

The test now successfully validates:
- Correct demultiplexing based on inline barcodes
- Proper read count preservation (100 reads for Sample1, 50 for Sample2)
- Correct barcode trimming (8bp inline barcode removed from R1)
- Accurate read group metadata in output BAMs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Updated tests to expect barcode IDs without _R1 suffix, which is the
correct format when using splitcode's --keep-r1-r2 mode. The --keep-r1-r2
flag instructs splitcode to automatically add _R1/_R2 suffixes to output
filenames, so the config and keep files should use IDs without the suffix.

Tests fixed:
- test_basic_single_pool
- test_config_keep_id_matching
- test_complex_realistic_scenario

This change aligns with the fix in generate_splitcode_config_and_keep_files
that was necessary for the splitcode demux integration to work correctly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…eep files

Splitcode 0.31.4 no longer strips carriage returns from IDs in config/keep files,
causing output files to be created with \r in filenames (e.g., Sample1\r_R1.fastq
instead of Sample1_R1.fastq). This prevented the integration test from finding
the output files.

The root cause was Python's csv.writer using \r\n as the default line terminator.
Fixed by explicitly setting lineterminator='\n' when creating the TSV writers for
both config and keep files.

Changes:
- Set lineterminator='\n' in csv.writer calls for config and keep files
- This ensures Unix-style line endings compatible with splitcode 0.31.4

Fixes integration test: test_splitcode_demux_basic now passes with splitcode 0.31.4

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
Fixes Terra workflow failures where splitcode summary JSON files could not be
found due to mismatched pool identifiers.

Root cause: create_splitcode_lookup_table() was re-reading the sample sheet
from disk and independently reconstructing pool IDs (muxed_run), which could
differ from the pool IDs computed by SampleSheet.inner_demux_mapper(). This
caused FileNotFoundError when searching for splitcode summary JSON files.

Example:
- Expected: AGCGTGTT-CTTCGCAA.l1_summary.json
- Actual:   AGCGTGTT-CTTCGCAA.l1.AAGCG75M5.1_summary.json

Changes:
1. Modified create_splitcode_lookup_table() to accept either a file path
   (legacy behavior) or DataFrame (new behavior)
2. Updated splitcode_demux() to pass inner_demux_barcode_map_df DataFrame
   directly instead of re-reading the sample sheet
3. Added library_id_per_sample to inner_demux_mapper() output columns

This ensures a single source of truth for pool IDs throughout the workflow,
eliminating the pool ID mismatch issue. All 91 unit tests pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@dpark01 dpark01 marked this pull request as ready for review November 8, 2025 00:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants