-
Notifications
You must be signed in to change notification settings - Fork 1
Add demultiplexing of 3-barcode sheets with splitcode #119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
add wrapper for AddCommentsToBam to picard.py, `AddCommentsToBamTool()`, to be used to add comment lines (`@CO`) to a bam file
add `collapse_dup_strs_to_str_or_md5()` to `util/misc.py`, which collapses a list of strings into a single string, where the single string returned is either a string made from one unique value (in the event all input strings were duplicates), a delimiter-joined version of the input strings, or if the joined string would be too long, a truncated md5 hash of the joined inputs (without delimiters). A suffix can optionally be added (ex. `suffix="_muxed"`)
…llumina.py::illumina_demux() via --collapse_duplicated_barcodes add collapsing of barcode or barcode pair duplicates as part of `illumina.py::main_illumina_demux()` via `--collapse_duplicated_barcodes` flag. The deduplication is done on the basis of (`barcode_1`) for single-end Illumina indexing, or (`barcode_1`,`barcode_2`) for paired-end Illumina indexing, with each group of duplicate indexes collapsing to a single row prior to demultiplexing via picard IlluminaBasecallsToSam. The deduplication happens within the `SampleSheet()` class after reading in the samplesheet (with future format flexibility in mind). Within each deduplicated group, the `barcode_[1-2]` columns are passed through verbatim. Other columns in the sample sheet are either reported verbatim if only one distinct value exists within each deduplicated group, or set to a single string that represents all values within each group. The single string can be the delimited join of member values, optionally followed by a supplied suffix (`_muxed` is the default), or if the length of such a constructed string exceeds a specified limit, the single string is set to a truncated md5 hash of joined member values (without delimiting characters) followed by an optional suffix. This deduplication occurs in `util/misc.py::collapse_dup_strs_to_str_or_md5()`. The intent is for the new column values for each deduplicated group to reflect the values from the column prior to deduplication, so changes in sample sheet construction upstream manifest in a way that allows for (some) comparison between deduplicated sample sheets. In the case of a mixed pool where some barcodes or barcode pairs are deduplicated but other barcode pairs are distinct, the distinct rows are passed through without modification.
…function to SampleSheet() class WIP adapting splitcode_demux; added inner_demux_mapper() convenience function to SampleSheet() class to more easily map from original sample IDs to collapsed (and other) values. refactoring is anticipated for the new functions added on this branch
… "run" and "muxed_run" inputs and values in inner_demux_mapper()
…on for plotting; refactoring in various places WIP splitcode demux working (again, but generalized); stats aggregation for plotting; refactoring in various places. Additional cleanup needed. conversion from fastq-to-bam needed on the output side. rendering of metrics in Picard style needed
the SampleSheet class now has a function, `rev_comp_barcode_values(barcode_columns_to_revcomp=None, inplace=True)`, which manipulates stored rows of the supplied sample sheet to reverse-complement sequences in specified columns. This function can be called once a sample sheet has been loaded. Barcodes can also be rev-comped when a SampleSheet object is initialized by passing column names to `barcode_columns_to_revcomp`. The most common case is to rev-comp `barcode_2`, and sample sheet init function allows `rev_comp_barcode_2=True` to do this without passing the column name explicitly to `barcode_columns_to_revcomp`. The `illumina_demux` and `splitcode_demux` functions now expose a CLI parameter, `--rev_comp_barcodes_before_demux `; if this flag is passed without values, the default `"barcode_2"` column will be rev-comped. If values are passed, the specified columns will rev-comped. If `--rev_comp_barcodes_before_demux` is not passed, no columns will be rev-comped. Related unit tests are included in `test/unit/test_illumina.py`. An efficient `reverse_complement()` function has been added to `util/misc.py`. The file `util/cmd.py` has been extended to offer a custom argparse action, `storeMultiArgsOrFallBackToConst`, which allows for the aforementioned behavior of `--rev_comp_barcodes_before_demux` (`None` if flag omitted, const default value if flag only, one or more values if flag and values passed).
…yle metrics file; re-enable splitcode code path add convert_splitcode_demux_metrics_to_picard_style to emit picard-style metrics file; re-enable splitcode code path I had bypassed for development (re-using old splitcode results)
…r; and for picard fq -> ubam, one fq pair per worker
…ng from splitcode demux; minor alignment changes for readability add comment in description tag of read group header for RGs originating from splitcode demux https://samtools.github.io/hts-specs/SAMv1.pdf#page=4
…ret the value as an int if the sample sheet only has numbers in the library_id_per_column
… Dockerfile for creating environment
…ode demux refactor tabular reporting output of read counts following inner-barcode demux, move csv creation logic from plot_sorted_curve() to a new function, write_barcode_metrics_for_pools()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This WIP pull request integrates SwiftSeq demux functionality and makes various code quality improvements. The changes include adding new tools for barcode demultiplexing, updating dependencies, improving concurrency patterns, and enhancing utility functions.
Key changes:
- Adds SplitCode tool integration for barcode demultiplexing
- Updates various dependencies and Python version requirements
- Improves error handling and switches from ThreadPoolExecutor to ProcessPoolExecutor
- Adds utility functions for string manipulation and reverse complement operations
Reviewed Changes
Copilot reviewed 19 out of 21 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| util/misc.py | Adds string collapsing, MD5 hashing, and reverse complement utility functions |
| util/file.py | Updates SQLite configuration and improves error handling |
| util/cmd.py | Adds custom argparse action for multiple arguments with fallback |
| tools/splitcode.py | New tool wrapper for splitcode barcode demultiplexing |
| tools/picard.py | Adds new Picard tools and improves parameter formatting |
| tools/bwa.py | Changes ThreadPoolExecutor to ProcessPoolExecutor |
| test files | Adds comprehensive tests for new barcode handling functionality |
| requirements files | Updates dependency versions and Python requirements |
| Docker/CI files | Updates build configurations and documentation |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| joined_values_delimited = f"{delimiter}".join(sorted([s for s in unique_vals if len(s)>0]) if sort_plural_vals else unique_vals) + (suffix if append_suffix_to_delimited_str else "") | ||
|
|
||
|
|
||
| if len(joined_values_delimited) <= int(hash_if_longer_than): |
Copilot
AI
Aug 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comparison logic is incorrect. When hash_if_longer_than is -1 (the default), this condition will never be true since string length cannot be <= -1. This means the function will always return the MD5 hash instead of the delimited string when using default parameters.
| if len(joined_values_delimited) <= int(hash_if_longer_than): | |
| if hash_if_longer_than < 0 or len(joined_values_delimited) <= int(hash_if_longer_than): |
| https://docs.python.org/3.12/library/tarfile.html#extraction-filters | ||
| ''' | ||
| log.debug("Setting extraction filter for tarfile module to tar_filter") | ||
| tarfile.TarFile.extraction_filter = staticmethod(tarfile.tar_filter) if hasattr(tarfile, 'data_filter') else (lambda member, path: member) |
Copilot
AI
Aug 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition checks for hasattr(tarfile, 'data_filter') but should check for hasattr(tarfile, 'tar_filter') since that's the function being used. This could result in using the fallback lambda when tar_filter is actually available.
| tarfile.TarFile.extraction_filter = staticmethod(tarfile.tar_filter) if hasattr(tarfile, 'data_filter') else (lambda member, path: member) | |
| tarfile.TarFile.extraction_filter = staticmethod(tarfile.tar_filter) if hasattr(tarfile, 'tar_filter') else (lambda member, path: member) |
| # Otherwise, return joined values | ||
| # or if that would be too long, return (MD5 of joined values)[-8:]+(optional suffix) | ||
| # the input values are optionally sorted after removing empty strings | ||
| joined_values_delimited = f"{delimiter}".join(sorted([s for s in unique_vals if len(s)>0]) if sort_plural_vals else unique_vals) + (suffix if append_suffix_to_delimited_str else "") |
Copilot
AI
Aug 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When sort_plural_vals is True, empty strings are filtered out, but when False, unique_vals is used directly which may contain empty strings. This inconsistency could cause different behavior in the joined string depending on the sort flag.
| joined_values_delimited = f"{delimiter}".join(sorted([s for s in unique_vals if len(s)>0]) if sort_plural_vals else unique_vals) + (suffix if append_suffix_to_delimited_str else "") | |
| filtered_vals = [s for s in unique_vals if len(s) > 0] | |
| joined_values_delimited = f"{delimiter}".join(sorted(filtered_vals) if sort_plural_vals else filtered_vals) + (suffix if append_suffix_to_delimited_str else "") |
| super(SplitCodeTool, self).__init__(install_methods=install_methods) | ||
|
|
||
| def _get_tool_version(self): | ||
| self.tool_version = subprocess.check_output([self.install_and_get_path(), '--version']).decode('UTF-8').strip() |
Copilot
AI
Aug 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line could raise subprocess.CalledProcessError if splitcode returns a non-zero exit code for --version. Consider adding error handling to provide a more informative error message.
| self.tool_version = subprocess.check_output([self.install_and_get_path(), '--version']).decode('UTF-8').strip() | |
| try: | |
| self.tool_version = subprocess.check_output([self.install_and_get_path(), '--version']).decode('UTF-8').strip() | |
| except subprocess.CalledProcessError as e: | |
| log.error(f"Failed to get {TOOL_NAME} version. Command output: {e.output.decode('UTF-8') if e.output else 'No output'}. Return code: {e.returncode}") | |
| raise RuntimeError(f"Could not determine {TOOL_NAME} version. Please check that {TOOL_NAME} is installed and working.") from e |
| else: | ||
| # If the joined string is too long, compute MD5 of joined values | ||
| #joined_values = f"".join(sorted(unique_vals)) + (suffix if calculate_md5_including_suffix else "") | ||
| joined_values = f"".join(unique_vals) + (suffix if calculate_md5_including_suffix else "") |
Copilot
AI
Aug 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line joins all values including potentially empty strings, which differs from the filtered approach used in line 102. Consider using consistent filtering logic for both the delimited string and MD5 calculation.
| joined_values = f"".join(unique_vals) + (suffix if calculate_md5_including_suffix else "") | |
| joined_values = f"".join([s for s in unique_vals if len(s) > 0]) + (suffix if calculate_md5_including_suffix else "") |
| Returns the reverse complement using string.maketrans | ||
| """ | ||
| table = bytearray.maketrans(b"ACTGN",b"TGACN") | ||
| return bytearray(seq.encode("UTF8")).translate(table)[::-1].decode("UTF8") |
Copilot
AI
Aug 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The translation table is recreated on every function call. Consider moving this to module level as a constant for better performance when the function is called frequently.
| return bytearray(seq.encode("UTF8")).translate(table)[::-1].decode("UTF8") | |
| # Translation table for reverse complementing DNA sequences | |
| REVCOMP_TRANSLATION_TABLE = bytearray.maketrans(b"ACTGN", b"TGACN") | |
| def reverse_complement(seq): | |
| """ | |
| Returns the reverse complement using string.maketrans | |
| """ | |
| return bytearray(seq.encode("UTF8")).translate(REVCOMP_TRANSLATION_TABLE)[::-1].decode("UTF8") |
Handle the case where splitcode processes 0 reads for a pool, which results in an empty tag_qc array in the summary JSON. Previously this would crash with KeyError: 'tag' when trying to create a DataFrame from the empty array and access non-existent columns. The fix adds defensive handling in create_lut(): - Check if tag_qc is empty before processing - When empty, create output with correct schema and 0 read counts - When not empty, process normally as before - Log a warning for pools with 0 reads This allows the 3-barcode demultiplexing workflow to complete successfully even when some pools have no matching reads. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Fix three functions to handle cases where barcodes have zero reads: 1. plot_read_counts(): Check for positive values before setting log scale to avoid matplotlib warnings when all data is zero 2. plot_sorted_curve(): Handle division by zero when calculating fractions for pools with 0 total reads 3. write_barcode_metrics_for_pools(): Use ternary expressions to safely handle division by zero when calculating percentages All functions now produce valid outputs and log warnings when encountering zero counts, allowing the 3-barcode demultiplexing workflow to complete successfully even when some barcodes/pools have no matching reads. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
… fix bugs This commit adds extensive unit test coverage for the create_splitcode_lookup_table function (formerly create_lut) and fixes several bugs discovered during testing. Test Coverage Added: - New TestSplitcodeLookupTable class with 4 comprehensive test cases - test_basic_single_pool: Validates full workflow with 3-sample pool - test_zero_reads_pool: Tests edge case handling (empty splitcode output) - test_multi_pool: Tests multiple pools in one run - test_append_run_id: Tests flowcell ID suffix parameter - All tests passing (4/4) in docker environment Test Input Files: - Created test/input/TestSplitcodeLookupTable/ with sample sheets and JSON files - 3 sample sheets covering single pool, zero reads, and multi-pool scenarios - 4 splitcode summary JSON files with realistic test data Bugs Fixed: 1. Fixed append_run_id parameter causing JSON file lookup failures - Pool names include append_run_id suffix but JSON filenames don't - Added logic to strip suffix when looking up JSON files - Discovered by test_append_run_id 2. Fixed splitcode DataFrame type handling - Previously converted all columns to string, breaking numeric operations - Now only convert tag column to string, keep count/distance numeric - Prevents NaN errors in downstream arithmetic Code Quality Improvements: - Function renamed: create_lut → create_splitcode_lookup_table - Added comprehensive NumPy-style docstring (60+ lines) - Removed trivial read_json() wrapper function - Fixed all pandas type coercion issues (dtype=str on all DataFrames) - Removed defensive str() casting that was treating symptoms 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Problem:
- Picard's FastqToSam crashes on empty FASTQ inputs because it cannot
determine quality score encoding with zero reads
- Both fastq_to_bam (read_utils.py) and run_picard_fastq_to_ubam
(illumina.py) would fail when given empty input files
- Error: Cannot determine candidate qualities: no qualities found.
Solution:
1. Added FastqToSamTool.isFastqEmpty() static method:
- Checks if FASTQ file is semantically empty (no reads)
- Optimized for performance: files >1KB assumed non-empty without
reading contents
- Small files checked for whitespace-only content
2. Modified FastqToSamTool.execute() to handle empty inputs:
- If FASTQ2 is empty but FASTQ1 is not, treat as single-read input
- If FASTQ1 is empty, create valid empty BAM file using pysam:
* Extracts header metadata from picardOptions
* Creates proper BAM header with read group info
* Writes header-only BAM (zero reads)
* Bypasses Picard completely
3. Added test coverage:
- test_fastq_to_bam_empty_inputs() verifies empty FASTQ handling
- Tests that output BAM has valid header, zero reads, and is
readable by samtools
Benefits:
- Both fastq_to_bam and run_picard_fastq_to_ubam now safe for empty inputs
- Produces semantically correct empty BAM files
- SamtoolsTool.isEmpty() returns True as expected
- No Picard crash, proper error handling
Files modified:
- tools/picard.py: Add isFastqEmpty() and defensive code in execute()
- test/unit/test_read_utils.py: Add test_fastq_to_bam_empty_inputs()
All tests passing (2/2 in TestFastqBam).
…output Problem: - rmdup_cdhit_bam() was calling cd-hit-dup with background=True, which returns immediately without waiting for the process to finish - FastqToSamTool().execute() was called immediately after, trying to read cd-hit-dup's output files before they were fully written - This created a race condition where FastqToSamTool.isFastqEmpty() would see empty/incomplete output files and create an empty BAM instead of waiting - Additionally, input FASTQ files were being deleted (os.unlink) while cd-hit-dup was still running in background, causing "File openning failed" errors (cd-hit-dup reads input files twice) Solution: - Changed cd-hit-dup execution from background=True to background=False - This ensures cd-hit-dup completes and output files are fully written before FastqToSamTool tries to read them - Input files are only deleted after cd-hit-dup is completely done Impact: - Fixes test_cdhit_canned_input which was failing with 0 reads instead of 1772 - No functional change to workflow, just proper synchronization - Slightly slower (blocking) but correct behavior The previous background=True approach was unsafe and only worked by accident when Picard would retry/wait for files to exist. The defensive empty-file handling in FastqToSamTool exposed this pre-existing race condition.
Created test_illumina_splitcode.py with focused tests for run_splitcode_on_pool()
and supporting functions. These tests validate our assumptions about splitcode's
behavior rather than testing splitcode itself (which is an external tool).
Tests cover:
1. Basic splitcode execution on BAM input with inline barcodes
2. Summary JSON location and format ({pool_id}_summary.json)
3. JSON structure (tag_qc array with tag/count/distance fields)
4. Empty barcode outputs (barcodes with zero matching reads)
5. Output file locations and naming conventions
6. Barcode trimming behavior (left=1 removes barcode from R1)
7. append_run_id parameter and JSON file lookup logic
These tests focus on the integration points between our code and splitcode:
- Config file format we generate
- Keep file format (simplified - complex cases deferred)
- JSON parsing assumptions in create_splitcode_lookup_table
- File path conventions we rely on in splitcode_demux workflow
Status: Tests created but not yet fully passing - keep file format needs
investigation. The core JSON format tests are the most critical and should
work once keep file format is corrected.
- Created test_illumina_splitcode.py with 6 test cases validating splitcode behavior
- Tests focus on validating our assumptions about splitcode (not testing the external tool itself)
- Test coverage includes:
* Basic splitcode execution and JSON output format (test_run_splitcode_on_pool_basic)
* Handling of empty barcode outputs (test_run_splitcode_on_pool_empty_output)
* JSON structure validation (test_splitcode_json_output_format)
* Output file location conventions (test_splitcode_output_file_locations)
* Barcode trimming behavior (test_splitcode_barcode_trimming)
* append_run_id parameter handling (test_splitcode_with_append_run_id)
Key validations:
- Summary JSON location: {out_demux_dir_path}/{pool_id}_summary.json
- JSON structure: has 'tag_qc' array with tag/count/distance fields
- tag_qc has multiple entries per tag (one for each distance level 0-3)
- Output FASTQ naming: {output_prefix}_R1.fastq, {output_prefix}_R2.fastq
- Splitcode config format: FILE_NUMBER:START_BP:END_BP (e.g., 0:0:8 for 8bp barcode in R1)
- ID convention matches illumina.py: f"{sample_library_id}_R1"
All tests pass.
Added detailed inline documentation explaining splitcode behavior and file formats: 1. run_splitcode_on_pool() function: - Added comprehensive docstring describing all parameters, return values, and outputs - Documented splitcode config file format (FILE_NUMBER:START_BP:END_BP) - Documented keep file format and ID matching requirements - Explained tag_qc JSON structure (multiple entries per tag, one per distance level) - Referenced test_illumina_splitcode.py for detailed examples 2. Splitcode config file generation: - Added detailed comment block explaining each config column - Documented that locations format is FILE:START:END (e.g., 0:0:8 for 8bp barcode) - Clarified that endpoint is exclusive (Python-like slicing) - Explained left/right trimming options 3. Splitcode keep file generation: - Added comment block with format specification (TSV, no header) - Provided example showing barcode_id must match config file ID - Emphasized the critical requirement that IDs must match between files - Documented output filename pattern 4. Summary JSON parsing in create_splitcode_lookup_table(): - Added critical comment explaining tag_qc structure - Provided example JSON showing multiple distance levels per tag - Documented that we filter to distance=0 (perfect) and distance=1 (1-mismatch) - Explained this is important for any code reading these JSON files These comments document key learnings from test development and should help future developers understand splitcode's behavior and file format requirements.
Added detailed module-level documentation serving as a quick reference for splitcode file formats and behavior: - Explains the purpose of these tests (validating assumptions vs testing the tool) - Documents config file format with all columns explained - Documents keep file format and ID matching requirement - Provides concrete example of summary JSON tag_qc structure - Emphasizes critical gotcha: tag_qc has multiple entries per tag (one per distance level) - Lists all expected output files and their naming conventions This docstring serves as a quick reference guide for developers working with splitcode integration, consolidating key learnings from test development.
- Created generate_splitcode_config_and_keep_files() function to handle config and keep file generation for a single pool - This separates pure text file transformation logic from the complex splitcode_demux orchestration function - Added 9 comprehensive unit tests covering: * Basic single/multi-pool scenarios * Variable barcode lengths (4bp, 8bp, 10bp) * Hamming distance parameter (0, 1, 2) * R1 trimming with/without extra bp * Pool filtering (multi-pool sample sheets) * Config/keep ID matching validation * Output path construction * Empty pool error handling * Complex realistic scenarios with full IDs - Refactored splitcode_demux() to use new function (137 lines -> 40 lines) - All existing tests continue to pass This makes the config/keep generation logic much easier to test and debug, addressing a major gap in test coverage for the splitcode demux code path. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
When splitcode summary JSON files are missing or malformed, we now provide comprehensive debugging information to help diagnose the issue faster. Changes to illumina.py (create_splitcode_lookup_table): - Check if JSON file exists before attempting to open - If missing: Log directory contents (JSON files first, then others) - If malformed: Log JSON parse error + first 10 lines of file - If multiple matches: Warn about ambiguous files - Always re-raise exception with full context Added 2 test cases: - test_missing_json_file_provides_debugging_info: Validates directory listing - test_malformed_json_provides_file_preview: Validates JSON parse error handling This addresses a common pain point where splitcode runs but the summary JSON is not found in the expected location, making it hard to debug what went wrong. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Move all splitcode test classes from test_illumina_splitcode.py to test_illumina.py - Add missing imports (shutil, json, tools.picard) to test_illumina.py - Delete test_illumina_splitcode.py to follow project convention - All 36 splitcode-related tests now pass in consolidated file 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Fixes FileNotFoundError when looking up splitcode summary JSON files in workflows with append_run_id (e.g., flowcell IDs). Root cause: Code was stripping run suffix from pool IDs before file lookup, but splitcode creates files with the FULL pool ID including the suffix. Changes: - illumina.py: Remove incorrect suffix-stripping logic in create_splitcode_lookup_table() - test_illumina.py: Update test_append_run_id to use correct filename with suffix Before: Looked for AGCGTGTT-CTTCGCAA.l1_summary.json After: Looks for AGCGTGTT-CTTCGCAA.l1.AAGCG75M5.1_summary.json This matches what splitcode actually creates and fixes all failing three-barcode demux workflows in Terra workspace gcid-viral-y6/Swift-RNA-Seq. All 36 splitcode tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…nctions Integration test: - Add TestSplitcodeDemuxIntegration class with end-to-end test - Create minimal test data (RunInfo.xml, SampleSheet.tsv) - Test verifies splitcode_demux runs without errors - Test creates synthetic pool BAM with inline barcodes - Verifies output BAM files are created for each sample Bug fixes (discovered during testing): - Fix dtype bugs in plot_read_counts(): Convert numeric columns after reading CSV - Fix dtype bugs in plot_sorted_curve(): Convert numeric columns after reading CSV - Fix dtype bugs in write_barcode_metrics_for_pools(): Convert numeric columns Root cause: CSV files were read with dtype=str, but plotting/metrics functions performed arithmetic operations (sum, division) on string columns, causing TypeError. Solution: Explicitly convert num_reads_* columns to numeric after reading CSVs while keeping other columns as strings. All splitcode tests pass (37 total including new integration test). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
This commit fixes critical issues in the splitcode demultiplexing workflow and adds comprehensive integration testing. Key fixes: - Remove hardcoded --no-output flag from splitcode wrapper that prevented --keep files from being written (tools/splitcode.py) - Fix config/keep file format to work correctly with --keep-r1-r2 mode (removed _R1 suffix from barcode IDs in illumina.py) - Change splitcode temp directory to subdirectory of outDir to ensure proper file persistence through BAM conversion (illumina.py) - Add dummy --output files to satisfy splitcode requirements while using --keep mode for sample-specific outputs (illumina.py) Test enhancements: - Add comprehensive integration test for splitcode_demux command with read count verification and BAM content validation (test_illumina.py) - Create helper methods to generate test input BAMs with inline barcodes and expected output BAMs after demux (test_illumina.py) - Enhance assert_equal_bam_reads to show first 5 reads from each BAM when comparison fails for easier debugging (test/__init__.py) - Add test data with simplified barcode sequences (TestSplitcodeDemuxIntegration/) The test now successfully validates: - Correct demultiplexing based on inline barcodes - Proper read count preservation (100 reads for Sample1, 50 for Sample2) - Correct barcode trimming (8bp inline barcode removed from R1) - Accurate read group metadata in output BAMs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Updated tests to expect barcode IDs without _R1 suffix, which is the correct format when using splitcode's --keep-r1-r2 mode. The --keep-r1-r2 flag instructs splitcode to automatically add _R1/_R2 suffixes to output filenames, so the config and keep files should use IDs without the suffix. Tests fixed: - test_basic_single_pool - test_config_keep_id_matching - test_complex_realistic_scenario This change aligns with the fix in generate_splitcode_config_and_keep_files that was necessary for the splitcode demux integration to work correctly. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…eep files Splitcode 0.31.4 no longer strips carriage returns from IDs in config/keep files, causing output files to be created with \r in filenames (e.g., Sample1\r_R1.fastq instead of Sample1_R1.fastq). This prevented the integration test from finding the output files. The root cause was Python's csv.writer using \r\n as the default line terminator. Fixed by explicitly setting lineterminator='\n' when creating the TSV writers for both config and keep files. Changes: - Set lineterminator='\n' in csv.writer calls for config and keep files - This ensures Unix-style line endings compatible with splitcode 0.31.4 Fixes integration test: test_splitcode_demux_basic now passes with splitcode 0.31.4 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]>
Fixes Terra workflow failures where splitcode summary JSON files could not be found due to mismatched pool identifiers. Root cause: create_splitcode_lookup_table() was re-reading the sample sheet from disk and independently reconstructing pool IDs (muxed_run), which could differ from the pool IDs computed by SampleSheet.inner_demux_mapper(). This caused FileNotFoundError when searching for splitcode summary JSON files. Example: - Expected: AGCGTGTT-CTTCGCAA.l1_summary.json - Actual: AGCGTGTT-CTTCGCAA.l1.AAGCG75M5.1_summary.json Changes: 1. Modified create_splitcode_lookup_table() to accept either a file path (legacy behavior) or DataFrame (new behavior) 2. Updated splitcode_demux() to pass inner_demux_barcode_map_df DataFrame directly instead of re-reading the sample sheet 3. Added library_id_per_sample to inner_demux_mapper() output columns This ensures a single source of truth for pool IDs throughout the workflow, eliminating the pool ID mismatch issue. All 91 unit tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
This PR adds functionality to illumina_demux that enables handling of 3-barcode sample sheets used by the pooled cDNA-tagseq protocol, in which reads are demultiplexed in two stages: first a standard round of demultiplexing with a pair of outer barcodes, followed by a custom round that separates out reads based on inline barcodes. The first round is implemented with Picard, as before, and the second with a new tool called splitcode.