feat: calculate MD and NM tags #229

emmcauley · 2025-03-25T19:59:49Z

Closes #84. Adds an md attribute to the ReadEditInfo class and an n_as_match boolean flag to calculate_edit_info(). calculate_edit_info() is now a little more complex than we might like (11 > 10). I tried not to change any pre-existing code more than was necessary.

Some of the reads for testing were inherited from htsjdk.

⚠️ Once we converge on the style and the functionality in this PR, I can open another PR to address this part of the Github issue: Bonus points for adding another method that takes a iterator over Templates (and/or pysam.AlignedSegments) and a pysam.FastaFile and updates the records.

codecov · 2025-03-25T20:02:39Z

Codecov Report

Attention: Patch coverage is 89.09091% with 6 lines in your changes missing coverage. Please review.

Project coverage is 91.07%. Comparing base (0dd300f) to head (ef46fe7).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
fgpyo/sam/__init__.py	89.09%	3 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #229      +/-   ##
==========================================
- Coverage   91.19%   91.07%   -0.12%     
==========================================
  Files          18       18              
  Lines        2351     2387      +36     
  Branches      356      365       +9     
==========================================
+ Hits         2144     2174      +30     
- Misses        135      138       +3     
- Partials       72       75       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tests/fgpyo/sam/test_sam.py

fgpyo/sam/__init__.py

…t for clarity, true up assertion style to be consistent

tests/fgpyo/sam/test_sam.py

… for n_as_match, expand docstring, defensive programming against invalid CIGAR ops

coderabbitai · 2025-04-11T20:19:14Z

Walkthrough

The changes introduce constants NO_QUERY_BASES and OK_BASES for query base handling and add an md attribute to ReadEditInfo for the MD tag string. The calculate_edit_info function gains a match_htsjdk boolean parameter to toggle base matching rules, returns None for unmapped or missing query bases, and is refactored to process CIGAR operations with boundary checks while constructing the MD tag. The test suite is expanded with new, renamed, and parameterized tests covering base matching logic, offsets, edge cases, and MD tag correctness.

Note

⚡️ AI Code Reviews for VS Code, Cursor, Windsurf

CodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback.
Learn more here.

Note

⚡️ Faster reviews with caching

CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.
Enjoy the performance boost—your workflow just got faster.

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

fgpyo/sam/__init__.py (1)

1046-1069: Function signature and docstring updates
The signature now includes n_as_match and reference_offset, broadening functionality. The docstring is thorough, referencing SAM standards and htsjdk logic. Consider simplifying if the function grows more complex.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 44b8cea and 6dc63aa.

📒 Files selected for processing (2)

fgpyo/sam/__init__.py (4 hunks)
tests/fgpyo/sam/test_sam.py (4 hunks)

🧰 Additional context used

🪛 GitHub Check: codecov/patch

fgpyo/sam/__init__.py

[warning] 1130-1130: fgpyo/sam/init.py#L1130
Added line #L1130 was not covered by tests

🔇 Additional comments (19)

fgpyo/sam/__init__.py (6)

204-205: New constant for missing query bases
These lines clearly define a placeholder for records without query bases. Looks good and straightforward.

1033-1044: Added md attribute in ReadEditInfo
Documenting and implementing md in ReadEditInfo aligns well with SAM standards, ensuring we can store the MD tag. No issues spotted.

1082-1083: Return None for unmapped or missing query bases
Clean conditional check here, properly documented in the docstring. Good approach to handle incomplete data.

1085-1129: Core loop logic for alignments
This block meticulously accounts for each CIGAR operation type (N, I, S, D, M, X, EQ). The code is consistent with the docstring, handling offsets and boundaries carefully.

1130-1131: Uncovered error path
Static analysis indicates line 1130 is not tested. Consider adding a test that triggers this “Invalid CIGAR operation” condition to maintain coverage and ensure the error path is verified.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 1130-1130: fgpyo/sam/init.py#L1130
Added line #L1130 was not covered by tests

1132-1142: Finalizing MD edits
Appending the final match count and constructing ReadEditInfo looks correct. The md="".join(md_edits) call aligns with standard MD tag assembly.

tests/fgpyo/sam/test_sam.py (13)

8-8: Importing Optional
Straightforward addition for handling None usage. No issues noted.

655-659: Testing MD with no edits
Calling calculate_edit_info() with n_as_match=False and checking info.md == "30" ensures no mismatches. Good coverage for the simple case.

661-671: Test for zero offset
Verifies that the function properly handles reference offsets with no edits. Clear test name and correct assertion.

673-692: Test for non-zero offset edits
The ASCII art is helpful. Confirms mismatches are counted correctly when reference alignment is shifted. Thorough approach.

701-709: Mismatch plus insertion scenario
The MD tag ignoring insertions is correct per SAM spec. Tests confirm logic for mismatches and inserted bases. Nicely done.

718-725: Test with clipping and deletions
Ensures that soft-clips are ignored in MD calculations yet fully checks the deletion. The results align with expectations.

728-753: Ns alignment test
Parameterizing for n_as_match covers both ignoring or counting N→N as mismatches. This thoroughly tests ambiguous bases.

756-773: Consecutive mismatches
Test confirms consecutive fails from htsjdk data. The comparison to the original MD tag is a robust approach.

776-788: Soft clip at end
Verifies the correctness of the MD tag amid clips. Checking that n_as_match does not matter is a nice detail.

790-802: Skipped region
Validates how 2N is treated in the CIGAR. Confirming the MD tag matches the read’s tag is thorough.

804-818: Deletion out of bounds
Ensures the function doesn’t overreach when the deletion extends beyond reference. Good boundary check.

820-837: Matches out of bounds
Coverage for partial reference consumption. The MD string and NM tally look correct.

839-848: Missing query bases
Confirming that None is returned if the query bases are absent guarantees consistency with the function doc.

fgpyo/sam/__init__.py

tests/fgpyo/sam/test_sam.py

nh13 · 2025-04-12T03:19:51Z

fgpyo/sam/__init__.py

+        elif op not in (CigarOp.H, CigarOp.P):  # if we come across an unexpected CIGAR operation
+            raise ValueError(f"Invalid CIGAR operation: {op}")


Issue: I don't think these are errors.

Both hard clipping and padding operators consume neither reference bases nor read bases (see the table on page 8 in the spec), so they can be silently ignored. This is what the htsjdk implementation does, which we're following.

Can you add a test cases with cigars that have hard clips and padding operators (separately)?

Suggested change

elif op not in (CigarOp.H, CigarOp.P): # if we come across an unexpected CIGAR operation

raise ValueError(f"Invalid CIGAR operation: {op}")

I am so sorry, I read this as elif op in (CigarOp.H, CigarOp.P). Consider assert_never. Sorry again

I could not get assert_never() to work here. I reviewed it with Clint and it sounded like a potential rabbit hole?

elif op not in (CigarOp.H, CigarOp.P): # pragma: no cover assert_never(op) error: Argument 1 to "assert_never" has incompatible type "Literal[CigarOp.M, CigarOp.I, CigarOp.S, CigarOp.H, CigarOp.P, CigarOp.EQ, CigarOp.X, CigarOp.code, CigarOp.character, CigarOp.consumes_query, CigarOp.consumes_reference]"; expected "Never" [arg-type]

fgpyo/sam/__init__.py

tests/fgpyo/sam/test_sam.py

fgpyo/sam/__init__.py

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6dc63aa and 3aeb035.

📒 Files selected for processing (2)

fgpyo/sam/__init__.py (4 hunks)
tests/fgpyo/sam/test_sam.py (4 hunks)

🧰 Additional context used

🪛 GitHub Actions: test

fgpyo/sam/__init__.py

[error] 1051-1051: mypy error: X | Y syntax for unions requires Python 3.10

🔇 Additional comments (20)

fgpyo/sam/__init__.py (11)

204-205: LGTM: Added constant for missing query bases.

Clear constant to represent missing query bases in SAM.

1043-1043: LGTM: Added MD tag field to ReadEditInfo.

Simple attribute addition aligns with PR objectives.

1061-1064: LGTM: Clear explanation of NM tag calculation.

Docstring clearly references SAM spec and explains ambiguity handling.

1078-1078: Default for n_as_match should be False for SAM spec compliance.

The default correctly aligns with the SAM specification now.

1085-1086: LGTM: Early return for unmapped or missing query reads.

Appropriate early exit for cases where MD/NM calculation isn't possible.

1093-1096: LGTM: Added MD tag calculation variables.

Proper initialization of variables needed for MD tag computation.

1099-1116: LGTM: Match/mismatch handling with proper bounds checking.

Logic correctly handles matches/mismatches while tracking MD tag and preventing out-of-bounds access.

1106-1108: LGTM: Special handling for '=' character.

The implementation properly treats '=' as a match always, regardless of reference base.

1118-1131: LGTM: Deletion handling with bounds checking.

Deletion handling with proper MD tag formatting and bounds checking.

1138-1139: LGTM: Handling for ignored CIGAR operations.

Properly skips hard clips and padding operations without raising errors.

1151-1151: LGTM: Added MD tag to return value.

Correctly passes the calculated MD tag string to the ReadEditInfo constructor.

tests/fgpyo/sam/test_sam.py (9)

655-658: LGTM: Updated existing test with n_as_match parameter.

Test properly includes new parameter and MD tag assertion.

673-692: LGTM: Good test for nonzero offset with clear ASCII visualization.

Test properly validates offsets with detailed ASCII visualization of alignment.

728-753: LGTM: Parameterized test for ambiguous N bases.

Test validates both behaviors of n_as_match parameter with ambiguous bases.

756-773: LGTM: Test for consecutive mismatches.

Test verifies complex case with consecutive mismatches without matches at start.

790-803: LGTM: Tests for ignored CIGAR operations.

Tests verify that skipped, padded, and hard-clipped regions are handled correctly.

805-817: LGTM: Test for equals sign in query sequence.

Test confirms that '=' in query is always treated as match regardless of reference.

835-849: LGTM: Test for deletion extending beyond reference.

Test verifies proper handling of deletions that go out of bounds.

851-868: LGTM: Test for matches extending beyond reference.

Test confirms correct behavior when matches extend beyond reference bounds.

870-878: LGTM: Test for missing query bases.

Test confirms function returns None for reads with missing query bases.

fgpyo/sam/__init__.py

tests/fgpyo/sam/test_sam.py

emmcauley · 2025-05-13T20:57:12Z

fgpyo/sam/__init__.py

+        elif op not in (CigarOp.H, CigarOp.P):  # if we come across an unexpected CIGAR operation
+            raise ValueError(f"Invalid CIGAR operation: {op}")


I could not get assert_never() to work here. I reviewed it with Clint and it sounded like a potential rabbit hole?

elif op not in (CigarOp.H, CigarOp.P): # pragma: no cover assert_never(op) error: Argument 1 to "assert_never" has incompatible type "Literal[CigarOp.M, CigarOp.I, CigarOp.S, CigarOp.H, CigarOp.P, CigarOp.EQ, CigarOp.X, CigarOp.code, CigarOp.character, CigarOp.consumes_query, CigarOp.consumes_reference]"; expected "Never" [arg-type]

fgpyo/sam/__init__.py

tests/fgpyo/sam/test_sam.py

* set n_as_match default to False * expand docstring * re-order CIGAR op checks * add testing coverage for =, op.N, op.H, op.P

coderabbitai

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 1

🛑 Comments failed to post (1)

fgpyo/sam/__init__.py (1)
1138-1140: 💡 Verification agent
🧩 Analysis chain

Verify error handling coverage.

The error handling for unexpected CIGAR operations isn't covered by tests.

🏁 Script executed:
#!/bin/bash
# Check if there are tests that verify error handling for invalid CIGAR operations

echo "Searching for tests that could trigger the ValueError for invalid CIGAR ops..."
rg "test_calc_edit_info.*invalid.*cigar" tests/

echo "Checking for any CIGAR operations not in the standard set..."
rg "CigarOp\.\w+" --no-filename | sort | uniq
Length of output: 1868
Add tests for invalid CIGAR operations

The ValueError branch for unexpected CIGAR ops in fgpyo/sam/__init__.py (lines 1138–1140) has no test coverage. Please add unit tests that:

Instantiate or mock a CIGAR string containing a non-standard operation (neither H nor P).

Call the function that processes CIGAR operations (e.g. calc_edit_info) and assert that it raises ValueError(f"Invalid CIGAR operation: {op}").

This will ensure error handling for invalid CIGAR codes is properly tested.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 1139-1139: fgpyo/sam/init.py#L1139
Added line #L1139 was not covered by tests
🤖 Prompt for AI Agents (early access)
In fgpyo/sam/__init__.py around lines 1138 to 1140, the ValueError raised for
invalid CIGAR operations lacks test coverage. Add unit tests that create or mock
a CIGAR string containing an invalid operation (not H or P), then call the
function processing CIGAR ops (such as calc_edit_info) and assert that it raises
the expected ValueError with the message "Invalid CIGAR operation: {op}". This
will verify the error handling branch is properly tested.

emmcauley · 2025-05-13T21:04:25Z

@nh13 can you please confirm this is the behavior you want? I changed the results of this test and I just want to be careful that this is the expected behavior:

def test_calc_edit_info_with_matches_out_of_bounds() -> None:
    """Assert expected behavior for matches that extends out of bounds of the read."""
    chrom = "AGTCCGTTA"
    builder = SamBuilder(r1_len=4)
    rec = builder.add_single(bases="TTAG", chrom="chr1", start=0, cigar="6D4M")

    info = sam.calculate_edit_info(rec=rec, reference_sequence=chrom, n_as_match=False)
    # Offset Ref: AGTCCGTTA
    #             xxxxxx|||.
    #      Query: ------TTAG
    assert info.mismatches == 0
    assert info.insertions == 0
    assert info.inserted_bases == 0
    assert info.deletions == 0
    assert info.deleted_bases == 0
    assert info.nm == 0
    assert info.md == "0^AGTCCG0"  # report AGTCCG deletion, 3 matches are out of bounds

fgpyo/sam/__init__.py

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09fdd38 and ef46fe7.

📒 Files selected for processing (2)

fgpyo/sam/__init__.py (4 hunks)
tests/fgpyo/sam/test_sam.py (4 hunks)

🧰 Additional context used

🪛 GitHub Check: codecov/patch

fgpyo/sam/__init__.py

[warning] 1118-1118: fgpyo/sam/init.py#L1118
Added line #L1118 was not covered by tests

[warning] 1140-1140: fgpyo/sam/init.py#L1140
Added line #L1140 was not covered by tests

[warning] 1156-1156: fgpyo/sam/init.py#L1156
Added line #L1156 was not covered by tests

🔇 Additional comments (16)

fgpyo/sam/__init__.py (9)

204-206: Good addition of NO_QUERY_BASES constant.

Clearly defines the behavior when query bases are missing in a SAM record.

210-212: Appropriate definition of valid bases.

The OK_BASES includes "=" to properly handle SAM spec where equals is always considered a match regardless of reference base.

1049-1058: Helper function improves code clarity.

Good encapsulation of match logic into a separate function. The differentiation between SAM spec and htsjdk implementation is clear.

1061-1067: Function signature properly updated.

The renamed parameter match_htsjdk (replacing n_as_match) makes the intention clearer and the return type correctly handles the None case.

1070-1087: Improved docstring with detailed explanation.

Documentation effectively explains the difference between SAM spec and htsjdk implementations, and clarifies when None will be returned.

1102-1103: Early return handles edge cases.

Correctly returns None for unmapped reads or missing query bases, preventing downstream issues.

1135-1146: Deletion handling builds MD tag correctly.

The code appends the match count, adds the caret symbol, then appends the deleted reference bases. Note that line 1140-1141 (out-of-bounds handling) isn't tested.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 1140-1140: fgpyo/sam/init.py#L1140
Added line #L1140 was not covered by tests

1158-1159: MD tag finalization is correct.

Appending the final match count at the end of the function ensures complete MD tag generation.

1167-1169: Return value includes MD tag.

The ReadEditInfo object correctly includes the new MD tag.

tests/fgpyo/sam/test_sam.py (7)

651-678: Comprehensive test for base matching logic.

Thorough test case covering all important combinations of query and reference bases with both match_htsjdk settings.

785-803: Good test for consecutive mismatches.

Verifies MD tag construction for the challenging case of consecutive mismatches with no intervening matches.

820-833: Tests for ignored CIGAR operations.

Effectively tests handling of skipped regions, hard clips, and padding operations in CIGAR strings.

835-847: Test for '=' in query sequence.

Confirms that '=' in query sequence is always treated as a match regardless of the reference base.

865-879: Tests deletion extending out of bounds.

Good test for a deletion that goes beyond the reference boundaries.

881-898: Tests matches extending beyond reference boundaries.

This test is important for validating edge case handling where alignments extend past available reference sequence.

900-908: Tests for missing query bases.

Properly verifies both None and "*" cases for missing query sequences.

coderabbitai · 2025-05-15T20:33:36Z

fgpyo/sam/__init__.py

+        elif op == CigarOp.N:  # skipped region from ref, consumes ref
+            target_offset += elem.length
+        elif op not in (CigarOp.H, CigarOp.P):  # pragma: no cover
+            raise ValueError(f"Invalid CIGAR operation: {op}")


💡 Verification agent

🧩 Analysis chain

This error path needs test coverage.

The code path for invalid CIGAR operations isn't tested. While marked with pragma: no cover, consider adding a test to ensure proper error handling.

🏁 Script executed:

#!/bin/bash # Check if there are tests for invalid CIGAR operations echo "Searching for tests with invalid CIGAR operations..." rg -A 2 "Invalid CIGAR" tests/

Length of output: 147

🏁 Script executed:

#!/bin/bash # Show the context around the raise to confirm the pragma and function name sed -n '1140,1170p' fgpyo/sam/__init__.py

Length of output: 1161

Add test for invalid CIGAR branch

The ValueError at fgpyo/sam/init.py lines 1155–1156 (the branch marked # pragma: no cover) is untested. Please add a unit test that feeds an out-of-spec CIGAR operation and asserts that ValueError("Invalid CIGAR operation: …") is raised.

Location: fgpyo/sam/init.py, in the loop handling op not in (CigarOp.H, CigarOp.P)

Suggestion: in your test suite, create a dummy CIGAR element with an invalid op and use pytest.raises(ValueError) to verify the error message.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 1156-1156: fgpyo/sam/init.py#L1156
Added line #L1156 was not covered by tests

🤖 Prompt for AI Agents

In fgpyo/sam/__init__.py at line 1156, the ValueError raised for invalid CIGAR operations is not covered by tests. Add a unit test in the test suite that constructs a CIGAR element with an invalid operation code outside the allowed set, then use pytest.raises(ValueError) to assert that the error with the message "Invalid CIGAR operation: ..." is properly raised, ensuring this error path is tested.

coderabbitai · 2025-05-15T20:33:36Z

fgpyo/sam/__init__.py

+                    break  # out of bounds
+                query_base: str = rec.query_sequence[query_offset + in_block_offset].upper()


💡 Verification agent

🧩 Analysis chain

Out-of-bounds handling could use test coverage.

The code correctly breaks when matches extend beyond reference boundaries, but this branch is not covered by tests.

🏁 Script executed:

#!/bin/bash # Check if this code path is actually tested anywhere echo "Searching for tests that might cover the out-of-bounds match condition..." rg -A 2 "break # out of bounds" tests/

Length of output: 198

🏁 Script executed:

#!/bin/bash # Identify the function containing the out-of-bounds break and search tests for it FUNCTION=$(awk '/def /{func=$0} /break # out of bounds/{print func; exit}' fgpyo/sam/__init__.py | sed -E 's/def\s+([^(]+).*/\1/') echo "Function: $FUNCTION" echo "Searching for tests referencing $FUNCTION..." rg -n "$FUNCTION" -n tests/

Length of output: 68238

🏁 Script executed:

#!/bin/bash # Show context around the out-of-bounds break to identify the enclosing function rg -n "break # out of bounds" -C10 fgpyo/sam/__init__.py

Length of output: 2306

Add tests for out-of-bounds CIGAR handling

No existing tests hit the break # out of bounds branches in fgpyo/sam/init.py. Please add unit tests that:

Exercise a match block (e.g. CIGAR “NM” longer than reference tail) to hit line 1118

Exercise a deletion block longer than remaining reference to hit line 1139

Place them under tests/fgpyo/sam/, verifying that your MD/NM tag logic correctly stops at the reference end.

🧰 Tools

🪛 GitHub Check: codecov/patch

[warning] 1118-1118: fgpyo/sam/init.py#L1118
Added line #L1118 was not covered by tests

🤖 Prompt for AI Agents

In fgpyo/sam/__init__.py around lines 1118 to 1119, the code breaks out of a loop when matches extend beyond reference boundaries, but this branch lacks test coverage. Add unit tests under tests/fgpyo/sam/ that create scenarios with a match block (e.g., a CIGAR "NM" longer than the reference tail) to trigger the break at line 1118, and a deletion block longer than the remaining reference to trigger the break at line 1139. These tests should verify that the MD/NM tag logic correctly stops processing at the reference end.

nh13 · 2025-05-29T17:12:13Z

fgpyo/sam/__init__.py

 NO_QUERY_QUALITIES: array = qualitystring_to_array(STRING_PLACEHOLDER)
 """The quality array corresponding to an unavailable query quality string ("*")."""

+OK_BASES: set[str] = {"A", "C", "G", "T", "="}


suggestion: shall we make this private for now since it's only used here and not intended to be use by outside projects

Suggested change

OK_BASES: set[str] = {"A", "C", "G", "T", "="}

_OK_BASES: set[str] = {"A", "C", "G", "T", "="}

nh13 · 2025-05-29T17:34:51Z

fgpyo/sam/__init__.py

+            for in_block_offset in range(0, elem.length):
+                if (target_offset + in_block_offset) >= len(reference_sequence):
+                    break  # out of bounds
+            md_edits.append(f"{reference_sequence[target_offset:target_offset + elem.length]}")


chore: (1) I think we need to upper case the reference and (2) I don't think we need an f-string since reference sequence is already a string

Suggested change

md_edits.append(f"{reference_sequence[target_offset:target_offset + elem.length]}")

md_edits.append(reference_sequence[target_offset:target_offset + elem.length].upper())

nh13 · 2025-05-29T17:37:32Z

fgpyo/sam/__init__.py

+    target_offset: int = reference_offset if reference_offset is not None else rec.reference_start
+    cigar: Cigar = Cigar.from_cigartuples(rec.cigartuples)

    matches, mms, insertions, ins_bases, deletions, del_bases = 0, 0, 0, 0, 0, 0


chore: while we're here, can we update this variable to a longer form?

Suggested change

matches, mms, insertions, ins_bases, deletions, del_bases = 0, 0, 0, 0, 0, 0

matches, mismatches, insertions, ins_bases, deletions, del_bases = 0, 0, 0, 0, 0, 0

nh13 · 2025-05-29T17:40:56Z

fgpyo/sam/__init__.py

+            if target_offset < elem.length:
+                break


chore: I found another difference between htsjdk and samtools:

htsjdk will break before updating the NM value

samtools will break after updating the NM value

What do you think about updating deletions and del_bases if match_htsjdk is false?

If you agree, remember to update the docstring.

nh13 · 2025-05-29T17:42:31Z

fgpyo/sam/__init__.py

+        elif op not in (CigarOp.H, CigarOp.P):  # pragma: no cover
+            raise ValueError(f"Invalid CIGAR operation: {op}")
+
+    md_edits.append(str(current_match_count))


nit: f-strings are just a little faster

Suggested change

md_edits.append(str(current_match_count))

md_edits.append(f"{current_match_count}")

emmcauley force-pushed the em_calculate_md_and_nm_tags branch from 87136be to 62e8228 Compare March 25, 2025 20:00

clintval assigned clintval and znorgaard and unassigned clintval Mar 25, 2025

emmcauley force-pushed the em_calculate_md_and_nm_tags branch from f035577 to d625ecf Compare March 27, 2025 16:44

feat: add MD tag calculation and n_as_match bool flag

d4fc881

emmcauley force-pushed the em_calculate_md_and_nm_tags branch from d625ecf to 1e52794 Compare March 28, 2025 14:09

tests: add testing coverage for MD calculation with n_as_match flag

019c788

emmcauley force-pushed the em_calculate_md_and_nm_tags branch from 1e52794 to 019c788 Compare March 28, 2025 16:30

znorgaard requested changes Mar 28, 2025

View reviewed changes

tests/fgpyo/sam/test_sam.py Outdated Show resolved Hide resolved

tests/fgpyo/sam/test_sam.py Outdated Show resolved Hide resolved

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved

tests: incorporate reviewer comments -- include ASCII art, rename tes…

2e8f3d0

…t for clarity, true up assertion style to be consistent

emmcauley requested a review from znorgaard March 28, 2025 19:23

coderabbitai bot approved these changes Mar 28, 2025

View reviewed changes

znorgaard reviewed Mar 28, 2025

View reviewed changes

tests/fgpyo/sam/test_sam.py Show resolved Hide resolved

fix: the match should be out of bounds

6b45969

znorgaard self-requested a review April 10, 2025 21:09

feat: finish out implementation of calculate_edit_info -- add default…

7470176

… for n_as_match, expand docstring, defensive programming against invalid CIGAR ops

emmcauley force-pushed the em_calculate_md_and_nm_tags branch from 6b29c10 to c303154 Compare April 11, 2025 19:44

tests: increase testing coverage for a read that has missing query bases

6dc63aa

emmcauley force-pushed the em_calculate_md_and_nm_tags branch from c303154 to 6dc63aa Compare April 11, 2025 19:49

emmcauley marked this pull request as ready for review April 11, 2025 20:16

emmcauley requested review from clintval, nh13 and tfenne as code owners April 11, 2025 20:16

emmcauley assigned nh13 and tfenne and unassigned znorgaard Apr 11, 2025

coderabbitai bot reviewed Apr 11, 2025

View reviewed changes

nh13 requested changes Apr 12, 2025

View reviewed changes

nh13 reviewed Apr 12, 2025

View reviewed changes

tests/fgpyo/sam/test_sam.py Show resolved Hide resolved

znorgaard reviewed Apr 21, 2025

View reviewed changes

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved

emmcauley force-pushed the em_calculate_md_and_nm_tags branch from 0c702f5 to 3aeb035 Compare May 13, 2025 20:44

coderabbitai bot requested changes May 13, 2025

View reviewed changes

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved

emmcauley force-pushed the em_calculate_md_and_nm_tags branch from 3aeb035 to 09fdd38 Compare May 13, 2025 20:59

emmcauley commented May 13, 2025

View reviewed changes

fix: incorporate reviewer comments

5b02a5d

* set n_as_match default to False * expand docstring * re-order CIGAR op checks * add testing coverage for =, op.N, op.H, op.P

emmcauley force-pushed the em_calculate_md_and_nm_tags branch from 09fdd38 to 5b02a5d Compare May 13, 2025 21:02

coderabbitai bot reviewed May 13, 2025

View reviewed changes

coderabbitai bot approved these changes May 13, 2025

View reviewed changes

emmcauley requested review from nh13 and znorgaard May 13, 2025 21:04

tfenne reviewed May 14, 2025

View reviewed changes

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved

fix: add missing logic in _is_match_base(); cover with tiny unit-test

ef46fe7

emmcauley requested a review from tfenne May 15, 2025 20:31

coderabbitai bot requested changes May 15, 2025

View reviewed changes

nh13 requested changes May 29, 2025

View reviewed changes

nh13 assigned emmcauley and unassigned nh13 and tfenne Jun 3, 2025

		elif op not in (CigarOp.H, CigarOp.P): # if we come across an unexpected CIGAR operation
		raise ValueError(f"Invalid CIGAR operation: {op}")

		break # out of bounds
		query_base: str = rec.query_sequence[query_offset + in_block_offset].upper()

	OK_BASES: set[str] = {"A", "C", "G", "T", "="}
	_OK_BASES: set[str] = {"A", "C", "G", "T", "="}

	md_edits.append(f"{reference_sequence[target_offset:target_offset + elem.length]}")
	md_edits.append(reference_sequence[target_offset:target_offset + elem.length].upper())

	matches, mms, insertions, ins_bases, deletions, del_bases = 0, 0, 0, 0, 0, 0
	matches, mismatches, insertions, ins_bases, deletions, del_bases = 0, 0, 0, 0, 0, 0

	md_edits.append(str(current_match_count))
	md_edits.append(f"{current_match_count}")

feat: calculate MD and NM tags #229

Are you sure you want to change the base?

feat: calculate MD and NM tags #229

Uh oh!

Conversation

emmcauley commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

emmcauley commented May 13, 2025

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 15, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot May 15, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

emmcauley commented Mar 25, 2025 •

edited

Loading

codecov bot commented Mar 25, 2025 •

edited

Loading

coderabbitai bot commented Apr 11, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)