Skip to content

Conversation

@emmcauley
Copy link
Contributor

@emmcauley emmcauley commented Mar 25, 2025

Closes #84. Adds an md attribute to the ReadEditInfo class and an n_as_match boolean flag to calculate_edit_info(). calculate_edit_info() is now a little more complex than we might like (11 > 10). I tried not to change any pre-existing code more than was necessary.

Some of the reads for testing were inherited from htsjdk.

⚠️ Once we converge on the style and the functionality in this PR, I can open another PR to address this part of the Github issue: Bonus points for adding another method that takes a iterator over Templates (and/or pysam.AlignedSegments) and a pysam.FastaFile and updates the records.

@emmcauley emmcauley force-pushed the em_calculate_md_and_nm_tags branch from 87136be to 62e8228 Compare March 25, 2025 20:00
@codecov
Copy link

codecov bot commented Mar 25, 2025

Codecov Report

Attention: Patch coverage is 89.09091% with 6 lines in your changes missing coverage. Please review.

Project coverage is 91.07%. Comparing base (0dd300f) to head (ef46fe7).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
fgpyo/sam/__init__.py 89.09% 3 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #229      +/-   ##
==========================================
- Coverage   91.19%   91.07%   -0.12%     
==========================================
  Files          18       18              
  Lines        2351     2387      +36     
  Branches      356      365       +9     
==========================================
+ Hits         2144     2174      +30     
- Misses        135      138       +3     
- Partials       72       75       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@clintval clintval assigned clintval and znorgaard and unassigned clintval Mar 25, 2025
@emmcauley emmcauley force-pushed the em_calculate_md_and_nm_tags branch from f035577 to d625ecf Compare March 27, 2025 16:44
@emmcauley emmcauley force-pushed the em_calculate_md_and_nm_tags branch from d625ecf to 1e52794 Compare March 28, 2025 14:09
@emmcauley emmcauley force-pushed the em_calculate_md_and_nm_tags branch from 1e52794 to 019c788 Compare March 28, 2025 16:30
…t for clarity, true up assertion style to be consistent
@emmcauley emmcauley requested a review from znorgaard March 28, 2025 19:23
@znorgaard znorgaard self-requested a review April 10, 2025 21:09
… for n_as_match, expand docstring, defensive programming against invalid CIGAR ops
@emmcauley emmcauley force-pushed the em_calculate_md_and_nm_tags branch from 6b29c10 to c303154 Compare April 11, 2025 19:44
@emmcauley emmcauley force-pushed the em_calculate_md_and_nm_tags branch from c303154 to 6dc63aa Compare April 11, 2025 19:49
@emmcauley emmcauley marked this pull request as ready for review April 11, 2025 20:16
@emmcauley emmcauley assigned nh13 and tfenne and unassigned znorgaard Apr 11, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Apr 11, 2025

Walkthrough

The changes introduce constants NO_QUERY_BASES and OK_BASES for query base handling and add an md attribute to ReadEditInfo for the MD tag string. The calculate_edit_info function gains a match_htsjdk boolean parameter to toggle base matching rules, returns None for unmapped or missing query bases, and is refactored to process CIGAR operations with boundary checks while constructing the MD tag. The test suite is expanded with new, renamed, and parameterized tests covering base matching logic, offsets, edge cases, and MD tag correctness.

Note

⚡️ AI Code Reviews for VS Code, Cursor, Windsurf

CodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback.
Learn more here.


Note

⚡️ Faster reviews with caching

CodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure Review - Disable Cache at either the organization or repository level. If you prefer to disable all data retention across your organization, simply turn off the Data Retention setting under your Organization Settings.
Enjoy the performance boost—your workflow just got faster.

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
fgpyo/sam/__init__.py (1)

1046-1069: Function signature and docstring updates
The signature now includes n_as_match and reference_offset, broadening functionality. The docstring is thorough, referencing SAM standards and htsjdk logic. Consider simplifying if the function grows more complex.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 44b8cea and 6dc63aa.

📒 Files selected for processing (2)
  • fgpyo/sam/__init__.py (4 hunks)
  • tests/fgpyo/sam/test_sam.py (4 hunks)
🧰 Additional context used
🪛 GitHub Check: codecov/patch
fgpyo/sam/__init__.py

[warning] 1130-1130: fgpyo/sam/init.py#L1130
Added line #L1130 was not covered by tests

🔇 Additional comments (19)
fgpyo/sam/__init__.py (6)

204-205: New constant for missing query bases
These lines clearly define a placeholder for records without query bases. Looks good and straightforward.


1033-1044: Added md attribute in ReadEditInfo
Documenting and implementing md in ReadEditInfo aligns well with SAM standards, ensuring we can store the MD tag. No issues spotted.


1082-1083: Return None for unmapped or missing query bases
Clean conditional check here, properly documented in the docstring. Good approach to handle incomplete data.


1085-1129: Core loop logic for alignments
This block meticulously accounts for each CIGAR operation type (N, I, S, D, M, X, EQ). The code is consistent with the docstring, handling offsets and boundaries carefully.


1130-1131: Uncovered error path
Static analysis indicates line 1130 is not tested. Consider adding a test that triggers this “Invalid CIGAR operation” condition to maintain coverage and ensure the error path is verified.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 1130-1130: fgpyo/sam/init.py#L1130
Added line #L1130 was not covered by tests


1132-1142: Finalizing MD edits
Appending the final match count and constructing ReadEditInfo looks correct. The md="".join(md_edits) call aligns with standard MD tag assembly.

tests/fgpyo/sam/test_sam.py (13)

8-8: Importing Optional
Straightforward addition for handling None usage. No issues noted.


655-659: Testing MD with no edits
Calling calculate_edit_info() with n_as_match=False and checking info.md == "30" ensures no mismatches. Good coverage for the simple case.


661-671: Test for zero offset
Verifies that the function properly handles reference offsets with no edits. Clear test name and correct assertion.


673-692: Test for non-zero offset edits
The ASCII art is helpful. Confirms mismatches are counted correctly when reference alignment is shifted. Thorough approach.


701-709: Mismatch plus insertion scenario
The MD tag ignoring insertions is correct per SAM spec. Tests confirm logic for mismatches and inserted bases. Nicely done.


718-725: Test with clipping and deletions
Ensures that soft-clips are ignored in MD calculations yet fully checks the deletion. The results align with expectations.


728-753: Ns alignment test
Parameterizing for n_as_match covers both ignoring or counting N→N as mismatches. This thoroughly tests ambiguous bases.


756-773: Consecutive mismatches
Test confirms consecutive fails from htsjdk data. The comparison to the original MD tag is a robust approach.


776-788: Soft clip at end
Verifies the correctness of the MD tag amid clips. Checking that n_as_match does not matter is a nice detail.


790-802: Skipped region
Validates how 2N is treated in the CIGAR. Confirming the MD tag matches the read’s tag is thorough.


804-818: Deletion out of bounds
Ensures the function doesn’t overreach when the deletion extends beyond reference. Good boundary check.


820-837: Matches out of bounds
Coverage for partial reference consumption. The MD string and NM tally look correct.


839-848: Missing query bases
Confirming that None is returned if the query bases are absent guarantees consistency with the function doc.

Comment on lines 1129 to 1130
elif op not in (CigarOp.H, CigarOp.P): # if we come across an unexpected CIGAR operation
raise ValueError(f"Invalid CIGAR operation: {op}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: I don't think these are errors.

Both hard clipping and padding operators consume neither reference bases nor read bases (see the table on page 8 in the spec), so they can be silently ignored. This is what the htsjdk implementation does, which we're following.

Can you add a test cases with cigars that have hard clips and padding operators (separately)?

Suggested change
elif op not in (CigarOp.H, CigarOp.P): # if we come across an unexpected CIGAR operation
raise ValueError(f"Invalid CIGAR operation: {op}")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am so sorry, I read this as elif op in (CigarOp.H, CigarOp.P). Consider assert_never. Sorry again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not get assert_never() to work here. I reviewed it with Clint and it sounded like a potential rabbit hole?

elif op not in (CigarOp.H, CigarOp.P):  # pragma: no cover
  assert_never(op)

error: Argument 1 to "assert_never" has incompatible type "Literal[CigarOp.M, CigarOp.I, CigarOp.S, CigarOp.H, CigarOp.P, CigarOp.EQ, CigarOp.X, CigarOp.code, CigarOp.character, CigarOp.consumes_query, CigarOp.consumes_reference]"; expected "Never"  [arg-type]

@emmcauley emmcauley force-pushed the em_calculate_md_and_nm_tags branch from 0c702f5 to 3aeb035 Compare May 13, 2025 20:44
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6dc63aa and 3aeb035.

📒 Files selected for processing (2)
  • fgpyo/sam/__init__.py (4 hunks)
  • tests/fgpyo/sam/test_sam.py (4 hunks)
🧰 Additional context used
🪛 GitHub Actions: test
fgpyo/sam/__init__.py

[error] 1051-1051: mypy error: X | Y syntax for unions requires Python 3.10

🔇 Additional comments (20)
fgpyo/sam/__init__.py (11)

204-205: LGTM: Added constant for missing query bases.

Clear constant to represent missing query bases in SAM.


1043-1043: LGTM: Added MD tag field to ReadEditInfo.

Simple attribute addition aligns with PR objectives.


1061-1064: LGTM: Clear explanation of NM tag calculation.

Docstring clearly references SAM spec and explains ambiguity handling.


1078-1078: Default for n_as_match should be False for SAM spec compliance.

The default correctly aligns with the SAM specification now.


1085-1086: LGTM: Early return for unmapped or missing query reads.

Appropriate early exit for cases where MD/NM calculation isn't possible.


1093-1096: LGTM: Added MD tag calculation variables.

Proper initialization of variables needed for MD tag computation.


1099-1116: LGTM: Match/mismatch handling with proper bounds checking.

Logic correctly handles matches/mismatches while tracking MD tag and preventing out-of-bounds access.


1106-1108: LGTM: Special handling for '=' character.

The implementation properly treats '=' as a match always, regardless of reference base.


1118-1131: LGTM: Deletion handling with bounds checking.

Deletion handling with proper MD tag formatting and bounds checking.


1138-1139: LGTM: Handling for ignored CIGAR operations.

Properly skips hard clips and padding operations without raising errors.


1151-1151: LGTM: Added MD tag to return value.

Correctly passes the calculated MD tag string to the ReadEditInfo constructor.

tests/fgpyo/sam/test_sam.py (9)

655-658: LGTM: Updated existing test with n_as_match parameter.

Test properly includes new parameter and MD tag assertion.


673-692: LGTM: Good test for nonzero offset with clear ASCII visualization.

Test properly validates offsets with detailed ASCII visualization of alignment.


728-753: LGTM: Parameterized test for ambiguous N bases.

Test validates both behaviors of n_as_match parameter with ambiguous bases.


756-773: LGTM: Test for consecutive mismatches.

Test verifies complex case with consecutive mismatches without matches at start.


790-803: LGTM: Tests for ignored CIGAR operations.

Tests verify that skipped, padded, and hard-clipped regions are handled correctly.


805-817: LGTM: Test for equals sign in query sequence.

Test confirms that '=' in query is always treated as match regardless of reference.


835-849: LGTM: Test for deletion extending beyond reference.

Test verifies proper handling of deletions that go out of bounds.


851-868: LGTM: Test for matches extending beyond reference.

Test confirms correct behavior when matches extend beyond reference bounds.


870-878: LGTM: Test for missing query bases.

Test confirms function returns None for reads with missing query bases.

@emmcauley emmcauley force-pushed the em_calculate_md_and_nm_tags branch from 3aeb035 to 09fdd38 Compare May 13, 2025 20:59
Comment on lines 1129 to 1130
elif op not in (CigarOp.H, CigarOp.P): # if we come across an unexpected CIGAR operation
raise ValueError(f"Invalid CIGAR operation: {op}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not get assert_never() to work here. I reviewed it with Clint and it sounded like a potential rabbit hole?

elif op not in (CigarOp.H, CigarOp.P):  # pragma: no cover
  assert_never(op)

error: Argument 1 to "assert_never" has incompatible type "Literal[CigarOp.M, CigarOp.I, CigarOp.S, CigarOp.H, CigarOp.P, CigarOp.EQ, CigarOp.X, CigarOp.code, CigarOp.character, CigarOp.consumes_query, CigarOp.consumes_reference]"; expected "Never"  [arg-type]

  * set n_as_match default to False
  * expand docstring
  * re-order CIGAR op checks
  * add testing coverage for =, op.N, op.H, op.P
@emmcauley emmcauley force-pushed the em_calculate_md_and_nm_tags branch from 09fdd38 to 5b02a5d Compare May 13, 2025 21:02
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 1

🛑 Comments failed to post (1)
fgpyo/sam/__init__.py (1)

1138-1140: 💡 Verification agent

🧩 Analysis chain

Verify error handling coverage.

The error handling for unexpected CIGAR operations isn't covered by tests.


🏁 Script executed:

#!/bin/bash
# Check if there are tests that verify error handling for invalid CIGAR operations

echo "Searching for tests that could trigger the ValueError for invalid CIGAR ops..."
rg "test_calc_edit_info.*invalid.*cigar" tests/

echo "Checking for any CIGAR operations not in the standard set..."
rg "CigarOp\.\w+" --no-filename | sort | uniq

Length of output: 1868


Add tests for invalid CIGAR operations

The ValueError branch for unexpected CIGAR ops in fgpyo/sam/__init__.py (lines 1138–1140) has no test coverage. Please add unit tests that:

  • Instantiate or mock a CIGAR string containing a non-standard operation (neither H nor P).
  • Call the function that processes CIGAR operations (e.g. calc_edit_info) and assert that it raises ValueError(f"Invalid CIGAR operation: {op}").

This will ensure error handling for invalid CIGAR codes is properly tested.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 1139-1139: fgpyo/sam/init.py#L1139
Added line #L1139 was not covered by tests

🤖 Prompt for AI Agents (early access)
In fgpyo/sam/__init__.py around lines 1138 to 1140, the ValueError raised for
invalid CIGAR operations lacks test coverage. Add unit tests that create or mock
a CIGAR string containing an invalid operation (not H or P), then call the
function processing CIGAR ops (such as calc_edit_info) and assert that it raises
the expected ValueError with the message "Invalid CIGAR operation: {op}". This
will verify the error handling branch is properly tested.

@emmcauley
Copy link
Contributor Author

@nh13 can you please confirm this is the behavior you want? I changed the results of this test and I just want to be careful that this is the expected behavior:

def test_calc_edit_info_with_matches_out_of_bounds() -> None:
    """Assert expected behavior for matches that extends out of bounds of the read."""
    chrom = "AGTCCGTTA"
    builder = SamBuilder(r1_len=4)
    rec = builder.add_single(bases="TTAG", chrom="chr1", start=0, cigar="6D4M")

    info = sam.calculate_edit_info(rec=rec, reference_sequence=chrom, n_as_match=False)
    # Offset Ref: AGTCCGTTA
    #             xxxxxx|||.
    #      Query: ------TTAG
    assert info.mismatches == 0
    assert info.insertions == 0
    assert info.inserted_bases == 0
    assert info.deletions == 0
    assert info.deleted_bases == 0
    assert info.nm == 0
    assert info.md == "0^AGTCCG0"  # report AGTCCG deletion, 3 matches are out of bounds

@emmcauley emmcauley requested review from nh13 and znorgaard May 13, 2025 21:04
@emmcauley emmcauley requested a review from tfenne May 15, 2025 20:31
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09fdd38 and ef46fe7.

📒 Files selected for processing (2)
  • fgpyo/sam/__init__.py (4 hunks)
  • tests/fgpyo/sam/test_sam.py (4 hunks)
🧰 Additional context used
🪛 GitHub Check: codecov/patch
fgpyo/sam/__init__.py

[warning] 1118-1118: fgpyo/sam/init.py#L1118
Added line #L1118 was not covered by tests


[warning] 1140-1140: fgpyo/sam/init.py#L1140
Added line #L1140 was not covered by tests


[warning] 1156-1156: fgpyo/sam/init.py#L1156
Added line #L1156 was not covered by tests

🔇 Additional comments (16)
fgpyo/sam/__init__.py (9)

204-206: Good addition of NO_QUERY_BASES constant.

Clearly defines the behavior when query bases are missing in a SAM record.


210-212: Appropriate definition of valid bases.

The OK_BASES includes "=" to properly handle SAM spec where equals is always considered a match regardless of reference base.


1049-1058: Helper function improves code clarity.

Good encapsulation of match logic into a separate function. The differentiation between SAM spec and htsjdk implementation is clear.


1061-1067: Function signature properly updated.

The renamed parameter match_htsjdk (replacing n_as_match) makes the intention clearer and the return type correctly handles the None case.


1070-1087: Improved docstring with detailed explanation.

Documentation effectively explains the difference between SAM spec and htsjdk implementations, and clarifies when None will be returned.


1102-1103: Early return handles edge cases.

Correctly returns None for unmapped reads or missing query bases, preventing downstream issues.


1135-1146: Deletion handling builds MD tag correctly.

The code appends the match count, adds the caret symbol, then appends the deleted reference bases. Note that line 1140-1141 (out-of-bounds handling) isn't tested.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 1140-1140: fgpyo/sam/init.py#L1140
Added line #L1140 was not covered by tests


1158-1159: MD tag finalization is correct.

Appending the final match count at the end of the function ensures complete MD tag generation.


1167-1169: Return value includes MD tag.

The ReadEditInfo object correctly includes the new MD tag.

tests/fgpyo/sam/test_sam.py (7)

651-678: Comprehensive test for base matching logic.

Thorough test case covering all important combinations of query and reference bases with both match_htsjdk settings.


785-803: Good test for consecutive mismatches.

Verifies MD tag construction for the challenging case of consecutive mismatches with no intervening matches.


820-833: Tests for ignored CIGAR operations.

Effectively tests handling of skipped regions, hard clips, and padding operations in CIGAR strings.


835-847: Test for '=' in query sequence.

Confirms that '=' in query sequence is always treated as a match regardless of the reference base.


865-879: Tests deletion extending out of bounds.

Good test for a deletion that goes beyond the reference boundaries.


881-898: Tests matches extending beyond reference boundaries.

This test is important for validating edge case handling where alignments extend past available reference sequence.


900-908: Tests for missing query bases.

Properly verifies both None and "*" cases for missing query sequences.

elif op == CigarOp.N: # skipped region from ref, consumes ref
target_offset += elem.length
elif op not in (CigarOp.H, CigarOp.P): # pragma: no cover
raise ValueError(f"Invalid CIGAR operation: {op}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

This error path needs test coverage.

The code path for invalid CIGAR operations isn't tested. While marked with pragma: no cover, consider adding a test to ensure proper error handling.


🏁 Script executed:

#!/bin/bash
# Check if there are tests for invalid CIGAR operations

echo "Searching for tests with invalid CIGAR operations..."
rg -A 2 "Invalid CIGAR" tests/

Length of output: 147


🏁 Script executed:

#!/bin/bash
# Show the context around the raise to confirm the pragma and function name
sed -n '1140,1170p' fgpyo/sam/__init__.py

Length of output: 1161


Add test for invalid CIGAR branch

The ValueError at fgpyo/sam/init.py lines 1155–1156 (the branch marked # pragma: no cover) is untested. Please add a unit test that feeds an out-of-spec CIGAR operation and asserts that ValueError("Invalid CIGAR operation: …") is raised.

  • Location: fgpyo/sam/init.py, in the loop handling op not in (CigarOp.H, CigarOp.P)
  • Suggestion: in your test suite, create a dummy CIGAR element with an invalid op and use pytest.raises(ValueError) to verify the error message.
🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 1156-1156: fgpyo/sam/init.py#L1156
Added line #L1156 was not covered by tests

🤖 Prompt for AI Agents
In fgpyo/sam/__init__.py at line 1156, the ValueError raised for invalid CIGAR
operations is not covered by tests. Add a unit test in the test suite that
constructs a CIGAR element with an invalid operation code outside the allowed
set, then use pytest.raises(ValueError) to assert that the error with the
message "Invalid CIGAR operation: ..." is properly raised, ensuring this error
path is tested.

Comment on lines +1118 to +1119
break # out of bounds
query_base: str = rec.query_sequence[query_offset + in_block_offset].upper()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Out-of-bounds handling could use test coverage.

The code correctly breaks when matches extend beyond reference boundaries, but this branch is not covered by tests.


🏁 Script executed:

#!/bin/bash
# Check if this code path is actually tested anywhere

echo "Searching for tests that might cover the out-of-bounds match condition..."
rg -A 2 "break  # out of bounds" tests/

Length of output: 198


🏁 Script executed:

#!/bin/bash
# Identify the function containing the out-of-bounds break and search tests for it
FUNCTION=$(awk '/def /{func=$0} /break  # out of bounds/{print func; exit}' fgpyo/sam/__init__.py | sed -E 's/def\s+([^(]+).*/\1/')
echo "Function: $FUNCTION"
echo "Searching for tests referencing $FUNCTION..."
rg -n "$FUNCTION" -n tests/

Length of output: 68238


🏁 Script executed:

#!/bin/bash
# Show context around the out-of-bounds break to identify the enclosing function
rg -n "break  # out of bounds" -C10 fgpyo/sam/__init__.py

Length of output: 2306


Add tests for out-of-bounds CIGAR handling

No existing tests hit the break # out of bounds branches in fgpyo/sam/init.py. Please add unit tests that:

  • Exercise a match block (e.g. CIGAR “NM” longer than reference tail) to hit line 1118
  • Exercise a deletion block longer than remaining reference to hit line 1139

Place them under tests/fgpyo/sam/, verifying that your MD/NM tag logic correctly stops at the reference end.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 1118-1118: fgpyo/sam/init.py#L1118
Added line #L1118 was not covered by tests

🤖 Prompt for AI Agents
In fgpyo/sam/__init__.py around lines 1118 to 1119, the code breaks out of a
loop when matches extend beyond reference boundaries, but this branch lacks test
coverage. Add unit tests under tests/fgpyo/sam/ that create scenarios with a
match block (e.g., a CIGAR "NM" longer than the reference tail) to trigger the
break at line 1118, and a deletion block longer than the remaining reference to
trigger the break at line 1139. These tests should verify that the MD/NM tag
logic correctly stops processing at the reference end.

NO_QUERY_QUALITIES: array = qualitystring_to_array(STRING_PLACEHOLDER)
"""The quality array corresponding to an unavailable query quality string ("*")."""

OK_BASES: set[str] = {"A", "C", "G", "T", "="}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: shall we make this private for now since it's only used here and not intended to be use by outside projects

Suggested change
OK_BASES: set[str] = {"A", "C", "G", "T", "="}
_OK_BASES: set[str] = {"A", "C", "G", "T", "="}

for in_block_offset in range(0, elem.length):
if (target_offset + in_block_offset) >= len(reference_sequence):
break # out of bounds
md_edits.append(f"{reference_sequence[target_offset:target_offset + elem.length]}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chore: (1) I think we need to upper case the reference and (2) I don't think we need an f-string since reference sequence is already a string

Suggested change
md_edits.append(f"{reference_sequence[target_offset:target_offset + elem.length]}")
md_edits.append(reference_sequence[target_offset:target_offset + elem.length].upper())

target_offset: int = reference_offset if reference_offset is not None else rec.reference_start
cigar: Cigar = Cigar.from_cigartuples(rec.cigartuples)

matches, mms, insertions, ins_bases, deletions, del_bases = 0, 0, 0, 0, 0, 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chore: while we're here, can we update this variable to a longer form?

Suggested change
matches, mms, insertions, ins_bases, deletions, del_bases = 0, 0, 0, 0, 0, 0
matches, mismatches, insertions, ins_bases, deletions, del_bases = 0, 0, 0, 0, 0, 0

Comment on lines +1143 to +1144
if target_offset < elem.length:
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chore: I found another difference between htsjdk and samtools:

  1. htsjdk will break before updating the NM value
  2. samtools will break after updating the NM value

What do you think about updating deletions and del_bases if match_htsjdk is false?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you agree, remember to update the docstring.

elif op not in (CigarOp.H, CigarOp.P): # pragma: no cover
raise ValueError(f"Invalid CIGAR operation: {op}")

md_edits.append(str(current_match_count))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: f-strings are just a little faster

Suggested change
md_edits.append(str(current_match_count))
md_edits.append(f"{current_match_count}")

@nh13 nh13 assigned emmcauley and unassigned nh13 and tfenne Jun 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a method to calculate MD and NM tags.

6 participants