-
-
Notifications
You must be signed in to change notification settings - Fork 918
feat: Add GroundCheck for source grounding verification #1968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Ruthvik-Bandari
wants to merge
5
commits into
567-labs:main
Choose a base branch
from
Ruthvik-Bandari:feature/groundcheck-verification
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
feat: Add GroundCheck for source grounding verification #1968
Ruthvik-Bandari
wants to merge
5
commits into
567-labs:main
from
Ruthvik-Bandari:feature/groundcheck-verification
+2,297
−0
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Add GroundCheck class to verify LLM extractions against source text - Implement exact, fuzzy, and numeric matching strategies - Add grounding_validator for Pydantic integration - Add verify_extraction convenience function - Include comprehensive test suite (11 tests) Helps detect hallucinations in structured data extraction by verifying that extracted values actually exist in the source text.
Contributor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed everything up to b5cc820 in 3 minutes and 3 seconds. Click for details.
- Reviewed
448lines of code in2files - Skipped
0files when reviewing. - Skipped posting
5draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. instructor/groundcheck.py:147
- Draft comment:
Avoid duplicate fuzzy matching calls. Consider storing the result of _fuzzy_match (already computed on line 147) and reusing it on line 154 to improve performance. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 85% vs. threshold = 85% The comment is technically correct - there IS a potential duplicate call to_fuzzy_match. However, this only happens when the fuzzy match score is below the threshold (self.fuzzy_threshold, default 0.85). In that case, the code falls through to semantic matching, and then at line 154 calls fuzzy_match again to get a "best effort" result for the NOT_FOUND case. This is actually intentional behavior - the second call is meant to provide some evidence even when no good match is found. The duplication only occurs in the failure path, not the success path. That said, storing the result from line 147 and reusing it would be a valid optimization. This is a legitimate code quality suggestion about DRY principles and performance. The duplicate call only happens in the fallback/failure case where fuzzy matching didn't meet the threshold. This might be intentional design to separate "good match" logic from "best effort evidence" logic. The performance impact is likely minimal since this only affects cases where matching fails. The comment doesn't acknowledge that this is a fallback path, not the main path. While the duplicate call is in a fallback path, it's still a valid DRY violation and performance optimization opportunity. The fuzzy_result from line 147 could be stored and reused at line 154, avoiding an expensive computation. This is a clear, actionable suggestion that improves code quality without changing behavior. This is a valid code quality comment that identifies a DRY violation and performance optimization opportunity. The suggestion is clear, actionable, and would improve the code. Even though it's in a fallback path, avoiding the duplicate expensive fuzzy match computation is worthwhile.
2. instructor/groundcheck.py:50
- Draft comment:
The post_init method flags any field with confidence < 0.5, which may override custom threshold logic in _verify_field. Consider parameterizing or documenting this behavior. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 75% vs. threshold = 85% This appears to be a legitimate code quality issue. The__post_init__method creates a hardcoded 0.5 threshold that could override or conflict with the custom threshold logic passed to_verify_field. For example, if a user sets a threshold of 0.3 for a specific field, and the confidence comes back as 0.4,_verify_fieldwould setflagged=False(since 0.4 >= 0.3), but then__post_init__would override it toflagged=True(since 0.4 < 0.5). This is a real inconsistency in the code logic that could cause unexpected behavior. The comment suggests either parameterizing this behavior or documenting it, which are both reasonable solutions. However, I need to check if this is actually a problem in practice. Looking more carefully at the_verify_fieldcalls, they all explicitly setflaggedbased on the threshold comparison. The__post_init__check only applies "if not self.flagged", so it won't override an already-flagged field. But it could still flag a field that was intentionally not flagged. This is a subtle bug that could manifest in edge cases. The critique is valid - this is indeed a subtle issue. When_verify_fieldexplicitly setsflagged=False(because confidence meets the custom threshold), the__post_init__could still override it toTrueif confidence < 0.5. This creates an implicit minimum threshold of 0.5 that users cannot configure, which contradicts the design that allows custom thresholds. This is a real code quality issue that should be addressed. This comment identifies a legitimate design inconsistency where a hardcoded 0.5 threshold in__post_init__can override custom threshold logic. This is a valid code quality concern that requires a code change to fix. The comment should be kept.
3. instructor/groundcheck.py:196
- Draft comment:
The nested loop in _fuzzy_match may be costly for longer texts. Consider optimizing this logic or adding caching if performance becomes an issue. - Reason this comment was not posted:
Confidence changes required:80%<= threshold85%None
4. instructor/groundcheck.py:226
- Draft comment:
When computing semantic similarity, ensure division by zero is safely handled, even though SentenceTransformer embeddings are unlikely to be zero. - Reason this comment was not posted:
Confidence changes required:80%<= threshold85%None
5. instructor/groundcheck.py:245
- Draft comment:
In _verify_complex_field, the method returned is hardcoded as FUZZY_MATCH. Consider reflecting the actual matching techniques used on nested fields or using a more generic method label. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 75% vs. threshold = 85% This is a legitimate code quality observation. The method is indeed hardcoded as FUZZY_MATCH for complex fields, even though the nested fields might have been verified using EXACT_MATCH, NUMERIC_MATCH, or SEMANTIC_MATCH. This is somewhat misleading. However, I need to consider: 1) Is this a clear code change that's needed? 2) Is it actionable? 3) Does it violate the rules about being "obvious or unimportant"? The comment suggests using a "more generic method label" which is actionable. This seems like a reasonable code quality suggestion that would improve accuracy of the API. It's not speculative - the issue definitely exists in the code. This could be considered a minor issue since thereasonfield does provide context about it being a list/dict with averaged confidence. The method field might not be critical for complex types. Also, the comment uses "Consider" which is somewhat soft language - is this really a must-fix or just a nice-to-have? While thereasonfield provides some context, themethodfield is part of the public API and should be accurate. Using FUZZY_MATCH when no fuzzy matching was actually performed is misleading. The suggestion is clear and actionable: either use a more generic label or reflect the actual methods used. This is a legitimate code quality issue about API accuracy. This is a valid code quality comment that points out an inaccuracy in the API where the verification method is hardcoded incorrectly for complex fields. The suggestion is actionable and would improve code quality. I should keep this comment.
Workflow ID: wflow_kTuNWOPDXqOy4kRw
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
- Remove duplicate fuzzy matching calls (store result for reuse) - Add AGGREGATE verification method for complex fields - Add division by zero check in semantic matching - Remove hardcoded 0.5 threshold override in FieldResult
Core Enhancements: - Add HallucinationError to core/exceptions.py - Export all GroundCheck components from main package - Add with_grounding() decorator for automatic verification - Add GroundedExtractor class for seamless Instructor integration - Add to_dict() method for serialization - Fix all ruff linting issues Testing: - Expand test suite from 11 to 24 tests - Add tests for decorator, extractor, edge cases - Add real-world scenario tests (medical, contact extraction) Documentation: - Add comprehensive docs/concepts/groundcheck.md - Add to mkdocs.yml navigation - Include mermaid diagrams and API reference Examples: - Add examples/groundcheck/basic_usage.py with 4 working examples Users can now: - from instructor import GroundCheck, verify_extraction, HallucinationError - Use @with_grounding decorator for automatic verification - Use GroundedExtractor for integrated extraction + verification
Adds confidence scoring for LLM extractions using token log probabilities. Zero extra API calls - just parses existing response data. Features: - ConfidenceScorer class with configurable thresholds - Field-level and overall confidence scores - ConfidenceLevel enum (HIGH/MEDIUM/LOW/VERY_LOW) - LowConfidenceError exception for threshold enforcement - score_confidence() convenience function - enable_logprobs() helper Performance: - Zero additional API calls - < 1ms processing time - Zero new dependencies Testing: - 14 tests covering all functionality - Mock LLM response testing - Edge case coverage Documentation: - Full docs at docs/concepts/confidence.md - Working examples at examples/confidence/ Exports added to main package: - from instructor import ConfidenceScorer, score_confidence, ConfidenceLevel
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Adds GroundCheck - a source grounding verification feature that detects hallucinations in LLM extractions.
Problem
LLMs can return perfectly valid JSON that passes Pydantic validation but contains hallucinated values not present in the source text. Currently, Instructor validates structure but not truth.
Solution
GroundCheck verifies each extracted field against the source text using multiple strategies:
Example
Features
GroundCheckclass with configurable thresholdsverify_extraction()convenience functiongrounding_validator()for Pydantic integrationTests
Dependencies
rapidfuzz(optional, for fuzzy matching)sentence-transformers(optional, for semantic matching)Checklist
Important
Introduces
GroundCheckfor verifying LLM-extracted data against source text using multiple matching strategies, with Pydantic integration and comprehensive tests.GroundCheckclass ingroundcheck.pyfor verifying LLM-extracted data against source text using exact, fuzzy, numeric, and semantic matching.verify_extraction()function for convenience andgrounding_validator()for Pydantic integration.test_groundcheck.pycovering exact match, fuzzy match, numeric match, hallucination detection, and edge cases.rapidfuzzfor fuzzy matching andsentence-transformersfor semantic matching.This description was created by
for b5cc820. You can customize this summary. It will automatically update as commits are pushed.