feat: Add GroundCheck for source grounding verification #1968

Ruthvik-Bandari · 2025-12-20T15:50:45Z

Summary

Adds GroundCheck - a source grounding verification feature that detects hallucinations in LLM extractions.

Problem

LLMs can return perfectly valid JSON that passes Pydantic validation but contains hallucinated values not present in the source text. Currently, Instructor validates structure but not truth.

Solution

GroundCheck verifies each extracted field against the source text using multiple strategies:

Exact matching - Field value appears verbatim in source
Fuzzy matching - Handles typos/OCR errors (using rapidfuzz)
Numeric matching - Handles formatting differences ($1,234.56 vs 1234.56)
Semantic matching - Optional embeddings-based similarity

Example

from instructor.groundcheck import verify_extraction

result = verify_extraction(
    source_text="Invoice #12345 from Acme Corp. Total: $500",
    extracted_data={"invoice": "12345", "vendor": "Acme Corp", "currency": "USD"}
)
print(result.flagged_fields)  # ["currency"] - hallucinated!

Features

GroundCheck class with configurable thresholds
verify_extraction() convenience function
grounding_validator() for Pydantic integration
Field-level confidence scores and evidence
Works with nested objects and lists

Tests

11 tests added, all passing
Covers exact match, fuzzy match, numeric match, hallucination detection, edge cases

Dependencies

rapidfuzz (optional, for fuzzy matching)
sentence-transformers (optional, for semantic matching)

Checklist

Tests added and passing
Code follows project style
No breaking changes
Optional dependencies only

Important

Introduces GroundCheck for verifying LLM-extracted data against source text using multiple matching strategies, with Pydantic integration and comprehensive tests.

Feature:
- Adds GroundCheck class in groundcheck.py for verifying LLM-extracted data against source text using exact, fuzzy, numeric, and semantic matching.
- Introduces verify_extraction() function for convenience and grounding_validator() for Pydantic integration.
- Supports field-level confidence scores, evidence, and handles nested objects and lists.
Tests:
- Adds 11 tests in test_groundcheck.py covering exact match, fuzzy match, numeric match, hallucination detection, and edge cases.
Dependencies:
- Optional dependencies on rapidfuzz for fuzzy matching and sentence-transformers for semantic matching.

^{This description was created by}^{for b5cc820. You can customize this summary. It will automatically update as commits are pushed.}

- Add GroundCheck class to verify LLM extractions against source text - Implement exact, fuzzy, and numeric matching strategies - Add grounding_validator for Pydantic integration - Add verify_extraction convenience function - Include comprehensive test suite (11 tests) Helps detect hallucinations in structured data extraction by verifying that extracted values actually exist in the source text.

ellipsis-dev

Important

Looks good to me! 👍

Reviewed everything up to b5cc820 in 3 minutes and 3 seconds. Click for details.

Reviewed 448 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 5 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. instructor/groundcheck.py:147

Draft comment:
Avoid duplicate fuzzy matching calls. Consider storing the result of _fuzzy_match (already computed on line 147) and reusing it on line 154 to improve performance.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 85% vs. threshold = 85% The comment is technically correct - there IS a potential duplicate call to _fuzzy_match. However, this only happens when the fuzzy match score is below the threshold (self.fuzzy_threshold, default 0.85). In that case, the code falls through to semantic matching, and then at line 154 calls fuzzy_match again to get a "best effort" result for the NOT_FOUND case. This is actually intentional behavior - the second call is meant to provide some evidence even when no good match is found. The duplication only occurs in the failure path, not the success path. That said, storing the result from line 147 and reusing it would be a valid optimization. This is a legitimate code quality suggestion about DRY principles and performance. The duplicate call only happens in the fallback/failure case where fuzzy matching didn't meet the threshold. This might be intentional design to separate "good match" logic from "best effort evidence" logic. The performance impact is likely minimal since this only affects cases where matching fails. The comment doesn't acknowledge that this is a fallback path, not the main path. While the duplicate call is in a fallback path, it's still a valid DRY violation and performance optimization opportunity. The fuzzy_result from line 147 could be stored and reused at line 154, avoiding an expensive computation. This is a clear, actionable suggestion that improves code quality without changing behavior. This is a valid code quality comment that identifies a DRY violation and performance optimization opportunity. The suggestion is clear, actionable, and would improve the code. Even though it's in a fallback path, avoiding the duplicate expensive fuzzy match computation is worthwhile.

2. instructor/groundcheck.py:50

Draft comment:
The post_init method flags any field with confidence < 0.5, which may override custom threshold logic in _verify_field. Consider parameterizing or documenting this behavior.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 75% vs. threshold = 85% This appears to be a legitimate code quality issue. The __post_init__ method creates a hardcoded 0.5 threshold that could override or conflict with the custom threshold logic passed to _verify_field. For example, if a user sets a threshold of 0.3 for a specific field, and the confidence comes back as 0.4, _verify_field would set flagged=False (since 0.4 >= 0.3), but then __post_init__ would override it to flagged=True (since 0.4 < 0.5). This is a real inconsistency in the code logic that could cause unexpected behavior. The comment suggests either parameterizing this behavior or documenting it, which are both reasonable solutions. However, I need to check if this is actually a problem in practice. Looking more carefully at the _verify_field calls, they all explicitly set flagged based on the threshold comparison. The __post_init__ check only applies "if not self.flagged", so it won't override an already-flagged field. But it could still flag a field that was intentionally not flagged. This is a subtle bug that could manifest in edge cases. The critique is valid - this is indeed a subtle issue. When _verify_field explicitly sets flagged=False (because confidence meets the custom threshold), the __post_init__ could still override it to True if confidence < 0.5. This creates an implicit minimum threshold of 0.5 that users cannot configure, which contradicts the design that allows custom thresholds. This is a real code quality issue that should be addressed. This comment identifies a legitimate design inconsistency where a hardcoded 0.5 threshold in __post_init__ can override custom threshold logic. This is a valid code quality concern that requires a code change to fix. The comment should be kept.

3. instructor/groundcheck.py:196

Draft comment:
The nested loop in _fuzzy_match may be costly for longer texts. Consider optimizing this logic or adding caching if performance becomes an issue.
Reason this comment was not posted:
Confidence changes required: 80% <= threshold 85% None

4. instructor/groundcheck.py:226

Draft comment:
When computing semantic similarity, ensure division by zero is safely handled, even though SentenceTransformer embeddings are unlikely to be zero.
Reason this comment was not posted:
Confidence changes required: 80% <= threshold 85% None

5. instructor/groundcheck.py:245

Draft comment:
In _verify_complex_field, the method returned is hardcoded as FUZZY_MATCH. Consider reflecting the actual matching techniques used on nested fields or using a more generic method label.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 75% vs. threshold = 85% This is a legitimate code quality observation. The method is indeed hardcoded as FUZZY_MATCH for complex fields, even though the nested fields might have been verified using EXACT_MATCH, NUMERIC_MATCH, or SEMANTIC_MATCH. This is somewhat misleading. However, I need to consider: 1) Is this a clear code change that's needed? 2) Is it actionable? 3) Does it violate the rules about being "obvious or unimportant"? The comment suggests using a "more generic method label" which is actionable. This seems like a reasonable code quality suggestion that would improve accuracy of the API. It's not speculative - the issue definitely exists in the code. This could be considered a minor issue since the reason field does provide context about it being a list/dict with averaged confidence. The method field might not be critical for complex types. Also, the comment uses "Consider" which is somewhat soft language - is this really a must-fix or just a nice-to-have? While the reason field provides some context, the method field is part of the public API and should be accurate. Using FUZZY_MATCH when no fuzzy matching was actually performed is misleading. The suggestion is clear and actionable: either use a more generic label or reflect the actual methods used. This is a legitimate code quality issue about API accuracy. This is a valid code quality comment that points out an inaccuracy in the API where the verification method is hardcoded incorrectly for complex fields. The suggestion is actionable and would improve code quality. I should keep this comment.

Workflow ID: wflow_kTuNWOPDXqOy4kRw

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

- Remove duplicate fuzzy matching calls (store result for reuse) - Add AGGREGATE verification method for complex fields - Add division by zero check in semantic matching - Remove hardcoded 0.5 threshold override in FieldResult

Core Enhancements: - Add HallucinationError to core/exceptions.py - Export all GroundCheck components from main package - Add with_grounding() decorator for automatic verification - Add GroundedExtractor class for seamless Instructor integration - Add to_dict() method for serialization - Fix all ruff linting issues Testing: - Expand test suite from 11 to 24 tests - Add tests for decorator, extractor, edge cases - Add real-world scenario tests (medical, contact extraction) Documentation: - Add comprehensive docs/concepts/groundcheck.md - Add to mkdocs.yml navigation - Include mermaid diagrams and API reference Examples: - Add examples/groundcheck/basic_usage.py with 4 working examples Users can now: - from instructor import GroundCheck, verify_extraction, HallucinationError - Use @with_grounding decorator for automatic verification - Use GroundedExtractor for integrated extraction + verification

Adds confidence scoring for LLM extractions using token log probabilities. Zero extra API calls - just parses existing response data. Features: - ConfidenceScorer class with configurable thresholds - Field-level and overall confidence scores - ConfidenceLevel enum (HIGH/MEDIUM/LOW/VERY_LOW) - LowConfidenceError exception for threshold enforcement - score_confidence() convenience function - enable_logprobs() helper Performance: - Zero additional API calls - < 1ms processing time - Zero new dependencies Testing: - 14 tests covering all functionality - Mock LLM response testing - Edge case coverage Documentation: - Full docs at docs/concepts/confidence.md - Working examples at examples/confidence/ Exports added to main package: - from instructor import ConfidenceScorer, score_confidence, ConfidenceLevel

ellipsis-dev bot reviewed Dec 20, 2025

View reviewed changes

Ruthvik-Bandari and others added 4 commits December 20, 2025 11:00

fix: address code review feedback

0f7e133

- Remove duplicate fuzzy matching calls (store result for reuse) - Add AGGREGATE verification method for complex fields - Add division by zero check in semantic matching - Remove hardcoded 0.5 threshold override in FieldResult

Merge branch 'main' into feature/groundcheck-verification

c5e59c3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: Add GroundCheck for source grounding verification #1968

feat: Add GroundCheck for source grounding verification #1968

Ruthvik-Bandari commented Dec 20, 2025 •

edited by ellipsis-dev bot

Loading

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

feat: Add GroundCheck for source grounding verification #1968

Are you sure you want to change the base?

feat: Add GroundCheck for source grounding verification #1968

Conversation

Ruthvik-Bandari commented Dec 20, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Example

Features

Tests

Dependencies

Checklist

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Ruthvik-Bandari commented Dec 20, 2025 •

edited by ellipsis-dev bot

Loading