Skip to content

Conversation

@Ruthvik-Bandari
Copy link

@Ruthvik-Bandari Ruthvik-Bandari commented Dec 20, 2025

Summary

Adds GroundCheck - a source grounding verification feature that detects hallucinations in LLM extractions.

Problem

LLMs can return perfectly valid JSON that passes Pydantic validation but contains hallucinated values not present in the source text. Currently, Instructor validates structure but not truth.

Solution

GroundCheck verifies each extracted field against the source text using multiple strategies:

  • Exact matching - Field value appears verbatim in source
  • Fuzzy matching - Handles typos/OCR errors (using rapidfuzz)
  • Numeric matching - Handles formatting differences ($1,234.56 vs 1234.56)
  • Semantic matching - Optional embeddings-based similarity

Example

from instructor.groundcheck import verify_extraction

result = verify_extraction(
    source_text="Invoice #12345 from Acme Corp. Total: $500",
    extracted_data={"invoice": "12345", "vendor": "Acme Corp", "currency": "USD"}
)
print(result.flagged_fields)  # ["currency"] - hallucinated!

Features

  • GroundCheck class with configurable thresholds
  • verify_extraction() convenience function
  • grounding_validator() for Pydantic integration
  • Field-level confidence scores and evidence
  • Works with nested objects and lists

Tests

  • 11 tests added, all passing
  • Covers exact match, fuzzy match, numeric match, hallucination detection, edge cases

Dependencies

  • rapidfuzz (optional, for fuzzy matching)
  • sentence-transformers (optional, for semantic matching)

Checklist

  • Tests added and passing
  • Code follows project style
  • No breaking changes
  • Optional dependencies only

Important

Introduces GroundCheck for verifying LLM-extracted data against source text using multiple matching strategies, with Pydantic integration and comprehensive tests.

  • Feature:
    • Adds GroundCheck class in groundcheck.py for verifying LLM-extracted data against source text using exact, fuzzy, numeric, and semantic matching.
    • Introduces verify_extraction() function for convenience and grounding_validator() for Pydantic integration.
    • Supports field-level confidence scores, evidence, and handles nested objects and lists.
  • Tests:
    • Adds 11 tests in test_groundcheck.py covering exact match, fuzzy match, numeric match, hallucination detection, and edge cases.
  • Dependencies:
    • Optional dependencies on rapidfuzz for fuzzy matching and sentence-transformers for semantic matching.

This description was created by Ellipsis for b5cc820. You can customize this summary. It will automatically update as commits are pushed.

- Add GroundCheck class to verify LLM extractions against source text
- Implement exact, fuzzy, and numeric matching strategies
- Add grounding_validator for Pydantic integration
- Add verify_extraction convenience function
- Include comprehensive test suite (11 tests)

Helps detect hallucinations in structured data extraction by verifying
that extracted values actually exist in the source text.
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed everything up to b5cc820 in 3 minutes and 3 seconds. Click for details.
  • Reviewed 448 lines of code in 2 files
  • Skipped 0 files when reviewing.
  • Skipped posting 5 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. instructor/groundcheck.py:147
  • Draft comment:
    Avoid duplicate fuzzy matching calls. Consider storing the result of _fuzzy_match (already computed on line 147) and reusing it on line 154 to improve performance.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 85% vs. threshold = 85% The comment is technically correct - there IS a potential duplicate call to _fuzzy_match. However, this only happens when the fuzzy match score is below the threshold (self.fuzzy_threshold, default 0.85). In that case, the code falls through to semantic matching, and then at line 154 calls fuzzy_match again to get a "best effort" result for the NOT_FOUND case. This is actually intentional behavior - the second call is meant to provide some evidence even when no good match is found. The duplication only occurs in the failure path, not the success path. That said, storing the result from line 147 and reusing it would be a valid optimization. This is a legitimate code quality suggestion about DRY principles and performance. The duplicate call only happens in the fallback/failure case where fuzzy matching didn't meet the threshold. This might be intentional design to separate "good match" logic from "best effort evidence" logic. The performance impact is likely minimal since this only affects cases where matching fails. The comment doesn't acknowledge that this is a fallback path, not the main path. While the duplicate call is in a fallback path, it's still a valid DRY violation and performance optimization opportunity. The fuzzy_result from line 147 could be stored and reused at line 154, avoiding an expensive computation. This is a clear, actionable suggestion that improves code quality without changing behavior. This is a valid code quality comment that identifies a DRY violation and performance optimization opportunity. The suggestion is clear, actionable, and would improve the code. Even though it's in a fallback path, avoiding the duplicate expensive fuzzy match computation is worthwhile.
2. instructor/groundcheck.py:50
  • Draft comment:
    The post_init method flags any field with confidence < 0.5, which may override custom threshold logic in _verify_field. Consider parameterizing or documenting this behavior.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 75% vs. threshold = 85% This appears to be a legitimate code quality issue. The __post_init__ method creates a hardcoded 0.5 threshold that could override or conflict with the custom threshold logic passed to _verify_field. For example, if a user sets a threshold of 0.3 for a specific field, and the confidence comes back as 0.4, _verify_field would set flagged=False (since 0.4 >= 0.3), but then __post_init__ would override it to flagged=True (since 0.4 < 0.5). This is a real inconsistency in the code logic that could cause unexpected behavior. The comment suggests either parameterizing this behavior or documenting it, which are both reasonable solutions. However, I need to check if this is actually a problem in practice. Looking more carefully at the _verify_field calls, they all explicitly set flagged based on the threshold comparison. The __post_init__ check only applies "if not self.flagged", so it won't override an already-flagged field. But it could still flag a field that was intentionally not flagged. This is a subtle bug that could manifest in edge cases. The critique is valid - this is indeed a subtle issue. When _verify_field explicitly sets flagged=False (because confidence meets the custom threshold), the __post_init__ could still override it to True if confidence < 0.5. This creates an implicit minimum threshold of 0.5 that users cannot configure, which contradicts the design that allows custom thresholds. This is a real code quality issue that should be addressed. This comment identifies a legitimate design inconsistency where a hardcoded 0.5 threshold in __post_init__ can override custom threshold logic. This is a valid code quality concern that requires a code change to fix. The comment should be kept.
3. instructor/groundcheck.py:196
  • Draft comment:
    The nested loop in _fuzzy_match may be costly for longer texts. Consider optimizing this logic or adding caching if performance becomes an issue.
  • Reason this comment was not posted:
    Confidence changes required: 80% <= threshold 85% None
4. instructor/groundcheck.py:226
  • Draft comment:
    When computing semantic similarity, ensure division by zero is safely handled, even though SentenceTransformer embeddings are unlikely to be zero.
  • Reason this comment was not posted:
    Confidence changes required: 80% <= threshold 85% None
5. instructor/groundcheck.py:245
  • Draft comment:
    In _verify_complex_field, the method returned is hardcoded as FUZZY_MATCH. Consider reflecting the actual matching techniques used on nested fields or using a more generic method label.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 75% vs. threshold = 85% This is a legitimate code quality observation. The method is indeed hardcoded as FUZZY_MATCH for complex fields, even though the nested fields might have been verified using EXACT_MATCH, NUMERIC_MATCH, or SEMANTIC_MATCH. This is somewhat misleading. However, I need to consider: 1) Is this a clear code change that's needed? 2) Is it actionable? 3) Does it violate the rules about being "obvious or unimportant"? The comment suggests using a "more generic method label" which is actionable. This seems like a reasonable code quality suggestion that would improve accuracy of the API. It's not speculative - the issue definitely exists in the code. This could be considered a minor issue since the reason field does provide context about it being a list/dict with averaged confidence. The method field might not be critical for complex types. Also, the comment uses "Consider" which is somewhat soft language - is this really a must-fix or just a nice-to-have? While the reason field provides some context, the method field is part of the public API and should be accurate. Using FUZZY_MATCH when no fuzzy matching was actually performed is misleading. The suggestion is clear and actionable: either use a more generic label or reflect the actual methods used. This is a legitimate code quality issue about API accuracy. This is a valid code quality comment that points out an inaccuracy in the API where the verification method is hardcoded incorrectly for complex fields. The suggestion is actionable and would improve code quality. I should keep this comment.

Workflow ID: wflow_kTuNWOPDXqOy4kRw

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

Ruthvik-Bandari and others added 4 commits December 20, 2025 11:00
- Remove duplicate fuzzy matching calls (store result for reuse)
- Add AGGREGATE verification method for complex fields
- Add division by zero check in semantic matching
- Remove hardcoded 0.5 threshold override in FieldResult
Core Enhancements:
- Add HallucinationError to core/exceptions.py
- Export all GroundCheck components from main package
- Add with_grounding() decorator for automatic verification
- Add GroundedExtractor class for seamless Instructor integration
- Add to_dict() method for serialization
- Fix all ruff linting issues

Testing:
- Expand test suite from 11 to 24 tests
- Add tests for decorator, extractor, edge cases
- Add real-world scenario tests (medical, contact extraction)

Documentation:
- Add comprehensive docs/concepts/groundcheck.md
- Add to mkdocs.yml navigation
- Include mermaid diagrams and API reference

Examples:
- Add examples/groundcheck/basic_usage.py with 4 working examples

Users can now:
- from instructor import GroundCheck, verify_extraction, HallucinationError
- Use @with_grounding decorator for automatic verification
- Use GroundedExtractor for integrated extraction + verification
Adds confidence scoring for LLM extractions using token log probabilities.
Zero extra API calls - just parses existing response data.

Features:
- ConfidenceScorer class with configurable thresholds
- Field-level and overall confidence scores
- ConfidenceLevel enum (HIGH/MEDIUM/LOW/VERY_LOW)
- LowConfidenceError exception for threshold enforcement
- score_confidence() convenience function
- enable_logprobs() helper

Performance:
- Zero additional API calls
- < 1ms processing time
- Zero new dependencies

Testing:
- 14 tests covering all functionality
- Mock LLM response testing
- Edge case coverage

Documentation:
- Full docs at docs/concepts/confidence.md
- Working examples at examples/confidence/

Exports added to main package:
- from instructor import ConfidenceScorer, score_confidence, ConfidenceLevel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant