Skip to content

Conversation

@JeliHacker
Copy link
Contributor

Fixes #63

Added a new TitleElementMerger processing step that identifies and merges adjacent TitleElement instances that:

  • Are on the same line - Have the same parent HTML element
  • Have the same level - Maintain hierarchical consistency
  • Are adjacent - Appear consecutively in the element list
  • Should be merged - Avoid merging separate complete titles, edge case

Tests pass, but it affects the accuracy for the MSFT 10-Q in task snapshot-verify
Screenshot 2025-10-30 at 1 31 36 AM

- Add TitleElementMerger processing step to both 10-Q and 10-K parsers
- Export TitleElementMerger in processing_steps __init__.py
- Include the TitleElementMerger implementation

This addresses issues where section titles are split across multiple HTML elements
and need to be merged back together (e.g., 'PART I. FINANCI' + 'AL INFORMATION').
- Refactor _can_merge_with_batch to reduce return statements from 7 to 6
- Extract _are_separate_complete_titles helper method for better organization
- Fix mypy type errors with proper casting and bool() conversion
- All pre-commit checks now pass: tests, linting, and type checking
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix the TopSectionTitle being split in MSFT filing

1 participant