Skip to content

[SPARKNLP-1161] Adding features to PDF Reader #14596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

danilojsl
Copy link
Contributor

@danilojsl danilojsl commented Jun 4, 2025

Description

This PR introduces two new configurable parameters to the PdfToText transformer and PDF Reader to enrich PDF parsing:

  • extractCoordinates: When enabled, outputs spatial metadata (text position and dimensions) per character in the PDF. Outputs are stored in a new column as a positions array containing structured page coordinate mappings.

  • normalizeLigatures: When extractCoordinates is enabled, this option ensures ligature characters (e.g., fi, fl, œ) are normalized to their decomposed forms (fi, fl, oe).
    Prevents these typographic ligatures from being interpreted as distinct characters in downstream text analysis.

  • exception: New Output Column for Fault Tolerance
    A new exception column has been introduced to capture and log any processing errors encountered when handling individual PDF documents.

This enhancement ensures:

  • Fine-grained coordinate mapping for each character enables spatial reasoning and layout-aware models.
  • Ligature normalization improves text consistency and downstream linguistic accuracy, aligning extracted data with model expectations and training datasets.
  • Batch jobs are not interrupted by a single corrupt or malformed PDF.
  • Detailed error messages are recorded per document, supporting granular debugging and post-analysis.

Motivation and Context

Many downstream NLP tasks, such as entity recognition, layout analysis, and table extraction, require precise positional context of text elements in PDFs. Previously, these components provided only linear text extraction, losing valuable spatial metadata.

Additionally, typographic ligatures (like fi, fl, or œ) can lead to inconsistent tokenization and entity boundary errors when not normalized. These characters often distort string matching and model predictions in document processing pipelines.

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@danilojsl danilojsl self-assigned this Jun 4, 2025
@DevinTDHa DevinTDHa changed the base branch from master to release/603-release-candidate June 6, 2025 15:21
@DevinTDHa DevinTDHa marked this pull request as draft June 10, 2025 10:56
@danilojsl danilojsl changed the base branch from release/603-release-candidate to master June 11, 2025 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants