[SPARKNLP-1161] Adding features to PDF Reader #14596
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces two new configurable parameters to the
PdfToText
transformer and PDF Reader to enrich PDF parsing:extractCoordinates
: When enabled, outputs spatial metadata (text position and dimensions) per character in the PDF. Outputs are stored in a new column as a positions array containing structured page coordinate mappings.normalizeLigatures
: When extractCoordinates is enabled, this option ensures ligature characters (e.g., fi, fl, œ) are normalized to their decomposed forms (fi, fl, oe).Prevents these typographic ligatures from being interpreted as distinct characters in downstream text analysis.
exception
: New Output Column for Fault ToleranceA new exception column has been introduced to capture and log any processing errors encountered when handling individual PDF documents.
This enhancement ensures:
Motivation and Context
Many downstream NLP tasks, such as entity recognition, layout analysis, and table extraction, require precise positional context of text elements in PDFs. Previously, these components provided only linear text extraction, losing valuable spatial metadata.
Additionally, typographic ligatures (like fi, fl, or œ) can lead to inconsistent tokenization and entity boundary errors when not normalized. These characters often distort string matching and model predictions in document processing pipelines.
How Has This Been Tested?
Screenshots (if appropriate):
Types of changes
Checklist: