[SPARKNLP-1161] Adding features to PDF Reader #14596

danilojsl · 2025-06-04T21:47:45Z

Description

This PR introduces two new configurable parameters to the PdfToText transformer and PDF Reader to enrich PDF parsing:

extractCoordinates: When enabled, outputs spatial metadata (text position and dimensions) per character in the PDF. Outputs are stored in a new column as a positions array containing structured page coordinate mappings.
normalizeLigatures: When extractCoordinates is enabled, this option ensures ligature characters (e.g., ﬁ, ﬂ, œ) are normalized to their decomposed forms (fi, fl, oe).
Prevents these typographic ligatures from being interpreted as distinct characters in downstream text analysis.
exception: New Output Column for Fault Tolerance
A new exception column has been introduced to capture and log any processing errors encountered when handling individual PDF documents.

This enhancement ensures:

Fine-grained coordinate mapping for each character enables spatial reasoning and layout-aware models.
Ligature normalization improves text consistency and downstream linguistic accuracy, aligning extracted data with model expectations and training datasets.
Batch jobs are not interrupted by a single corrupt or malformed PDF.
Detailed error messages are recorded per document, supporting granular debugging and post-analysis.

Motivation and Context

Many downstream NLP tasks, such as entity recognition, layout analysis, and table extraction, require precise positional context of text elements in PDFs. Previously, these components provided only linear text extraction, losing valuable spatial metadata.

Additionally, typographic ligatures (like ﬁ, ﬂ, or œ) can lead to inconsistent tokenization and entity boundary errors when not normalized. These characters often distort string matching and model predictions in document processing pipelines.

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

…DF reader

examples/python/reader/SparkNLP_PDF_Reader_Demo.ipynb

src/main/scala/com/johnsnowlabs/reader/util/pdf/CustomStripper.java

[SPARKNLP-1161] Adding extractCoordinates and normalizeLigatures to P…

a5f8a10

…DF reader

danilojsl self-assigned this Jun 4, 2025

danilojsl requested review from maziyarpanahi and DevinTDHa June 4, 2025 21:48

[SPARKNLP-1161] Updating PDF reader Demo notebook [skip test]

064e174

DevinTDHa reviewed Jun 6, 2025

View reviewed changes

examples/python/reader/SparkNLP_PDF_Reader_Demo.ipynb Show resolved Hide resolved

src/main/scala/com/johnsnowlabs/reader/util/pdf/CustomStripper.java Show resolved Hide resolved

DevinTDHa changed the base branch from master to release/603-release-candidate June 6, 2025 15:21

[SPARKNLP-1161] Fix typos in PDF reader Demo notebook [skip test]

940268b

DevinTDHa marked this pull request as draft June 10, 2025 10:56

[SPARKNLP-1162] Adding exceptions log column

460e33a

danilojsl changed the base branch from release/603-release-candidate to master June 11, 2025 22:56

danilojsl added the enhancement label Jun 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARKNLP-1161] Adding features to PDF Reader #14596

[SPARKNLP-1161] Adding features to PDF Reader #14596

Uh oh!

danilojsl commented Jun 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[SPARKNLP-1161] Adding features to PDF Reader #14596

Are you sure you want to change the base?

[SPARKNLP-1161] Adding features to PDF Reader #14596

Uh oh!

Conversation

danilojsl commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danilojsl commented Jun 4, 2025 •

edited

Loading