Name objects after inline images are considered binary data #3172

yaseerapure · 2025-03-11T10:11:18Z

I'm trying to read a pdf using PyPdf but it gave me this error, although my pdf file is not corrupted. but when i replace the version from 5.3.0 to 5.1.0. the error got resolved
PdfReadError: Unexpected end of stream

Environment

Ubuntu 20.0

Code + PDF

This is a minimal, complete example that shows the issue:

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

DATA_PATH = 'data/'

def load_pdf_files(data):
    loader=DirectoryLoader(data,glob='*.pdf',loader_cls=PyPDFLoader)
    documnets=loader.load()
    return documnets

documents=load_pdf_files(data=DATA_PATH)
print("length of documents",len(documents))

This is the pdf file I'm using
https://www.academia.edu/32752835/The_GALE_ENCYCLOPEDIA_of_MEDICINE_SECOND_EDITION

stefan6419846 · 2025-03-11T10:42:15Z

Thanks for the report. This error is caused by our refined handling of inline images. We currently detect

BI
/CS/G
/W 57
/H 55
/BPC 8
/I true
/F/Fl
/DP<</Predictor 15
/Columns 57>>
....
EI Q
/R10 gs
/R12 cs

(part of the content stream of page 2) as the inline image being followed by binary data due to the name /R10 having more than three characters.

I did some quick testing and it seems like I found a solution to resolve this. PR will follow when I have found some more time to do proper testing for it as well in the next days.

Closes py-pdf#3172.

stefan6419846 changed the title ~~PdfReadError: Unexpected end of stream~~ Name objects after inline images are considered binary data Mar 11, 2025

stefan6419846 added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Mar 11, 2025

stefan6419846 added a commit to stefan6419846/pypdf that referenced this issue Mar 12, 2025

BUG: Fix detection of inline images followed by names or numbers

791b121

Closes py-pdf#3172.

stefan6419846 linked a pull request Mar 12, 2025 that will close this issue

BUG: Fix detection of inline images followed by names or numbers #3173

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Name objects after inline images are considered binary data #3172

Name objects after inline images are considered binary data #3172

yaseerapure commented Mar 11, 2025

stefan6419846 commented Mar 11, 2025

Name objects after inline images are considered binary data #3172

Name objects after inline images are considered binary data #3172

Comments

yaseerapure commented Mar 11, 2025

Environment

Code + PDF

stefan6419846 commented Mar 11, 2025