Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Name objects after inline images are considered binary data #3172

Open
yaseerapure opened this issue Mar 11, 2025 · 1 comment · May be fixed by #3173
Open

Name objects after inline images are considered binary data #3172

yaseerapure opened this issue Mar 11, 2025 · 1 comment · May be fixed by #3173
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF

Comments

@yaseerapure
Copy link

I'm trying to read a pdf using PyPdf but it gave me this error, although my pdf file is not corrupted. but when i replace the version from 5.3.0 to 5.1.0. the error got resolved
PdfReadError: Unexpected end of stream

Environment

Ubuntu 20.0

Code + PDF

This is a minimal, complete example that shows the issue:

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

DATA_PATH = 'data/'

def load_pdf_files(data):
    loader=DirectoryLoader(data,glob='*.pdf',loader_cls=PyPDFLoader)
    documnets=loader.load()
    return documnets

documents=load_pdf_files(data=DATA_PATH)
print("length of documents",len(documents))

This is the pdf file I'm using
https://www.academia.edu/32752835/The_GALE_ENCYCLOPEDIA_of_MEDICINE_SECOND_EDITION

@stefan6419846
Copy link
Collaborator

Thanks for the report. This error is caused by our refined handling of inline images. We currently detect

BI
/CS/G
/W 57
/H 55
/BPC 8
/I true
/F/Fl
/DP<</Predictor 15
/Columns 57>>
....
EI Q
/R10 gs
/R12 cs

(part of the content stream of page 2) as the inline image being followed by binary data due to the name /R10 having more than three characters.

I did some quick testing and it seems like I found a solution to resolve this. PR will follow when I have found some more time to do proper testing for it as well in the next days.

@stefan6419846 stefan6419846 changed the title PdfReadError: Unexpected end of stream Name objects after inline images are considered binary data Mar 11, 2025
@stefan6419846 stefan6419846 added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Mar 11, 2025
stefan6419846 added a commit to stefan6419846/pypdf that referenced this issue Mar 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants