You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to read a pdf using PyPdf but it gave me this error, although my pdf file is not corrupted. but when i replace the version from 5.3.0 to 5.1.0. the error got resolved
PdfReadError: Unexpected end of stream
Environment
Ubuntu 20.0
Code + PDF
This is a minimal, complete example that shows the issue:
fromlangchain_community.document_loadersimportPyPDFLoader, DirectoryLoaderfromlangchain.text_splitterimportRecursiveCharacterTextSplitterDATA_PATH='data/'defload_pdf_files(data):
loader=DirectoryLoader(data,glob='*.pdf',loader_cls=PyPDFLoader)
documnets=loader.load()
returndocumnetsdocuments=load_pdf_files(data=DATA_PATH)
print("length of documents",len(documents))
(part of the content stream of page 2) as the inline image being followed by binary data due to the name /R10 having more than three characters.
I did some quick testing and it seems like I found a solution to resolve this. PR will follow when I have found some more time to do proper testing for it as well in the next days.
stefan6419846
changed the title
PdfReadError: Unexpected end of stream
Name objects after inline images are considered binary data
Mar 11, 2025
stefan6419846
added
the
is-bug
From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
label
Mar 11, 2025
I'm trying to read a pdf using PyPdf but it gave me this error, although my pdf file is not corrupted. but when i replace the version from 5.3.0 to 5.1.0. the error got resolved
PdfReadError: Unexpected end of stream
Environment
Ubuntu 20.0
Code + PDF
This is a minimal, complete example that shows the issue:
This is the pdf file I'm using
https://www.academia.edu/32752835/The_GALE_ENCYCLOPEDIA_of_MEDICINE_SECOND_EDITION
The text was updated successfully, but these errors were encountered: