Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not indexing any PDF files (PDFStreamEngine stream of error) #1591

Open
jafooool opened this issue Oct 8, 2024 · 1 comment
Open

not indexing any PDF files (PDFStreamEngine stream of error) #1591

jafooool opened this issue Oct 8, 2024 · 1 comment
Labels

Comments

@jafooool
Copy link

jafooool commented Oct 8, 2024

Describe the bug
added a trove of PDF file ... launch indexing ... get only

Error writing:
org.apache.tika.sax.TaggedSAXException: Error writing:
org.xml.sax.SAXException: Error writing:
java.io.IOException: Read end dead
2024-10-08 10:35:10,354 [Apache Tika: XXXXX.pdf] WARN PDFStreamEngine - org.apache.tika.sax.TaggedSAXException: Error writing:
org.apache.tika.sax.TaggedSAXException: Error writing:
org.xml.sax.SAXException: Error writing:
java.io.IOException: Read end dead

etc. No PDF files get indexed

Desktop (please complete the following information):

  • OS: WINDOWS 10 with 32 Gb on some INTEL chip, etc.
  • Browser: FIREFOX 131.0

last available version of DATASHARE

@bamthomas
Copy link
Collaborator

Which datashare version?

I tried with the 18.3.0 (latest: from yesterday) and it works fine with PDF (with and without OCR).

I've already seen this kind of error when there is a low level issue with file access or with badly encoded PDF files.

Are you sure that your PDF files are not corrupted?
Or that the access to the filesystem is OK?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants