Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle per-object file identifiers for encryption #42

Open
kyakuno opened this issue Oct 3, 2023 · 9 comments
Open

Handle per-object file identifiers for encryption #42

kyakuno opened this issue Oct 3, 2023 · 9 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@kyakuno
Copy link

kyakuno commented Oct 3, 2023

Describe the bug
I got Unable to decompress stream data: Data error. from inflate.
The return code of inflate is Z_DATA_ERROR.

To Reproduce
Steps to reproduce the behavior:

  1. Download the pdf (https://www.axell.co.jp/business/pdf/AX51903_DS06P_hpdl202110xx.pdf)
  2. Run pdfiototext
./pdfiototext AX51903_DS06P_hpdl202110xx.pdf

Expected behavior
Success to extract text.

System Information:

  • OS: macOS Sonoma

Additional context

st->predictor is 12 = _PDFIO_PREDICTOR_PNG_UP.
The error seems to occur with PDFs that contain images.

stream st->filter 6 st->predictor 12
stream st->filter 6 st->predictor 1
stream st->filter 6 st->predictor 1
stream st->filter 6 st->predictor 1
stream st->filter 6 st->predictor 1
stream st->filter 6 st->predictor 1
stream st->filter 6 st->predictor 1
stream st->filter 6 st->predictor 12
AX51903_DS06P_hpdl202110xx.pdf: Unable to decompress stream data: Data error.
AX51903_DS06P_hpdl202110xx.pdf: Unable to find pages object.
@kyakuno
Copy link
Author

kyakuno commented Oct 3, 2023

The issue occured on both Head (87ca4db 2023/10/02 18:27) and v 1.1.1.

@michaelrsweet
Copy link
Owner

OK, so this is an encrypted PDF generated by what looks like an old MacOS 9 version of Acrobat. The object that isn't loading is a secondary xref stream, which is odd because the primary stream loaded just fine...

Investigating...

@michaelrsweet michaelrsweet added bug Something isn't working priority-medium and removed investigating labels Oct 6, 2023
@michaelrsweet michaelrsweet added this to the Stable milestone Oct 6, 2023
@kyakuno
Copy link
Author

kyakuno commented Oct 8, 2023

Thank you very much for the investigation. I would be very happy if this file could be read.

@michaelrsweet
Copy link
Owner

It looks like there is a broken object reference. Need to do a little digging but I might need to allow for this and throw an error when you try to actually load the broken reference.

@michaelrsweet
Copy link
Owner

Looking back, the first error is the unable to decompress error due to a bogus xref stream in object 451.

@michaelrsweet
Copy link
Owner

and this object has a different file key than the rest of the file...

@michaelrsweet
Copy link
Owner

Deferring this to "future" since it will require a re-implementation of the crypto handler and I have never seen a PDF file containing two different file IDs.

@michaelrsweet michaelrsweet added enhancement New feature or request and removed bug Something isn't working priority-medium labels Nov 15, 2023
@michaelrsweet michaelrsweet modified the milestones: Stable, Future Nov 15, 2023
@michaelrsweet michaelrsweet changed the title Unable to decompress stream data: Data error. Handle per-object file identifiers for encryption Dec 12, 2023
@michaelrsweet
Copy link
Owner

Current code has an issue because the object dictionary is trying to be decrypted while it is being loaded; need to split out the code that decrypts string values from the code that loads the object dictionary.

@michaelrsweet
Copy link
Owner

OK, so for this file it actually looks like the per-object ID is the same as the main file ID, but the object itself is actually damaged. Xpdf doesn't ever try to load it so maybe it is an object that doesn't need to be loaded to use the file? Will be looking at that tomorrow...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants