Handle per-object file identifiers for encryption #42

kyakuno · 2023-10-03T08:36:38Z

Describe the bug
I got Unable to decompress stream data: Data error. from inflate.
The return code of inflate is Z_DATA_ERROR.

To Reproduce
Steps to reproduce the behavior:

Download the pdf (https://www.axell.co.jp/business/pdf/AX51903_DS06P_hpdl202110xx.pdf)
Run pdfiototext

./pdfiototext AX51903_DS06P_hpdl202110xx.pdf

Expected behavior
Success to extract text.

System Information:

OS: macOS Sonoma

Additional context

st->predictor is 12 = _PDFIO_PREDICTOR_PNG_UP.
The error seems to occur with PDFs that contain images.

stream st->filter 6 st->predictor 12
stream st->filter 6 st->predictor 1
stream st->filter 6 st->predictor 1
stream st->filter 6 st->predictor 1
stream st->filter 6 st->predictor 1
stream st->filter 6 st->predictor 1
stream st->filter 6 st->predictor 1
stream st->filter 6 st->predictor 12
AX51903_DS06P_hpdl202110xx.pdf: Unable to decompress stream data: Data error.
AX51903_DS06P_hpdl202110xx.pdf: Unable to find pages object.

The text was updated successfully, but these errors were encountered:

kyakuno · 2023-10-03T08:41:29Z

The issue occured on both Head (87ca4db 2023/10/02 18:27) and v 1.1.1.

michaelrsweet · 2023-10-06T16:10:51Z

OK, so this is an encrypted PDF generated by what looks like an old MacOS 9 version of Acrobat. The object that isn't loading is a secondary xref stream, which is odd because the primary stream loaded just fine...

Investigating...

kyakuno · 2023-10-08T12:01:56Z

Thank you very much for the investigation. I would be very happy if this file could be read.

michaelrsweet · 2023-11-15T01:01:43Z

It looks like there is a broken object reference. Need to do a little digging but I might need to allow for this and throw an error when you try to actually load the broken reference.

michaelrsweet · 2023-11-15T14:01:40Z

Looking back, the first error is the unable to decompress error due to a bogus xref stream in object 451.

michaelrsweet · 2023-11-15T14:09:51Z

and this object has a different file key than the rest of the file...

michaelrsweet · 2023-11-15T14:18:27Z

Deferring this to "future" since it will require a re-implementation of the crypto handler and I have never seen a PDF file containing two different file IDs.

michaelrsweet · 2023-12-14T20:05:02Z

Current code has an issue because the object dictionary is trying to be decrypted while it is being loaded; need to split out the code that decrypts string values from the code that loads the object dictionary.

michaelrsweet · 2023-12-20T01:14:49Z

OK, so for this file it actually looks like the per-object ID is the same as the main file ID, but the object itself is actually damaged. Xpdf doesn't ever try to load it so maybe it is an object that doesn't need to be loaded to use the file? Will be looking at that tomorrow...

michaelrsweet self-assigned this Oct 3, 2023

michaelrsweet added the investigating label Oct 3, 2023

michaelrsweet added bug Something isn't working priority-medium and removed investigating labels Oct 6, 2023

michaelrsweet added this to the Stable milestone Oct 6, 2023

michaelrsweet added enhancement New feature or request and removed bug Something isn't working priority-medium labels Nov 15, 2023

michaelrsweet modified the milestones: Stable, Future Nov 15, 2023

michaelrsweet changed the title ~~Unable to decompress stream data: Data error.~~ Handle per-object file identifiers for encryption Dec 12, 2023

michaelrsweet added a commit that referenced this issue Dec 13, 2023

Support per-object file IDs (Issue #42)

2b92044

michaelrsweet added a commit that referenced this issue Dec 14, 2023

Defer object/value decryption to after the object is loaded (Issue #42)

7330cc3

michaelrsweet mentioned this issue Feb 9, 2024

Unable to read files password-protected by Adobe Protect PDF online tool #62

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle per-object file identifiers for encryption #42

Handle per-object file identifiers for encryption #42

kyakuno commented Oct 3, 2023

kyakuno commented Oct 3, 2023

michaelrsweet commented Oct 6, 2023

kyakuno commented Oct 8, 2023

michaelrsweet commented Nov 15, 2023

michaelrsweet commented Nov 15, 2023

michaelrsweet commented Nov 15, 2023

michaelrsweet commented Nov 15, 2023

michaelrsweet commented Dec 14, 2023

michaelrsweet commented Dec 20, 2023

Handle per-object file identifiers for encryption #42

Handle per-object file identifiers for encryption #42

Comments

kyakuno commented Oct 3, 2023

kyakuno commented Oct 3, 2023

michaelrsweet commented Oct 6, 2023

kyakuno commented Oct 8, 2023

michaelrsweet commented Nov 15, 2023

michaelrsweet commented Nov 15, 2023

michaelrsweet commented Nov 15, 2023

michaelrsweet commented Nov 15, 2023

michaelrsweet commented Dec 14, 2023

michaelrsweet commented Dec 20, 2023