Skip to content

AssertionError trying to read content stream starting with \x00\xfe\x00\xff #3470

@AlbertP3

Description

@AlbertP3

When processing a PDF, the file can be read, however, applying transformation such as scaling or padding raises AssertionError on assert org is not None in TextStringObject. I've pinpointed this issue to be caused by a sequence starting with UTF-16 BOM-like sequence: '\x00\xfe\x00\xff\x00\xf8\x01\n'. At the same time, the file is correctly displayed in Adobe Acrobat.

A simplest workaround is to replace assert org is not None, "mypy" in _base.py::TextStringObject::new with:

if org is None: 
    org = value.encode("utf-16")

However, I feel like this is more than likely to cause some serious issues down the road and would rather require a more robust approach.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.15.9-arch1-1-x86_64-with-glibc2.41

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==6.0.0, crypt_provider=('cryptography', '43.0.3'), PIL=11.3.0

Code + PDF

This is a minimal, complete example that shows the issue:

import io
from pypdf import PdfReader

def create_minimal_pdf_string(text: str):
    return f"""%PDF-1.0
1 0 obj
<</Type /Catalog /Pages 2 0 R>>
endobj
2 0 obj
<</Type /Pages /Kids [3 0 R] /Count 1>>
endobj
3 0 obj
<</Type /Page /Parent 2 0 R /MediaBox [0 0 612 792] /Contents 4 0 R>>
endobj
4 0 obj
<< /Length {len(text.encode('charmap')) + 30} >>
stream
BT
/Helv 12 Tf
50 700 Td
({text}) Tj
ET
endstream
endobj
xref
0 5
0000000000 65535 f
0000000010 00000 n
0000000053 00000 n
0000000102 00000 n
0000000157 00000 n
trailer
<</Size 5 /Root 1 0 R>>
startxref
{157 + (len(text.encode('charmap')) + 30) + 10}
%%EOF
"""

troublesome_str = '\x00\xfe\x00\xff\x00\xf8\x01\n'
pdf = create_minimal_pdf_string(troublesome_str)
reader = PdfReader(io.BytesIO(pdf.encode("charmap")))
page = reader.get_page(0)
page.scale_to(595, page.mediabox.height * (595 / page.mediabox.height))

I'm unable to publish the original file due to copyright issues. The attached snippet is the best effort to replicate this issue.

Traceback

This is the complete traceback I see:

File "test.py", line 270, in <module>
page.scale_to(595, page.mediabox.height * (595 / page.mediabox.height))
File "..../.venv/lib64/python3.9/site-packages/pypdf/_page.py", line 1661, in scale_to
self.scale(sx, sy)
File "..../.venv/lib64/python3.9/site-packages/pypdf/_page.py", line 1597, in scale
self.add_transformation((sx, 0, 0, sy, 0, 0))
File "..../.venv/lib64/python3.9/site-packages/pypdf/_page.py", line 1552, in add_transformation
content = PageObject._add_transformation_matrix(content, self.pdf, ctm)
File "..../.venv/lib64/python3.9/site-packages/pypdf/_page.py", line 1011, in _add_transformation_matrix
contents.operations.insert(
File "..../.venv/lib64/python3.9/site-packages/pypdf/generic/_data_structures.py", line 1432, in operations
self._parse_content_stream(BytesIO(self._data))
File "..../.venv/lib64/python3.9/site-packages/pypdf/generic/_data_structures.py", line 1325, in _parse_content_stream
operands.append(read_object(stream, None, self.forced_encoding))
File "..../.venv/lib64/python3.9/site-packages/pypdf/generic/_data_structures.py", line 1479, in read_object
return read_string_from_stream(stream, forced_encoding)
File "..../.venv/lib64/python3.9/site-packages/pypdf/generic/_utils.py", line 121, in read_string_from_stream
return create_string_object(bytes(txt), forced_encoding)
File "..../.venv/lib64/python3.9/site-packages/pypdf/generic/_utils.py", line 170, in create_string_object
retval = TextStringObject(string.decode("utf-16be"))
File "..../.venv/lib64/python3.9/site-packages/pypdf/generic/_base.py", line 658, in __new__
assert org is not None
AssertionError

Metadata

Metadata

Assignees

No one assigned

    Labels

    genericThe generic submodule is affectedis-robustness-issueFrom a users perspective, this is about robustnessneeds-pdfThe issue needs a PDF file to show the problem

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions