-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
When processing a PDF, the file can be read, however, applying transformation such as scaling or padding raises AssertionError on assert org is not None
in TextStringObject. I've pinpointed this issue to be caused by a sequence starting with UTF-16 BOM-like sequence: '\x00\xfe\x00\xff\x00\xf8\x01\n'. At the same time, the file is correctly displayed in Adobe Acrobat.
A simplest workaround is to replace assert org is not None, "mypy"
in _base.py::TextStringObject::new with:
if org is None:
org = value.encode("utf-16")
However, I feel like this is more than likely to cause some serious issues down the road and would rather require a more robust approach.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Linux-6.15.9-arch1-1-x86_64-with-glibc2.41
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==6.0.0, crypt_provider=('cryptography', '43.0.3'), PIL=11.3.0
Code + PDF
This is a minimal, complete example that shows the issue:
import io
from pypdf import PdfReader
def create_minimal_pdf_string(text: str):
return f"""%PDF-1.0
1 0 obj
<</Type /Catalog /Pages 2 0 R>>
endobj
2 0 obj
<</Type /Pages /Kids [3 0 R] /Count 1>>
endobj
3 0 obj
<</Type /Page /Parent 2 0 R /MediaBox [0 0 612 792] /Contents 4 0 R>>
endobj
4 0 obj
<< /Length {len(text.encode('charmap')) + 30} >>
stream
BT
/Helv 12 Tf
50 700 Td
({text}) Tj
ET
endstream
endobj
xref
0 5
0000000000 65535 f
0000000010 00000 n
0000000053 00000 n
0000000102 00000 n
0000000157 00000 n
trailer
<</Size 5 /Root 1 0 R>>
startxref
{157 + (len(text.encode('charmap')) + 30) + 10}
%%EOF
"""
troublesome_str = '\x00\xfe\x00\xff\x00\xf8\x01\n'
pdf = create_minimal_pdf_string(troublesome_str)
reader = PdfReader(io.BytesIO(pdf.encode("charmap")))
page = reader.get_page(0)
page.scale_to(595, page.mediabox.height * (595 / page.mediabox.height))
I'm unable to publish the original file due to copyright issues. The attached snippet is the best effort to replicate this issue.
Traceback
This is the complete traceback I see:
File "test.py", line 270, in <module>
page.scale_to(595, page.mediabox.height * (595 / page.mediabox.height))
File "..../.venv/lib64/python3.9/site-packages/pypdf/_page.py", line 1661, in scale_to
self.scale(sx, sy)
File "..../.venv/lib64/python3.9/site-packages/pypdf/_page.py", line 1597, in scale
self.add_transformation((sx, 0, 0, sy, 0, 0))
File "..../.venv/lib64/python3.9/site-packages/pypdf/_page.py", line 1552, in add_transformation
content = PageObject._add_transformation_matrix(content, self.pdf, ctm)
File "..../.venv/lib64/python3.9/site-packages/pypdf/_page.py", line 1011, in _add_transformation_matrix
contents.operations.insert(
File "..../.venv/lib64/python3.9/site-packages/pypdf/generic/_data_structures.py", line 1432, in operations
self._parse_content_stream(BytesIO(self._data))
File "..../.venv/lib64/python3.9/site-packages/pypdf/generic/_data_structures.py", line 1325, in _parse_content_stream
operands.append(read_object(stream, None, self.forced_encoding))
File "..../.venv/lib64/python3.9/site-packages/pypdf/generic/_data_structures.py", line 1479, in read_object
return read_string_from_stream(stream, forced_encoding)
File "..../.venv/lib64/python3.9/site-packages/pypdf/generic/_utils.py", line 121, in read_string_from_stream
return create_string_object(bytes(txt), forced_encoding)
File "..../.venv/lib64/python3.9/site-packages/pypdf/generic/_utils.py", line 170, in create_string_object
retval = TextStringObject(string.decode("utf-16be"))
File "..../.venv/lib64/python3.9/site-packages/pypdf/generic/_base.py", line 658, in __new__
assert org is not None
AssertionError