AssertionError trying to read content stream starting with \x00\xfe\x00\xff

When processing a PDF, the file can be read, however, applying transformation such as scaling or padding raises AssertionError on `assert org is not None` in TextStringObject. I've pinpointed this issue to be caused by a sequence starting with UTF-16 BOM-like sequence: '\x00\xfe\x00\xff\x00\xf8\x01\n'. At the same time, the file is correctly displayed in Adobe Acrobat.

A simplest workaround is to replace  `assert org is not None, "mypy"` in _base.py::TextStringObject::__new__ with:
```python
if org is None: 
    org = value.encode("utf-16")
```
However, I feel like this is more than likely to cause some serious issues down the road and would rather require a more robust approach.

## Environment

Which environment were you using when you encountered the problem?

```bash
$ python -m platform
Linux-6.15.9-arch1-1-x86_64-with-glibc2.41

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==6.0.0, crypt_provider=('cryptography', '43.0.3'), PIL=11.3.0
```

## Code + PDF

This is a minimal, complete example that shows the issue:

```python
import io
from pypdf import PdfReader

def create_minimal_pdf_string(text: str):
    return f"""%PDF-1.0
1 0 obj
<</Type /Catalog /Pages 2 0 R>>
endobj
2 0 obj
<</Type /Pages /Kids [3 0 R] /Count 1>>
endobj
3 0 obj
<</Type /Page /Parent 2 0 R /MediaBox [0 0 612 792] /Contents 4 0 R>>
endobj
4 0 obj
<< /Length {len(text.encode('charmap')) + 30} >>
stream
BT
/Helv 12 Tf
50 700 Td
({text}) Tj
ET
endstream
endobj
xref
0 5
0000000000 65535 f
0000000010 00000 n
0000000053 00000 n
0000000102 00000 n
0000000157 00000 n
trailer
<</Size 5 /Root 1 0 R>>
startxref
{157 + (len(text.encode('charmap')) + 30) + 10}
%%EOF
"""

troublesome_str = '\x00\xfe\x00\xff\x00\xf8\x01\n'
pdf = create_minimal_pdf_string(troublesome_str)
reader = PdfReader(io.BytesIO(pdf.encode("charmap")))
page = reader.get_page(0)
page.scale_to(595, page.mediabox.height * (595 / page.mediabox.height))
```

I'm unable to publish the original file due to copyright issues. The attached snippet is the best effort to replicate this issue.

## Traceback

This is the complete traceback I see:

```
File "test.py", line 270, in <module>
page.scale_to(595, page.mediabox.height * (595 / page.mediabox.height))
File "..../.venv/lib64/python3.9/site-packages/pypdf/_page.py", line 1661, in scale_to
self.scale(sx, sy)
File "..../.venv/lib64/python3.9/site-packages/pypdf/_page.py", line 1597, in scale
self.add_transformation((sx, 0, 0, sy, 0, 0))
File "..../.venv/lib64/python3.9/site-packages/pypdf/_page.py", line 1552, in add_transformation
content = PageObject._add_transformation_matrix(content, self.pdf, ctm)
File "..../.venv/lib64/python3.9/site-packages/pypdf/_page.py", line 1011, in _add_transformation_matrix
contents.operations.insert(
File "..../.venv/lib64/python3.9/site-packages/pypdf/generic/_data_structures.py", line 1432, in operations
self._parse_content_stream(BytesIO(self._data))
File "..../.venv/lib64/python3.9/site-packages/pypdf/generic/_data_structures.py", line 1325, in _parse_content_stream
operands.append(read_object(stream, None, self.forced_encoding))
File "..../.venv/lib64/python3.9/site-packages/pypdf/generic/_data_structures.py", line 1479, in read_object
return read_string_from_stream(stream, forced_encoding)
File "..../.venv/lib64/python3.9/site-packages/pypdf/generic/_utils.py", line 121, in read_string_from_stream
return create_string_object(bytes(txt), forced_encoding)
File "..../.venv/lib64/python3.9/site-packages/pypdf/generic/_utils.py", line 170, in create_string_object
retval = TextStringObject(string.decode("utf-16be"))
File "..../.venv/lib64/python3.9/site-packages/pypdf/generic/_base.py", line 658, in __new__
assert org is not None
AssertionError
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AssertionError trying to read content stream starting with \x00\xfe\x00\xff #3470

Environment

Code + PDF

Traceback

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AssertionError trying to read content stream starting with \x00\xfe\x00\xff #3470

Description

Environment

Code + PDF

Traceback

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions