Skip to content

Hyphens and word structure lost when copying multi-line string from PDF #999

@AlexSafonova

Description

@AlexSafonova

Summary

When rendering a long hyphenated string using OpenHTMLtoPDF, the output visually looks correct — the string is broken into multiple lines. However, when copying the text from the resulting PDF (e.g. in Acrobat Reader), hyphens are missing and words are joined or separated incorrectly.

This makes it impossible to extract exact string values from the PDF if they contain dashes or dot-separated parts.

Steps to Reproduce

Use the following HTML as source:

<td style="width: 100px; font-family: Arial; font-size: 10pt;">
  APD.XX.XX.XX-p0-kafka-client-test-name
</td>

Render it using PdfRendererBuilder, for example:

new PdfRendererBuilder()
    .useFastMode()
    .withW3cDocument(html, baseUri)
    .toStream(outputStream)
    .run();

Open the generated PDF in a reader.
Copy the string and paste it into any text editor.

Expected
The copied result should be:

APD.XX.XX.XX-p0-kafka-client-test-name

Even if the string is visually broken into several lines in the PDF, the structure should be preserved.

Actual
The copied result is incorrect. Example:

APD. XX. XX. XXp0- kafkaclienttestname

Hyphens (-) are missing in multiple places.
Words are broken or joined arbitrarily.
Dot segments are also misinterpreted due to how the text is split into separate PDF drawing blocks.
Root Cause (Assumed)
It seems that when long strings are broken into lines (either due to narrow table cells or explicit
), the renderer treats each visual segment as a separate text object in the PDF. Upon copying, these fragments are joined without proper logic, and some characters (especially - and .) are lost or shifted.

Suggested Solution

Could the library offer an optional mode where long text remains a single logical text block in the PDF, even when wrapped visually? For example:

Provide a builder flag like .preserveLogicalTextFlow(true)
Or allow opt-in to smarter text extraction (ToUnicode mapping or merged text spans)
This would help preserve exact text structure — critical when exporting structured identifiers, document codes, or technical strings.

Notes
This issue is especially critical when rendering technical identifiers or configuration keys that include dashes and dots, such as:

com.example-service-name-v1
These are often copied from PDFs into tools/scripts, and loss of structure causes real errors.

Issue example.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions