-
Notifications
You must be signed in to change notification settings - Fork 380
Description
Summary
When rendering a long hyphenated string using OpenHTMLtoPDF, the output visually looks correct — the string is broken into multiple lines. However, when copying the text from the resulting PDF (e.g. in Acrobat Reader), hyphens are missing and words are joined or separated incorrectly.
This makes it impossible to extract exact string values from the PDF if they contain dashes or dot-separated parts.
Steps to Reproduce
Use the following HTML as source:
<td style="width: 100px; font-family: Arial; font-size: 10pt;">
APD.XX.XX.XX-p0-kafka-client-test-name
</td>
Render it using PdfRendererBuilder, for example:
new PdfRendererBuilder()
.useFastMode()
.withW3cDocument(html, baseUri)
.toStream(outputStream)
.run();
Open the generated PDF in a reader.
Copy the string and paste it into any text editor.
Expected
The copied result should be:
APD.XX.XX.XX-p0-kafka-client-test-name
Even if the string is visually broken into several lines in the PDF, the structure should be preserved.
Actual
The copied result is incorrect. Example:
APD. XX. XX. XXp0- kafkaclienttestname
Hyphens (-) are missing in multiple places.
Words are broken or joined arbitrarily.
Dot segments are also misinterpreted due to how the text is split into separate PDF drawing blocks.
Root Cause (Assumed)
It seems that when long strings are broken into lines (either due to narrow table cells or explicit
), the renderer treats each visual segment as a separate text object in the PDF. Upon copying, these fragments are joined without proper logic, and some characters (especially - and .) are lost or shifted.
Suggested Solution
Could the library offer an optional mode where long text remains a single logical text block in the PDF, even when wrapped visually? For example:
Provide a builder flag like .preserveLogicalTextFlow(true)
Or allow opt-in to smarter text extraction (ToUnicode mapping or merged text spans)
This would help preserve exact text structure — critical when exporting structured identifiers, document codes, or technical strings.
Notes
This issue is especially critical when rendering technical identifiers or configuration keys that include dashes and dots, such as:
com.example-service-name-v1
These are often copied from PDFs into tools/scripts, and loss of structure causes real errors.