PDFBOX-5487: Remove all space characters if contained within the adjacent letters#155
PDFBOX-5487: Remove all space characters if contained within the adjacent letters#155noureldin-eg wants to merge 2 commits intoapache:trunkfrom
Conversation
|
@noureldin-eg Any known side effects for this commit? |
|
I never got any feedback in PDFBOX-5487. What I need to know is whether the contents in PDFBOX-5487-arabic.pdf-sorted-diff.txt are better in the lines with "new". |
|
Hi @kaismh and @THausherr, Thank you for reminding me about this, and I’ll try to provide updates asap. |
|
@THausherr The output is better for all arabic cases I tried, but not sure if it might break some situations or other languages. Might be better to have as an option |
|
I'm reluctant to add a new option... it doesn't seem to be needed. I have around 100 local test files and only two were changed (only in the sorted output). |
No known side effects for Arabic (and English) text extraction. I can't confirm its impact on other languages, but if you'd like, I could modify the implementation to apply this fix only when the unicode fall within the Arabic code pages (as in PR #156).
Yes, the extracted contents are better after my commit. Specifically, the two key changes highlighted in the screenshots above and explained in the Jira issue have been addressed. |
…cent letters, by Mohamed M NourElDin; closes #155 git-svn-id: https://svn.apache.org/repos/asf/pdfbox/branches/2.0@1922512 13f79535-47bb-0310-9956-ffa450edef68
…cent letters, by Mohamed M NourElDin; closes #155 git-svn-id: https://svn.apache.org/repos/asf/pdfbox/branches/3.0@1922513 13f79535-47bb-0310-9956-ffa450edef68
Please see PDFBOX-5487 and the comments below.
In the PDF attached in the Jira issue, there are 2 space characters which overlap with the adjacent letters of 2 Arabic words. When sorting is enabled, this space gets shifted into the middle of a word.
This commit will remove such spaces just after sorting.