Skip to content

PDFBOX-5487: Remove all space characters if contained within the adjacent letters#155

Closed
noureldin-eg wants to merge 2 commits intoapache:trunkfrom
noureldin-eg:PDFBOX-5487
Closed

PDFBOX-5487: Remove all space characters if contained within the adjacent letters#155
noureldin-eg wants to merge 2 commits intoapache:trunkfrom
noureldin-eg:PDFBOX-5487

Conversation

@noureldin-eg
Copy link

Please see PDFBOX-5487 and the comments below.

In the PDF attached in the Jira issue, there are 2 space characters which overlap with the adjacent letters of 2 Arabic words. When sorting is enabled, this space gets shifted into the middle of a word.

This commit will remove such spaces just after sorting.

PDFBOX-5487_ اعلامية

PDFBOX-5487_ وفضلا

@kaismh
Copy link

kaismh commented Dec 7, 2024

@noureldin-eg Any known side effects for this commit?

@THausherr
Copy link
Contributor

I never got any feedback in PDFBOX-5487. What I need to know is whether the contents in PDFBOX-5487-arabic.pdf-sorted-diff.txt are better in the lines with "new".

@noureldin-eg
Copy link
Author

Hi @kaismh and @THausherr,
It has been quite some time since I created this PR, and I had thought it was already merged. I’m glad to revisit this and contribute to the library again. However, I will need a bit of time to set up the project and review the code.

Thank you for reminding me about this, and I’ll try to provide updates asap.

@kaismh
Copy link

kaismh commented Dec 11, 2024

@THausherr The output is better for all arabic cases I tried, but not sure if it might break some situations or other languages. Might be better to have as an option

@THausherr
Copy link
Contributor

I'm reluctant to add a new option... it doesn't seem to be needed. I have around 100 local test files and only two were changed (only in the sorted output).
I'd still like to get some feedback by @noureldin-eg .

@noureldin-eg
Copy link
Author

noureldin-eg commented Dec 14, 2024

Any known side effects for this commit?

No known side effects for Arabic (and English) text extraction. I can't confirm its impact on other languages, but if you'd like, I could modify the implementation to apply this fix only when the unicode fall within the Arabic code pages (as in PR #156).

whether the contents in PDFBOX-5487-arabic.pdf-sorted-diff.txt are better in the lines with "new"

Yes, the extracted contents are better after my commit. Specifically, the two key changes highlighted in the screenshots above and explained in the Jira issue have been addressed.

asfgit pushed a commit that referenced this pull request Dec 15, 2024
…cent letters, by Mohamed M NourElDin; closes #155

git-svn-id: https://svn.apache.org/repos/asf/pdfbox/branches/2.0@1922512 13f79535-47bb-0310-9956-ffa450edef68
asfgit pushed a commit that referenced this pull request Dec 15, 2024
…cent letters, by Mohamed M NourElDin; closes #155

git-svn-id: https://svn.apache.org/repos/asf/pdfbox/branches/3.0@1922513 13f79535-47bb-0310-9956-ffa450edef68
@asfgit asfgit closed this in 374972f Dec 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants