PDFBOX-5487: Remove all space characters if contained within the adjacent letters by noureldin-eg · Pull Request #155 · apache/pdfbox

noureldin-eg · 2023-02-19T10:05:41Z

Please see PDFBOX-5487 and the comments below.

In the PDF attached in the Jira issue, there are 2 space characters which overlap with the adjacent letters of 2 Arabic words. When sorting is enabled, this space gets shifted into the middle of a word.

This commit will remove such spaces just after sorting.

…cent letters https://issues.apache.org/jira/browse/PDFBOX-5487

kaismh · 2024-12-07T07:31:52Z

@noureldin-eg Any known side effects for this commit?

THausherr · 2024-12-07T09:30:01Z

I never got any feedback in PDFBOX-5487. What I need to know is whether the contents in PDFBOX-5487-arabic.pdf-sorted-diff.txt are better in the lines with "new".

noureldin-eg · 2024-12-07T20:00:34Z

Hi @kaismh and @THausherr,
It has been quite some time since I created this PR, and I had thought it was already merged. I’m glad to revisit this and contribute to the library again. However, I will need a bit of time to set up the project and review the code.

Thank you for reminding me about this, and I’ll try to provide updates asap.

kaismh · 2024-12-11T04:51:17Z

@THausherr The output is better for all arabic cases I tried, but not sure if it might break some situations or other languages. Might be better to have as an option

THausherr · 2024-12-11T11:55:29Z

I'm reluctant to add a new option... it doesn't seem to be needed. I have around 100 local test files and only two were changed (only in the sorted output).
I'd still like to get some feedback by @noureldin-eg .

noureldin-eg · 2024-12-14T22:22:12Z

Any known side effects for this commit?

No known side effects for Arabic (and English) text extraction. I can't confirm its impact on other languages, but if you'd like, I could modify the implementation to apply this fix only when the unicode fall within the Arabic code pages (as in PR #156).

whether the contents in PDFBOX-5487-arabic.pdf-sorted-diff.txt are better in the lines with "new"

Yes, the extracted contents are better after my commit. Specifically, the two key changes highlighted in the screenshots above and explained in the Jira issue have been addressed.

…cent letters, by Mohamed M NourElDin; closes #155 git-svn-id: https://svn.apache.org/repos/asf/pdfbox/branches/2.0@1922512 13f79535-47bb-0310-9956-ffa450edef68

…cent letters, by Mohamed M NourElDin; closes #155 git-svn-id: https://svn.apache.org/repos/asf/pdfbox/branches/3.0@1922513 13f79535-47bb-0310-9956-ffa450edef68

PDFBOX-5487: Remove all space characters if contained within the adja…

f95a5be

…cent letters https://issues.apache.org/jira/browse/PDFBOX-5487

Merge branch 'apache:trunk' into PDFBOX-5487

c5db965

asfgit closed this in 374972f Dec 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDFBOX-5487: Remove all space characters if contained within the adjacent letters#155

PDFBOX-5487: Remove all space characters if contained within the adjacent letters#155
noureldin-eg wants to merge 2 commits intoapache:trunkfrom
noureldin-eg:PDFBOX-5487

noureldin-eg commented Feb 19, 2023

Uh oh!

kaismh commented Dec 7, 2024

Uh oh!

THausherr commented Dec 7, 2024

Uh oh!

noureldin-eg commented Dec 7, 2024

Uh oh!

kaismh commented Dec 11, 2024

Uh oh!

THausherr commented Dec 11, 2024

Uh oh!

noureldin-eg commented Dec 14, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

noureldin-eg commented Feb 19, 2023

Uh oh!

kaismh commented Dec 7, 2024

Uh oh!

THausherr commented Dec 7, 2024

Uh oh!

noureldin-eg commented Dec 7, 2024

Uh oh!

kaismh commented Dec 11, 2024

Uh oh!

THausherr commented Dec 11, 2024

Uh oh!

noureldin-eg commented Dec 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

noureldin-eg commented Dec 14, 2024 •

edited

Loading