Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve pdfiototext text extraction #49

Open
kleuter opened this issue Oct 10, 2023 · 1 comment
Open

Improve pdfiototext text extraction #49

kleuter opened this issue Oct 10, 2023 · 1 comment
Assignees
Labels
enhancement New feature or request priority-low
Milestone

Comments

@kleuter
Copy link
Contributor

kleuter commented Oct 10, 2023

Trying to understand how to pdfiototext.c works. The code seem to output too many extra unnecessary spaces for this PDF.

image

weird_spaces.pdf

@michaelrsweet
Copy link
Owner

Keep in mind that the pdfiototext program is just a proof of concept/example. If you need text output there are far better options available...

In this case, the spaces are placed between fragments in the content stream. This particular file does a lot of kerning and pdfiototext is not smart enough to merge adjacent fragments.

@michaelrsweet michaelrsweet self-assigned this Oct 10, 2023
@michaelrsweet michaelrsweet added enhancement New feature or request priority-low labels Oct 10, 2023
@michaelrsweet michaelrsweet added this to the Future milestone Oct 10, 2023
@michaelrsweet michaelrsweet changed the title pdfiototext.c output is weird (extra spaces) Improve pdfiototext text extraction Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority-low
Projects
None yet
Development

No branches or pull requests

2 participants