Skip to content

handle additional broken pdf files in the common crawl set #1108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 26, 2025

Conversation

EliotJones
Copy link
Member

  • a file contained 2 indices pointing to '.notdef' for the character name so we just take the first rather than requiring a single
  • a file contained '/' (empty name) as the subtype declaration, so we fall back to trying type 1 and truetype parsing in this situation

with these changes we can now parse the 6000 files from 0000 to 0005 in the corpus with the exception of corrupt files and files with corrupt xrefs which we can't currently recover from but are possible to parse in some readers:

  • 0005634.pdf
  • 0002973.pdf

https://digitalcorpora.s3.amazonaws.com/s3_browser.html#corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/zipfiles/0000-0999/

- a file contained 2 indices pointing to '.notdef' for the character name so
we just take the first rather than requiring a single
- a file contained '/' (empty name) as the subtype declaration, so we fall back
to trying type 1 and truetype parsing in this situation
@EliotJones EliotJones requested a review from BobLd July 26, 2025 17:34
@BobLd BobLd merged commit 27df4af into master Jul 26, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants