handle additional broken pdf files in the common crawl set #1108

EliotJones · 2025-07-26T17:34:36Z

a file contained 2 indices pointing to '.notdef' for the character name so we just take the first rather than requiring a single
a file contained '/' (empty name) as the subtype declaration, so we fall back to trying type 1 and truetype parsing in this situation

with these changes we can now parse the 6000 files from 0000 to 0005 in the corpus with the exception of corrupt files and files with corrupt xrefs which we can't currently recover from but are possible to parse in some readers:

0005634.pdf
0002973.pdf

https://digitalcorpora.s3.amazonaws.com/s3_browser.html#corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/zipfiles/0000-0999/

- a file contained 2 indices pointing to '.notdef' for the character name so we just take the first rather than requiring a single - a file contained '/' (empty name) as the subtype declaration, so we fall back to trying type 1 and truetype parsing in this situation

EliotJones requested a review from BobLd July 26, 2025 17:34

BobLd approved these changes Jul 26, 2025

View reviewed changes

BobLd merged commit 27df4af into master Jul 26, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

handle additional broken pdf files in the common crawl set #1108

handle additional broken pdf files in the common crawl set #1108

Uh oh!

EliotJones commented Jul 26, 2025

Uh oh!

Uh oh!

Uh oh!

handle additional broken pdf files in the common crawl set #1108

handle additional broken pdf files in the common crawl set #1108

Uh oh!

Conversation

EliotJones commented Jul 26, 2025

Uh oh!

Uh oh!

Uh oh!