rework numeric tokenizer hot path #1104

EliotJones · 2025-07-25T02:21:38Z

the existing numeric tokenizer involved allocations and string parsing. since the number formats in pdf files are fairly predictable we can improve this substantially

this ends up being somewhere between 2-3 times faster in my benchmarks on a subset of numeric data from PDF files:

| Method | Mean      | Error     | StdDev    |
|------- |----------:|----------:|----------:|
| PigOld | 13.684 us | 0.2333 us | 0.2182 us |
| PigNew |  5.963 us | 0.0990 us | 0.0877 us |

| Method | Mean      | Error     | StdDev    |
|------- |----------:|----------:|----------:|
| PigOld | 14.806 us | 0.2098 us | 0.1962 us |
| PigNew |  6.230 us | 0.1205 us | 0.1127 us |

Based on tracing for opening ~350 documents the tokenize method here was called approximately 16.2 million times so is definitely a hot path. Real world performance impact may be different.

the existing numeric tokenizer involved allocations and string parsing. since the number formats in pdf files are fairly predictable we can improve this substantially

EliotJones · 2025-07-25T02:24:56Z

@BobLd I'm planning a 0.1.11 full release as follows:

Merge make link copying more tolerant when adding page #1103 to improve link building
Release full version after 1103 with various parsing improvements and bugfixes
Merge this and WIP: move file parsing to single-pass static methods #1102 since these are much larger, more risky changes and will have time to bed-in as I work through the common crawl corpus and people trial pre-release versions
Work on the builder redesign details in make link copying more tolerant when adding page #1103 for 0.1.12

BobLd · 2025-07-25T17:12:28Z

@EliotJones thanks a lot for this change, looks great!

Regarding 0.1.11, we are on the same page - agreed with your plan.

rework numeric tokenizer hot path

45cf974

the existing numeric tokenizer involved allocations and string parsing. since the number formats in pdf files are fairly predictable we can improve this substantially

EliotJones requested a review from BobLd July 25, 2025 02:21

BobLd approved these changes Jul 25, 2025

View reviewed changes

BobLd merged commit 85fc63d into master Jul 25, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rework numeric tokenizer hot path #1104

rework numeric tokenizer hot path #1104

Uh oh!

EliotJones commented Jul 25, 2025

Uh oh!

EliotJones commented Jul 25, 2025 •

edited

Loading

Uh oh!

BobLd commented Jul 25, 2025

Uh oh!

Uh oh!

Uh oh!

rework numeric tokenizer hot path #1104

rework numeric tokenizer hot path #1104

Uh oh!

Conversation

EliotJones commented Jul 25, 2025

Uh oh!

EliotJones commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BobLd commented Jul 25, 2025

Uh oh!

Uh oh!

Uh oh!

EliotJones commented Jul 25, 2025 •

edited

Loading