Skip to content

rework numeric tokenizer hot path #1104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 25, 2025

Conversation

EliotJones
Copy link
Member

the existing numeric tokenizer involved allocations and string parsing. since the number formats in pdf files are fairly predictable we can improve this substantially

this ends up being somewhere between 2-3 times faster in my benchmarks on a subset of numeric data from PDF files:

| Method | Mean      | Error     | StdDev    |
|------- |----------:|----------:|----------:|
| PigOld | 13.684 us | 0.2333 us | 0.2182 us |
| PigNew |  5.963 us | 0.0990 us | 0.0877 us |

| Method | Mean      | Error     | StdDev    |
|------- |----------:|----------:|----------:|
| PigOld | 14.806 us | 0.2098 us | 0.1962 us |
| PigNew |  6.230 us | 0.1205 us | 0.1127 us |

Based on tracing for opening ~350 documents the tokenize method here was called approximately 16.2 million times so is definitely a hot path. Real world performance impact may be different.

the existing numeric tokenizer involved allocations and string parsing. since
the number formats in pdf files are fairly predictable we can improve this
substantially
@EliotJones EliotJones requested a review from BobLd July 25, 2025 02:21
@EliotJones
Copy link
Member Author

EliotJones commented Jul 25, 2025

@BobLd I'm planning a 0.1.11 full release as follows:

@BobLd
Copy link
Collaborator

BobLd commented Jul 25, 2025

@EliotJones thanks a lot for this change, looks great!

Regarding 0.1.11, we are on the same page - agreed with your plan.

@BobLd BobLd merged commit 85fc63d into master Jul 25, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants