-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve checksum calculations #13989
base: main
Are you sure you want to change the base?
Conversation
Take advantage of the existing buffer in BufferedChecksum to speed up reads for Longs, Ints, Shorts and Long arrays by avoiding byte-by-byte reads.
This is actually slower, we only want to call |
Thank you for your feedback. Perhaps I misunderstood your point, but the implementation I propose only calls
For large arrays, there can be an improvement of up to 23 times when reading long arrays and 6 times when reading single long values. Transitioning from reading single long values to long arrays for live documents and Bloom Filters— bitsets being commonly large in both scenarios—results in even greater performance enhancements. The benchmark shows the single-long approach performs better on small arrays. This is likely due to the cost of wrapping the |
The change makes sense to me and looks like it could speed up loading live docs (and thus speed up opening readers).
Something like this would make sense to me, let's make it look as similar as possible to |
OK, I see @jfboeuf, thank you for the explanation. My only concern with with the optimization is testing. If there is a bug here, the user will get CorruptIndexException. Could we add some low-level tests somehow? Especially the readLongs() seems like potential trouble, as I am not sure anything tests that it works correctly with a non-zero |
@jpountz
After testing with different sizes of
@rmuir I'll add unit tests to thoroughly check the method behaves properly. |
} | ||
|
||
void updateLongs(long[] vals, int offset, int len) { | ||
flush(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All other updateXXX()
methods only flush if there is no room left or if the data to update is bigger than the buffer size. I'd like to retain this property here (even though your benchmarks suggest that it doesn't hurt performance much).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed the modification to limit the flush. Not only is the buffer no longer systematically flushed entering the updateLongs(long[], int, int)
but also the last chunk is not flushed which results beneficial when in the likely case there is remaining data (the codec footer) after the long[]
that could fit in the buffer.
Take advantage of the existing buffer in
BufferedChecksum
to speed up reads for Longs, Ints, Shorts, and Long arrays by avoiding byte-by-byte reads.Use the faster
readLongs()
method to decode live docs and bloom filters.