Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve checksum calculations #13989

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Improve checksum calculations #13989

wants to merge 3 commits into from

Conversation

jfboeuf
Copy link
Contributor

@jfboeuf jfboeuf commented Nov 12, 2024

Take advantage of the existing buffer in BufferedChecksum to speed up reads for Longs, Ints, Shorts, and Long arrays by avoiding byte-by-byte reads.
Use the faster readLongs() method to decode live docs and bloom filters.

Take advantage of the existing buffer in BufferedChecksum to speed up
reads for Longs, Ints, Shorts and Long arrays by avoiding byte-by-byte
reads.
@rmuir
Copy link
Member

rmuir commented Nov 12, 2024

This is actually slower, we only want to call updateBytes(byte[]) or the checksum calculation is very slow (not vectorized).

@jfboeuf
Copy link
Contributor Author

jfboeuf commented Nov 12, 2024

Thank you for your feedback. Perhaps I misunderstood your point, but the implementation I propose only calls Checksum.update(byte[]). The change resides in how the buffer is fed to avoid byte-by-byte reading from the underlying IndexInput and byte-by-byte copy to the buffer. I created and pushed into a different branch a JMH benchmark comparing the current implementation to the one I propose. The results show a noticeable improvement:

Benchmark                                                  (size)   Mode  Cnt     Score     Error   Units
BufferedChecksumIndexInputBenchmark.decodeLongArray            10  thrpt   15  7521.843 ± 167.756  ops/ms
BufferedChecksumIndexInputBenchmark.decodeLongArray          1000  thrpt   15  1004.213 ±   3.587  ops/ms
BufferedChecksumIndexInputBenchmark.decodeLongArray        100000  thrpt   15    10.993 ±   0.102  ops/ms
BufferedChecksumIndexInputBenchmark.decodeLongArrayOrig        10  thrpt   15  3865.018 ±  38.850  ops/ms
BufferedChecksumIndexInputBenchmark.decodeLongArrayOrig      1000  thrpt   15    46.381 ±   0.293  ops/ms
BufferedChecksumIndexInputBenchmark.decodeLongArrayOrig    100000  thrpt   15     0.475 ±   0.022  ops/ms
BufferedChecksumIndexInputBenchmark.decodeSingleLongs          10  thrpt   15  8355.638 ±  52.745  ops/ms
BufferedChecksumIndexInputBenchmark.decodeSingleLongs        1000  thrpt   15   212.133 ±   4.296  ops/ms
BufferedChecksumIndexInputBenchmark.decodeSingleLongs      100000  thrpt   15     2.744 ±   0.021  ops/ms
BufferedChecksumIndexInputBenchmark.decodeSingleLongsOrig      10  thrpt   15  3938.857 ±  42.751  ops/ms
BufferedChecksumIndexInputBenchmark.decodeSingleLongsOrig    1000  thrpt   15    47.246 ±   0.444  ops/ms
BufferedChecksumIndexInputBenchmark.decodeSingleLongsOrig  100000  thrpt   15     0.460 ±   0.020  ops/ms

For large arrays, there can be an improvement of up to 23 times when reading long arrays and 6 times when reading single long values. Transitioning from reading single long values to long arrays for live documents and Bloom Filters— bitsets being commonly large in both scenarios—results in even greater performance enhancements.

The benchmark shows the single-long approach performs better on small arrays. This is likely due to the cost of wrapping the byte[] to a ByteBuffer and LongBuffer that is not paid off to copy a few bytes. It can be improved by making updateLongs(long[], int, int) switch to a loop over updateLong(long) when the length to checksum fits in the buffer. What do you think?

@jpountz
Copy link
Contributor

jpountz commented Nov 12, 2024

The change makes sense to me and looks like it could speed up loading live docs (and thus speed up opening readers).

The benchmark shows the single-long approach performs better on small arrays. [...] It can be improved by making updateLongs(long[], int, int) switch to a loop over updateLong(long) when the length to checksum fits in the buffer.

Something like this would make sense to me, let's make it look as similar as possible to update(byte[], int, int)?

@rmuir
Copy link
Member

rmuir commented Nov 12, 2024

OK, I see @jfboeuf, thank you for the explanation. My only concern with with the optimization is testing. If there is a bug here, the user will get CorruptIndexException.

Could we add some low-level tests somehow?

Especially the readLongs() seems like potential trouble, as I am not sure anything tests that it works correctly with a non-zero offset.

@jfboeuf
Copy link
Contributor Author

jfboeuf commented Nov 12, 2024

@jpountz
I modified the benchmark to make it more realistic by adding a header to the IndexOutput so:

  • updateLongs(long[], int, int) triggers a real flush before processing (the benchmark was biased because the buffer was empty, which is unlikely in real operations where there is at least the header and the table size before).
  • write accesses to the buffer by updateLong(long) are not necessarily aligned (they are always with the new updateLongs(long[], int, int)).

After testing with different sizes of long[] it seems the loop of updateLong(long) is only worth it for really small arrays and this specific case probably doesn't deserve the additional complexity/if_branch in updateLongs(long[], int, int).

Benchmark                                              (size)   Mode  Cnt     Score     Error   Units
BufferedChecksumIndexInputBenchmark.decodeLongArray         8  thrpt   15  5992.000 ± 123.429  ops/ms
BufferedChecksumIndexInputBenchmark.decodeLongArray        16  thrpt   15  5729.068 ± 123.423  ops/ms
BufferedChecksumIndexInputBenchmark.decodeLongArray        32  thrpt   15  5307.703 ± 149.885  ops/ms
BufferedChecksumIndexInputBenchmark.decodeLongArray        64  thrpt   15  4960.178 ± 124.810  ops/ms
BufferedChecksumIndexInputBenchmark.decodeSingleLongs       8  thrpt   15  6182.986 ± 253.181  ops/ms
BufferedChecksumIndexInputBenchmark.decodeSingleLongs      16  thrpt   15  5472.158 ± 190.560  ops/ms
BufferedChecksumIndexInputBenchmark.decodeSingleLongs      32  thrpt   15  4739.634 ± 404.963  ops/ms
BufferedChecksumIndexInputBenchmark.decodeSingleLongs      64  thrpt   15  3858.685 ±  81.880  ops/ms

@rmuir I'll add unit tests to thoroughly check the method behaves properly.

}

void updateLongs(long[] vals, int offset, int len) {
flush();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other updateXXX() methods only flush if there is no room left or if the data to update is bigger than the buffer size. I'd like to retain this property here (even though your benchmarks suggest that it doesn't hurt performance much).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed the modification to limit the flush. Not only is the buffer no longer systematically flushed entering the updateLongs(long[], int, int) but also the last chunk is not flushed which results beneficial when in the likely case there is remaining data (the codec footer) after the long[] that could fit in the buffer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants