Introduce raw fwd index version V5 containing implicit num doc length, improving space efficiency #14105

jackluo923 · 2024-09-27T21:15:35Z

The data layout of the multi-value fixed byte raw forward index can be optimized to enhance storage efficiency.

Consider the following multi-value document as an example: [int(1), int(2), int(3)]. The current binary data layout in MVFixedBytesRawFwdIndex is as follows: 0x00000010 0x00000003 0x00000001 0x00000002 0x00000003.

The first 4 bytes 0x00000010 is an integer representing the total payload length of the byte array containing the multi-value document content, which in this case is 16 bytes.
The next 4 bytes 0x00000003 is an integer explicitly representing the number of elements in the multi-value document (i.e., 3).
The remaining 12 bytes 0x00000001 0x00000002 0x00000003 are 3 integers representing the 3 values of the multi-value document: 1, 2, and 3.

In Pinot, the fixed byte raw forward index can only contain one specific fixed-length data type: int, long, float, or double. Rather than explicitly specifying the number of elements for each document using an integer, this value can be omitted and instead inferred implicitly using the following calculation:

number of elements = buffer payload length / size of data type

If the forward index uses the passthrough chunk compression type (i.e., no compression), we can save 4 bytes per document by omitting the explicit element count. This results in the following savings:

For documents with 0 elements, we save 50%.
For documents with 1 element, we save 33%.
For documents with 2 elements, we save 25%.
As the number of elements increases, the percentage of space saved decreases accordingly.

For forward indexes that leverage compression to reduce data size, the savings can be even more significant in some scenarios. This PR includes a unit test, VarByteChunkV5Test#validateCompressionRatioIncrease, which demonstrates this. In particular, we used ZStandard as the chunk compressor and inserted 1 million short multi-value (MV) documents, where the length follows a Gaussian distribution. In this experiment, the values of each integer in the MV documents were also somewhat repetitive. Under these conditions, we observed 50%+ reduction in on-disk file size compared to V4 fwd index writer version

This PR introduces the implicit length optimization via VarByteChunkForwardIndexWriterV5 on the write path, alongside VarByteChunkForwardIndexReaderV5 on the read path. MultiValueFixedByteRawIndexCreator and ForwardIndexReaderFactory are also modified accordingly to use the new index writer and reader when the index version is set to 5 or greater. After this PR is merged, other composite forward indexes such as the CLPForwardIndexCreatorV1 forward index can leverage these new classes to significantly improve the overall compression ratio.

…ByteRawFwdIndex

codecov-commenter · 2024-09-27T21:51:40Z

Codecov Report

Attention: Patch coverage is 60.00000% with 18 lines in your changes missing coverage. Please review.

Project coverage is 63.76%. Comparing base (59551e4) to head (6fe4517).
Report is 1195 commits behind head on master.

Files with missing lines	Patch %	Lines
...ders/forward/VarByteChunkForwardIndexReaderV5.java	27.27%	8 Missing ⚠️
.../writer/impl/VarByteChunkForwardIndexWriterV5.java	45.45%	6 Missing ⚠️
...gment/index/forward/ForwardIndexReaderFactory.java	33.33%	1 Missing and 1 partial ⚠️
.../writer/impl/VarByteChunkForwardIndexWriterV4.java	75.00%	1 Missing ⚠️
...ders/forward/VarByteChunkForwardIndexReaderV4.java	80.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #14105      +/-   ##
============================================
+ Coverage     61.75%   63.76%   +2.01%     
- Complexity      207     1535    +1328     
============================================
  Files          2436     2626     +190     
  Lines        133233   144646   +11413     
  Branches      20636    22136    +1500     
============================================
+ Hits          82274    92239    +9965     
- Misses        44911    45596     +685     
- Partials       6048     6811     +763

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (+99.99%)`	⬆️
integration	`100.00% <ø> (+99.99%)`	⬆️
integration1	`100.00% <ø> (+99.99%)`	⬆️
integration2	`0.00% <ø> (ø)`
java-11	`55.38% <20.00%> (-6.33%)`	⬇️
java-21	`63.66% <60.00%> (+2.03%)`	⬆️
skip-bytebuffers-false	`63.75% <60.00%> (+2.00%)`	⬆️
skip-bytebuffers-true	`63.64% <60.00%> (+35.91%)`	⬆️
temurin	`63.76% <60.00%> (+2.01%)`	⬆️
unittests	`63.76% <60.00%> (+2.01%)`	⬆️
unittests1	`55.42% <20.00%> (+8.53%)`	⬆️
unittests2	`34.33% <60.00%> (+6.60%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…wIndexCreatorV2Test.java

…tiValueFixedByteRawIndexCreatorV2Test

Jackie-Jiang

Let's introduce a new forward index version v5 for this new format

Jackie-Jiang · 2024-10-08T01:06:17Z

...pache/pinot/segment/local/segment/creator/impl/fwd/MultiValueFixedByteRawIndexCreatorV2.java

+/**
+ Same as MultiValueFixedByteRawIndexCreator, but without storing the number of elements for each row.
+ */
+public class MultiValueFixedByteRawIndexCreatorV2 extends MultiValueFixedByteRawIndexCreator {


We don't want to add a new creator because creator is used to handle creation of different version of forward index. Instead, we want to add a new raw index version v5 for this new format

Jackie-Jiang · 2024-10-08T01:07:03Z

.../pinot/segment/local/segment/index/readers/forward/FixedByteChunkMVForwardIndexReaderV2.java

+/**
+ Same as FixedByteChunkMVForwardIndexReader, but the number of elements for each row is inferred
+ */
+public final class FixedByteChunkMVForwardIndexReaderV2 extends FixedByteChunkMVForwardIndexReader {


Let's call it V5 to be consistent with index version

.../pinot/segment/local/segment/index/readers/forward/FixedByteChunkMVForwardIndexReaderV2.java

…orward index creator version.

Jackie-Jiang

LGTM

Jackie-Jiang · 2024-10-15T18:42:22Z

...ain/java/org/apache/pinot/segment/local/io/writer/impl/VarByteChunkForwardIndexWriterV5.java

+
+
+/**
+ * Forward index writer that extends {@link VarByteChunkForwardIndexWriterV4} with the only difference being the


This is not the only difference. Let's also document the value format difference

...ache/pinot/segment/local/segment/index/readers/forward/VarByteChunkForwardIndexReaderV4.java

…wIndexCreatorV2Test.java

…tiValueFixedByteRawIndexCreatorV2Test

…orward index creator version.

…dex' into master-improved-MV-fixed-byte-index

… reader to fetch version number.

Jackie-Jiang · 2024-10-16T21:24:08Z

...ache/pinot/segment/local/segment/index/readers/forward/VarByteChunkForwardIndexReaderV4.java

@@ -81,6 +80,12 @@ public VarByteChunkForwardIndexReaderV4(PinotDataBuffer dataBuffer, FieldSpec.Da
    _isSingleValue = isSingleValue;
  }

+  public void validateIndexVersion(PinotDataBuffer dataBuffer) {


I meant we can add a getVersion() method into this class, and override it in v5 reader

Jackie-Jiang

Only a few minor comments

Jackie-Jiang · 2024-10-16T22:00:00Z

...ain/java/org/apache/pinot/segment/local/io/writer/impl/VarByteChunkForwardIndexWriterV4.java

@@ -76,11 +76,13 @@
 public class VarByteChunkForwardIndexWriterV4 implements VarByteChunkWriter {
  public static final int VERSION = 4;

-  private static final Logger LOGGER = LoggerFactory.getLogger(VarByteChunkForwardIndexWriterV4.class);
+  // Use the run-time concrete class to retrieve the logger
+  protected final Logger _logger = LoggerFactory.getLogger(this.getClass());


Suggested change

protected final Logger _logger = LoggerFactory.getLogger(this.getClass());

protected final Logger _logger = LoggerFactory.getLogger(getClass());

...ain/java/org/apache/pinot/segment/local/io/writer/impl/VarByteChunkForwardIndexWriterV4.java

...ache/pinot/segment/local/segment/index/readers/forward/VarByteChunkForwardIndexReaderV4.java

Initial implementation of toggling explicit MV entry size for MVFixed…

84987ed

…ByteRawFwdIndex

jackluo923 added 8 commits October 2, 2024 05:24

Fixed uncovered code paths exposed via unit test

d654fd9

Fix style issue

3d4b99b

Refactored code to use new class versions.

8c967b5

Fixed style.

c2359ec

Refactored MultiValueFixedByteRawIndexCreatorTest.java

dd3410f

Fix style.

0c0df84

Modified existing unit test and extended it for MultiValueFixedByteRa…

e7e091b

…wIndexCreatorV2Test.java

Improved unit test for MultiValueFixedByteRawIndexCreatorTest and Mul…

153be16

…tiValueFixedByteRawIndexCreatorV2Test

jackluo923 changed the title ~~WIP: Enable toggling of explicit MV entry size for MVFixedBytesRawFwdIndex~~ Introduce MultiValueFixedByteRawIndexCreatorV2 and corresponding forward index reader path, improving space efficiency over MultiValueFixedByteRawIndexCreator Oct 2, 2024

jackluo923 added 2 commits October 2, 2024 23:17

Remove redundant blank line

0233905

Adjusted comments content

69defe1

Jackie-Jiang added enhancement feature release-notes Referenced by PRs that need attention when compiling the next release notes Configuration Config changes (addition/deletion/change in behavior) labels Oct 2, 2024

Removed redundant constructor missed during refactoring.

e1173c0

Jackie-Jiang reviewed Oct 8, 2024

View reviewed changes

jackluo923 added 2 commits October 9, 2024 05:51

Upgrade MVFixedByteRawIndex reader and writer from V4 to V5, retain f…

34ac786

…orward index creator version.

Minor changes in MultiValueFixedByteRawIndexCreator

b090676

jackluo923 added 2 commits October 9, 2024 06:08

Fix minor style issue.

2ff1914

Refactored FixByteChunkMVForwardIndexReader

54b2709

jackluo923 changed the title ~~Introduce raw fwd index version V5 containing implicit num doc length, improving space efficiency~~ WIP: Introduce raw fwd index version V5 containing implicit num doc length, improving space efficiency Oct 8, 2024

jackluo923 added 5 commits October 9, 2024 16:35

Deleted FixByteChunkMVForwardIndexReaderV2

318b826

Deleted FixByteChunkMVForwardIndexReaderV2Test

a9170b7

Add VarByteChunkV5Test unit test

d699c2a

Add license to VarByteChunkV5Test unit test

7137792

Improved unit test

6452c79

Rebase to latest

27328bf

Jackie-Jiang approved these changes Oct 15, 2024

View reviewed changes

jackluo923 added 20 commits October 16, 2024 02:51

Refactored code to use new class versions.

79c6f66

Fixed style.

e9778d3

Refactored MultiValueFixedByteRawIndexCreatorTest.java

b43f676

Fix style.

c6033b9

Modified existing unit test and extended it for MultiValueFixedByteRa…

3f654e4

…wIndexCreatorV2Test.java

Improved unit test for MultiValueFixedByteRawIndexCreatorTest and Mul…

12af1ce

…tiValueFixedByteRawIndexCreatorV2Test

Adjusted comments content

063c5b4

Upgrade MVFixedByteRawIndex reader and writer from V4 to V5, retain f…

b8794f2

…orward index creator version.

Deleted FixByteChunkMVForwardIndexReaderV2

9812e3e

Deleted FixByteChunkMVForwardIndexReaderV2Test

085aed6

Improved unit test

7e1d10c

Refactored unit test

d06dda4

Add blank line

e9835c5

Remove blank line

06c0b95

Add blank line

07d6f75

Remove blank line

680fc24

Removed redundant RuntimeException from method signature

1b22234

Merge remote-tracking branch 'origin/master-improved-MV-fixed-byte-in…

cfbc9ee

…dex' into master-improved-MV-fixed-byte-index

Updated javadoc for VarByteChunkForwardIndexWriterV5

0da0ca7

Addressed code review comments to use getVersion() in forward index…

89ec8af

… reader to fetch version number.

Jackie-Jiang reviewed Oct 16, 2024

View reviewed changes

Addressed final minor code review suggestion.

592967a

Jackie-Jiang approved these changes Oct 16, 2024

View reviewed changes

jackluo923 added 2 commits October 17, 2024 12:22

Change getConcreteClassVersion back to getVersion

44b0df8

Adjusted member variable scope in VarByteChunkForwardIndexWriterV4

6fe4517

deemoliu approved these changes Oct 17, 2024

View reviewed changes

Jackie-Jiang merged commit ad37bd8 into apache:master Oct 17, 2024
20 of 21 checks passed

jackluo923 mentioned this pull request Oct 21, 2024

Add missing precondition check for V5 writer version in BaseChunkForwardIndexWriter #14265

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce raw fwd index version V5 containing implicit num doc length, improving space efficiency #14105

Introduce raw fwd index version V5 containing implicit num doc length, improving space efficiency #14105

jackluo923 commented Sep 27, 2024 •

edited

Loading

codecov-commenter commented Sep 27, 2024 •

edited

Loading

Jackie-Jiang left a comment

Jackie-Jiang Oct 8, 2024

jackluo923 Oct 8, 2024

Jackie-Jiang Oct 8, 2024

jackluo923 Oct 8, 2024

Jackie-Jiang left a comment

Jackie-Jiang Oct 15, 2024

jackluo923 Oct 16, 2024

Jackie-Jiang Oct 16, 2024

Jackie-Jiang left a comment

Jackie-Jiang Oct 16, 2024



		/**
		* Forward index writer that extends {@link VarByteChunkForwardIndexWriterV4} with the only difference being the

	protected final Logger _logger = LoggerFactory.getLogger(this.getClass());
	protected final Logger _logger = LoggerFactory.getLogger(getClass());

Introduce raw fwd index version V5 containing implicit num doc length, improving space efficiency #14105

Introduce raw fwd index version V5 containing implicit num doc length, improving space efficiency #14105

Conversation

jackluo923 commented Sep 27, 2024 • edited Loading

codecov-commenter commented Sep 27, 2024 • edited Loading

Codecov Report

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackluo923 commented Sep 27, 2024 •

edited

Loading

codecov-commenter commented Sep 27, 2024 •

edited

Loading