-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce raw fwd index version V5 containing implicit num doc length, improving space efficiency #14105
Merged
Jackie-Jiang
merged 104 commits into
apache:master
from
y-scope:master-improved-MV-fixed-byte-index
Oct 17, 2024
Merged
Introduce raw fwd index version V5 containing implicit num doc length, improving space efficiency #14105
Changes from 101 commits
Commits
Show all changes
104 commits
Select commit
Hold shift + click to select a range
84987ed
Initial implementation of toggling explicit MV entry size for MVFixed…
jackluo923 d654fd9
Fixed uncovered code paths exposed via unit test
jackluo923 3d4b99b
Fix style issue
jackluo923 8c967b5
Refactored code to use new class versions.
jackluo923 c2359ec
Fixed style.
jackluo923 dd3410f
Refactored MultiValueFixedByteRawIndexCreatorTest.java
jackluo923 0c0df84
Fix style.
jackluo923 e7e091b
Modified existing unit test and extended it for MultiValueFixedByteRa…
jackluo923 153be16
Improved unit test for MultiValueFixedByteRawIndexCreatorTest and Mul…
jackluo923 0233905
Remove redundant blank line
jackluo923 69defe1
Adjusted comments content
jackluo923 e1173c0
Removed redundant constructor missed during refactoring.
jackluo923 34ac786
Upgrade MVFixedByteRawIndex reader and writer from V4 to V5, retain f…
jackluo923 b090676
Minor changes in MultiValueFixedByteRawIndexCreator
jackluo923 2ff1914
Fix minor style issue.
jackluo923 54b2709
Refactored FixByteChunkMVForwardIndexReader
jackluo923 318b826
Deleted FixByteChunkMVForwardIndexReaderV2
jackluo923 a9170b7
Deleted FixByteChunkMVForwardIndexReaderV2Test
jackluo923 d699c2a
Add VarByteChunkV5Test unit test
jackluo923 7137792
Add license to VarByteChunkV5Test unit test
jackluo923 6452c79
Improved unit test
jackluo923 9bfbd22
Refactored unit test
jackluo923 fac46c5
Add blank line
jackluo923 29a9fdb
Remove blank line
jackluo923 8abf7fe
Add blank line
jackluo923 1c877c6
Remove blank line
jackluo923 ce6870b
Rebase with lastest master
jackluo923 0d71f91
Fixed uncovered code paths exposed via unit test
jackluo923 736f23f
Fix style issue
jackluo923 f2faece
Refactored code to use new class versions.
jackluo923 01cbf56
Fixed style.
jackluo923 f8c2f24
Refactored MultiValueFixedByteRawIndexCreatorTest.java
jackluo923 a6ca351
Fix style.
jackluo923 61fee18
Modified existing unit test and extended it for MultiValueFixedByteRa…
jackluo923 500dae6
Improved unit test for MultiValueFixedByteRawIndexCreatorTest and Mul…
jackluo923 427bdb5
Remove redundant blank line
jackluo923 0cb705d
Adjusted comments content
jackluo923 1617ebd
Removed redundant constructor missed during refactoring.
jackluo923 813d360
Upgrade MVFixedByteRawIndex reader and writer from V4 to V5, retain f…
jackluo923 7a565e4
Fix minor style issue.
jackluo923 296959d
Deleted FixByteChunkMVForwardIndexReaderV2
jackluo923 0d7073b
Deleted FixByteChunkMVForwardIndexReaderV2Test
jackluo923 427af58
Add VarByteChunkV5Test unit test
jackluo923 6e8c3ae
Add license to VarByteChunkV5Test unit test
jackluo923 52976a7
Improved unit test
jackluo923 9bb453a
Refactored unit test
jackluo923 93d3100
Add blank line
jackluo923 79e91e9
Remove blank line
jackluo923 9a7676f
Add blank line
jackluo923 aa9eb74
Remove blank line
jackluo923 4ce5280
Refactored code to utilize changes from Extract common MV ser/de logi…
jackluo923 0d88f2f
Merge remote-tracking branch 'origin/master-improved-MV-fixed-byte-in…
jackluo923 ec1b628
Removed redundant RuntimeException from method signature
jackluo923 87dd327
Merge branch 'apache:master' into master-improved-MV-fixed-byte-index
jackluo923 e7f645c
Rebase with lastest master
jackluo923 5cb4575
Fixed uncovered code paths exposed via unit test
jackluo923 1ece331
Fix style issue
jackluo923 4c35683
Refactored code to use new class versions.
jackluo923 bd1da13
Fixed style.
jackluo923 2637e2d
Refactored MultiValueFixedByteRawIndexCreatorTest.java
jackluo923 731906e
Fix style.
jackluo923 171aaf4
Modified existing unit test and extended it for MultiValueFixedByteRa…
jackluo923 bca5eda
Improved unit test for MultiValueFixedByteRawIndexCreatorTest and Mul…
jackluo923 ce5eb7b
Remove redundant blank line
jackluo923 7274f4c
Adjusted comments content
jackluo923 ea29a13
Removed redundant constructor missed during refactoring.
jackluo923 256d774
Upgrade MVFixedByteRawIndex reader and writer from V4 to V5, retain f…
jackluo923 acfe864
Fix minor style issue.
jackluo923 a4751b6
Deleted FixByteChunkMVForwardIndexReaderV2
jackluo923 32062a1
Deleted FixByteChunkMVForwardIndexReaderV2Test
jackluo923 ef3f663
Add VarByteChunkV5Test unit test
jackluo923 38a8cb6
Add license to VarByteChunkV5Test unit test
jackluo923 b8dfacd
Improved unit test
jackluo923 bd9bdee
Refactored unit test
jackluo923 5b6e29e
Add blank line
jackluo923 597762f
Remove blank line
jackluo923 ff21345
Add blank line
jackluo923 cde2a6d
Remove blank line
jackluo923 d430d61
Refactored code to utilize changes from Extract common MV ser/de logi…
jackluo923 3f68f75
Rebased to latest
jackluo923 27328bf
Rebase to latest
jackluo923 79c6f66
Refactored code to use new class versions.
jackluo923 e9778d3
Fixed style.
jackluo923 b43f676
Refactored MultiValueFixedByteRawIndexCreatorTest.java
jackluo923 c6033b9
Fix style.
jackluo923 3f654e4
Modified existing unit test and extended it for MultiValueFixedByteRa…
jackluo923 12af1ce
Improved unit test for MultiValueFixedByteRawIndexCreatorTest and Mul…
jackluo923 063c5b4
Adjusted comments content
jackluo923 b8794f2
Upgrade MVFixedByteRawIndex reader and writer from V4 to V5, retain f…
jackluo923 9812e3e
Deleted FixByteChunkMVForwardIndexReaderV2
jackluo923 085aed6
Deleted FixByteChunkMVForwardIndexReaderV2Test
jackluo923 7e1d10c
Improved unit test
jackluo923 d06dda4
Refactored unit test
jackluo923 e9835c5
Add blank line
jackluo923 06c0b95
Remove blank line
jackluo923 07d6f75
Add blank line
jackluo923 680fc24
Remove blank line
jackluo923 1b22234
Removed redundant RuntimeException from method signature
jackluo923 cfbc9ee
Merge remote-tracking branch 'origin/master-improved-MV-fixed-byte-in…
jackluo923 0da0ca7
Updated javadoc for VarByteChunkForwardIndexWriterV5
jackluo923 89ec8af
Addressed code review comments to use `getVersion()` in forward index…
jackluo923 592967a
Addressed final minor code review suggestion.
jackluo923 44b0df8
Change getConcreteClassVersion back to getVersion
jackluo923 6fe4517
Adjusted member variable scope in VarByteChunkForwardIndexWriterV4
jackluo923 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
122 changes: 122 additions & 0 deletions
122
.../java/org/apache/pinot/segment/local/io/writer/impl/VarByteChunkForwardIndexWriterV5.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,122 @@ | ||
/** | ||
* Licensed to the Apache Software Foundation (ASF) under one | ||
* or more contributor license agreements. See the NOTICE file | ||
* distributed with this work for additional information | ||
* regarding copyright ownership. The ASF licenses this file | ||
* to you under the Apache License, Version 2.0 (the | ||
* "License"); you may not use this file except in compliance | ||
* with the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, | ||
* software distributed under the License is distributed on an | ||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
* KIND, either express or implied. See the License for the | ||
* specific language governing permissions and limitations | ||
* under the License. | ||
*/ | ||
package org.apache.pinot.segment.local.io.writer.impl; | ||
|
||
import java.io.File; | ||
import java.io.IOException; | ||
import javax.annotation.concurrent.NotThreadSafe; | ||
import org.apache.pinot.segment.local.utils.ArraySerDeUtils; | ||
import org.apache.pinot.segment.spi.compression.ChunkCompressionType; | ||
|
||
|
||
/** | ||
* Forward index writer that extends {@link VarByteChunkForwardIndexWriterV4} and overrides the data layout for | ||
* multi-value fixed byte operations to improve space efficiency. | ||
* | ||
* <p>Consider the following multi-value document as an example: {@code [int(1), int(2), int(3)]}. | ||
* The current binary data layout in {@code VarByteChunkForwardIndexWriterV4} is as follows:</p> | ||
* <pre> | ||
* 0x00000010 0x00000003 0x00000001 0x00000002 0x00000003 | ||
* </pre> | ||
* | ||
* <ol> | ||
* <li>The first 4 bytes ({@code 0x00000010}) represent the total payload length of the byte array | ||
* containing the multi-value document content, which in this case is 16 bytes.</li> | ||
* | ||
* <li>The next 4 bytes ({@code 0x00000003}) represent the number of elements in the multi-value document (i.e., 3) | ||
* .</li> | ||
* | ||
* <li>The remaining 12 bytes ({@code 0x00000001 0x00000002 0x00000003}) represent the 3 integer values of the | ||
* multi-value document: 1, 2, and 3.</li> | ||
* </ol> | ||
* | ||
* <p>In Pinot, the fixed byte raw forward index can only store one specific fixed-length data type: | ||
* {@code int}, {@code long}, {@code float}, or {@code double}. Instead of explicitly storing the number of elements | ||
* for each document for multi-value document, this value can be inferred by:</p> | ||
* <pre> | ||
* number of elements = buffer payload length / size of data type | ||
* </pre> | ||
* | ||
* <p>If the forward index uses the passthrough chunk compression type (i.e., no compression), we can save | ||
* 4 bytes per document by omitting the explicit element count. This leads to the following space savings:</p> | ||
* | ||
* <ul> | ||
* <li>For documents with 0 elements, we save 50%.</li> | ||
* <li>For documents with 1 element, we save 33%.</li> | ||
* <li>For documents with 2 elements, we save 25%.</li> | ||
* <li>As the number of elements increases, the percentage of space saved decreases.</li> | ||
* </ul> | ||
* | ||
* <p>For forward indexes that use compression to reduce data size, the savings can be even more significant | ||
* in certain cases. This is demonstrated in the unit test {@link VarByteChunkV5Test#validateCompressionRatioIncrease}, | ||
* where ZStandard was used as the chunk compressor. In the test, 1 million short multi-value (MV) documents | ||
* were inserted, following a Gaussian distribution for document lengths. Additionally, the values of each integer | ||
* in the MV documents were somewhat repetitive. Under these conditions, we observed a 50%+ reduction in on-disk | ||
* file size compared to the V4 forward index writer version.</p> | ||
* | ||
* <p>Note that the {@code VERSION} tag is a {@code static final} class variable set to {@code 5}. Since static | ||
* variables are shadowed in the child class thus associated with the class that defines them, care must be taken to | ||
* ensure that the parent class can correctly observe the child class's {@code VERSION} value at runtime. To handle | ||
* this cleanly and correctly, the {@code getConcreteClassVersion()} method is overridden to return the concrete | ||
* subclass's {@code VERSION} value, ensuring that the correct version number is returned even when using a reference | ||
* to the parent class.</p> | ||
* | ||
* @see VarByteChunkForwardIndexWriterV4 | ||
* @see VarByteChunkForwardIndexWriterV5#getConcreteClassVersion() | ||
*/ | ||
@NotThreadSafe | ||
public class VarByteChunkForwardIndexWriterV5 extends VarByteChunkForwardIndexWriterV4 { | ||
public static final int VERSION = 5; | ||
|
||
public VarByteChunkForwardIndexWriterV5(File file, ChunkCompressionType compressionType, int chunkSize) | ||
throws IOException { | ||
super(file, compressionType, chunkSize); | ||
} | ||
|
||
// Hide the parent class getVersion() | ||
public static int getVersion() { | ||
return VERSION; | ||
} | ||
|
||
// Override the parent class getConcreteClassVersion(); | ||
@Override | ||
public int getConcreteClassVersion() { | ||
return VERSION; | ||
} | ||
|
||
@Override | ||
public void putIntMV(int[] values) { | ||
putBytes(ArraySerDeUtils.serializeIntArrayWithoutLength(values)); | ||
} | ||
|
||
@Override | ||
public void putLongMV(long[] values) { | ||
putBytes(ArraySerDeUtils.serializeLongArrayWithoutLength(values)); | ||
} | ||
|
||
@Override | ||
public void putFloatMV(float[] values) { | ||
putBytes(ArraySerDeUtils.serializeFloatArrayWithoutLength(values)); | ||
} | ||
|
||
@Override | ||
public void putDoubleMV(double[] values) { | ||
putBytes(ArraySerDeUtils.serializeDoubleArrayWithoutLength(values)); | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -67,8 +67,7 @@ public class VarByteChunkForwardIndexReaderV4 | |
|
||
public VarByteChunkForwardIndexReaderV4(PinotDataBuffer dataBuffer, FieldSpec.DataType storedType, | ||
boolean isSingleValue) { | ||
int version = dataBuffer.getInt(0); | ||
Preconditions.checkState(version == VarByteChunkForwardIndexWriterV4.VERSION, "Illegal index version: %s", version); | ||
validateIndexVersion(dataBuffer); | ||
jackluo923 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
_storedType = storedType; | ||
_targetDecompressedChunkSize = dataBuffer.getInt(4); | ||
_chunkCompressionType = ChunkCompressionType.valueOf(dataBuffer.getInt(8)); | ||
|
@@ -81,6 +80,12 @@ public VarByteChunkForwardIndexReaderV4(PinotDataBuffer dataBuffer, FieldSpec.Da | |
_isSingleValue = isSingleValue; | ||
} | ||
|
||
public void validateIndexVersion(PinotDataBuffer dataBuffer) { | ||
jackluo923 marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I meant we can add a |
||
int version = dataBuffer.getInt(0); | ||
Preconditions.checkState(version == VarByteChunkForwardIndexWriterV4.getVersion(), "Illegal index version: %s", | ||
version); | ||
} | ||
|
||
@Override | ||
public boolean isDictionaryEncoded() { | ||
return false; | ||
|
86 changes: 86 additions & 0 deletions
86
...e/pinot/segment/local/segment/index/readers/forward/VarByteChunkForwardIndexReaderV5.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
/** | ||
* Licensed to the Apache Software Foundation (ASF) under one | ||
* or more contributor license agreements. See the NOTICE file | ||
* distributed with this work for additional information | ||
* regarding copyright ownership. The ASF licenses this file | ||
* to you under the Apache License, Version 2.0 (the | ||
* "License"); you may not use this file except in compliance | ||
* with the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, | ||
* software distributed under the License is distributed on an | ||
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
* KIND, either express or implied. See the License for the | ||
* specific language governing permissions and limitations | ||
* under the License. | ||
*/ | ||
package org.apache.pinot.segment.local.segment.index.readers.forward; | ||
|
||
import com.google.common.base.Preconditions; | ||
import org.apache.pinot.segment.local.io.writer.impl.VarByteChunkForwardIndexWriterV4; | ||
import org.apache.pinot.segment.local.io.writer.impl.VarByteChunkForwardIndexWriterV5; | ||
import org.apache.pinot.segment.local.utils.ArraySerDeUtils; | ||
import org.apache.pinot.segment.spi.memory.PinotDataBuffer; | ||
import org.apache.pinot.spi.data.FieldSpec; | ||
|
||
|
||
/** | ||
* Chunk-based raw (non-dictionary-encoded) forward index reader for values of SV variable length data types | ||
* (BIG_DECIMAL, STRING, BYTES), MV fixed length and MV variable length data types. | ||
* <p>For data layout, please refer to the documentation for {@link VarByteChunkForwardIndexWriterV4} | ||
*/ | ||
public class VarByteChunkForwardIndexReaderV5 extends VarByteChunkForwardIndexReaderV4 { | ||
public VarByteChunkForwardIndexReaderV5(PinotDataBuffer dataBuffer, FieldSpec.DataType storedType, | ||
boolean isSingleValue) { | ||
super(dataBuffer, storedType, isSingleValue); | ||
} | ||
|
||
@Override | ||
public void validateIndexVersion(PinotDataBuffer dataBuffer) { | ||
int version = dataBuffer.getInt(0); | ||
Preconditions.checkState(version == VarByteChunkForwardIndexWriterV5.getVersion(), "Illegal index version: %s", | ||
version); | ||
} | ||
|
||
@Override | ||
public int getIntMV(int docId, int[] valueBuffer, VarByteChunkForwardIndexReaderV4.ReaderContext context) { | ||
return ArraySerDeUtils.deserializeIntArrayWithoutLength(context.getValue(docId), valueBuffer); | ||
} | ||
|
||
@Override | ||
public int[] getIntMV(int docId, VarByteChunkForwardIndexReaderV4.ReaderContext context) { | ||
return ArraySerDeUtils.deserializeIntArrayWithoutLength(context.getValue(docId)); | ||
} | ||
|
||
@Override | ||
public int getLongMV(int docId, long[] valueBuffer, VarByteChunkForwardIndexReaderV4.ReaderContext context) { | ||
return ArraySerDeUtils.deserializeLongArrayWithoutLength(context.getValue(docId), valueBuffer); | ||
} | ||
|
||
@Override | ||
public long[] getLongMV(int docId, VarByteChunkForwardIndexReaderV4.ReaderContext context) { | ||
return ArraySerDeUtils.deserializeLongArrayWithoutLength(context.getValue(docId)); | ||
} | ||
|
||
@Override | ||
public int getFloatMV(int docId, float[] valueBuffer, VarByteChunkForwardIndexReaderV4.ReaderContext context) { | ||
return ArraySerDeUtils.deserializeFloatArrayWithoutLength(context.getValue(docId), valueBuffer); | ||
} | ||
|
||
@Override | ||
public float[] getFloatMV(int docId, VarByteChunkForwardIndexReaderV4.ReaderContext context) { | ||
return ArraySerDeUtils.deserializeFloatArrayWithoutLength(context.getValue(docId)); | ||
} | ||
|
||
@Override | ||
public int getDoubleMV(int docId, double[] valueBuffer, VarByteChunkForwardIndexReaderV4.ReaderContext context) { | ||
return ArraySerDeUtils.deserializeDoubleArrayWithoutLength(context.getValue(docId), valueBuffer); | ||
} | ||
|
||
@Override | ||
public double[] getDoubleMV(int docId, VarByteChunkForwardIndexReaderV4.ReaderContext context) { | ||
return ArraySerDeUtils.deserializeDoubleArrayWithoutLength(context.getValue(docId)); | ||
} | ||
} |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.