-
Notifications
You must be signed in to change notification settings - Fork 21
CNDB-14275 Port commits from main #1780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ild heap (#1693) ### What is the issue Fixes: riptano/cndb#13689 ### What does this PR fix and why was it fixed This PR utilizes the NodeQueue::pushMany method to decrease the time complexity required to build the NodeQueue from `O(n log(n))` to `O(n)`. This is likely only significant for sufficiently large hybrid queries. For example, we have seen cases of the search producing 400k rows, which means that we do 400k insertions into these NodeQueue objects.
Fixes riptano/cndb#13822 CNDB test pr riptano/cndb#13826 SAI’s DiskANN/JVector engine currently **always searches with pruning enabled**, trading recall for latency. That is fine for most workloads, but: * **Threshold / bounded ANN queries** can lose matches when pruning exits early. * Performance‑testing users need an easy way to turn pruning off to measure the recall/latency curve. This patch introduces a per‑query **`use_pruning`** option so users and operators can choose the trade‑off that suits them. --- ```cql WITH ann_options = {'use_pruning': true|false} ``` *When omitted we fall back to the node level default (see below).* * Cluster‑wide default is controlled by the JVM system property: ``` -Dcassandra.sai.jvector.use_pruning_default=<true|false> ``` exposed as `V3OnDiskFormat.JVECTOR_USE_PRUNING_DEFAULT`. * The property defaults to `true`, preserving existing behaviour. * Value must be the literal `true` or `false` (case‑insensitive). * Unknown ANN option keys continue to raise `InvalidRequestException`. * `Orderer` computes the value of the `usePruning` option by using the `use_pruning` value if it is not null or the jvm default if it is null and passes it down to **all** `graph.search()` calls. * Threshold / bounded ANN queries always pass `use_pruning = false` because correctness > latency for those paths (this is a net new change, but it's very minor and might not have any impact on those queries depending on the jvector implementation) * We added one flag bit to `ANNOptions` serialization; older nodes ignore unknown bits, so mixed‑version clusters are fine (though they do throw an exception for unknown settings) * Parsing, validation and transport round‑trips (`ANNOptionsTest`). * Distributed smoke (`ANNOptionsDistributedTest`). * Recall regression for pruning vs no‑pruning (`VectorSiftSmallTest.ensureDisablingPruningIncreasesRecall`).
…1705) ### What is the issue Fixes: riptano/cndb#13625 ### What does this PR fix and why was it fixed When `rerank_k` is non-positive, we will skip reranking where possible. A minor note: this change is not strictly backwards compatible because of the original implementation. However, it will fail quickly if the coordinator accepts the value but the writer does not, and since this is only a query-time parameter, I propose that we move forward and accept the net-new value that was previously disallowed.
…ation (#1709) ### What is the issue Fixes: riptano/cndb#13847 ### What does this PR fix and why was it fixed We already make use of the `Token#getLongValue` method in SAI, which only works for the Murmur partitioner, so I propose that we add a symmetric method that allows us to create tokens without converting a long to a byte buffer back to a long.
The number of rows stored in SSTable's SAI index was incorrect for analyzed indexes and for indexes on collections as it was counting posting lists for each term. In non-analyzed or non-collection indexes there is a term per row, while in analyzed index there can be many terms in a row. This led to incorrect count of rows. Fixes counting number of rows stored in SSTable SAI index. Minor fixes of code warnings in the affected files.
Checklist before you submit for review
|
|
❌ Build ds-cassandra-pr-gate/PR-1780 rejected by Butler16 new test failure(s) in 1 builds Found 16 new test failuresShowing only first 15 new test failures
No known test failures found |
|
Reviewed the test failures that butler reported and all but one are already known (listed in the 5.0 Test Failures spreadsheet) The new failure is |



What is the issue
CNDB-14275 Port commits from main to main-5.0
What does this PR fix and why was it fixed
This PR is porting the following commits from main branch to main-5.0 branch.
CNDB-13739 count numRows correctly in SSTable SAI (#1695) cb3ee3d Ruslan Fomkin [email protected] Apr 25, 2025 at 2:06 AM
SSTableIndexWriterdue to original usage ofTypeUtil.encodebeing nowTypeUtil.asIndexBytesin 5.0CNDB-13847: Add Token::fromLongValue method to simplify SAI token creation (#1709) 7d52230 Michael Marshall [email protected] Apr 24, 2025 at 7:41 AM
CNDB-13848: avoid serializing AlwaysPresentFilter when BF memory limit is reached (#1710) 1458fed Zhao Yang [email protected] Apr 24, 2025 at 1:13 AM
filter instanceOf BloomFiltercollapsed into theFilterComponent.savemethod because 5.0 has done a lot of refactoring here. This covers the original commit changes toSSTableReader.saveBloomFilter,BigTableWriter.flushBf,TrieIndexSSTableReader.recreateBloomFilter, andTrieIndexSSTableWriter.flushBf.AlwaysPresentFilter.toStringgoes instead toFilterFactory.AlwaysPresentFilter.toString.CNDB-13625: Implement rerankless vector search through ANN_OPTIONS (#1705) 053fc12 Michael Marshall [email protected] Apr 23, 2025 at 9:54 AM
CNDB-13833: Fix QueryTimeoutTest on JDK22 6c2cd8d Ekaterina Dimitrova [email protected] Apr 23, 2025 at 8:11 AM
CNDB-13822: Add ANN_OPTION use_pruning (#1704) 9cd80ce Michael Marshall [email protected] Apr 22, 2025 at 9:39 PM
SAI_VECTOR_USE_PRUNING_DEFAULT("cassandra.sai.jvector.use_pruning_default", "1000"),CNDB-13689: use NodeQueue::pushMany to decrease time complexity to build heap (#1693) df81d9a Michael Marshall [email protected] Apr 22, 2025 at 9:17 AM