CNDB-11666: Batch clusterings into single SAI partition post-filtering reads #1883

michaeljmarshall · 2025-07-17T05:32:41Z

What is the issue

Fixes: https://github.com/riptano/cndb/issues/11666
Ports: https://issues.apache.org/jira/browse/CASSANDRA-19497

May fix: https://github.com/riptano/cndb/issues/14822

What does this PR fix and why was it fixed

Here is a draft of porting the fix from upstream. Initial validation shows improved performance that gets much closer to the aa performance for low selectivity queries.

Test results from many different versions show that this patch gets us from ~39 qps to ~418 qps, giving us a 10x increase in throughput.

$ latte list --tag ondisk
File ─────────────────────────────────────────────────────────────────────────────────────────────   Workload   Function   Timestamp ─────────   Tags ──────────────────────────   Params   Rate   Thrpt. [req/s]   P50 [ms]   P99 [ms]
./wide.Test_Cluster.4.0.11.0-SNAPSHOT.ondisk.p128.t1.c1.20250716.124318.json                         wide.rn    hc         2025-07-16 12:42:17   ondisk, ec, cc                                              7526       15.5       36.6
./wide.Test_Cluster.5.0.5-SNAPSHOT.ondisk.5.0.p128.t1.c1.20250716.125436.json                        wide.rn    hc         2025-07-16 12:53:35   ondisk, 5.0                                                 4634       23.7       95.3
./wide.Test_Cluster.4.0.11.0-SNAPSHOT.ondisk.cc.main.aa.p128.t1.c1.20250716.131838.json              wide.rn    hc         2025-07-16 13:17:37   ondisk, cc, main, aa                                        1182      106.1      191.8
./wide.Test_Cluster.4.0.11.0-SNAPSHOT.ondisk.cc.main-with-11666.aa.p128.t1.c1.20250717.002057.json   wide.rn    hc         2025-07-17 00:19:57   ondisk, cc, main-with-11666, aa                             1105      116.2      187.0
./wide.Test_Cluster.4.0.11.0-SNAPSHOT.ondisk.cc.main-with-11666.ec.p128.t1.c1.20250717.002614.json   wide.rn    hc         2025-07-17 00:25:13   ondisk, cc, main-with-11666, ec                             6041       20.1       33.6
./wide.Test_Cluster.4.0.11.0-SNAPSHOT.ondisk.ec.cc.p128.t1.c1.20250716.124531.json                   wide.rn    lc         2025-07-16 12:44:26   ondisk, ec, cc                                                39     3147.0     3432.9
./wide.Test_Cluster.5.0.5-SNAPSHOT.ondisk.5.0.p128.t1.c1.20250716.125619.json                        wide.rn    lc         2025-07-16 12:55:18   ondisk, 5.0                                                  289      439.0      550.2
./wide.Test_Cluster.4.0.11.0-SNAPSHOT.ondisk.cc.main.aa.p128.t1.c1.20250716.131640.json              wide.rn    lc         2025-07-16 13:15:39   ondisk, cc, main, aa                                         509      243.7      372.8
./wide.Test_Cluster.4.0.11.0-SNAPSHOT.ondisk.cc.main-with-11666.aa.p128.t1.c1.20250717.001941.json   wide.rn    lc         2025-07-17 00:18:40   ondisk, cc, main-with-11666, aa                              511      241.6      360.3
./wide.Test_Cluster.4.0.11.0-SNAPSHOT.ondisk.cc.main-with-11666.ec.p128.t1.c1.20250717.002441.json   wide.rn    lc         2025-07-17 00:23:40   ondisk, cc, main-with-11666, ec                              418      299.8      388.4

github-actions · 2025-07-17T05:32:55Z

eolivelli

Overall the patch looks good

we have to add unit tests (or identify existing unit tests) that cover the new "feature" and the code we touched

src/java/org/apache/cassandra/utils/InsertionOrderedNavigableSet.java

src/java/org/apache/cassandra/index/sai/plan/StorageAttachedIndexSearcher.java

src/java/org/apache/cassandra/index/sai/plan/QueryController.java

src/java/org/apache/cassandra/index/sai/plan/StorageAttachedIndexSearcher.java

adelapena · 2025-07-17T11:08:32Z

src/java/org/apache/cassandra/index/sai/plan/StorageAttachedIndexSearcher.java

+//            Preconditions.checkNotNull(key.partitionKey(), "Partition key must not be null");
+//            if (lastKey != null && key.partitionKey().equals(lastKey.partitionKey()) && key.clustering().equals(lastKey.clustering()))
+//                return null;
+//            lastKey = key;


I guess this needs an update.

I left this in an ambiguous state due to concerns about correctness and deduplication. Looks like we do that in fillNextSelectedKeysInPartition, so I'll remove these lines.

Actually, I need to follow up on this comment to make sure we're in the clear:

// Key reads are lazy, delayed all the way to this point. // We don't want key.equals(lastKey) because some PrimaryKey implementations consider more than just // partition key and clustering for equality. This can break lastKey skipping, which is necessary for // correctness when PrimaryKey doesn't have a clustering (as otherwise, the same partition may get // filtered and considered as a result multiple times). // we need a non-null partitionKey here, as we want to construct a SinglePartitionReadCommand

I am satisfied that the logic is correct as is. The cases where we had issues comparing lastKey and the nextKey are no longer relevant because we get the PrimaryKey objects from the iterator are either fully qualified or are static (with empty clustering keys), and in the static case, we have an iterator over the whole partition, which is what we would have been doing previously.

There is possibly an opportunity to optimize the logic with static primary keys, but as far as I can tell, the current "error" is to read additional rows from disk, which is an acceptable error. I'm not certain, but it seems possible that upstream has a similar problem, if one exists (it might not though because their PrimaryKey objects are slightly different)

I created this ticket as a follow up https://github.com/riptano/cndb/issues/14861

src/java/org/apache/cassandra/index/sai/plan/QueryController.java

michaeljmarshall · 2025-07-18T16:30:35Z

Marking as ready for review to run CI. That will help me figure out if #1883 (comment) is a problem, since the code from apache definitely uses the key.equals(lastKey). (Note that Apache doesn't have the partition only indexing we get with aa, which is the reason we added that comment as a part of this PR #1096 (the one that added back in aa support).)

JeremiahDJordan · 2025-07-18T19:14:59Z

That will help me figure out if #1883 (comment) is a problem ... the one that added back in aa support

Can we tell if we are in the aa case or not, and not do this new logic if we are? This batching stuff doesn't really make sense for aa files?

michaeljmarshall · 2025-07-18T20:36:30Z

That will help me figure out if #1883 (comment) is a problem ... the one that added back in aa support

Can we tell if we are in the aa case or not, and not do this new logic if we are? This batching stuff doesn't really make sense for aa files?

I think we're good to go here. You're right that the aa logic isn't a problem.

michaeljmarshall · 2025-07-18T22:00:49Z

✔️ Build ds-cassandra-pr-gate/PR-1883 approved by Butler

Approved by Butler See build details here

Looks like the tests are passing here, but the github actions don't seem quite right.

eolivelli

LGTM

Waiting for final @adelapena 's review

eolivelli

Thanks for adding more tests

…g reads Port of CASSANDRA-19497. Co-authored-by: Caleb Rackliffe <[email protected]> Co-authored-by: Michael Marshall <[email protected]> Co-authored-by: Andrés de la Peña <[email protected]>

sonarqubecloud · 2025-07-22T10:23:08Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
92.3% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

cassci-bot · 2025-07-22T10:26:26Z

✔️ Build ds-cassandra-pr-gate/PR-1883 approved by Butler

Approved by Butler
See build details here

…g reads (#1883) Port of CASSANDRA-19497 Co-authored-by: Caleb Rackliffe <[email protected]> Co-authored-by: Michael Marshall <[email protected]> Co-authored-by: Andrés de la Peña <[email protected]>

michaeljmarshall requested review from pkolaczk and eolivelli July 17, 2025 05:32

michaeljmarshall self-assigned this Jul 17, 2025

eolivelli reviewed Jul 17, 2025

View reviewed changes

src/java/org/apache/cassandra/utils/InsertionOrderedNavigableSet.java Show resolved Hide resolved

adelapena reviewed Jul 17, 2025

View reviewed changes

src/java/org/apache/cassandra/index/sai/plan/StorageAttachedIndexSearcher.java Outdated Show resolved Hide resolved

adelapena reviewed Jul 17, 2025

View reviewed changes

src/java/org/apache/cassandra/index/sai/plan/QueryController.java Show resolved Hide resolved

adelapena reviewed Jul 17, 2025

View reviewed changes

src/java/org/apache/cassandra/index/sai/plan/StorageAttachedIndexSearcher.java Show resolved Hide resolved

adelapena reviewed Jul 17, 2025

View reviewed changes

src/java/org/apache/cassandra/index/sai/plan/QueryController.java Show resolved Hide resolved

michaeljmarshall marked this pull request as ready for review July 18, 2025 16:30

eolivelli approved these changes Jul 21, 2025

View reviewed changes

adelapena approved these changes Jul 21, 2025

View reviewed changes

eolivelli approved these changes Jul 21, 2025

View reviewed changes

eolivelli mentioned this pull request Jul 21, 2025

CNDB-11666: Batch clusterings into single SAI partition post-filterin… #1884

Merged

adelapena mentioned this pull request Jul 21, 2025

CNDB-11666: Batch clusterings into single SAI partition post-filtering reads #1897

Merged

JeremiahDJordan approved these changes Jul 21, 2025

View reviewed changes

CNDB-11666: Batch clusterings into single SAI partition post-filterin…

52c294f

…g reads Port of CASSANDRA-19497. Co-authored-by: Caleb Rackliffe <[email protected]> Co-authored-by: Michael Marshall <[email protected]> Co-authored-by: Andrés de la Peña <[email protected]>

adelapena force-pushed the cndb-11666 branch from 734ba5b to 52c294f Compare July 22, 2025 09:40

adelapena merged commit 81f2cf8 into main Jul 22, 2025
488 checks passed

adelapena deleted the cndb-11666 branch July 22, 2025 15:38

CNDB-11666: Batch clusterings into single SAI partition post-filtering reads #1883

CNDB-11666: Batch clusterings into single SAI partition post-filtering reads #1883

Uh oh!

Conversation

michaeljmarshall commented Jul 17, 2025

What is the issue

What does this PR fix and why was it fixed

Uh oh!

github-actions bot commented Jul 17, 2025

Checklist before you submit for review

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adelapena Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

michaeljmarshall Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

michaeljmarshall Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

michaeljmarshall Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

michaeljmarshall Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

michaeljmarshall commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JeremiahDJordan commented Jul 18, 2025

Uh oh!

michaeljmarshall commented Jul 18, 2025

Uh oh!

michaeljmarshall commented Jul 18, 2025

✔️ Build ds-cassandra-pr-gate/PR-1883 approved by Butler

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Jul 22, 2025

Quality Gate passed

Uh oh!

cassci-bot commented Jul 22, 2025

✔️ Build ds-cassandra-pr-gate/PR-1883 approved by Butler

Uh oh!

Uh oh!

Uh oh!

michaeljmarshall commented Jul 18, 2025 •

edited

Loading