CNDB-10302: Tombstone-capable trie memtable #2005

blambov · 2025-09-17T16:43:01Z

What is the issue

https://github.com/riptano/cndb/issues/10302

What does this PR fix and why was it fixed

Implements the necessary trie machinery to work with trie sets, range and deletion-aware tries, and a memtable that uses it to store deletions in separate per-partition branches of the memtable trie.

Implements a method of skipping over tombstones when converting UnfilteredRowIterator to the filtered RowIterator, which has the effect of ignoring all tombstones when looking for data and speeds up next-live lookups dramatically. Adds a test to demonstrate this effect with the new memtable.

The changes are described in further detail with each commit.

This also changes the behaviour of subtries to always include boundaries, their prefixes and their descendant branches. This is necessary for well-defined reverse walks and helps present metadata on the path of queried ranges, and is not a real limitation for the prefix-free keys that we use.

Range tries are tries made of ranges of coverage, which track applicable ranges and are mainly to be used to store deletions and deletion ranges.

…ly(RangeTrie...)

Deletion-aware tries combine data and deletion tries. The cursor of a deletion-aware trie walks the data part of the trie, but also provides a `deletionBranchCursor` that can return a deletion/ tombstone branch covering the current position and the branch below it as a range trie. Such a branch can be given only once for any path in the trie (i.e. there cannot be a deletion branch covering another deletion branch). Deletion-aware merges and updates to in-memory tries take deletion branches into account when merging data so that deleted data is not produced in the resulting merge.

Implements a row-level trie memtable that uses deletion-aware tries to store deletions separately from live data, together with the associated TrieBackedPartition and TriePartitionUpdate. Every deletion is first converted to its range version (e.g. deleted rows are now represented as a WHERE ck <= x AND ck >= x, deleted partitions -- as deletions covering from LT_EXCLUDED to GT_NEXT_COMPONENT to include static and all normal rows) and then stored in the deletion path of the trie. To make tests work, all such ranges are converted back to rows and partition deletion times on conversion to UnfiteredPartitionIterator.

Adds a new method to UnfilteredRowIterator that is implemented by the new trie-backed partitions to ask them to stop issuing tombstones. This is done on filtering (i.e. conversion from UnfilteredRowIterator to RowIterator) where tombstones have already done their job and are no longer needed. Adds JMH tests of tombstones that demonstrate the improvement.

In the initial implementation row deletions were mapped to range tombstones, which works but isn't compatible with the multitude of tests, which require deletions to be returned in the form they were made. This commit changes the representation of deleted rows to use point tombstones. In addition to making the tests pass, this improves the memory usage of memtables with row deletions. Although they only add complexity at this stage, point tombstones (expanded to apply to the covered branch) will be needed in the next stage of development.

sonarqubecloud · 2025-10-06T11:23:39Z

Quality Gate passed

Issues
23 New issues
0 Accepted issues

Measures
0 Security Hotspots
85.4% Coverage on New Code
2.8% Duplication on New Code

See analysis details on SonarQube Cloud

blambov · 2025-10-06T11:31:23Z

Some benchmark results demonstrating the effect:

Benchmark                                (BATCH)  (count)  (deletionPattern)    (deletionSpec)  (deletionsRatio)  (flush)     (memtableClass)  (partitions)  (threadCount)  (useNet)  Mode  Cnt    Score    Error  Units
ReadTestWidePartitions.readGreaterMatch     1000  1000000         FROM_START  RANGE_FROM_START             0.997    INMEM        TrieMemtable           999              1     false  avgt   10    8.924 ±  0.117  ms/op
ReadTestWidePartitions.readGreaterMatch     1000  1000000         FROM_START  RANGE_FROM_START             0.997    INMEM  TrieMemtableStage2           999              1     false  avgt   10  185.442 ±  6.448  ms/op
ReadTestWidePartitions.readGreaterMatch     1000  1000000         FROM_START  RANGE_FROM_START             0.997    INMEM  TrieMemtableStage1           999              1     false  avgt   10   48.197 ±  3.383  ms/op
ReadTestWidePartitions.readGreaterMatch     1000  1000000         FROM_START   SINGLETON_RANGE             0.997    INMEM        TrieMemtable           999              1     false  avgt   10   11.465 ±  0.225  ms/op
ReadTestWidePartitions.readGreaterMatch     1000  1000000         FROM_START   SINGLETON_RANGE             0.997    INMEM  TrieMemtableStage2           999              1     false  avgt   10  436.228 ± 14.452  ms/op
ReadTestWidePartitions.readGreaterMatch     1000  1000000         FROM_START   SINGLETON_RANGE             0.997    INMEM  TrieMemtableStage1           999              1     false  avgt   10  261.936 ±  8.704  ms/op
ReadTestWidePartitions.readGreaterMatch     1000  1000000         FROM_START             EQUAL             0.997    INMEM        TrieMemtable           999              1     false  avgt   10   11.073 ±  0.206  ms/op
ReadTestWidePartitions.readGreaterMatch     1000  1000000         FROM_START             EQUAL             0.997    INMEM  TrieMemtableStage2           999              1     false  avgt   10  190.903 ±  4.218  ms/op
ReadTestWidePartitions.readGreaterMatch     1000  1000000         FROM_START             EQUAL             0.997    INMEM  TrieMemtableStage1           999              1     false  avgt   10   81.501 ±  1.221  ms/op

In the table above, the first 99.7% of all rows in each partition are deleted with one of the following operations:

RANGE_FROM_START:  DELETE ... WHERE userid = ? AND picid <= ?
 SINGLETON_RANGE:  DELETE ... WHERE userid = ? AND picid >= ? AND picid <= ?
           EQUAL:  DELETE ... WHERE userid = ? AND picid = ?

and then a SELECT ... WHERE userid = ? AND picid >= ? was issued.

An example with hundreds of thousands of tombstones per read partition:

Benchmark                                (BATCH)  (count)  (deletionPattern)  (deletionSpec)  (deletionsRatio)  (flush)  (memtableClass)  (partitions)  (threadCount)  (useNet)  Mode  Cnt   Score   Error  Units
ReadTestWidePartitions.readGreaterMatch     1000  1000000         FROM_START           EQUAL             0.997    INMEM     TrieMemtable             3              1     false  avgt   10  10.280 ± 0.117  ms/op

(this throws TombstoneOverwhelmingException with all other memtable types and runs in the tens of seconds per query when the guardrail is disabled)

Here is the table of the build time and memory usage:

memtableClass        count partitions  deletionsRatio      deletionSpec  build time  on-heap memory  off-heap memory
TrieMemtable       1000000        999           0.997  RANGE_FROM_START     20.910s       23.639MiB        28.149MiB
TrieMemtableStage2 1000000        999           0.997  RANGE_FROM_START     12.965s       88.500MiB        75.671MiB
TrieMemtableStage1 1000000        999           0.997  RANGE_FROM_START     13.512s      116.381MiB        32.013MiB
TrieMemtable       1000000        999           0.997   SINGLETON_RANGE     17.828s       61.638MiB       106.289MiB
TrieMemtableStage2 1000000        999           0.997   SINGLETON_RANGE    106.892s      336.825MiB        75.671MiB
TrieMemtableStage1 1000000        999           0.997   SINGLETON_RANGE     17.354s      598.587MiB        42.013MiB
TrieMemtable       1000000        999           0.997             EQUAL     16.648s       42.617MiB        75.865MiB
TrieMemtableStage2 1000000        999           0.997             EQUAL     13.937s       65.407MiB        75.671MiB
TrieMemtableStage1 1000000        999           0.997             EQUAL     15.645s       93.272MiB        42.013MiB

Unlike the previous memtables, the new implementation will delete data from the trie when it receives a range tombstone, resulting in some cases in longer build time but lower memory usage.

Full benchmark run to be posted soon.

test/unit/org/apache/cassandra/db/tries/DeletionAwareTestBase.java

test/unit/org/apache/cassandra/db/tries/DeletionBranchConsistencyTest.java

test/unit/org/apache/cassandra/db/tries/InMemoryDeletionAwareTrieConsistencyTest.java

test/unit/org/apache/cassandra/db/tries/InMemoryDeletionAwareTrieThreadedTest.java

test/unit/org/apache/cassandra/db/tries/DeletionAwareRandomizedTest.java

src/java/org/apache/cassandra/db/rows/TrieTombstoneMarkerImpl.java

test/unit/org/apache/cassandra/db/partition/PartitionImplementationTest.java

test/unit/org/apache/cassandra/db/tries/SlicedTrieTest.java

cbornet · 2025-10-27T16:14:52Z

Impressive work @blambov !
I added a few comments, mostly cosmetics.
Otherwise LGTM!

… using Range operations

and use it to avoid a couple of intermediate objects in set union

Fix Cursor.skipToWhenAhead for reverse iteration Add Cursor.dumpBranch for debugging Fix various methods to return Preencoded byte-comparables Fix deletion-aware collection merge cursor's reporting of deletion branch at tail points Fix deletion-aware collection merge cursor's failure on one deletion branch

… relevant deletions

sonarqubecloud · 2025-11-11T12:14:25Z

Quality Gate passed

Issues
14 New issues
0 Accepted issues

Measures
0 Security Hotspots
85.5% Coverage on New Code
2.5% Duplication on New Code

See analysis details on SonarQube Cloud

cassci-bot · 2025-11-11T12:20:03Z

✔️ Build ds-cassandra-pr-gate/PR-2005 approved by Butler

Approved by Butler
See build details here

pkolaczk

Not a full review yet.

So far I have two general remarks (I will not flag them individually in all places):

License header should be DataStax license, not Apache License
Some more complex asserts would benefit from messages including the actual values that failed the assertion - that would make it easier to debug

pkolaczk · 2025-12-02T11:32:12Z

src/java/org/apache/cassandra/db/tries/Cursor.java

+        return true;
+    }
+
+    /// Returns a tail trie, i.e. a trie whose root is the current position. Walking a tail trie will list all


This method now returns a cursor for the trie, not a trie itself. Please update the javadoc.

pkolaczk · 2025-12-02T11:34:22Z

src/java/org/apache/cassandra/db/tries/Cursor.java

+    /// Returns a tail trie, i.e. a trie whose root is the current position. Walking a tail trie will list all
+    /// descendants of the current position with depth adjusted by the current depth.
+    ///
+    /// It is an error to call `tailTrie` on an exhausted cursor.


tailCursor

pkolaczk · 2025-12-02T12:21:02Z

src/java/org/apache/cassandra/db/tries/Cursor.java

+/// `(0, -1), (1, t), (2, r), (3, e), (4, e)*, (3, i), (4, e)*, (4, p)*, (1, w), (2, i), (3, n)*`
+///
+/// Because we exhaust transitions on bigger depths before we go the next transition on the smaller ones, when
+/// cursors are advanced together their positions can be easily compared using only the [#depth] and


I think it needs highlighting that cursors "advanced together" means that if they are not at the same position, we always advance the one that is lagging behind until it catches up or jumps over. Otherwise, if we let the higher cursor advance more times, or skipTo arbitrary points, this comparison logic would not work.

Anyway, I'm impressed by how smart this algorithm is!

pkolaczk · 2025-12-02T12:24:13Z

src/java/org/apache/cassandra/db/tries/Cursor.java

+    int incomingTransition();
+
+    /// @return the content associated with the current node. This may be non-null for any presented node, including
+    ///         the root.


but it also can be null, right?
Can we add @Nullable if this is the case?

pkolaczk · 2025-12-02T12:26:22Z

src/java/org/apache/cassandra/db/tries/Cursor.java

+    ///     child.
+    ///
+    /// It is an error to call this after the trie has already been exhausted (i.e. when `depth() == -1`);
+    /// for performance reasons we won't always check this.


What happens if we don't check this and some code calls it in that state?
Can we at least narrow down the list of very bad things that could happen or explicitly state it's undefined? Noop, exception, returning duplicate last entry, ... ?

It does look to me like a good candidate for an assert depth >= 0 - cheap enough that won't make much difference when assertions are enabled and could increase the likelyhood we fail fast in the tests, but also zero cost in prod where we run with assertions turned off

pkolaczk · 2025-12-03T10:45:38Z

src/java/org/apache/cassandra/db/tries/CursorWalkable.java

+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */


We should use DataStax license for new files.

pkolaczk · 2025-12-03T10:46:16Z

src/java/org/apache/cassandra/db/tries/Cursor.java

+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+


Please replace with DataStax license

pkolaczk · 2025-12-03T11:14:19Z

src/java/org/apache/cassandra/db/tries/MergeCursor.java

+        assert c1.depth() == c2.depth();
+        assert c1.incomingTransition() == c2.incomingTransition();


Nit: a message with the actual values might be helpful when something goes wrong

pkolaczk · 2025-12-03T13:19:51Z

src/java/org/apache/cassandra/db/tries/Trie.java

    {
-        return new TrieEntriesIterator.AsEntriesFilteredByType<>(cursor(direction), clazz);
+        return dir -> new SingletonCursor<>(dir, b.asComparableBytes(byteComparableVersion), byteComparableVersion, v);


This is quite cool to use a SAM here to implement a Trie by providing the implementation of makeCursor, but I am slightly on the fence with the readablity aspect of that. I mean, I like that it's terse and looks nice once understood, but because Trie has sooo many methods it wasn't immediately apparent to me how it actually works ;) It requires noticing that one unimplemented method among plenty of defaults.

I would slightly prefer to be a bit more verbose and explicit here and implement the Trie directly by using an anonymous class. The previous version of this code was more readable to me. Or, if we decide to keep this short version, because it is a prevalent abstraction everywhere, maybe let's have at least some hint in the comment / javadoc on the Trie explaining this pattern.

// Edit: the more look at it, the more the short version makes sense; because it seems that you deliberately want Tries to be defined in terms of the traversal (which is very smart as we don't need a physical node structure). So indeed maybe just add a paragraph on that in the top-level description.

pkolaczk · 2025-12-03T13:26:20Z

src/java/org/apache/cassandra/db/tries/Trie.java

-        {
-            return direction;
-        }
+    Cursor<T> makeCursor(Direction direction);


Can we shift that up to the beginning of the class?
And also add a Javadoc as this seems to be the central way of building Tries.
This is too important to be hiding here.

BTW: Trie is public and Cursor is package private. There is a compiler warning that Cursor is exposed outside of its visibility scope. I don't think it's a problem, maybe we should suppress that warning. I guess your intention was that we cannot implement Tries outside of this package?

blambov force-pushed the CNDB-10302 branch 3 times, most recently from e358a60 to 6914cba Compare September 29, 2025 13:00

blambov force-pushed the CNDB-10302 branch from 1d1755d to cd0bdb3 Compare October 3, 2025 15:52

blambov added 16 commits October 3, 2025 18:57

Extract Cursor class and update javadoc style

bf5f311

Change transformations to implement Cursor

e7f4f00

Adds the ability to verify cursors' behaviour for debugging

09a8b56

Put direction argument first in forEach/process

635feaa

Extract BaseTrie

7e57c12

Add concrete type to BaseTrie

31ea5db

Add CursorWalkable interface to BaseTrie and move implementations there

a70a2d2

Run trie tests with verification by default

125b325

Fix prefixed and singleton tailCursor

3275662

Extract InMemoryBaseTrie unchanged in preparation for other trie types

4735cde

Add deletion support for InMemoryTrie

871f926

Add RangeTrie

e9b8ce1

Range tries are tries made of ranges of coverage, which track applicable ranges and are mainly to be used to store deletions and deletion ranges.

Implement RangeTrie.applyTo, InMemoryTrie.delete and InMemoryTrie.app…

ce9d384

…ly(RangeTrie...)

Add "Stage2" versions of trie memtable and partition classes

8069242

blambov force-pushed the CNDB-10302 branch from cd0bdb3 to 9556094 Compare October 6, 2025 09:40

blambov added 6 commits October 6, 2025 13:37

Add trie slicing support for SAI uses

855f587

Generalize forEachValue/Entry

33b461e

Switch MemtableAverageRowSize to use trie directly and expand test

d59d5ec

blambov force-pushed the CNDB-10302 branch from 9556094 to d59d5ec Compare October 6, 2025 10:39