Skip to content

CNDB-13696 fix empty iterator access in BM25 search on partial SSTable #1691

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 15, 2025

Conversation

k-rus
Copy link

@k-rus k-rus commented Apr 10, 2025

What is the issue

Executing a query with BM25 search and a condition on partial SSTable results in empty iterator access error. And there was no test with storing data in segments.

What does this PR fix and why was it fixed

The PR implements BM25 search tests with splitting data into two tables. This reproduced this bug, CNDB-13696, and demonstrates current confusion on the BM25 ordering result to be fixed by CNDB-13667.

This PR adds a check for empty iterator created for a PK belonging to another segment. This fixes the bug of trying to get the first element of an empty iterator.

Fixes https://github.com/riptano/cndb/issues/13696, fixes https://github.com/riptano/cndb/issues/13667
It's based on #1688, which fixes https://github.com/riptano/cndb/issues/13671

It's planned to be two commits on top of the commit in #1688. (it's okay to squash as the test reproduces the issue)

Copy link

github-actions bot commented Apr 10, 2025

Checklist before you submit for review

  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@k-rus k-rus requested a review from a team April 10, 2025 22:32
@k-rus k-rus force-pushed the rf-13667-reproduce-different-bm25-order branch from 3d31438 to 04e3862 Compare April 11, 2025 07:54
@pkolaczk pkolaczk self-requested a review April 15, 2025 10:30
@k-rus k-rus changed the title CNDB-13696 fix empty iterator access in BM25 search on partial SSTable CNDB-13696 fix empty iterator access in BM25 search on partial SSTable index Apr 15, 2025
@k-rus k-rus changed the title CNDB-13696 fix empty iterator access in BM25 search on partial SSTable index CNDB-13696 fix empty iterator access in BM25 search on partial SSTable Apr 15, 2025
k-rus added 2 commits April 15, 2025 15:17
Adds tests to split data into two segments and demonstrate that the
BM25 ordering is different than single table.

Currently the test is broken due to a bug in BM25 implementation and
this commit reproduces it.
Fixes the bug that if SSTable contains part of the data, it results
in reading from an empty iterator.
@k-rus k-rus force-pushed the rf-13667-reproduce-different-bm25-order branch from 04e3862 to 3c3b69c Compare April 15, 2025 13:17
@cassci-bot
Copy link

✔️ Build ds-cassandra-pr-gate/PR-1691 approved by Butler


Approved by Butler
See build details here

@k-rus k-rus merged commit 167a98c into main Apr 15, 2025
478 checks passed
@k-rus k-rus deleted the rf-13667-reproduce-different-bm25-order branch April 15, 2025 14:53
@@ -168,6 +168,9 @@ private Cell<?> readColumn(SSTableReader sstable, PrimaryKey primaryKey)
var slices = Slices.with(indexContext.comparator(), Slice.make(primaryKey.clustering()));
try (var rowIterator = sstable.iterator(dk, slices, columnFilter, false, SSTableReadsListener.NOOP_LISTENER))
{
// primaryKey might not belong to this sstable, thus the iterator will be empty
if (rowIterator.isEmpty())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix. Adding a semi-related observation. If we end up seeing a lot of object allocations associated with the Slices which are O(n) for results matched in the WHERE clause, it might be worth finding a way to use sstable.couldContain(dk), which checks the bloom filter, to reduce allocations for keys not in the sstable. If you look in the implementation of the iterator method, you'll see how it already does that for us, but it's after we've already created our slice object.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will not be surprised if there are more places for such improvements.

michaeljmarshall pushed a commit that referenced this pull request Apr 18, 2025
#1691)

Executing a query with BM25 search and a condition on partial SSTable
results in empty iterator access error. And there was no test with
storing data in segments.

The PR implements BM25 search tests with splitting data into two tables.
This reproduced this bug, CNDB-13696, and demonstrates current confusion
on the BM25 ordering result to be fixed by CNDB-13553.

This PR adds a check for empty iterator created for a PK belonging to
another segment. This fixes the bug of trying to get the first element
of an empty iterator.

(cherry picked from commit 167a98c)
djatnieks pushed a commit that referenced this pull request Apr 22, 2025
#1691)

Executing a query with BM25 search and a condition on partial SSTable
results in empty iterator access error. And there was no test with
storing data in segments.

The PR implements BM25 search tests with splitting data into two tables.
This reproduced this bug, CNDB-13696, and demonstrates current confusion
on the BM25 ordering result to be fixed by CNDB-13553.

This PR adds a check for empty iterator created for a PK belonging to
another segment. This fixes the bug of trying to get the first element
of an empty iterator.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants