Skip to content

CNDB-14361 use node's total document and term counts (#1791) #1877

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 17, 2025

Conversation

driftx
Copy link

@driftx driftx commented Jul 16, 2025

What is the issue

BM25 score needs to use document average length aggregated on entire node.
Also a bug was discovered, which was introduced by #1789: term frequencies don't
include documents filtered by other predicates, which makes inconsistent result
between query plans.

What does this PR fix and why was it fixed

Fixes https://github.com/riptano/cndb/issues/14361

Changes that BM25 score uses document count and term count aggregated on entire node instead of per segment. Fixes a bug in calculating term frequencies, so
they count all documents.

As the work added more code duplication all duplicated code is refactored in getTopKRows methods.

In addition removes unnecessary public modifier in afffected interface.

### What is the issue

BM25 score needs to use document average length aggregated on entire
node.
Also a bug was discovered, which was introduced by #1789: term
frequencies don't
include documents filtered by other predicates, which makes inconsistent
result
between query plans.

### What does this PR fix and why was it fixed

Fixes riptano/cndb#14361

Changes that BM25 score uses document count and term count aggregated
on entire node instead of per segment. Fixes a bug in calculating term
frequencies, so
they count all documents.

As the work added more code duplication all duplicated code is
refactored in getTopKRows methods.

In addition removes unnecessary public modifier in afffected interface.
Copy link

Checklist before you submit for review

  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@driftx
Copy link
Author

driftx commented Jul 16, 2025

Clean merge.

@driftx driftx requested a review from djatnieks July 16, 2025 20:16
@driftx driftx force-pushed the CNDB-14812 branch 2 times, most recently from ad3337b to 6104635 Compare July 17, 2025 13:46
@driftx driftx merged commit 6f1076a into main-5.0 Jul 17, 2025
6 of 234 checks passed
@driftx driftx deleted the CNDB-14812 branch July 17, 2025 13:52
Copy link

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-1877 rejected by Butler


33 new test failure(s) in 2 builds
See build details here


Found 33 new test failures

Showing only first 15 new test failures

Test Explanation Branch history Upstream history
...lidation.operations.AlterTest-compression_jdk11 regression 🔴🔴
t.TestCqlshUnicode.test_unicode_identifier regression 🔴🔵
...nQueryShouldNotTimeoutWhenItExceedesReadTimeout regression 🔴🔴
...nglePageReadIsFastButAggregationExceedesTimeout regression 🔴🔴
...estInterruptedExceptionCachedCounterLockManager regression 🔴🔵
...adCommitLogAndSSTablesWithDroppedColumnTestCC50 regression 🔴🔴
...oadCommitLogAndSSTablesWithDroppedColumnTestDSE regression 🔴🔴
...thRestartTest.testReadingValuesOfDroppedColumns regression 🔴🔴
o.a.c.d.t.s.f.FeaturesVersionSupportDBTest.testANN regression 🔴🔴
o.a.c.d.t.s.f.FeaturesVersionSupportDCTest.testANN regression 🔴🔴
o.a.c.d.t.s.f.FeaturesVersionSupportEBTest.testANN regression 🔴🔴
...c.FeaturesVersionSupportTest.testANNSupport[eb] regression 🔴🔴
....FeaturesVersionSupportTest.testGeoDistance[aa] regression 🔴🔴
....FeaturesVersionSupportTest.testGeoDistance[ba] regression 🔴🔴
...cySSTableTest.testVerifyOldDroppedTupleSSTables regression 🔴🔴

Found 1 known test failures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants