CNDB-14361 use node's total document and term counts (#1791) #1877

driftx · 2025-07-16T16:30:22Z

What is the issue

BM25 score needs to use document average length aggregated on entire node.
Also a bug was discovered, which was introduced by #1789: term frequencies don't
include documents filtered by other predicates, which makes inconsistent result
between query plans.

What does this PR fix and why was it fixed

Fixes https://github.com/riptano/cndb/issues/14361

Changes that BM25 score uses document count and term count aggregated on entire node instead of per segment. Fixes a bug in calculating term frequencies, so
they count all documents.

As the work added more code duplication all duplicated code is refactored in getTopKRows methods.

In addition removes unnecessary public modifier in afffected interface.

### What is the issue BM25 score needs to use document average length aggregated on entire node. Also a bug was discovered, which was introduced by #1789: term frequencies don't include documents filtered by other predicates, which makes inconsistent result between query plans. ### What does this PR fix and why was it fixed Fixes riptano/cndb#14361 Changes that BM25 score uses document count and term count aggregated on entire node instead of per segment. Fixes a bug in calculating term frequencies, so they count all documents. As the work added more code duplication all duplicated code is refactored in getTopKRows methods. In addition removes unnecessary public modifier in afffected interface.

github-actions · 2025-07-16T16:30:37Z

driftx · 2025-07-16T16:34:51Z

Clean merge.

sonarqubecloud · 2025-07-17T14:39:08Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
93.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

cassci-bot · 2025-07-17T14:42:36Z

❌ Build ds-cassandra-pr-gate/PR-1877 rejected by Butler

33 new test failure(s) in 2 builds
See build details here

Found 33 new test failures

Showing only first 15 new test failures

Test	Explanation	Branch history
...lidation.operations.AlterTest-compression_jdk11	regression	🔴🔴
t.TestCqlshUnicode.test_unicode_identifier	regression	🔴🔵
...nQueryShouldNotTimeoutWhenItExceedesReadTimeout	regression	🔴🔴
...nglePageReadIsFastButAggregationExceedesTimeout	regression	🔴🔴
...estInterruptedExceptionCachedCounterLockManager	regression	🔴🔵
...adCommitLogAndSSTablesWithDroppedColumnTestCC50	regression	🔴🔴
...oadCommitLogAndSSTablesWithDroppedColumnTestDSE	regression	🔴🔴
...thRestartTest.testReadingValuesOfDroppedColumns	regression	🔴🔴
o.a.c.d.t.s.f.FeaturesVersionSupportDBTest.testANN	regression	🔴🔴
o.a.c.d.t.s.f.FeaturesVersionSupportDCTest.testANN	regression	🔴🔴
o.a.c.d.t.s.f.FeaturesVersionSupportEBTest.testANN	regression	🔴🔴
...c.FeaturesVersionSupportTest.testANNSupport[eb]	regression	🔴🔴
....FeaturesVersionSupportTest.testGeoDistance[aa]	regression	🔴🔴
....FeaturesVersionSupportTest.testGeoDistance[ba]	regression	🔴🔴
...cySSTableTest.testVerifyOldDroppedTupleSSTables	regression	🔴🔴

Found 1 known test failures

driftx requested a review from djatnieks July 16, 2025 20:16

djatnieks approved these changes Jul 16, 2025

View reviewed changes

driftx force-pushed the CNDB-14812 branch 2 times, most recently from ad3337b to 6104635 Compare July 17, 2025 13:46

driftx merged commit 6f1076a into main-5.0 Jul 17, 2025
6 of 234 checks passed

driftx deleted the CNDB-14812 branch July 17, 2025 13:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CNDB-14361 use node's total document and term counts (#1791) #1877

CNDB-14361 use node's total document and term counts (#1791) #1877

Uh oh!

driftx commented Jul 16, 2025

Uh oh!

github-actions bot commented Jul 16, 2025

Uh oh!

driftx commented Jul 16, 2025

Uh oh!

Uh oh!

sonarqubecloud bot commented Jul 17, 2025

Uh oh!

cassci-bot commented Jul 17, 2025

Uh oh!

Uh oh!

CNDB-14361 use node's total document and term counts (#1791) #1877

CNDB-14361 use node's total document and term counts (#1791) #1877

Uh oh!

Conversation

driftx commented Jul 16, 2025

What is the issue

What does this PR fix and why was it fixed

Uh oh!

github-actions bot commented Jul 16, 2025

Checklist before you submit for review

Uh oh!

driftx commented Jul 16, 2025

Uh oh!

Uh oh!

sonarqubecloud bot commented Jul 17, 2025

Quality Gate passed

Uh oh!

cassci-bot commented Jul 17, 2025

❌ Build ds-cassandra-pr-gate/PR-1877 rejected by Butler

Found 33 new test failures

Found 1 known test failures

Uh oh!

Uh oh!