Skip to content

CNDB-12651: Fix flaky VectorDistributedTest.rangeRestrictedTest #1834

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 24, 2025

Conversation

adelapena
Copy link

Fix VectorDistributedTest.rangeRestrictedTest to don't require an exact match in ANN query results.

That test is the dtest version of the VectorLocalTest.rangeRestrictedTest utest. The utest was relaxed by this commit to only check for results within a 5% of the expected, given the approximate nature of ANN. However, the dtest was never updated in the same way. This PR applies the same change to the dtest.

@adelapena adelapena self-assigned this Jun 26, 2025
Copy link

Checklist before you submit for review

  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@adelapena adelapena requested a review from a team June 26, 2025 10:54
@adelapena adelapena changed the title Cndb 12651 main CNDB-12651: Fix flaky VectorDistributedTest.rangeRestrictedTest Jun 26, 2025
@ekaterinadimitrova2 ekaterinadimitrova2 requested review from ekaterinadimitrova2 and removed request for a team June 26, 2025 13:53
@ekaterinadimitrova2
Copy link

ekaterinadimitrova2 commented Jun 26, 2025

Can we have a repeatable run in CI, like running it in a loop in the multiplexer a few hundred times?
It seems (on a very quick look), VectorLocalTest.rangeRestrictedTest failed (not related, but it is a similar issue), and it was because it was slightly bigger than 5% difference. I guess the multiplexer can help us polish the threshold. WDYT?
Otherwise, the changes LGTM. I left only one question for my understanding.

@adelapena
Copy link
Author

Thanks for the review. I have increased to 6% and tried to run it in the multiplexer. However, it seems it's not running the tests: https://jenkins-stargazer.aws.dsinternal.org/view/cc-builds/job/ds-cassandra-build/1120/parameters/

Maybe I'm missing some parameter. I found this Slack thread about how to use the multiplexer, did you manage to do it?

@ekaterinadimitrova2
Copy link

it seems it's not running the tests: https://jenkins-stargazer.aws.dsinternal.org/view/cc-builds/job/ds-cassandra-build/1120/parameters/

Maybe I'm missing some parameter. I found this Slack thread about how to use the multiplexer, did you manage to do it?

Yes, I managed as per @jacek-lewandowski 's explanation in that thread.
https://datastax.slack.com/archives/C04SYTHAXL3/p1743641721970889?thread_ts=1743617278.814499&cid=C04SYTHAXL3

I just took a look at what you submitted, and there was an unneeded space after the first .*. (I also missed it initially)
I started a new build here—https://jenkins-stargazer.aws.dsinternal.org/view/cc-builds/job/ds-cassandra-build/1122/—but it seems that some tests in that class are still failing sometimes.
Ideally, we want the multiplexer to run the distributed test in a loop, too.

@adelapena
Copy link
Author

Aah, an extra white space! Good catch! I have run the multiplexer for the dtest without failures for rangeRestrictedTest:
https://jenkins-stargazer.aws.dsinternal.org/view/cc-builds/job/ds-cassandra-build/1124/

Perhaps we should increase that percentage to something as high as 20%, to match the recall? I guess that in the worst-case scenario the number of absent rows can match the ones missing in recall calculations.

@adelapena
Copy link
Author

I have just increased the percentage to 20% to match recall, in case all the unmatched rows are all missing. Repeated runs are:

I have also inlined searchWithRange since the previous refactor using beforeAndAfterFlush makes it called from a single point, and I find it easier to read if percentage and recall values are close. I'll revert if you don't agree.

There are similar errors across both test classes, but I'd prefer to deal with them separately, since they seem to work slightly differently. For example, some expect a perfect match if we are expecting less than ten rows, and 5% if we expect more.

@ekaterinadimitrova2
Copy link

I have also inlined searchWithRange since the previous refactor using beforeAndAfterFlush makes it called from a single point, and I find it easier to read if percentage and recall values are close. I'll revert if you don't agree.

I agree, no need to revert anything, thanks.

There are similar errors across both test classes, but I'd prefer to deal with them separately, since they seem to work slightly differently. For example, some expect a perfect match if we are expecting less than ten rows, and 5% if we expect more.

+1

Perhaps we should increase that percentage to something as high as 20%, to match the recall?

I have just increased the percentage to 20% to match recall, in case all the unmatched rows are all missing.

Makes sense, considering the recall is 0.8 and we assert on it.

Don't expect an exact, fixed number of results from ANN. This is analogous to the fix
applied to  the not-distributed version of the test, VectorLocalTest.rangeRestrictedTest,
by ae96e0f
…geRestrictedTest to match expected recall

Also, inline searchWithRange since the previous refactor using beforeAndAfterFlush makes it single-called
Copy link

@cassci-bot
Copy link

✔️ Build ds-cassandra-pr-gate/PR-1834 approved by Butler


Approved by Butler
See build details here

@adelapena adelapena merged commit e897bad into main Jul 24, 2025
492 checks passed
@adelapena adelapena deleted the CNDB-12651-main branch July 24, 2025 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants