Skip to content

CNDB-14460: Fix Nodes test flakiness resulting from unsafe interleaving of async operations in test scenarios (#1812) #1881

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 17, 2025

Conversation

driftx
Copy link

@driftx driftx commented Jul 16, 2025

What is the issue

The singleton Nodes instance sequences operations that should not overlap by running them on a single-threaded executor. In some cases, these operations are executed in a synchronous manner, where the caller waits on the future. In other cases, they're executed asynchronously by queuing. In some tests, the singleton Nodes instance is shut down and replaced in an unsafe manner, to test cases where a node is restarted. This shut down does not terminate or wait on the executor, as the asynchronous tasks can safely be recovered on node restart. In the tests, however, these asynchronous operations can interleave with the newly created Nodes instance such that the operations no longer have the expected isolation, resulting in test failures.

Async operations can also interleave with the temporary directories backing a Nodes instance being deleted by Junit.

What does this PR fix and why was it fixed

When unsafely replacing the singleton Nodes instance in tests, trigger a shutdown on the executors and await inflight tasks.

When shutting down at the end of a test, await inflight tasks.

Copy link

Checklist before you submit for review

  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@driftx
Copy link
Author

driftx commented Jul 16, 2025

NodesTest in 5.0 is mocked and LegacySystemKeyspaceToNodesTest doesn't exist, so no test changes.

@driftx driftx requested a review from djatnieks July 16, 2025 20:18
…ng of async operations in test scenarios (#1812)

### What is the issue
The singleton Nodes instance sequences operations that should not
overlap by running them on a single-threaded executor. In some cases,
these operations are executed in a synchronous manner, where the caller
waits on the future. In other cases, they're executed asynchronously by
queuing. In some tests, the singleton Nodes instance is shut down and
replaced in an unsafe manner, to test cases where a node is restarted.
This shut down does not terminate or wait on the executor, as the
asynchronous tasks can safely be recovered on node restart. In the
tests, however, these asynchronous operations can interleave with the
newly created Nodes instance such that the operations no longer have the
expected isolation, resulting in test failures.

Async operations can also interleave with the temporary directories
backing a Nodes instance being deleted by Junit.

### What does this PR fix and why was it fixed
When unsafely replacing the singleton Nodes instance in tests, trigger a
shutdown on the executors and await inflight tasks.

When shutting down at the end of a test, await inflight tasks.
@driftx driftx merged commit 230b620 into main-5.0 Jul 17, 2025
11 of 244 checks passed
@driftx driftx deleted the CNDB-14828 branch July 17, 2025 15:24
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-1881 rejected by Butler


38 new test failure(s) in 2 builds
See build details here


Found 38 new test failures

Showing only first 15 new test failures

Test Explanation Branch history Upstream history
...lidation.operations.AlterTest-compression_jdk11 regression 🔴🔴
...nQueryShouldNotTimeoutWhenItExceedesReadTimeout regression 🔴🔴
...nglePageReadIsFastButAggregationExceedesTimeout regression 🔴🔴
...adCommitLogAndSSTablesWithDroppedColumnTestCC40 regression 🔴🔴
...adCommitLogAndSSTablesWithDroppedColumnTestCC50 regression 🔴🔴
...oadCommitLogAndSSTablesWithDroppedColumnTestDSE regression 🔴🔴
...thRestartTest.testReadingValuesOfDroppedColumns regression 🔴🔴
o.a.c.d.t.s.f.FeaturesVersionSupportDBTest.testANN regression 🔴🔴
o.a.c.d.t.s.f.FeaturesVersionSupportDCTest.testANN regression 🔴🔴
o.a.c.d.t.s.f.FeaturesVersionSupportEBTest.testANN regression 🔴🔴
...c.FeaturesVersionSupportTest.testANNSupport[eb] regression 🔴🔴
....FeaturesVersionSupportTest.testGeoDistance[aa] regression 🔴🔴
....FeaturesVersionSupportTest.testGeoDistance[ba] regression 🔴🔴
....s.f.SnapshotTest.shouldTakeAndRestoreSnapshots regression 🔴🔵
...cySSTableTest.testVerifyOldDroppedTupleSSTables regression 🔴🔴

Found 2 known test failures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants