CNDB-15967: Add a reachability fence to Ref#release to prevent races between the state release and the reference reaper #2125

jkni · 2025-11-13T23:05:21Z

What is the issue

When releasing a Ref, we occasionally see this release fail as a "bad release", indicating that the ref has already been released. At apparently the exact same time (to the resolution of our logs), we see the reference-reaper report a leak of the same reference and release the reference. This happens most frequently in GlobalTidyConcurrencyTest, which intentionally stresses these subsystems. In particular, for this issue to occur, there must be no strong references to the ref outside the reference on the stack frame of the method calling ref.release and no further usages of ref within that method.

The Ref ref has a state field, which refers to a Ref$State which is also a PhantomReference to ref. When ref.release() is called, it calls state.release(). During the execution of state.release() but prior to the usage of releasedUpdater to update the released field, the JIT may determine that ref will not be used and clear its reference on the stack. Then, the GC may determine that ref is only phantom reachable, clearing state as a PhantomReference and enqueuing it into referenceQueue. The Reference-Reaper may then take state from the referenceQueue and release it as a leak. This leak release can update the released field, causing it to report a leak. The thread executing state.release() may then resume, and when it attempts to update the released field it will fail and report a bad release. Ultimately, since it's a race between a correct release and a Reference-Reaper release, these are both spurious errors and the system continues to operate correctly.

To prevent this race, we must ensure that ref remains strongly reachable throughout the execution of state.release(). A reachability fence to itself will prevent the JIT from clearing the reference on the stack, keeping it strongly reachable.

What does this PR fix and why was it fixed

This PR adds a reachability fence to ref.release() after the state release call. It also adds a byteman rule to occasionally induce GCs that can determine ref to be phantom reachable in GlobalTidyConcurrencyTest. This byteman rule causes the test to reliably fail without the reachability fence.

github-actions · 2025-11-13T23:05:37Z

JeremiahDJordan

wow. nice detective work!

michaeljmarshall

Nice find and explanation!

…between the state release and the reference reaper

eolivelli

Great catch !

jkni · 2025-11-14T18:55:07Z

CI good in CNDB PR as well: https://github.com/riptano/cndb/pull/16007

sonarqubecloud · 2025-11-14T19:49:19Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

cassci-bot · 2025-11-14T19:58:08Z

❌ Build ds-cassandra-pr-gate/PR-2125 rejected by Butler

1 regressions found
See build details here

Found 1 new test failures

Test	Explanation	Runs	Upstream
o.a.c.db.ColumnFamilyStoreTest.testDiscardSSTables	REGRESSION	🔴⚪	2 / 17

Found 1 known test failures

jkni · 2025-11-14T20:16:27Z

o.a.c.db.ColumnFamilyStoreTest.testDiscardSSTables -- this is flaky for the same reason on nightly. Reported and merging.

…event races between the state release and the reference reaper (#2125) When releasing a `Ref`, we occasionally see this release fail as a "bad release", indicating that the ref has already been released. At apparently the exact same time (to the resolution of our logs), we see the reference-reaper report a leak of the same reference and release the reference. This happens most frequently in GlobalTidyConcurrencyTest, which intentionally stresses these subsystems. In particular, for this issue to occur, there must be no strong references to the ref outside the reference on the stack frame of the method calling `ref.release` _and_ no further usages of `ref` within that method. The Ref `ref` has a `state` field, which refers to a `Ref$State` which is also a `PhantomReference` to `ref`. When `ref.release()` is called, it calls `state.release()`. During the execution of `state.release()` but prior to the usage of `releasedUpdater` to update the `released` field, the JIT may determine that `ref` will not be used and clear its reference on the stack. Then, the GC may determine that `ref` is only phantom reachable, clearing `state` as a `PhantomReference` and enqueuing it into `referenceQueue`. The `Reference-Reaper` may then take `state` from the `referenceQueue` and release it as a leak. This leak release can update the `released` field, causing it to report a leak. The thread executing `state.release()` may then resume, and when it attempts to update the `released` field it will fail and report a bad release. Ultimately, since it's a race between a correct release and a Reference-Reaper release, these are both spurious errors and the system continues to operate correctly. To prevent this race, we must ensure that `ref` remains strongly reachable throughout the execution of `state.release()`. A reachability fence to itself will prevent the JIT from clearing the reference on the stack, keeping it strongly reachable. This PR adds a reachability fence to `ref.release()` after the state release call. It also adds a byteman rule to occasionally induce GCs that can determine `ref` to be phantom reachable in `GlobalTidyConcurrencyTest`. This byteman rule causes the test to reliably fail without the reachability fence.

jkni self-assigned this Nov 13, 2025

JeremiahDJordan approved these changes Nov 13, 2025

View reviewed changes

michaeljmarshall approved these changes Nov 13, 2025

View reviewed changes

CNDB-15967: Add a reachability fence to Ref#release to prevent races …

ad74edd

…between the state release and the reference reaper

jkni force-pushed the CNDB-15967 branch from 0d474f7 to ad74edd Compare November 14, 2025 16:14

eolivelli reviewed Nov 14, 2025

View reviewed changes

jkni merged commit f5af71f into main Nov 14, 2025
491 of 497 checks passed

jkni deleted the CNDB-15967 branch November 14, 2025 20:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CNDB-15967: Add a reachability fence to Ref#release to prevent races between the state release and the reference reaper #2125

CNDB-15967: Add a reachability fence to Ref#release to prevent races between the state release and the reference reaper #2125

Uh oh!

jkni commented Nov 13, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 13, 2025

Uh oh!

JeremiahDJordan left a comment

Uh oh!

michaeljmarshall left a comment

Uh oh!

eolivelli left a comment

Uh oh!

jkni commented Nov 14, 2025

Uh oh!

sonarqubecloud bot commented Nov 14, 2025

Uh oh!

cassci-bot commented Nov 14, 2025

Uh oh!

jkni commented Nov 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

CNDB-15967: Add a reachability fence to Ref#release to prevent races between the state release and the reference reaper #2125

CNDB-15967: Add a reachability fence to Ref#release to prevent races between the state release and the reference reaper #2125

Uh oh!

Conversation

jkni commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the issue

What does this PR fix and why was it fixed

Uh oh!

github-actions bot commented Nov 13, 2025

Checklist before you submit for review

Uh oh!

JeremiahDJordan left a comment

Choose a reason for hiding this comment

Uh oh!

michaeljmarshall left a comment

Choose a reason for hiding this comment

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

jkni commented Nov 14, 2025

Uh oh!

sonarqubecloud bot commented Nov 14, 2025

Quality Gate passed

Uh oh!

cassci-bot commented Nov 14, 2025

❌ Build ds-cassandra-pr-gate/PR-2125 rejected by Butler

Found 1 new test failures

Found 1 known test failures

Uh oh!

jkni commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jkni commented Nov 13, 2025 •

edited

Loading

jkni commented Nov 14, 2025 •

edited

Loading