Skip to content

Conversation

@jkni
Copy link

@jkni jkni commented Nov 13, 2025

What is the issue

When releasing a Ref, we occasionally see this release fail as a "bad release", indicating that the ref has already been released. At apparently the exact same time (to the resolution of our logs), we see the reference-reaper report a leak of the same reference and release the reference. This happens most frequently in GlobalTidyConcurrencyTest, which intentionally stresses these subsystems. In particular, for this issue to occur, there must be no strong references to the ref outside the reference on the stack frame of the method calling ref.release and no further usages of ref within that method.

The Ref ref has a state field, which refers to a Ref$State which is also a PhantomReference to ref. When ref.release() is called, it calls state.release(). During the execution of state.release() but prior to the usage of releasedUpdater to update the released field, the JIT may determine that ref will not be used and clear its reference on the stack. Then, the GC may determine that ref is only phantom reachable, clearing state as a PhantomReference and enqueuing it into referenceQueue. The Reference-Reaper may then take state from the referenceQueue and release it as a leak. This leak release can update the released field, causing it to report a leak. The thread executing state.release() may then resume, and when it attempts to update the released field it will fail and report a bad release. Ultimately, since it's a race between a correct release and a Reference-Reaper release, these are both spurious errors and the system continues to operate correctly.

To prevent this race, we must ensure that ref remains strongly reachable throughout the execution of state.release(). A reachability fence to itself will prevent the JIT from clearing the reference on the stack, keeping it strongly reachable.

What does this PR fix and why was it fixed

This PR adds a reachability fence to ref.release() after the state release call. It also adds a byteman rule to occasionally induce GCs that can determine ref to be phantom reachable in GlobalTidyConcurrencyTest. This byteman rule causes the test to reliably fail without the reachability fence.

@github-actions
Copy link

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@jkni jkni self-assigned this Nov 13, 2025
Copy link
Member

@JeremiahDJordan JeremiahDJordan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow. nice detective work!

Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find and explanation!

…between the state release and the reference reaper
Copy link

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch !

@jkni
Copy link
Author

jkni commented Nov 14, 2025

CI good in CNDB PR as well: https://github.com/riptano/cndb/pull/16007

@sonarqubecloud
Copy link

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-2125 rejected by Butler


1 regressions found
See build details here


Found 1 new test failures

Test Explanation Runs Upstream
o.a.c.db.ColumnFamilyStoreTest.testDiscardSSTables REGRESSION 🔴 2 / 17

Found 1 known test failures

@jkni
Copy link
Author

jkni commented Nov 14, 2025

o.a.c.db.ColumnFamilyStoreTest.testDiscardSSTables -- this is flaky for the same reason on nightly. Reported and merging.

@jkni jkni merged commit f5af71f into main Nov 14, 2025
491 of 497 checks passed
@jkni jkni deleted the CNDB-15967 branch November 14, 2025 20:16
michaelsembwever pushed a commit that referenced this pull request Nov 20, 2025
…event races between the state release and the reference reaper (#2125)

When releasing a `Ref`, we occasionally see this release fail as a "bad
release", indicating that the ref has already been released. At
apparently the exact same time (to the resolution of our logs), we see
the reference-reaper report a leak of the same reference and release the
reference. This happens most frequently in GlobalTidyConcurrencyTest,
which intentionally stresses these subsystems. In particular, for this
issue to occur, there must be no strong references to the ref outside
the reference on the stack frame of the method calling `ref.release`
_and_ no further usages of `ref` within that method.

The Ref `ref` has a `state` field, which refers to a `Ref$State` which
is also a `PhantomReference` to `ref`. When `ref.release()` is called,
it calls `state.release()`. During the execution of `state.release()`
but prior to the usage of `releasedUpdater` to update the `released`
field, the JIT may determine that `ref` will not be used and clear its
reference on the stack. Then, the GC may determine that `ref` is only
phantom reachable, clearing `state` as a `PhantomReference` and
enqueuing it into `referenceQueue`. The `Reference-Reaper` may then take
`state` from the `referenceQueue` and release it as a leak. This leak
release can update the `released` field, causing it to report a leak.
The thread executing `state.release()` may then resume, and when it
attempts to update the `released` field it will fail and report a bad
release. Ultimately, since it's a race between a correct release and a
Reference-Reaper release, these are both spurious errors and the system
continues to operate correctly.

To prevent this race, we must ensure that `ref` remains strongly
reachable throughout the execution of `state.release()`. A reachability
fence to itself will prevent the JIT from clearing the reference on the
stack, keeping it strongly reachable.

This PR adds a reachability fence to `ref.release()` after the state
release call. It also adds a byteman rule to occasionally induce GCs
that can determine `ref` to be phantom reachable in
`GlobalTidyConcurrencyTest`. This byteman rule causes the test to
reliably fail without the reachability fence.
michaelsembwever pushed a commit that referenced this pull request Dec 9, 2025
…event races between the state release and the reference reaper (#2125)

When releasing a `Ref`, we occasionally see this release fail as a "bad
release", indicating that the ref has already been released. At
apparently the exact same time (to the resolution of our logs), we see
the reference-reaper report a leak of the same reference and release the
reference. This happens most frequently in GlobalTidyConcurrencyTest,
which intentionally stresses these subsystems. In particular, for this
issue to occur, there must be no strong references to the ref outside
the reference on the stack frame of the method calling `ref.release`
_and_ no further usages of `ref` within that method.

The Ref `ref` has a `state` field, which refers to a `Ref$State` which
is also a `PhantomReference` to `ref`. When `ref.release()` is called,
it calls `state.release()`. During the execution of `state.release()`
but prior to the usage of `releasedUpdater` to update the `released`
field, the JIT may determine that `ref` will not be used and clear its
reference on the stack. Then, the GC may determine that `ref` is only
phantom reachable, clearing `state` as a `PhantomReference` and
enqueuing it into `referenceQueue`. The `Reference-Reaper` may then take
`state` from the `referenceQueue` and release it as a leak. This leak
release can update the `released` field, causing it to report a leak.
The thread executing `state.release()` may then resume, and when it
attempts to update the `released` field it will fail and report a bad
release. Ultimately, since it's a race between a correct release and a
Reference-Reaper release, these are both spurious errors and the system
continues to operate correctly.

To prevent this race, we must ensure that `ref` remains strongly
reachable throughout the execution of `state.release()`. A reachability
fence to itself will prevent the JIT from clearing the reference on the
stack, keeping it strongly reachable.

This PR adds a reachability fence to `ref.release()` after the state
release call. It also adds a byteman rule to occasionally induce GCs
that can determine `ref` to be phantom reachable in
`GlobalTidyConcurrencyTest`. This byteman rule causes the test to
reliably fail without the reachability fence.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants