-
Notifications
You must be signed in to change notification settings - Fork 21
CNDB-15967: Add a reachability fence to Ref#release to prevent races between the state release and the reference reaper #2125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Checklist before you submit for review
|
JeremiahDJordan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow. nice detective work!
michaeljmarshall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find and explanation!
…between the state release and the reference reaper
eolivelli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch !
|
CI good in CNDB PR as well: https://github.com/riptano/cndb/pull/16007 |
|
❌ Build ds-cassandra-pr-gate/PR-2125 rejected by Butler1 regressions found Found 1 new test failures
Found 1 known test failures |
|
o.a.c.db.ColumnFamilyStoreTest.testDiscardSSTables -- this is flaky for the same reason on nightly. Reported and merging. |
…event races between the state release and the reference reaper (#2125) When releasing a `Ref`, we occasionally see this release fail as a "bad release", indicating that the ref has already been released. At apparently the exact same time (to the resolution of our logs), we see the reference-reaper report a leak of the same reference and release the reference. This happens most frequently in GlobalTidyConcurrencyTest, which intentionally stresses these subsystems. In particular, for this issue to occur, there must be no strong references to the ref outside the reference on the stack frame of the method calling `ref.release` _and_ no further usages of `ref` within that method. The Ref `ref` has a `state` field, which refers to a `Ref$State` which is also a `PhantomReference` to `ref`. When `ref.release()` is called, it calls `state.release()`. During the execution of `state.release()` but prior to the usage of `releasedUpdater` to update the `released` field, the JIT may determine that `ref` will not be used and clear its reference on the stack. Then, the GC may determine that `ref` is only phantom reachable, clearing `state` as a `PhantomReference` and enqueuing it into `referenceQueue`. The `Reference-Reaper` may then take `state` from the `referenceQueue` and release it as a leak. This leak release can update the `released` field, causing it to report a leak. The thread executing `state.release()` may then resume, and when it attempts to update the `released` field it will fail and report a bad release. Ultimately, since it's a race between a correct release and a Reference-Reaper release, these are both spurious errors and the system continues to operate correctly. To prevent this race, we must ensure that `ref` remains strongly reachable throughout the execution of `state.release()`. A reachability fence to itself will prevent the JIT from clearing the reference on the stack, keeping it strongly reachable. This PR adds a reachability fence to `ref.release()` after the state release call. It also adds a byteman rule to occasionally induce GCs that can determine `ref` to be phantom reachable in `GlobalTidyConcurrencyTest`. This byteman rule causes the test to reliably fail without the reachability fence.
…event races between the state release and the reference reaper (#2125) When releasing a `Ref`, we occasionally see this release fail as a "bad release", indicating that the ref has already been released. At apparently the exact same time (to the resolution of our logs), we see the reference-reaper report a leak of the same reference and release the reference. This happens most frequently in GlobalTidyConcurrencyTest, which intentionally stresses these subsystems. In particular, for this issue to occur, there must be no strong references to the ref outside the reference on the stack frame of the method calling `ref.release` _and_ no further usages of `ref` within that method. The Ref `ref` has a `state` field, which refers to a `Ref$State` which is also a `PhantomReference` to `ref`. When `ref.release()` is called, it calls `state.release()`. During the execution of `state.release()` but prior to the usage of `releasedUpdater` to update the `released` field, the JIT may determine that `ref` will not be used and clear its reference on the stack. Then, the GC may determine that `ref` is only phantom reachable, clearing `state` as a `PhantomReference` and enqueuing it into `referenceQueue`. The `Reference-Reaper` may then take `state` from the `referenceQueue` and release it as a leak. This leak release can update the `released` field, causing it to report a leak. The thread executing `state.release()` may then resume, and when it attempts to update the `released` field it will fail and report a bad release. Ultimately, since it's a race between a correct release and a Reference-Reaper release, these are both spurious errors and the system continues to operate correctly. To prevent this race, we must ensure that `ref` remains strongly reachable throughout the execution of `state.release()`. A reachability fence to itself will prevent the JIT from clearing the reference on the stack, keeping it strongly reachable. This PR adds a reachability fence to `ref.release()` after the state release call. It also adds a byteman rule to occasionally induce GCs that can determine `ref` to be phantom reachable in `GlobalTidyConcurrencyTest`. This byteman rule causes the test to reliably fail without the reachability fence.



What is the issue
When releasing a
Ref, we occasionally see this release fail as a "bad release", indicating that the ref has already been released. At apparently the exact same time (to the resolution of our logs), we see the reference-reaper report a leak of the same reference and release the reference. This happens most frequently in GlobalTidyConcurrencyTest, which intentionally stresses these subsystems. In particular, for this issue to occur, there must be no strong references to the ref outside the reference on the stack frame of the method callingref.releaseand no further usages ofrefwithin that method.The Ref
refhas astatefield, which refers to aRef$Statewhich is also aPhantomReferencetoref. Whenref.release()is called, it callsstate.release(). During the execution ofstate.release()but prior to the usage ofreleasedUpdaterto update thereleasedfield, the JIT may determine thatrefwill not be used and clear its reference on the stack. Then, the GC may determine thatrefis only phantom reachable, clearingstateas aPhantomReferenceand enqueuing it intoreferenceQueue. TheReference-Reapermay then takestatefrom thereferenceQueueand release it as a leak. This leak release can update thereleasedfield, causing it to report a leak. The thread executingstate.release()may then resume, and when it attempts to update thereleasedfield it will fail and report a bad release. Ultimately, since it's a race between a correct release and a Reference-Reaper release, these are both spurious errors and the system continues to operate correctly.To prevent this race, we must ensure that
refremains strongly reachable throughout the execution ofstate.release(). A reachability fence to itself will prevent the JIT from clearing the reference on the stack, keeping it strongly reachable.What does this PR fix and why was it fixed
This PR adds a reachability fence to
ref.release()after the state release call. It also adds a byteman rule to occasionally induce GCs that can determinerefto be phantom reachable inGlobalTidyConcurrencyTest. This byteman rule causes the test to reliably fail without the reachability fence.