-
Notifications
You must be signed in to change notification settings - Fork 21
hcd-130 diag patch #1743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: hcd-1.1.1-rel
Are you sure you want to change the base?
hcd-130 diag patch #1743
Conversation
@@ -1581,6 +1582,69 @@ public void testPending() | |||
assertEquals(8, aggregate.getPending().size()); | |||
} | |||
|
|||
@Test | |||
public void applyMaxParallelism() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore me
|
❌ Build ds-cassandra-pr-gate/PR-1743 rejected by Butler10 new test failure(s) in 1 builds Found 10 new test failures
No known test failures found |
The only sstable Ci failures also repro on the base branch |
This is not ready for review but otherwise CI won't trigger |
@@ -2792,7 +2795,8 @@ public <V> V runWithCompactionsDisabled(Callable<V> callable, | |||
{ | |||
// synchronize so that concurrent invocations don't re-enable compactions partway through unexpectedly, | |||
// and so we only run one major compaction at a time | |||
synchronized (this) | |||
rwcdLock.lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blambov let me pick your brain here. We're getting deadlocks here when syncing on CF and later on the strategy when pausing or resuming compactions. Then a UCS task locks on the strategy first an later on CF. Trying to reorder locking produces all sorts of new deadlocks elsewhere in the code.
Given runWithCompactionsDisabled
syncs on CF to avoid concurrent calls iiuc, do you see any problems with this approach?
- Use an aux lock to avoid concurrent calls
- Reorder the sync against CF to come later avoiding nested and badly ordered syncs
I am just wary the original sync against CF was guarding against sthg else I might be missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach does sound sensible to me. This is a very long process to be locking the CFS.
I took a look at the usages of synchronization in CFS and they fall in four categories:
- locking
data
for memtable changes (flush, drop, etc.), - locking
this
for recalculating state (local data ranges), - locking
this
to apply long-running tasks that forbid compaction, - locking the static class object to create new CFS instances.
The second and third categories are incompatible and cause this deadlock. One of categories needs to use a different lock. Maybe call it longRunningSerializedOperationsLock
to also include any further uses of the kind.
Or maybe use a separate single-threaded executor? Now that this comes to mind, what thread executes these operations, and can we cause a deadlock by holding that thread while waiting for something to complete on it? This could also happen if the running thread isn't single-threaded, if we manage to issue enough requests to park all of them... Sounds like it is safest to use a dedicated executor to solve the serialization problem for the compactions-disabled tasks and anything similar we may need in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I followed that single thread executor approach you mention? I pushed a version where the actual cfs sync and callable all happen in their own single thread. That should isolate that thread from heavy parking the previous executor pool whichever that was. Is that what you meant? I feel like I'm missing sthg here bc that has nothing to do with the serialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the whole thing. I now realize this is a synchronous method that returns a value so we can't really gain anything by putting it in a thread as we still have to block to get the value.
// doublecheck that we finished, instead of timing out | ||
for (ColumnFamilyStore cfs : toInterruptFor) | ||
return Executors.newSingleThreadExecutor().submit(() -> { | ||
synchronized (this) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? And if we do need to synchronize here, should it not be on data
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Given many nodetool commands call into here shouldn't we block on the whole CF instead of only data
? We don't know the nature of all commands, customer extensions or future commands. Syncing on whole CF seems best. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think synchronized (this)
stops compactions initiating work (which is what this seems to want to achieve) as it does not prevent changes to the sstable set (synchronized (data)
does the latter). And it does leave us open to the same kind of deadlock. It may be even worse if we lock data
, unless we are certain none of the callables can perform work in another thread.
I personally think it would be safest not to do any locking here. If a callable needs it, it should lock data
to stop concurrent sstable set changes. If necessary, we can provide a method that also wraps the callable in a synchronized (data)
and use it for calls we know are just selecting sstables (e.g. not for truncateBlocking
which does a lot of work inside that call).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, even synchronized(data)
does not prevent changes to the sstable set. Let's get rid of this lock completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies I was pulled into a meeting.
Removing synchronized (this)
can make calls to nodetool commands and invalidateLocalRangesAndDiskBoundaries(), importNewSSTables(), getLocalRanges() and invalidateLocalRangesAndDiskBoundaries() race i.e.
But iiuc you're saying we don't need to lock at this level bc the callable will just be consuming the current CF API which is already thread-safe. If any callable was to break that it is it's responsibility to sync when needed. The serialization synchronized (this)
was trying to provide it is already provided now by longRunningSerializedOperationsLock
.
Is this what you're saying? I am still wary nodetool commands needing to sync on this
. But we can forward your proposal to the customer, if all goes good port it to OSS where more eyes will see it and then merge it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we started with synchronization on the whole method, done (according to the comment) to prevent other operations of this kind racing, and we replaced that with a different lock. The extra synchronization on this
is thus not necessary as other e.g. nodetool commands cannot run in parallel because of the new lock.
The thing synchronized (this)
actually blocks is recalculation of token ranges by other threads, which is not something we need to block, and which could still lead to a deadlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI looks good Let's forward this latest diag to the customer and hope it's the last bug we hit!
What is the issue
...
What does this PR fix and why was it fixed
...