Skip to content

Conversation

DomGarguilo
Copy link
Member

Fixes #5535

Swap out the Timer‑based check for a double‑checked‑locking pattern on the lock, and only short‑circuit when the cached tablet actually has a valid location.

Also made the failing IT case more concurrently strenuous to better exercise this new logic.

@DomGarguilo DomGarguilo added this to the 4.0.0 milestone May 6, 2025
@DomGarguilo DomGarguilo self-assigned this May 6, 2025
@dlmarion dlmarion requested a review from keith-turner May 7, 2025 11:49

CachedTablet now = findTabletInCache(row);
if (now != null) {
if (now.getTserverLocation().isPresent() && lcSession.checkLock(now) != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In support of ondemand tablets, tablets w/o a location can be cached in 4.0 (was not the case in 2.1). So these checks that look for the presence of a location do not seem correct.

Copy link
Member Author

@DomGarguilo DomGarguilo Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in 5268f73

try (var unused = lookupLocks.lock(ptl.getExtent())) {
// See if entry was added to cache by another thread while we were waiting on the lock
var cached = findTabletInCache(row);
if (cached != null && cached.getCreationTimer().startedAfter(timer)) {
Copy link
Contributor

@keith-turner keith-turner May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What made you suspect the timer here? I looked into the specifics of the timer and found how its implemented here https://github.com/openjdk/jdk/blob/b6b5ac1ef9042ed62a8358aa6943b8dc87dcf0ab/src/hotspot/os/posix/os_posix.cpp#L1474 . Looking at the docs for that system call it does mention that it will not go backwards but also there is no gaurantee it wil go forward between two calls. So its possible two threads got the exact same nanoTime causing the entry to look stale, that is probably ok. We could make the time comparison >= instead of > if that is the case.

Copy link
Contributor

@keith-turner keith-turner May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the > to a >= is iffy and probably not good so please ignore that suggestion. One way we could avoid comparing the time or anything else is to just look for change in the object reference. So maybe could do the following?

   CachedTablet before = findTabletInCache(row);
  try (var unused = lookupLocks.lock(ptl.getExtent())) {
       CachedTablet after = findTabletInCache(row);
       if(after != before && after != null){
           // another thread probably did the lookup while we were waiting on the lock, so let use that
           return;
       }
   

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the object reference check in 5268f73


var executor = Executors.newCachedThreadPool();
List<Future<CachedTablet>> futures = new ArrayList<>();
final int lookupCount = 128;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the fix is the test more likely to fail w/ these changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not been able to intentionally get the test to fail, with or without these changes. I have only ever seen the test fail when the build is run for something else.

@DomGarguilo DomGarguilo merged commit 7e84361 into apache:main Jun 3, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Broken or Flaky test: ClientTabletCacheImplTest.testMultithreadedLookups

2 participants