Skip to content

Conversation

@LZD-PratyushBhatt
Copy link

@LZD-PratyushBhatt LZD-PratyushBhatt commented Jun 28, 2025

Issues

Description

  • Here are some details about my PR, including screenshots of any UI changes:
    Refer to _controllerOpt getting shared across resources #3050 for Problem description.
    Basically, all DistClusterControllerStateModel instances were synchronizing on the same global lock object due to Optional.empty() returning a singleton instance.
    Replaced synchronization on _controllerOpt (the shared singleton) with a dedicated per-instance lock object _controllerLock = new Object(). This eliminates cross-instance contention while preserving within-instance synchronization guarantees.
    (Write a concise description including what, why, how)

Tests

  • The following tests are written for this issue:
    Added a new test testNoSharedLockAcrossInstances
    (List the names of added unit/integration tests)

  • The following is the result of the "mvn test" command on the appropriate module:

(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

@LZD-PratyushBhatt LZD-PratyushBhatt changed the title Fix global lock contention in DistClusterControllerStateModel caused … Fix global lock contention in DistClusterControllerStateModel caused by Optional.empty() singleton Jun 28, 2025
@LZD-PratyushBhatt
Copy link
Author

@junkaixue @GrantPSpencer Can you take a look at this? Thanks!

Copy link
Contributor

@GrantPSpencer GrantPSpencer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great catch on this, will greatly reduce topstate handoff when controller is processing multiple STs simultaneously

Just have 1 nit comment on test assertion

Copy link
Contributor

@GrantPSpencer GrantPSpencer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
will still need committer approval, cc @junkaixue @xyuanlu

logger.info(controllerName + " becoming leader from standby for " + clusterName);

synchronized (_controllerOpt) {
synchronized (_controllerLock) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Optional.empty() is the root cause of locking problem, why not just make the function synchronized without creating new lock object?

Copy link
Author

@LZD-PratyushBhatt LZD-PratyushBhatt Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the suggestion @junkaixue, that’s a fair point. I agree that synchronized at method level on this would technically provide mutual exclusion in this case.

I opted for a dedicated lock object (_controllerLock) to keep the synchronization scope narrowly focused on controller lifecycle transitions, and also I didnt want to change the already implemented dedicated lock logic. This makes the locking intent more explicit and avoids potential contention if other unrelated synchronized methods are ever added to this class (either now or later, or lets say by mistake).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know what you think, if we are sure this class will be small in size in future and not much complex, then yeah synchronized methods make more sense then.

Copy link

@laxman-ch laxman-ch Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junkaixue : a new lock based implementation is close to the existing implementation while fixing the problem on the lock contention issue found as mentioned in the issue (due to JDK implementation of Optional.empty as singleton).

imho, making the methods synchronized may be of larger scope as we need to relook at the thread safety of other instance variables in current class and its base classes. So, if you suggest to fix this, should this be addressed in a different issue and PR.

@LZD-PratyushBhatt
Copy link
Author

Hey @junkaixue can you take a look at this again if you get time, Thanks a lot!

@LZD-PratyushBhatt
Copy link
Author

Hey @junkaixue , @xyuanlu reminder on this one, Thanks for your time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

_controllerOpt getting shared across resources

4 participants