Skip to content

Add Node Weight to GetDesiredBalance #131025

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

joshua-adams-1
Copy link
Contributor

@joshua-adams-1 joshua-adams-1 commented Jul 10, 2025

Extends the _internal/desired_balance API to return the node weights. These are found within the DesiredBalanceResponse.ClusterBalanceStats.NodeBalanceStats object. If no node weights have been calculated, then this value defaults to 0.

Github issue: #126579
Jira Ticket: ES-11546

Extends the `_internal/desired_balance` API to return the node weights.
These are found within the `DesiredBalanceResponse.ClusterBalanceStats
.NodeBalanceStats` object. If no node weights have been calculated, then
 this value defaults to 0.

Issue: elastic#126579
@joshua-adams-1 joshua-adams-1 force-pushed the get-desired-balance-node-weights branch from 78aa356 to aeb59b2 Compare July 11, 2025 09:30
@joshua-adams-1 joshua-adams-1 force-pushed the get-desired-balance-node-weights branch from 3ed1d0d to 3ac4def Compare July 11, 2025 10:26
@joshua-adams-1 joshua-adams-1 requested a review from ywangd July 11, 2025 15:03
@joshua-adams-1 joshua-adams-1 added >non-issue :Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. labels Jul 11, 2025
@joshua-adams-1 joshua-adams-1 marked this pull request as ready for review July 11, 2025 15:04
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jul 11, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. Looks promising. I left some comments.

Comment on lines 9 to 15
- requires:
capabilities:
- method: GET
path: _internal/desired_balance
capabilities: [ cluster_balance-node_balance_stats-node_weights_returned ]
test_runner_features: [ capabilities ]
reason: "Node weights returned in the node balance stats was added in version 9.2.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not add this requires section to existing tests since (1) they do not test the new field and (2) it reduces the test coverage for old versions. Instead we should add a new test with this requires section and checks the new field in the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had issues with the CI tests failing without this - will wait for the CI tests to run on my most recent commit and try and debug from there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised that the existing tests would fail without it. If you share a buildscan of the failure, I can also help looking into it.

@pxsalehi pxsalehi added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed Coordination/Distributed A catch all label for anything in the Distributed Coordination area. Please avoid if you can. labels Jul 15, 2025
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great. I only had minor comments.

Comment on lines 9 to 15
- requires:
capabilities:
- method: GET
path: _internal/desired_balance
capabilities: [ cluster_balance-node_balance_stats-node_weights_returned ]
test_runner_features: [ capabilities ]
reason: "Node weights returned in the node balance stats was added in version 9.2.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised that the existing tests would fail without it. If you share a buildscan of the failure, I can also help looking into it.

@joshua-adams-1 joshua-adams-1 force-pushed the get-desired-balance-node-weights branch from 129ab38 to d255680 Compare July 16, 2025 10:59
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more minor comments.

@joshua-adams-1 joshua-adams-1 force-pushed the get-desired-balance-node-weights branch from 9b99150 to 08d6cc8 Compare July 17, 2025 14:00
@joshua-adams-1 joshua-adams-1 force-pushed the get-desired-balance-node-weights branch from 594fc20 to cd5e8f4 Compare July 18, 2025 08:36
@joshua-adams-1 joshua-adams-1 requested a review from ywangd July 18, 2025 11:12
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great. Some minor comments over testing code.

public void testSerializationWithTransportVersionV_8_7_0() throws IOException {
ClusterBalanceStats.NodeBalanceStats instance = createTestInstance();
// Serialization changes based on this version
TransportVersion oldVersion = TransportVersions.V_8_7_0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's common to randomize from a broader version range for the oldVersion, i.e.:

Suggested change
TransportVersion oldVersion = TransportVersions.V_8_7_0;
final var oldVersion = TransportVersionUtils.randomVersionBetween(
random(),
TransportVersions.MINIMUM_COMPATIBLE,
TransportVersionUtils.getPreviousVersion(TransportVersions.V_8_8_0)
);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4a7ec19

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, so MINIMUM_COMPATIBLE refers to Elastic version 8.19. Therefore any transport version less than this can be safely removed, as per the comment on line 51 of this file, and the Jira ticket https://elasticco.atlassian.net/browse/ES-10337.

Therefore, as a follow up to this PR, I will remove all references to TransportVersion.V_8_8_0 and TransportVersions.V_8_12_0 in the NodeBalanceStats.readFrom() method (and then delete these tests since they will then not be necessary).

Until that time, I have just randomly generated a TransportVersion between V_8_0_0 and V_8_8_0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you are right. 👍

@joshua-adams-1 joshua-adams-1 force-pushed the get-desired-balance-node-weights branch from 3955ca7 to 72635ca Compare July 21, 2025 15:03
Copy link

cla-checker-service bot commented Jul 21, 2025

❌ Author of the following commits did not sign a Contributor Agreement:
f7cb702

Please, read and sign the above mentioned agreement if you want to contribute to this project

@joshua-adams-1 joshua-adams-1 force-pushed the get-desired-balance-node-weights branch from dca823a to dfe3506 Compare July 21, 2025 17:15
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

One comment about handling not-yet-included node is somewhat important and please address it. Thanks for the iterations!

@@ -300,18 +372,45 @@ private static ClusterState createClusterState(List<DiscoveryNode> nodes, List<T
.build();
}

private static DesiredBalance createDesiredBalance(ClusterState state) {
private static DesiredBalance createDesiredBalanceWithEmptyNodeWeights(ClusterState state) {
return createDesiredBalance(state, randomDoubleBetween(-1, 1, true), true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: seems wasteful to randomize given its unused

Suggested change
return createDesiredBalance(state, randomDoubleBetween(-1, 1, true), true);
return createDesiredBalance(state, 0, true);

Comment on lines +245 to +247
Double nodeWeight = desiredBalance.weightsPerNode().isEmpty()
? null
: desiredBalance.weightsPerNode().get(routingNode.node()).nodeWeight();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two suggestions

  1. We can add assert desiredBalance != null before this line to make the intention clear that it should not be null. Otherwise if it throws a NPE in some future tests, it would make people pause and wonder whether null handling is needed here.
  2. Desired balance is compueted asynchrounously. It is possible that a new node from the latest cluster state is not yet included. So we need to count for it as well, e.g. something like:
Double nodeWeight = Optional.of(desiredBalance.weightsPerNode().get(routingNode.node())
    .map(NodeWeightStats::nodeWeight)
    .orElse(null);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants