Skip to content

Conversation

qq619618919
Copy link

Description of PR

JIRA: YARN-11878. AsyncDispatcher event queue backlog with millions of STATUS_UPDATE events

Avoid costly ContainerStatusPBImpl.getCapability() calls in STATUS_UPDATE when Opportunistic containers are disabled

Background

This behavior was introduced by YARN-11003. to support Opportunistic containers optimization in the ResourceManager.

To implement that optimization, StatusUpdateWhenHealthyTransition calls ContainerStatusPBImpl.getCapability() during every STATUS_UPDATE event.
This ensures container resource capability info is always available for scheduling decisions
when opportunistic containers are enabled.

However, in clusters where opportunistic containers are disabled,
retrieving capability in every STATUS_UPDATE becomes unnecessary,
since the capability value is not used in most workflows.

Currently

NodeManager heartbeat: frequent STATUS_UPDATE events sent to the ResourceManager
Each STATUS_UPDATE processing: triggers ContainerStatusPBImpl.getCapability()
Problem: Even when the opportunistic container feature is off, the same costly protobuf parsing and ResourcePBImpl object construction still happens for each event. This leads to:

  1. High CPU usage in the AsyncDispatcher event processing thread
  2. Millions of repeated, unused protobuf parses in large clusters
  3. Increased event queue latency and slower scheduling decisions

Impact

In clusters with thousands of nodes, STATUS_UPDATE events can account for >90% of the AsyncDispatcher queue.
Profiling shows that getCapability() calls consume >90% of CPU time in StatusUpdateWhenHealthyTransition.transition() when opportunistic containers are disabled.
The overhead is pure waste under these conditions and can be entirely skipped.

Proposed Changes

  1. Skip capability retrieval logic when opportunisticContainersEnabled is false.
  2. Cache remoteContainer.getCapability() result in a local variable to prevent multiple protobuf parsing calls within the same STATUS_UPDATE handling.

How was this patch tested?

CI

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'YARN-11878. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@qq619618919
Copy link
Author

Performance Verification in Production

We tested this patch in a production YARN cluster and used Arthas to monitor RM node event handling performance via:

monitor -c 5 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher handle

Result:

Before patch (with original YARN-11003 behavior): average NM heartbeat handling time ≈ 1.10 ms
After patch (skip/caching getCapability() when Opportunistic containers disabled): average NM heartbeat handling time ≈ 0.09 ms
This shows over 12× improvement in heartbeat event processing latency, reducing RM AsyncDispatcher thread load significantly and improving scheduling responsiveness in large clusters.

Conclusion:

The patch removes unnecessary getCapability() calls when the Opportunistic container feature is disabled, reducing CPU overhead and improving event queue turnover rate.
This optimization has already proven effective in production with substantial gains in RM performance.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 58s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 37m 23s trunk passed
+1 💚 compile 1m 4s trunk passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
+1 💚 compile 1m 14s trunk passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04
+1 💚 checkstyle 1m 1s trunk passed
+1 💚 mvnsite 1m 11s trunk passed
+1 💚 javadoc 0m 56s trunk passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 54s trunk passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04
-1 ❌ spotbugs 1m 15s /branch-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt hadoop-yarn-server-resourcemanager in trunk failed.
+1 💚 shadedclient 33m 36s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 0s the patch passed
+1 💚 compile 0m 57s the patch passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
+1 💚 javac 0m 57s the patch passed
+1 💚 compile 0m 56s the patch passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04
+1 💚 javac 0m 56s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 35s the patch passed
+1 💚 mvnsite 1m 18s the patch passed
+1 💚 javadoc 0m 45s the patch passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 45s the patch passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04
-1 ❌ spotbugs 1m 2s /patch-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt hadoop-yarn-server-resourcemanager in the patch failed.
+1 💚 shadedclient 33m 45s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 111m 44s /patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt hadoop-yarn-server-resourcemanager in the patch passed.
+1 💚 asflicense 0m 35s The patch does not generate ASF License warnings.
224m 27s
Reason Tests
Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.TestAbstractYarnScheduler
hadoop.yarn.server.resourcemanager.resourcetracker.TestNMExpiry
hadoop.yarn.server.resourcemanager.resourcetracker.TestNMReconnect
hadoop.yarn.server.resourcemanager.TestResourceTrackerService
hadoop.yarn.server.resourcemanager.resourcetracker.TestRMNMRPCResponseId
hadoop.yarn.server.resourcemanager.TestRMNodeTransitions
hadoop.yarn.server.resourcemanager.logaggregationstatus.TestRMAppLogAggregationStatus
Subsystem Report/Notes
Docker ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8026/1/artifact/out/Dockerfile
GITHUB PR #8026
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux bc2c114e91d1 5.15.0-156-generic #166-Ubuntu SMP Sat Aug 9 00:02:46 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / c5764a9
Default Java Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04
Multi-JDK versions /usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8026/1/testReport/
Max. process+thread count 930 (vs. ulimit of 5500)
modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8026/1/console
versions git=2.25.1 maven=3.9.11
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants