YARN-11878. AsyncDispatcher event queue backlog with millions of STAT… #8026

qq619618919 · 2025-10-12T09:08:52Z

Description of PR

JIRA: YARN-11878. AsyncDispatcher event queue backlog with millions of STATUS_UPDATE events

Avoid costly ContainerStatusPBImpl.getCapability() calls in STATUS_UPDATE when Opportunistic containers are disabled

Background

This behavior was introduced by YARN-11003. to support Opportunistic containers optimization in the ResourceManager.

To implement that optimization, StatusUpdateWhenHealthyTransition calls ContainerStatusPBImpl.getCapability() during every STATUS_UPDATE event.
This ensures container resource capability info is always available for scheduling decisions
when opportunistic containers are enabled.

However, in clusters where opportunistic containers are disabled,
retrieving capability in every STATUS_UPDATE becomes unnecessary,
since the capability value is not used in most workflows.

Currently

NodeManager heartbeat: frequent STATUS_UPDATE events sent to the ResourceManager
Each STATUS_UPDATE processing: triggers ContainerStatusPBImpl.getCapability()
Problem: Even when the opportunistic container feature is off, the same costly protobuf parsing and ResourcePBImpl object construction still happens for each event. This leads to:

High CPU usage in the AsyncDispatcher event processing thread
Millions of repeated, unused protobuf parses in large clusters
Increased event queue latency and slower scheduling decisions

Impact

In clusters with thousands of nodes, STATUS_UPDATE events can account for >90% of the AsyncDispatcher queue.
Profiling shows that getCapability() calls consume >90% of CPU time in StatusUpdateWhenHealthyTransition.transition() when opportunistic containers are disabled.
The overhead is pure waste under these conditions and can be entirely skipped.

Proposed Changes

Skip capability retrieval logic when opportunisticContainersEnabled is false.
Cache remoteContainer.getCapability() result in a local variable to prevent multiple protobuf parsing calls within the same STATUS_UPDATE handling.

How was this patch tested?

CI

For code changes:

Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'YARN-11878. Your PR title ...')?
Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

…US_UPDATE events

qq619618919 · 2025-10-12T09:26:21Z

Performance Verification in Production

We tested this patch in a production YARN cluster and used Arthas to monitor RM node event handling performance via:

monitor -c 5 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher handle

Result:

Before patch (with original YARN-11003 behavior): average NM heartbeat handling time ≈ 1.10 ms
After patch (skip/caching getCapability() when Opportunistic containers disabled): average NM heartbeat handling time ≈ 0.09 ms
This shows over 12× improvement in heartbeat event processing latency, reducing RM AsyncDispatcher thread load significantly and improving scheduling responsiveness in large clusters.

Conclusion:

The patch removes unnecessary getCapability() calls when the Opportunistic container feature is disabled, reducing CPU overhead and improving event queue turnover rate.
This optimization has already proven effective in production with substantial gains in RM performance.

hadoop-yetus · 2025-10-12T12:54:33Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 58s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
-1 ❌	test4tests	0m 0s		The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
			_ trunk Compile Tests _
+1 💚	mvninstall	37m 23s		trunk passed
+1 💚	compile	1m 4s		trunk passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
+1 💚	compile	1m 14s		trunk passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04
+1 💚	checkstyle	1m 1s		trunk passed
+1 💚	mvnsite	1m 11s		trunk passed
+1 💚	javadoc	0m 56s		trunk passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
+1 💚	javadoc	0m 54s		trunk passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04
-1 ❌	spotbugs	1m 15s	/branch-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt	hadoop-yarn-server-resourcemanager in trunk failed.
+1 💚	shadedclient	33m 36s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	1m 0s		the patch passed
+1 💚	compile	0m 57s		the patch passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
+1 💚	javac	0m 57s		the patch passed
+1 💚	compile	0m 56s		the patch passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04
+1 💚	javac	0m 56s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	0m 35s		the patch passed
+1 💚	mvnsite	1m 18s		the patch passed
+1 💚	javadoc	0m 45s		the patch passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
+1 💚	javadoc	0m 45s		the patch passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04
-1 ❌	spotbugs	1m 2s	/patch-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt	hadoop-yarn-server-resourcemanager in the patch failed.
+1 💚	shadedclient	33m 45s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
-1 ❌	unit	111m 44s	/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt	hadoop-yarn-server-resourcemanager in the patch passed.
+1 💚	asflicense	0m 35s		The patch does not generate ASF License warnings.
		224m 27s

Reason	Tests
Failed junit tests	hadoop.yarn.server.resourcemanager.scheduler.TestAbstractYarnScheduler
	hadoop.yarn.server.resourcemanager.resourcetracker.TestNMExpiry
	hadoop.yarn.server.resourcemanager.resourcetracker.TestNMReconnect
	hadoop.yarn.server.resourcemanager.TestResourceTrackerService
	hadoop.yarn.server.resourcemanager.resourcetracker.TestRMNMRPCResponseId
	hadoop.yarn.server.resourcemanager.TestRMNodeTransitions
	hadoop.yarn.server.resourcemanager.logaggregationstatus.TestRMAppLogAggregationStatus

Subsystem	Report/Notes
Docker	ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8026/1/artifact/out/Dockerfile
GITHUB PR	#8026
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux bc2c114e91d1 5.15.0-156-generic #166-Ubuntu SMP Sat Aug 9 00:02:46 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `c5764a9`
Default Java	Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04
Multi-JDK versions	/usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8026/1/testReport/
Max. process+thread count	930 (vs. ulimit of 5500)
modules	C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8026/1/console
versions	git=2.25.1 maven=3.9.11
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

YARN-11878. AsyncDispatcher event queue backlog with millions of STAT…

c5764a9

…US_UPDATE events

github-actions bot added YARN trunk labels Oct 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

YARN-11878. AsyncDispatcher event queue backlog with millions of STAT… #8026

YARN-11878. AsyncDispatcher event queue backlog with millions of STAT… #8026

qq619618919 commented Oct 12, 2025

Uh oh!

qq619618919 commented Oct 12, 2025

Uh oh!

hadoop-yetus commented Oct 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

YARN-11878. AsyncDispatcher event queue backlog with millions of STAT… #8026

Are you sure you want to change the base?

YARN-11878. AsyncDispatcher event queue backlog with millions of STAT… #8026

Conversation

qq619618919 commented Oct 12, 2025

Description of PR

Background

Currently

Impact

Proposed Changes

How was this patch tested?

For code changes:

Uh oh!

qq619618919 commented Oct 12, 2025

Performance Verification in Production

Result:

Conclusion:

Uh oh!

hadoop-yetus commented Oct 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants