-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Is it platform specific
generic
Importance or Severity
Critical
Description of the bug
Dockers are flapping on SONiC fanout switch due to behavior change in docker-wait-any-rs compared to docker-wait-any when the docker does not exist
After investigating it we realized it was caused by a behavior change in an exception flow in docker-wait-any between the python and the rust version (the SONiC community re-implemented some components using rust).
In short, the teamd docker is removed in the SONiC fanout. But swss service checks teamd docker using docker-wait-any and exits in the rust version if the docker does not exist. This causes swss service to be stopped and then docker flaps.
Detailed analysis
What systemd is doing
swss.service runs ExecStart=/usr/local/bin/swss.sh wait. If that command returns, systemd treats the service process as having exited. With RestartSec=30, it will be restarted, but the stop is still visible.
In swss.sh, the wait function ends by calling the helper:
swss.shLines 492-496
if [[ ! -z $DEV ]]; then
/usr/bin/docker-wait-any-rs -s ${SERVICE}$DEV -d `printf "%s$DEV " ${PEER}` ${ALL_DEPS}
else
/usr/bin/docker-wait-any-rs -s ${SERVICE} -d ${PEER} ${ALL_DEPS}
fi
If docker-wait-any-rs exits quickly, swss.sh wait exits, so systemd considers swss stopped.
What changed in the Rust helper
In the Rust version, if a container doesn’t exist, wait_container returns an error (e.g., 404), and the code treats that as “the container exited” and triggers termination:
lib.rsLines 74-97
if let Err(e) = wait_result {
if exit_event.load(Ordering::Acquire) {
info!("Container {} wait thread get exception: {}", container_name, e);
return Ok(());
}
// If a container is killed, `wait_container` may return an error.
// Treat this as the container having exited.
info!("Container {} exited with a status that resulted in an error from the Docker API: {}", container_name, e);
}
info!("No longer waiting on container '{}'", container_name);
if dependent_services.contains(&container_name) {
let warm_restart = device_info::is_warm_restart_enabled(&container_name)?;
let fast_reboot = device_info::is_fast_reboot_enabled()?;
if warm_restart || fast_reboot {
continue;
}
}
exit_event.store(true, Ordering::Release);
return Ok(());
When teamd doesn’t exist, wait_result is Err(...). The code logs it, falls through, and sets the global exit flag (unless warm/fast reboot is enabled for that container), causing that task to return Ok(()).
The main loop in the Rust helper then exits as soon as any one watcher returns successfully:
lib.rsLines 130-147
while let Some(result) = tasks.join_next().await {
match result {
Ok(Ok(())) => {
break;
}
Ok(Err(e)) => {
error!("Container watcher error: {}", e);
return Err(e);
}
Err(e) => {
error!("Task join error: {}", e);
return Err(Error::Join(e));
}
}
}
Ok(0)
So a “non-existent container” is treated as “container exited,” which ends the helper, which ends swss.sh wait, which stops the swss systemd service.
What the Python helper did before
In the Python version, only TypeError is caught. Errors like “container not found” (e.g., docker.errors.NotFound/APIError) are not caught and therefore terminate the thread with an uncaught exception. The main thread keeps waiting on the global event and never exits:
docker-wait-anyLines 51-72
while True:
try:
docker_client.wait(container_name)
except TypeError as e:
if g_thread_exit_event.is_set():
# When other thread exist, main thread will exit and docker_client will be destoryed
log.log_info("Container {} wait thread get exception: {}".format(container_name, e))
return
else:
raise e
log.log_info("No longer waiting on container '{}'".format(container_name))
# If this is a dependent service and warm restart is enabled for the system/container,
# OR if the system is going through a fast-reboot, DON'T signal main thread to exit
if (container_name in g_dep_services and
(device_info.is_warm_restart_enabled(container_name) or device_info.is_fast_reboot_enabled())):
continue
# Signal the main thread to exit
g_thread_exit_event.set()
return
Result: when teamd didn’t exist, the teamd thread crashed, but the main thread blocked forever, so docker-wait-any never exited, and swss.sh wait stayed running. That’s why swss didn’t stop.
Why you observed the change
- The Rust helper proactively treats any Docker wait error (including “not found”) as an exit signal unless warm/fast reboot is enabled for that dependent. This causes the helper to exit immediately when teamd isn’t present.
- The Python helper inadvertently ignored most non-TypeError exceptions by letting the thread die without signaling the main thread, causing the helper to block indefinitely and thus keeping swss running.
Steps to Reproduce
Deploy SONiC fanout switch.
Actual Behavior and Expected Behavior
Dockers are flapping on the fanout switch after deploy
Relevant log output
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
df35ead48d50 urm.nvidia.com/sw-nbu-sws-sonic-docker/sonic-wjh:9.9.9-master-153 "/usr/local/bin/supe…" About an hour ago Up 10 seconds what-just-happened
1b5ebb42817c docker-snmp:latest "/usr/bin/docker-snm…" About an hour ago Up 11 seconds snmp
33ac5c827a26 docker-sonic-gnmi:latest "/usr/local/bin/supe…" About an hour ago Up About an hour gnmi
0066688ebc5f docker-eventd:latest "/usr/local/bin/supe…" About an hour ago Up About an hour eventd
68f208856a8e docker-platform-monitor:latest "/usr/bin/docker_ini…" About an hour ago Up 23 seconds pmon
4d7195876951 docker-router-advertiser:latest "/usr/bin/docker-ini…" About an hour ago Exited (0) About an hour ago radv
ba792f1a18cd docker-syncd-mlnx:latest "/usr/local/bin/supe…" About an hour ago Up 23 seconds syncd
109d6c3c14ae docker-sysmgr:latest "/usr/local/bin/supe…" About an hour ago Up About an hour sysmgr
0b0ee5d5d8b8 docker-orchagent:latest "/usr/bin/docker-ini…" About an hour ago Up 26 seconds swss
95323d902200 docker-sonic-restapi:latest "/usr/local/bin/supe…" About an hour ago Exited (0) About an hour ago restapi
47298e5b709c docker-sonic-mgmt-framework:latest "/usr/local/bin/supe…" 3 hours ago Exited (0) About an hour ago mgmt-framework
18a0d2e03367 docker-lldp:latest "/usr/bin/docker-lld…" 3 hours ago Exited (0) About an hour ago lldp
806b9faf518f docker-fpm-frr:latest "/usr/bin/docker_ini…" 3 hours ago Exited (0) 3 hours ago bgp
5f2002a2e497 docker-database:latest "/usr/local/bin/dock…" 3 hours ago Up 3 hours databaseOutput of show version, show techsupport
Attach files (if any)
No response