Skip to content

Bug: dockers flapping on SONiC fan out switch due to behavior change in docker-wait-any-rs compared to docker-wait-any when the docker does not exist #24730

@stephenxs

Description

@stephenxs

Is it platform specific

generic

Importance or Severity

Critical

Description of the bug

Dockers are flapping on SONiC fanout switch due to behavior change in docker-wait-any-rs compared to docker-wait-any when the docker does not exist

After investigating it we realized it was caused by a behavior change in an exception flow in docker-wait-any between the python and the rust version (the SONiC community re-implemented some components using rust).
In short, the teamd docker is removed in the SONiC fanout. But swss service checks teamd docker using docker-wait-any and exits in the rust version if the docker does not exist. This causes swss service to be stopped and then docker flaps.

Detailed analysis

What systemd is doing
swss.service runs ExecStart=/usr/local/bin/swss.sh wait. If that command returns, systemd treats the service process as having exited. With RestartSec=30, it will be restarted, but the stop is still visible.

In swss.sh, the wait function ends by calling the helper:

swss.shLines 492-496
    if [[ ! -z $DEV ]]; then
        /usr/bin/docker-wait-any-rs -s ${SERVICE}$DEV -d `printf "%s$DEV " ${PEER}` ${ALL_DEPS}
    else
        /usr/bin/docker-wait-any-rs -s ${SERVICE} -d ${PEER} ${ALL_DEPS}
    fi

If docker-wait-any-rs exits quickly, swss.sh wait exits, so systemd considers swss stopped.

What changed in the Rust helper

In the Rust version, if a container doesn’t exist, wait_container returns an error (e.g., 404), and the code treats that as “the container exited” and triggers termination:

lib.rsLines 74-97
        if let Err(e) = wait_result {
            if exit_event.load(Ordering::Acquire) {
                info!("Container {} wait thread get exception: {}", container_name, e);
                return Ok(());
            }
            // If a container is killed, `wait_container` may return an error.
            // Treat this as the container having exited.
            info!("Container {} exited with a status that resulted in an error from the Docker API: {}", container_name, e);
        }
        info!("No longer waiting on container '{}'", container_name);
        if dependent_services.contains(&container_name) {
            let warm_restart = device_info::is_warm_restart_enabled(&container_name)?;
            let fast_reboot = device_info::is_fast_reboot_enabled()?;
            if warm_restart || fast_reboot {
                continue;
            }
        }
        exit_event.store(true, Ordering::Release);
        return Ok(());

When teamd doesn’t exist, wait_result is Err(...). The code logs it, falls through, and sets the global exit flag (unless warm/fast reboot is enabled for that container), causing that task to return Ok(()).
The main loop in the Rust helper then exits as soon as any one watcher returns successfully:

lib.rsLines 130-147
    while let Some(result) = tasks.join_next().await {
        match result {
            Ok(Ok(())) => {
                break;
            }
            Ok(Err(e)) => {
                error!("Container watcher error: {}", e);
                return Err(e);
            }
            Err(e) => {
                error!("Task join error: {}", e);
                return Err(Error::Join(e));
            }
        }
    }
    Ok(0)

So a “non-existent container” is treated as “container exited,” which ends the helper, which ends swss.sh wait, which stops the swss systemd service.

What the Python helper did before

In the Python version, only TypeError is caught. Errors like “container not found” (e.g., docker.errors.NotFound/APIError) are not caught and therefore terminate the thread with an uncaught exception. The main thread keeps waiting on the global event and never exits:

docker-wait-anyLines 51-72
    while True:
        try:
            docker_client.wait(container_name)
        except TypeError as e:
            if g_thread_exit_event.is_set():
                # When other thread exist, main thread will exit and docker_client will be destoryed
                log.log_info("Container {} wait thread get exception: {}".format(container_name, e))
                return
            else:
                raise e
        
        log.log_info("No longer waiting on container '{}'".format(container_name))
        # If this is a dependent service and warm restart is enabled for the system/container,
        # OR if the system is going through a fast-reboot, DON'T signal main thread to exit
        if (container_name in g_dep_services and
                (device_info.is_warm_restart_enabled(container_name) or device_info.is_fast_reboot_enabled())):
            continue
        # Signal the main thread to exit
        g_thread_exit_event.set()
        return

Result: when teamd didn’t exist, the teamd thread crashed, but the main thread blocked forever, so docker-wait-any never exited, and swss.sh wait stayed running. That’s why swss didn’t stop.

Why you observed the change

  • The Rust helper proactively treats any Docker wait error (including “not found”) as an exit signal unless warm/fast reboot is enabled for that dependent. This causes the helper to exit immediately when teamd isn’t present.
  • The Python helper inadvertently ignored most non-TypeError exceptions by letting the thread die without signaling the main thread, causing the helper to block indefinitely and thus keeping swss running.

Steps to Reproduce

Deploy SONiC fanout switch.

Actual Behavior and Expected Behavior

Dockers are flapping on the fanout switch after deploy

Relevant log output

CONTAINER ID   IMAGE                                                               COMMAND                  CREATED             STATUS                         PORTS     NAMES
df35ead48d50   urm.nvidia.com/sw-nbu-sws-sonic-docker/sonic-wjh:9.9.9-master-153   "/usr/local/bin/supe…"   About an hour ago   Up 10 seconds                            what-just-happened
1b5ebb42817c   docker-snmp:latest                                                  "/usr/bin/docker-snm…"   About an hour ago   Up 11 seconds                            snmp
33ac5c827a26   docker-sonic-gnmi:latest                                            "/usr/local/bin/supe…"   About an hour ago   Up About an hour                         gnmi
0066688ebc5f   docker-eventd:latest                                                "/usr/local/bin/supe…"   About an hour ago   Up About an hour                         eventd
68f208856a8e   docker-platform-monitor:latest                                      "/usr/bin/docker_ini…"   About an hour ago   Up 23 seconds                            pmon
4d7195876951   docker-router-advertiser:latest                                     "/usr/bin/docker-ini…"   About an hour ago   Exited (0) About an hour ago             radv
ba792f1a18cd   docker-syncd-mlnx:latest                                            "/usr/local/bin/supe…"   About an hour ago   Up 23 seconds                            syncd
109d6c3c14ae   docker-sysmgr:latest                                                "/usr/local/bin/supe…"   About an hour ago   Up About an hour                         sysmgr
0b0ee5d5d8b8   docker-orchagent:latest                                             "/usr/bin/docker-ini…"   About an hour ago   Up 26 seconds                            swss
95323d902200   docker-sonic-restapi:latest                                         "/usr/local/bin/supe…"   About an hour ago   Exited (0) About an hour ago             restapi
47298e5b709c   docker-sonic-mgmt-framework:latest                                  "/usr/local/bin/supe…"   3 hours ago         Exited (0) About an hour ago             mgmt-framework
18a0d2e03367   docker-lldp:latest                                                  "/usr/bin/docker-lld…"   3 hours ago         Exited (0) About an hour ago             lldp
806b9faf518f   docker-fpm-frr:latest                                               "/usr/bin/docker_ini…"   3 hours ago         Exited (0) 3 hours ago                   bgp
5f2002a2e497   docker-database:latest                                              "/usr/local/bin/dock…"   3 hours ago         Up 3 hours                               database

Output of show version, show techsupport

Attach files (if any)

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions