Skip to content

orchagent crashes with SWSS_LOG_THROW when loading a saved configuration with conflicting speeds on breakout sub-ports. #3981

@univac490

Description

@univac490

Title: orchagent crashes with SWSS_LOG_THROW when loading a saved configuration with conflicting speeds on breakout sub-ports.

Severity: Critical (Causes unrecoverable boot loop)

Component: orchagent, portmgrd, syncd

Description:
When a physical port is broken out (e.g., a 400G port into 2x200G sub-ports), the system allows a user to configure conflicting speeds on those sub-ports (e.g., Ethernet0 at 40G, while Ethernet4 remains at its 200G default). This is a physically impossible state for the hardware's Port Macro/VCO to support.

While the system may not crash immediately, saving this invalid configuration is fatal on the next reboot.

On boot, orchagent reads this conflicting configuration and tries to apply it using the PortsOrch::addPortBulk function. The hardware (SAI) correctly rejects this impossible state, which syncd reports as SAI_STATUS_FAILURE (rv:-1), citing a VCO Validation Failed error.

The orchagent code in addPortBulk then intentionally calls SWSS_LOG_THROW in response to this error, causing a fatal crash loop and rendering the switch unrecoverable.

Steps to Reproduce:

  1. On a stable switch (Dell Z9332F, master branch), break out a 400G port:
    sudo config interface breakout Ethernet0 '2x200G[100G,40G]'
  2. Set a speed on the first sub-port, which conflicts with the 200G default on the second sub-port:
    sudo config interface speed Ethernet0 40000
  3. Save this physically impossible configuration:
    sudo config save -y
  4. Reboot the switch:
    sudo reboot

Expected Behavior:
The system should prevent this failure. The config interface speed command must be aware of hardware port-macro constraints.

When a speed is set on any sub-port of a breakout, the system should automatically apply that same speed to all other sub-ports on the same physical Port Macro, with a confirmation message (e.g., "This will also set Ethernet4 to 40G. Continue? [y/N]").

At the absolute minimum, if an invalid config is saved, orchagent must not crash on boot. It should log the error and continue, leaving the conflicting ports down.

Actual Behavior:
The switch enters an unrecoverable crash loop.

  1. orchagent starts and reads the conflicting 40G/200G config.
  2. syncd fails with a VCO Validation Failed error and returns SAI_STATUS_FAILURE (rv:-1).
  3. PortsOrch::addPortBulk calls SWSS_LOG_THROW, crashing orchagent.
  4. systemd restarts orchagent, and the loop repeats.

Workaround (Recovery from Crash Loop):
The only software-level recovery from this crash loop is to:

  1. Log in to the switch console (while it's in the crash loop).
  2. Manually edit /etc/sonic/config_db.json.
  3. Find the PORT table entries for the conflicting breakout ports (e.g., Ethernet0 and Ethernet4).
  4. Set their speed values to be identical (e.g., set both to "40000").
  5. Save the file.
  6. Restart the swss service (sudo systemctl restart swss). orchagent will now load the valid configuration and boot successfully.

Evidence / Logs:

syncd Log (Hardware Rejection):

ERR syncd#syncd: [none] SAI_API_PORT:brcm_sai_create_port:11176 Adding port failed failed with error -5.
INFO syncd#supervisord: syncd 0:_soc_esw_portctrl_pm_flex_vco_validation:   VCO Validation Failed for Port Macro: 4
ERR syncd#syncd: :- sendApiResponse: api SAI_COMMON_API_BULK_CREATE failed in syncd mode: SAI_STATUS_FAILURE

orchagent Log (The Crash):

ERR swss#orchagent: :- addPortBulk: Failed to create ports with bulk operation, rv:-1
ERR swss#orchagent: :- handleSaiFailure: Encountered failure in create operation, SAI API: SAI_API_PORT, status: SAI_STATUS_FAILURE
ERR swss#orchagent: :- addPortBulk: PortsOrch bulk create failure
INFO swss#supervisord: orchagent terminate called after throwing an instance of 'std::runtime_error'
INFO swss#supervisord: orchagent   what():  :- addPortBulk: PortsOrch bulk create failure
WARN exited: orchagent (terminated by SIGABRT (core dumped); not expected)

GDB Backtrace:

#6  0x00007f770414b0d8 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f7704a12c56 in swss::Logger::wthrow (this=<optimized out>, prio=prio@entry=swss::Logger::SWSS_ERROR, fmt=fmt@entry=0x564d55224588 ":- %s: PortsOrch bulk create failure") at common/logger.cpp:335
#8  0x0000564d54d97ffa in PortsOrch::addPortBulk (this=this@entry=0x7f7703254000, portList=...) at ./orchagent/portsorch.cpp:1361
#9  0x0000564d54dc02c7 in PortsOrch::doPortTask (this=this@entry=0x7f7703254000, consumer=...) at ./orchagent/portsorch.cpp:4423
#10 0x0000564d54dc34f6 in PortsOrch::doTask (this=0x7f7703254000, consumer=...) at ./orchagent/portsorch.cpp:6140

Root Cause (Code):
In orchagent/portsorch.cpp, inside PortsOrch::addPortBulk:

    auto handle_status = handleSaiCreateStatus(SAI_API_PORT, status);
    if (handle_status != task_process_status::task_success)
    {
        SWSS_LOG_THROW("PortsOrch bulk create failure"); // <-- THIS IS THE BUG
    }

Suggested Fix:
This bug should be fixed at two levels:

  1. Config Validation (Ideal Fix): The portmgrd or config logic must be made aware of hardware port-macro constraints. It should not allow a user to set or save a configuration with conflicting speeds on the same macro.
  2. Error Handling (Critical Fix): The error handling in PortsOrch::addPortBulk should be improved. A task_failed status resulting from a hardware port creation failure (like this VCO conflict) should be treated as a non-fatal error. The function should log SWSS_LOG_ERROR and return false; instead of calling SWSS_LOG_THROW, which prevents a system-wide crash from a recoverable configuration error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions