-
Notifications
You must be signed in to change notification settings - Fork 195
Module graceful shutdown support #667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Module graceful shutdown support #667
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
sonic-chassisd/scripts/chassisd
Outdated
| module.set_module_state_transition(v2, module_name, "startup") | ||
| except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is the module state transition from here cleared?
sonic-chassisd/scripts/chassisd
Outdated
| wants_up = (admin_state != 'down') | ||
| not_online = (str(operational_state).lower() | ||
| != str(ModuleBase.MODULE_STATUS_ONLINE).lower()) | ||
| if wants_up and not_online and v2: | ||
| try: | ||
| module.set_module_state_transition(v2, module_name, "startup") | ||
| self.log_info(f"Marked startup transition for {module_name} at boot") | ||
| except Exception as e: | ||
| self.log_error(f"Failed to set startup transition for {module_name}: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this section, this function was only present to handle the case were the CONFIG_DB entry is not present, in which case dark mode is implied and we power off the DPUs
sonic-chassisd/scripts/chassisd
Outdated
| ModuleTransitionFlagHelper().set_transition_flag(module_name) | ||
| try_get(self.module_updater.chassis.get_module(module_index).set_admin_state, admin_state, default=False) | ||
| # Only run pre-shutdown on DOWN path | ||
| try_get(module.module_pre_shutdown, default=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
module_pre_shutdown is supposed to be called if we need to power off the DPU, or if we need to stay in dark mode, as the exisitng code was present in the same format, please align
sonic-chassisd/scripts/chassisd
Outdated
| # Clear transition flag in STATE_DB via ModuleBase centralized API | ||
| try: | ||
| module_obj = self.chassis.get_module(module_index) | ||
| module_obj.clear_module_state_transition(self._state_v2, key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy Clearing on operational state change is a problem, this can be intentional or unintentional, moreover, there is a transition from online to offline to online when we do a reboot, this will clear the transition during that state change, which is not the expected behavior
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
dfad0fc to
5b2c0f6
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
eedae8c to
ddf58cc
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
… format, swsscommon for DB connection etc
…hy/sonic-platform-daemons into graceful-shutdown # Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| self.log_warning(f"Failed to set shutdown transition for {key}") | ||
| except Exception as e: | ||
| self.log_error(f"Failed to set shutdown transition for {key}: {e}") | ||
| result = try_get(module.set_admin_state, admin_state, default=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we setting state transition again at line 267 and shouldn't we invoke set_admin_state_using_graceful_handler here?
If we are setting state transition in this code path where caller takes care of, we should be avoiding the set/clear in set_admin_state_using_graceful_handler(). also, we need to set the state before pre-shutdown to handle failure scenario
| self.log_warning(f"Failed to set startup transition for {key}") | ||
| except Exception as e: | ||
| self.log_error(f"Failed to set startup transition for {key}: {e}") | ||
| result = try_get(module.set_admin_state, admin_state, default=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above.
Provide support for SmartSwitch DPU module graceful shutdown.
Description
Single source of truth for transitions
All components now use
sonic_platform_base.module_base.ModuleBasehelpers:set_module_state_transition(db, name, transition_type)clear_module_state_transition(db, name)get_module_state_transition(db, name) -> dictis_module_state_transition_timed_out(db, name, timeout_secs) -> boolEliminates duplicated logic and race-prone direct Redis writes.
Correct table everywhere
CHASSIS_MODULE_TABLE(replacesCHASSIS_MODULE_INFO_TABLE).Ownership & lifecycle
The initiator of an operation (
startup/shutdown/reboot) sets:state_transition_in_progress=Truetransition_type=<op>transition_start_time=<utc-iso8601>The platform (
set_admin_state()) is responsible for clearing:state_transition_in_progress=Falsetransition_end_time=<epoch>(or similar end stamp).CLI pre-clears only when a prior transition is timed out.
Timeouts & policy
Platform JSON path only:
/usr/share/sonic/device/{plat}/platform.json; else constants.Typical production values used:
startup: 180s,shutdown: 180s(≈graceful_wait 60s + power 120s),reboot: 120s.Graceful wait (e.g., waiting for “Graceful shutdown complete”) is a platform policy and implemented inside platform
set_admin_state()—not in ModuleBase.Boot behavior
chassisdon start:set_initial_dpu_admin_state()which marks transitions via ModuleBase before calling platformset_admin_state().gNOI shutdown daemon
Listens on
CHASSIS_MODULE_TABLEand triggers only when:state_transition_in_progress=Trueandtransition_type=shutdown.Never clears the flag (ownership stays with the platform).
Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).
CLI (
config chassis modules …)is_module_state_transition_timed_out()→ auto-clear then proceed.startup/shutdown; platform clears on completion.Redis robustness
hset(mapping=...)usage.Race reduction & consistency
transition_start_time; clears may add an end stamp.Change scope
Please refer to the HLD and related PRs:
HLD: # 1991 sonic-net/SONiC#1991
sonic-host-services: sonic-net/sonic-host-services#255
sonic-platform-common: sonic-net/sonic-platform-common#567
Module graceful shutdown support #4031 sonic-net/sonic-utilities#4031
How Has This Been Tested?
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU