-
Notifications
You must be signed in to change notification settings - Fork 202
Module graceful shutdown support #567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Module graceful shutdown support #567
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
sonic_platform_base/module_base.py
Outdated
| # gnoi reboot pipe related | ||
| GNOI_REBOOT_PIPE_PATH = "/host/gnoi_reboot.pipe" | ||
| GNOI_REBOOT_RESPONSE_PIPE_PATH = "/host/gnoi_reboot_response.pipe" | ||
| GNOI_PORT = 50052 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default port is 8080 and please ready from Redis similar to https://github.com/sonic-net/sonic-utilities/blob/c78e0f73fece3fb1c6fb07718a64eddd337dae23/scripts/reboot_smartswitch_helper#L41C1-L45C2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not required anymore. Cleaned it.
sonic_platform_base/module_base.py
Outdated
| GNOI_REBOOT_PIPE_PATH = "/host/gnoi_reboot.pipe" | ||
| GNOI_REBOOT_RESPONSE_PIPE_PATH = "/host/gnoi_reboot_response.pipe" | ||
| GNOI_PORT = 50052 | ||
| GNOI_RESPONSE_TIMEOUT = 60 # seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please read the timeout from platform.json similar to https://github.com/sonic-net/sonic-utilities/blob/c78e0f73fece3fb1c6fb07718a64eddd337dae23/scripts/reboot_smartswitch_helper#L109C7-L109C52
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
sonic_platform_base/module_base.py
Outdated
| This method performs the following steps: | ||
| 1. Sends a JSON-formatted reboot request to the gNOI reboot daemon via a named pipe. | ||
| 2. Waits for a response on a designated response pipe, with a timeout of 60 seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the comment accordingly to platform.json timeout
sonic_platform_base/module_base.py
Outdated
| """ | ||
| raise NotImplementedError | ||
|
|
||
| def pre_shutdown_hook(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this invoked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored and not valid anymore
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
|
/azp run |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
sonic_platform_base/module_base.py
Outdated
| subtype = device_info.get_device_subtype() | ||
| if subtype == "SmartSwitch" and not is_dpu(): | ||
| self.graceful_shutdown_handler() | ||
| # Proceed to set the admin state using the platform-specific implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this supposed to work? super here will call set_admin_state of the base class, not of the derived one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the refactored implementation the platform will graceful_shutdown_handler()
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
tests/module_base_test.py
Outdated
| mb.ModuleBase._TRANSITION_TIMEOUTS_CACHE = None | ||
| with patch("os.path.exists", return_value=False): | ||
| d = Dummy() | ||
| assert d._load_transition_timeouts()["reboot"] == 240 |
Copilot
AI
Oct 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number 240 should use the constant from ModuleBase._TRANSITION_TIMEOUT_DEFAULTS["reboot"] to ensure test stays in sync with actual default values and avoid hardcoding the same value in multiple places.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
|
/azp run |
1 similar comment
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
|
Azure Pipelines successfully started running 1 pipeline(s). |
Co-authored-by: Copilot <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| return admin_state_success | ||
|
|
||
| # Admin DOWN: Perform graceful shutdown first | ||
| module_name = self.get_name() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For uniformity, invoke set_module_state_transition here itself before calling graceful shutdown handler
| PCIE_DETACH_INFO_TABLE_KEY = PCIE_DETACH_INFO_TABLE+"|"+pcie_string | ||
| if not self.state_db_connector: | ||
| self.state_db_connector = swsscommon.swsscommon.DBConnector("STATE_DB", 0) | ||
| db = self._state_db_connector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just delete line 345 and 346 and simply replace state_db_connector with _state_db_connector or better just use prior name of state_db_connector(), so unnecessary code changes can be avoided.
| self.state_db_connector.hset(PCIE_DETACH_INFO_TABLE_KEY, "bus_info", pcie_string) | ||
| self.state_db_connector.hset(PCIE_DETACH_INFO_TABLE_KEY, "dpu_state", operation) | ||
| # Set the PCI detach info for detaching operation | ||
| db.set(db.STATE_DB, PCIE_DETACH_INFO_TABLE_KEY, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hset is the right way to set the keys in the STATE_DB, please avoid unrelated changes.
| self.state_db_connector.delete(PCIE_DETACH_INFO_TABLE_KEY) | ||
| # Delete the entire entry for attaching operation | ||
| if hasattr(db, 'delete'): | ||
| db.delete(db.STATE_DB, PCIE_DETACH_INFO_TABLE_KEY, "bus_info") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 351 is correct, as the delete deletes entire PCIE_DETACH_INFO_TABLE_KEY
| # Atomically set transition state (handles race conditions with locking) | ||
| # Note: This is safe to call even if caller already set transition state, | ||
| # as the function is idempotent and will not overwrite existing valid transitions | ||
| self.set_module_state_transition(db, module_name, "shutdown") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, this should be set before calling the graceful handler as we are clearing the flag outside itself
| try: | ||
| oper = self.get_oper_status() | ||
| if oper and str(oper).lower() == "offline": | ||
| if not self.clear_module_state_transition(db, module_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be duplicate code as the caller is clearing in False and True case, could you please re-check?
| # This handles cases where multiple agents might be waiting | ||
| if self.is_module_state_transition_timed_out(db, module_name, shutdown_timeout): | ||
| # Clear only if we can confirm it's actually timed out | ||
| if not self.clear_module_state_transition(db, module_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, caller is also clearing the state upon returning, this seems to be duplicate step
|
|
||
| # Final timeout check before clearing - use recorded start time, not our local wait time | ||
| if self.is_module_state_transition_timed_out(db, module_name, shutdown_timeout): | ||
| if not self.clear_module_state_transition(db, module_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, seems to be duplicate
| if up: | ||
| # Admin UP: Set transition state to 'startup' before admin state change | ||
| module_name = self.get_name() | ||
| self.set_module_state_transition(self._state_db_connector, module_name, "startup") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be aborting the operation if the set fails according to https://github.com/sonic-net/SONiC/blob/master/doc/smart-switch/graceful-shutdown/graceful-shutdown.md HLD "Scenario 1"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we need to set this transition state before pre-shutdown, because if setting state fails, we need to abort the operation. Should we leave set/clear transitions to the caller to correctly do pre-shutdown and post-startup sequence?
| if t0.tzinfo is None: | ||
| t0 = t0.replace(tzinfo=timezone.utc) | ||
|
|
||
| age = (datetime.now(timezone.utc) - t0).total_seconds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy could you address comment from Qi
Provide support for SmartSwitch DPU module graceful shutdown.
Description:
Single source of truth for transitions
All components now use sonic_platform_base.module_base.ModuleBase helpers:
set_module_state_transition(db, name, transition_type)
clear_module_state_transition(db, name)
get_module_state_transition(db, name) -> dict
is_module_state_transition_timed_out(db, name, timeout_secs) -> bool
Eliminates duplicated logic and race-prone direct Redis writes.
Correct table everywhere
Standardized on CHASSIS_MODULE_TABLE (replaces CHASSIS_MODULE_INFO_TABLE).
HLD mismatch addressed in code (HLD fix tracked separately).
Ownership & lifecycle
The initiator of an operation (startup/shutdown/reboot) sets:
state_transition_in_progress=True
transition_type=<op>
transition_start_time=<utc-iso8601>
The platform (set_admin_state()) is responsible for clearing:
state_transition_in_progress=False
optionally transition_end_time=<epoch> (or similar end stamp).
CLI pre-clears only when a prior transition is timed out.
Timeouts & policy
Platform JSON path only: /usr/share/sonic/device/{plat}/platform.json; else constants.
Typical production values used:
startup: 180s, shutdown: 180s (≈ graceful_wait 60s + power 120s), reboot: 120s.
Graceful wait (e.g., waiting for “Graceful shutdown complete”) is a platform policy and implemented inside platform set_admin_state()—not in ModuleBase.
Boot behavior
chassisd on start:
Clears stale flags once (centralized sweep).
Runs set_initial_dpu_admin_state() which marks transitions via ModuleBase before calling platform set_admin_state().
Leaves clearing to the platform or to well-defined status transitions (ONLINE/OFFLINE) where appropriate.
gNOI shutdown daemon
Listens on CHASSIS_MODULE_TABLE and triggers only when:
state_transition_in_progress=True and transition_type=shutdown.
Never clears the flag (ownership stays with the platform).
Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).
CLI (config chassis modules …)
Uses ModuleBase APIs for all set/get/timeout checks.
If a previous transition is stuck, is_module_state_transition_timed_out() → auto-clear then proceed.
Sets transition at the start of startup/shutdown; platform clears on completion.
Fabric card flow retained; edits are surgical.
Redis robustness
Helpers handle both stacks (swsssdk/swsscommon); no hset(mapping=...) usage.
Consistent HGETALL/HSET paths; resilient to connector differences.
Race reduction & consistency
Centralized writes prevent multi-writer races.
All transition writes include transition_start_time; clears may add an end stamp.
Existing PCI/file-lock logic left intact; unrelated behavior unchanged.
Change scope
Minimal, targeted diffs.
No background tasks added, no broad refactors beyond transition handling.
Behavior changes are limited to making transition semantics correct and uniform across repos.
HLD: # 1991 sonic-net/SONiC#1991
sonic-platform-common: #567 sonic-net/sonic-platform-common#567
sonic-utilities: sonic-net/sonic-utilities#4031
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667
How to verify it
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU
Provide support for SmartSwitch DPU module graceful shutdown.
# Description:
* **Single source of truth for transitions**
* All components now use `sonic_platform_base.module_base.ModuleBase` helpers:
* `set_module_state_transition(db, name, transition_type)`
* `clear_module_state_transition(db, name)`
* `get_module_state_transition(db, name) -> dict`
* `is_module_state_transition_timed_out(db, name, timeout_secs) -> bool`
* Eliminates duplicated logic and race-prone direct Redis writes.
* **Correct table everywhere**
* Standardized on **`CHASSIS_MODULE_TABLE`** (replaces `CHASSIS_MODULE_INFO_TABLE`).
* HLD mismatch addressed in code (HLD fix tracked separately).
* **Ownership & lifecycle**
* The **initiator** of an operation (`startup`/`shutdown`/`reboot`) sets:
* `state_transition_in_progress=True`
* `transition_type=<op>`
* `transition_start_time=<utc-iso8601>`
* The **platform** (`set_admin_state()`) is responsible for clearing:
* `state_transition_in_progress=False`
* optionally `transition_end_time=<epoch>` (or similar end stamp).
* CLI pre-clears only when a prior transition is **timed out**.
* **Timeouts & policy**
* Platform JSON path only: `/usr/share/sonic/device/{plat}/platform.json`; else **constants**.
* Typical production values used:
* `startup: 180s`, `shutdown: 180s` (≈ `graceful_wait 60s + power 120s`), `reboot: 120s`.
* **Graceful wait** (e.g., waiting for “Graceful shutdown complete”) is a **platform policy** and implemented inside platform `set_admin_state()`—not in ModuleBase.
* **Boot behavior**
* `chassisd` on start:
1. **Clears stale flags once** (centralized sweep).
2. Runs `set_initial_dpu_admin_state()` which **marks transitions** via ModuleBase before calling platform `set_admin_state()`.
3. Leaves clearing to the platform or to well-defined status transitions (ONLINE/OFFLINE) where appropriate.
* **gNOI shutdown daemon**
* Listens on **`CHASSIS_MODULE_TABLE`** and triggers only when:
* `state_transition_in_progress=True` **and** `transition_type=shutdown`.
* Never clears the flag (ownership stays with the platform).
* Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).
* **CLI (`config chassis modules …`)**
* Uses ModuleBase APIs for all set/get/timeout checks.
* If a previous transition is stuck, `is_module_state_transition_timed_out()` → auto-clear then proceed.
* Sets transition at the start of `startup`/`shutdown`; platform clears on completion.
* Fabric card flow retained; edits are surgical.
* **Redis robustness**
* Helpers handle both stacks (swsssdk/swsscommon); no `hset(mapping=...)` usage.
* Consistent HGETALL/HSET paths; resilient to connector differences.
* **Race reduction & consistency**
* Centralized writes prevent multi-writer races.
* All transition writes include `transition_start_time`; clears may add an end stamp.
* Existing PCI/file-lock logic left intact; unrelated behavior unchanged.
* **Change scope**
* Minimal, targeted diffs.
* No background tasks added, no broad refactors beyond transition handling.
* Behavior changes are limited to making transition semantics correct and uniform across repos.
HLD: # 1991 sonic-net/SONiC#1991
sonic-platform-common: #567 sonic-net/sonic-platform-common#567
sonic-utilities: sonic-net/sonic-utilities#4031
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667
How to verify it
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU
…ransition handling
<!-- Provide a general summary of your changes in the Title above -->
#### Description
<!--
Describe your changes in detail
-->
HLD: https://github.com/sonic-net/SONiC/blob/master/doc/smart-switch/graceful-shutdown/graceful-shutdown.md
These changes build upon enhancements in [`sonic-platform-common#567`](sonic-net#567)
This change introduces enhancements to the `ModuleBase` class to support graceful shutdown and startup operations for DPU and other module types.
It adds new methods and transition handling logic to ensure platform modules follow an ordered and coordinated shutdown/startup procedure, minimizing hardware inconsistencies and transient errors during reboot or DPU detachment.
Key changes include:
Added transition management APIs:
```
set_module_state_transition()
get_module_state_transition()
clear_module_state_transition()
```
Introduced graceful lifecycle handlers:
- `_graceful_shutdown_handler()` to wait for external transition completion using `gnoi_halt_in_progress` field with timeout handling
- Implemented database-backed transition tracking (CHASSIS_MODULE_TABLE)
Added helper functions for:
- File-based operation locks to ensure concurrency safety during transitions
- Included caching of transition timeout configuration from platform.json
- Added robust error-handling and logging to prevent partial updates in Redis DB
#### Motivation and Context
<!--
Why is this change required? What problem does it solve?
If this pull request closes/resolves an open Issue, make sure you
include the text "fixes #xxxx", "closes #xxxx" or "resolves #xxxx" here
-->
This enhancement is part of the SmartSwitch / DPU graceful shutdown/reboot and state management effort.
Currently, `ModuleBase` lacks lifecycle orchestration methods for safe shutdown or startup of DPUs and peripheral modules.
By adding transition-aware handling, the system can:
Avoid race conditions between platform daemons during reboot/shutdown
Ensure state transitions are reflected in Redis (CHASSIS_MODULE_TABLE)
Support controlled detach/reattach of PCIe devices and sensor configuration reloads
Enable PMON daemons to coordinate module-level transitions consistently
This work aligns with SONiC’s graceful reboot framework and the upcoming DPU lifecycle enhancements tracked internally.
#### How Has This Been Tested?
<!--
Please describe in detail how you tested your changes.
Include details of your testing environment, and the tests you ran to
see how your change affects other areas of the code, etc.
-->
Testing performed on both SmartSwitch (DPU-enabled) and non-DPU platforms:
- ✅ Unit tests added under tests/test_module_base.py covering:
- Transition management (set/get/clear)
- Timeout behavior and concurrency lock handling
- PCIe detach/reattach and sensor config updates
- Graceful shutdown/startup flows (set_admin_state_gracefully)
- ✅ Verified Redis DB updates for transition keys under CHASSIS_MODULE_TABLE
- ✅ Simulated shutdown and startup sequences:
- module_pre_shutdown() → safely detaches PCIe and updates state
- module_post_startup() → rescans PCIe and restores sensor configuration
- ✅ Regression-tested existing platform daemons to ensure backward compatibility
#### Additional Information (Optional)
Provide support for SmartSwitch DPU module graceful shutdown.
# Description:
* **Single source of truth for transitions**
failure_prs.log skip_prs.log All components now use `sonic_platform_base.module_base.ModuleBase` helpers:
failure_prs.log skip_prs.log `set_module_state_transition(db, name, transition_type)`
failure_prs.log skip_prs.log `clear_module_state_transition(db, name)`
failure_prs.log skip_prs.log `get_module_state_transition(db, name) -> dict`
failure_prs.log skip_prs.log `is_module_state_transition_timed_out(db, name, timeout_secs) -> bool`
failure_prs.log skip_prs.log Eliminates duplicated logic and race-prone direct Redis writes.
* **Correct table everywhere**
failure_prs.log skip_prs.log Standardized on **`CHASSIS_MODULE_TABLE`** (replaces `CHASSIS_MODULE_INFO_TABLE`).
failure_prs.log skip_prs.log HLD mismatch addressed in code (HLD fix tracked separately).
* **Ownership & lifecycle**
failure_prs.log skip_prs.log The **initiator** of an operation (`startup`/`shutdown`/`reboot`) sets:
failure_prs.log skip_prs.log `state_transition_in_progress=True`
failure_prs.log skip_prs.log `transition_type=<op>`
failure_prs.log skip_prs.log `transition_start_time=<utc-iso8601>`
failure_prs.log skip_prs.log The **platform** (`set_admin_state()`) is responsible for clearing:
failure_prs.log skip_prs.log `state_transition_in_progress=False`
failure_prs.log skip_prs.log optionally `transition_end_time=<epoch>` (or similar end stamp).
failure_prs.log skip_prs.log CLI pre-clears only when a prior transition is **timed out**.
* **Timeouts & policy**
failure_prs.log skip_prs.log Platform JSON path only: `/usr/share/sonic/device/{plat}/platform.json`; else **constants**.
failure_prs.log skip_prs.log Typical production values used:
failure_prs.log skip_prs.log `startup: 180s`, `shutdown: 180s` (≈ `graceful_wait 60s + power 120s`), `reboot: 120s`.
failure_prs.log skip_prs.log **Graceful wait** (e.g., waiting for “Graceful shutdown complete”) is a **platform policy** and implemented inside platform `set_admin_state()`—not in ModuleBase.
* **Boot behavior**
failure_prs.log skip_prs.log `chassisd` on start:
1. **Clears stale flags once** (centralized sweep).
2. Runs `set_initial_dpu_admin_state()` which **marks transitions** via ModuleBase before calling platform `set_admin_state()`.
3. Leaves clearing to the platform or to well-defined status transitions (ONLINE/OFFLINE) where appropriate.
* **gNOI shutdown daemon**
failure_prs.log skip_prs.log Listens on **`CHASSIS_MODULE_TABLE`** and triggers only when:
failure_prs.log skip_prs.log `state_transition_in_progress=True` **and** `transition_type=shutdown`.
failure_prs.log skip_prs.log Never clears the flag (ownership stays with the platform).
failure_prs.log skip_prs.log Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).
* **CLI (`config chassis modules …`)**
failure_prs.log skip_prs.log Uses ModuleBase APIs for all set/get/timeout checks.
failure_prs.log skip_prs.log If a previous transition is stuck, `is_module_state_transition_timed_out()` → auto-clear then proceed.
failure_prs.log skip_prs.log Sets transition at the start of `startup`/`shutdown`; platform clears on completion.
failure_prs.log skip_prs.log Fabric card flow retained; edits are surgical.
* **Redis robustness**
failure_prs.log skip_prs.log Helpers handle both stacks (swsssdk/swsscommon); no `hset(mapping=...)` usage.
failure_prs.log skip_prs.log Consistent HGETALL/HSET paths; resilient to connector differences.
* **Race reduction & consistency**
failure_prs.log skip_prs.log Centralized writes prevent multi-writer races.
failure_prs.log skip_prs.log All transition writes include `transition_start_time`; clears may add an end stamp.
failure_prs.log skip_prs.log Existing PCI/file-lock logic left intact; unrelated behavior unchanged.
* **Change scope**
failure_prs.log skip_prs.log Minimal, targeted diffs.
failure_prs.log skip_prs.log No background tasks added, no broad refactors beyond transition handling.
failure_prs.log skip_prs.log Behavior changes are limited to making transition semantics correct and uniform across repos.
HLD: # 1991 sonic-net/SONiC#1991
sonic-platform-common: #567 sonic-net/sonic-platform-common#567
sonic-utilities: sonic-net/sonic-utilities#4031
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667
How to verify it
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU
…ransition handling (#613) <!-- Provide a general summary of your changes in the Title above --> #### Description <!-- Describe your changes in detail --> HLD: https://github.com/sonic-net/SONiC/blob/master/doc/smart-switch/graceful-shutdown/graceful-shutdown.md These changes build upon enhancements in [`sonic-platform-common#567`](#567) This change introduces enhancements to the `ModuleBase` class to support graceful shutdown and startup operations for DPU and other module types. It adds new methods and transition handling logic to ensure platform modules follow an ordered and coordinated shutdown/startup procedure, minimizing hardware inconsistencies and transient errors during reboot or DPU detachment. Key changes include: Added transition management APIs: ``` set_module_state_transition() get_module_state_transition() clear_module_state_transition() ``` Introduced graceful lifecycle handlers: - `_graceful_shutdown_handler()` to wait for external transition completion using `gnoi_halt_in_progress` field with timeout handling - Implemented database-backed transition tracking (CHASSIS_MODULE_TABLE) Added helper functions for: - File-based operation locks to ensure concurrency safety during transitions - Included caching of transition timeout configuration from platform.json - Added robust error-handling and logging to prevent partial updates in Redis DB #### Motivation and Context <!-- Why is this change required? What problem does it solve? If this pull request closes/resolves an open Issue, make sure you include the text "fixes #xxxx", "closes #xxxx" or "resolves #xxxx" here --> This enhancement is part of the SmartSwitch / DPU graceful shutdown/reboot and state management effort. Currently, `ModuleBase` lacks lifecycle orchestration methods for safe shutdown or startup of DPUs and peripheral modules. By adding transition-aware handling, the system can: Avoid race conditions between platform daemons during reboot/shutdown Ensure state transitions are reflected in Redis (CHASSIS_MODULE_TABLE) Support controlled detach/reattach of PCIe devices and sensor configuration reloads Enable PMON daemons to coordinate module-level transitions consistently This work aligns with SONiC’s graceful reboot framework and the upcoming DPU lifecycle enhancements tracked internally. #### How Has This Been Tested? <!-- Please describe in detail how you tested your changes. Include details of your testing environment, and the tests you ran to see how your change affects other areas of the code, etc. --> Testing performed on both SmartSwitch (DPU-enabled) and non-DPU platforms: - ✅ Unit tests added under tests/test_module_base.py covering: - Transition management (set/get/clear) - Timeout behavior and concurrency lock handling - PCIe detach/reattach and sensor config updates - Graceful shutdown/startup flows (set_admin_state_gracefully) - ✅ Verified Redis DB updates for transition keys under CHASSIS_MODULE_TABLE - ✅ Simulated shutdown and startup sequences: - module_pre_shutdown() → safely detaches PCIe and updates state - module_post_startup() → rescans PCIe and restores sensor configuration - ✅ Regression-tested existing platform daemons to ensure backward compatibility #### Additional Information (Optional)
Provide support for SmartSwitch DPU module graceful shutdown.
Description
Single source of truth for transitions
All components now use
sonic_platform_base.module_base.ModuleBasehelpers:set_module_state_transition(db, name, transition_type)clear_module_state_transition(db, name)get_module_state_transition(db, name) -> dictis_module_state_transition_timed_out(db, name, timeout_secs) -> boolEliminates duplicated logic and race-prone direct Redis writes.
Correct table everywhere
CHASSIS_MODULE_TABLE(replacesCHASSIS_MODULE_INFO_TABLE).Ownership & lifecycle
The initiator of an operation (
startup/shutdown/reboot) sets:state_transition_in_progress=Truetransition_type=<op>transition_start_time=<utc-iso8601>The platform (
set_admin_state()) is responsible for clearing:state_transition_in_progress=Falsetransition_end_time=<epoch>(or similar end stamp).CLI pre-clears only when a prior transition is timed out.
Timeouts & policy
Platform JSON path only:
/usr/share/sonic/device/{plat}/platform.json; else constants.Typical production values used:
startup: 180s,shutdown: 180s(≈graceful_wait 60s + power 120s),reboot: 120s.Graceful wait (e.g., waiting for “Graceful shutdown complete”) is a platform policy and implemented inside platform
set_admin_state()—not in ModuleBase.Boot behavior
chassisdon start:set_initial_dpu_admin_state()which marks transitions via ModuleBase before calling platformset_admin_state().gNOI shutdown daemon
Listens on
CHASSIS_MODULE_TABLEand triggers only when:state_transition_in_progress=Trueandtransition_type=shutdown.Never clears the flag (ownership stays with the platform).
Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).
CLI (
config chassis modules …)is_module_state_transition_timed_out()→ auto-clear then proceed.startup/shutdown; platform clears on completion.Redis robustness
hset(mapping=...)usage.Race reduction & consistency
transition_start_time; clears may add an end stamp.Change scope
HLD: # 1991 sonic-net/SONiC#1991
sonic-host-services: #255 sonic-net/sonic-host-services#255
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667
sonic-utilities: sonic-net/sonic-utilities#4031
How Has This Been Tested?
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU