-
Notifications
You must be signed in to change notification settings - Fork 656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SmartSwitch] Extend reboot script for rebooting SmartSwitch #3566
base: master
Are you sure you want to change the base?
Conversation
460146c
to
c72fbc0
Compare
8746356
to
d6fc624
Compare
debug "User requested rebooting device ${DPU_NAME} ..." | ||
|
||
# Retrieve DPU IP and GNMI port | ||
dpu_ip=$(get_dpu_ip "${DPU_NAME}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both get_dpu_ip
and get_gnmi_port
do not return anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be because the code might not be present for this in sonic image. I will check on the possible ETA for the changes in that case. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gpunathilell gnmi port changes are expected to be available by 11/15.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what I meant was the function call is not returning anything as we have no echo statements from the function(even if the gnmi ports are added manually the function doesn't return anything, and the value in dpu_ip
is being overwritten with empty string)
# Load the platform chassis if not already loaded | ||
load_platform_chassis() | ||
|
||
if not is_smartswitch(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam rename to reboot_dpu() since you don't intend to support any other module?
return False | ||
|
||
|
||
def main(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam main() function in helper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor This is to write unit tests for main function as well for maximum code coverage, else I couldn't call in unit tests.
return True | ||
|
||
|
||
def is_dpu(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam can we add this
sonic-utilities/counterpoll/main.py
Line 20 in 7cbcfda
def is_dpu(db): |
echo "Error: Failed to send reboot status command to DPU ${DPU_NAME}" | ||
exit ${EXIT_ERROR} | ||
fi | ||
debug "$reboot_status" | ||
} | ||
|
||
# Function to retrieve DPU bus info from platform JSON | ||
function get_dpu_bus_info() { | ||
local DPU_NAME=$1 | ||
DPU_BUS_INFO=$(jq -r --arg DPU_NAME "${DPU_NAME}" '.DPUS[] | select(has($DPU_NAME)) | .[$DPU_NAME].bus_info' "$PLATFORM_JSON_PATH") | ||
if [ -z "$DPU_BUS_INFO" ]; then | ||
echo "Error: bus_info not found for DPU ${DPU_NAME}" | ||
exit ${EXIT_ERROR} | ||
fi | ||
debug "$DPU_NAME : $DPU_BUS_INFO" | ||
} | ||
|
||
#Function to detach PCI module | ||
function pci_detach_module() { | ||
local DPU_NAME=$1 | ||
local DPU_BUS_INFO=$2 | ||
status = $(python3 -c "import reboot_helper; reboot_helper.pci_detach_module('${DPU_NAME}')") | ||
if [ -z "$status" ] || [ "$status" = "false" ]; then | ||
echo 1 > /sys/bus/pci/devices/${DPU_BUS_INFO}/remove | ||
fi | ||
} | ||
|
||
# Function to reboot the platform module | ||
function reboot_platform_module() { | ||
local DPU_NAME=$1 | ||
reboot_status=$(python3 -c "import reboot_helper; reboot_helper.reboot_module('${DPU_NAME}')") | ||
if [ -z "$reboot_status" ] || [ "$reboot_status" = "false" ]; then | ||
echo "Error: Failed to reboot the platform" | ||
exit ${EXIT_ERROR} | ||
fi | ||
} | ||
|
||
function reboot_dpu_module() | ||
{ | ||
local DPU_NAME=$1 | ||
local DPU_INDEX=${DPU_NAME//[!0-9]/} | ||
|
||
debug "User requested rebooting device ${DPU_NAME} ..." | ||
|
||
# Retrieve DPU IP and GNMI port | ||
dpu_ip=$(get_dpu_ip "${DPU_NAME}") | ||
port=$(get_gnmi_port "${DPU_NAME}") | ||
|
||
if [ -z "$dpu_ip" ] || [ -z "$port" ]; then | ||
echo "Error: Failed to retrieve DPU IP or GNMI port for ${DPU_NAME}" | ||
exit ${EXIT_ERROR} | ||
fi | ||
|
||
# Issue GNOI client command to reboot the DPU | ||
docker exec -i gnmi gnoi_client -target ${dpu_ip}:${port} -logtostderr -insecure -rpc Reboot -jsonin '{"method":3}' | ||
if [ $? -ne 0 ]; then | ||
echo "Error: Failed to send reboot command to DPU ${DPU_NAME}" | ||
exit ${EXIT_ERROR} | ||
fi | ||
|
||
# Retrieve dpu_halt_services_timeout value using jq | ||
dpu_halt_services_timeout=$(jq -r '.dpu_halt_services_timeout' "$PLATFORM_JSON_PATH" 2>/dev/null) | ||
if [ $? -ne 0 ]; then | ||
echo "Error: Failed to retrieve dpu_halt_services_timeout from ${PLATFORM_JSON_PATH}" | ||
exit ${EXIT_ERROR} | ||
fi | ||
|
||
# Poll on reboot status response with a timeout mechanism | ||
poll_interval=5 | ||
waited_time=0 | ||
while true; do | ||
reboot_status=$(get_reboot_status "${dpu_ip}" "${port}") | ||
debug "GNOI RebootStatus response ${reboot_status}" | ||
is_reboot_active=$(echo "$reboot_status" | grep "active" | awk '{print $2}') | ||
if [ "$is_reboot_active" == "false" ]; then | ||
break | ||
fi | ||
|
||
sleep "$poll_interval" | ||
waited_time=$((waited_time + poll_interval)) | ||
if [ $waited_time -ge $dpu_halt_services_timeout ]; then | ||
echo "Error: Timeout waiting for DPU ${DPU_NAME} to finish rebooting" | ||
exit ${EXIT_ERROR} | ||
fi | ||
done | ||
|
||
# Check if DPU exists and retrieve bus info | ||
DPU_BUS_INFO=$(get_dpu_bus_info "${DPU_NAME}") | ||
|
||
# Update STATE_DB and handle PCIe removal and rescan | ||
sonic-db-cli state_db set "PCIE_DETACH_INFO|${DPU_NAME}" '{"dpu_id": "'${DPU_INDEX}'", "dpu_state": "detaching", "bus_info": "'${DPU_BUS_INFO}'"}' | ||
|
||
pci_detach_module "${DPU_NAME}" "${DPU_BUS_INFO}" | ||
reboot_platform_module "${DPU_NAME}" | ||
echo 1 > /sys/bus/pci/rescan | ||
|
||
sonic-db-cli state_db del "PCIE_DETACH_INFO|${DPU_NAME}" | ||
} | ||
|
||
function parse_options() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam can we move these helper methods/functions to another file and import/include them?
poll_interval=5 | ||
waited_time=0 | ||
while true; do | ||
reboot_status=$(get_reboot_status "${dpu_ip}" "${port}") | ||
debug "GNOI RebootStatus response ${reboot_status}" | ||
is_reboot_active=$(echo "$reboot_status" | grep "active" | awk '{print $2}') | ||
if [ "$is_reboot_active" == "false" ]; then | ||
break | ||
fi | ||
|
||
sleep "$poll_interval" | ||
waited_time=$((waited_time + poll_interval)) | ||
if [ $waited_time -ge $dpu_halt_services_timeout ]; then | ||
echo "Error: Timeout waiting for DPU ${DPU_NAME} to finish rebooting" | ||
exit ${EXIT_ERROR} | ||
fi | ||
done | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam may be put into a function wait_for_dpu_reboot_status()
?
try: | ||
platform_chassis.pci_detach(module_name) | ||
log.log_info("Detach command sent for module {}".format(module_name)) | ||
return True | ||
except NotImplementedError: | ||
raise NotImplementedError("PCI detach not implemented on this platform: {type(e).__name__}") | ||
except Exception as e: | ||
log.log_error("Unexpected error occurred while detaching module {}: {}".format(module_name, e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam we agreed that pci detach/attach need not be platform specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor yes, I am not using attach method but we planned to support detach method in future. But Gagan from NVIDIA mentioned they are doing something special in this PCI detach method and provided comment to modify accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam, @prgeor in the HLD, we defined an API for attaching and detaching the DPU PCI, which was added to the base class in PR #504 of the sonic-platform-common repository. Why aren’t we using it in the implementation? Our agreement was that if the vendor implemented the API, we would use it; otherwise, we would fall back on the default attach/detach implementation.
fi | ||
fi | ||
|
||
is_dpu=$(python3 -c "import reboot_helper; reboot_helper.is_dpu()") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam as mentioned can we move this to a common lib inside sonic-utitilies for future reuse?
if config_data is None: | ||
return False | ||
|
||
return any(re.search(r'\.DPU$', key) for key in config_data.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return 'DPU' in config_data
will do the same, but will be more efficient.
return True | ||
|
||
|
||
def is_dpu(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we add this to the device_info.py file (in the same way as is_smartswitch(): https://github.com/sonic-net/sonic-buildimage/blob/master/src/sonic-py-common/sonic_py_common/device_info.py#L579) to prevent code duplication. It will allow to reuse it in many places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will do this change first and then refactor the code. Thanks!
What I did
Extended reboot script for SmartSwitch cases to reboot entire SmartSwitch or a specific DPU
How I did it
Implemented changes according to https://github.com/sonic-net/SONiC/blob/605c3a56ac2717dbbb638433e7bb13054fc05a31/doc/smart-switch/reboot/reboot-hld.md
How to verify it
--Pending testing on SmartSwitch--
Previous command output (if the output of a command-line utility has changed)
New command output (if the output of a command-line utility has changed)