Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SmartSwitch] Extend reboot script for rebooting SmartSwitch #3566

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

vvolam
Copy link

@vvolam vvolam commented Oct 3, 2024

What I did

Extended reboot script for SmartSwitch cases to reboot entire SmartSwitch or a specific DPU

How I did it

Implemented changes according to https://github.com/sonic-net/SONiC/blob/605c3a56ac2717dbbb638433e7bb13054fc05a31/doc/smart-switch/reboot/reboot-hld.md

How to verify it

--Pending testing on SmartSwitch--

  • Verified the script on non-smart switch and didn't find any regressions. Also script throughs errors if any new smart switch related parameters are given by user.

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

scripts/reboot Outdated Show resolved Hide resolved
scripts/reboot Outdated Show resolved Hide resolved
scripts/reboot Outdated Show resolved Hide resolved
scripts/reboot Outdated Show resolved Hide resolved
debug "User requested rebooting device ${DPU_NAME} ..."

# Retrieve DPU IP and GNMI port
dpu_ip=$(get_dpu_ip "${DPU_NAME}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both get_dpu_ip and get_gnmi_port do not return anything

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be because the code might not be present for this in sonic image. I will check on the possible ETA for the changes in that case. Thank you!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gpunathilell gnmi port changes are expected to be available by 11/15.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I meant was the function call is not returning anything as we have no echo statements from the function(even if the gnmi ports are added manually the function doesn't return anything, and the value in dpu_ip is being overwritten with empty string)

scripts/reboot Outdated Show resolved Hide resolved
scripts/reboot Outdated Show resolved Hide resolved
@vvolam vvolam changed the title Extend reboot script for rebooting SmartSwitch [SmartSwitch] Extend reboot script for rebooting SmartSwitch Nov 4, 2024
@vvolam vvolam marked this pull request as ready for review November 4, 2024 19:39
# Load the platform chassis if not already loaded
load_platform_chassis()

if not is_smartswitch():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam rename to reboot_dpu() since you don't intend to support any other module?

return False


def main():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam main() function in helper?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor This is to write unit tests for main function as well for maximum code coverage, else I couldn't call in unit tests.

return True


def is_dpu():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam can we add this

def is_dpu(db):
to some common library in sonic-utilities and use that?

Comment on lines +188 to 317
echo "Error: Failed to send reboot status command to DPU ${DPU_NAME}"
exit ${EXIT_ERROR}
fi
debug "$reboot_status"
}

# Function to retrieve DPU bus info from platform JSON
function get_dpu_bus_info() {
local DPU_NAME=$1
DPU_BUS_INFO=$(jq -r --arg DPU_NAME "${DPU_NAME}" '.DPUS[] | select(has($DPU_NAME)) | .[$DPU_NAME].bus_info' "$PLATFORM_JSON_PATH")
if [ -z "$DPU_BUS_INFO" ]; then
echo "Error: bus_info not found for DPU ${DPU_NAME}"
exit ${EXIT_ERROR}
fi
debug "$DPU_NAME : $DPU_BUS_INFO"
}

#Function to detach PCI module
function pci_detach_module() {
local DPU_NAME=$1
local DPU_BUS_INFO=$2
status = $(python3 -c "import reboot_helper; reboot_helper.pci_detach_module('${DPU_NAME}')")
if [ -z "$status" ] || [ "$status" = "false" ]; then
echo 1 > /sys/bus/pci/devices/${DPU_BUS_INFO}/remove
fi
}

# Function to reboot the platform module
function reboot_platform_module() {
local DPU_NAME=$1
reboot_status=$(python3 -c "import reboot_helper; reboot_helper.reboot_module('${DPU_NAME}')")
if [ -z "$reboot_status" ] || [ "$reboot_status" = "false" ]; then
echo "Error: Failed to reboot the platform"
exit ${EXIT_ERROR}
fi
}

function reboot_dpu_module()
{
local DPU_NAME=$1
local DPU_INDEX=${DPU_NAME//[!0-9]/}

debug "User requested rebooting device ${DPU_NAME} ..."

# Retrieve DPU IP and GNMI port
dpu_ip=$(get_dpu_ip "${DPU_NAME}")
port=$(get_gnmi_port "${DPU_NAME}")

if [ -z "$dpu_ip" ] || [ -z "$port" ]; then
echo "Error: Failed to retrieve DPU IP or GNMI port for ${DPU_NAME}"
exit ${EXIT_ERROR}
fi

# Issue GNOI client command to reboot the DPU
docker exec -i gnmi gnoi_client -target ${dpu_ip}:${port} -logtostderr -insecure -rpc Reboot -jsonin '{"method":3}'
if [ $? -ne 0 ]; then
echo "Error: Failed to send reboot command to DPU ${DPU_NAME}"
exit ${EXIT_ERROR}
fi

# Retrieve dpu_halt_services_timeout value using jq
dpu_halt_services_timeout=$(jq -r '.dpu_halt_services_timeout' "$PLATFORM_JSON_PATH" 2>/dev/null)
if [ $? -ne 0 ]; then
echo "Error: Failed to retrieve dpu_halt_services_timeout from ${PLATFORM_JSON_PATH}"
exit ${EXIT_ERROR}
fi

# Poll on reboot status response with a timeout mechanism
poll_interval=5
waited_time=0
while true; do
reboot_status=$(get_reboot_status "${dpu_ip}" "${port}")
debug "GNOI RebootStatus response ${reboot_status}"
is_reboot_active=$(echo "$reboot_status" | grep "active" | awk '{print $2}')
if [ "$is_reboot_active" == "false" ]; then
break
fi

sleep "$poll_interval"
waited_time=$((waited_time + poll_interval))
if [ $waited_time -ge $dpu_halt_services_timeout ]; then
echo "Error: Timeout waiting for DPU ${DPU_NAME} to finish rebooting"
exit ${EXIT_ERROR}
fi
done

# Check if DPU exists and retrieve bus info
DPU_BUS_INFO=$(get_dpu_bus_info "${DPU_NAME}")

# Update STATE_DB and handle PCIe removal and rescan
sonic-db-cli state_db set "PCIE_DETACH_INFO|${DPU_NAME}" '{"dpu_id": "'${DPU_INDEX}'", "dpu_state": "detaching", "bus_info": "'${DPU_BUS_INFO}'"}'

pci_detach_module "${DPU_NAME}" "${DPU_BUS_INFO}"
reboot_platform_module "${DPU_NAME}"
echo 1 > /sys/bus/pci/rescan

sonic-db-cli state_db del "PCIE_DETACH_INFO|${DPU_NAME}"
}

function parse_options()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam can we move these helper methods/functions to another file and import/include them?

Comment on lines +286 to +303
poll_interval=5
waited_time=0
while true; do
reboot_status=$(get_reboot_status "${dpu_ip}" "${port}")
debug "GNOI RebootStatus response ${reboot_status}"
is_reboot_active=$(echo "$reboot_status" | grep "active" | awk '{print $2}')
if [ "$is_reboot_active" == "false" ]; then
break
fi

sleep "$poll_interval"
waited_time=$((waited_time + poll_interval))
if [ $waited_time -ge $dpu_halt_services_timeout ]; then
echo "Error: Timeout waiting for DPU ${DPU_NAME} to finish rebooting"
exit ${EXIT_ERROR}
fi
done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam may be put into a function wait_for_dpu_reboot_status() ?

Comment on lines +181 to +188
try:
platform_chassis.pci_detach(module_name)
log.log_info("Detach command sent for module {}".format(module_name))
return True
except NotImplementedError:
raise NotImplementedError("PCI detach not implemented on this platform: {type(e).__name__}")
except Exception as e:
log.log_error("Unexpected error occurred while detaching module {}: {}".format(module_name, e))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam we agreed that pci detach/attach need not be platform specific?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor yes, I am not using attach method but we planned to support detach method in future. But Gagan from NVIDIA mentioned they are doing something special in this PCI detach method and provided comment to modify accordingly.

@gpunathilell @oleksandrivantsiv FYI.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam, @prgeor in the HLD, we defined an API for attaching and detaching the DPU PCI, which was added to the base class in PR #504 of the sonic-platform-common repository. Why aren’t we using it in the implementation? Our agreement was that if the vendor implemented the API, we would use it; otherwise, we would fall back on the default attach/detach implementation.

fi
fi

is_dpu=$(python3 -c "import reboot_helper; reboot_helper.is_dpu()")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam as mentioned can we move this to a common lib inside sonic-utitilies for future reuse?

if config_data is None:
return False

return any(re.search(r'\.DPU$', key) for key in config_data.keys())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return 'DPU' in config_data will do the same, but will be more efficient.

return True


def is_dpu():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we add this to the device_info.py file (in the same way as is_smartswitch(): https://github.com/sonic-net/sonic-buildimage/blob/master/src/sonic-py-common/sonic_py_common/device_info.py#L579) to prevent code duplication. It will allow to reuse it in many places.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will do this change first and then refactor the code. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants