Skip to content

Conversation

@prgeor
Copy link
Collaborator

@prgeor prgeor commented May 13, 2025

Description

Delay DOM polling until all ports are initialized

Motivation and Context

Platforms with large radix like 512 ports may take more time to initialize and complete the CMIS datapath state machine. Since DOM polling is quite expensive on IO bound operation, it can result in contention with CMIS manager task which is initializing the port

How Has This Been Tested?

Tested this on a platform with 512 100G ports with 800G DR8 optics and see a reduction of overall link up time by around 4mins.

Additional Information (Optional)

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prgeor prgeor requested a review from Junchao-Mellanox May 13, 2025 05:25
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Junchao-Mellanox
Copy link
Collaborator

@moshemos @dgsudharsan for awareness

@mihirpat1 mihirpat1 requested a review from Copilot May 13, 2025 18:07
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR delays DOM polling until all ports finish initialization, ensuring that the CMIS datapath state machine is complete before expensive polling begins.

  • Introduces a new function wait_port_initialization that loops until all logical ports are either initialized or removed from consideration.
  • Calls the new waiting function in the task_worker() method to delay DOM monitoring until all ports reach a terminal CMIS state.

dom_info_update_periodic_secs = self.DOM_INFO_UPDATE_PERIOD_SECS

# Wait for all PORTs to be initialized
self.wait_port_initialization(dom_info_update_periodic_secs)
Copy link

Copilot AI May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider implementing a timeout mechanism in wait_port_initialization() to prevent the possibility of an infinite wait if a port never reaches a terminal state.

Suggested change
self.wait_port_initialization(dom_info_update_periodic_secs)
self.wait_port_initialization(dom_info_update_periodic_secs, timeout=300) # Timeout set to 5 minutes

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@mihirpat1 mihirpat1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Can you please help in fixing the build failure?

continue

physical_port = physical_port_list[0]
if not xcvrd._wrapper_get_presence(physical_port):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Can we instead call xcvrd_utils.get_transceiver_presence()?
Also, can we check if self.task_stopping_event.is_set()?

Wondering if you want to use _validate_and_get_physical_port instead?


# Adding dom_info_update_periodic_secs to allow xcvrd to initialize ports
# before starting the periodic update
next_periodic_db_update_time = datetime.datetime.now() + datetime.timedelta(seconds=dom_info_update_periodic_secs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Wondering if we should set is_periodic_db_update_needed to True since all the ports would have been in CMIS terminal state by this point?

def wait_port_initialization(self, delay):
logical_port_set = set(self.port_mapping.logical_port_list)

while logical_port_set:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Can we add an upper bound timeout here to ensure that we don't end up in infinite loop (ideally, this should not occur)?

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mihirpat1 mihirpat1 requested a review from Copilot May 29, 2025 20:34
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR delays DOM polling until all ports are initialized, aiming to reduce contention during port initialization.

  • Introduces a waiting function to poll for port initialization using defined timeout and polling constants.
  • Updates task_worker to log an error if ports remain uninitialized past the timeout, or to start DOM monitoring once all ports are ready.

if datetime.datetime.now() > dom_wait_time_end:
break

return logical_port_set
Copy link

Copilot AI May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return statement is placed inside the while loop, causing an early exit after the first iteration. Consider moving the return statement outside the loop to allow full polling until either a timeout occurs or all ports are initialized.

Suggested change
return logical_port_set
return logical_port_set

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Can you please address this?

Copy link
Contributor

@mihirpat1 mihirpat1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Can you please fix the build failure?

self.log_notice("Stop event generated during DOM monitoring loop")
break

if not is_periodic_db_update_needed:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Why is this removed?


# Start loop to update dom info in DB periodically and handle port change events
while not self.task_stopping_event.is_set():
# Check if periodic db update is needed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Why is this removed?

else:
self.log_notice("All ports are in CMIS terminal state, start DOM monitoring")

# Start loop to update dom info in DB periodically and handle port change events
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor Please add is_periodic_db_update_needed = True to ensure that DOM data is updated periodically

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants