Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase Orch CPU utilization timeout before link flap #16187

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

arista-hpandya
Copy link
Contributor

@arista-hpandya arista-hpandya commented Dec 20, 2024

This change was made because in modular chassis with multi-asic LCs, the link flap test might run on the uplink LC followed by the downlink LC. Since the uplink has a lot of neighbors the downlink CPU is busy re-routing the different pathways. In such a scenario, the downlink LC will still be hot (above 10% utilization) before we flap its interfaces. Hence, the increase in timeout.

We tested it with a timeout of 500 and it failed so we are increasing it to 600 which has been passing on our local T2 testbeds.

Description of PR

Summary:
Fixes #16186

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 202012
  • 202205
  • 202305
  • 202311
  • 202405

Approach

What is the motivation for this PR?

To make sure that the timeout for the Orchagent CPU utilization check is large enough for the test to pass.

How did you do it?

Increased the timeout from 100 to 600.

How did you verify/test it?

Ran the test on T2 testbed with a timeout of 600 (Passed) and 500 (Failed)

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

This change was made because in modular chassis with multi-asic LCs, the
link flap test might run on the uplink LC followed by the downlink LC.
In such a scenario, the downlink LC will still be hot (above 10%
utilization) before we flap its interfaces. Hence, the increase in
timeout.

We tested it with a timeout of 500 and it failed so we are increasing it
to 600 which has been passing on our local T2 testbeds.
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@wenyiz2021
Copy link
Contributor

@arista-hpandya could you redefine the timeout in continuous link flap for T2?
basically leave the timeout as 100sec for T0 and T1, we don't want to increase for t0/t1.
for T2 we can increase to 500sec

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@arista-hpandya
Copy link
Contributor Author

@arista-hpandya could you redefine the timeout in continuous link flap for T2? basically leave the timeout as 100sec for T0 and T1, we don't want to increase for t0/t1. for T2 we can increase to 500sec

Hi @wenyiz2021 ! Thanks for reviewing this. I have made the changes to increase the timeout only for T2 devices. Also, on a side note happy new year!

@rlhui rlhui requested a review from liamkearney-msft January 3, 2025 04:56
Copy link
Contributor

@liamkearney-msft liamkearney-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment, otherwise lgtm

tests/platform_tests/link_flap/test_cont_link_flap.py Outdated Show resolved Hide resolved
@arlakshm
Copy link
Contributor

arlakshm commented Jan 4, 2025

/Azp run Azure.sonic-mgmt

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@arista-hpandya
Copy link
Contributor Author

/azpw run Azure.sonic-mgmt

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-mgmt

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@arlakshm
Copy link
Contributor

/AzurePipelines run Azure.sonic-mgmt

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[Bug][202405]: Failed: Orch CPU utilization > orch cpu threshold 10 before link flap
5 participants