Skip to content

Fix rare timing issue in test_CmisManagerTask_task_worker_decommission #650

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

longhuan-cisco
Copy link
Contributor

@longhuan-cisco longhuan-cisco commented Jul 26, 2025

Description

If the CPU of build server is very fast, testcase test_CmisManagerTask_task_worker_decommission might have a chance to fail at

assert get_cmis_state_from_state_db('Ethernet0', task.xcvr_table_helper.get_status_sw_tbl(task.get_asic_id('Ethernet0'))) == CMIS_STATE_FAILED

The reason is:

  • Test code incorrectly assumes update_cmis_state_expiration_time(lport, duration_seconds=0.0) will set cmis_expired as current_time (i.e. current_time + 0.0, meaning expired immediately), however there's always a CMIS_EXPIRATION_BUFFER_MS (2ms) getting added on top of it.
    def update_cmis_state_expiration_time(self, lport, duration_seconds):
    """
    Set the CMIS expiration time for the given logical port
    in the port dictionary.
    Args:
    lport: Logical port name
    duration_seconds: Duration in seconds for the expiration
    """
    self.port_dict[lport]['cmis_expired'] = datetime.datetime.now() + \
    datetime.timedelta(seconds=duration_seconds) + \
    datetime.timedelta(milliseconds=self.CMIS_EXPIRATION_BUFFER_MS)

e.g. if current_time is 01:36:25.528, calling self.update_cmis_state_expiration_time(lport, 0.0), will set cmis_expired to 01:36:25.530, i.e. 2ms ahead of current time.

When testcase is testing timeout for 'ConfigSuccess' scenario:

elif state == CMIS_STATE_DP_INIT:
if not self.check_config_error(api, host_lanes_mask, ['ConfigSuccess']):
if self.is_timer_expired(expired):
self.log_notice("{}: timeout for 'ConfigSuccess', current ConfigStatus: "
"{}".format(lport, list(api.get_config_datapath_hostlane_status().values())))
self.force_cmis_reinit(lport, retries + 1)
continue

  • On my build server, this testcase always pass. It constantly takes more than 2ms to run through one while/for loop iteration, so after self.update_cmis_state_expiration_time(lport, 0.0), it costs only two while/loop to become is_timer_expired()==True reaching timeout for 'ConfigSuccess', in total it constantly costs only 17 while/for loop iterations to reach CMIS_STATE_FAILED after retries, in which case task_stopping_event.is_set = MagicMock(side_effect=[False]*50 + [True]*2) is more than enough.
  • But if the CPU of build server is very fast, it has a chance to be able to run through multiple while/for loop iterations within 2ms, costing more while/for loop iterations to reach timeout for 'ConfigSuccess', which might end up running out of task_stopping_event.is_set = MagicMock(side_effect=[False]*50 + [True]*2) too soon , not being able to even reach CMIS_STATE_FAILED, causing testcase assertion failure.

In below example, CPU is fast, within 2ms, it's able to run through 5 while/for loop iterations. In other words, it costs 5 while/for loop iterations to reach timeout for 'ConfigSuccess' after self.update_cmis_state_expiration_time(lport, 0.0):

2025-07-26 07:45:08.244422 CMIS: Ethernet0: DpDeinit duration 0.0 secs, modulePwrUp duration 0.0 secs
2025-07-26 07:45:08.244469 CMIS: update_cmis_state_expiration_time(0.0) set expired_time to 2025-07-26 07:45:08.246441, current_time 2025-07-26 07:45:08.244441
2025-07-26 07:45:08.244838 CMIS: Ethernet0: 100G, lanemask=0xff, CMIS state=AP_CONFIGURED(decommission*), ... retries=2
2025-07-26 07:45:08.245373 CMIS: Ethernet0: 100G, lanemask=0xff, CMIS state=DP_INIT(decommission*), ... retries=2
2025-07-26 07:45:08.245446 CMIS: is_timer_expired() = False
2025-07-26 07:45:08.245809 CMIS: Ethernet0: 100G, lanemask=0xff, CMIS state=DP_INIT(decommission*), ... retries=2
2025-07-26 07:45:08.245885 CMIS: is_timer_expired() = False
2025-07-26 07:45:08.246266 CMIS: Ethernet0: 100G, lanemask=0xff, CMIS state=DP_INIT(decommission*), ... retries=2
2025-07-26 07:45:08.246345 CMIS: is_timer_expired() = False
2025-07-26 07:45:08.246715 CMIS: Ethernet0: 100G, lanemask=0xff, CMIS state=DP_INIT(decommission*), ... retries=2
2025-07-26 07:45:08.246795 CMIS: is_timer_expired() = True
2025-07-26 07:45:08.246857 CMIS: Ethernet0: timeout for 'ConfigSuccess', current ConfigStatus: ['ConfigRejected', 'ConfigRejected', 'ConfigRejected', 'ConfigRejected', 'ConfigRejected', 'ConfigRejected', 'ConfigRejected', 'ConfigRejected']
2025-07-26 07:45:08.247248 CMIS: Ethernet0: 100G, lanemask=0xff, CMIS state=INSERTED(decommission*), ... retries=3

Motivation and Context

How Has This Been Tested?

Testcase passed

Additional Information (Optional)

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mihirpat1 mihirpat1 requested a review from Copilot July 27, 2025 00:18
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a rare timing issue in the test_CmisManagerTask_task_worker_decommission test that could cause failures on fast CPUs. The issue occurred because the test incorrectly assumed that setting a duration of 0.0 seconds would immediately expire a timer, but there's always a 2ms buffer added to expiration times.

  • Replaces duration-based timer mocking with direct timer expiration mocking
  • Removes three mock configurations for duration methods that are no longer needed
  • Adds a direct mock for is_timer_expired to return True immediately
Comments suppressed due to low confidence (1)

sonic-xcvrd/tests/test_xcvrd.py:3429

  • The mock always returns True, which may not adequately test the timer logic. Consider using a side_effect to simulate timer progression or test both expired and non-expired states.
        task.is_timer_expired = MagicMock(return_value=True)

@longhuan-cisco
Copy link
Contributor Author

longhuan-cisco commented Jul 27, 2025

Thanks @mihirpat1 for the quick review/approval.

Hi @prgeor, could you please review
Current test_CmisManagerTask_task_worker_decommission() has a low chance to fail if the CPU of build/CICD server is very fast. And it's a simple fix in test code, nothing wrong with xcvrd.py code.

cc @AnoopKamath

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants