Skip to content

Set DOM polling interval based on the time at the start of the loop instead of the end#757

Open
aditya-nexthop wants to merge 1 commit intosonic-net:masterfrom
nexthop-ai:aditya-polling-interval-fix
Open

Set DOM polling interval based on the time at the start of the loop instead of the end#757
aditya-nexthop wants to merge 1 commit intosonic-net:masterfrom
nexthop-ai:aditya-polling-interval-fix

Conversation

@aditya-nexthop
Copy link
Contributor

Description

During profiling we found that there was a significant period of time when transceiver information was not being updated.
Upon study, it was seen that for DOM_INFO_UPDATE_PERIOD_SECS after the last transceiver in the loop was polled, there were no updates.

Motivation and Context

fixes #756
This improves DOM polling performance of transceivers in xcvrd and correctly ensures that DOM_INFO_UPDATE_PERIOD_SECS time elapses between updating a specific transceiver.

How Has This Been Tested?

We did a CPU profiling without and with the change over a 10 min window just after xcvrd starts
Before:
image
After:
image

The CPU is more uniformly utilized after the change instead of staying idle when more than DOM_INFO_UPDATE_PERIOD_SECS has already passed since a specific transceiver was polled.

Additional Information (Optional)

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@aditya-nexthop aditya-nexthop force-pushed the aditya-polling-interval-fix branch from b20422c to 1f21007 Compare February 20, 2026 02:03
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@aditya-nexthop aditya-nexthop force-pushed the aditya-polling-interval-fix branch from 1f21007 to c1e6d5f Compare February 20, 2026 02:04
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the xcvrd DOM monitoring loop scheduling so the next DOM polling window is calculated from the timestamp at the start of the loop iteration, instead of from the time after all ports have been processed. This aligns the effective polling period with DOM_INFO_UPDATE_PERIOD_SECS per the expectation in issue #756 and reduces idle gaps after a full polling pass.

Changes:

  • Capture now at the beginning of each DOM monitoring loop iteration.
  • Use that loop-start timestamp to determine whether an update is due and to compute next_periodic_db_update_time.

Comment on lines +275 to +276
now = datetime.datetime.now()
if next_periodic_db_update_time <= now:
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new scheduling logic depends on capturing now at the start of the loop and using it to compute next_periodic_db_update_time. There isn’t currently a unit test that asserts this behavior for a non-zero DOM_INFO_UPDATE_PERIOD_SECS (e.g., by mocking datetime.datetime.now() to simulate a long per-iteration processing time and verifying the next update time is based on the loop-start timestamp, not the loop-end timestamp). Adding such a test would help prevent regressions to the original timing bug.

Copilot generated this review using guidance from repository custom instructions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditya-nexthop do you want to add this test?

@prgeor prgeor requested a review from mihirpat1 March 10, 2026 22:29
Copy link
Contributor

@mihirpat1 mihirpat1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditya-nexthop @prgeor This change looks correct for fixing the interval anchoring. However, a broader question: if each port takes ~1s to update DOM data and there are 64 ports, the loop takes ~64s per iteration. With DOM_INFO_UPDATE_PERIOD_SECS potentially shorter than that, are we okay with the loop running back-to-back with no sleep/yield? Could there be a concern about starving other tasks or excessive CPU usage in that scenario?

@aditya-nexthop
Copy link
Contributor Author

@aditya-nexthop @prgeor This change looks correct for fixing the interval anchoring. However, a broader question: if each port takes ~1s to update DOM data and there are 64 ports, the loop takes ~64s per iteration. With DOM_INFO_UPDATE_PERIOD_SECS potentially shorter than that, are we okay with the loop running back-to-back with no sleep/yield? Could there be a concern about starving other tasks or excessive CPU usage in that scenario?

Hi @mihirpat1, as per #758 the wait is primarily due to the time waiting for select to timeout.

We don't need to merge this PR if we merge #758 as #758 includes these changes.
I thought it useful to review the effect of this change as well as #758 independently to understand the impact.

I will close this PR once #758 merges.

@prgeor
Copy link
Collaborator

prgeor commented Mar 11, 2026

@aditya-nexthop @prgeor This change looks correct for fixing the interval anchoring. However, a broader question: if each port takes ~1s to update DOM data and there are 64 ports, the loop takes ~64s per iteration. With DOM_INFO_UPDATE_PERIOD_SECS potentially shorter than that, are we okay with the loop running back-to-back with no sleep/yield? Could there be a concern about starving other tasks or excessive CPU usage in that scenario?

@mihirpat1 in that case we may have to define DOM_INFO_UPDATE_PERIOD_SECS per platform basis? As the polling period of default 60 seconds does NOT work universally?

@mihirpat1
Copy link
Contributor

@aditya-nexthop @prgeor This change looks correct for fixing the interval anchoring. However, a broader question: if each port takes ~1s to update DOM data and there are 64 ports, the loop takes ~64s per iteration. With DOM_INFO_UPDATE_PERIOD_SECS potentially shorter than that, are we okay with the loop running back-to-back with no sleep/yield? Could there be a concern about starving other tasks or excessive CPU usage in that scenario?

@mihirpat1 in that case we may have to define DOM_INFO_UPDATE_PERIOD_SECS per platform basis? As the polling period of default 60 seconds does NOT work universally?

@prgeor That's correct - defining it per platform can help in addressing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DomInfoUpdateTask polling period DOM_INFO_UPDATE_PERIOD_SECS improperly delays polling

5 participants