Skip to content

Conversation

@saiarcot895
Copy link
Contributor

What I did

It's possible for teamd to start up, create the port channel interface in the kernel, and then later exit because some failure condition was hit, or initialization took too long (and so the parent process killed it), or something else. When this happens, teamsyncd will get a notification from the kernel saying that the port channel interface has been created, and will start to add it into STATE_DB. If this happens before teamd goes down (and the port channel interface gets removed from the kernel), then anything depending on that STATE_DB entry will begin its processing, not realizing that that interface will get removed. This will result in dependent applications thinking everything has been processed and configs have been applied, but they haven't really been applied. (In the case of LAGs, this will be intfmgrd adding the IP address to the interface.)

This is functionally a race condition between teamd creating and then deleting the interface (due to a failure condition), teamsyncd acting too fast, and dependent applications assuming all setup is complete. This race condition is more visible on weaker systems.

Therefore, to try to prevent it, before adding an entry in STATE_DB, make sure that teamsyncd can get information from the kernel about the port channel interface, and then directly connect to teamd and make sure that the teamd is processing requests. If both of these succeed, then it can be assumed that all setup is done, and that teamd won't be exiting soon.

Why I did it

How I verified it

Perform several config reloads on a weaker hardware to make sure there are no cases of the LAG entry being added to STATE_DB, and then having teamd exit afterwards.

Details if related

Copilot AI review requested due to automatic review settings November 10, 2025 22:47
Copilot finished reviewing on behalf of saiarcot895 November 10, 2025 22:48
@mssonicbld
Copy link
Collaborator

/azp run

1 similar comment
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a race condition where teamd might create a port channel interface and then exit due to failure before fully initializing, causing dependent applications to think configuration is complete when it's not. The fix adds a verification step to connect to teamd and verify it's running before adding the LAG entry to STATE_DB.

Key changes:

  • Added teamdctl connection verification before STATE_DB entry creation
  • Included libteamdctl library dependency for daemon communication
  • Reorganized error logging to occur before retry check

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
teamsyncd/teamsync.h Added teamdctl.h header for teamd daemon communication
teamsyncd/teamsync.cpp Added teamdctl connection and verification logic to ensure teamd is responsive before STATE_DB updates
teamsyncd/Makefile.am Added -lteamdctl library dependency to link against libteamdctl

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +15 to +16
#include <team.h>
#include <teamdctl.h>
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintainability: These includes are already present in teamsync.h (lines 12-13), which is included on line 13. The duplicate includes are redundant and should be removed.

Suggested change
#include <team.h>
#include <teamdctl.h>

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a stylistic choice to explicitly include what I'm using in this file, so that if the header files change, this file doesn't require changing.

}

char *response;
err = teamdctl_config_get_raw_direct(m_teamdctl, &response);
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Memory leak - the response pointer returned by teamdctl_config_get_raw_direct is not freed. According to libteamdctl documentation, the response needs to be freed using teamdctl_config_get_raw_direct_free(response) or free(response) to prevent memory leaks.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Saikrishna Arcot <[email protected]>
…5/sonic-swss into fix-teamdsyncd-race-condition
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants