Skip to content

add support to handle ha notifications #3659

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

zjswhhh
Copy link
Contributor

@zjswhhh zjswhhh commented May 27, 2025

What I did
Handle ha_set and ha_scope sai notifications.

HLD:
https://github.com/sonic-net/SONiC/blob/master/doc/smart-switch/high-availability/smart-switch-ha-detailed-design.md
https://github.com/sonic-net/SONiC/blob/master/doc/smart-switch/high-availability/smart-switch-ha-dpu-scope-dpu-driven-setup.md#72-ha-role-activation

Per offline discussion with @r12f, haorch will directly write DPU_STATE_DB instead of using zmq for

  1. straightforward to implement
  2. it's not a bottle neck of performance (one table per dpu, only 8 tables for now)

sign-off: Jing Zhang [email protected]

Why I did it

How I verified it
UTs.

Details if related

@zjswhhh zjswhhh requested a review from prsunny as a code owner May 27, 2025 02:53
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@zjswhhh zjswhhh changed the title add support to handle ha events add support to handle ha notifications May 27, 2025
@zjswhhh zjswhhh requested review from prabhataravind, theasianpianist, r12f, yue-fred-gao and jimmyzhai and removed request for prsunny May 27, 2025 02:53
@@ -52,6 +52,18 @@ void on_twamp_session_event(uint32_t count, sai_twamp_session_event_notification
// which causes concurrency access to the DB
}

void on_ha_set_event(uint32_t count, sai_ha_set_event_data_t *data)
{
// don't use this event handler, because it runs by libsairedis in a separate thread
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work since DPU uses ZMQ. Please check this PR #3547 on what has to be done

Copy link
Contributor Author

@zjswhhh zjswhhh May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vivekrnv , thanks for bringing it up. Can you share the doc? Per HA detailed design, there is no change mentioned for communication mode between syncd to orchagent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a change done by @prabhataravind to move the communication channel to ZMQ. PFA: sonic-net/sonic-buildimage#21940

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

SWSS_LOG_NOTICE("DPU is pending on role activation for %s", key.c_str());
}

fvs.push_back({"ha_state", sai_ha_state_name.at(ha_scope_event[i].ha_state)});

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see ha_state in hld. Is it a new addition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Will remove it.

@mssonicbld
Copy link
Collaborator

/azp run

@zjswhhh zjswhhh requested review from vivekrnv and yue-fred-gao May 30, 2025 18:50
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@prabhataravind prabhataravind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notification changes lgtm

@zjswhhh
Copy link
Contributor Author

zjswhhh commented May 31, 2025

Hi @vivekrnv - appreciate your comments! Can you help review again?

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prsunny
Copy link
Collaborator

prsunny commented Jun 3, 2025

@zjswhhh , most PR checker issues are fixed as of today. Do you know if the test issues are caused by the PR? Please investigate

@zjswhhh
Copy link
Contributor Author

zjswhhh commented Jun 3, 2025

/azpw

@zjswhhh
Copy link
Contributor Author

zjswhhh commented Jun 5, 2025

/azp run Azure.sonic-swss

Copy link

Commenter does not have sufficient privileges for PR 3659 in repo sonic-net/sonic-swss

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

std::string data;
std::vector<swss::FieldValueTuple> values;

consumer.pop(op, data, values);
Copy link
Contributor

@vivekrnv vivekrnv Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use pops api and process multiple notifications at once.

Also, Make sure doTask(NotificationConsumer &consumer) is called during the empty doTask() Eg:

o->doTask();

DashHaOrch::doTask()
{
    # Finish pending tasks
   doTask(*m_haSetNotificationConsumer);
   doTask(*m_haScopeNotificationConsumer);
}

if not there is a possibility we might miss some notifications

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use pops api and process multiple notifications at once.

Updated.

Also, Make sure doTask(NotificationConsumer &consumer) is called during the empty doTask()

That's a great tip! I added the executors to m_consumerMap in DashHaOrch constructor

Orch::addExecutor(haSetNotificatier);
Orch::addExecutor(haScopeNotificatier);

So it should have been taken care of:

void Orch::doTask()
{
for (auto &it : m_consumerMap)
{
it.second->drain();
}
}

Don't think I need to override the method in DashHaOrch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, i see. We should be good then.

@zjswhhh zjswhhh force-pushed the state_activated_rq branch from 6ff9ebe to eec055a Compare June 8, 2025 17:59
@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@@ -25,6 +57,125 @@ DashHaOrch::DashHaOrch(DBConnector *db, const vector<string> &tables, DashOrch *

dash_ha_set_result_table_ = make_unique<Table>(app_state_db, APP_DASH_HA_SET_TABLE_NAME);
dash_ha_scope_result_table_ = make_unique<Table>(app_state_db, APP_DASH_HA_SCOPE_TABLE_NAME);

m_dpuStateDbConnector = make_unique<DBConnector>("DPU_STATE_DB", 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VS tests are failing, looks like we will need swsscommon changes: https://github.com/sonic-net/sonic-swss-common/blob/master/common/database_config.json

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on it now.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants