Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Do not enter vendor SAI critical section for counter polling/clearing operations (#1450)" #1498

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

stephenxs
Copy link
Contributor

Revert "Do not enter vendor SAI critical section for counter polling/clearing operations (#1450)"

This reverts commit 0317b16.

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@kcudnik
Copy link
Collaborator

kcudnik commented Jan 13, 2025

Please give detailed explanation why this is reverted, since previously we established that this indeed helped with stat performance.

@stephenxs
Copy link
Contributor Author

Please give detailed explanation why this is reverted, since previously we established that this indeed helped with stat performance.

yes, will do

@stephenxs
Copy link
Contributor Author

Please give detailed explanation why this is reverted, since previously we established that this indeed helped with stat performance.

There is a timeout between SAI and SDK when counters are polled in bulk mode. the timeout is estimated based on the fact that all SAI API calls from syncd's main thread or counter-polling threads wouldn't execute parallelly.
now, with the mutex on counter polling API calls removed, a bulk counter polling API call can execute parallelly with other SAI API calls from 1. syncd's main thread or 2. other counter polling threads
if there are a heavy workload on SAI API calls from syncd's main thread, the bulk counter polling can get timeout
we need to extend the time out but we need to estimate how long the timeout should be in such scenario, which need more effort.

@kcudnik
Copy link
Collaborator

kcudnik commented Jan 14, 2025

Which timeout to extend ? Syncd heavy use is only on boot scenario, when you see timeouts why pooling countes?

@stephenxs
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stephenxs
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@kcudnik
Copy link
Collaborator

kcudnik commented Jan 15, 2025

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@kcudnik
Copy link
Collaborator

kcudnik commented Jan 16, 2025

since you touched previously not covered code, you will need to add unittests for this code to satisfy code coverage

@stephenxs
Copy link
Contributor Author

since you touched previously not covered code, you will need to add unittests for this code to satisfy code coverage

Yes. I did. actually all lines were covered but somehow coverage reported they were uncovered

@kcudnik
Copy link
Collaborator

kcudnik commented Jan 16, 2025

doyou have them written in unittests/syncd/TestVendorSai.cpp ? and actually calling those stats ?

@stephenxs
Copy link
Contributor Author

stephenxs commented Jan 20, 2025

doyou have them written in unittests/syncd/TestVendorSai.cpp ? and actually calling those stats ?

not really but I've reviewed existing test cases and understood the APIs have been covered

Thread 1 "tests" hit Breakpoint 4, operator() (counter_ids=0x55555590cb30, number_of_counters=1, object_id=5910974510923777, object_type=SAI_OBJECT_TYPE_QUEUE, __closure=0x555555901fb0) at TestFlexCounter.cpp:265
265             clearCalled = true;
(gdb) bt
#0  operator() (counter_ids=0x55555590cb30, number_of_counters=1, object_id=5910974510923777, object_type=SAI_OBJECT_TYPE_QUEUE, __closure=0x555555901fb0) at TestFlexCounter.cpp:265
#1  std::__invoke_impl<long int, FlexCounter_addRemoveCounter_Test::TestBody()::<lambda(sai_object_type_t, sai_object_id_t, uint32_t, const sai_stat_id_t*)>&, _sai_object_type_t, long unsigned int, unsigned int, unsigned int const*> (__f=...)
    at /usr/include/c++/12/bits/invoke.h:61
#2  std::__invoke_r<int, FlexCounter_addRemoveCounter_Test::TestBody()::<lambda(sai_object_type_t, sai_object_id_t, uint32_t, const sai_stat_id_t*)>&, _sai_object_type_t, long unsigned int, unsigned int, unsigned int const*> (__fn=...)
    at /usr/include/c++/12/bits/invoke.h:142
#3  std::_Function_handler<int(_sai_object_type_t, long unsigned int, unsigned int, unsigned int const*), FlexCounter_addRemoveCounter_Test::TestBody()::<lambda(sai_object_type_t, sai_object_id_t, uint32_t, const sai_stat_id_t*)> >::_M_invoke(const std::_Any_data &, _sai_object_type_t &&, unsigned long &&, unsigned int &&, const unsigned int *&&) (__functor=..., __args#0=@0x7fffffffd3e8: SAI_OBJECT_TYPE_QUEUE, __args#1=@0x7fffffffd3f0: 5910974510923777, __args#2=@0x7fffffffd3ec: 1, __args#3=@0x7fffffffd3f8: 0x55555590cb30)
    at /usr/include/c++/12/bits/std_function.h:290
#4  0x00005555555c4285 in std::function<int (_sai_object_type_t, unsigned long, unsigned int, unsigned int const*)>::operator()(_sai_object_type_t, unsigned long, unsigned int, unsigned int const*) const (__args#3=<optimized out>, __args#2=<optimized out>, 
    __args#1=<optimized out>, __args#0=<optimized out>, this=0x555555901fb0) at /usr/include/c++/12/bits/std_function.h:587
#5  MockableSaiInterface::clearStats (this=0x555555901d50, object_type=SAI_OBJECT_TYPE_QUEUE, object_id=5910974510923777, number_of_counters=1, counter_ids=0x55555590cb30) at MockableSaiInterface.cpp:212
#6  0x0000555555693277 in CounterContext<_sai_queue_stat_t>::collectData (this=this@entry=0x55555590d1b0, rid=rid@entry=5910974510923777, counter_ids=std::vector of length 1, capacity 1 = {...}, stats_mode=stats_mode@entry=SAI_STATS_MODE_READ_AND_CLEAR, 
    log_err=log_err@entry=false, stats=std::vector of length 1, capacity 1 = {...}) at FlexCounter.cpp:1115
#7  0x00005555556b2c6e in CounterContext<_sai_queue_stat_t>::getSupportedCounters (this=this@entry=0x55555590d1b0, rid=rid@entry=5910974510923777, counter_ids=std::vector of length 2, capacity 2 = {...}, stats_mode=stats_mode@entry=SAI_STATS_MODE_READ_AND_CLEAR)
    at /usr/include/c++/12/bits/new_allocator.h:90
#8  0x00005555556d5965 in CounterContext<_sai_queue_stat_t>::updateSupportedCounters (stats_mode=SAI_STATS_MODE_READ_AND_CLEAR, counter_ids=std::vector of length 2, capacity 2 = {...}, rid=5910974510923777, this=0x55555590d1b0) at FlexCounter.cpp:1389
#9  CounterContext<_sai_queue_stat_t>::addObject (this=this@entry=0x55555590d1b0, vid=<optimized out>, vid@entry=5910974510923777, rid=rid@entry=5910974510923777, idStrings=std::vector of length 2, capacity 2 = {...}, per_object_stats_mode="") at FlexCounter.cpp:496
#10 0x0000555555673382 in syncd::FlexCounter::addCounter (this=this@entry=0x7fffffffdc40, vid=5910974510923777, rid=5910974510923777, values=std::vector of length 1, capacity 4 = {...}) at FlexCounter.cpp:2609
#11 0x00005555555d2884 in testAddRemoveCounter(unsigned int, _sai_object_type_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::function<void (swss::Table&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)>, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool) (numOid=<optimized out>, numOid@entry=1, object_type=object_type@entry=SAI_OBJECT_TYPE_QUEUE, counterIdFieldName="QUEUE_COUNTER_ID_LIST", 
    counterIdNames=std::vector of length 2, capacity 2 = {...}, expectedValues=std::vector of length 2, capacity 2 = {...}, verifyFunc=..., autoRemoveDbEntry=false, statsMode="STATS_MODE_READ_AND_CLEAR", bulkAdd=false, bulkChunkSize="", bulkChunkSizePerCounter="", 
    bulkChunkSizeAfterPort=true, pluginName="", immediatelyRemoveBulkChunkSizePerCounter=false) at TestFlexCounter.cpp:136
#12 0x00005555555d44b7 in FlexCounter_addRemoveCounter_Test::TestBody (this=<optimized out>) at TestFlexCounter.cpp:269
#13 0x00005555556522e7 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ()
#14 0x00005555556439de in testing::Test::Run() ()
#15 0x0000555555643b95 in testing::TestInfo::Run() ()
#16 0x0000555555644139 in testing::TestSuite::Run() ()
#17 0x00005555556494df in testing::internal::UnitTestImpl::RunAllTests() ()
#18 0x0000555555652857 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) ()
#19 0x0000555555643c51 in testing::UnitTest::Run() ()
#20 0x00005555555bca4b in RUN_ALL_TESTS () at /usr/include/gtest/gtest.h:2293
#21 main (argc=<optimized out>, argv=<optimized out>) at main.cpp:13

Looking at stack frame # 6, clearStats is called. It does not show off in the backtrace maybe because of optimization.

#6  0x0000555555693277 in CounterContext<_sai_queue_stat_t>::collectData (this=this@entry=0x55555590d1b0, rid=rid@entry=5910974510923777, counter_ids=std::vector of length 1, capacity 1 = {...}, stats_mode=stats_mode@entry=SAI_STATS_MODE_READ_AND_CLEAR, 
    log_err=log_err@entry=false, stats=std::vector of length 1, capacity 1 = {...}) at FlexCounter.cpp:1115

 1104             if (status != SAI_STATUS_SUCCESS)
 1105             {
 1106                 if (log_err)
 1107                     SWSS_LOG_ERROR("Failed to get stats of %s 0x%" PRIx64 ": %d", m_name.c_str(), rid, status);
 1108                 else
 1109                     SWSS_LOG_INFO("Failed to get stats of %s 0x%" PRIx64 ": %d", m_name.c_str(), rid, status);
 1110                 return false;
 1111             }
 1112 
 1113             if (stats_mode == SAI_STATS_MODE_READ_AND_CLEAR)
 1114             {
 1115                 status = m_vendorSai->clearStats(
 1116                         m_objectType,
 1117                         rid,
 1118                         static_cast<uint32_t>(counter_ids.size()),
 1119                         reinterpret_cast<const sai_stat_id_t *>(counter_ids.data()));
 1120                 if (status != SAI_STATUS_SUCCESS)
 1121                 {

@stephenxs
Copy link
Contributor Author

For getStatsExt

Thread 1 "tests" hit Breakpoint 5, operator() (counters=0x55555590d410, number_of_counters=<optimized out>, __closure=0x555555901f90) at TestFlexCounter.cpp:624
624                 counters[i] = (i + 1) * 10;
(gdb) bt
#0  operator() (counters=0x55555590d410, number_of_counters=<optimized out>, __closure=0x555555901f90) at TestFlexCounter.cpp:624
#1  std::__invoke_impl<long int, FlexCounter_bulkCounter_Test::TestBody()::<lambda(sai_object_type_t, sai_object_id_t, uint32_t, const sai_stat_id_t*, sai_stats_mode_t, uint64_t*)>&, _sai_object_type_t, long unsigned int, unsigned int, unsigned int const*, _sai_stats_mode_t, long unsigned int*> (__f=...) at /usr/include/c++/12/bits/invoke.h:61
#2  std::__invoke_r<int, FlexCounter_bulkCounter_Test::TestBody()::<lambda(sai_object_type_t, sai_object_id_t, uint32_t, const sai_stat_id_t*, sai_stats_mode_t, uint64_t*)>&, _sai_object_type_t, long unsigned int, unsigned int, unsigned int const*, _sai_stats_mode_t, long unsigned int*> (__fn=...) at /usr/include/c++/12/bits/invoke.h:142
#3  std::_Function_handler<int(_sai_object_type_t, long unsigned int, unsigned int, unsigned int const*, _sai_stats_mode_t, long unsigned int*), FlexCounter_bulkCounter_Test::TestBody()::<lambda(sai_object_type_t, sai_object_id_t, uint32_t, const sai_stat_id_t*, sai_stats_mode_t, uint64_t*)> >::_M_invoke(const std::_Any_data &, _sai_object_type_t &&, unsigned long &&, unsigned int &&, const unsigned int *&&, _sai_stats_mode_t &&, unsigned long *&&) (__functor=..., __args#0=<optimized out>, __args#1=<optimized out>, 
    __args#2=<optimized out>, __args#3=@0x7fffffffd3e0: 0x55555590dc30, __args#4=@0x7fffffffd3d4: SAI_STATS_MODE_READ, __args#5=@0x7fffffffd3e8: 0x55555590d410) at /usr/include/c++/12/bits/std_function.h:290
#4  0x00005555555c41d4 in std::function<int (_sai_object_type_t, unsigned long, unsigned int, unsigned int const*, _sai_stats_mode_t, unsigned long*)>::operator()(_sai_object_type_t, unsigned long, unsigned int, unsigned int const*, _sai_stats_mode_t, unsigned long*) const (__args#5=<optimized out>, __args#4=<optimized out>, __args#3=<optimized out>, __args#2=<optimized out>, __args#1=<optimized out>, __args#0=<optimized out>, this=0x555555901f90) at /usr/include/c++/12/bits/std_function.h:587
#5  MockableSaiInterface::getStatsExt (this=0x555555901d50, object_type=SAI_OBJECT_TYPE_COUNTER, object_id=23643898043695105, number_of_counters=1, counter_ids=0x55555590dc30, mode=SAI_STATS_MODE_READ, counters=0x55555590d410) at MockableSaiInterface.cpp:197
#6  0x000055555568c14c in CounterContext<_sai_counter_stat_t>::collectData (this=this@entry=0x55555590d1b0, rid=rid@entry=23643898043695105, counter_ids=std::vector of length 1, capacity 1 = {...}, stats_mode=stats_mode@entry=SAI_STATS_MODE_READ, 
    log_err=log_err@entry=false, stats=std::vector of length 1, capacity 1 = {...}) at FlexCounter.cpp:1142
#7  0x00005555556afbae in CounterContext<_sai_counter_stat_t>::getSupportedCounters (this=this@entry=0x55555590d1b0, rid=rid@entry=23643898043695105, counter_ids=std::vector of length 2, capacity 2 = {...}, stats_mode=stats_mode@entry=SAI_STATS_MODE_READ)
    at /usr/include/c++/12/bits/new_allocator.h:90
#8  0x00005555556c4ea5 in CounterContext<_sai_counter_stat_t>::updateSupportedCounters (stats_mode=SAI_STATS_MODE_READ, counter_ids=std::vector of length 2, capacity 2 = {...}, rid=23643898043695105, this=0x55555590d1b0) at FlexCounter.cpp:1389
#9  CounterContext<_sai_counter_stat_t>::addObject (this=this@entry=0x55555590d1b0, vid=<optimized out>, vid@entry=23643898043695105, rid=rid@entry=23643898043695105, idStrings=std::vector of length 2, capacity 2 = {...}, per_object_stats_mode="")
    at FlexCounter.cpp:496
#10 0x0000555555673552 in syncd::FlexCounter::addCounter (this=this@entry=0x7fffffffdc40, vid=23643898043695105, rid=23643898043695105, values=std::vector of length 1, capacity 4 = {...}) at FlexCounter.cpp:2690
#11 0x00005555555d2884 in testAddRemoveCounter(unsigned int, _sai_object_type_t, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::function<void (swss::Table&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)>, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool) (numOid=<optimized out>, numOid@entry=2, object_type=object_type@entry=SAI_OBJECT_TYPE_COUNTER, counterIdFieldName="FLOW_COUNTER_ID_LIST", 
    counterIdNames=std::vector of length 2, capacity 2 = {...}, expectedValues=std::vector of length 2, capacity 2 = {...}, verifyFunc=..., autoRemoveDbEntry=true, statsMode="STATS_MODE_READ", bulkAdd=false, bulkChunkSize="", bulkChunkSizePerCounter="", 
    bulkChunkSizeAfterPort=true, pluginName="", immediatelyRemoveBulkChunkSizePerCounter=false) at TestFlexCounter.cpp:136
#12 0x00005555555d6ae5 in FlexCounter_bulkCounter_Test::TestBody (this=<optimized out>) at TestFlexCounter.cpp:677
#13 0x00005555556522e7 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ()
#14 0x00005555556439de in testing::Test::Run() ()
#15 0x0000555555643b95 in testing::TestInfo::Run() ()
#16 0x0000555555644139 in testing::TestSuite::Run() ()
#17 0x00005555556494df in testing::internal::UnitTestImpl::RunAllTests() ()
#18 0x0000555555652857 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) ()
#19 0x0000555555643c51 in testing::UnitTest::Run() ()
#20 0x00005555555bca4b in RUN_ALL_TESTS () at /usr/include/gtest/gtest.h:2293
#21 main (argc=<optimized out>, argv=<optimized out>) at main.cpp:13

Stack frame # 6

#6  0x000055555568c14c in CounterContext<_sai_counter_stat_t>::collectData (this=this@entry=0x55555590d1b0, rid=rid@entry=23643898043695105, counter_ids=std::vector of length 1, capacity 1 = {...}, stats_mode=stats_mode@entry=SAI_STATS_MODE_READ, 
    log_err=log_err@entry=false, stats=std::vector of length 1, capacity 1 = {...}) at FlexCounter.cpp:1142

 1135                     }
 1136                     return false;
 1137                 }
 1138             }
 1139         }
 1140         else
 1141         {
 1142             status = m_vendorSai->getStatsExt(
 1143                     m_objectType,
 1144                     rid,
 1145                     static_cast<uint32_t>(counter_ids.size()),
 1146                     reinterpret_cast<const sai_stat_id_t *>(counter_ids.data()),
 1147                     stats_mode,
 1148                     stats.data());
 1149             if (status != SAI_STATUS_SUCCESS)
 1150             {
 1151                 if (log_err)
 1152                 {
 1153                     SWSS_LOG_ERROR("Failed to get stats of %s 0x%" PRIx64 ": %d", m_name.c_str(), rid, status);
 1154                 }
 1155                 else
 1156                 {
 1157                     SWSS_LOG_INFO("Failed to get stats of %s 0x%" PRIx64 ": %d", m_name.c_str(), rid, status);
 1158                 }
 1159                 return false;

@mssonicbld
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@kcudnik
Copy link
Collaborator

kcudnik commented Jan 20, 2025

Interesting, but codecovrrage not sure. It's if its optimized, maybe you found a bug in codecoverage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants