-
Notifications
You must be signed in to change notification settings - Fork 31
Description
Hello,
I am currently developing a communication kernel and would like to use Thread Trace to debug and analyze its performance at the ISA (assembly) instruction level.I tried running the following command: rocprofv3 --att -d <output_dir> -- <application_path>However, I encountered an error during execution (see below).
Details
Unidirectional LOAD bandwidth GiB/s [Remote read]
SRC\DST GPU 00 GPU 01
GPU 00 -> 986.54 42.27
GPU 01 -> 42.07 984.74
W20251030 09:30:51.570256 139906822775424 generateRocpd.cpp:580] writing SQL database for process 142107 on node 3638630046
E20251030 09:30:51.571632 139906822775424 generateRocpd.cpp:603] Opened result file: /home/tmp/iris/pass_2/normandy-mlflow-debug-j6iqur-041086-2xwbd/142107_results.db (UUID=00049be3-1c39-7c39-8ea0-168663cde420)
W20251030 09:30:51.617263 139793821794944 generateRocpd.cpp:580] writing SQL database for process 142108 on node 3638630046
E20251030 09:30:51.618204 139793821794944 generateRocpd.cpp:603] Opened result file: /home/tmp/iris/pass_2/normandy-mlflow-debug-j6iqur-041086-2xwbd/142108_results.db (UUID=00049be3-1c3a-7c3a-8f87-75d03bc0f13d)
W20251030 09:30:51.679299 139906822775424 simple_timer.cpp:55] SQLite3 generation :: rocpd_string :: 0.045479 sec
W20251030 09:30:51.679628 139906822775424 simple_timer.cpp:55] SQLite3 generation :: rocpd_info_node :: 0.000309 sec
F20251030 09:30:51.680204 139906822775424 common.cpp:117] SQLite3 error 1: unrecognized token: "'/opt/venv/bin/python -c frommultiprocessing.spawnimportspawn_main;"
SQLite3 error: common.cpp:830
SQL Statement: INSERT INTO rocpd_info_process_00049be3_1c39_7c39_8ea0_168663cde420 (id, nid, ppid, pid, init, fini, start, end, command, environment, extdata) VALUES (142107, 3638630046, 142014, 142107, 19795221561709861, 19795233094016172, 19795221561709861, 19795233094016172, '/opt/venv/bin/python -c frommultiprocessing.spawnimportspawn_main;
W20251030 09:30:51.689028 139793821794944 simple_timer.cpp:55] SQLite3 generation :: rocpd_string :: 0.045302 sec
W20251030 09:30:51.689389 139793821794944 simple_timer.cpp:55] SQLite3 generation :: rocpd_info_node :: 0.000341 sec
F20251030 09:30:51.689972 139793821794944 common.cpp:117] SQLite3 error 1: unrecognized token: "'/opt/venv/bin/python -c frommultiprocessing.spawnimportspawn_main;"
SQLite3 error: common.cpp:830
SQL Statement: INSERT INTO rocpd_info_process_00049be3_1c3a_7c3a_8f87_75d03bc0f13d (id, nid, ppid, pid, init, fini, start, end, command, environment, extdata) VALUES (142108, 3638630046, 142014, 142108, 19795221562653435, 19795233140494068, 19795221562653435, 19795233140494068, '/opt/venv/bin/python -c frommultiprocessing.spawnimportspawn_main;
@ 0x7f3ea41dad40 google::LogMessage::Fail()
@ 0x7f3ea41dbe98 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f3ea3f783bc rocprofiler::tool::sql::execute_raw_sql_statements_impl(sqlite3*, std::basic_string_view<char, std::char_traits >, int ()(void, int, char**, char**), void*, int)
@ 0x7f3ea3f2be8f rocprofiler::tool::write_rocpd(rocprofiler::tool::output_config const&, rocprofiler::tool::metadata const&, std::vector<rocprofiler::tool::agent_info, std::allocatorrocprofiler::tool::agent_info > const&, rocprofiler::tool::generator<rocprofiler::tool::t
@ 0x7f3ea3f3dde8 rocprofiler::tool::write_rocpd(rocprofiler::tool::output_config const&, rocprofiler::tool::metadata const&, std::vector<rocprofiler::tool::agent_info, std::allocatorrocprofiler::tool::agent_info > const&, rocprofiler::tool::generator<rocprofiler::tool::t
@ 0x7f3ea3e57ec6 tool_fini(void*)
@ 0x7f2454bb4d40 google::LogMessage::Fail()
@ 0x7f3ea357e333 rocprofiler::registration::(anonymous namespace)::invoke_client_finalizer(rocprofiler_client_id_t)
@ 0x7f3ea357e571 std::once_flag::_Prepare_execution::_Prepare_execution<std::call_oncerocprofiler::registration::finalize()::{lambda()#1}(std::once_flag&, rocprofiler::registration::finalize()::{lambda()#1}&&)::{lambda()#1}>(rocprofiler::registration::finalize()::{lambda
@ 0x7f2454bb5e98 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f3ea3099ee8 (unknown)
@ 0x7f3ea3579737 rocprofiler::registration::finalize()
@ 0x7f3ea3e3f461 finalize_rocprofv3(std::basic_string_view<char, std::char_traits >)
@ 0x7f24549523bc rocprofiler::tool::sql::execute_raw_sql_statements_impl(sqlite3*, std::basic_string_view<char, std::char_traits >, int ()(void, int, char**, char**), void*, int)
@ 0x7f3ea3045a56 __cxa_finalize
@ 0x7f2454905e8f rocprofiler::tool::write_rocpd(rocprofiler::tool::output_config const&, rocprofiler::tool::metadata const&, std::vector<rocprofiler::tool::agent_info, std::allocatorrocprofiler::tool::agent_info > const&, rocprofiler::tool::generator<rocprofiler::tool::t
@ 0x7f3ea3e1d997 (unknown)
W20251030 09:30:51.805797 139906822775424 tool.cpp:2776] [PPID=142014][PID=142107][TID=142107][rocprofv3_error_signal_handler] rocprofv3 caught signal 6...
W20251030 09:30:51.805820 139906822775424 tool.cpp:2799] [PPID=142014][PID=142107][TID=142107][rocprofv3_error_signal_handler] rocprofv3 will wait for 0 children to exit
W20251030 09:30:51.805826 139906822775424 tool.cpp:2814] [PPID=142014][PID=142107][TID=142107][rocprofv3_error_signal_handler] rocprofv3 finalizing after signal 6...
W20251030 09:30:51.805837 139906822775424 tool.cpp:2855] [PPID=142014][PID=142107][TID=142107][rocprofv3_error_signal_handler] rocprofv3 found chained signal handler for 6... executing chained sigaction (SIGINFO)
*** Aborted at 1761816651 (unix time) try "date -d @1761816651" if you are using GNU date ***
PC: @ 0x7f3ea30969fc pthread_kill
*** SIGABRT (@0x22b1b) received by PID 142107 (TID 0x7f3e98792280) from PID 142107; stack trace: ***
@ 0x7f3ea3099ee8 (unknown)
@ 0x7f3ea41e55f0 google::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*)
@ 0x7f3ea3e44795 rocprofv3_error_signal_handler::{lambda()#1}::operator()() const
@ 0x7f2454917de8 rocprofiler::tool::write_rocpd(rocprofiler::tool::output_config const&, rocprofiler::tool::metadata const&, std::vector<rocprofiler::tool::agent_info, std::allocatorrocprofiler::tool::agent_info > const&, rocprofiler::tool::generator<rocprofiler::tool::t
@ 0x7f3ea3099ee8 (unknown)
@ 0x7f2454831ec6 tool_fini(void*)
@ 0x7f3ea3e3b59d rocprofv3_error_signal_handler
@ 0x7f2453f58333 rocprofiler::registration::(anonymous namespace)::invoke_client_finalizer(rocprofiler_client_id_t)
@ 0x7f3ea3042520 (unknown)
@ 0x7f2453f58571 std::once_flag::_Prepare_execution::_Prepare_execution<std::call_oncerocprofiler::registration::finalize()::{lambda()#1}(std::once_flag&, rocprofiler::registration::finalize()::{lambda()#1}&&)::{lambda()#1}>(rocprofiler::registration::finalize()::{lambda
@ 0x7f3ea30969fc pthread_kill
@ 0x7f2453a99ee8 (unknown)
@ 0x7f3ea3042476 raise
@ 0x7f2453f53737 rocprofiler::registration::finalize()
@ 0x7f3ea30287f3 abort
@ 0x7f3ea3e1a93d google::DumpStackTraceAndExit() [clone .cold]
@ 0x7f2454819461 finalize_rocprofv3(std::basic_string_view<char, std::char_traits >)
@ 0x7f2453a45a56 __cxa_finalize
@ 0x7f3ea41dad40 google::LogMessage::Fail()
@ 0x7f3ea41dbe98 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f24547f7997 (unknown)
W20251030 09:30:51.816354 139793821794944 tool.cpp:2776] [PPID=142014][PID=142108][TID=142108][rocprofv3_error_signal_handler] rocprofv3 caught signal 6...
W20251030 09:30:51.816383 139793821794944 tool.cpp:2799] [PPID=142014][PID=142108][TID=142108][rocprofv3_error_signal_handler] rocprofv3 will wait for 0 children to exit
W20251030 09:30:51.816389 139793821794944 tool.cpp:2814] [PPID=142014][PID=142108][TID=142108][rocprofv3_error_signal_handler] rocprofv3 finalizing after signal 6...
W20251030 09:30:51.816398 139793821794944 tool.cpp:2855] [PPID=142014][PID=142108][TID=142108][rocprofv3_error_signal_handler] rocprofv3 found chained signal handler for 6... executing chained sigaction (SIGINFO)
*** Aborted at 1761816651 (unix time) try "date -d @1761816651" if you are using GNU date ***
@ 0x7f3ea3f783bc rocprofiler::tool::sql::execute_raw_sql_statements_impl(sqlite3*, std::basic_string_view<char, std::char_traits >, int ()(void, int, char**, char**), void*, int)
PC: @ 0x7f2453a969fc pthread_kill
*** SIGABRT (@0x22b1c) received by PID 142108 (TID 0x7f2449170280) from PID 142108; stack trace: ***
@ 0x7f3ea3f2be8f rocprofiler::tool::write_rocpd(rocprofiler::tool::output_config const&, rocprofiler::tool::metadata const&, std::vector<rocprofiler::tool::agent_info, std::allocatorrocprofiler::tool::agent_info > const&, rocprofiler::tool::generator<rocprofiler::tool::t
@ 0x7f2453a99ee8 (unknown)
@ 0x7f2454bbf5f0 google::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*)
@ 0x7f3ea3f3dde8 rocprofiler::tool::write_rocpd(rocprofiler::tool::output_config const&, rocprofiler::tool::metadata const&, std::vector<rocprofiler::tool::agent_info, std::allocatorrocprofiler::tool::agent_info > const&, rocprofiler::tool::generator<rocprofiler::tool::t
@ 0x7f245481e795 rocprofv3_error_signal_handler::{lambda()#1}::operator()() const
@ 0x7f3ea3e57ec6 tool_fini(void*)
@ 0x7f2453a99ee8 (unknown)
@ 0x7f3ea357e333 rocprofiler::registration::(anonymous namespace)::invoke_client_finalizer(rocprofiler_client_id_t)
@ 0x7f245481559d rocprofv3_error_signal_handler
@ 0x7f3ea357e571 std::once_flag::_Prepare_execution::_Prepare_execution<std::call_oncerocprofiler::registration::finalize()::{lambda()#1}(std::once_flag&, rocprofiler::registration::finalize()::{lambda()#1}&&)::{lambda()#1}>(rocprofiler::registration::finalize()::{lambda
@ 0x7f2453a42520 (unknown)
@ 0x7f3ea3099ee8 (unknown)
@ 0x7f2453a969fc pthread_kill
@ 0x7f3ea3579737 rocprofiler::registration::finalize()
@ 0x7f2453a42476 raise
@ 0x7f3ea3e3f461 finalize_rocprofv3(std::basic_string_view<char, std::char_traits >)
@ 0x7f2453a287f3 abort
@ 0x7f3ea3045a56 __cxa_finalize
@ 0x7f24547f493d google::DumpStackTraceAndExit() [clone .cold]
@ 0x7f3ea3e1d997 (unknown)
@ 0x7f2454bb4d40 google::LogMessage::Fail()
@ 0x7f2454bb5e98 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f24549523bc rocprofiler::tool::sql::execute_raw_sql_statements_impl(sqlite3*, std::basic_string_view<char, std::char_traits >, int ()(void, int, char**, char**), void*, int)
@ 0x7f2454905e8f rocprofiler::tool::write_rocpd(rocprofiler::tool::output_config const&, rocprofiler::tool::metadata const&, std::vector<rocprofiler::tool::agent_info, std::allocatorrocprofiler::tool::agent_info > const&, rocprofiler::tool::generator<rocprofiler::tool::t
@ 0x7f2454917de8 rocprofiler::tool::write_rocpd(rocprofiler::tool::output_config const&, rocprofiler::tool::metadata const&, std::vector<rocprofiler::tool::agent_info, std::allocatorrocprofiler::tool::agent_info > const&, rocprofiler::tool::generator<rocprofiler::tool::t
@ 0x7f2454831ec6 tool_fini(void*)
@ 0x7f2453f58333 rocprofiler::registration::(anonymous namespace)::invoke_client_finalizer(rocprofiler_client_id_t)
@ 0x7f2453f58571 std::once_flag::_Prepare_execution::_Prepare_execution<std::call_oncerocprofiler::registration::finalize()::{lambda()#1}(std::once_flag&, rocprofiler::registration::finalize()::{lambda()#1}&&)::{lambda()#1}>(rocprofiler::registration::finalize()::{lambda
@ 0x7f2453a99ee8 (unknown)
@ 0x7f2453f53737 rocprofiler::registration::finalize()
@ 0x7f2454819461 finalize_rocprofv3(std::basic_string_view<char, std::char_traits >)
@ 0x7f2453a45a56 __cxa_finalize
@ 0x7f24547f7997 (unknown)
W1030 09:31:16.230000 142014 torch/multiprocessing/spawn.py:160] Terminating process 142107 via signal SIGTERM
Traceback (most recent call last):
File "/home/tmp/iris/examples/00_load/load_bench.py", line 296, in
main()
File "/home/tmp/iris/examples/00_load/load_bench.py", line 287, in main
mp.spawn(
File "/opt/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/opt/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
while not context.join():
File "/opt/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 184, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGABRT
W20251030 09:31:16.359694 139813742502528 simple_timer.cpp:55] [rocprofv3] 'python examples/00_load/load_bench.py --num_ranks 2' :: 37.664389 sec
W20251030 09:31:16.361439 139813742502528 simple_timer.cpp:55] [rocprofv3] tool finalization :: 0.001590 sec
Fatal error: Command '['python', 'examples/00_load/load_bench.py', '--num_ranks', '2']' returned non-zero exit status 1.
W20251030 09:31:16.570020 139751244759680 simple_timer.cpp:55] [rocprofv3] '/opt/venv/bin/python -c from multiprocessing.resource_tracker import main;main(3)' :: 36.195495 sec
W20251030 09:31:16.571160 139751244759680 simple_timer.cpp:55] [rocprofv3] tool finalization :: 0.001032 sec
Questions
Does Thread Trace support usage in multi-process (multi-rank) communication kernels?
If yes, how should it be correctly configured or enabled?
If not, what are the recommended methods to debug or analyze the performance of multi-process communication kernels?