Description
Describe the bug
Hi,
@yogi81 profiled vllm using rocprof-compute with a filter --kernel fused_moe_kernel
, which was found through a more top level pytorch profiler.
As rocprof-compute needs to run multiple times to collect metrics (I assume as different performance counters are recorded in different runs for them to be more accurate?), the user needs to restart vllm multiple times through vllm serve
manually + do requests, and kill it manually as this is an infinite server. Eventually, this works and rocprofiler-compute profile
successfully finishes.
The collected trace is available at: https://github.com/yogi81/rocmprofiler.
Calling rocprofiler-compute analyze
on this data, we get:
Traceback (most recent call last):
File "/usr/bin/rocprofiler-compute", line 156, in <module>
main()
File "/usr/bin/rocprofiler-compute", line 148, in main
rocprof_compute.run_analysis()
File "/opt/rocm-6.4.1/libexec/rocprofiler-compute/utils/utils.py", line 53, in wrap_function
result = function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/rocm-6.4.1/libexec/rocprofiler-compute/rocprof_compute_base.py", line 355, in run_analysis
analyzer.run_analysis()
File "/opt/rocm-6.4.1/libexec/rocprofiler-compute/utils/utils.py", line 53, in wrap_function
result = function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/rocm-6.4.1/libexec/rocprofiler-compute/rocprof_compute_analyze/analysis_cli.py", line 90, in run_analysis
tty.show_all(
File "/opt/rocm-6.4.1/libexec/rocprofiler-compute/utils/tty.py", line 114, in show_all
adjusted_name = base_df["Kernel_Name"].apply(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/pandas/core/series.py", line 4924, in apply
).apply()
^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/pandas/core/apply.py", line 1427, in apply
return self.apply_standard()
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/pandas/core/apply.py", line 1507, in apply_standard
mapped = obj._map_values(
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/pandas/core/base.py", line 921, in _map_values
return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/pandas/core/algorithms.py", line 1743, in map_array
return lib.map_infer(values, mapper, convert=convert)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib.pyx", line 2972, in pandas._libs.lib.map_infer
File "/opt/rocm-6.4.1/libexec/rocprofiler-compute/utils/tty.py", line 115, in <lambda>
lambda x: string_multiple_lines(x, 80, 4)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/rocm-6.4.1/libexec/rocprofiler-compute/utils/tty.py", line 47, in string_multiple_lines
while idx < len(source) and len(lines) < max_rows:
^^^^^^^^^^^
TypeError: object of type 'float' has no len()
It appears that the data for Dispatch List
contains some data with Kernel_Name
being nan
, later resulting in this issue.
rocprofiler-compute/src/utils/tty.py
Lines 134 to 140 in 61a9381
rocprofiler-compute/src/utils/tty.py
Lines 37 to 43 in 61a9381
I assume these are kernels that were not filtered with --name
? See:
type, table_config raw_csv_table {'id': 2, 'title': 'Dispatch List', 'source': 'pmc_dispatch_info.csv'}
table_config[id] 2
base_df Dispatch_ID Kernel_Name GPU_ID
0 612.0 fused_moe_kernel.kd 1.0
1 615.0 fused_moe_kernel.kd 1.0
2 636.0 fused_moe_kernel.kd 1.0
3 639.0 fused_moe_kernel.kd 1.0
4 831.0 fused_moe_kernel.kd 1.0
... ... ... ...
6467 NaN NaN NaN
6468 NaN NaN NaN
6469 NaN NaN NaN
6470 NaN NaN NaN
6471 NaN NaN NaN
Is this a known issue? Why do we have data with NaN? Should they just be filtered out, or is it a deeper issue? cc @coleramos425
Thank you.
cc @yogi81
Linux Distribution
22.04.5 LTS (Jammy Jellyfish)
ROCm Compute Profiler Version
3.1.0 (release) / f29f1341
GPU
MI300X
ROCm Version
rocm-6.4.1
Cluster name (if applicable)
No response
Reproducer
Can share if needed later.
Expected behavior
No response
Relevant log output
Screenshots
No response
Additional Context
No response