Skip to content

[Bug]: CLI analyze results for Dispatch List include data with NaN only, resulting in errors in rocprofiler-compute analyze #791

Open
@fxmarty-amd

Description

@fxmarty-amd

Describe the bug

Hi,

@yogi81 profiled vllm using rocprof-compute with a filter --kernel fused_moe_kernel, which was found through a more top level pytorch profiler.

As rocprof-compute needs to run multiple times to collect metrics (I assume as different performance counters are recorded in different runs for them to be more accurate?), the user needs to restart vllm multiple times through vllm serve manually + do requests, and kill it manually as this is an infinite server. Eventually, this works and rocprofiler-compute profile successfully finishes.

The collected trace is available at: https://github.com/yogi81/rocmprofiler.

Calling rocprofiler-compute analyze on this data, we get:

Traceback (most recent call last):
  File "/usr/bin/rocprofiler-compute", line 156, in <module>
    main()
  File "/usr/bin/rocprofiler-compute", line 148, in main
    rocprof_compute.run_analysis()
  File "/opt/rocm-6.4.1/libexec/rocprofiler-compute/utils/utils.py", line 53, in wrap_function
    result = function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm-6.4.1/libexec/rocprofiler-compute/rocprof_compute_base.py", line 355, in run_analysis
    analyzer.run_analysis()
  File "/opt/rocm-6.4.1/libexec/rocprofiler-compute/utils/utils.py", line 53, in wrap_function
    result = function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm-6.4.1/libexec/rocprofiler-compute/rocprof_compute_analyze/analysis_cli.py", line 90, in run_analysis
    tty.show_all(
  File "/opt/rocm-6.4.1/libexec/rocprofiler-compute/utils/tty.py", line 114, in show_all
    adjusted_name = base_df["Kernel_Name"].apply(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pandas/core/series.py", line 4924, in apply
    ).apply()
      ^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pandas/core/apply.py", line 1427, in apply
    return self.apply_standard()
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pandas/core/apply.py", line 1507, in apply_standard
    mapped = obj._map_values(
             ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pandas/core/base.py", line 921, in _map_values
    return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pandas/core/algorithms.py", line 1743, in map_array
    return lib.map_infer(values, mapper, convert=convert)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib.pyx", line 2972, in pandas._libs.lib.map_infer
  File "/opt/rocm-6.4.1/libexec/rocprofiler-compute/utils/tty.py", line 115, in <lambda>
    lambda x: string_multiple_lines(x, 80, 4)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm-6.4.1/libexec/rocprofiler-compute/utils/tty.py", line 47, in string_multiple_lines
    while idx < len(source) and len(lines) < max_rows:
                ^^^^^^^^^^^
TypeError: object of type 'float' has no len()

It appears that the data for Dispatch List contains some data with Kernel_Name being nan, later resulting in this issue.

if table_config["source"] == "pmc_kernel_top.csv":
adjusted_name = base_df["Kernel_Name"].apply(
lambda x: string_multiple_lines(x, 40, 3)
)
else:
adjusted_name = base_df["Kernel_Name"].apply(
lambda x: string_multiple_lines(x, 80, 4)

def string_multiple_lines(source, width, max_rows):
"""
Adjust string with multiple lines by inserting '\n'
"""
idx = 0
lines = []
while idx < len(source) and len(lines) < max_rows:

I assume these are kernels that were not filtered with --name? See:

type, table_config raw_csv_table {'id': 2, 'title': 'Dispatch List', 'source': 'pmc_dispatch_info.csv'}
table_config[id] 2
base_df       Dispatch_ID          Kernel_Name  GPU_ID
0           612.0  fused_moe_kernel.kd     1.0
1           615.0  fused_moe_kernel.kd     1.0
2           636.0  fused_moe_kernel.kd     1.0
3           639.0  fused_moe_kernel.kd     1.0
4           831.0  fused_moe_kernel.kd     1.0
...           ...                  ...     ...
6467          NaN                  NaN     NaN
6468          NaN                  NaN     NaN
6469          NaN                  NaN     NaN
6470          NaN                  NaN     NaN
6471          NaN                  NaN     NaN

Is this a known issue? Why do we have data with NaN? Should they just be filtered out, or is it a deeper issue? cc @coleramos425

Thank you.

cc @yogi81

Linux Distribution

22.04.5 LTS (Jammy Jellyfish)

ROCm Compute Profiler Version

3.1.0 (release) / f29f1341

GPU

MI300X

ROCm Version

rocm-6.4.1

Cluster name (if applicable)

No response

Reproducer

Can share if needed later.

Expected behavior

No response

Relevant log output

Screenshots

No response

Additional Context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtriage

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions