Skip to content

Conversation

pabloantoniom
Copy link
Contributor

@pabloantoniom pabloantoniom commented Sep 12, 2025

Motivation

This PR improves the robustness of the script by doing the following:

  1. Report an error if an entry in tuningDb is invalid. Check in tuningDb are valid before trying to run them, so that the perfConfig.py script is more robust. Currently, read_tuning_db reads line by line and populates the map with the contents of each line without checking if what is being read is valid or not. If it's not, the test will simply return NaN, giving the user no feedback about what is wrong.
  2. Check errros in getNanoSeconds, mainly check if file exists before attempting to read it (which might be related to rocprof not generating the file in the expected path) and check that the format is correct.

Technical Details

Check arch, config and perfConfig and make sure they are all valid. Check errors in getNanoSeconds

Test Plan

Testing with invalid entries.

Test Result

The error of each invalid entry is reported properly.

Submission Checklist

def validate_tuning_db_entry(arch, config, perfConfig):
# Validate arch
if " " in arch:
raise ValueError(f"invalid db entry: '{arch} {config} {perfConfig}' with arch='{arch}'")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to Found invalid db entry. Arch is incorrect or arch is empty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a regex now as suggested by Daniel, so finer grain errors are not easy to have

return result_average

def valid_perf_config(perfConfig):
if not (perfConfig.startswith('v1') or perfConfig.startswith('v2') or perfConfig.startswith('v3')):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Print better message like perf_config doesn't start with v1/v2/v3 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a TODO to use this function: parse_perf_config() (of this PR: https://github.com/ROCm/rocMLIR/pull/1918/files) once it's merged. Or copy-paste it to a common utility file and use it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using parse_perf_config (from Justin's PR) now to check perfConfig as suggested by Daniel, I will make sure that Justin adds this checks to have an informative error message

return result_average

def valid_perf_config(perfConfig):
if not (perfConfig.startswith('v1') or perfConfig.startswith('v2') or perfConfig.startswith('v3')):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a TODO to use this function: parse_perf_config() (of this PR: https://github.com/ROCm/rocMLIR/pull/1918/files) once it's merged. Or copy-paste it to a common utility file and use it here.

@pabloantoniom pabloantoniom changed the title [perfRunner] Report an error if an entry in tuningDb is invalid [perfRunner] Improve robustness Sep 22, 2025
@pabloantoniom pabloantoniom marked this pull request as ready for review September 22, 2025 14:28
@pabloantoniom pabloantoniom changed the title [perfRunner] Improve robustness Improve robustness of perfRunner by adding error checking in several places Sep 22, 2025
with open(fileName, mode='r', newline='') as csv_file:
reader = csv.DictReader(csv_file)
return sum(int(float(row['AverageNs'])) for row in reader if 'AverageNs' in row)
except KeyError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will we have a KeyError exeption? We check "if 'AverageNs' in row"

# Validates that the db entry (composed by the arch, config, and perfConfig) is well-formed and consistent
def validate_tuning_db_entry(arch, num_cu, config, perf_config):
# 1. Check perf_config with parse_perf_config helper function.
if num_cu == "unknown":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use None as a sentinel for unknown?

# 1. Check perf_config with parse_perf_config helper function.
if num_cu == "unknown":
# parse_perf_config requires num_cu, so just pass 1 to make it happy.
parsed_params = parse_perf_config(perf_config, "1", arch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we passing 1? I think we should always know the num_cus

raise ValueError(f"invalid db entry: '{arch} {config} {perf_config}' with arch='{arch}': perf_config is invalid")

# 2. Validate arch
arch_pattern = r"^gfx([0-9]+)(:[^:\s]+)*$"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already exists in perfRunner: GFX_CHIP_RE

raise ValueError(f"invalid db entry: '{arch} {config} {perf_config}' with arch='{arch}': arch is invalid")

# 3. Validate config
re_ty = r"(f32|f16|i16|i8)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use existing variables for types in perfRunner: DATA_TYPES_GEMM, DATA_TYPES_GEMM_GEMM, ...

re_tf = r"(true|false)"
gemm_config_pattern = rf"^-t {re_ty} -out_datatype {re_ty} -transA {re_tf} -transB {re_tf} -g [0-9]+ -m [0-9]+ -n [0-9]+ -k [0-9]+$"

re_conv = r"(conv|convint8|convfp16)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use DATA_TYPES constant

# 3. Validate config
re_ty = r"(f32|f16|i16|i8)"
re_tf = r"(true|false)"
gemm_config_pattern = rf"^-t {re_ty} -out_datatype {re_ty} -transA {re_tf} -transB {re_tf} -g [0-9]+ -m [0-9]+ -n [0-9]+ -k [0-9]+$"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed. We already check (or should check) in fromCommandLine() of GemmConfiguration, AttentionConfiguration, etc...

if parsed_params is None:
raise ValueError(f"invalid db entry: '{arch} {config} {perf_config}' with arch='{arch}': perf_config is invalid")

# 2. Validate arch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have the checks in fromCommandLine() instead of here.

attn_ty = r"(f32|f16)"
attn_config_pattern = rf"^-t {attn_ty} -transQ {re_tf} -transK {re_tf} -transV {re_tf} -transO {re_tf} -g [0-9]+ -seq_len_q [0-9]+ -seq_len_k [0-9]+ -head_dim_qk [0-9]+ -head_dim_v [0-9]+$"

if not re.match(gemm_config_pattern, config) and not re.match(conv_config_pattern, config) and not re.match(attn_config_pattern, config):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing gemm+gemm and conv+gemm

if len(entries) == 3:
arch, config, perfConfig = entries
ret[arch, config] = perfConfig
numCu = "unknown"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than using unknown, let's get the default value for each arch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants