Improve robustness of perfRunner by adding error checking in several places #1983

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

pabloantoniom wants to merge 4 commits into develop from pablo-1985

Contributor

pabloantoniom commented Sep 12, 2025 •

edited

Loading

Motivation

This PR improves the robustness of the script by doing the following:

Report an error if an entry in tuningDb is invalid. Check in tuningDb are valid before trying to run them, so that the perfConfig.py script is more robust. Currently, read_tuning_db reads line by line and populates the map with the contents of each line without checking if what is being read is valid or not. If it's not, the test will simply return NaN, giving the user no feedback about what is wrong.
Check errros in getNanoSeconds, mainly check if file exists before attempting to read it (which might be related to rocprof not generating the file in the expected path) and check that the format is correct.

Technical Details

Check arch, config and perfConfig and make sure they are all valid. Check errors in getNanoSeconds

Test Plan

Testing with invalid entries.

Test Result

The error of each invalid entry is reported properly.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.


          First idea

umangyadav reviewed

View reviewed changes

mlir/utils/performance/perfRunner.py Outdated

    
              def validate_tuning_db_entry(arch, config, perfConfig):

                  # Validate arch

                  if " " in arch:

                      raise ValueError(f"invalid db entry: '{arch} {config} {perfConfig}' with arch='{arch}'")

Member

umangyadav Sep 12, 2025

I think it would be better to Found invalid db entry. Arch is incorrect or arch is empty

Contributor Author

pabloantoniom Sep 22, 2025

Using a regex now as suggested by Daniel, so finer grain errors are not easy to have

mlir/utils/performance/perfRunner.py Outdated

    
                      return result_average

              def valid_perf_config(perfConfig):

                  if not (perfConfig.startswith('v1') or perfConfig.startswith('v2') or perfConfig.startswith('v3')):

Member

umangyadav Sep 12, 2025

Print better message like perf_config doesn't start with v1/v2/v3 ?

Contributor

dhernandez0 Sep 18, 2025

add a TODO to use this function: parse_perf_config() (of this PR: https://github.com/ROCm/rocMLIR/pull/1918/files) once it's merged. Or copy-paste it to a common utility file and use it here.

Contributor Author

pabloantoniom Sep 22, 2025

I'm using parse_perf_config (from Justin's PR) now to check perfConfig as suggested by Daniel, I will make sure that Justin adds this checks to have an informative error message

dhernandez0 reviewed

View reviewed changes

mlir/utils/performance/perfRunner.py Outdated

    
                      return result_average

              def valid_perf_config(perfConfig):

                  if not (perfConfig.startswith('v1') or perfConfig.startswith('v2') or perfConfig.startswith('v3')):

Contributor

dhernandez0 Sep 18, 2025

add a TODO to use this function: parse_perf_config() (of this PR: https://github.com/ROCm/rocMLIR/pull/1918/files) once it's merged. Or copy-paste it to a common utility file and use it here.

mlir/utils/performance/perfRunner.py Outdated Show resolved Hide resolved


          Check errors in getNanoSeconds

3cb48ba

pabloantoniom changed the title ~~[perfRunner] Report an error if an entry in tuningDb is invalid~~ [perfRunner] Improve robustness

pabloantoniom added 2 commits

September 22, 2025 14:22


          Add code from Justin PR

33c5009


          Validate all 3 values using regex and Justins function (as suggested …

0ed68b3

…by Daniel)

pabloantoniom marked this pull request as ready for review

September 22, 2025 14:28

pabloantoniom requested a review from causten as a code owner

September 22, 2025 14:28

pabloantoniom changed the title ~~[perfRunner] Improve robustness~~ Improve robustness of perfRunner by adding error checking in several places

pabloantoniom mentioned this pull request

Add parsing script for tier1-tuning configs #1918

Open

dhernandez0 reviewed

View reviewed changes

mlir/utils/performance/perfRunner.py

    
                      with open(fileName, mode='r', newline='') as csv_file:

                          reader = csv.DictReader(csv_file)

                          return sum(int(float(row['AverageNs'])) for row in reader if 'AverageNs' in row)

                  except KeyError:

Contributor

dhernandez0 Oct 1, 2025

will we have a KeyError exeption? We check "if 'AverageNs' in row"

mlir/utils/performance/perfRunner.py

    
              # Validates that the db entry (composed by the arch, config, and perfConfig) is well-formed and consistent

              def validate_tuning_db_entry(arch, num_cu, config, perf_config):

                  # 1. Check perf_config with parse_perf_config helper function.

                  if num_cu == "unknown":

Contributor

dhernandez0 Oct 1, 2025

why not use None as a sentinel for unknown?

mlir/utils/performance/perfRunner.py

    
                  # 1. Check perf_config with parse_perf_config helper function.

                  if num_cu == "unknown":

                      # parse_perf_config requires num_cu, so just pass 1 to make it happy.

                      parsed_params = parse_perf_config(perf_config, "1", arch)

Contributor

dhernandez0 Oct 1, 2025

why are we passing 1? I think we should always know the num_cus

mlir/utils/performance/perfRunner.py

    
                      raise ValueError(f"invalid db entry: '{arch} {config} {perf_config}' with arch='{arch}': perf_config is invalid")

                  # 2. Validate arch

                  arch_pattern = r"^gfx([0-9]+)(:[^:\s]+)*$"

Contributor

dhernandez0 Oct 1, 2025

already exists in perfRunner: GFX_CHIP_RE

mlir/utils/performance/perfRunner.py

    
                      raise ValueError(f"invalid db entry: '{arch} {config} {perf_config}' with arch='{arch}': arch is invalid")

                  # 3. Validate config

                  re_ty = r"(f32|f16|i16|i8)"

Contributor

dhernandez0 Oct 1, 2025

use existing variables for types in perfRunner: DATA_TYPES_GEMM, DATA_TYPES_GEMM_GEMM, ...

mlir/utils/performance/perfRunner.py

    
                  re_tf = r"(true|false)"

                  gemm_config_pattern = rf"^-t {re_ty} -out_datatype {re_ty} -transA {re_tf} -transB {re_tf} -g [0-9]+ -m [0-9]+ -n [0-9]+ -k [0-9]+$"

                  re_conv = r"(conv|convint8|convfp16)"

Contributor

dhernandez0 Oct 1, 2025

use DATA_TYPES constant

mlir/utils/performance/perfRunner.py

    
                  # 3. Validate config

                  re_ty = r"(f32|f16|i16|i8)"

                  re_tf = r"(true|false)"

                  gemm_config_pattern = rf"^-t {re_ty} -out_datatype {re_ty} -transA {re_tf} -transB {re_tf} -g [0-9]+ -m [0-9]+ -n [0-9]+ -k [0-9]+$"

Contributor

dhernandez0 Oct 1, 2025

this is not needed. We already check (or should check) in fromCommandLine() of GemmConfiguration, AttentionConfiguration, etc...

mlir/utils/performance/perfRunner.py

    
                  if parsed_params is None:

                      raise ValueError(f"invalid db entry: '{arch} {config} {perf_config}' with arch='{arch}': perf_config is invalid")

                  # 2. Validate arch

Contributor

dhernandez0 Oct 1, 2025

I think we should have the checks in fromCommandLine() instead of here.

mlir/utils/performance/perfRunner.py

    
                  attn_ty = r"(f32|f16)"

                  attn_config_pattern = rf"^-t {attn_ty} -transQ {re_tf} -transK {re_tf} -transV {re_tf} -transO {re_tf} -g [0-9]+ -seq_len_q [0-9]+ -seq_len_k [0-9]+ -head_dim_qk [0-9]+ -head_dim_v [0-9]+$"

                  if not re.match(gemm_config_pattern, config) and not re.match(conv_config_pattern, config) and not re.match(attn_config_pattern, config):

Contributor

dhernandez0 Oct 1, 2025

missing gemm+gemm and conv+gemm

mlir/utils/performance/perfRunner.py

    
                              if len(entries) == 3:

                                  arch, config, perfConfig = entries

                                  ret[arch, config] = perfConfig

                                  numCu = "unknown"

Contributor

dhernandez0 Oct 1, 2025

rather than using unknown, let's get the default value for each arch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet