-
Notifications
You must be signed in to change notification settings - Fork 49
Improve robustness of perfRunner by adding error checking in several places #1983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
mlir/utils/performance/perfRunner.py
Outdated
def validate_tuning_db_entry(arch, config, perfConfig): | ||
# Validate arch | ||
if " " in arch: | ||
raise ValueError(f"invalid db entry: '{arch} {config} {perfConfig}' with arch='{arch}'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to Found invalid db entry. Arch is incorrect or arch is empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a regex now as suggested by Daniel, so finer grain errors are not easy to have
mlir/utils/performance/perfRunner.py
Outdated
return result_average | ||
|
||
def valid_perf_config(perfConfig): | ||
if not (perfConfig.startswith('v1') or perfConfig.startswith('v2') or perfConfig.startswith('v3')): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Print better message like perf_config
doesn't start with v1/v2/v3
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a TODO to use this function: parse_perf_config() (of this PR: https://github.com/ROCm/rocMLIR/pull/1918/files) once it's merged. Or copy-paste it to a common utility file and use it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using parse_perf_config (from Justin's PR) now to check perfConfig as suggested by Daniel, I will make sure that Justin adds this checks to have an informative error message
mlir/utils/performance/perfRunner.py
Outdated
return result_average | ||
|
||
def valid_perf_config(perfConfig): | ||
if not (perfConfig.startswith('v1') or perfConfig.startswith('v2') or perfConfig.startswith('v3')): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a TODO to use this function: parse_perf_config() (of this PR: https://github.com/ROCm/rocMLIR/pull/1918/files) once it's merged. Or copy-paste it to a common utility file and use it here.
with open(fileName, mode='r', newline='') as csv_file: | ||
reader = csv.DictReader(csv_file) | ||
return sum(int(float(row['AverageNs'])) for row in reader if 'AverageNs' in row) | ||
except KeyError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will we have a KeyError exeption? We check "if 'AverageNs' in row"
# Validates that the db entry (composed by the arch, config, and perfConfig) is well-formed and consistent | ||
def validate_tuning_db_entry(arch, num_cu, config, perf_config): | ||
# 1. Check perf_config with parse_perf_config helper function. | ||
if num_cu == "unknown": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use None as a sentinel for unknown?
# 1. Check perf_config with parse_perf_config helper function. | ||
if num_cu == "unknown": | ||
# parse_perf_config requires num_cu, so just pass 1 to make it happy. | ||
parsed_params = parse_perf_config(perf_config, "1", arch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we passing 1? I think we should always know the num_cus
raise ValueError(f"invalid db entry: '{arch} {config} {perf_config}' with arch='{arch}': perf_config is invalid") | ||
|
||
# 2. Validate arch | ||
arch_pattern = r"^gfx([0-9]+)(:[^:\s]+)*$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
already exists in perfRunner: GFX_CHIP_RE
raise ValueError(f"invalid db entry: '{arch} {config} {perf_config}' with arch='{arch}': arch is invalid") | ||
|
||
# 3. Validate config | ||
re_ty = r"(f32|f16|i16|i8)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use existing variables for types in perfRunner: DATA_TYPES_GEMM, DATA_TYPES_GEMM_GEMM, ...
re_tf = r"(true|false)" | ||
gemm_config_pattern = rf"^-t {re_ty} -out_datatype {re_ty} -transA {re_tf} -transB {re_tf} -g [0-9]+ -m [0-9]+ -n [0-9]+ -k [0-9]+$" | ||
|
||
re_conv = r"(conv|convint8|convfp16)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use DATA_TYPES constant
# 3. Validate config | ||
re_ty = r"(f32|f16|i16|i8)" | ||
re_tf = r"(true|false)" | ||
gemm_config_pattern = rf"^-t {re_ty} -out_datatype {re_ty} -transA {re_tf} -transB {re_tf} -g [0-9]+ -m [0-9]+ -n [0-9]+ -k [0-9]+$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not needed. We already check (or should check) in fromCommandLine() of GemmConfiguration, AttentionConfiguration, etc...
if parsed_params is None: | ||
raise ValueError(f"invalid db entry: '{arch} {config} {perf_config}' with arch='{arch}': perf_config is invalid") | ||
|
||
# 2. Validate arch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have the checks in fromCommandLine() instead of here.
attn_ty = r"(f32|f16)" | ||
attn_config_pattern = rf"^-t {attn_ty} -transQ {re_tf} -transK {re_tf} -transV {re_tf} -transO {re_tf} -g [0-9]+ -seq_len_q [0-9]+ -seq_len_k [0-9]+ -head_dim_qk [0-9]+ -head_dim_v [0-9]+$" | ||
|
||
if not re.match(gemm_config_pattern, config) and not re.match(conv_config_pattern, config) and not re.match(attn_config_pattern, config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing gemm+gemm and conv+gemm
if len(entries) == 3: | ||
arch, config, perfConfig = entries | ||
ret[arch, config] = perfConfig | ||
numCu = "unknown" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than using unknown, let's get the default value for each arch.
Motivation
This PR improves the robustness of the script by doing the following:
getNanoSeconds
, mainly check if file exists before attempting to read it (which might be related to rocprof not generating the file in the expected path) and check that the format is correct.Technical Details
Check arch, config and perfConfig and make sure they are all valid. Check errors in
getNanoSeconds
Test Plan
Testing with invalid entries.
Test Result
The error of each invalid entry is reported properly.
Submission Checklist