-
Notifications
You must be signed in to change notification settings - Fork 26
Update benchmark_step.jl
for CUDA benchmarking with useful kernel names
#4055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
perf/benchmark_step.jl
Outdated
end | ||
|
||
# If we're running on CUDA, use CUDA's profiler | ||
if ENV["CLIMACOMMS_DEVICE"] == "CUDA" && device isa ClimaComms.CUDADevice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imreddyTeja any idea why this would fail here on CPU with:
ERROR: LoadError: UndefVarError: `CUDA` not defined in `Main`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you see this while using the .buildkite
project?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's happening in the Buildkite pipeline, which should be using that project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is because CUDA.@profile
is a macro, which gets expanded when the if-else statement is parsed.
Moving import CUDA
outside of the conditional should fix it, but that will always import CUDA, regardless if running on CPU or GPU.
Another option would be to wrap CA.benchmark_step!(integrator, Y₀, n_steps)
in a function and explicitly call CUDA.profile_internally
or CUDA.profile_externally
|
||
- group: "Reproducibility infrastructure" | ||
steps: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes were made by a YAML auto-formatter in VS Code. Is there a style guide I might be breaking here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure... this is something I have been wondering as well. I considered following this example, which is used in Buildkite's docs.
else | ||
@info "Using internal CUDA profiler" | ||
CUDA.@profile external = false begin | ||
e = CUDA.@elapsed begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the e = CUDA.@elapsed...
inside the CUDA.@profile
do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent is to return the elapsed time of the call for the info statement below. I still need to figure out how to get the tabular output for the kernels that took the longest time. Let me know if you know how to do that. It looks like the ClimaAtmos.benchmark_step
function doesn't return the same data that ClimaTimeSteppers.benchmark_step
does. The latter is used in benchmark.jl
and the former in benchmark_step.jl
, though both scripts are calling their respective benchmark_step
method 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the tabular output you are referring to only works with the internal profiler. Also, with
p = CUDA.@profile external=false $any_expression$
p
will be the profiling results as CUDA.ProfileResults
type. With the external profiler:
e = CUDA.@profile external=true $any_expression$
e is whatever any_expression
returns.
end | ||
else | ||
@info "Using internal CUDA profiler" | ||
CUDA.@profile external = false begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The profiler output needs to be printed for the tabular output to show:
### shows tabular output because the profiled results are the returned value
julia> if true
CUDA.@profile external=false CUDA.@elapsed some_cuarray .+= 2
end
Profiler ran for 219.11 µs, captu....
#### no results shown because the info statement's return is the return value
julia> if true
CUDA.@profile external=false CUDA.@elapsed some_cuarray .+= 2
@info "some info"
end
[ Info: some info
### profiler results explicitly shown
julia> if true
p = CUDA.@profile external=false CUDA.@elapsed some_cuarray .+= 2
println(p) # or @show p
@info "some info"
end
Profiler ran for 178.1 µs, capturing 78 events.
Host-side activity: calling CUDA APIs took 95.61 µs (53.68% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
...
This uses CliMA/ClimaCore.jl#2376 to provide more useful CUDA kernel names in benchmarks.
TODO