Skip to content

Conversation

petebachant
Copy link
Member

@petebachant petebachant commented Oct 14, 2025

This uses CliMA/ClimaCore.jl#2376 to provide more useful CUDA kernel names in benchmarks.

TODO

  • Update Buildkite pipeline to use this feature
  • Switch to non-dev ClimaCore

@petebachant petebachant marked this pull request as draft October 14, 2025 16:58
end

# If we're running on CUDA, use CUDA's profiler
if ENV["CLIMACOMMS_DEVICE"] == "CUDA" && device isa ClimaComms.CUDADevice
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imreddyTeja any idea why this would fail here on CPU with:

ERROR: LoadError: UndefVarError: `CUDA` not defined in `Main`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see this while using the .buildkite project?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's happening in the Buildkite pipeline, which should be using that project.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is because CUDA.@profile is a macro, which gets expanded when the if-else statement is parsed.

Moving import CUDA outside of the conditional should fix it, but that will always import CUDA, regardless if running on CPU or GPU.

Another option would be to wrap CA.benchmark_step!(integrator, Y₀, n_steps) in a function and explicitly call CUDA.profile_internally or CUDA.profile_externally


- group: "Reproducibility infrastructure"
steps:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes were made by a YAML auto-formatter in VS Code. Is there a style guide I might be breaking here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure... this is something I have been wondering as well. I considered following this example, which is used in Buildkite's docs.

else
@info "Using internal CUDA profiler"
CUDA.@profile external = false begin
e = CUDA.@elapsed begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the e = CUDA.@elapsed... inside the CUDA.@profile do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is to return the elapsed time of the call for the info statement below. I still need to figure out how to get the tabular output for the kernels that took the longest time. Let me know if you know how to do that. It looks like the ClimaAtmos.benchmark_step function doesn't return the same data that ClimaTimeSteppers.benchmark_step does. The latter is used in benchmark.jl and the former in benchmark_step.jl, though both scripts are calling their respective benchmark_step method 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tabular output you are referring to only works with the internal profiler. Also, with

p = CUDA.@profile external=false $any_expression$

p will be the profiling results as CUDA.ProfileResults type. With the external profiler:

e = CUDA.@profile external=true $any_expression$

e is whatever any_expression returns.

end
else
@info "Using internal CUDA profiler"
CUDA.@profile external = false begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The profiler output needs to be printed for the tabular output to show:

### shows tabular output because the profiled results are the returned value
julia> if true
       CUDA.@profile external=false CUDA.@elapsed some_cuarray .+= 2
       end

Profiler ran for 219.11 µs, captu....
#### no results shown because the info statement's return is the return value
julia> if true
       CUDA.@profile external=false CUDA.@elapsed some_cuarray .+= 2
       @info "some info"
       end
[ Info: some info

### profiler results explicitly shown
julia> if true
       p = CUDA.@profile external=false CUDA.@elapsed some_cuarray .+= 2
       println(p) # or @show p
       @info "some info"
       end
Profiler ran for 178.1 µs, capturing 78 events.

Host-side activity: calling CUDA APIs took 95.61 µs (53.68% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name               │
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants