Skip to content

Fix L2 cache bandwidth metrics for MI350#843

Merged
vedithal-amd merged 2 commits intoROCm:developfrom
vedithal-amd:add-mi350-metrics
Aug 1, 2025
Merged

Fix L2 cache bandwidth metrics for MI350#843
vedithal-amd merged 2 commits intoROCm:developfrom
vedithal-amd:add-mi350-metrics

Conversation

@vedithal-amd
Copy link
Contributor

@vedithal-amd vedithal-amd commented Jul 30, 2025

rocprofiler-compute Pull Request

Related Issue

  • Closes #

What type of PR is this? (check all that apply)

  • Bug Fix
  • Cherry Pick
  • Continuous Integration
  • Documentation Update
  • Feature
  • Optimization
  • Refactor
  • Other (please specify)

Technical Details

Fix L2 cache bandwidth metrics for L2 cache for MI350

Have you added or updated tests to validate functionality?

  • Yes
  • No - does not apply to this PR

Added / Updated documentation?

  • Yes
  • No - does not apply to this PR

Have you updated CHANGELOG?

  • Yes
  • No - does not apply to this PR

Copy link
Contributor

@feizheng10 feizheng10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@xuchen-amd xuchen-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just left some non-blocking questions.
And minor format fixings needed.

per normalization unit. This is expected to be the sum of global/generic and
spill/stack atomics in the address processor.
Write Ack Instructions: The total number of write acknowledgements submitted by
data-return unit to SQ not SP, summed over all compute units on the accelerator,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the mention of not SP just to match rocprof-sdk's description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is right, see the description for TD_WRITE_ACKT_WAVEFRONT_sum in https://raw.githubusercontent.com/ROCm/rocprofiler-sdk/refs/heads/amd-mainline/source/share/rocprofiler-sdk/counter_defs.yaml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@feizheng10 , is it better to say SQ instead of SQ not SP ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me just saying SQ

max: MAX((MAX((TCC_EA0_RDREQ_sum - TCC_EA0_RDREQ_DRAM_sum), 0) / $denom))
unit: (Req + $normUnit)
Read Bandwidth - PCIe:
avg: AVG(TCC_EA0_RDREQ_IO_32B_sum * 32/ $denom)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@feizheng10, I recall we briefly talked about bandwidth calculations should be over (End_Timestamp - Start_Timestamp) to have the unit GB/s. Should it be the case here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can set --normal-unit per_second to get Bytes per second.
Default is Bytes per kernel where denominator = 1

Currently following bandwidth metrics use normalization unit as denominator:

  • L1I-L2 Bandwidth (1302)
  • sL1D-L2 BW (1403)
  • Cache BW (1603)
  • L1-L2 BW (1603)
  • Bandwidth (1703)
  • Read / Write / Atomic Bandwidth (1703)
  • Read / Write / Atomic Bandwidth - Infinity Fabric / HBM / PCIe (1706)

@feizheng10 , are you saying that all of the above Bandwidth metrics should be Gbps instead of Bytes per Kernel (default)? And this is for all gfx archs?

To be honest, I agree that all of these should be Gbps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for Xuan raising the question.
Let's look into the details. As always, we don't like it but there is always special case:)
So technically, the unit for all BW should be *B/s. And for all of them, I'd say the valid normalization unit can be applied should be: per_kernel and per_wave. per_cycle and per_second don't make sense for me. Because we are setting per_kernel as default, the only another choice for user would be per_wave, which is not that common for the public users.
Basically, we are adding another "per" customized layer on top of GB per second.
Literally, keeping *B/s makes more sense for the end user.
We need to find a proper place/way to remind them, hi, this "GB/s (per_wave)". May be later.

Thought?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, all of the above listed metrics for all gfx archs should have Unit = Gbps per normalization unit instead of Bytes per normalization unit

Denominator for these metrics should be ((Start_Timestamp - End_Timestamp) / $denom)

Description of the metrics should also be updated.

I think users after reading the Unit and Description will not use --normal-unit per-second because Gbps per second does not make any sense!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed that all bandwidth metrics should have the unit Gbps.
I will make this change in a separate PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #854

@vedithal-amd vedithal-amd merged commit b349e40 into ROCm:develop Aug 1, 2025
4 of 10 checks passed
@vedithal-amd vedithal-amd deleted the add-mi350-metrics branch August 1, 2025 18:00
ammallya pushed a commit that referenced this pull request Oct 28, 2025
* Fix L2 cache bandwidth metrics for MI350

* Address review comments

[ROCm/rocprofiler-compute commit: b349e40]
ammallya pushed a commit that referenced this pull request Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants