Releases: ROCm/device-metrics-exporter
device-metrics-exporter-charts-v1.4.2
v1.4.2
-
New Label
KFD_PROCESS_IDlabel will now report the process ID using the
respective GPU. This enables baremetal debian deployments to have job
information where no scheduler is used.DEPLOYMENT_MODElabel to specify the GPU operating environment
-
New Field
-
GPU_AFID_ERRORSfield added to report RAS events associated AMD Field Identifier (AFID) list, More details can be found here- More Info on AMD Field ID and next steps are https://docs.amd.com/r/en-US/AMD_Field_ID_70122_v1.0/AFID-Event-List
-
Violation Metrics 1.8 version fields
- Current percentag and Per Compute Core violation metrics available for
GPU_VIOLATION_PROCESSOR_HOT_RESIDENCY_PERCENTAGEGPU_VIOLATION_PPT_RESIDENCY_PERCENTAGEGPU_VIOLATION_SOCKET_THERMAL_RESIDENCY_PERCENTAGEGPU_VIOLATION_VR_THERMAL_RESIDENCY_PERCENTAGEGPU_VIOLATION_HBM_THERMAL_RESIDENCY_PERCENTAGEGPU_VIOLATION_GFX_CLOCK_BELOW_HOST_LIMIT_POWER_ACCUMULATEDGPU_VIOLATION_GFX_CLOCK_BELOW_HOST_LIMIT_THERMAL_ACCUMULATEDGPU_VIOLATION_LOW_UTILIZATION_ACCUMULATEDGPU_VIOLATION_GFX_CLOCK_BELOW_HOST_LIMIT_TOTAL_ACCUMULATEDGPU_VIOLATION_GFX_CLOCK_BELOW_HOST_LIMIT_POWER_PERCENTAGEGPU_VIOLATION_GFX_CLOCK_BELOW_HOST_LIMIT_THERMAL_PERCENTAGEGPU_VIOLATION_LOW_UTILIZATION_PERCENTAGEGPU_VIOLATION_GFX_CLOCK_BELOW_HOST_LIMIT_TOTAL_PERCENTAGE
- Current percentag and Per Compute Core violation metrics available for
-
Clock Metrics Added
GPU_MIN_CLOCK,GPU_MAX_CLOCK
-
-
Label Value Change
- Clock type was not normalized in previous releases, now clock type label values
are normalized without prefixes ofGPU_CLOCK_TYPE_. More details can be found
here
- Clock type was not normalized in previous releases, now clock type label values
Platform Support
ROCm 7.1.1 MI2xx, MI3xx
release v1.4.0.1
What's Changed
- release v1.4.0.1 by @spraveenio in #387
- update gpuagent from release-v1.4.0 (#1083) by @spraveenio in #392
Full Changelog: v1.4.0...v1.4.0.1
v1.4.1.2
What's Changed
- gpuagent localhost bind (#1050) by @spraveenio in #367
- [CP 1064] gpuop-530 update documents for gpuagent port config and restart for debian deployments by @ci-penbot-01 in #377
- [CP 1067] fix typo doc by @ci-penbot-01 in #378
- release v1.4.1.2 by @spraveenio in #386
Full Changelog: v1.4.1...v1.4.1.2
device-metrics-exporter-charts-v1.4.1
v1.4.1
-
Configurable Resource Limits
- Exporter Pod resource limits can now be configured through the Helm chart
- Default limits are set if not specified in the Helm chart
- Debian systemd service file is now set with default resource limits
- Beta: DRA (Dynamic Resource Allocation) device support — exporter can detect and report DRA-allocated GPU devices from the device plugin/pod resources.
-
Profiler Failure Handling
- Profiler metrics is hardware sensitive for failures. To better handle potential system
disruption from coredumps or profiler failures, the exporter will automatically disable
profiler metrics to maintain critical exporter functionality and server stability:- Coredump: Profiler metrics will be disabled immediately upon detection
- Non-crash errors: If the profiler metrics read encounters 3 consecutive non-crash errors, it will be automatically disabled
- Re-enabling: Profiler metrics can be re-enabled by restarting/deleting the exporter pod on that node/server
- Profiler metrics is hardware sensitive for failures. To better handle potential system
-
Note Profiler metrics are prefixed by
gpu_prof_in metrics list
Platform Support
ROCm 7.1.x MI2xx, MI3xx; Kubernetes 1.34+ (required only for DRA beta feature)
device-metrics-exporter-charts-v1.4.0
v1.4.0
-
MI35x Platfform Support
- Exporter now supports MI35x platform with parity with latest supported
fields.
- Exporter now supports MI35x platform with parity with latest supported
-
Mask Unsupported Fields
- Platform-specific unsupported fields (amd-smi marked as N/A) will not be exported.
Boot logs will indicate which fields are supported by the platform (logged once during startup).
- Platform-specific unsupported fields (amd-smi marked as N/A) will not be exported.
-
New Profiler Fields
- New fields are added for better understanding of the application
-
Depricated Fields Notice
-
Following fields are depricated from 6.14.14 driver onwards
- GPU_MMA_ACTIVITY
- GPU_JPEG_ACTIVITY
- GPU_VCN_ACTIVITY
-
These fields are replaced by following fields
- GPU_VCN_BUSY_INSTANTANEOUS
- GPU_JPEG_BUSY_INSTANTANEOUS
-
Platform Support
ROCm 7.0 MI2xx, MI3xx
Issues Fixed
-
fixed metric naming discrepancies between config field and exported field. The
following prometheus fields name are being changed. -
one config field is being renamed which would require updating the
config.json from the released branch forpcie_nac_received_count->
pcie_nack_received_countS.No Old Field Name New Field Name 1 xgmi_neighbor_0_nop_tx gpu_xgmi_nbr_0_nop_tx 2 xgmi_neighbor_1_nop_tx gpu_xgmi_nbr_1_nop_tx 3 xgmi_neighbor_0_request_tx gpu_xgmi_nbr_0_req_tx 4 xgmi_neighbor_0_response_tx gpu_xgmi_nbr_0_resp_tx 5 xgmi_neighbor_1_response_tx gpu_xgmi_nbr_1_resp_tx 6 xgmi_neighbor_0_beats_tx gpu_xgmi_nbr_0_beats_tx 7 xgmi_neighbor_1_beats_tx gpu_xgmi_nbr_1_beats_tx 8 xgmi_neighbor_0_tx_throughput gpu_xgmi_nbr_0_tx_thrput 9 xgmi_neighbor_1_tx_throughput gpu_xgmi_nbr_1_tx_thrput 10 xgmi_neighbor_2_tx_throughput gpu_xgmi_nbr_2_tx_thrput 11 xgmi_neighbor_3_tx_throughput gpu_xgmi_nbr_3_tx_thrput 12 xgmi_neighbor_4_tx_throughput gpu_xgmi_nbr_4_tx_thrput 13 xgmi_neighbor_5_tx_throughput gpu_xgmi_nbr_5_tx_thrput 14 gpu_violation_vr_thermal_tracking_accumulated gpu_violation_vr_thermal_residency_accumulated 15 pcie_nac_received_count pcie_nack_received_count 16 gpu_violation_proc_hot_residency_accumulated gpu_violation_processor_hot_residency_accumulated 17 gpu_violation_soc_thermal_residency_accumulated gpu_violation_socket_thermal_residency_accumulated
device_metrics_exporter_helm_chart_v1.3.1
v1.3.1
Platform Support
OpenShift 4.19 platform support has been added in this release as part of GPU
Operator v1.3.1 Release.
Release Highlights
-
New Metric Fields
- GPU_GFX_BUSY_INSTANTANEOUS, GPU_VC_BUSY_INSTANTANEOUS,
GPU_JPEG_BUSY_INSTANTANEOUS are added to represent partition activities at
more granuler level. - GPU_GFX_ACTIVITY is only applicable for unpartitioned systems, user must
rely on the new BUSY_INSTANTANEOUS fields on partitioned systems.
- GPU_GFX_BUSY_INSTANTANEOUS, GPU_VC_BUSY_INSTANTANEOUS,
-
Health Service Config
- Health services can be disabled through configmap
-
Profiler Metrics Default Config Change
- The previous release of exporter i.e. v1.3.0's ConfigMap present under
example directory had Profiler Metrics enabled by default. Now, this is
set to be disabled by default from v1.3.1 onwards, because profiling is
generally needed only by application developers. If needed, please enable
it through the ConfigMap and make sure that there is no other Exporter
instance or another tool running ROCm profiler at the same time.
- The previous release of exporter i.e. v1.3.0's ConfigMap present under
-
Notice: Exporter Handling of Unsupported Platform Fields (Upcoming Major Release)
- Current Behavior: The exporter sets unsupported platform-specific field metrics to 0.
- Upcoming Change: In the next major release, the exporter will omit unsupported fields
(e.g., those marked as N/A in amd-smi) instead of exporting them as 0. - Logging: Detailed logs will indicate which fields are unsupported, allowing users to verify platform compatibility.
device_metrics_exporter_helm_chart_v1.3.0
v1.3.0
Release Highlights
- K8s Extra Pod Labels
- Allows to tag the exported metrics with any user-defined Pod labels through
ExtraPodLabelsfield in the configmap
- Allows to tag the exported metrics with any user-defined Pod labels through
- Support for Singularity Installation
- Singularity can be used to run device-metrics-exporter in a cluster environment through SLURM. For more details please look at https://github.com/ROCm/device-metrics-exporter/blob/staging/docs/installation/singularity.md
- Performance Metrics
- Integrated with ROCm Profiler to export profiler related metrics on supported platforms, with toggle functionality through
ProfilerMetricsfield in the configmap
- Integrated with ROCm Profiler to export profiler related metrics on supported platforms, with toggle functionality through
- Custom Prefix for Exporter
- Allows to add a custom prefix to the names of all exported metrics through
CommonConfigfield in the configmap. For example, by adding a "amd_" prefix, this can be used to identify the metrics exported for AMD GPUs
- Allows to add a custom prefix to the names of all exported metrics through
Platform Support
ROCm 6.4.x MI3xx
device_metrics_exporter_helm_chart_v1.2.1
A helm chart for deploying AMD Device Metrics Exporter
Documentation: https://instinct.docs.amd.com/projects/gpu-operator/en/latest/metrics/exporter.html
device_metrics_exporter_helm_chart_v1.2.0
A helm chart for deploying AMD Device Metrics Exporter
Documentation: https://instinct.docs.amd.com/projects/gpu-operator/en/release-v1.2.0/metrics/exporter.html
device_metrics_exporter_helm_chart_v1.1.0
A helm chart for deploying AMD Device Metrics Exporter