Skip to content

Commit d222ad9

Browse files
spraveeniosajmera-pensando
authored andcommitted
update docs (#869)
(cherry picked from commit abd18e2a150bfb2c4e6dac25d43c62def684b883)
1 parent 55fee60 commit d222ad9

File tree

5 files changed

+42
-30
lines changed

5 files changed

+42
-30
lines changed

docs/configuration/configmap.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,11 @@ When deploying AMD Device Metrics Exporter on Kubernetes, a `ConfigMap` is deplo
66

77
- `ServerPort`: this field is ignored when Device Metrics Exporter is deployed by the [GPU Operator](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/) to avoid conflicts with the service node port config.
88
- `GPUConfig`:
9-
- Fields: An array of strings specifying what metrics field to be exported.
10-
- Labels: `SERIAL_NUMBER`, `GPU_ID`, `POD`, `NAMESPACE`, `CONTAINER`, `JOB_ID`, `JOB_USER`, `JOB_PARTITION`, `CARD_MODEL`, `HOSTNAME`, `GPU_PARTITION_ID`, `GPU_COMPUTE_PARTITION_TYPE`, and `GPU_MEMORY_PARTITION_TYPE` are always set and cannot be removed. Labels supported are available in the provided example `configmap.yml`.
11-
- CustomLabels: A map of user-defined labels and their values. Users can set up to 10 custom labels. From the `GPUMetricLabel` list, only `CLUSTER_NAME` is allowed to be set in `CustomLabels`. Any other labels from this list cannot be set. Users can define other custom labels outside of this restriction. These labels will be exported with every metric, ensuring consistent metadata across all metrics.
12-
- ExtraPodLabels: This defines a map that links Prometheus label names to Kubernetes pod labels. Each key is the Prometheus label that will be exposed in metrics, and the value is the pod label to pull the data from. This lets you expose pod metadata as Prometheus labels for easier filtering and querying.<br>(e.g. Considering an entry like `"WORKLOAD_ID" : "amd-workload-id"`, where `WORKLOAD_ID` is a label visible in metrics and its value is the pod label value of a pod label key set as `amd-workload-id`).
13-
- ProfilerMetrics: A map of toggle to enable Profiler Metrics either for `all` nodes or a specific hostname with desired state. Key with specific hostname `$HOSTNAME` takes precedense over a `all` key.
9+
- `Fields`: An array of strings specifying what metrics field to be exported.
10+
- `Labels`: `SERIAL_NUMBER`, `GPU_ID`, `POD`, `NAMESPACE`, `CONTAINER`, `JOB_ID`, `JOB_USER`, `JOB_PARTITION`, `CARD_MODEL`, `HOSTNAME`, `GPU_PARTITION_ID`, `GPU_COMPUTE_PARTITION_TYPE`, and `GPU_MEMORY_PARTITION_TYPE` are always set and cannot be removed. Labels supported are available in the provided example `configmap.yml`.
11+
- `CustomLabels`: A map of user-defined labels and their values. Users can set up to 10 custom labels. From the `GPUMetricLabel` list, only `CLUSTER_NAME` is allowed to be set in `CustomLabels`. Any other labels from this list cannot be set. Users can define other custom labels outside of this restriction. These labels will be exported with every metric, ensuring consistent metadata across all metrics.
12+
- `ExtraPodLabels`: This defines a map that links Prometheus label names to Kubernetes pod labels. Each key is the Prometheus label that will be exposed in metrics, and the value is the pod label to pull the data from. This lets you expose pod metadata as Prometheus labels for easier filtering and querying.<br>(e.g. Considering an entry like `"WORKLOAD_ID" : "amd-workload-id"`, where `WORKLOAD_ID` is a label visible in metrics and its value is the pod label value of a pod label key set as `amd-workload-id`).
13+
- `ProfilerMetrics`: A map of toggle to enable Profiler Metrics either for `all` nodes or a specific hostname with desired state. Key with specific hostname `$HOSTNAME` takes precedense over a `all` key. This only controls the Profiler Metrics which has prefix of GPU_PROF_ from the metrics list.
1414
- `CommonConfig`:
1515
- `MetricsFieldPrefix`: Add prefix string for all the fields exporter. [Premetheus Metric Label formatted](https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels) string prefix will be accepted, on any invalid prefix will default to empty prefix to allow exporting of the fields.
1616
- `HealthService` : Health Service configurations for the exproter.
@@ -20,12 +20,12 @@ When deploying AMD Device Metrics Exporter on Kubernetes, a `ConfigMap` is deplo
2020

2121
To use a custom configuration when deploying the Metrics Exporter:
2222

23-
1. Create a `ConfigMap` based on the provided example [configmap.yml](https://github.com/ROCm/device-metrics-exporter/blob/main/example/configmap.yaml)
23+
1. Create a `ConfigMap` based on the provided example [configmap.yml](../examples/configmap.yml) file.
2424
2. Change the `configMap` property in `values.yaml` to `configmap.yml`
2525
3. Run `helm install`:
2626

2727
```bash
28-
helm install exporter https://github.com/ROCm/device-metrics-exporter/releases/download/v1.3.1/device-metrics-exporter-charts-v1.3.1.tgz -n metrics-exporter -f values.yaml --create-namespace
28+
helm install exporter https://github.com/ROCm/device-metrics-exporter/releases/download/v1.4.0/device-metrics-exporter-charts-v1.4.0.tgz -n metrics-exporter -f values.yaml --create-namespace
2929
```
3030

31-
Device Metrics Exporter polls for configuration changes every minute, so updates take effect without container restarts.
31+
Device Metrics Exporter polls for configuration changes every minute, so updates take effect without container restarts.

docs/configuration/docker.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
To use a custom configuration with the AMD Device Metrics Exporter container:
44

5-
1. Create a config file based on the provided example [config.json](https://raw.githubusercontent.com/ROCm/device-metrics-exporter/refs/heads/main/example/config.json)
5+
1. Create a config file based on the provided example [config.json](../../example/config.json)
66
2. Save `config.json` in the `config/` folder
77
3. Mount the `config/` folder when starting the container:
88

@@ -13,7 +13,7 @@ docker run -d \
1313
-p 5000:5000 \
1414
-v ./config:/etc/metrics \
1515
--name device-metrics-exporter \
16-
rocm/device-metrics-exporter:v1.3.1
16+
rocm/device-metrics-exporter:v1.4.0
1717
```
1818

1919
The exporter polls for configuration changes every minute, so updates take effect without container restarts.

docs/installation/deb-package.rst

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -53,13 +53,12 @@ Step 2: Install AMDGPU Driver
5353
.. note::
5454
For the most up-to-date information on installing dkms drivers please see the `ROCm Install Quick Start <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html>`_ page. The below instructions are the most current instructions as of ROCm 7.0.rc1.
5555

56-
1. Download the driver from the Radeon repository (`repo.radeon.com <https://repo.radeon.com/amdgpu-install>`_) for your operating system. For example if you want to get the latest ROCm 7.0.rc1 drivers for Ubuntu 22.04 you would run the following command:
56+
1. Download the driver from the Radeon repository (`repo.radeon.com <https://repo.radeon.com/amdgpu-install>`_) for your operating system. For example if you want to get the latest ROCm 7.0.0 drivers for Ubuntu 22.04 you would run the following command:
5757

5858
.. code-block:: bash
5959
60-
sudo mkdir --parents --mode=0755 /etc/apt/keyrings
61-
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null
62-
echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/30.10_rc1/ubuntu jammy main" sudo tee /etc/apt/sources.list.d/amdgpu.list
60+
wget https://repo.radeon.com/amdgpu-install/7.0/ubuntu/jammy/amdgpu-install_7.0.70000-1_all.deb
61+
sudo apt install ./amdgpu-install_7.0.70000-1_all.deb
6362
sudo apt update
6463
6564
Please note that the above url will be different depending on what version of the drivers you will be installing and type of Operating System you are using.

docs/installation/kubernetes-helm.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,10 +19,14 @@ For Kubernetes environments, a Helm chart is provided for easy deployment.
1919
```yaml
2020
platform: k8s
2121
nodeSelector: {} # Optional: Add custom nodeSelector
22+
tolerations: [] # Optional: Add custom tolerations
23+
kubelet:
24+
podResourceAPISocketPath: /var/lib/kubelet/pod-resources
2225
image:
2326
repository: docker.io/rocm/device-metrics-exporter
2427
tag: v1.4.0
2528
pullPolicy: Always
29+
configMap: "" # Optional: Add custom configuration
2630
service:
2731
type: ClusterIP # or NodePort
2832
ClusterIP:

docs/releasenotes.md

Lines changed: 25 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Release Notes
22

3-
## v1.4.0 (active development)
3+
## v1.4.0
44

55
- **MI35x Platfform Support**
66
- Exporter now supports MI35x platform with parity with latest supported
@@ -29,21 +29,30 @@ ROCm 7.0 MI2xx, MI3xx
2929

3030
### Issues Fixed
3131
- fixed metric naming discrepancies between config field and exported field. The
32-
following prometheues fields are updated:
33-
- xgmi_neighbor_0_nop_tx -> gpu_xgmi_nbr_0_nop_tx
34-
- xgmi_neighbor_1_nop_tx -> gpu_xgmi_nbr_1_nop_tx
35-
- xgmi_neighbor_0_request_tx -> gpu_xgmi_nbr_0_req_tx
36-
- xgmi_neighbor_0_response_tx -> gpu_xgmi_nbr_0_resp_tx
37-
- xgmi_neighbor_1_response_tx -> gpu_xgmi_nbr_0_resp_tx
38-
- xgmi_neighbor_1_response_tx -> gpu_xgmi_nbr_1_resp_tx
39-
- xgmi_neighbor_0_beats_tx -> gpu_xgmi_nbr_0_beats_tx
40-
- xgmi_neighbor_1_beats_tx -> gpu_xgmi_nbr_1_beats_tx
41-
- xgmi_neighbor_0_tx_throughput -> gpu_xgmi_nbr_0_tx_thrput
42-
- xgmi_neighbor_1_tx_throughput -> gpu_xgmi_nbr_1_tx_thrput
43-
- xgmi_neighbor_2_tx_throughput -> gpu_xgmi_nbr_2_tx_thrput
44-
- xgmi_neighbor_3_tx_throughput -> gpu_xgmi_nbr_3_tx_thrput
45-
- xgmi_neighbor_4_tx_throughput -> gpu_xgmi_nbr_4_tx_thrput
46-
- xgmi_neighbor_5_tx_throughput -> gpu_xgmi_nbr_5_tx_thrput
32+
following prometheus fields name are being changed.
33+
- one config field is being renamed which would require updating the
34+
config.json from the released branch for `pcie_nac_received_count` ->
35+
`pcie_nack_received_count`
36+
37+
| S.No | Old Field Name | New Field Name |
38+
|------|-------------------------------------------------|----------------------------------------------------|
39+
| 1 | xgmi_neighbor_0_nop_tx | gpu_xgmi_nbr_0_nop_tx |
40+
| 2 | xgmi_neighbor_1_nop_tx | gpu_xgmi_nbr_1_nop_tx |
41+
| 3 | xgmi_neighbor_0_request_tx | gpu_xgmi_nbr_0_req_tx |
42+
| 4 | xgmi_neighbor_0_response_tx | gpu_xgmi_nbr_0_resp_tx |
43+
| 5 | xgmi_neighbor_1_response_tx | gpu_xgmi_nbr_1_resp_tx |
44+
| 6 | xgmi_neighbor_0_beats_tx | gpu_xgmi_nbr_0_beats_tx |
45+
| 7 | xgmi_neighbor_1_beats_tx | gpu_xgmi_nbr_1_beats_tx |
46+
| 8 | xgmi_neighbor_0_tx_throughput | gpu_xgmi_nbr_0_tx_thrput |
47+
| 9 | xgmi_neighbor_1_tx_throughput | gpu_xgmi_nbr_1_tx_thrput |
48+
| 10 | xgmi_neighbor_2_tx_throughput | gpu_xgmi_nbr_2_tx_thrput |
49+
| 11 | xgmi_neighbor_3_tx_throughput | gpu_xgmi_nbr_3_tx_thrput |
50+
| 12 | xgmi_neighbor_4_tx_throughput | gpu_xgmi_nbr_4_tx_thrput |
51+
| 13 | xgmi_neighbor_5_tx_throughput | gpu_xgmi_nbr_5_tx_thrput |
52+
| 14 | gpu_violation_vr_thermal_tracking_accumulated | gpu_violation_vr_thermal_residency_accumulated |
53+
| 15 | pcie_nac_received_count | pcie_nack_received_count |
54+
| 16 | gpu_violation_proc_hot_residency_accumulated | gpu_violation_processor_hot_residency_accumulated |
55+
| 17 | gpu_violation_soc_thermal_residency_accumulated | gpu_violation_socket_thermal_residency_accumulated |
4756

4857
## v1.3.1
4958

0 commit comments

Comments
 (0)