update docs (#869)

spraveenio · sajmera-pensando · commit d222ad9a9f12 · 2025-10-01T12:58:57.000-07:00
(cherry picked from commit abd18e2a150bfb2c4e6dac25d43c62def684b883)
diff --git a/docs/configuration/configmap.md b/docs/configuration/configmap.md
@@ -6,11 +6,11 @@ When deploying AMD Device Metrics Exporter on Kubernetes, a `ConfigMap` is deplo
 
 - `ServerPort`: this field is ignored when Device Metrics Exporter is deployed by the [GPU Operator](https://instinct.docs.amd.com/projects/gpu-operator/en/latest/) to avoid conflicts with the service node port config.
 - `GPUConfig`:
-  - Fields: An array of strings specifying what metrics field to be exported.
-  - Labels: `SERIAL_NUMBER`, `GPU_ID`, `POD`, `NAMESPACE`, `CONTAINER`, `JOB_ID`, `JOB_USER`, `JOB_PARTITION`, `CARD_MODEL`, `HOSTNAME`, `GPU_PARTITION_ID`, `GPU_COMPUTE_PARTITION_TYPE`, and `GPU_MEMORY_PARTITION_TYPE` are always set and cannot be removed. Labels supported are available in the provided example `configmap.yml`.
-  - CustomLabels: A map of user-defined labels and their values. Users can set up to 10 custom labels. From the `GPUMetricLabel` list, only `CLUSTER_NAME` is allowed to be set in `CustomLabels`. Any other labels from this list cannot be set. Users can define other custom labels outside of this restriction. These labels will be exported with every metric, ensuring consistent metadata across all metrics.
-  - ExtraPodLabels: This defines a map that links Prometheus label names to Kubernetes pod labels. Each key is the Prometheus label that will be exposed in metrics, and the value is the pod label to pull the data from. This lets you expose pod metadata as Prometheus labels for easier filtering and querying.<br>(e.g. Considering an entry like `"WORKLOAD_ID"   : "amd-workload-id"`, where `WORKLOAD_ID` is a label visible in metrics and its value is the pod label value of a pod label key set as `amd-workload-id`).
-  - ProfilerMetrics: A map of toggle to enable Profiler Metrics either for `all` nodes or a specific hostname with desired state. Key with specific hostname `$HOSTNAME` takes precedense over a `all` key.
+  - `Fields`: An array of strings specifying what metrics field to be exported.
+  - `Labels`: `SERIAL_NUMBER`, `GPU_ID`, `POD`, `NAMESPACE`, `CONTAINER`, `JOB_ID`, `JOB_USER`, `JOB_PARTITION`, `CARD_MODEL`, `HOSTNAME`, `GPU_PARTITION_ID`, `GPU_COMPUTE_PARTITION_TYPE`, and `GPU_MEMORY_PARTITION_TYPE` are always set and cannot be removed. Labels supported are available in the provided example `configmap.yml`.
+  - `CustomLabels`: A map of user-defined labels and their values. Users can set up to 10 custom labels. From the `GPUMetricLabel` list, only `CLUSTER_NAME` is allowed to be set in `CustomLabels`. Any other labels from this list cannot be set. Users can define other custom labels outside of this restriction. These labels will be exported with every metric, ensuring consistent metadata across all metrics.
+  - `ExtraPodLabels`: This defines a map that links Prometheus label names to Kubernetes pod labels. Each key is the Prometheus label that will be exposed in metrics, and the value is the pod label to pull the data from. This lets you expose pod metadata as Prometheus labels for easier filtering and querying.<br>(e.g. Considering an entry like `"WORKLOAD_ID"   : "amd-workload-id"`, where `WORKLOAD_ID` is a label visible in metrics and its value is the pod label value of a pod label key set as `amd-workload-id`).
+  - `ProfilerMetrics`: A map of toggle to enable Profiler Metrics either for `all` nodes or a specific hostname with desired state. Key with specific hostname `$HOSTNAME` takes precedense over a `all` key. This only controls the Profiler Metrics which has prefix of GPU_PROF_ from the metrics list.
 - `CommonConfig`: 
   - `MetricsFieldPrefix`: Add prefix string for all the fields exporter. [Premetheus Metric Label formatted](https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels) string prefix will be accepted, on any invalid prefix will default to empty prefix to allow exporting of the fields.
   - `HealthService` : Health Service configurations for the exproter.
@@ -20,12 +20,12 @@ When deploying AMD Device Metrics Exporter on Kubernetes, a `ConfigMap` is deplo
 
 To use a custom configuration when deploying the Metrics Exporter:
 
-1. Create a `ConfigMap` based on the provided example [configmap.yml](https://github.com/ROCm/device-metrics-exporter/blob/main/example/configmap.yaml)
+1. Create a `ConfigMap` based on the provided example [configmap.yml](../examples/configmap.yml) file.
 2. Change the `configMap` property in `values.yaml` to `configmap.yml`
 3. Run `helm install`:
 
 ```bash
-helm install exporter https://github.com/ROCm/device-metrics-exporter/releases/download/v1.3.1/device-metrics-exporter-charts-v1.3.1.tgz -n metrics-exporter -f values.yaml --create-namespace
+helm install exporter https://github.com/ROCm/device-metrics-exporter/releases/download/v1.4.0/device-metrics-exporter-charts-v1.4.0.tgz -n metrics-exporter -f values.yaml --create-namespace
 ```
 
-Device Metrics Exporter polls for configuration changes every minute, so updates take effect without container restarts.
+Device Metrics Exporter polls for configuration changes every minute, so updates take effect without container restarts.
diff --git a/docs/configuration/docker.md b/docs/configuration/docker.md
@@ -2,7 +2,7 @@
 
 To use a custom configuration with the AMD Device Metrics Exporter container:
 
-1. Create a config file based on the provided example [config.json](https://raw.githubusercontent.com/ROCm/device-metrics-exporter/refs/heads/main/example/config.json)
+1. Create a config file based on the provided example [config.json](../../example/config.json)
 2. Save `config.json` in the `config/` folder
 3. Mount the `config/` folder when starting the container:
 
@@ -13,7 +13,7 @@ docker run -d \
   -p 5000:5000 \
   -v ./config:/etc/metrics \
   --name device-metrics-exporter \
-  rocm/device-metrics-exporter:v1.3.1
+  rocm/device-metrics-exporter:v1.4.0
 ```
 
 The exporter polls for configuration changes every minute, so updates take effect without container restarts.
diff --git a/docs/installation/deb-package.rst b/docs/installation/deb-package.rst
@@ -53,13 +53,12 @@ Step 2: Install AMDGPU Driver
 .. note::
    For the most up-to-date information on installing dkms drivers please see the `ROCm Install Quick Start <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html>`_ page. The below instructions are the most current instructions as of ROCm 7.0.rc1.
 
-1. Download the driver from the Radeon repository (`repo.radeon.com <https://repo.radeon.com/amdgpu-install>`_) for your operating system. For example if you want to get the latest ROCm 7.0.rc1 drivers for Ubuntu 22.04 you would run the following command:
+1. Download the driver from the Radeon repository (`repo.radeon.com <https://repo.radeon.com/amdgpu-install>`_) for your operating system. For example if you want to get the latest ROCm 7.0.0 drivers for Ubuntu 22.04 you would run the following command:
 
    .. code-block:: bash
 
-      sudo mkdir --parents --mode=0755 /etc/apt/keyrings
-      wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null
-      echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/30.10_rc1/ubuntu jammy main" sudo tee /etc/apt/sources.list.d/amdgpu.list
+      wget https://repo.radeon.com/amdgpu-install/7.0/ubuntu/jammy/amdgpu-install_7.0.70000-1_all.deb
+      sudo apt install ./amdgpu-install_7.0.70000-1_all.deb
       sudo apt update
 
    Please note that the above url will be different depending on what version of the drivers you will be installing and type of Operating System you are using.
diff --git a/docs/installation/kubernetes-helm.md b/docs/installation/kubernetes-helm.md
@@ -19,10 +19,14 @@ For Kubernetes environments, a Helm chart is provided for easy deployment.
 ```yaml
 platform: k8s
 nodeSelector: {} # Optional: Add custom nodeSelector
+tolerations: []  # Optional: Add custom tolerations
+kubelet:
+  podResourceAPISocketPath: /var/lib/kubelet/pod-resources
 image:
   repository: docker.io/rocm/device-metrics-exporter
   tag: v1.4.0
   pullPolicy: Always
+configMap: "" # Optional: Add custom configuration
 service:
   type: ClusterIP  # or NodePort
   ClusterIP:
diff --git a/docs/releasenotes.md b/docs/releasenotes.md
@@ -1,6 +1,6 @@
 # Release Notes
 
-## v1.4.0 (active development)
+## v1.4.0
 
 - **MI35x Platfform Support**
   - Exporter now supports MI35x platform with parity with latest supported
@@ -29,21 +29,30 @@ ROCm 7.0 MI2xx, MI3xx
 
 ### Issues Fixed
 - fixed metric naming discrepancies between config field and exported field. The
-  following prometheues fields are updated:
-  - xgmi_neighbor_0_nop_tx -> gpu_xgmi_nbr_0_nop_tx
-  - xgmi_neighbor_1_nop_tx -> gpu_xgmi_nbr_1_nop_tx
-  - xgmi_neighbor_0_request_tx -> gpu_xgmi_nbr_0_req_tx
-  - xgmi_neighbor_0_response_tx -> gpu_xgmi_nbr_0_resp_tx
-  - xgmi_neighbor_1_response_tx -> gpu_xgmi_nbr_0_resp_tx
-  - xgmi_neighbor_1_response_tx -> gpu_xgmi_nbr_1_resp_tx
-  - xgmi_neighbor_0_beats_tx -> gpu_xgmi_nbr_0_beats_tx
-  - xgmi_neighbor_1_beats_tx -> gpu_xgmi_nbr_1_beats_tx
-  - xgmi_neighbor_0_tx_throughput -> gpu_xgmi_nbr_0_tx_thrput
-  - xgmi_neighbor_1_tx_throughput -> gpu_xgmi_nbr_1_tx_thrput
-  - xgmi_neighbor_2_tx_throughput -> gpu_xgmi_nbr_2_tx_thrput
-  - xgmi_neighbor_3_tx_throughput -> gpu_xgmi_nbr_3_tx_thrput
-  - xgmi_neighbor_4_tx_throughput -> gpu_xgmi_nbr_4_tx_thrput
-  - xgmi_neighbor_5_tx_throughput -> gpu_xgmi_nbr_5_tx_thrput
+  following prometheus fields name are being changed.
+- one config field is being renamed which would require updating the
+  config.json from the released branch for `pcie_nac_received_count` ->
+  `pcie_nack_received_count`
+
+  | S.No | Old Field Name                                  | New Field Name                                     |
+  |------|-------------------------------------------------|----------------------------------------------------|
+  | 1    | xgmi_neighbor_0_nop_tx                          | gpu_xgmi_nbr_0_nop_tx                              |
+  | 2    | xgmi_neighbor_1_nop_tx                          | gpu_xgmi_nbr_1_nop_tx                              |
+  | 3    | xgmi_neighbor_0_request_tx                      | gpu_xgmi_nbr_0_req_tx                              |
+  | 4    | xgmi_neighbor_0_response_tx                     | gpu_xgmi_nbr_0_resp_tx                             |
+  | 5    | xgmi_neighbor_1_response_tx                     | gpu_xgmi_nbr_1_resp_tx                             |
+  | 6    | xgmi_neighbor_0_beats_tx                        | gpu_xgmi_nbr_0_beats_tx                            |
+  | 7    | xgmi_neighbor_1_beats_tx                        | gpu_xgmi_nbr_1_beats_tx                            |
+  | 8    | xgmi_neighbor_0_tx_throughput                   | gpu_xgmi_nbr_0_tx_thrput                           |
+  | 9    | xgmi_neighbor_1_tx_throughput                   | gpu_xgmi_nbr_1_tx_thrput                           |
+  | 10   | xgmi_neighbor_2_tx_throughput                   | gpu_xgmi_nbr_2_tx_thrput                           |
+  | 11   | xgmi_neighbor_3_tx_throughput                   | gpu_xgmi_nbr_3_tx_thrput                           |
+  | 12   | xgmi_neighbor_4_tx_throughput                   | gpu_xgmi_nbr_4_tx_thrput                           |
+  | 13   | xgmi_neighbor_5_tx_throughput                   | gpu_xgmi_nbr_5_tx_thrput                           |
+  | 14   | gpu_violation_vr_thermal_tracking_accumulated   | gpu_violation_vr_thermal_residency_accumulated     |
+  | 15   | pcie_nac_received_count                         | pcie_nack_received_count                           |
+  | 16   | gpu_violation_proc_hot_residency_accumulated    | gpu_violation_processor_hot_residency_accumulated  |
+  | 17   | gpu_violation_soc_thermal_residency_accumulated | gpu_violation_socket_thermal_residency_accumulated |
 
 ## v1.3.1