10 Jun 00:11

332837b

Latest

GPU Operator v1.3.0 Release Notes

The AMD GPU Operator v1.3.0 release introduces new features, most notably of which is support for GPU partitioning on MI300 series GPUs with the new Device Config Manager component.

Release Highlights

Support for configuring, scheduling & monitoring GPU compute & memory partitions

A new component called Device-Config-Manager enables configuration of GPU partitions.
XGMI & PCIE Topology Aware GPU workload scheduling

Local topology-aware scheduling has been implemented to prioritize scheduling GPUs and GPU partitions from the same node or same GPU if possible.
Device-Metrics-Exporter enhancements

a. New exporter metrics generated by ROCm profiler
b. Ability to add user-defined Pod labels on the exported metrics
c. Ability to prefix all metric names with user defined prefix
DeviceConfig creation via Helm install

Installing the GPU Operator via Helm will now also install a default DeviceConfig custom resource. Use --set crds.defaultCR.install=false flag to skip this during the Helm install if desired.

Platform Support

No new platform support has been added in this release. While the GPU Operator now supports OpenShift 4.17, the newly introduced features in this release (GPU Health Monitoring, Automatic Driver & Component Upgrade, and Test Runner) are currently only available for vanilla Kubernetes deployments. These features are not yet supported on OpenShift, and OpenShift support will be introduced in the next minor release.

Documentation Updates

Updated Release notes detailing new features in v1.3.0.
New section added describing the new Device Config Manager component responsible for configuring GPU partitions on GPU worker nodes.
Updated GPU Operator install instructions to include the default DeviceConfig custom resource that gets created and how to skip installing it if desired.

Known Limitations

The Device Config Manager is currently only supported on Kubernetes. We will be adding a Debian package to support bare metal installations in the next release of DCM. For the time being

The Device Config Manager requires running a docker container if you wish to run it in standalone mode (without Kubernetes).
- Impact: Users wishing to use a standalone version of the Device Config Manager will need to run a standalone docker image and configure the partitions using config.json file.
- Root Cause: DCM does not currently support standalone installation via a Debian package like other standalone components of the GPU Operator. We will be adding a Debian package to support standalone bare metal installations in the next release of DCM.
- Recommendation: Those wishing to use GPU partitioning in a bare metal environment should instead use the standalone docker image for DCM. Alternatively users can use amd-smi to change partitioning modes. See amdgpu-docs documentation for how to do this.
The GPU Operator will report an error when ROCm driver install version doesn't match the version string in the Radeon Repo.
- Impact: The DeviceConfig will report an error if you specify "6.4.0" or "6.3.0" for the spec.driver.version.
- Root Cause: The version specified in the CR would still have to match the version string on Radeon repo.
- Recommendation: Although this will be fixed in a future version of the GPU Operator, for the time being you will instead need to specific "6.4" or "6.3" when installing those versions of the ROCm amdgpu driver.

Note: All current and historical limitations for the GPU Operator, including their latest statuses and any associated workarounds or fixes, are tracked in the following documentation page: Known Issues and Limitations.
Please refer to this page regularly for the most up-to-date information.

Assets 3

09 Jun 23:51

sajmera-pensando

v1.2.2

557ecb5

gpu_operator_helm_chart_v1.2.2

GPU Operator v1.2.2 Release Notes

The AMD GPU Operator v1.2.2 release introduces new features to support Device Metrics Exporter's integration with Prometheus Operator from ServiceMonitor custom resource and also introduces several bug fixes.

Release Highlights

Enhanced Metrics Integration with Prometheus Operator

This release introduces a streamlined method for integrating the metrics endpoint of the metrics exporter with the Prometheus Operator.

Users can now leverage the DeviceConfig custom resource to specify the necessary configuration for metrics collection. The GPU Operator will automatically read the relevant DeviceConfig and manage the creation and lifecycle of a corresponding ServiceMonitor custom resource.

This automation simplifies the process of exposing metrics to the Prometheus Operator, allowing for easier scraping and monitoring of GPU-related metrics within your Kubernetes environment.

Documentation Updates

Updated Release notes detailing new features in v1.2.2.

Known Limitations

Note: All current and historical limitations for the GPU Operator, including their latest statuses and any associated workarounds or fixes, are tracked in the following documentation page: Known Issues and Limitations.
Please refer to this page regularly for the most up-to-date information.

Fixes

Node labeller failed to report node labels when users are using DeviceConfig with spec.driver.enable=false and customized node selector in spec.selector [#183]
- Issue: When users are using inbox driver, they will set spec.driver.enable=false within the DeviceConfig spec. If they are also using customized node selector in spec.selector, once node labeller was brought up its GPU properties labels are not showing up among Node resource labels.
- Root Cause: When users are using spec.driver.enable=false and customized non-default selector spec.selector, the operator controller manager is using the wrong selector to clean up node labeller's labels on non-GPU nodes.
- Resolution: This issue has been fixed in v1.2.2. Users can upgrade to v1.2.2 and GPU properties node labels will show up once node labeller was brought up again.
Users self-defined node labels under domain amd.com are unexpectly removed [#151]
- Issue: When users created some node labels under amd.com domain (e.g. amd.com/gpu: "true") for their own usage, it is unexpectly getting removed during bootstrapping.
- Root Cause:
  - When node labeller pod launched it will remove all node labels within amd.com and beta.amd.com from current node then post the labels managed by itself.
  - When operator is executing the reconcile function, the removal of DevicePlugin or will remove all node labels under amd.com or beta.amd.com domain even if they are not managed by node labeller.
- Resolution: This issue has been fixed in v1.2.2 for both operator and node labeller side. Users can upgrade to v1.2.2 operator helm chart and use latest node labeller image then only node labeller managed labels will be auto removed. Other users defined labels under amd.com or beta.amd.com won't be auto removed by operator or node labeller.
During automatic driver upgrade nodes can get stuck in reboot-in-progress
- Issue: When users upgrade the driver version by using DeviceConfig automatic upgrade feature with spec.driver.upgradePolicy.enable=true and spec.driver.upgradePolicy.rebootRequired=true, some nodes may get stuck at reboot-in-progress state.
- Root Cause:
  - Upgrademgr was checking the generationID of DeviceConfig to make sure any spec change during upgrade won't interfere existing upgrade. But if CR changes even for other parts of the device config spec which are unrelated to upgrade, this check will be a problem as new driver upgrade will not start for unrelated CR changes.
  - During the driver upgrade when node reboot happened, the controller manager pod could also get affected and rescheduled to another node. When it comes back, in the init phase, it checks for reboot-in-progress and attempts to delete reboot pod. But it is possible that reboot pod has terminated by then already.
- Resolution: The controller manager's upgrade manager module implementation has been patched to fix this issue in release v1.2.2, by upgrading to new controller manager image this issue should have been fixed.

Assets 3

20 Apr 18:43

sajmera-pensando

v1.2.1

d6fc1db

gpu_operator_helm_chart_v1.2.1

GPU Operator v1.2.1 Release Notes

The GPU Operator v1.2.1 release adds support for Red Hat OpenShift versions 4.16, 4.17 and 4.18. The AMD GPU Operator v1.2.1 has gone through a rigorous validation process and is certified for use on OpenShift. It can be deployed via the Red Hat Catalog.

Release Highlights

The AMD GPU Operator has now been certified for use with Red Hat OpenShift v4.16, v4.17 and v4.18

New Platform Support

Red Hat OpenShift 4.16-4.18
- Supported features:
  - GPU Health Monitoring
  - Automatic Driver Upgrades
  - Test Runner for GPU Diagnostics
- Requirements: Red Hat OpenShift version 4.16, 4.17 or 4.18

Assets 3

22 Feb 18:01

sajmera-pensando

v1.2.0

164ce15

gpu_operator_helm_chart_v1.2.0

GPU Operator v1.2.0 Release Notes

The GPU Operator v1.2.0 release introduces significant new features, including GPU health monitoring, automated component and driver upgrades, and a test runner for enhanced validation and troubleshooting. These improvements aim to increase reliability, streamline upgrades, and provide enhanced visibility into GPU health.

Release Highlights

GPU Health Monitoring
- Real-time health checks via metrics exporter
- Integration with Kubernetes Device Plugin for automatic removal of unhealthy GPUs from compute node schedulable resources
- Customizable health thresholds via K8s ConfigMaps
GPU Operator and Automated Driver Upgrades
- Automatic and manual upgrades of the device plugin, node labeller, test runner and metrics exporter via configurable upgrade policies
- Automatic driver upgrades is now supported with node cordon, drain, version tracking and optional node reboot
Test Runner for GPU Diagnostics
- Automated testing of unhealthy GPUs
- Pre-start job tests embedded in workload pods
- Manual and scheduled GPU tests with event logging and result tracking

Platform Support

No new platform support has been added in this release. While the GPU Operator now supports OpenShift 4.17, the newly introduced features in this release (GPU Health Monitoring, Automatic Driver & Component Upgrade, and Test Runner) are currently only available for vanilla Kubernetes deployments. These features are not yet supported on OpenShift, and OpenShift support will be introduced in the next minor release.

Documentation Updates

Updated Release notes detailing new features in v1.2.0
Continued effort to respond to all GitHub User reported Issues on ROCm/gpu-operator repo and able to resolve 5 of those issues with this new GPU Operator release [#2], [#23], [#25], [#30], [#55]
Updated Known Issues and Limitations section to highlight limitations users should be aware of including 6 new issues and 1 fixed issue
Updated Quick Start Guide for GPU Operator to make it easier for users to get started
Revamped Driver Upgrade Guide that includes utilizing Automatic Upgrade Process (New feature in v1.2.0)
New Documentation section on Upgrading the GPU Operator, as well as, Upgrading GPU Operator Components (new features for v1.2.0)
New Metrics Exporter documentation on Health Checks feature (new feature for v1.2.0)
New Documentation section for Test Runner (new feature for v1.2.0)
New and updated GPU Operator examples in the GPU Operator repo including Test Runner job examples
Whole site doc review conducted for the AMD GPU Operator Instinct docs site and many corrections made to outdated or incorrect documentation

Known Limitations

Incomplete Cleanup on Manual Module Removal
- Impact: When AMD GPU drivers are manually removed (instead of using the operator for uninstallation), not all GPU modules are cleaned up completely.
- Recommendation: Always use the GPU Operator for installing and uninstalling drivers to ensure complete cleanup.
Inconsistent Node Detection on Reboot
- Impact: In some reboot sequences, Kubernetes fails to detect that a worker node has reloaded, which prevents the operator from installing the updated driver. This happens as a result of the node rebooting and coming back online too quickly before the default time check interval of 50s.
- Recommendation: Consider tuning the kubelet and controller-manager flags (such as --node-status-update-frequency=10s, --node-monitor-grace-period=40s, and --node-monitor-period=5s) to improve node status detection. Refer to Kubernetes documentation for more details.
Inconsistent Metrics Fetch Using NodePort
- Impact: When accessing metrics via a Nodeport service (NodeIP:NodePort) from within a cluster, Kuberntes' built-in load balancing may sometimes route requests to different pods, leading to occasional inconsistencies in the returned metrics. This behavior is inherent to Kubernetes networking and is not a defect in the GPU Operator.
- Recommendation: Only use the internal PodIP and configured pod port (default: 5000) when retrieving metrics from within the cluster instead of the NodePort. Refer to the Metrics Exporter document section for more details.
Helm Install Fails if GPU Operator Image is Unavailable
- Impact: If the image provided via --set controllerManager.manager.image.repository and --set controllerManager.manager.image.tag does not exist, the controller manager pod may enter a CrashLoopBackOff state and hinder uninstallation unless --no-hooks is used.
- Recommendation: Ensure that the correct GPU Operator controller image is available in your registry before installation. To uninstall the operator after seeing a ErrImagePull error, use --no-hooks to bypass all pre-uninstall helm hooks.
Driver Reboot Requirement with ROCm 6.3.x
- Impact: While using ROCm 6.3+ drivers, the operator may not complete the driver upgrade properly unless a node reboot is performed.
- Recommendation: Manually reboot the affected nodes after the upgrade to complete driver installation. Alternatively, we recommend setting rebootRequired to true in the upgrade policy for driver upgrades. This ensures that a reboot is triggered after the driver upgrade, guaranteeing that the new driver is fully loaded and applied. This workaround should be used until the underlying issue is resolved in a future release.
Driver Upgrade Timing Issue
- Impact: During an upgrade, if a node's ready status fluctuates (e.g., from Ready to NotReady to Ready) before the driver version label is updated by the operator, the old driver might remain installed. The node might continue running the previous driver version even after an upgrade has been initiated.
- Recommendation: Ensure nodes are fully stable before triggering an upgrade, and if necessary, manually update node labels to enforce the new driver version. Refer to driver upgrade documentation for more details.

Fixes

Driver Upgrade Failure with Exporter Enabled
- Previously, enabling the exporter alongside the operator caused driver upgrades to fail.
- Status: This issue has been fixed in v1.2.0.

Assets 3

04 Feb 01:57

sajmera-pensando

v1.1.0

b870c8d

gpu_operator_helm_chart_v1.1.0

GPU Operator v1.1.0 Release Notes

The GPU Operator v1.1.0 release adds support for Red Hat OpenShift versions 4.16 and 4.17. The AMD GPU Operator has gone through a rigourous validation process and is now certified for use on OpenShift. It can now be deployed via the Red Hat Catalog.

The latest AMD GPU Operator OLM Bundle for OpenShift is tagged with version v1.1.1 as the operator image has been updated to include a minor driver fix.

Release Highlights

The AMD GPU Operator has now been certified for use with Red Hat OpenShift v4.16 and v4.17
Updated documentation with installationa and configuration steps for Red Hat OpenShift

Platform Support

New Platform Support

Red Hat OpenShift 4.16-4.17
- Supported features:
  - Driver management
  - Workload scheduling
  - Metrics monitoring
- Requirements: Red Hat OpenShift version 4.16 or 4.17

Known Limitations

Due to issue with KMM 2.2 deletion of DeviceConfig Custom Resource gets stuck in Red Hat OpenShift
- Impact: Not able to delete the DeviceConfig Custom Resource if the node reboots during uninstall.
- Affected Configurations: This issue only affects Red Hat OpenShift
- Workaround: This issue will be fixed in the next release of KMM. For the time being you can use a previous version of KMM aside from 2.2 or manually remove the status from NMC:
1. List all the NMC resources and pick up the correct NMC (there is one nmc per node, named the same as the node it related to).
```
oc get nmc -A
```
2. Edit the NMC.
```
oc edit nmc <nmc name>
```
3. Remove from NMC status for all the data related to your module and save. That should allow the module to be finally deleted.

Assets 3

15 Nov 20:35

sajmera-pensando

v1.0.0

b18ccee

gpu_operator_helm_chart_v1.0.0

AMD GPU Operator v1.0.0 Release Notes

This release is the first major release of AMD GPU Operator. The AMD GPU Operator simplifies the deployment and management of AMD Instinct™ GPU accelerators within Kubernetes clusters. This project enables seamless configuration and operation of GPU-accelerated workloads, including machine learning, Generative AI, and other GPU-intensive applications.

Release Highlights

Manage AMD GPU drivers with desired versions on Kubernetes cluster nodes
Customized scheduling of AMD GPU workloads within Kubernetes cluster
Metrics and statistics monitoring solution for AMD GPU hardware and workloads
Support specialized networking environment like HTTP proxy or Air-gapped network

Hardware Support

New Hardware Support

AMD Instinct™ MI300
- Required driver version: ROCm 6.2+
AMD Instinct™ MI250
- Required driver version: ROCm 6.2+
AMD Instinct™ MI210
- Required driver version: ROCm 6.2+

Platform Support

New Platform Support

Kubernetes 1.29+
- Supported features:
  - Driver management
  - Workload scheduling
  - Metrics monitoring
- Requirements: Kubernetes version 1.29+

Breaking Changes

Not Applicable as this is the initial release.

New Features

Feature Category

Driver management
- Managed Driver Installations: Users will be able to install ROCm 6.2+ dkms driver on Kubernetes worker nodes, they can also optionally choose to use inbox or pre-installed driver on the worker nodes
- DeviceConfig Custom Resource: Users can configure a new DeviceConfig CRD (Custom Resource Definition) to define the driver management behavior of the GPU Operator
GPU Workload Scheduling
- Custom Resource Allocation "amd.com/gpu": After the deployment of the GPU Operator a new custom resource allocation will be present on each GPU node, amd.com/gpu, which will list the allocatable GPU resources on the node for which GPU workloads can be scheduled against
- Assign Multiple GPUs: Users can easily specify the number of AMD GPUs required by each workload in the deployment/pod spec and the Kubernetes scheduler wiill automatically take care of assigning the correct GPU resources
Metrics Monitoring for GPUs and Workloads:
- Out-of-box Metrics: Users can optionally enable the AMD Device Metrics Exporter when installing the AMD GPU Operator to enable a robust out-of-box monitoring solution for prometheus to consume
- Custom Metrics Configurations: Users can utilize a configmap to customize the configuration and behavior of Device Metrics Exporter
Specialized Network Setups:
- Air-gapped Installation: Users can install the GPU Operator in a secure air-gapped environment where the Kubernetes cluster has no external network connectivity
- HTTP Proxy Support: The AMD GPU Operator supports usage within a Kubernetes cluster that is behind a HTTP Proxy. HTTPS support to be added in future release.

Known Limitations

GPU operator driver installs only DKMS package
- Impact: Applications which require ROCM packages will need to install respective packages.
- Affectioned Configurations: All configurations
- Workaround: None as this is the intended behaviour as other ROCm software packages should be managed inside the containers/workloads running on the cluster
When Using Operator to install amdgpu 6.1.3/6.2 a reboot is required to complete install
- Impact: Node requires a reboot when upgrade is initiated due to ROCm bug. Driver install failures may be seen in dmesg
- Affected configurations: Nodes with driver version >= ROCm 6.2.x
- Workaround: Reboot the nodes upgraded manually to finish the driver install. This has been fixed in ROCm 6.3+
GPU Operator unable to install amdgpu driver if existing driver is already installed
- Impact: Driver install will fail if amdgpu in-box Driver is present/already installed
- Affected Configurations: All configurations
- Workaround: When installing the amdgpu drivers using the GPU Operator, worker nodes should have amdgpu blacklisted or amdgpu drivers should not be pre-installed on the node. Blacklist in-box driver so that it is not loaded or remove the pre-installed driver
When GPU Operator is used in SKIP driver install mode, if amdgpu module is removed with device plugin installed it will not reflect active GPU available on the server
- Impact: Scheduling Workloads will have impact as it will scheduled on nodes which does have active GPU.
- Affectioned Configurations: All configurations
- Workaround: Restart the Device plugin pod deployed.
Worker nodes where Kernel needs to be upgraded needs to taken out of the cluster and readded with Operator installed
- Impact: Node upgrade will not proceed automatically and requires manual intervention
- Affected Configurations: All configurations
- Workaround: Manually mark the node as unschedulable, preventing new pods from being scheduled on it, by cordoning it off:
```
kubectl cordon <node-name>
```
When GPU Operator is installed with Exporter enabled, upgrade of driver is blocked as exporter is actively using the amdgpu module
- Impact: Driver upgrade is blocked
- Affectioned Configurations: All configurations
- Workaround: Disable the Metrics Exporter to allow driver upgrade by updating the deviceconfig in the gpu-operator namespace to set metrics exporter to enabled to false:
```
kubectl patch deviceconfig gpu-operator -n kube-amd-gpu --type='json' -p='[{"op": "replace", "path": "/spec/metricsExporter/enable", "value": false}]'
```

Assets 3

Releases: ROCm/gpu-operator

gpu_operator_helm_chart_v1.3.0

GPU Operator v1.3.0 Release Notes

Release Highlights

Platform Support

Documentation Updates

Known Limitations

Uh oh!

gpu_operator_helm_chart_v1.2.2

GPU Operator v1.2.2 Release Notes

Release Highlights

Documentation Updates

Known Limitations

Fixes

Uh oh!

gpu_operator_helm_chart_v1.2.1

GPU Operator v1.2.1 Release Notes

Release Highlights

New Platform Support

Uh oh!

gpu_operator_helm_chart_v1.2.0

GPU Operator v1.2.0 Release Notes

Release Highlights

Platform Support

Documentation Updates

Known Limitations

Fixes

Uh oh!

gpu_operator_helm_chart_v1.1.0

GPU Operator v1.1.0 Release Notes

Release Highlights

Platform Support

New Platform Support

Known Limitations

Uh oh!

gpu_operator_helm_chart_v1.0.0

AMD GPU Operator v1.0.0 Release Notes

Release Highlights

Hardware Support

New Hardware Support

Platform Support

New Platform Support

Breaking Changes

New Features

Feature Category

Known Limitations

Uh oh!