Skip to content

gpu_operator_helm_chart_v1.3.0

Latest
Compare
Choose a tag to compare
@sajmera-pensando sajmera-pensando released this 10 Jun 00:11

GPU Operator v1.3.0 Release Notes

The AMD GPU Operator v1.3.0 release introduces new features, most notably of which is support for GPU partitioning on MI300 series GPUs with the new Device Config Manager component.

Release Highlights

  • Support for configuring, scheduling & monitoring GPU compute & memory partitions

    A new component called Device-Config-Manager enables configuration of GPU partitions.

  • XGMI & PCIE Topology Aware GPU workload scheduling

    Local topology-aware scheduling has been implemented to prioritize scheduling GPUs and GPU partitions from the same node or same GPU if possible.

  • Device-Metrics-Exporter enhancements

    a. New exporter metrics generated by ROCm profiler
    b. Ability to add user-defined Pod labels on the exported metrics
    c. Ability to prefix all metric names with user defined prefix

  • DeviceConfig creation via Helm install

    Installing the GPU Operator via Helm will now also install a default DeviceConfig custom resource. Use --set crds.defaultCR.install=false flag to skip this during the Helm install if desired.

Platform Support

  • No new platform support has been added in this release. While the GPU Operator now supports OpenShift 4.17, the newly introduced features in this release (GPU Health Monitoring, Automatic Driver & Component Upgrade, and Test Runner) are currently only available for vanilla Kubernetes deployments. These features are not yet supported on OpenShift, and OpenShift support will be introduced in the next minor release.

Documentation Updates

  • Updated Release notes detailing new features in v1.3.0.
  • New section added describing the new Device Config Manager component responsible for configuring GPU partitions on GPU worker nodes.
  • Updated GPU Operator install instructions to include the default DeviceConfig custom resource that gets created and how to skip installing it if desired.

Known Limitations

  • The Device Config Manager is currently only supported on Kubernetes. We will be adding a Debian package to support bare metal installations in the next release of DCM. For the time being
  1. The Device Config Manager requires running a docker container if you wish to run it in standalone mode (without Kubernetes).

    • Impact: Users wishing to use a standalone version of the Device Config Manager will need to run a standalone docker image and configure the partitions using config.json file.
    • Root Cause: DCM does not currently support standalone installation via a Debian package like other standalone components of the GPU Operator. We will be adding a Debian package to support standalone bare metal installations in the next release of DCM.
    • Recommendation: Those wishing to use GPU partitioning in a bare metal environment should instead use the standalone docker image for DCM. Alternatively users can use amd-smi to change partitioning modes. See amdgpu-docs documentation for how to do this.
  2. The GPU Operator will report an error when ROCm driver install version doesn't match the version string in the Radeon Repo.

    • Impact: The DeviceConfig will report an error if you specify "6.4.0" or "6.3.0" for the spec.driver.version.
    • Root Cause: The version specified in the CR would still have to match the version string on Radeon repo.
    • Recommendation: Although this will be fixed in a future version of the GPU Operator, for the time being you will instead need to specific "6.4" or "6.3" when installing those versions of the ROCm amdgpu driver.

Note: All current and historical limitations for the GPU Operator, including their latest statuses and any associated workarounds or fixes, are tracked in the following documentation page: Known Issues and Limitations.
Please refer to this page regularly for the most up-to-date information.