diff --git a/soperator/README.md b/soperator/README.md index 30260f64..d3ec0857 100644 --- a/soperator/README.md +++ b/soperator/README.md @@ -1,33 +1,32 @@ -# Terraform recipe to create Slurm cluster on K8s with [Soperator](https://github.com/nebius/soperator) in Nebius +# Slurm cluster on K8s with [Soperator](https://github.com/nebius/soperator) ## Overview -This solution allows you to create a Slurm cluster in Kubernetes with a single terraform apply. +This Terraform recipe allows you to create a Slurm cluster in Kubernetes with a single `terraform apply`. -Running Slurm in Kubernetes using this operator brings several features and possibilities. +Running Slurm on Kubernetes using the Soperator brings several features and possibilities. -### Easy scaling +### 📈 Easy scaling You can scale the Slurm cluster up or down without the need to bootstrap new nodes from scratch. -### High availability +### 🩹 High availability -K8s provides some self-healing out of the box: Slurm nodes represented as K8S pods are automatically restarted in case -of problems. +K8s provides some self-healing out of the box: Slurm nodes represented as K8S pods are automatically restarted in case of problems. -### Shared root filesystem +### 🔄 Shared root filesystem When users interact with Slurm, they see a single shared persistent storage as the root directory on each Slurm node. -This frees users from the Slurm requirement that is very difficult to achieve: all nodes must be identical. Because of -the storage, users don't need to manually synchronise all software versions and Linux UIDs & GIDs among the nodes. +This frees users from the Slurm requirement that is very difficult to achieve: all nodes must be identical. +Because of the storage, users don't need to manually synchronise all software versions and Linux UIDs & GIDs among the nodes. -### Protection against accidental Slurm breakage +### 🪖 Protection against accidental Slurm breakage -Users connect to login nodes and execute jobs on worker nodes not on the system where Slurm daemons are running, but in -a special isolated environment from which it's almost impossible to accidentally break Slurm. +Users connect to the login nodes and execute jobs on worker nodes not on the system where Slurm daemons are running, +but in a special isolated environment from which it's almost impossible to accidentally break Slurm. In addition, GPU drivers and libraries are mounted from K8s nodes so users can't irreversibly break them. -### Periodic GPU health checks +### 🩺 Periodic GPU health checks NCCL tests are periodically launched on all Slurm workers, and nodes that show unsatisfactory results are drained. These checks are implemented as usual Slurm jobs - they stay in the same queue with users' workload and don't interfere it. @@ -43,253 +42,374 @@ These checks are implemented as usual Slurm jobs - they stay in the same queue w ## Prerequisites -### Get your own copy +Make sure you have the following programs installed on your machine. -In order to not mess with example recipe, make your own copy of [example directory](installations/example): -```bash -mkdir installations/ +- [Terraform CLI](https://developer.hashicorp.com/terraform/install) -cd installations/ + > [!IMPORTANT] + > The minimum version of Terraform needed for this recipe is `1.8.0`. -cp -r ../examples/ ./ -``` + ```console + $ terraform version + Terraform v1.9.8 + on darwin_arm64 + ... + ``` -> [!NOTE] -> Following steps will be described as you work in terminal within that new directory. +- [Nebius CLI](https://docs.nebius.ai/cli/install) -### JQ + ```console + $ nebius version + 0.11.2 + ``` -Install [jq](https://jqlang.github.io/jq/download/). + [Authorize it](https://docs.nebius.com/cli/configure/#authorize-with-a-user-account) with a user account. -### Nebius CLI +- [kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) -Install and initialize [Nebius CLI](https://docs.nebius.ai/cli/install). + ```console + $ kubectl version + Client Version: v1.31.1 + ... + ``` -### Keeping state in remote Storage +- [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) -In order to store Terraform state remotely in Nebius Object Storage, Terraform must be able to connect to it. -We'll use [service account](https://docs.nebius.ai/iam/service-accounts/manage/) for that purpose. + ```console + $ aws --version + aws-cli/2.17.20 Python/3.11.9 Darwin/23.6.0 exe/x86_64 + ``` -Let's start with exporting your tenant and project IDs for a further use. +- [jq](https://jqlang.github.io/jq/download/) -> [!TIP] -> We suggest you to replace checks for `NEBIUS_TENANT_ID` and `NEBIUS_PROJECT_ID` in provided [`.envrc`](installations/example/.envrc) file -> with the following: -> -> ```bash -> # -------------------- -> # Automatic retrieving -> # -------------------- -> NEBIUS_TENANT_ID=$(nebius iam tenant list \ -> --format json \ -> | jq -r ".items.[0].metadata.id") -> export NEBIUS_TENANT_ID -> -> NEBIUS_PROJECT_ID=$(nebius iam project list \ -> --parent-id "${NEBIUS_TENANT_ID}" \ -> --format json \ -> | jq -r ".items.[0].metadata.id") -> export NEBIUS_PROJECT_ID -> -> # --------------- -> # OR specific IDs -> # --------------- -> export NEBIUS_TENANT_ID='' -> export NEBIUS_PROJECT_ID='' -> ``` + ```console + $ jq --version + jq-1.7.1 + ``` -#### Service account - -1. Create service account - - ```bash - NEBIUS_SA_TERRAFORM_ID=$(nebius iam service-account create \ - --parent-id "${NEBIUS_PROJECT_ID}" \ - --name 'slurm-terraform-sa' \ - --format json | jq -r '.metadata.id') - - export NEBIUS_SA_TERRAFORM_ID - ``` - -2. Add this account to the `editors` group - - ```bash - # Getting ID of the 'editors' group - NEBIUS_GROUP_EDITORS_ID=$(nebius iam group get-by-name \ - --parent-id "${NEBIUS_TENANT_ID}" \ - --name 'editors' \ - --format json | jq -r '.metadata.id') - - export NEBIUS_GROUP_EDITORS_ID - - # Adding SA to the 'editors' group - nebius iam group-membership create \ - --parent-id "${NEBIUS_GROUP_EDITORS_ID}" \ - --member-id "${NEBIUS_SA_TERRAFORM_ID}" - ``` - -3. Create a key pair for giving AWS CLI a way to access Storage with the service account - - ```bash - NEBIUS_SA_ACCESS_KEY_ID=$(nebius iam access-key create \ - --parent-id "${NEBIUS_PROJECT_ID}" \ - --name 'slurm-terraform-sa-access-key' \ - --account-service-account-id "${NEBIUS_SA_TERRAFORM_ID}" \ - --description 'AWS CLI key' \ - --format json | jq -r '.resource_id') - - export NEBIUS_SA_ACCESS_KEY_ID - ``` - -#### AWS CLI - -1. [Install](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) `aws` -2. Add the key, the Nebius AI region ID and the Object Storage endpoint URL to the AWS CLI configuration - - ```bash - aws configure set aws_access_key_id "${NEBIUS_SA_ACCESS_KEY_AWS_ID}" - - aws configure set aws_secret_access_key "${NEBIUS_SA_SECRET_ACCESS_KEY}" - - aws configure set region 'eu-north1' - - aws configure set endpoint_url 'https://storage.eu-north1.nebius.cloud:443' - ``` - -#### Bucket +- `md5sum` -```bash -NEBIUS_BUCKET_NAME="tfstate-slurm-k8s-$(echo -n "${NEBIUS_TENANT_ID}-${NEBIUS_PROJECT_ID}" | md5sum | awk '$0=$1')" + We use `md5sum` utility to generate unique S3 bucket IDs. + + `md5sum` is often pre-installed on most of Unix-like OSs. Ensure that you have it installed on your machine. + + ```shell + which md5sum + ``` + + > [!TIP] + > To install `md5sum` on macOS, you have to install GNU coreutils that includes it. + > ```shell + > brew install coreutils + > ``` -nebius storage bucket create --parent-id "${NEBIUS_PROJECT_ID}" --versioning-policy 'enabled' --name "${NEBIUS_BUCKET_NAME}" -``` +- [direnv](https://direnv.net/#basic-installation) -> [!NOTE] -> `NEBIUS_BUCKET_NAME` contains unique bucket name dedicated to the project inside your tenant. + `direnv` is a tool for automatic loading of directory-scoped environment variables. + It can find and load variables from e.g. `.envrc` file. -> [!NOTE] -> `--versioning-policy 'enabled'` allows you to keep track of versions made by Terraform. -> It gives you a possibility to roll back to specified version of TF state in case your installation is broken. +## Step-by-step guide -#### md5sum +Let's start from opening this directory in terminal. -We use `md5sum` utility to generate unique S3 bucket IDs. +### Get your own copy -`md5sum` is often pre-installed on most of Unix-like OSs. Ensure that you have it installed on your machine. +In order to not mess with example recipe, make your own copy of [example directory](installations/example): -```bash -which md5sum +```shell +mkdir installations/ +``` +```shell +cd installations/ +``` +```shell +cp -r ../example/ ./ ``` -> [!TIP] -> To install `md5sum` on macOS, you have to install GNU coreutils that includes it. -> ```bash -> brew install coreutils +> [!NOTE] +> At first, you will get an error like: +> +> ```text +> direnv: error /nebius-solution-library/soperator/installations//.envrc is blocked. Run `direnv allow` to approve its content > ``` +> +> We can ignore it for a moment, until we have our setup configured. -### Kubectl +> [!IMPORTANT] +> Following steps will be described as you work in terminal within that new directory. -Install [kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) and verify it's working: +### Export your project info -```bash -kubectl cluster-info +Let's export your tenant and project IDs for a further use. + +```shell +export NEBIUS_TENANT_ID='' ``` +```shell +export NEBIUS_PROJECT_ID='' +``` + +### Give Terraform rights to access Object Storage + +In order to store Terraform state remotely in Nebius Object Storage, Terraform must be able to connect to it. +We'll use [service account](https://docs.nebius.ai/iam/service-accounts/manage/) for that purpose. -### Environment +1. Create service account: + + ```shell + NEBIUS_SA_TERRAFORM_ID=$(nebius iam service-account create \ + --parent-id "${NEBIUS_PROJECT_ID}" \ + --name 'slurm-terraform-sa' \ + --format json \ + | jq -r '.metadata.id') + ``` + ```shell + export NEBIUS_SA_TERRAFORM_ID + ``` + + Make sure it has valid value: + + ```console + $ echo ${NEBIUS_SA_TERRAFORM_ID} + serviceaccount- + ``` + +2. Add this account to the `editors` group: + + Get ID of the `editors` group: + + ```shell + NEBIUS_GROUP_EDITORS_ID=$(nebius iam group get-by-name \ + --parent-id "${NEBIUS_TENANT_ID}" \ + --name 'editors' \ + --format json \ + | jq -r '.metadata.id') + ``` + ```shell + export NEBIUS_GROUP_EDITORS_ID + ``` + + Make sure it has valid value: + + ```console + $ echo ${NEBIUS_GROUP_EDITORS_ID} + group- + ``` + + Add service account to the `editors` group: + + ```shell + nebius iam group-membership create \ + --parent-id "${NEBIUS_GROUP_EDITORS_ID}" \ + --member-id "${NEBIUS_SA_TERRAFORM_ID}" + ``` + +3. Create a key pair for giving AWS CLI a way to access Object Storage with the service account: + + ```shell + NEBIUS_SA_ACCESS_KEY_ID=$(nebius iam access-key create \ + --parent-id "${NEBIUS_PROJECT_ID}" \ + --name 'slurm-terraform-sa-access-key' \ + --account-service-account-id "${NEBIUS_SA_TERRAFORM_ID}" \ + --description 'AWS CLI key' \ + --format json \ + | jq -r '.resource_id') + ``` + ```shell + export NEBIUS_SA_ACCESS_KEY_ID + ``` + + Make sure it has valid value: + + ```console + $ echo ${NEBIUS_SA_ACCESS_KEY_ID} + accesskey- + ``` + +### Create a bucket in Object Storage + +Let's create a S3 bucket in Object Storage, which will be used by Terraform to store its state remotely. + +1. Generate a name for a bucket: + + ```shell + NEBIUS_PROJECT_HASH=$(echo -n "${NEBIUS_TENANT_ID}-${NEBIUS_PROJECT_ID}" \ + | md5sum \ + | awk '$0=$1') + NEBIUS_BUCKET_NAME="tfstate-slurm-k8s-${NEBIUS_PROJECT_HASH}" + ``` + + Make sure it has valid value: + + ```console + $ echo ${NEBIUS_BUCKET_NAME} + tfstate-slurm-k8s- + ``` + + > [!NOTE] + > `NEBIUS_BUCKET_NAME` contains unique bucket name dedicated to the project inside your tenant. + +2. Create a bucket: + + ```shell + nebius storage bucket create \ + --name "${NEBIUS_BUCKET_NAME}" \ + --parent-id "${NEBIUS_PROJECT_ID}" \ + --versioning-policy 'enabled' + ``` + + > [!NOTE] + > `--versioning-policy 'enabled'` allows you to keep track of versions made by Terraform. + > It gives you a possibility to roll back to specified version of TF state in case your installation is broken. + +3. Add the key, the Nebius AI region ID and the Object Storage endpoint URL to the AWS CLI configuration + + ```bash + aws configure set aws_access_key_id "${NEBIUS_SA_ACCESS_KEY_AWS_ID}" + ``` + ```bash + aws configure set aws_secret_access_key "${NEBIUS_SA_SECRET_ACCESS_KEY}" + ``` + ```bash + aws configure set region 'eu-north1' + ``` + ```bash + aws configure set endpoint_url 'https://storage.eu-north1.nebius.cloud:443' + ``` + +### Set environment variables You have to have IAM token for auth with **Nebius CLI** and **Nebius Terraform provider**. -In order to do that, we provide `.envrc` file that gets access token from Nebius IAM. +In order to do that, we provide [`.envrc`](installations/example/.envrc) file that gets access token from Nebius IAM. It exposes following environment variables: -- `NEBIUS_IAM_TOKEN` for `nebius` tool; -- `TF_VAR_iam_token` for being used in Terraform. -Setting `TF_VAR_iam_token` env var to some value is a way to pass this variable to Terraform from environment. -You can also set it within `terraform.tfvars`, but it's not secure, and we do not recommend to do that. +- for Nebius CLI: + - `NEBIUS_IAM_TOKEN` +- For Terraform: + - `TF_VAR_iam_token` + - `TF_VAR_iam_tenant_id` + - `TF_VAR_iam_project_id` + - `TF_VAR_vpc_subnet_id` +- For AWS CLI + - `AWS_ACCESS_KEY_ID` + - `AWS_SECRET_ACCESS_KEY` -To load variables from `.envrc` file, you can use `direnv` or you can simply call +It also generates `terraform_backend_override.tf` file used by Terraform for configuring S3 backend. -```bash -source .envrc -``` +> [!NOTE] +> Setting environment variables like `TF_VAR_` is a way to pass the value for this variable to Terraform from environment. +> You can also set e.g. `iam_token` within [`terraform.tfvars`](installations/example/terraform.tfvars), but it's not secure, and we do not recommend to do that. > [!TIP] -> If you have your access token expired, you can simply re-source `.envrc` -> ```bash -> source .envrc +> You can replace checks for `NEBIUS_TENANT_ID` and `NEBIUS_PROJECT_ID` in provided [`.envrc`](installations/example/.envrc) file with the following: +> +> #### Automatic retrieving +> +> ```shell +> # NEBIUS_TENANT_ID="${NEBIUS_TENANT_ID:?NEBIUS_TENANT_ID not set}" +> # Becomes +> NEBIUS_TENANT_ID=$(nebius iam tenant list \ +> --format json \ +> | jq -r ".items.[0].metadata.id") +> ``` +> ```shell +> # NEBIUS_PROJECT_ID="${NEBIUS_PROJECT_ID:?NEBIUS_PROJECT_ID not set}" +> # Becomes +> NEBIUS_PROJECT_ID=$(nebius iam project list \ +> --parent-id "${NEBIUS_TENANT_ID}" \ +> --format json \ +> | jq -r ".items.[0].metadata.id") +> ``` +> +> #### Specific IDs +> +> ```shell +> # NEBIUS_TENANT_ID="${NEBIUS_TENANT_ID:?NEBIUS_TENANT_ID not set}" +> # Becomes +> NEBIUS_TENANT_ID='' +> ``` +> ```shell +> # NEBIUS_PROJECT_ID="${NEBIUS_PROJECT_ID:?NEBIUS_PROJECT_ID not set}" +> # Becomes +> NEBIUS_PROJECT_ID='' > ``` -#### `direnv` +Since we exported some variables in previous steps, +let's create a new terminal session and load variables from `.envrc` file to the clean environment. -`direnv` is a tool for automatic loading of directory-scoped environment variables. -It can find and load variables from e.g. `.envrc` file. +To allow `direnv` access for `.envrc` file, run: -1. [Install](https://direnv.net/#basic-installation) `direnv` -2. Run +```shell +direnv allow . +``` - ```bash - direnv allow . - ``` +And check if it works: - To allow `direnv` access for `.envrc` file. -3. Check if it works +```console +$ token_present() { test ${NEBIUS_IAM_TOKEN} && echo 'IAM token is present' || echo 'There is no IAM token'; } +$ pushd .. > /dev/null ; echo ; token_present ; echo ; popd > /dev/null ; echo ; token_present +direnv: unloading - ```bash - token_present() { test ${NEBIUS_IAM_TOKEN} && echo 'IAM token is present' || echo 'There is no IAM token'; } - - pushd .. > /dev/null ; echo ; token_present ; echo ; popd > /dev/null ; echo ; token_present - ``` +There is no IAM token - You'll get something like: +direnv: loading /installations//.envrc +token from NEBIUS_IAM_TOKEN env is used +token from NEBIUS_IAM_TOKEN env is used +direnv: export +AWS_ACCESS_KEY_ID +AWS_SECRET_ACCESS_KEY +NEBIUS_IAM_TOKEN +NEBIUS_PROJECT_ID +NEBIUS_TENANT_ID +NEBIUS_VPC_SUBNET_ID +TF_VAR_iam_project_id +TF_VAR_iam_tenant_id +TF_VAR_iam_token +TF_VAR_vpc_subnet_id - ``` - direnv: unloading - - There is no IAM token - - direnv: loading /terraform/.envrc - direnv: export +NEBIUS_IAM_TOKEN - - IAM token is present - ``` +IAM token is present +``` > [!TIP] -> If you have your access token expired, you can switch directories back and forth to trigger unloading/loading of -> `.envrc` file, or just simply call `direnv reload` -> ```bash +> If you have access token expired, +> you can switch directories back and forth to trigger unloading/loading of `.envrc` file: +> ```shell > pushd .. && popd -> # or +> ``` +> Or just simply call: +> ```shell > direnv reload > ``` -### Terraform CLI +> [!NOTE] +> Instead of using `direnv`, you can just use `source` command to load variables from .envrc. +> ```shell +> source .envrc +> ``` +> However, you won't get variables unloaded when you leave your installation directory. +> I case of access token expiration, call `source` command again: +> ```shell +> source .envrc +> ``` -Install [Terraform CLI](https://developer.hashicorp.com/terraform/install). +Once you loaded `.envrc` file into your environment, you'll get `.aws_secret_access_key` and + files created in your installation directory. > [!IMPORTANT] -> The minimum version of Terraform needed for this recipe is `1.8.0`. - -## Create your cluster +> Make sure that: +> - `.aws_secret_access_key` file is not empty +> - `terraform_backend_override.tf` file contains valid bucket name -### Initialization +### Initialize Terraform -Execute: +To initialize a Terraform project, download all referenced providers and modules, execute: ```shell terraform init ``` -This command will download all referenced providers and modules. +Now you have your project set up, and Terraform state stored in Object Storage. ### Fill out terraform variables -We provide default variables in [`terraform.tfvars`](installations/example/terraform.tfvars) file that you can use as a -reference for your cluster configuration. +We provide default variables in [`terraform.tfvars`](installations/example/terraform.tfvars) file, +that you can use as a reference for your cluster configuration. All variables there are comprehensively commented, and you'll probably leave most of them with pre-set values. -### Creating resources +### Create resources -1. Run `terraform plan` to make sure if provided values create resources as you want. +1. Run `terraform plan` to make sure if provided values create resources the way you wanted. 2. Run `terraform apply` to create resources based on provided values. You will be prompted to check if resources correspond to your needs. Type `yes` if the configuration is correct and watch the process. @@ -302,18 +422,16 @@ correspond to your needs. Type `yes` if the configuration is correct and watch t Our Terraform recipe waits for `slurm.nebius.ai/SlurmCluster` CustomResource having `Available` `.status.phase`. -Once it's ready, we create `login.sh` script to connect to Slurm. It automatically gets public IP address of: +Once it's ready, `login.sh` script will be created to connect to Slurm. It automatically gets public IP address of: - K8s node (in case of use of `NodePort` Service type); - Slurm Login Service (in case of use of `LoadBalancer` Service type). You can use this script to easily connect to your newly created cluster. It accepts following arguments: -- _Optional_ `-u ` (by default, `root`); -- `-k `. +- _Optional_ `-u ` (by default, `root`); +- `-k `. -```bash -./login.sh -k ~/.ssh/id_rsa -``` -```text +```console +$ ./login.sh -k ~/.ssh/id_rsa ... root@login-0:~# ``` diff --git a/soperator/VERSION b/soperator/VERSION index 9beda55f..f350950c 100644 --- a/soperator/VERSION +++ b/soperator/VERSION @@ -1 +1 @@ -1.14.10 +1.14.11 diff --git a/soperator/installations/example/main.tf b/soperator/installations/example/main.tf index 70ad385c..85520627 100644 --- a/soperator/installations/example/main.tf +++ b/soperator/installations/example/main.tf @@ -1,5 +1,7 @@ locals { create_nlb = var.slurm_login_service_type == "NodePort" + + worker_resources = module.resources.this[var.k8s_cluster_node_group_gpu.resource.platform][var.k8s_cluster_node_group_gpu.resource.preset] } module "filestore" { @@ -115,21 +117,45 @@ module "k8s" { } } -module "nvidia_operators" { +module "nvidia_operator_network" { + count = local.worker_resources.gpus > 0 ? 1 : 0 + depends_on = [ module.k8s ] - source = "../../modules/nvidia_operators" + source = "../../../modules/network-operator" + + cluster_id = module.k8s.cluster_id + parent_id = data.nebius_iam_v1_project.this.id providers = { - helm = helm + nebius = nebius + } +} + +module "nvidia_operator_gpu" { + count = local.worker_resources.gpus > 0 ? 1 : 0 + + depends_on = [ + module.nvidia_operator_network + ] + + source = "../../../modules/gpu-operator" + + cluster_id = module.k8s.cluster_id + parent_id = data.nebius_iam_v1_project.this.id + + enable_dcgm_service_monitor = var.telemetry_enabled + + providers = { + nebius = nebius } } module "slurm" { depends_on = [ - module.k8s + module.k8s, ] source = "../../modules/slurm" @@ -139,20 +165,12 @@ module "slurm" { node_count = var.slurm_node_count - worker_resources = tomap({ - "8gpu-128vcpu-1600gb" = { - cpu_cores = 128 - 48 - memory_gibibytes = 1600 - 400 - ephemeral_storage_gibibytes = ceil(var.k8s_cluster_node_group_gpu.boot_disk.size_gibibytes / 2) - gpus = 8 - } - "1gpu-20vcpu-200gb" = { - cpu_cores = 20 - 4 - memory_gibibytes = 200 - 50 - ephemeral_storage_gibibytes = ceil(var.k8s_cluster_node_group_gpu.boot_disk.size_gibibytes / 2) - gpus = 1 - } - })[var.k8s_cluster_node_group_gpu.resource.preset] + worker_resources = { + cpu_cores = local.worker_resources.cpu_cores + memory_gibibytes = local.worker_resources.memory_gibibytes + ephemeral_storage_gibibytes = ceil(var.k8s_cluster_node_group_gpu.boot_disk.size_gibibytes / 2) + gpus = local.worker_resources.gpus + } login_service_type = var.slurm_login_service_type login_node_port = var.slurm_login_node_port @@ -164,7 +182,6 @@ module "slurm" { slurmdbd_config = var.slurmdbd_config slurm_accounting_config = var.slurm_accounting_config - # TODO: MSP-2817 - use computed values of filestore sizes filestores = { controller_spool = { size_gibibytes = module.filestore.controller_spool.size_gibibytes diff --git a/soperator/installations/example/terraform.tf b/soperator/installations/example/terraform.tf index ae983717..014e6175 100644 --- a/soperator/installations/example/terraform.tf +++ b/soperator/installations/example/terraform.tf @@ -4,7 +4,7 @@ terraform { required_providers { nebius = { source = "terraform-provider-nebius.storage.ai.nebius.cloud/nebius/nebius" - version = "0.3.22" + version = "0.4.4" } units = { @@ -41,3 +41,7 @@ provider "helm" { token = var.iam_token } } + +module "resources" { + source = "../../modules/available_resources" +} diff --git a/soperator/installations/example/terraform.tfvars b/soperator/installations/example/terraform.tfvars index f929275d..83c3444d 100644 --- a/soperator/installations/example/terraform.tfvars +++ b/soperator/installations/example/terraform.tfvars @@ -199,7 +199,7 @@ slurm_cluster_name = "my-amazing-slurm" # Version of soperator. # --- -slurm_operator_version = "1.14.10" +slurm_operator_version = "1.14.11" #----------------------------------------------------------------------------------------------------------------------# # # diff --git a/soperator/modules/available_resources/main.tf b/soperator/modules/available_resources/main.tf new file mode 100644 index 00000000..35b5537c --- /dev/null +++ b/soperator/modules/available_resources/main.tf @@ -0,0 +1,92 @@ +locals { + # TODO: Get to know exact amount of allocatable resources + resources = tomap({ + "cpu-e2" = tomap({ + # Insufficient resource presets + # 2vcpu-8gb + # 4vcpu-16gb + "8vcpu-32gb" = { + cpu_cores = 8 - 2 + memory_gibibytes = 32 - 10 + gpus = 0 + gpu_cluster_compatible = false + } + "16vcpu-64gb" = { + cpu_cores = 16 - 2 + memory_gibibytes = 64 - 10 + gpus = 0 + gpu_cluster_compatible = false + } + "32vcpu-128gb" = { + cpu_cores = 32 - 2 + memory_gibibytes = 128 - 10 + gpus = 0 + gpu_cluster_compatible = false + } + "48vcpu-192gb" = { + cpu_cores = 48 - 2 + memory_gibibytes = 192 - 10 + gpus = 0 + gpu_cluster_compatible = false + } + "64vcpu-256gb" = { + cpu_cores = 64 - 2 + memory_gibibytes = 256 - 10 + gpus = 0 + gpu_cluster_compatible = false + } + "80vcpu-320gb" = { + cpu_cores = 80 - 2 + memory_gibibytes = 320 - 10 + gpus = 0 + gpu_cluster_compatible = false + } + }) + "gpu-h100-sxm" = tomap({ + "1gpu-16vcpu-200gb" = { + cpu_cores = 16 - 2 + memory_gibibytes = 200 - 15 + gpus = 1 + gpu_cluster_compatible = false + } + "8gpu-128vcpu-1600gb" = { + cpu_cores = 128 - 2 + memory_gibibytes = 1600 - 350 + gpus = 8 + gpu_cluster_compatible = true + } + }) + "gpu-l40s-a" = tomap({ + "1gpu-8vcpu-32gb" = { + cpu_cores = 8 - 2 + memory_gibibytes = 32 - 10 + gpus = 1 + gpu_cluster_compatible = false + } + "1gpu-16vcpu-64gb" = { + cpu_cores = 16 - 2 + memory_gibibytes = 64 - 10 + gpus = 1 + gpu_cluster_compatible = false + } + "1gpu-24vcpu-96gb" = { + cpu_cores = 24 - 2 + memory_gibibytes = 96 - 10 + gpus = 1 + gpu_cluster_compatible = false + } + "1gpu-32vcpu-128gb" = { + cpu_cores = 32 - 2 + memory_gibibytes = 128 - 10 + gpus = 1 + gpu_cluster_compatible = false + } + "1gpu-40vcpu-160gb" = { + cpu_cores = 40 - 2 + memory_gibibytes = 160 - 10 + gpus = 1 + gpu_cluster_compatible = false + } + }) + }) +} diff --git a/soperator/modules/available_resources/outputs.tf b/soperator/modules/available_resources/outputs.tf new file mode 100644 index 00000000..59cd6cf6 --- /dev/null +++ b/soperator/modules/available_resources/outputs.tf @@ -0,0 +1,4 @@ +output "this" { + description = "Map of available node resources grouped by platform -> preset." + value = local.resources +} diff --git a/soperator/modules/k8s/k8s_ng_gpu.tf b/soperator/modules/k8s/k8s_ng_gpu.tf index 76b31599..4e4bd4ac 100644 --- a/soperator/modules/k8s/k8s_ng_gpu.tf +++ b/soperator/modules/k8s/k8s_ng_gpu.tf @@ -1,11 +1,7 @@ locals { gpu = { cluster = { - create = tomap({ - "8gpu-128vcpu-1600gb" = true - "1gpu-20vcpu-200gb" = false - })[var.node_group_gpu.resource.preset] - + create = module.resources.this[var.node_group_gpu.resource.platform][var.node_group_gpu.resource.preset].gpu_cluster_compatible name = join("-", [ trimsuffix( substr( @@ -18,11 +14,6 @@ locals { var.node_group_gpu.gpu_cluster.infiniband_fabric ]) } - - count = tomap({ - "8gpu-128vcpu-1600gb" = 8 - "1gpu-20vcpu-200gb" = 1 - })[var.node_group_gpu.resource.preset] } } @@ -62,11 +53,11 @@ resource "nebius_mk8s_v1_node_group" "gpu" { metadata = { labels = module.labels.label_group_name_gpu } - taints = [{ + taints = module.resources.this[var.node_group_gpu.resource.platform][var.node_group_gpu.resource.preset].gpus > 0 ? [{ key = "nvidia.com/gpu", - value = local.gpu.count + value = module.resources.this[var.node_group_gpu.resource.platform][var.node_group_gpu.resource.preset].gpus effect = "NO_SCHEDULE" - }] + }] : null resources = { platform = var.node_group_gpu.resource.platform @@ -105,5 +96,13 @@ resource "nebius_mk8s_v1_node_group" "gpu" { ignore_changes = [ labels, ] + + precondition { + condition = (var.node_group_gpu.resource.platform == "cpu-e2" + ? !contains(["2vcpu-8gb", "4vcpu-16gb"], var.node_group_gpu.resource.preset) + : true + ) + error_message = "Worker resource preset '${var.node_group_gpu.resource.preset}' is insufficient." + } } } diff --git a/soperator/modules/k8s/outputs.tf b/soperator/modules/k8s/outputs.tf index 48cf5e0f..15b9d2be 100644 --- a/soperator/modules/k8s/outputs.tf +++ b/soperator/modules/k8s/outputs.tf @@ -6,6 +6,11 @@ output "control_plane" { } } +output "cluster_id" { + description = "K8s cluster ID." + value = nebius_mk8s_v1_cluster.this.id +} + output "allocation_id" { description = "ID of the VPC allocation used for SSH connection into Slurm cluster." value = local.allocation_id diff --git a/soperator/modules/k8s/terraform.tf b/soperator/modules/k8s/terraform.tf index 5fccb54c..80e6c447 100644 --- a/soperator/modules/k8s/terraform.tf +++ b/soperator/modules/k8s/terraform.tf @@ -13,3 +13,7 @@ terraform { module "labels" { source = "../labels" } + +module "resources" { + source = "../available_resources" +} diff --git a/soperator/modules/login/main.tf b/soperator/modules/login/main.tf index d55f43ce..86e8a641 100644 --- a/soperator/modules/login/main.tf +++ b/soperator/modules/login/main.tf @@ -21,7 +21,7 @@ resource "local_file" "this" { terraform_data.connection_ip, ] - filename = "${path.root}/login.sh" + filename = "${path.root}/${var.script_name}.sh" file_permission = "0774" content = templatefile("${path.module}/templates/login.sh.tftpl", { address = terraform_data.connection_ip.output diff --git a/soperator/modules/login/variables.tf b/soperator/modules/login/variables.tf index 47a5556d..0e6e6257 100644 --- a/soperator/modules/login/variables.tf +++ b/soperator/modules/login/variables.tf @@ -15,3 +15,9 @@ variable "slurm_cluster_name" { type = string nullable = false } + +variable "script_name" { + description = "Name of the script file." + type = string + default = "login" +} diff --git a/soperator/modules/nvidia_operators/locals.tf b/soperator/modules/nvidia_operators/locals.tf deleted file mode 100644 index 128769da..00000000 --- a/soperator/modules/nvidia_operators/locals.tf +++ /dev/null @@ -1,17 +0,0 @@ -locals { - helm = { - repository = "cr.nemax.nebius.cloud/yc-marketplace/nebius" - chart = { - operator = { - network = { - name = "network-operator" - version = "24.4.0" - } - gpu = { - name = "gpu-operator" - version = "v24.3.0" - } - } - } - } -} diff --git a/soperator/modules/nvidia_operators/main.tf b/soperator/modules/nvidia_operators/main.tf deleted file mode 100644 index 247f2723..00000000 --- a/soperator/modules/nvidia_operators/main.tf +++ /dev/null @@ -1,60 +0,0 @@ -resource "helm_release" "network_operator" { - name = local.helm.chart.operator.network.name - repository = "oci://${local.helm.repository}/nvidia-${local.helm.chart.operator.network.name}/chart" - chart = local.helm.chart.operator.network.name - version = local.helm.chart.operator.network.version - atomic = true - timeout = 600 - - create_namespace = true - namespace = local.helm.chart.operator.network.name - - wait = true - wait_for_jobs = true -} - -resource "helm_release" "gpu_operator" { - depends_on = [ - helm_release.network_operator - ] - - name = local.helm.chart.operator.gpu.name - repository = "oci://${local.helm.repository}/nvidia-${local.helm.chart.operator.gpu.name}/chart" - chart = local.helm.chart.operator.gpu.name - version = local.helm.chart.operator.gpu.version - atomic = true - timeout = 600 - - create_namespace = true - namespace = local.helm.chart.operator.gpu.name - - values = [templatefile("${path.module}/templates/helm_values/gpu_operator.yaml.tftpl", { - repository = local.helm.repository - image_prefix = "nvidia-${local.helm.chart.operator.gpu.name}/image" - operator_version = "v24.3.0" - - enable = { - cc_manager = false - dcgm = true - dcgm_exporter = true - dcgm_exporter_service_monitor = true - device_plugin = true - driver = true - driver_rdma = true - driver_rdma_host_mofed = false - gfd = true - kata_manager = false - mig_manager = true - nfd = true - node_status_exporter = false - sandbox_device_plugin = true - toolkit = true - vfio_manager = true - vgpu_device_manager = true - vgpu_manager = false - } - })] - - wait = true - wait_for_jobs = true -} diff --git a/soperator/modules/nvidia_operators/templates/helm_values/gpu_operator.yaml.tftpl b/soperator/modules/nvidia_operators/templates/helm_values/gpu_operator.yaml.tftpl deleted file mode 100644 index f0b4aeca..00000000 --- a/soperator/modules/nvidia_operators/templates/helm_values/gpu_operator.yaml.tftpl +++ /dev/null @@ -1,113 +0,0 @@ -ccManager: - enabled: ${enable.cc_manager} - repository: "${repository}" - image: "${image_prefix}/k8s-cc-manager" - -dcgm: - enabled: ${enable.dcgm} - repository: "${repository}" - image: "${image_prefix}/dcgm" - -dcgmExporter: - enabled: ${enable.dcgm_exporter} - repository: "${repository}" - image: "${image_prefix}/dcgm-exporter" - serviceMonitor: - enabled: ${enable.dcgm_exporter_service_monitor} - interval: 30s - -devicePlugin: - enabled: ${enable.device_plugin} - repository: "${repository}" - image: "${image_prefix}/k8s-device-plugin" - -driver: - enabled: ${enable.driver} - version: "535.161.08" - repository: "${repository}" - image: "${image_prefix}/driver" - full_alternative_image: "${repository}/${image_prefix}/driver:535.161.08-ubuntu22.04" - full_new_image: "${repository}/${image_prefix}/driver:550.54.15-ubuntu20.04" - full_new_alternative_image: "${repository}/${image_prefix}/driver:550.54.15-ubuntu22.04" - - manager: - repository: "${repository}" - image: "${image_prefix}/k8s-driver-manager" - - rdma: - enabled: ${enable.driver_rdma} - useHostMofed: ${enable.driver_rdma_host_mofed} - -gfd: - enabled: ${enable.gfd} - repository: "${repository}" - image: "${image_prefix}/k8s-device-plugin" - -kataManager: - enabled: ${enable.kata_manager} - repository: "${repository}" - image: "${image_prefix}/k8s-kata-manager" - -migManager: - enabled: ${enable.mig_manager} - repository: "${repository}" - image: "${image_prefix}/k8s-mig-manager" - -nfd: - enabled: ${enable.nfd} -node-feature-discovery: - image: - repository: "${repository}/${image_prefix}/node-feature-discovery" - tag: "v0.15.4" - -nodeStatusExporter: - enabled: ${enable.node_status_exporter} - repository: "${repository}" - image: "${image_prefix}/gpu-operator-validator" - version: "${operator_version}" - -operator: - repository: "${repository}" - image: "${image_prefix}/gpu-operator" - version: "${operator_version}" - - initContainer: - repository: "${repository}" - image: "${image_prefix}/cuda" - -sandboxDevicePlugin: - enabled: ${enable.sandbox_device_plugin} - repository: "${repository}" - image: "${image_prefix}/kubevirt-gpu-device-plugin" - -toolkit: - enabled: ${enable.toolkit} - repository: "${repository}" - image: "${image_prefix}/container-toolkit" - -unified_agent_installer: - image: "${repository}/${image_prefix}/busybox:latest_pinned" - -validator: - repository: "${repository}" - image: "${image_prefix}/gpu-operator-validator" - version: "${operator_version}" - -vfioManager: - enabled: ${enable.vfio_manager} - repository: "${repository}" - image: "${image_prefix}/cuda" - - driverManager: - repository: "${repository}" - image: "${image_prefix}/k8s-driver-manager" - -vgpuDeviceManager: - enabled: ${enable.vgpu_device_manager} - repository: "${repository}" - image: "${image_prefix}/vgpu-device-manager" - -vgpuManager: - enabled: ${enable.vgpu_manager} - repository: "${repository}" - image: "${image_prefix}/k8s-driver-manager" diff --git a/soperator/modules/slurm/templates/helm_values/slurm_cluster.yaml.tftpl b/soperator/modules/slurm/templates/helm_values/slurm_cluster.yaml.tftpl index ef169f9a..4e610088 100644 --- a/soperator/modules/slurm/templates/helm_values/slurm_cluster.yaml.tftpl +++ b/soperator/modules/slurm/templates/helm_values/slurm_cluster.yaml.tftpl @@ -1,4 +1,5 @@ clusterName: ${name} +clusterType: ${ nodes.worker.resources.gpus > 0 ? "gpu" : "cpu" } k8sNodeFilters: - name: ${k8s_node_filters.non_gpu.name} @@ -22,10 +23,12 @@ k8sNodeFilters: operator: In values: - ${k8s_node_filters.gpu.affinity.value} + %{~ if nodes.worker.resources.gpus > 0 ~} tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule + %{~ endif ~} volumeSources: - name: jail