Skip to content

Commit

Permalink
Merge pull request #40 from nebius/dev/soperator
Browse files Browse the repository at this point in the history
Soperator stable release 1.14.7
  • Loading branch information
asteny authored Oct 15, 2024
2 parents 06cde89 + 5a316f8 commit dd87d8d
Show file tree
Hide file tree
Showing 9 changed files with 54 additions and 59 deletions.
3 changes: 2 additions & 1 deletion .github/workflows/soperator.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ on:
- main
paths:
- "soperator/**"
- ".github/workflows/soperator.yml"

permissions:
contents: read
Expand All @@ -23,7 +24,7 @@ jobs:
egress-policy: audit

- name: Checkout repository
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332 # v4.1.7
uses: actions/checkout@eef61447b9ff4aafe5dcd4e0bbf5d482be7e7871 # v4.2.1

- name: Build release
run: |
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/terraform.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ on:
pull_request:
paths-ignore:
- 'soperator/**'
- '.github/workflows/soperator.yml'
# schedule:
# - cron: '30 * * * *'

Expand Down
24 changes: 22 additions & 2 deletions soperator/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,17 @@ These checks are implemented as usual Slurm jobs - they stay in the same queue w

## Get your own copy

In order to not mess with example recipe, make your own copy of [example directory](installations/example).
Following steps will be described as you work in terminal within that new directory.
In order to not mess with example recipe, make your own copy of [example directory](installations/example):
```bash
mkdir installations/<your-installation-name>

cd installations/<your-installation-name>

cp -r ../examples/ ./
```

> [!NOTE]
> Following steps will be described as you work in terminal within that new directory.
### Nebius CLI

Expand Down Expand Up @@ -94,6 +103,7 @@ Let's start with exporting your tenant and project IDs for a further use.
--parent-id "${NEBIUS_PROJECT_ID}" \
--name 'slurm-terraform-sa' \
--format json | jq -r '.metadata.id')
export NEBIUS_SA_TERRAFORM_ID
```
Expand All @@ -105,6 +115,7 @@ Let's start with exporting your tenant and project IDs for a further use.
--parent-id "${NEBIUS_TENANT_ID}" \
--name 'editors' \
--format json | jq -r '.metadata.id')
export NEBIUS_GROUP_EDITORS_ID
# Adding SA to the 'editors' group
Expand All @@ -122,6 +133,7 @@ Let's start with exporting your tenant and project IDs for a further use.
--account-service-account-id "${NEBIUS_SA_TERRAFORM_ID}" \
--description 'AWS CLI key' \
--format json | jq -r '.resource_id')
export NEBIUS_SA_ACCESS_KEY_ID
```
Expand All @@ -132,15 +144,19 @@ Let's start with exporting your tenant and project IDs for a further use.
```bash
aws configure set aws_access_key_id "${NEBIUS_SA_ACCESS_KEY_AWS_ID}"
aws configure set aws_secret_access_key "${NEBIUS_SA_SECRET_ACCESS_KEY}"
aws configure set region 'eu-north1'
aws configure set endpoint_url 'https://storage.eu-north1.nebius.cloud:443'
```
#### Bucket
```bash
NEBIUS_BUCKET_NAME="tfstate-slurm-k8s-$(echo -n "${NEBIUS_TENANT_ID}-${NEBIUS_PROJECT_ID}" | md5sum | awk '$0=$1')"
nebius storage bucket create --parent-id "${NEBIUS_PROJECT_ID}" --versioning-policy 'enabled' --name "${NEBIUS_BUCKET_NAME}"
```
Expand Down Expand Up @@ -208,6 +224,7 @@ It can find and load variables from e.g. `.envrc` file.
```bash
token_present() { test ${NEBIUS_IAM_TOKEN} && echo 'IAM token is present' || echo 'There is no IAM token'; }
pushd .. > /dev/null ; echo ; token_present ; echo ; popd > /dev/null ; echo ; token_present
```
Expand Down Expand Up @@ -275,8 +292,11 @@ When it finishes, connect to the K8S cluster and wait until the `slurm.nebius.ai
Once it's available, you will be able to connect to Slurm login node via SSH using provided public key as a `root` user.
[//]: # (TODO: Add instructions on how to find this SLURM_IP)
```shell
SLURM_IP='<NLB node / allocated IP address>'
ssh -i '<Path to private key for provided public key>' [-p <Node port>] root@${SLURM_IP}
```
Expand Down
2 changes: 1 addition & 1 deletion soperator/VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.14.4
1.14.7
2 changes: 1 addition & 1 deletion soperator/installations/example/.envrc
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# region IAM token

NEBIUS_IAM_TOKEN=$(nebius iam get-access-token)
NEBIUS_IAM_TOKEN=$(NEBIUS_IAM_TOKEN='' nebius iam get-access-token)
export NEBIUS_IAM_TOKEN
export TF_VAR_iam_token=${NEBIUS_IAM_TOKEN}

Expand Down
2 changes: 1 addition & 1 deletion soperator/installations/example/terraform.tf
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ terraform {
}

provider "nebius" {
domain = "api.eu-north1.nebius.cloud:443"
domain = "api.eu.nebius.cloud:443"
}

provider "units" {}
Expand Down
71 changes: 22 additions & 49 deletions soperator/installations/example/terraform.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -85,14 +85,14 @@ filestore_jail = {

# Shared filesystems to be mounted inside jail.
# ---
filestore_jail_submounts = [{
name = "mlperf-sd"
mount_path = "/mlperf-sd"
spec = {
size_gibibytes = 2048
block_size_kibibytes = 4
}
}]
# filestore_jail_submounts = [{
# name = "mlperf-sd"
# mount_path = "/mlperf-sd"
# spec = {
# size_gibibytes = 2048
# block_size_kibibytes = 4
# }
# }]
# Or use existing filestores.
# ---
# filestore_jail_submounts = [{
Expand All @@ -107,12 +107,12 @@ filestore_jail_submounts = [{
# By default, null.
# Required if accounting_enabled is true.
# ---
# filestore_accounting = {
# spec = {
# size_gibibytes = 512
# block_size_kibibytes = 4
# }
# }
filestore_accounting = {
spec = {
size_gibibytes = 512
block_size_kibibytes = 4
}
}
# Or use existing filestore.
# ---
# filestore_accounting = {
Expand Down Expand Up @@ -173,10 +173,10 @@ k8s_cluster_node_group_gpu = {
# By default, empty list.
# ---
# k8s_cluster_node_ssh_access_users = [{
# name = "user1"
# name = "<USER1>"
# public_keys = [
# "user1 key1",
# "user1 key2",
# "<ENCRYPTION-METHOD HASH1 USER1>",,
# "<ENCRYPTION-METHOD HASH1 USER1>",,
# ]
# }]

Expand All @@ -199,7 +199,7 @@ slurm_cluster_name = "my-amazing-slurm"

# Version of soperator.
# ---
slurm_operator_version = "1.14.4"
slurm_operator_version = "1.14.7"

#----------------------------------------------------------------------------------------------------------------------#
# #
Expand Down Expand Up @@ -233,7 +233,7 @@ slurm_login_service_type = "NodePort"
# Authorized keys accepted for connecting to Slurm login nodes via SSH as 'root' user.
# ---
slurm_login_ssh_root_public_keys = [
"ENCRYPTION-METHOD HASH USER",
"<ENCRYPTION-METHOD HASH USER>",
]

# endregion Login
Expand Down Expand Up @@ -262,7 +262,7 @@ slurm_login_ssh_root_public_keys = [
# Shared memory size for Slurm controller and worker nodes in GiB.
# By default, 64.
# ---
# slurm_shared_memory_size_gibibytes = 64
slurm_shared_memory_size_gibibytes = 256

# endregion Config

Expand Down Expand Up @@ -306,7 +306,7 @@ slurm_login_ssh_root_public_keys = [
# Password of `admin` user of Grafana.
# Set it to your desired password.
# ---
# telemetry_grafana_admin_password = ""
telemetry_grafana_admin_password = "<YOUR-PASSWORD-FOR-GRAFANA>"

# endregion Telemetry

Expand All @@ -320,34 +320,7 @@ slurm_login_ssh_root_public_keys = [
# Whether to enable Accounting.
# By default, false.
# ---
# accounting_enabled = false

# Slurmdbd.conf configuration. See https://slurm.schedmd.com/slurmdbd.conf.html.Not all options are supported.
# slurmdbd_config = {
# archiveEvents = "yes"
# archiveJobs = "yes"
# archiveSteps = "yes"
# archiveSuspend = "yes"
# archiveResv = "yes"
# archiveUsage = "yes"
# archiveTXN = "yes"
# debugLevel = "info"
# tcpTimeout = "120"
# purgeEventAfter = "1month"
# purgeJobAfter = "1month"
# purgeStepAfter = "1month"
# purgeSuspendAfter = "12month"
# purgeResvAfter = "1month"
# }

# Slurm.conf accounting configuration. See https://slurm.schedmd.com/slurm.conf.html. Not all options are supported.
# slurm_accounting_config = {
# accountingStorageTRES = "gres/gpu,license/iop1"
# accountingStoreFlags = "job_comment,job_env,job_extra,job_script,no_stdio"
# acctGatherInterconnectType = "acct_gather_interconnect/ofed"
# jobAcctGatherType = "jobacct_gather/cgroup"
# jobAcctGatherFrequency = 30
# }
accounting_enabled = true

# endregion Accounting

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -91,8 +91,8 @@ sbatch \
--nodes="${NUM_NODES}" \
--gpus-per-node=${GPUS_PER_NODE} \
--ntasks-per-node="${GPUS_PER_NODE}" \
--cpus-per-task=16 \
--mem-per-cpu="8G" \
--cpus-per-task=8 \
--mem-per-cpu="4G" \
--time="${WALLTIME}" \
--output="${LOG_DIR}/%A_${JOB_NAME}.out" \
./scripts/slurm/srun.sh \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,8 @@ srun \
--container-mounts="${MOUNTS}" \
--container-workdir="${WORKDIR}" \
--ntasks-per-node="${GPUS_PER_NODE}" \
--cpus-per-task=16 \
--mem-per-cpu="8G" \
--cpus-per-task=8 \
--mem-per-cpu="4G" \
--nodes="${NUM_NODES}" \
bash -c "./run_and_time.sh \
--export NCCL_TOPO_FILE,HF_HUB_OFFLINE,NCCL_DEBUG \
Expand Down

0 comments on commit dd87d8d

Please sign in to comment.