-
Notifications
You must be signed in to change notification settings - Fork 792
Description
What happened?
My training-operator
pod's /metrics
endpoint is not reporting the Prometheus metrics mentioned here properly.
To be precise, training_operator_jobs_created_total
is being incremented as expected — the issue (thus far) has been with training_operator_jobs_successful_total
and training_operator_jobs_deleted_total
.
I started a PyTorch training job using this code (closely based on the guide here):
def train_func():
import torch
import torch.nn.functional as F
from torch.utils.data import DistributedSampler
from torchvision import datasets, transforms
import torch.distributed as dist
import os
# [1] Setup PyTorch DDP. Distributed environment will be set automatically by Training Operator.
dist.init_process_group(backend="gloo")
Distributor = torch.nn.parallel.DistributedDataParallel
local_rank = int(os.getenv("LOCAL_RANK", 0))
print(
"Distributed Training for WORLD_SIZE: {}, RANK: {}, LOCAL_RANK: {}".format(
dist.get_world_size(),
dist.get_rank(),
local_rank,
)
)
# [2] Create PyTorch CNN Model.
class Net(torch.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = torch.nn.Conv2d(1, 20, 5, 1)
self.conv2 = torch.nn.Conv2d(20, 50, 5, 1)
self.fc1 = torch.nn.Linear(4 * 4 * 50, 500)
self.fc2 = torch.nn.Linear(500, 10)
def forward(self, x):
x = F.relu(self.conv1(x))
x = F.max_pool2d(x, 2, 2)
x = F.relu(self.conv2(x))
x = F.max_pool2d(x, 2, 2)
x = x.view(-1, 4 * 4 * 50)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return F.log_softmax(x, dim=1)
# [3] Attach model to the correct GPU device and distributor.
device = torch.device(f"cuda:{local_rank}" if torch.cuda.is_available() else "cpu")
model = Net().to(device)
model = Distributor(model)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
# [4] Setup FashionMNIST dataloader and distribute data across PyTorchJob workers.
dataset = datasets.FashionMNIST(
"./data",
download=True,
train=True,
transform=transforms.Compose([transforms.ToTensor()]),
)
train_loader = torch.utils.data.DataLoader(
dataset=dataset,
batch_size=128,
sampler=DistributedSampler(dataset),
)
# [5] Start model Training.
for epoch in range(3):
for batch_idx, (data, target) in enumerate(train_loader):
# Attach Tensors to the device.
data = data.to(device)
target = target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % 10 == 0 and dist.get_rank() == 0:
print(
"Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}".format(
epoch,
batch_idx * len(data),
len(train_loader.dataset),
100.0 * batch_idx / len(train_loader),
loss.item(),
)
)
from kubeflow.training import TrainingClient
training_client = TrainingClient()
job_name = "pytorch-ddp"
# Start PyTorchJob with 3 Workers and 1 GPU per Worker (e.g. multi-node, multi-worker job).
training_client.create_job(
name=job_name,
train_func=train_func,
num_procs_per_worker="auto",
num_workers=3,
base_image="***.azurecr.io/pytorch/pytorch:2.1.2-cuda11.8-cudnn8-runtime",
# resources_per_worker={"gpu": "1"},
)
This job completed and was successful:
Code:
training_client.is_job_succeeded(name=job_name)
Output:
True
However, when visiting the /metrics
endpoint, I could only see that training_operator_jobs_created_total
was at 1 — neither training_operator_jobs_successful_total
nor training_operator_jobs_failed_total
had been incremented (neither one was present on the page).
So, I deleted this job to try again:
training_client.delete_job(job_name)
Still, on the /metrics
endpoint, training_operator_jobs_deleted_total
was not incremented/visible.
I created a job again with the same code and repeated the process. From this point on, training_operator_jobs_successful_total
was incremented as expected. However, training_operator_jobs_deleted_total
is still failing to update.
I have not had any failed/restarted jobs so I do not know about the behavior of those two metrics.
What did you expect to happen?
I expected training_operator_jobs_successful_total
to be incremented if a job is deemed successful by the Python client.
I expected training_operator_jobs_deleted_total
to be incremented if a job's resources are deleted successfully and there are no error logs related to job deletion in the training-operator
pod logs.
Environment
Kubernetes version:
$ kubectl version
Client Version: v1.30.8
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.7
Training Operator version:
$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"
***.azurecr.io/kubeflow/training-operator:v1-5a5f92d
Training Operator Python SDK version:
$ pip show kubeflow-training
Name: kubeflow-training
Version: 1.9.0rc0
Summary: Training Operator Python SDK
Home-page: https://github.com/kubeflow/training-operator/tree/master/sdk/python
Author: Kubeflow Authors
Author-email: [email protected]
License: Apache License Version 2.0
Location: /opt/conda/lib/python3.11/site-packages
Requires: certifi, kubernetes, retrying, setuptools, six, urllib3
Required-by:
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍