Some Prometheus metrics not being reported properly

### What happened?

My `training-operator` pod's `/metrics` endpoint is not reporting the Prometheus metrics mentioned [here](https://www.kubeflow.org/docs/components/training/user-guides/prometheus/#list-of-job-metrics) properly.

To be precise, `training_operator_jobs_created_total` is being incremented as expected — the issue (thus far) has been with `training_operator_jobs_successful_total` and `training_operator_jobs_deleted_total `.

I started a PyTorch training job using this code (closely based on the guide [here](https://www.kubeflow.org/docs/components/training/getting-started/#getting-started-with-pytorchjob)):
```
def train_func():
    import torch
    import torch.nn.functional as F
    from torch.utils.data import DistributedSampler
    from torchvision import datasets, transforms
    import torch.distributed as dist
    import os

    # [1] Setup PyTorch DDP. Distributed environment will be set automatically by Training Operator.
    dist.init_process_group(backend="gloo")
    Distributor = torch.nn.parallel.DistributedDataParallel
    local_rank = int(os.getenv("LOCAL_RANK", 0))
    print(
        "Distributed Training for WORLD_SIZE: {}, RANK: {}, LOCAL_RANK: {}".format(
            dist.get_world_size(),
            dist.get_rank(),
            local_rank,
        )
    )

    # [2] Create PyTorch CNN Model.
    class Net(torch.nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            self.conv1 = torch.nn.Conv2d(1, 20, 5, 1)
            self.conv2 = torch.nn.Conv2d(20, 50, 5, 1)
            self.fc1 = torch.nn.Linear(4 * 4 * 50, 500)
            self.fc2 = torch.nn.Linear(500, 10)

        def forward(self, x):
            x = F.relu(self.conv1(x))
            x = F.max_pool2d(x, 2, 2)
            x = F.relu(self.conv2(x))
            x = F.max_pool2d(x, 2, 2)
            x = x.view(-1, 4 * 4 * 50)
            x = F.relu(self.fc1(x))
            x = self.fc2(x)
            return F.log_softmax(x, dim=1)

    # [3] Attach model to the correct GPU device and distributor.
    device = torch.device(f"cuda:{local_rank}" if torch.cuda.is_available() else "cpu")
    model = Net().to(device)
    model = Distributor(model)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.5)

    # [4] Setup FashionMNIST dataloader and distribute data across PyTorchJob workers.
    dataset = datasets.FashionMNIST(
        "./data",
        download=True,
        train=True,
        transform=transforms.Compose([transforms.ToTensor()]),
    )
    train_loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=128,
        sampler=DistributedSampler(dataset),
    )

    # [5] Start model Training.
    for epoch in range(3):
        for batch_idx, (data, target) in enumerate(train_loader):
            # Attach Tensors to the device.
            data = data.to(device)
            target = target.to(device)

            optimizer.zero_grad()
            output = model(data)
            loss = F.nll_loss(output, target)
            loss.backward()
            optimizer.step()
            if batch_idx % 10 == 0 and dist.get_rank() == 0:
                print(
                    "Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}".format(
                        epoch,
                        batch_idx * len(data),
                        len(train_loader.dataset),
                        100.0 * batch_idx / len(train_loader),
                        loss.item(),
                    )
                )


from kubeflow.training import TrainingClient

training_client = TrainingClient()
job_name = "pytorch-ddp"

# Start PyTorchJob with 3 Workers and 1 GPU per Worker (e.g. multi-node, multi-worker job).
training_client.create_job(
    name=job_name,
    train_func=train_func,
    num_procs_per_worker="auto",
    num_workers=3,
    base_image="***.azurecr.io/pytorch/pytorch:2.1.2-cuda11.8-cudnn8-runtime",
    # resources_per_worker={"gpu": "1"},
)
```

This job completed and was successful:
Code:
```
training_client.is_job_succeeded(name=job_name)
```
Output:
```
True
```

However, when visiting the `/metrics` endpoint, I could only see that `training_operator_jobs_created_total` was at 1 — neither `training_operator_jobs_successful_total` nor `training_operator_jobs_failed_total` had been incremented (neither one was present on the page).

So, I deleted this job to try again:
```
training_client.delete_job(job_name)
```

Still, on the `/metrics` endpoint, `training_operator_jobs_deleted_total` was not incremented/visible.

I created a job again with the same code and repeated the process. From this point on, `training_operator_jobs_successful_total` was incremented as expected. However, `training_operator_jobs_deleted_total` is still failing to update.

I have not had any failed/restarted jobs so I do not know about the behavior of those two metrics.

### What did you expect to happen?

I expected `training_operator_jobs_successful_total` to be incremented if a job is deemed successful by the Python client.

I expected `training_operator_jobs_deleted_total` to be incremented if a job's resources are deleted successfully and there are no error logs related to job deletion in the `training-operator` pod logs.

### Environment

Kubernetes version:
```bash
$ kubectl version
Client Version: v1.30.8
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.7
```
Training Operator version:
```bash
$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"
***.azurecr.io/kubeflow/training-operator:v1-5a5f92d
```
Training Operator Python SDK version:
```bash
$ pip show kubeflow-training
Name: kubeflow-training
Version: 1.9.0rc0
Summary: Training Operator Python SDK
Home-page: https://github.com/kubeflow/training-operator/tree/master/sdk/python
Author: Kubeflow Authors
Author-email: hejinchi@cn.ibm.com
License: Apache License Version 2.0
Location: /opt/conda/lib/python3.11/site-packages
Requires: certifi, kubernetes, retrying, setuptools, six, urllib3
Required-by: 
```


### Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some Prometheus metrics not being reported properly #2408

What happened?

What did you expect to happen?

Environment

Impacted by this bug?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Some Prometheus metrics not being reported properly #2408

Description

What happened?

What did you expect to happen?

Environment

Impacted by this bug?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions