Skip to content

[bug] log_metric() fails when logging NaN values, causing component failures #12227

@wassimbensalem

Description

@wassimbensalem

Problem Description

When using metrics_out.log_metric() in Kubeflow Pipelines components, passing NaN values causes the entire component to fail with serialization errors. This is a common issue in ML pipelines where models may produce NaN metrics (e.g., when predictions are invalid or data is missing).

Expected Behavior

log_metric() should handle NaN values gracefully, either by:

  • Converting them to a default value (e.g., 0.0)
  • Skipping NaN metrics with a warning
  • Providing a clear error message

Current Behavior

The component fails with serialization errors when NaN values are passed to log_metric().

Minimal Reproduction Example

from kfp import dsl
from kfp.dsl import component, Input, Output, Dataset, Metrics

@component
def evaluate_model(metrics_out: Output[Metrics]):
    import numpy as np
    
    # Simulate metrics that might contain NaN values
    metrics = {
        'accuracy': 0.85,
        'precision': np.nan,  # This causes the failure
        'recall': 0.92,
        'f1_score': float('nan')  # This also causes the failure
    }
    
    # This line causes the component to fail
    for metric, value in metrics.items():
        metrics_out.log_metric(metric, value)  # Fails on NaN values

Error Output

ValueError: invalid literal for int() with base 10: 'nan'

Workaround

Currently, users must manually check for NaN values before logging:

import math
import numpy as np

for metric, value in metrics.items():
    if (math.isnan(value) if isinstance(value, (int, float)) else
            (isinstance(value, np.floating) and np.isnan(value))):
        value = 0.0  # or skip logging
    metrics_out.log_metric(metric, value)

Proposed Solution

The KFP SDK should handle NaN values internally in the log_metric() method, either by:

  1. Converting NaN to a configurable default value
  2. Skipping NaN metrics with a warning log
  3. Raising a more descriptive error message

Environment

  • Kubeflow Pipelines version: 2.13.0
  • Python version: 3.9.1

Additional Context

This issue affects ML practitioners who work with models that can produce NaN metrics, which is common in scenarios with:

  • Invalid predictions
  • Missing data
  • Division by zero in metric calculations
  • Edge cases in model evaluation

A fix would improve the robustness of KFP components and reduce the need for manual NaN handling in every pipeline.

Should I start implementing this ?


Impacted by this bug? Give it a 👍.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions