-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Problem Description
When using metrics_out.log_metric()
in Kubeflow Pipelines components, passing NaN values causes the entire component to fail with serialization errors. This is a common issue in ML pipelines where models may produce NaN metrics (e.g., when predictions are invalid or data is missing).
Expected Behavior
log_metric()
should handle NaN values gracefully, either by:
- Converting them to a default value (e.g., 0.0)
- Skipping NaN metrics with a warning
- Providing a clear error message
Current Behavior
The component fails with serialization errors when NaN values are passed to log_metric()
.
Minimal Reproduction Example
from kfp import dsl
from kfp.dsl import component, Input, Output, Dataset, Metrics
@component
def evaluate_model(metrics_out: Output[Metrics]):
import numpy as np
# Simulate metrics that might contain NaN values
metrics = {
'accuracy': 0.85,
'precision': np.nan, # This causes the failure
'recall': 0.92,
'f1_score': float('nan') # This also causes the failure
}
# This line causes the component to fail
for metric, value in metrics.items():
metrics_out.log_metric(metric, value) # Fails on NaN values
Error Output
ValueError: invalid literal for int() with base 10: 'nan'
Workaround
Currently, users must manually check for NaN values before logging:
import math
import numpy as np
for metric, value in metrics.items():
if (math.isnan(value) if isinstance(value, (int, float)) else
(isinstance(value, np.floating) and np.isnan(value))):
value = 0.0 # or skip logging
metrics_out.log_metric(metric, value)
Proposed Solution
The KFP SDK should handle NaN values internally in the log_metric()
method, either by:
- Converting NaN to a configurable default value
- Skipping NaN metrics with a warning log
- Raising a more descriptive error message
Environment
- Kubeflow Pipelines version: 2.13.0
- Python version: 3.9.1
Additional Context
This issue affects ML practitioners who work with models that can produce NaN metrics, which is common in scenarios with:
- Invalid predictions
- Missing data
- Division by zero in metric calculations
- Edge cases in model evaluation
A fix would improve the robustness of KFP components and reduce the need for manual NaN handling in every pipeline.
Should I start implementing this ?
Impacted by this bug? Give it a 👍.