Fix CSV logger not capturing learning_rate (#423)

talmo · claude · web-flow · commit ae458ca7899f · 2026-01-19T12:10:03.000-08:00
## Summary Fixes a regression introduced in PR #417 where the `learning_rate` column in `training_log.csv` was always empty. Also adds model-specific loss columns to the CSV for better parity with wandb logging. Fixes #422 ## Root Cause PR #417 made several changes to metrics logging: 1. Removed `LearningRateMonitor` callback (which logged as `lr-Adam`) 2. Added manual learning rate logging as `train/lr` However, the `CSVLoggerCallback` was only looking for: - `learning_rate` (direct key - never logged) - `lr-*` pattern (LearningRateMonitor format - no longer used) The new `train/lr` key was never checked, resulting in empty `learning_rate` values. ## Changes ### 1. Fix learning rate lookup (`sleap_nn/training/callbacks.py`) The CSVLoggerCallback now checks for the learning rate in this order: 1. `learning_rate` (direct key) 2. `train/lr` (current format from lightning modules) ← **NEW** 3. `lr-*` pattern (legacy LearningRateMonitor format) ### 2. Add model-specific CSV columns (`sleap_nn/training/model_trainer.py`) Added loss breakdown columns for different model types to match what's logged to wandb: | Model Type | New CSV Columns | |------------|-----------------| | `bottomup` | `train/confmaps_loss`, `train/paf_loss`, `val/confmaps_loss`, `val/paf_loss` | | `multi_class_bottomup` | `train/confmaps_loss`, `train/classmap_loss`, `train/class_accuracy`, `val/confmaps_loss`, `val/classmap_loss`, `val/class_accuracy` | | `multi_class_topdown` | `train/confmaps_loss`, `train/classvector_loss`, `train/class_accuracy`, `val/confmaps_loss`, `val/classvector_loss`, `val/class_accuracy` | ### 3. Add test (`tests/training/test_callbacks.py`) Added `test_on_validation_epoch_end_logs_train_lr_format` to verify the new `train/lr` key lookup works correctly. ## Example Output **Before (broken):** ```csv epoch,train/loss,val/loss,learning_rate,train/time,val/time 0,,0.006371453870087862,,, 1,0.0006624094676226377,0.0002221532049588859,,32.815,6.364 ``` **After (fixed):** ```csv epoch,train/loss,val/loss,learning_rate,train/time,val/time 0,,0.006371453870087862,,, 1,0.0006624094676226377,0.0002221532049588859,0.0001,32.815,6.364 ``` ## API Changes ### CSV Column Additions The `training_log.csv` file will now include additional columns depending on the model type. This is a non-breaking change - existing code that reads the CSV will continue to work, and the new columns provide additional information. **Note:** The CSV column name remains `learning_rate` (not `train/lr`) for backward compatibility with existing analysis scripts. ## Design Decisions 1. **Backward compatible column name**: We kept `learning_rate` as the CSV column name rather than changing to `train/lr` to avoid breaking existing analysis pipelines that expect the old name. 2. **Fallback chain for LR lookup**: The callback checks multiple key formats in order, maintaining compatibility with: - Direct `learning_rate` logging (if someone uses it) - New `train/lr` format (current) - Legacy `lr-*` format (LearningRateMonitor) 3. **Model-specific columns**: Rather than logging all possible columns for all models (which would result in many empty columns), we only add columns relevant to each model type. ## Test Plan - [x] `pytest tests/training/test_callbacks.py::TestCSVLoggerCallbackFileOps` - Unit tests for CSV logger - [x] `pytest tests/training/test_model_trainer.py::test_model_trainer_centered_instance` - Integration test verifying learning_rate is logged correctly --- 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
diff --git a/sleap_nn/training/callbacks.py b/sleap_nn/training/callbacks.py
@@ -85,10 +85,15 @@ def on_validation_epoch_end(self, trainer, pl_module):
                 if key == "epoch":
                     log_data["epoch"] = trainer.current_epoch
                 elif key == "learning_rate":
-                    # Handle both direct logging and LearningRateMonitor format (lr-*)
+                    # Handle multiple formats:
+                    # 1. Direct "learning_rate" key
+                    # 2. "train/lr" key (current format from lightning modules)
+                    # 3. "lr-*" keys from LearningRateMonitor (legacy)
                     value = metrics.get(key, None)
                     if value is None:
-                        # Look for lr-* keys from LearningRateMonitor
+                        value = metrics.get("train/lr", None)
+                    if value is None:
+                        # Look for lr-* keys from LearningRateMonitor (legacy)
                         for metric_key in metrics.keys():
                             if metric_key.startswith("lr-"):
                                 value = metrics[metric_key]
diff --git a/sleap_nn/training/model_trainer.py b/sleap_nn/training/model_trainer.py
@@ -849,6 +849,7 @@ def _setup_loggers_callbacks(self, viz_train_dataset, viz_val_dataset):
                 "train/time",
                 "val/time",
             ]
+            # Add model-specific keys for wandb parity
             if self.model_type in [
                 "single_instance",
                 "centered_instance",
@@ -857,6 +858,37 @@ def _setup_loggers_callbacks(self, viz_train_dataset, viz_val_dataset):
                 csv_log_keys.extend(
                     [f"train/confmaps/{name}" for name in self.skeletons[0].node_names]
                 )
+            if self.model_type == "bottomup":
+                csv_log_keys.extend(
+                    [
+                        "train/confmaps_loss",
+                        "train/paf_loss",
+                        "val/confmaps_loss",
+                        "val/paf_loss",
+                    ]
+                )
+            if self.model_type == "multi_class_bottomup":
+                csv_log_keys.extend(
+                    [
+                        "train/confmaps_loss",
+                        "train/classmap_loss",
+                        "train/class_accuracy",
+                        "val/confmaps_loss",
+                        "val/classmap_loss",
+                        "val/class_accuracy",
+                    ]
+                )
+            if self.model_type == "multi_class_topdown":
+                csv_log_keys.extend(
+                    [
+                        "train/confmaps_loss",
+                        "train/classvector_loss",
+                        "train/class_accuracy",
+                        "val/confmaps_loss",
+                        "val/classvector_loss",
+                        "val/class_accuracy",
+                    ]
+                )
             csv_logger = CSVLoggerCallback(
                 filepath=Path(self.config.trainer_config.ckpt_dir)
                 / self.config.trainer_config.run_name
diff --git a/tests/training/test_callbacks.py b/tests/training/test_callbacks.py
@@ -622,6 +622,38 @@ def test_on_validation_epoch_end_logs_metrics(self):
                 assert len(lines) == 2  # Header + data row
                 assert "5" in lines[1]  # Epoch
 
+    def test_on_validation_epoch_end_logs_train_lr_format(self):
+        """Logs learning rate from train/lr key (current format)."""
+        with tempfile.TemporaryDirectory() as tmpdir:
+            filepath = Path(tmpdir) / "metrics.csv"
+            callback = CSVLoggerCallback(filepath=filepath)
+
+            mock_trainer = MagicMock()
+            mock_trainer.is_global_zero = True
+            mock_trainer.current_epoch = 3
+            mock_trainer.callback_metrics = {
+                "train_loss": torch.tensor(0.4),
+                "val_loss": torch.tensor(0.2),
+                "train/lr": torch.tensor(
+                    0.0005
+                ),  # Current format from lightning modules
+            }
+            mock_pl_module = MagicMock()
+
+            with patch("sleap_nn.training.callbacks.RANK", 0):
+                callback.on_validation_epoch_end(mock_trainer, mock_pl_module)
+
+            assert filepath.exists()
+
+            # Read and verify contents
+            import csv
+
+            with open(filepath) as f:
+                reader = csv.DictReader(f)
+                row = next(reader)
+                assert row["epoch"] == "3"
+                assert row["learning_rate"].startswith("0.0005")
+
     def test_on_validation_epoch_end_skips_if_not_global_zero(self):
         """Skips logging if not global rank zero."""
         with tempfile.TemporaryDirectory() as tmpdir: