Update default trainer configuration parameters

gitttt-1234 · claude · gitttt-1234 · commit 0f5c457aaf70 · 2025-11-20T17:27:35.000-08:00
This commit updates the default values for learning rate scheduler and optimizer configurations to improve training performance and stability: **Configuration Changes:** - OptimizerConfig: - Learning rate: 1e-3 → 1e-4 - ReduceLROnPlateauConfig: - threshold_mode: "rel" → "abs" - threshold: 1e-4 → 1e-6 - patience: 10 → 5 - factor: 0.1 → 0.5 - cooldown: 0 → 3 - min_lr: 0.0 → 1e-8 - EarlyStoppingConfig: - min_delta: 0.0 → 1e-8 - patience: 1 → 10 - LRSchedulerConfig: - Now defaults to ReduceLROnPlateauConfig instead of None **Files Updated:** - Updated sleap_nn/config/trainer_config.py with new defaults and documentation - Updated all sample config files in docs/sample_configs/ - Updated test config files in tests/assets/model_ckpts/ - Updated configuration documentation in docs/config.md - Updated test assertions in tests/config/test_trainer_config.py The new defaults provide: - More conservative learning rate scheduling - Better threshold sensitivity with absolute mode - Improved early stopping behavior - More stable training convergence 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
diff --git a/docs/config.md b/docs/config.md
@@ -123,7 +123,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-06
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 5
       factor: 0.5
@@ -739,7 +739,7 @@ trainer_config:
 ### Optimizer Configuration
 - `optimizer_name`: (str) Optimizer to be used. One of ["Adam", "AdamW"]. **Default**: `"Adam"`
 - `optimizer`:
-    - `lr`: (float) Learning rate of type float. **Default**: `1e-3`
+    - `lr`: (float) Learning rate of type float. **Default**: `1e-4`
     - `amsgrad`: (bool) Enable AMSGrad with the optimizer. **Default**: `False`
 
 ### Learning Rate Schedulers
@@ -752,12 +752,12 @@ trainer_config:
 
 #### Reduce LR on Plateau
 - `lr_scheduler.reduce_lr_on_plateau`:
-    - `threshold`: (float) Threshold for measuring the new optimum, to only focus on significant changes. **Default**: `1e-4`
-    - `threshold_mode`: (str) One of "rel", "abs". In rel mode, dynamic_threshold = best * ( 1 + threshold ) in max mode or best * ( 1 - threshold ) in min mode. In abs mode, dynamic_threshold = best + threshold in max mode or best - threshold in min mode. **Default**: `"rel"`
-    - `cooldown`: (int) Number of epochs to wait before resuming normal operation after lr has been reduced. **Default**: `0`
-    - `patience`: (int) Number of epochs with no improvement after which learning rate will be reduced. For example, if patience = 2, then we will ignore the first 2 epochs with no improvement, and will only decrease the LR after the third epoch if the loss still hasn't improved then. **Default**: `10`
-    - `factor`: (float) Factor by which the learning rate will be reduced. new_lr = lr * factor. **Default**: `0.1`
-    - `min_lr`: (float or List[float]) A scalar or a list of scalars. A lower bound on the learning rate of all param groups or each group respectively. **Default**: `0.0`
+    - `threshold`: (float) Threshold for measuring the new optimum, to only focus on significant changes. **Default**: `1e-6`
+    - `threshold_mode`: (str) One of "rel", "abs". In rel mode, dynamic_threshold = best * ( 1 + threshold ) in max mode or best * ( 1 - threshold ) in min mode. In abs mode, dynamic_threshold = best + threshold in max mode or best - threshold in min mode. **Default**: `"abs"`
+    - `cooldown`: (int) Number of epochs to wait before resuming normal operation after lr has been reduced. **Default**: `3`
+    - `patience`: (int) Number of epochs with no improvement after which learning rate will be reduced. For example, if patience = 2, then we will ignore the first 2 epochs with no improvement, and will only decrease the LR after the third epoch if the loss still hasn't improved then. **Default**: `5`
+    - `factor`: (float) Factor by which the learning rate will be reduced. new_lr = lr * factor. **Default**: `0.5`
+    - `min_lr`: (float or List[float]) A scalar or a list of scalars. A lower bound on the learning rate of all param groups or each group respectively. **Default**: `1e-8`
 
 **Example Learning Rate Scheduler configurations:**
 
@@ -786,7 +786,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1e-6
-      threshold_mode: "rel"
+      threshold_mode: "abs"
       cooldown: 3
       patience: 5
       factor: 0.5
@@ -796,8 +796,8 @@ trainer_config:
 ### Early Stopping
 - `early_stopping`:
     - `stop_training_on_plateau`: (bool) True if early stopping should be enabled. **Default**: `False`
-    - `min_delta`: (float) Minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than or equal to min_delta, will count as no improvement. **Default**: `0.0`
-    - `patience`: (int) Number of checks with no improvement after which training will be stopped. Under the default configuration, one check happens after every training epoch. **Default**: `1`
+    - `min_delta`: (float) Minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than or equal to min_delta, will count as no improvement. **Default**: `1e-8`
+    - `patience`: (int) Number of checks with no improvement after which training will be stopped. Under the default configuration, one check happens after every training epoch. **Default**: `10`
 
 ### Online Hard Keypoint Mining (OHKM)
 - `online_hard_keypoint_mining`:
diff --git a/docs/sample_configs/config_bottomup_convnext.yaml b/docs/sample_configs/config_bottomup_convnext.yaml
@@ -122,7 +122,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-06
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 5
       factor: 0.5
diff --git a/docs/sample_configs/config_bottomup_unet_large_rf.yaml b/docs/sample_configs/config_bottomup_unet_large_rf.yaml
@@ -133,7 +133,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-08
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 8
       factor: 0.5
diff --git a/docs/sample_configs/config_bottomup_unet_medium_rf.yaml b/docs/sample_configs/config_bottomup_unet_medium_rf.yaml
@@ -133,7 +133,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-08
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 8
       factor: 0.5
diff --git a/docs/sample_configs/config_centroid_swint.yaml b/docs/sample_configs/config_centroid_swint.yaml
@@ -126,7 +126,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-06
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 5
       factor: 0.5
diff --git a/docs/sample_configs/config_centroid_unet.yaml b/docs/sample_configs/config_centroid_unet.yaml
@@ -127,7 +127,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-08
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 5
       factor: 0.5
diff --git a/docs/sample_configs/config_multi_class_bottomup_unet.yaml b/docs/sample_configs/config_multi_class_bottomup_unet.yaml
@@ -122,7 +122,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-06
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 5
       factor: 0.5
diff --git a/docs/sample_configs/config_single_instance_unet_large_rf.yaml b/docs/sample_configs/config_single_instance_unet_large_rf.yaml
@@ -127,7 +127,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-05
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 5
       factor: 0.5
diff --git a/docs/sample_configs/config_single_instance_unet_medium_rf.yaml b/docs/sample_configs/config_single_instance_unet_medium_rf.yaml
@@ -127,7 +127,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-08
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 5
       factor: 0.5
diff --git a/docs/sample_configs/config_topdown_centered_instance_unet_large_rf.yaml b/docs/sample_configs/config_topdown_centered_instance_unet_large_rf.yaml
@@ -129,7 +129,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-08
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 5
       factor: 0.5
diff --git a/docs/sample_configs/config_topdown_centered_instance_unet_medium_rf.yaml b/docs/sample_configs/config_topdown_centered_instance_unet_medium_rf.yaml
@@ -129,7 +129,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-08
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 5
       factor: 0.5
diff --git a/docs/sample_configs/config_topdown_multi_class_centered_instance_unet.yaml b/docs/sample_configs/config_topdown_multi_class_centered_instance_unet.yaml
@@ -124,7 +124,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-06
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 5
       factor: 0.5
diff --git a/sleap_nn/config/trainer_config.py b/sleap_nn/config/trainer_config.py
@@ -102,11 +102,11 @@ class OptimizerConfig:
     """Configuration for optimizer.
 
     Attributes:
-        lr: (float) Learning rate of type float. *Default*: `1e-3`.
+        lr: (float) Learning rate of type float. *Default*: `1e-4`.
         amsgrad: (bool) Enable AMSGrad with the optimizer. *Default*: `False`.
     """
 
-    lr: float = field(default=1e-3, validator=validators.gt(0))
+    lr: float = field(default=1e-4, validator=validators.gt(0))
     amsgrad: bool = False
 
 
@@ -128,21 +128,21 @@ class ReduceLROnPlateauConfig:
     """Configuration for ReduceLROnPlateau scheduler.
 
     Attributes:
-        threshold: (float) Threshold for measuring the new optimum, to only focus on significant changes. *Default*: `1e-4`.
-        threshold_mode: (str) One of "rel", "abs". In rel mode, dynamic_threshold = best * ( 1 + threshold ) in max mode or best * ( 1 - threshold ) in min mode. In abs mode, dynamic_threshold = best + threshold in max mode or best - threshold in min mode. *Default*: `"rel"`.
-        cooldown: (int) Number of epochs to wait before resuming normal operation after lr has been reduced. *Default*: `0`.
-        patience: (int) Number of epochs with no improvement after which learning rate will be reduced. For example, if patience = 2, then we will ignore the first 2 epochs with no improvement, and will only decrease the LR after the third epoch if the loss still hasn't improved then. *Default*: `10`.
-        factor: (float) Factor by which the learning rate will be reduced. new_lr = lr * factor. *Default*: `0.1`.
-        min_lr: (float or List[float]) A scalar or a list of scalars. A lower bound on the learning rate of all param groups or each group respectively. *Default*: `0.0`.
+        threshold: (float) Threshold for measuring the new optimum, to only focus on significant changes. *Default*: `1e-6`.
+        threshold_mode: (str) One of "rel", "abs". In rel mode, dynamic_threshold = best * ( 1 + threshold ) in max mode or best * ( 1 - threshold ) in min mode. In abs mode, dynamic_threshold = best + threshold in max mode or best - threshold in min mode. *Default*: `"abs"`.
+        cooldown: (int) Number of epochs to wait before resuming normal operation after lr has been reduced. *Default*: `3`.
+        patience: (int) Number of epochs with no improvement after which learning rate will be reduced. For example, if patience = 2, then we will ignore the first 2 epochs with no improvement, and will only decrease the LR after the third epoch if the loss still hasn't improved then. *Default*: `5`.
+        factor: (float) Factor by which the learning rate will be reduced. new_lr = lr * factor. *Default*: `0.5`.
+        min_lr: (float or List[float]) A scalar or a list of scalars. A lower bound on the learning rate of all param groups or each group respectively. *Default*: `1e-8`.
     """
 
-    threshold: float = 1e-4
-    threshold_mode: str = "rel"
-    cooldown: int = 0
-    patience: int = 10
-    factor: float = 0.1
+    threshold: float = 1e-6
+    threshold_mode: str = "abs"
+    cooldown: int = 3
+    patience: int = 5
+    factor: float = 0.5
     min_lr: Any = field(
-        default=0.0, validator=lambda instance, attr, value: instance.validate_min_lr()
+        default=1e-8, validator=lambda instance, attr, value: instance.validate_min_lr()
     )
 
     def validate_min_lr(self):
@@ -171,7 +171,7 @@ class LRSchedulerConfig:
     """
 
     step_lr: Optional[StepLRConfig] = None
-    reduce_lr_on_plateau: Optional[ReduceLROnPlateauConfig] = None
+    reduce_lr_on_plateau: Optional[ReduceLROnPlateauConfig] = field(factory=ReduceLROnPlateauConfig)
 
 
 @define
@@ -180,12 +180,12 @@ class EarlyStoppingConfig:
 
     Attributes:
         stop_training_on_plateau: (bool) True if early stopping should be enabled. *Default*: `False`.
-        min_delta: (float) Minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than or equal to min_delta, will count as no improvement. *Default*: `0.0`.
-        patience: (int) Number of checks with no improvement after which training will be stopped. Under the default configuration, one check happens after every training epoch. *Default*: `1`.
+        min_delta: (float) Minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than or equal to min_delta, will count as no improvement. *Default*: `1e-8`.
+        patience: (int) Number of checks with no improvement after which training will be stopped. Under the default configuration, one check happens after every training epoch. *Default*: `10`.
     """
 
-    min_delta: float = field(default=0.0, validator=validators.ge(0))
-    patience: int = field(default=1, validator=validators.ge(0))
+    min_delta: float = field(default=1e-8, validator=validators.ge(0))
+    patience: int = field(default=10, validator=validators.ge(0))
     stop_training_on_plateau: bool = False
 
 
@@ -285,7 +285,7 @@ class TrainerConfig:
         validator=lambda inst, attr, val: TrainerConfig.validate_optimizer_name(val),
     )
     optimizer: OptimizerConfig = field(factory=OptimizerConfig)
-    lr_scheduler: Optional[LRSchedulerConfig] = None
+    lr_scheduler: Optional[LRSchedulerConfig] = field(factory=LRSchedulerConfig)
     early_stopping: EarlyStoppingConfig = field(factory=EarlyStoppingConfig)
     online_hard_keypoint_mining: Optional[HardKeypointMiningConfig] = field(
         factory=HardKeypointMiningConfig
diff --git a/tests/assets/model_ckpts/minimal_instance_bottomup/initial_config.yaml b/tests/assets/model_ckpts/minimal_instance_bottomup/initial_config.yaml
@@ -147,7 +147,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-06
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 5
       factor: 0.5
diff --git a/tests/assets/model_ckpts/minimal_instance_bottomup/training_config.yaml b/tests/assets/model_ckpts/minimal_instance_bottomup/training_config.yaml
@@ -147,7 +147,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-06
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 5
       factor: 0.5
diff --git a/tests/assets/model_ckpts/minimal_instance_centered_instance/initial_config.yaml b/tests/assets/model_ckpts/minimal_instance_centered_instance/initial_config.yaml
@@ -141,7 +141,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-06
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 5
       factor: 0.5
diff --git a/tests/assets/model_ckpts/minimal_instance_centered_instance/training_config.yaml b/tests/assets/model_ckpts/minimal_instance_centered_instance/training_config.yaml
@@ -141,7 +141,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-06
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 5
       factor: 0.5
diff --git a/tests/assets/model_ckpts/minimal_instance_centroid/initial_config.yaml b/tests/assets/model_ckpts/minimal_instance_centroid/initial_config.yaml
@@ -137,7 +137,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-06
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 5
       factor: 0.5
diff --git a/tests/assets/model_ckpts/minimal_instance_centroid/training_config.yaml b/tests/assets/model_ckpts/minimal_instance_centroid/training_config.yaml
@@ -137,7 +137,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-06
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 5
       factor: 0.5
diff --git a/tests/assets/model_ckpts/minimal_instance_multiclass_bottomup/initial_config.yaml b/tests/assets/model_ckpts/minimal_instance_multiclass_bottomup/initial_config.yaml
@@ -147,7 +147,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-07
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 15
       factor: 0.1
diff --git a/tests/assets/model_ckpts/minimal_instance_multiclass_bottomup/training_config.yaml b/tests/assets/model_ckpts/minimal_instance_multiclass_bottomup/training_config.yaml
@@ -147,7 +147,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-07
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 15
       factor: 0.1
diff --git a/tests/assets/model_ckpts/minimal_instance_multiclass_centered_instance/initial_config.yaml b/tests/assets/model_ckpts/minimal_instance_multiclass_centered_instance/initial_config.yaml
@@ -150,7 +150,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-08
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 10
       factor: 0.1
diff --git a/tests/assets/model_ckpts/minimal_instance_multiclass_centered_instance/training_config.yaml b/tests/assets/model_ckpts/minimal_instance_multiclass_centered_instance/training_config.yaml
@@ -150,7 +150,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-08
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 10
       factor: 0.1
diff --git a/tests/assets/model_ckpts/minimal_instance_single_instance/initial_config.yaml b/tests/assets/model_ckpts/minimal_instance_single_instance/initial_config.yaml
@@ -140,7 +140,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-06
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 5
       factor: 0.5
diff --git a/tests/assets/model_ckpts/minimal_instance_single_instance/training_config.yaml b/tests/assets/model_ckpts/minimal_instance_single_instance/training_config.yaml
@@ -140,7 +140,7 @@ trainer_config:
     step_lr: null
     reduce_lr_on_plateau:
       threshold: 1.0e-06
-      threshold_mode: rel
+      threshold_mode: abs
       cooldown: 3
       patience: 5
       factor: 0.5
diff --git a/tests/config/test_trainer_config.py b/tests/config/test_trainer_config.py