Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds an optimizer preset system that allows users to quickly apply pre-configured hyperparameter sets for supported optimizers. The "speedrun" preset provides optimized settings for AdamW and Muon optimizers based on proven configurations.
- Adds a new
--optimizer_presetCLI argument with support for "none" and "speedrun" presets - Implements preset application logic that automatically configures hyperparameters for AdamW and Muon optimizers when the speedrun preset is selected
- Adds a new exploration configuration file to compare speedrun presets between optimizers
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| train_args.py | Adds the --optimizer_preset argument definition with choices for "none" and "speedrun" presets |
| train.py | Implements _apply_optimizer_presets() method that applies speedrun hyperparameters for AdamW and Muon optimizers |
| explorations/muon_vs_adamw.yaml | Adds a third parameter group for testing AdamW with default settings |
| explorations/muon_speedrun_preset.yaml | New configuration file comparing speedrun presets for Muon and AdamW against baseline AdamW |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| elif self.args.optimizer == "muon": | ||
| self.args.learning_rate = 2e-2 | ||
| self.args.muon_momentum = 0.95 | ||
| self.args.weight_decay = 0.0 | ||
|
|
||
| if self.master_process: | ||
| print(f"Applied {preset} preset for optimizer '{self.args.optimizer}'.") | ||
|
|
||
|
|
There was a problem hiding this comment.
When the speedrun preset is specified but the optimizer is neither "adamw" nor "muon", the function will print "Applied speedrun preset for optimizer 'X'" even though no preset values were actually applied. This could mislead users into thinking the preset was applied when it wasn't. Consider adding an else clause that either warns the user or skips the print statement when the optimizer doesn't have a preset implementation.
| elif self.args.optimizer == "muon": | |
| self.args.learning_rate = 2e-2 | |
| self.args.muon_momentum = 0.95 | |
| self.args.weight_decay = 0.0 | |
| if self.master_process: | |
| print(f"Applied {preset} preset for optimizer '{self.args.optimizer}'.") | |
| if self.master_process: | |
| print(f"Applied {preset} preset for optimizer 'adamw'.") | |
| elif self.args.optimizer == "muon": | |
| self.args.learning_rate = 2e-2 | |
| self.args.muon_momentum = 0.95 | |
| self.args.weight_decay = 0.0 | |
| if self.master_process: | |
| print(f"Applied {preset} preset for optimizer 'muon'.") | |
| else: | |
| if self.master_process: | |
| print(f"Warning: No '{preset}' preset available for optimizer '{self.args.optimizer}'. No preset values were applied.") |
| optimizer_preset: ["speedrun"] | ||
| - optimizer: ["muon"] | ||
| optimizer_preset: ["speedrun"] | ||
| muon_momentum: [0.95] |
There was a problem hiding this comment.
The muon_momentum parameter is redundantly specified here since line 512 in the _apply_optimizer_presets() method already sets this to 0.95 when the speedrun preset is used with the muon optimizer. This redundant specification could cause confusion about which value takes precedence.
| muon_momentum: [0.95] |
This pull request introduces support for optimizer hyperparameter presets, specifically a "speedrun" preset for both the AdamW and Muon optimizers, and updates experiment configuration files to utilize this new feature. The main changes include adding the
--optimizer_presetargument, implementing logic to apply preset hyperparameters, and updating YAML files to compare optimizers using the new preset.Optimizer preset support:
--optimizer_preset(default: "none", options: "none", "speedrun") to allow selection of preset hyperparameters for supported optimizers (AdamWandMuon). (train_args.py)_apply_optimizer_presetsmethod in the training script to assign preset hyperparameters for "speedrun" runs, affecting learning rate, betas, weight decay, and epsilon for both AdamW and Muon optimizers. This is now called during optimizer creation. (train.py)Experiment configuration updates:
explorations/muon_speedrun_preset.yaml, to compare Muon and AdamW optimizers using the "speedrun" preset on the minipile dataset.explorations/muon_vs_adamw.yamlto include an AdamW baseline for comparison.