-
Notifications
You must be signed in to change notification settings - Fork 77
Add MLflow support and expose logging configuration in TrainingArgs #680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds optional MLflow, Weights & Biases, and TensorBoard logging: new TrainingArgs fields, MLflowHandler and MLflow wiring in the metric logger, expanded setup_metric_logger signature and CLI flags, optional dependency files, and README install notes. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI as main_ds.py
participant Setup as setup_metric_logger()
participant Logger as MetricLogger
participant MLflow as MLflow
participant WandB as WandB
participant TB as TensorBoard
User->>CLI: start training with logging flags
CLI->>Setup: setup_metric_logger(output_dir, mlflow_..., wandb_..., tensorboard_log_dir)
Setup->>Logger: configure handlers (MLflowHandler, WandB, TensorBoard, Async)
Logger->>MLflow: _setup() -> set_tracking_uri / set_experiment / start_run
Logger->>WandB: init run (project, entity, run_name)
Logger->>TB: create writer (log_dir)
CLI->>Logger: emit metrics/hparams LogRecord
Logger->>MLflow: emit -> log_metrics / log_params
Logger->>WandB: emit -> log metrics/params
Logger->>TB: emit -> write scalars
CLI->>Logger: shutdown
Logger->>MLflow: close -> end_run()
Logger->>WandB: finish run
Logger->>TB: close writer
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Exposes logging configuration (tensorboard, wandb, mlflow, jsonl) through
flat kwargs in sft(), osft(), and lora_sft() convenience functions.
## New Parameters
- `loggers`: List of loggers to enable (e.g., ["wandb", "mlflow", "jsonl"])
- `run_name`: Run name with placeholder support ({time}, {rank})
- `log_level`: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
- `logging_steps`: How often to log metrics
- `wandb_project`, `wandb_entity`, `wandb_run_name`: W&B configuration
- `tensorboard_log_dir`: TensorBoard output directory
- `mlflow_tracking_uri`, `mlflow_experiment_name`: MLflow configuration
## Backend Support
| Logger | SFT | OSFT | LoRA |
|-------------|-----|------|------|
| wandb | Yes | Yes | Yes |
| tensorboard | Yes | No | Yes |
| mlflow | Yes | No | Yes |
| jsonl | Yes | Yes | No |
OSFT emits warnings for unsupported loggers/params and continues.
Depends on: instructlab/training#680
Co-Authored-By: Claude Opus 4.5 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/instructlab/training/main_ds.py`:
- Around line 275-283: The call to setup_metric_logger in main() uses
unnecessary defensive getattr() for mlflow/wandb fields; replace getattr(args,
"mlflow_tracking_uri", None), getattr(args, "mlflow_experiment_name", None),
getattr(args, "wandb_project", None), and getattr(args, "wandb_entity", None)
with direct attribute access args.mlflow_tracking_uri,
args.mlflow_experiment_name, args.wandb_project, and args.wandb_entity
respectively so it matches the pattern used in run_training() and with
train_args.
🧹 Nitpick comments (2)
src/instructlab/training/logger.py (2)
638-665: Unusedlog_dirparameter.The
log_dirparameter is stored asself.log_dirbut never used in_setup()or elsewhere. The docstring mentions it's "used as artifact location" but the implementation doesn't pass it to MLflow. Either use it to set the artifact location or remove it to avoid confusion.♻️ Option 1: Use log_dir as artifact location
def _setup(self): """Initialize the MLflow run with the configured settings.""" if mlflow is None: msg = ( "Could not initialize MLflowHandler because package mlflow could not be imported.\n" "Please ensure it is installed by running 'pip install mlflow'" ) raise RuntimeError(msg) if self.tracking_uri: mlflow.set_tracking_uri(self.tracking_uri) if self.experiment_name: - mlflow.set_experiment(self.experiment_name) + mlflow.set_experiment( + self.experiment_name, + artifact_location=str(self.log_dir), + ) self._mlflow_run = mlflow.start_run( run_name=self.run_name, **self.mlflow_init_kwargs )♻️ Option 2: Remove unused parameter
def __init__( self, level: int = logging.INFO, run_name: str | None = None, - log_dir: str | os.PathLike = "logs", tracking_uri: str | None = None, experiment_name: str | None = None, **mlflow_init_kwargs: Any, ): """Initialize the MLflow logger and check for required dependencies. Args: level: The logging level for this handler run_name: Name of the run, can contain placeholders - log_dir: Directory where MLflow artifacts should be stored (used as artifact location) tracking_uri: MLflow tracking server URI (e.g., "http://localhost:5000") experiment_name: Name of the MLflow experiment **mlflow_init_kwargs: Additional keyword arguments passed to mlflow.start_run() """ super().__init__(level) self.run_name = _substitute_placeholders(run_name) - self.log_dir = Path(log_dir) self.tracking_uri = tracking_uri self.experiment_name = experiment_name self.mlflow_init_kwargs = mlflow_init_kwargs.copy() self._mlflow_run = NoneNote: If removing
log_dir, also updatesetup_metric_loggerto not pass it to the MLflow handler config.
711-721: Consider adding a debug log for skipped non-numeric metrics.Non-numeric values are silently skipped. For consistency with
TensorBoardHandler(which warns on type errors), consider adding a debug-level message to help users understand why certain values aren't appearing in MLflow metrics.♻️ Proposed change
# Filter to only numeric values for metrics metrics_dict = {} for k, v in flat_dict.items(): try: metrics_dict[k] = float(v) except (ValueError, TypeError): # Skip non-numeric values for metrics - pass + warnings.warn( + f"MLflowHandler skipping non-numeric metric '{k}' with value {type(v).__name__}", + stacklevel=2, + )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/instructlab/training/logger.py (1)
862-979:⚠️ Potential issue | 🟡 MinorEnable MLflow when experiment/run name is provided (not only tracking URI).
Currently, MLflow logging won't activate if users only set
mlflow_experiment_nameormlflow_run_name, even though MLflow can operate with local filesystem storage when no tracking URI is specified. The detection logic should include these fields and theMLFLOW_EXPERIMENT_NAMEenvironment variable to align with the design pattern used for other backends (e.g., wandb is enabled whenwandb_projectis set).Proposed fix
- if mlflow_tracking_uri or os.environ.get("MLFLOW_TRACKING_URI"): + if ( + mlflow_tracking_uri + or mlflow_experiment_name + or mlflow_run_name + or os.environ.get("MLFLOW_TRACKING_URI") + or os.environ.get("MLFLOW_EXPERIMENT_NAME") + ): detected_loggers.append("mlflow")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/instructlab/training/logger.py (2)
1-46:⚠️ Potential issue | 🟡 MinorUpdate the module docstring example to match the new setup API.
The example still uses
loggers=[...]andrun_name=..., butsetup_metric_loggernow relies on auto-detection and backend-specific kwargs. This will mislead users.📝 Suggested docstring fix
- # Setup logging with TensorBoard and wandb - setup_metric_logger( - loggers=["tensorboard", "wandb"], - run_name="my_training_run", - output_dir="logs" - ) + # Setup logging with TensorBoard and wandb (auto-detected) + setup_metric_logger( + output_dir="logs", + tensorboard_log_dir="logs/tensorboard", + wandb_project="my_project", + wandb_run_name="my_training_run", + )
862-985:⚠️ Potential issue | 🟠 MajorRun name no longer wired to async/tensorboard (likely regression).
setup_metric_loggerused to acceptrun_name, but it’s now missing and both async/tensorboard handlers hardcoderun_name=None. This means the newTrainingArgs.run_namewon’t affect those backends. If the intent is a global run name, please reintroduce it and apply as a fallback for backend-specific names.✅ Suggested wiring
def setup_metric_logger( output_dir, *, + run_name: str | None = None, mlflow_tracking_uri: str | None = None, mlflow_experiment_name: str | None = None, mlflow_run_name: str | None = None, wandb_project: str | None = None, wandb_entity: str | None = None, wandb_run_name: str | None = None, tensorboard_log_dir: str | None = None, ): @@ "handlers": { "async": { "()": AsyncStructuredHandler, "log_dir": output_dir, - "run_name": None, # Uses default template + "run_name": run_name, # Uses default template if None "filters": async_filters, }, "tensorboard": { "()": TensorBoardHandler, "log_dir": tensorboard_log_dir or output_dir, - "run_name": None, # Uses default template + "run_name": run_name, # Uses default template if None "filters": ["is_mapping", "is_rank0"], }, "wandb": { "()": WandbHandler, "log_dir": output_dir, - "run_name": wandb_run_name, + "run_name": wandb_run_name or run_name, "project": wandb_project, "entity": wandb_entity, "filters": ["is_mapping", "is_rank0"], }, "mlflow": { "()": MLflowHandler, - "run_name": mlflow_run_name, + "run_name": mlflow_run_name or run_name, "tracking_uri": mlflow_tracking_uri or os.environ.get("MLFLOW_TRACKING_URI"), "experiment_name": mlflow_experiment_name or os.environ.get("MLFLOW_EXPERIMENT_NAME"), "filters": ["is_mapping", "is_rank0"], },
🤖 Fix all issues with AI agents
In `@src/instructlab/training/logger.py`:
- Around line 664-681: The _setup method currently calls mlflow.start_run()
unconditionally which raises if an MLflow run is already active; update _setup
to check mlflow.active_run() first and handle both cases: if mlflow.active_run()
exists, set self._mlflow_run = mlflow.active_run() to reuse it, otherwise call
mlflow.start_run(run_name=self.run_name, **self.mlflow_init_kwargs); optionally,
if you prefer nesting instead of reusing, call mlflow.start_run(...,
nested=True) when an active run exists—make this change around the existing
self._mlflow_run assignment and mlflow.start_run call in _setup.
Exposes logging configuration (tensorboard, wandb, mlflow, jsonl) through
flat kwargs in sft(), osft(), and lora_sft() convenience functions.
- `loggers`: List of loggers to enable (e.g., ["wandb", "mlflow", "jsonl"])
- `run_name`: Run name with placeholder support ({time}, {rank})
- `log_level`: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
- `logging_steps`: How often to log metrics
- `wandb_project`, `wandb_entity`, `wandb_run_name`: W&B configuration
- `tensorboard_log_dir`: TensorBoard output directory
- `mlflow_tracking_uri`, `mlflow_experiment_name`: MLflow configuration
| Logger | SFT | OSFT | LoRA |
|-------------|-----|------|------|
| wandb | Yes | Yes | Yes |
| tensorboard | Yes | No | Yes |
| mlflow | Yes | No | Yes |
| jsonl | Yes | Yes | No |
OSFT emits warnings for unsupported loggers/params and continues.
Depends on: instructlab/training#680
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Pass tensorboard_log_dir through to TrainingArgs
Update SFT backend to pass tensorboard_log_dir to TrainingArgs now that
instructlab-training supports configurable TensorBoard log directories.
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Enable MLflow support for OSFT backend
- Add "mlflow" to SUPPORTED_LOGGERS
- Remove mlflow params from UNSUPPORTED_PARAMS
- Add mlflow_run_name parameter throughout
- Update docstrings to reflect MLflow is now supported
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Address PR review comments
- Replace hardcoded data path with required=True in sft_mlflow_example.py
- Add stacklevel=2 to warnings.warn calls in osft.py
- Rename ambiguous loop variable 'l' to 'lg' in sft.py
- Add warnings for unsupported logging params in sft.py backend
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add OSFT MLflow example script
Similar to sft_mlflow_example.py but for OSFT training with
unfreeze_rank_ratio and other OSFT-specific parameters.
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add logging and experiment tracking documentation
Create comprehensive documentation for the unified logging API:
- MLflow, Weights & Biases, TensorBoard, and JSONL logging
- Configuration precedence (kwarg > env var > defaults)
- Backend-specific notes and supported loggers
- Example usage for all algorithms (SFT, OSFT, LoRA)
- Run naming with placeholder support
Update sidebar to include the new logging guide.
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Remove duplicate warnings import in sft.py
The warnings module is already imported at module level (line 2).
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Address review nitpicks in docs and examples
- Add language specifier to fenced code block in logging.md
- Remove unused result variable in example scripts
- Remove unnecessary wandb_run_name from MLflow-focused OSFT example
- Add comment clarifying /dev/shm usage for Linux shared memory
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Auto-detect loggers from config parameters
Loggers are now automatically enabled based on their configuration:
- mlflow_tracking_uri or MLFLOW_TRACKING_URI env → enables MLflow
- wandb_project or WANDB_PROJECT env → enables W&B
- tensorboard_log_dir → enables TensorBoard (SFT only)
The 'loggers' parameter is deprecated and emits a DeprecationWarning
when used. Updated API documentation with new logging configuration
sections and removed outdated logging guide.
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Address PR review feedback
- Fix markdown table alignment in docs for markdownlint compliance
- Change osft_mlflow_example.py default checkpoint dir from /tmp to ./
- Document that run_name is not supported by mini-trainer backend
- Add stacklevel=2 to block_size warning in osft.py
- Update loggers docstring to indicate deprecated status
Co-Authored-By: Claude Opus 4.5 <[email protected]>
remove extraneous api kwargs
removes more extraneous api additions
reintroduce mlflow_run_name into lora
docstrings
drop formatting changes
read envs when present
review comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/instructlab/training/logger.py`:
- Around line 732-737: The handler currently always calls mlflow.end_run() in
close(), which can end runs it didn't start; add a boolean flag (e.g.,
self._owns_mlflow_run) that is set to True only when this handler creates/starts
an MLflow run (where mlflow.start_run() is invoked or when self._mlflow_run is
assigned) and False when reusing an existing active run, then modify close() to
call mlflow.end_run() only if self._owns_mlflow_run is True and clear both
self._mlflow_run and self._owns_mlflow_run during cleanup; ensure the flag is
updated wherever the run is started or adopted so ownership is tracked
correctly.
🧹 Nitpick comments (1)
src/instructlab/training/logger.py (1)
942-946: Simplify the async logger inclusion logic.The current logic is a bit convoluted. When
detected_loggersis empty, line 943 setsloggers = ["async"], then lines 945-946 check if async is not in loggers (which it already is). Whendetected_loggersis non-empty, it duplicates the list reference then appends.♻️ Proposed simplification
- # Always include async logger for file logging - loggers = detected_loggers if detected_loggers else ["async"] - # Also include async logger alongside other loggers for file-based logging - if detected_loggers and "async" not in loggers: - loggers.append("async") + # Always include async logger for file-based logging + loggers = detected_loggers + ["async"] if detected_loggers else ["async"]Or even simpler:
- # Always include async logger for file logging - loggers = detected_loggers if detected_loggers else ["async"] - # Also include async logger alongside other loggers for file-based logging - if detected_loggers and "async" not in loggers: - loggers.append("async") + # Always include async logger for file-based logging alongside other loggers + loggers = [*detected_loggers, "async"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/instructlab/training/logger.py`:
- Around line 873-953: setup_metric_logger currently auto-enables backends based
only on backend-specific args and unconditionally ensures "async" is present,
and it doesn't propagate a global run_name to all backends; update it to accept
and honor an explicit logger selection (e.g., logger_type or logger_list
argument) so callers can request "tensorboard" or "async" even when
backend-specific fields are empty, only append "async" when explicitly requested
(or present in the explicit list), and ensure a global run_name parameter (or
run_name fallback) is forwarded into backend-specific settings (mlflow_run_name,
wandb_run_name, tensorboard handler config and any async file logger) so every
backend receives the same run name; change logic around detected_loggers/loggers
and the wiring to backend initializers (functions that configure
mlflow/wandb/tensorboard and the async handler) accordingly and apply the same
fix to the analogous block referenced around 968-997.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@pyproject.toml`:
- Around line 51-53: The pyproject defines optional extras
optional-dependencies.mlflow, optional-dependencies.wandb, and
optional-dependencies.tensorboard but the referenced files
requirements-mlflow.txt, requirements-wandb.txt, and
requirements-tensorboard.txt are missing; add those three files to the repo and
populate each with the correct dependency pins (e.g., mlflow and its compatible
extras in requirements-mlflow.txt, wandb and its compatible version in
requirements-wandb.txt, and tensorboard or tensorboard-related packages in
requirements-tensorboard.txt) following the same pinning style as the existing
optional requirements files (cuda/rocm/hpu/deepspeed) so pip install
'instructlab-training[mlflow]' / '[wandb]' / '[tensorboard]' succeeds.
…logging - Add tensorboard_log_dir field to TrainingArgs in config.py - Update setup_metric_logger to use tensorboard_log_dir when provided - Add CLI argument for tensorboard_log_dir - Wire tensorboard_log_dir through run_training() to subprocess command This allows users to specify a custom directory for TensorBoard logs, defaulting to output_dir if not specified. Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Replace defensive getattr() with direct attribute access in main_ds.py since args are guaranteed to exist from argparse defaults - Remove unused log_dir parameter from MLflowHandler - Add debug logging for non-numeric metrics skipped by MLflowHandler Co-Authored-By: Claude Opus 4.5 <[email protected]>
037454f to
3f7bfc2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/instructlab/training/logger.py`:
- Around line 891-896: The doc note incorrectly states the async logger default
template; either update the docstring to state that AsyncStructuredHandler uses
"training_params_and_metrics_global{rank}" when no run name is provided, or
change the AsyncStructuredHandler default to match "{time}_rank{rank}". Locate
the AsyncStructuredHandler class and its default run-name logic (and the
surrounding note in logger.py) and make them consistent: modify the note text to
the correct template string or update the AsyncStructuredHandler default
run-name value so both the implementation and the comment match.
- Around line 674-689: The _setup() logic currently sets mlflow.set_tracking_uri
and mlflow.set_experiment before checking mlflow.active_run(), causing explicit
tracking_uri and experiment_name in the handler to be silently ignored when an
active run exists; modify _setup() to check mlflow.active_run() first, and if
active is not None and (self.tracking_uri or self.experiment_name) emit a clear
warning (e.g., self.logger.warning or warnings.warn) that those settings will be
ignored, and only call mlflow.set_tracking_uri and mlflow.set_experiment when
starting a new run (i.e., in the branch where you assign self._mlflow_run =
mlflow.start_run and set self._owns_mlflow_run).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/instructlab/training/logger.py (1)
883-1012:⚠️ Potential issue | 🟡 MinorAdd
WANDB_ENTITYto the auto-detection condition for wandb logging.The environment variable names are correct (
MLFLOW_TRACKING_URI,MLFLOW_EXPERIMENT_NAME,WANDB_PROJECTall verified), but the auto-detection logic for wandb is incomplete. It checkswandb_projectandos.environ.get("WANDB_PROJECT")but omitsos.environ.get("WANDB_ENTITY"). Since the WandbHandler accepts the entity parameter and WANDB_ENTITY is an official W&B environment variable, the detection condition should also check for it:if wandb_project or os.environ.get("WANDB_PROJECT") or os.environ.get("WANDB_ENTITY"): detected_loggers.append("wandb")This ensures wandb logging is auto-enabled when the WANDB_ENTITY environment variable is set, consistent with how MLflow auto-detection includes multiple environment variables.
Summary
TrainingArgsfor programmatic API usagewandb_projectandwandb_entityfields toTrainingArgsfor consistencyChanges
New
TrainingArgsfieldslogger_typestr"async"tensorboard,wandb,mlflow,asyncrun_namestr | NoneNone{time},{rank}, etc.)mlflow_tracking_uristr | NoneNonemlflow_experiment_namestr | NoneNonewandb_projectstr | NoneNonewandb_entitystr | NoneNoneNew
MLflowHandlerclassImplements the same interface as
TensorBoardHandlerandWandbHandler:mlflow.log_metrics()mlflow.log_params()tracking_uriandexperiment_nameconfigurationMLFLOW_TRACKING_URIandMLFLOW_EXPERIMENT_NAMEenv varsUpdated
run_training()APIPreviously,
run_training()hardcoded the logger to"async". Now it reads fromTrainingArgs:Example Usage
Test plan
wandb_project/wandb_entityfieldsasynclogger_typeenables multiple backends simultaneously🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Chores