Skip to content

Conversation

@RobotSail
Copy link
Member

@RobotSail RobotSail commented Jan 27, 2026

Summary

  • Adds MLflow as a new logging backend alongside TensorBoard, W&B, and async JSONL
  • Exposes logging configuration through TrainingArgs for programmatic API usage
  • Adds wandb_project and wandb_entity fields to TrainingArgs for consistency

Changes

New TrainingArgs fields

Field Type Default Description
logger_type str "async" Comma-separated loggers: tensorboard, wandb, mlflow, async
run_name str | None None Run name with placeholder support ({time}, {rank}, etc.)
mlflow_tracking_uri str | None None MLflow tracking server URI
mlflow_experiment_name str | None None MLflow experiment name
wandb_project str | None None W&B project name
wandb_entity str | None None W&B team/entity name

New MLflowHandler class

Implements the same interface as TensorBoardHandler and WandbHandler:

  • Logs metrics via mlflow.log_metrics()
  • Logs hyperparameters via mlflow.log_params()
  • Supports tracking_uri and experiment_name configuration
  • Falls back to MLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_NAME env vars

Updated run_training() API

Previously, run_training() hardcoded the logger to "async". Now it reads from TrainingArgs:

# Before
setup_metric_logger("async", None, train_args.ckpt_output_dir)

# After
setup_metric_logger(
    train_args.logger_type,
    train_args.run_name,
    train_args.ckpt_output_dir,
    mlflow_tracking_uri=train_args.mlflow_tracking_uri,
    mlflow_experiment_name=train_args.mlflow_experiment_name,
    wandb_project=train_args.wandb_project,
    wandb_entity=train_args.wandb_entity,
)

Example Usage

from instructlab.training import run_training, TrainingArgs, TorchrunArgs

train_args = TrainingArgs(
    model_path="meta-llama/Llama-3.1-8B-Instruct",
    data_path="./data.jsonl",
    ckpt_output_dir="./outputs",
    # ... other required fields ...
    
    # New logging configuration
    logger_type="wandb,mlflow",
    run_name="experiment-{time}",
    mlflow_tracking_uri="http://localhost:5000",
    mlflow_experiment_name="my-experiments",
    wandb_project="my-project",
)

run_training(torch_args, train_args)

Test plan

  • Verify MLflow handler logs metrics correctly to a local MLflow server
  • Verify W&B logging still works with new wandb_project/wandb_entity fields
  • Verify backward compatibility: existing code without logging params defaults to async
  • Verify comma-separated logger_type enables multiple backends simultaneously

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Multi-backend metric logging: MLflow, Weights & Biases (W&B), and TensorBoard support.
    • New CLI/training options to configure MLflow (tracking/experiment/run), W&B (project/entity/run), and TensorBoard log directory.
  • Documentation

    • README and examples updated to show multi-backend logging, auto-detection, and configuration.
  • Chores

    • Added optional install entries and dedicated requirements files for MLflow, W&B, and TensorBoard.

@coderabbitai
Copy link

coderabbitai bot commented Jan 27, 2026

📝 Walkthrough

Walkthrough

Adds optional MLflow, Weights & Biases, and TensorBoard logging: new TrainingArgs fields, MLflowHandler and MLflow wiring in the metric logger, expanded setup_metric_logger signature and CLI flags, optional dependency files, and README install notes.

Changes

Cohort / File(s) Summary
Config: Training arguments
src/instructlab/training/config.py
Added seven optional TrainingArgs fields: mlflow_tracking_uri, mlflow_experiment_name, mlflow_run_name, wandb_project, wandb_entity, wandb_run_name, tensorboard_log_dir.
Logging backend & API
src/instructlab/training/logger.py
Added MLflowHandler with safe-import guard; integrated MLflow into metric logger lifecycle; expanded setup_metric_logger(...) signature to accept MLflow/W&B/TensorBoard params and auto-wire handlers alongside existing async/tensorboard/wandb backends.
CLI / Entrypoint integration
src/instructlab/training/main_ds.py
Added CLI flags for MLflow/W&B/TensorBoard; threaded new options into subprocess command; updated setup_metric_logger() calls to pass new args from train_args.
Docs & packaging
README.md, pyproject.toml, requirements-mlflow.txt, requirements-wandb.txt, requirements-tensorboard.txt
README: new optional logging dependencies section and install notes. pyproject.toml: added optional-dependencies entries referencing the three requirements files. Added three requirements files listing mlflow, wandb, and tensorboard.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI as main_ds.py
    participant Setup as setup_metric_logger()
    participant Logger as MetricLogger
    participant MLflow as MLflow
    participant WandB as WandB
    participant TB as TensorBoard

    User->>CLI: start training with logging flags
    CLI->>Setup: setup_metric_logger(output_dir, mlflow_..., wandb_..., tensorboard_log_dir)
    Setup->>Logger: configure handlers (MLflowHandler, WandB, TensorBoard, Async)
    Logger->>MLflow: _setup() -> set_tracking_uri / set_experiment / start_run
    Logger->>WandB: init run (project, entity, run_name)
    Logger->>TB: create writer (log_dir)
    CLI->>Logger: emit metrics/hparams LogRecord
    Logger->>MLflow: emit -> log_metrics / log_params
    Logger->>WandB: emit -> log metrics/params
    Logger->>TB: emit -> write scalars
    CLI->>Logger: shutdown
    Logger->>MLflow: close -> end_run()
    Logger->>WandB: finish run
    Logger->>TB: close writer
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nudged the args and traced each line,

MLflow, WandB, TensorBoard — all in kind,
Metrics hopped in, params snug and neat,
Handlers danced to every recorded beat,
A rabbit cheered the run, soft and spry.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main change: adding MLflow support and exposing logging configuration as new fields in TrainingArgs.
Docstring Coverage ✅ Passed Docstring coverage is 90.91% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch expose-mlflow

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mergify mergify bot added the ci-failure label Jan 27, 2026
RobotSail added a commit to Red-Hat-AI-Innovation-Team/training_hub that referenced this pull request Jan 27, 2026
Exposes logging configuration (tensorboard, wandb, mlflow, jsonl) through
flat kwargs in sft(), osft(), and lora_sft() convenience functions.

## New Parameters

- `loggers`: List of loggers to enable (e.g., ["wandb", "mlflow", "jsonl"])
- `run_name`: Run name with placeholder support ({time}, {rank})
- `log_level`: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
- `logging_steps`: How often to log metrics
- `wandb_project`, `wandb_entity`, `wandb_run_name`: W&B configuration
- `tensorboard_log_dir`: TensorBoard output directory
- `mlflow_tracking_uri`, `mlflow_experiment_name`: MLflow configuration

## Backend Support

| Logger      | SFT | OSFT | LoRA |
|-------------|-----|------|------|
| wandb       | Yes | Yes  | Yes  |
| tensorboard | Yes | No   | Yes  |
| mlflow      | Yes | No   | Yes  |
| jsonl       | Yes | Yes  | No   |

OSFT emits warnings for unsupported loggers/params and continues.

Depends on: instructlab/training#680

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/instructlab/training/main_ds.py`:
- Around line 275-283: The call to setup_metric_logger in main() uses
unnecessary defensive getattr() for mlflow/wandb fields; replace getattr(args,
"mlflow_tracking_uri", None), getattr(args, "mlflow_experiment_name", None),
getattr(args, "wandb_project", None), and getattr(args, "wandb_entity", None)
with direct attribute access args.mlflow_tracking_uri,
args.mlflow_experiment_name, args.wandb_project, and args.wandb_entity
respectively so it matches the pattern used in run_training() and with
train_args.
🧹 Nitpick comments (2)
src/instructlab/training/logger.py (2)

638-665: Unused log_dir parameter.

The log_dir parameter is stored as self.log_dir but never used in _setup() or elsewhere. The docstring mentions it's "used as artifact location" but the implementation doesn't pass it to MLflow. Either use it to set the artifact location or remove it to avoid confusion.

♻️ Option 1: Use log_dir as artifact location
     def _setup(self):
         """Initialize the MLflow run with the configured settings."""
         if mlflow is None:
             msg = (
                 "Could not initialize MLflowHandler because package mlflow could not be imported.\n"
                 "Please ensure it is installed by running 'pip install mlflow'"
             )
             raise RuntimeError(msg)

         if self.tracking_uri:
             mlflow.set_tracking_uri(self.tracking_uri)

         if self.experiment_name:
-            mlflow.set_experiment(self.experiment_name)
+            mlflow.set_experiment(
+                self.experiment_name,
+                artifact_location=str(self.log_dir),
+            )

         self._mlflow_run = mlflow.start_run(
             run_name=self.run_name, **self.mlflow_init_kwargs
         )
♻️ Option 2: Remove unused parameter
     def __init__(
         self,
         level: int = logging.INFO,
         run_name: str | None = None,
-        log_dir: str | os.PathLike = "logs",
         tracking_uri: str | None = None,
         experiment_name: str | None = None,
         **mlflow_init_kwargs: Any,
     ):
         """Initialize the MLflow logger and check for required dependencies.

         Args:
             level: The logging level for this handler
             run_name: Name of the run, can contain placeholders
-            log_dir: Directory where MLflow artifacts should be stored (used as artifact location)
             tracking_uri: MLflow tracking server URI (e.g., "http://localhost:5000")
             experiment_name: Name of the MLflow experiment
             **mlflow_init_kwargs: Additional keyword arguments passed to mlflow.start_run()
         """
         super().__init__(level)

         self.run_name = _substitute_placeholders(run_name)
-        self.log_dir = Path(log_dir)
         self.tracking_uri = tracking_uri
         self.experiment_name = experiment_name
         self.mlflow_init_kwargs = mlflow_init_kwargs.copy()

         self._mlflow_run = None

Note: If removing log_dir, also update setup_metric_logger to not pass it to the MLflow handler config.


711-721: Consider adding a debug log for skipped non-numeric metrics.

Non-numeric values are silently skipped. For consistency with TensorBoardHandler (which warns on type errors), consider adding a debug-level message to help users understand why certain values aren't appearing in MLflow metrics.

♻️ Proposed change
         # Filter to only numeric values for metrics
         metrics_dict = {}
         for k, v in flat_dict.items():
             try:
                 metrics_dict[k] = float(v)
             except (ValueError, TypeError):
                 # Skip non-numeric values for metrics
-                pass
+                warnings.warn(
+                    f"MLflowHandler skipping non-numeric metric '{k}' with value {type(v).__name__}",
+                    stacklevel=2,
+                )

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/instructlab/training/logger.py (1)

862-979: ⚠️ Potential issue | 🟡 Minor

Enable MLflow when experiment/run name is provided (not only tracking URI).

Currently, MLflow logging won't activate if users only set mlflow_experiment_name or mlflow_run_name, even though MLflow can operate with local filesystem storage when no tracking URI is specified. The detection logic should include these fields and the MLFLOW_EXPERIMENT_NAME environment variable to align with the design pattern used for other backends (e.g., wandb is enabled when wandb_project is set).

Proposed fix
-    if mlflow_tracking_uri or os.environ.get("MLFLOW_TRACKING_URI"):
+    if (
+        mlflow_tracking_uri
+        or mlflow_experiment_name
+        or mlflow_run_name
+        or os.environ.get("MLFLOW_TRACKING_URI")
+        or os.environ.get("MLFLOW_EXPERIMENT_NAME")
+    ):
         detected_loggers.append("mlflow")

@mergify mergify bot added ci-failure and removed ci-failure labels Feb 3, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/instructlab/training/logger.py (2)

1-46: ⚠️ Potential issue | 🟡 Minor

Update the module docstring example to match the new setup API.

The example still uses loggers=[...] and run_name=..., but setup_metric_logger now relies on auto-detection and backend-specific kwargs. This will mislead users.

📝 Suggested docstring fix
-    # Setup logging with TensorBoard and wandb
-    setup_metric_logger(
-        loggers=["tensorboard", "wandb"],
-        run_name="my_training_run",
-        output_dir="logs"
-    )
+    # Setup logging with TensorBoard and wandb (auto-detected)
+    setup_metric_logger(
+        output_dir="logs",
+        tensorboard_log_dir="logs/tensorboard",
+        wandb_project="my_project",
+        wandb_run_name="my_training_run",
+    )

862-985: ⚠️ Potential issue | 🟠 Major

Run name no longer wired to async/tensorboard (likely regression).

setup_metric_logger used to accept run_name, but it’s now missing and both async/tensorboard handlers hardcode run_name=None. This means the new TrainingArgs.run_name won’t affect those backends. If the intent is a global run name, please reintroduce it and apply as a fallback for backend-specific names.

✅ Suggested wiring
 def setup_metric_logger(
     output_dir,
     *,
+    run_name: str | None = None,
     mlflow_tracking_uri: str | None = None,
     mlflow_experiment_name: str | None = None,
     mlflow_run_name: str | None = None,
     wandb_project: str | None = None,
     wandb_entity: str | None = None,
     wandb_run_name: str | None = None,
     tensorboard_log_dir: str | None = None,
 ):
@@
         "handlers": {
             "async": {
                 "()": AsyncStructuredHandler,
                 "log_dir": output_dir,
-                "run_name": None,  # Uses default template
+                "run_name": run_name,  # Uses default template if None
                 "filters": async_filters,
             },
             "tensorboard": {
                 "()": TensorBoardHandler,
                 "log_dir": tensorboard_log_dir or output_dir,
-                "run_name": None,  # Uses default template
+                "run_name": run_name,  # Uses default template if None
                 "filters": ["is_mapping", "is_rank0"],
             },
             "wandb": {
                 "()": WandbHandler,
                 "log_dir": output_dir,
-                "run_name": wandb_run_name,
+                "run_name": wandb_run_name or run_name,
                 "project": wandb_project,
                 "entity": wandb_entity,
                 "filters": ["is_mapping", "is_rank0"],
             },
             "mlflow": {
                 "()": MLflowHandler,
-                "run_name": mlflow_run_name,
+                "run_name": mlflow_run_name or run_name,
                 "tracking_uri": mlflow_tracking_uri
                 or os.environ.get("MLFLOW_TRACKING_URI"),
                 "experiment_name": mlflow_experiment_name
                 or os.environ.get("MLFLOW_EXPERIMENT_NAME"),
                 "filters": ["is_mapping", "is_rank0"],
             },
🤖 Fix all issues with AI agents
In `@src/instructlab/training/logger.py`:
- Around line 664-681: The _setup method currently calls mlflow.start_run()
unconditionally which raises if an MLflow run is already active; update _setup
to check mlflow.active_run() first and handle both cases: if mlflow.active_run()
exists, set self._mlflow_run = mlflow.active_run() to reuse it, otherwise call
mlflow.start_run(run_name=self.run_name, **self.mlflow_init_kwargs); optionally,
if you prefer nesting instead of reusing, call mlflow.start_run(...,
nested=True) when an active run exists—make this change around the existing
self._mlflow_run assignment and mlflow.start_run call in _setup.

RobotSail added a commit to Red-Hat-AI-Innovation-Team/training_hub that referenced this pull request Feb 3, 2026
Exposes logging configuration (tensorboard, wandb, mlflow, jsonl) through
flat kwargs in sft(), osft(), and lora_sft() convenience functions.

- `loggers`: List of loggers to enable (e.g., ["wandb", "mlflow", "jsonl"])
- `run_name`: Run name with placeholder support ({time}, {rank})
- `log_level`: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
- `logging_steps`: How often to log metrics
- `wandb_project`, `wandb_entity`, `wandb_run_name`: W&B configuration
- `tensorboard_log_dir`: TensorBoard output directory
- `mlflow_tracking_uri`, `mlflow_experiment_name`: MLflow configuration

| Logger      | SFT | OSFT | LoRA |
|-------------|-----|------|------|
| wandb       | Yes | Yes  | Yes  |
| tensorboard | Yes | No   | Yes  |
| mlflow      | Yes | No   | Yes  |
| jsonl       | Yes | Yes  | No   |

OSFT emits warnings for unsupported loggers/params and continues.

Depends on: instructlab/training#680

Co-Authored-By: Claude Opus 4.5 <[email protected]>

Pass tensorboard_log_dir through to TrainingArgs

Update SFT backend to pass tensorboard_log_dir to TrainingArgs now that
instructlab-training supports configurable TensorBoard log directories.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

Enable MLflow support for OSFT backend

- Add "mlflow" to SUPPORTED_LOGGERS
- Remove mlflow params from UNSUPPORTED_PARAMS
- Add mlflow_run_name parameter throughout
- Update docstrings to reflect MLflow is now supported

Co-Authored-By: Claude Opus 4.5 <[email protected]>

Address PR review comments

- Replace hardcoded data path with required=True in sft_mlflow_example.py
- Add stacklevel=2 to warnings.warn calls in osft.py
- Rename ambiguous loop variable 'l' to 'lg' in sft.py
- Add warnings for unsupported logging params in sft.py backend

Co-Authored-By: Claude Opus 4.5 <[email protected]>

Add OSFT MLflow example script

Similar to sft_mlflow_example.py but for OSFT training with
unfreeze_rank_ratio and other OSFT-specific parameters.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

Add logging and experiment tracking documentation

Create comprehensive documentation for the unified logging API:
- MLflow, Weights & Biases, TensorBoard, and JSONL logging
- Configuration precedence (kwarg > env var > defaults)
- Backend-specific notes and supported loggers
- Example usage for all algorithms (SFT, OSFT, LoRA)
- Run naming with placeholder support

Update sidebar to include the new logging guide.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

Remove duplicate warnings import in sft.py

The warnings module is already imported at module level (line 2).

Co-Authored-By: Claude Opus 4.5 <[email protected]>

Address review nitpicks in docs and examples

- Add language specifier to fenced code block in logging.md
- Remove unused result variable in example scripts
- Remove unnecessary wandb_run_name from MLflow-focused OSFT example
- Add comment clarifying /dev/shm usage for Linux shared memory

Co-Authored-By: Claude Opus 4.5 <[email protected]>

Auto-detect loggers from config parameters

Loggers are now automatically enabled based on their configuration:
- mlflow_tracking_uri or MLFLOW_TRACKING_URI env → enables MLflow
- wandb_project or WANDB_PROJECT env → enables W&B
- tensorboard_log_dir → enables TensorBoard (SFT only)

The 'loggers' parameter is deprecated and emits a DeprecationWarning
when used. Updated API documentation with new logging configuration
sections and removed outdated logging guide.

Co-Authored-By: Claude Opus 4.5 <[email protected]>

Address PR review feedback

- Fix markdown table alignment in docs for markdownlint compliance
- Change osft_mlflow_example.py default checkpoint dir from /tmp to ./
- Document that run_name is not supported by mini-trainer backend
- Add stacklevel=2 to block_size warning in osft.py
- Update loggers docstring to indicate deprecated status

Co-Authored-By: Claude Opus 4.5 <[email protected]>

remove extraneous api kwargs

removes more extraneous api additions

reintroduce mlflow_run_name into lora

docstrings

drop formatting changes

read envs when present

review comments
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/instructlab/training/logger.py`:
- Around line 732-737: The handler currently always calls mlflow.end_run() in
close(), which can end runs it didn't start; add a boolean flag (e.g.,
self._owns_mlflow_run) that is set to True only when this handler creates/starts
an MLflow run (where mlflow.start_run() is invoked or when self._mlflow_run is
assigned) and False when reusing an existing active run, then modify close() to
call mlflow.end_run() only if self._owns_mlflow_run is True and clear both
self._mlflow_run and self._owns_mlflow_run during cleanup; ensure the flag is
updated wherever the run is started or adopted so ownership is tracked
correctly.
🧹 Nitpick comments (1)
src/instructlab/training/logger.py (1)

942-946: Simplify the async logger inclusion logic.

The current logic is a bit convoluted. When detected_loggers is empty, line 943 sets loggers = ["async"], then lines 945-946 check if async is not in loggers (which it already is). When detected_loggers is non-empty, it duplicates the list reference then appends.

♻️ Proposed simplification
-    # Always include async logger for file logging
-    loggers = detected_loggers if detected_loggers else ["async"]
-    # Also include async logger alongside other loggers for file-based logging
-    if detected_loggers and "async" not in loggers:
-        loggers.append("async")
+    # Always include async logger for file-based logging
+    loggers = detected_loggers + ["async"] if detected_loggers else ["async"]

Or even simpler:

-    # Always include async logger for file logging
-    loggers = detected_loggers if detected_loggers else ["async"]
-    # Also include async logger alongside other loggers for file-based logging
-    if detected_loggers and "async" not in loggers:
-        loggers.append("async")
+    # Always include async logger for file-based logging alongside other loggers
+    loggers = [*detected_loggers, "async"]

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/instructlab/training/logger.py`:
- Around line 873-953: setup_metric_logger currently auto-enables backends based
only on backend-specific args and unconditionally ensures "async" is present,
and it doesn't propagate a global run_name to all backends; update it to accept
and honor an explicit logger selection (e.g., logger_type or logger_list
argument) so callers can request "tensorboard" or "async" even when
backend-specific fields are empty, only append "async" when explicitly requested
(or present in the explicit list), and ensure a global run_name parameter (or
run_name fallback) is forwarded into backend-specific settings (mlflow_run_name,
wandb_run_name, tensorboard handler config and any async file logger) so every
backend receives the same run name; change logic around detected_loggers/loggers
and the wiring to backend initializers (functions that configure
mlflow/wandb/tensorboard and the async handler) accordingly and apply the same
fix to the analogous block referenced around 968-997.

@mergify mergify bot added the documentation Improvements or additions to documentation label Feb 3, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@pyproject.toml`:
- Around line 51-53: The pyproject defines optional extras
optional-dependencies.mlflow, optional-dependencies.wandb, and
optional-dependencies.tensorboard but the referenced files
requirements-mlflow.txt, requirements-wandb.txt, and
requirements-tensorboard.txt are missing; add those three files to the repo and
populate each with the correct dependency pins (e.g., mlflow and its compatible
extras in requirements-mlflow.txt, wandb and its compatible version in
requirements-wandb.txt, and tensorboard or tensorboard-related packages in
requirements-tensorboard.txt) following the same pinning style as the existing
optional requirements files (cuda/rocm/hpu/deepspeed) so pip install
'instructlab-training[mlflow]' / '[wandb]' / '[tensorboard]' succeeds.

@mergify mergify bot added dependencies Pull requests that update a dependency file ci-failure and removed ci-failure labels Feb 3, 2026
RobotSail and others added 11 commits February 3, 2026 22:20
…logging

- Add tensorboard_log_dir field to TrainingArgs in config.py
- Update setup_metric_logger to use tensorboard_log_dir when provided
- Add CLI argument for tensorboard_log_dir
- Wire tensorboard_log_dir through run_training() to subprocess command

This allows users to specify a custom directory for TensorBoard logs,
defaulting to output_dir if not specified.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Replace defensive getattr() with direct attribute access in main_ds.py
  since args are guaranteed to exist from argparse defaults
- Remove unused log_dir parameter from MLflowHandler
- Add debug logging for non-numeric metrics skipped by MLflowHandler

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/instructlab/training/logger.py`:
- Around line 891-896: The doc note incorrectly states the async logger default
template; either update the docstring to state that AsyncStructuredHandler uses
"training_params_and_metrics_global{rank}" when no run name is provided, or
change the AsyncStructuredHandler default to match "{time}_rank{rank}". Locate
the AsyncStructuredHandler class and its default run-name logic (and the
surrounding note in logger.py) and make them consistent: modify the note text to
the correct template string or update the AsyncStructuredHandler default
run-name value so both the implementation and the comment match.
- Around line 674-689: The _setup() logic currently sets mlflow.set_tracking_uri
and mlflow.set_experiment before checking mlflow.active_run(), causing explicit
tracking_uri and experiment_name in the handler to be silently ignored when an
active run exists; modify _setup() to check mlflow.active_run() first, and if
active is not None and (self.tracking_uri or self.experiment_name) emit a clear
warning (e.g., self.logger.warning or warnings.warn) that those settings will be
ignored, and only call mlflow.set_tracking_uri and mlflow.set_experiment when
starting a new run (i.e., in the branch where you assign self._mlflow_run =
mlflow.start_run and set self._owns_mlflow_run).

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/instructlab/training/logger.py (1)

883-1012: ⚠️ Potential issue | 🟡 Minor

Add WANDB_ENTITY to the auto-detection condition for wandb logging.

The environment variable names are correct (MLFLOW_TRACKING_URI, MLFLOW_EXPERIMENT_NAME, WANDB_PROJECT all verified), but the auto-detection logic for wandb is incomplete. It checks wandb_project and os.environ.get("WANDB_PROJECT") but omits os.environ.get("WANDB_ENTITY"). Since the WandbHandler accepts the entity parameter and WANDB_ENTITY is an official W&B environment variable, the detection condition should also check for it:

if wandb_project or os.environ.get("WANDB_PROJECT") or os.environ.get("WANDB_ENTITY"):
    detected_loggers.append("wandb")

This ensures wandb logging is auto-enabled when the WANDB_ENTITY environment variable is set, consistent with how MLflow auto-detection includes multiple environment variables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-failure dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants