non parallelized basic validator implementation [WIP] #1362

wesleytruong · 2025-07-02T00:24:42Z

The purpose of this PR is to create a basic, non-parallelized validator implementation and to get feedback on code structure and cleanliness.
Changes:

Created validation section in job_config with
Created a builder function for validator in train_spec
Created a builder function for validation dataset in hf_dataset.py
Created validator class
- Validator class initializes a build_validation_hf_loader but leaves this dataloader function unexposed to the train_spec
Integrated validation call into training loop
Creates one simple integration test with no parallelization and NGPU=1

torchtitan/config_manager.py

tianyu-l

First pass looks really good!
I left many detailed comments, please see if they make sense.

tianyu-l · 2025-07-02T04:41:30Z

torchtitan/components/validate.py

+    ):
+        self.job_config = job_config
+        self.loss_fn = loss_fn
+        self.model = model


I think we should pass model (model_parts) as an arg to validate, because it's changing

tianyu-l · 2025-07-02T04:43:45Z

torchtitan/components/validate.py

+        job_config: JobConfig,
+        loss_fn: LossFunction,
+        model: nn.Module,
+        dp_world_size: int,
+        dp_rank: int,
+        tokenizer: Tokenizer,


let's make the order as close as how you used them below in build_hf_validation_dataloader

torchtitan/config_manager.py

tianyu-l · 2025-07-02T04:49:17Z

torchtitan/config_manager.py

+
+    seq_len: int = 2048
+    """Sequence length for validation"""
+


set up a steps config, controlling how many iterations we run, default to -1 which means consuming all the data in the validation dataset

tianyu-l · 2025-07-02T04:53:02Z

torchtitan/datasets/hf_datasets.py

+    #     path="tests/assets/c4_test",
+    #     loader=lambda path: load_dataset(path, split="validation"),
+    #     text_processor=_process_c4_text,
+    # ),


we should use path="allenai/c4", and loader=lambda path: load_dataset(path, name="en", split="validation"),

tianyu-l · 2025-07-02T05:01:06Z

torchtitan/train.py

@@ -319,6 +321,23 @@ def __init__(self, job_config: JobConfig):
            device_type,
        )

+        # Build validator if validation is configured
+        self.validator = None
+        if (


if job_config.validation.enabled: assert self.train_spec.build_validator_fn is not None # build validator ...

tianyu-l · 2025-07-02T05:01:28Z

torchtitan/train.py

@@ -319,6 +321,23 @@ def __init__(self, job_config: JobConfig):
            device_type,
        )

+        # Build validator if validation is configured
+        self.validator = None


I don't think you need this line, since it's already defined as instance variable

tianyu-l · 2025-07-02T05:05:45Z

torchtitan/components/validate.py

+                    for k, v in input_dict.items():
+                        if isinstance(v, torch.Tensor):
+                            input_dict[k] = v.to(device_type)
+                    if isinstance(labels, torch.Tensor):


why do we need this if?

tianyu-l · 2025-07-02T05:06:36Z

torchtitan/components/validate.py

+                for batch_data, targets in self.validation_dataloader:
+                    input_dict, labels = batch_data, targets


Suggested change

for batch_data, targets in self.validation_dataloader:

input_dict, labels = batch_data, targets

for input_dict, labels in self.validation_dataloader:

tianyu-l · 2025-07-02T05:09:50Z

torchtitan/components/validate.py

+            logger.warning("No validation batches processed")
+
+        # Set model back to train mode
+        self.model.train()


let's put this as the last line of this method

…steps

wesleytruong · 2025-07-02T21:02:14Z

I've cleaned up the code according to your comments and added support for the validation frequency and steps. I also left streaming=True in the c4_validation dataset since otherwise it downloads the entire training dataset too. @tianyu-l

tianyu-l · 2025-07-03T00:50:33Z

torchtitan/config_manager.py

+    seq_len: int = 2048
+    """Sequence length for validation"""
+
+    val_freq: int = 1


no need to have the val_ prefix as it's not ambiguous under Validation

Suggested change

val_freq: int = 1

freq: int = 1

maybe default to 10

tianyu-l · 2025-07-03T00:52:03Z

torchtitan/config_manager.py

+    """Frequency of validation"""
+
+    val_steps: int = -1
+    """Number of validation steps, -1 means all steps"""


Suggested change

"""Number of validation steps, -1 means all steps"""

"""Number of validation steps, -1 means consuming all the data in the validation dataset"""

tianyu-l · 2025-07-03T00:54:18Z

torchtitan/datasets/hf_datasets.py

+    dp_rank: int,
+    tokenizer: Tokenizer,
+    job_config: JobConfig,
+    infinite: bool = True,


I think we can remove this arg -- I don't think anyone wants to do multiple loops over the validation dataset

tianyu-l · 2025-07-03T00:54:36Z

torchtitan/datasets/hf_datasets.py

+        seq_len=seq_len,
+        dp_rank=dp_rank,
+        dp_world_size=dp_world_size,
+        infinite=infinite,


so you can always set to False here

tianyu-l · 2025-07-03T00:54:56Z

torchtitan/models/llama3/train_configs/debug_model.toml

@@ -54,6 +54,7 @@ tensor_parallel_degree = 1
 enable_async_tensor_parallel = false
 pipeline_parallel_degree = 1
 context_parallel_degree = 1
+disable_loss_parallel = true


revert this change?

tianyu-l · 2025-07-03T01:01:19Z

torchtitan/train.py

@@ -463,6 +477,12 @@ def train_step(
        else:
            global_avg_loss = global_max_loss = loss.detach().item()

+        # Run validation if validator is available


as this is not part of training step, let's put this outside train_step and put it in train before self.checkpointer.save(...)

tianyu-l · 2025-07-03T01:03:46Z

tests/integration_tests.py

+                    "--validation.dataset c4_test",
+                ],
+            ],
+            "Validation test no parallelism",


Technically this is not without parallelism -- you are doing data parallel for validation; however, you are not doing all-reduce on the loss, so the loss you print out would be different on each DP rank. Let's do that in this PR, following the code in model forward.
https://github.com/pytorch/torchtitan/blob/main/torchtitan/train.py#L451-L464

For that you'll need to pass in parallel_dims world_mesh ft_manager when constructing Validator

I think then the code will support Tensor Parallel and Context Parallel but not Pipeline Parallel yet, which we can do in a followup PR.

tianyu-l · 2025-07-03T01:04:16Z

torchtitan/components/validate.py

+        model_parts: list[nn.Module],
+    ) -> dict[str, float]:
+        # Set model to eval mode
+        model = model_parts[0]


add a TODO: here claiming we only support data parallel for now.

Is there a reason to not support all parallelisms besides PP here?

tianyu-l · 2025-07-03T01:05:00Z

torchtitan/components/validate.py

+        num_val_steps = 0
+
+        with torch.no_grad():
+            try:


I believe you don't need this try-catch because StopIteration will be automatically captured by for loop safely.

runame

Thanks for implementing this, this will be very useful!

You can take a look at these changes for some inspiration for addressing some of my comments.

runame · 2025-07-03T02:41:01Z

torchtitan/train.py

+        if self.job_config.validation.enabled and self.validator.should_validate(
+            self.step
+        ):
+            validation_metrics = self.validator.validate(self.model_parts)


The validation metrics should be logged by self.metrics_processor.log() (to the terminal output and Tensorboard/wandb).

runame · 2025-07-03T02:44:18Z

torchtitan/train.py

+        # Build validator if validation is configured
+        if job_config.validation.enabled:
+            assert self.train_spec.build_validator_fn is not None
+


Can you raise an error here if parallel_dims.pp_enabled?

runame · 2025-07-03T02:48:38Z

torchtitan/datasets/hf_datasets.py

@@ -49,6 +49,13 @@ class DatasetConfig:
        loader=lambda path: load_dataset(path, split="train"),
        text_processor=_process_c4_text,
    ),
+    "c4_validation": DatasetConfig(
+        path="allenai/c4",
+        loader=lambda path: load_dataset(


Nit: you can reuse _load_c4_dataset together with functools.partial here by adding split as an argument to _load_c4_dataset.

runame · 2025-07-03T02:51:16Z

torchtitan/datasets/hf_datasets.py

@@ -193,3 +200,34 @@ def build_hf_dataloader(
        dp_world_size=dp_world_size,
        batch_size=batch_size,
    )
+
+
+def build_hf_validation_dataloader(


I don't think adding a new function for this is necessary; I would prefer replacing the job_config argument with dataset_name, dataset_path, batch_size, and seq_len. The reasoning is that for validation the function is also just returning a data loader based on a HF dataset, just the underlying dataset will be different.

runame · 2025-07-03T02:58:44Z

torchtitan/config_manager.py

@@ -657,6 +657,30 @@ class Experimental:
    """


+@dataclass
+class Validation:
+    enabled: bool = False


You could remove this field and modify val_freq to offer an option for disabling validation, e.g., val_freq: int | None = 10, where validation is disabled if val_freq=None.

runame · 2025-07-03T03:07:31Z

torchtitan/components/validate.py

+        # Compute average loss
+        if num_batches > 0:
+            average_loss = total_loss / num_batches
+        else:


I think this code path should never be used, you could guarantee this (ignoring the case of an empty dataloader) by adding a __post_init__ to the Validation dataclass that verifies that all values are valid, e.g., val_steps > 0.

runame · 2025-07-03T03:09:32Z

torchtitan/components/validate.py

+        # Set model back to train mode
+        model.train()
+
+        return {"validation_loss": average_loss}


The average_loss is the local loss for each rank, but should still be all-reduced across ranks.

runame · 2025-07-03T03:11:16Z

torchtitan/components/validate.py

+        # Set model back to train mode
+        model.train()
+
+        return {"validation_loss": average_loss}


Could you change this to "validation/loss"? This is important for how wandb represents the metrics and allows you to add more metrics to the same section via "validation/<you-new-metric>" later on.

runame · 2025-07-03T03:16:05Z

torchtitan/components/validate.py

+                    total_loss += loss.item()
+                    num_batches += 1
+
+                    num_val_steps += 1


Is there a reason you use separate counters for num_batches and num_val_steps? Also, you could use this instead:

for step, (input_dict, labels) in enumerate(self.validation_dataloader):

Here, step replaces num_batches and num_val_steps. You would also have to change num_val_steps >= self.job_config.validation.val_steps to step > self.job_config.validation.val_steps above.

runame · 2025-07-03T03:18:26Z

torchtitan/components/validate.py

+        device_type = utils.device_type
+        num_val_steps = 0
+
+        with torch.no_grad():


Nit: you can also use this as a decorator instead, so you don't have to indent your code as much.

@torch.no_grad() def validate(

wesleytruong added 3 commits July 1, 2025 15:59

non parallelized basic validator implementation

29ab77c

fix lint

ea30e19

cleaned up some confusing typos

a21e119

wesleytruong requested review from tianyu-l, fegin, wwwjn and wconstab as code owners July 2, 2025 00:24

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 2, 2025

wwwjn reviewed Jul 2, 2025

View reviewed changes

torchtitan/config_manager.py Show resolved Hide resolved

tianyu-l reviewed Jul 2, 2025

View reviewed changes

cleaned up more code, and added support for validation frequency and …

eb07b37

…steps

disabled validation to pass integration tests

f27049d

tianyu-l reviewed Jul 3, 2025

View reviewed changes

runame reviewed Jul 3, 2025

View reviewed changes

		for batch_data, targets in self.validation_dataloader:
		input_dict, labels = batch_data, targets

	for batch_data, targets in self.validation_dataloader:
	input_dict, labels = batch_data, targets
	for input_dict, labels in self.validation_dataloader:

	"""Number of validation steps, -1 means all steps"""
	"""Number of validation steps, -1 means consuming all the data in the validation dataset"""

non parallelized basic validator implementation [WIP] #1362

Are you sure you want to change the base?

non parallelized basic validator implementation [WIP] #1362

Conversation

wesleytruong commented Jul 2, 2025

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesleytruong commented Jul 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

runame left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!