[Flux] Enable checkpointing (#1195)

wwwjn · wwwjn · commit 9e1c56ebfd37 · 2025-05-16T13:17:32.000-07:00
## Context: 1. Change flux-dev / flux-schnell model training to be ~30000 step based on current MAST training results 2. Enable checkpointing. We enabled final_layer reshard_after_forward to solve issue described [here](#1167 (comment)) ## Test If we run following 2 runs, the training loss curve should be identical with `deterministic = True`: 1. Without checkpoint save and load, total step=10 2. Save checkpoint at step 5, and load checkpoint at step 5, continue training Currently issue #1194 makes the training loss not strictly identical. To exclude the influence of #1194, we reset the seeds (by calling `set_deterministic()` at the beginning of step 6. Then the checkpoint save/load makes the training loss identical. <img width="1675" alt="Screenshot 2025-05-14 at 2 06 23 PM" src="https://github.com/user-attachments/assets/22882b71-378c-44fa-bd48-8a8f238aa1b0" />
diff --git a/torchtitan/experiments/flux/README.md b/torchtitan/experiments/flux/README.md
@@ -23,13 +23,23 @@ Run the following command to train the model on a single GPU:
 
 ```
 
+If you want to train with other model config, run the following command:
+```bash
+CONFIG_FILE="./torchtitan/experiments/flux/train_configs/flux_schnell_model.toml" ./torchtitan/experiments/flux/run_train.sh
+```
+
 ## Supported Features
 - Parallelism: The model supports FSDP, HSDP for training on multiple GPUs.
 - Activation checkpointing: The model uses activation checkpointing to reduce memory usage during training.
+- Distributed checkpointing and loading.
+    - Notes on the current checkpointing implementation: Currently we need to enable `reshard_after_forward=True` before eval
+    process, and set it back to `False` after eval process. The reason is that eval step only runs forward, but not backward,
+    so FSDP reshard_after_forward plan would interfere with how parameters look like for the potential subsequent checkpointing step.
+
 
 
 ## TODO
 - [ ] More parallesim support (Tensor Parallelism, Context Parallelism, etc)
-- [ ] Support for distributed checkpointing and loading
 - [ ] Implement the num_flops_per_token calculation in get_nparams_and_flops() function
 - [ ] Implement test cases in CI for FLUX model. Adding more unit tests for FLUX model (eg, unit test for preprocessor, etc)
+- [ ] Checkpointing followup: Merge resharding strategy in `flux/trainer.py` to `parallel_flux.py`
diff --git a/torchtitan/experiments/flux/dataset/flux_dataset.py b/torchtitan/experiments/flux/dataset/flux_dataset.py
@@ -58,6 +58,7 @@ def _process_cc12m_image(
     if resized_img.mode != "RGB":
         resized_img = resized_img.convert("RGB")
 
+    # Normalize the image to [-1, 1]
     np_img = np.array(resized_img).transpose((2, 0, 1))
     tensor_img = torch.tensor(np_img).float() / 255.0 * 2.0 - 1.0
 
diff --git a/torchtitan/experiments/flux/train.py b/torchtitan/experiments/flux/train.py
@@ -182,7 +182,11 @@ def train_step(self, input_dict: dict[str, torch.Tensor], labels: torch.Tensor):
             or self.step == self.job_config.training.steps
         ):
             model.eval()
+            # We need to set reshard_after_forward before last forward pass.
+            # So the model wieghts are sharded the same way for checkpoint saving.
+            model.final_layer.set_reshard_after_forward(True)
             self.eval_step()
+            model.final_layer.set_reshard_after_forward(False)
             model.train()
 
     def eval_step(self, prompt: str = "A photo of a cat"):
diff --git a/torchtitan/experiments/flux/train_configs/debug_model.toml b/torchtitan/experiments/flux/train_configs/debug_model.toml
@@ -64,3 +64,11 @@ custom_args_module = "torchtitan.experiments.flux.flux_argparser"
 
 [activation_checkpoint]
 mode = "full"
+
+[checkpoint]
+enable_checkpoint = false
+folder = "checkpoint"
+interval = 5
+model_weights_only = false
+export_dtype = "float32"
+async_mode = "disabled"  # ["disabled", "async", "async_with_pinned_mem"]
diff --git a/torchtitan/experiments/flux/train_configs/flux_dev_model.toml b/torchtitan/experiments/flux/train_configs/flux_dev_model.toml
@@ -28,14 +28,14 @@ lr = 1e-4
 eps = 1e-8
 
 [lr_scheduler]
-warmup_steps = 30_000  # lr scheduler warm up, normally 20% of the train steps
+warmup_steps = 3_000  # lr scheduler warm up, normally 20% of the train steps
 decay_ratio = 0.0  # no decay
 
 [training]
 batch_size = 4
 seq_len = 512
 max_norm = 1.0  # grad norm clipping
-steps = 300_000
+steps = 30_000
 compile = false
 dataset = "cc12m-wds"
 classifer_free_guidance_prob = 0.1
@@ -63,3 +63,11 @@ custom_args_module = "torchtitan.experiments.flux.flux_argparser"
 
 [activation_checkpoint]
 mode = "full"
+
+[checkpoint]
+enable_checkpoint = false
+folder = "checkpoint"
+interval = 1_000
+model_weights_only = false
+export_dtype = "float32"
+async_mode = "disabled"  # ["disabled", "async", "async_with_pinned_mem"]
diff --git a/torchtitan/experiments/flux/train_configs/flux_schnell_model.toml b/torchtitan/experiments/flux/train_configs/flux_schnell_model.toml
@@ -28,14 +28,14 @@ lr = 1e-4
 eps = 1e-8
 
 [lr_scheduler]
-warmup_steps = 30_000  # lr scheduler warm up, normally 20% of the train steps
+warmup_steps = 3_000  # lr scheduler warm up, normally 20% of the train steps
 decay_ratio = 0.0  # no decay
 
 [training]
 batch_size = 4
 seq_len = 512
 max_norm = 1.0  # grad norm clipping
-steps = 300_000
+steps = 30_000
 compile = false
 dataset = "cc12m-wds"
 classifer_free_guidance_prob = 0.1
@@ -63,3 +63,11 @@ custom_args_module = "torchtitan.experiments.flux.flux_argparser"
 
 [activation_checkpoint]
 mode = "full"
+
+[checkpoint]
+enable_checkpoint = false
+folder = "checkpoint"
+interval = 1_000
+model_weights_only = false
+export_dtype = "float32"
+async_mode = "disabled"  # ["disabled", "async", "async_with_pinned_mem"]