Enable per stream masking config override (#1951)

shmh40 · Sophie Xhonneux · sophie-xhonneux · web-flow · commit 9600183cbf2e · 2026-03-12T21:50:16.000+01:00
* Add collapse monitoring * Fix bug * Fix SVD computation failing * Reduce variables logged * Fix EMA beta value computation * Refactor get_current_beta to ema.py * Sensible default for ema in jepa * Allow collapse monitoring for forecasting * Fix no collapse monitoring for forecasting * Try to fix forecasting * Fix teacher rank collapse when rope_2D is enabled Two issues caused the EMA teacher's effective rank to drop to ~8-10 (multi-GPU) or ~40 (single-GPU) at training start when rope_2D=True, while the student appeared unaffected: 1. pe_global zeroed with rope_2D: When rope_2D was enabled, pe_global was cleared to zero under the assumption that RoPE replaces it. However, RoPE only provides relative position in Q/K attention -- it does not affect V. pe_global is the sole source of per-cell token identity for masked cells (which have no content from local assimilation). Without it, all masked cells are identical, collapsing the teacher representation. The student metric was artificially inflated by dropout noise hiding the same underlying low-rank issue. Fix: always initialize pe_global -- it and RoPE serve complementary roles. 2. EMA reset ignores DDP key prefix: EMAModel.reset() loads the student state_dict directly via load_state_dict, but DDP wrapping adds a module. prefix to all keys. With strict=False, every key silently fails to match, leaving the teacher with uninitialized weights from to_empty(). The update() method already handled this mismatch but reset() did not. Combined with q_cells being skipped in EMA updates, the teacher q_cells was permanently corrupted on multi-GPU runs. Fix: strip the module. prefix before loading. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Try adding 2d rope to Query engine * Fix shape mismatch * Run linter * Adding support for dropping of streams * enable healpix masking at the level of the data * enable per stream masking strategy config override * per stream masking override test * move perstream masking to masker * fix moving per stream config in masker * lint * tidy up * better naming and docs of per stream override, and msds rename stage cfg to stream cfg * addressed comments but now broken with tokens_all.scatter_(0, scatter_idxs, torch.cat(x_embeds) + pe_embed[pe_idxs]) expected non-empty list of tensors. Also more scaffolding needed to make this work for masking, since we build the targets first and the source is just the ~target_mask, and there was stuff in the code not to drop target streams, only to drop as sources * Revert "addressed comments but now broken with tokens_all.scatter_(0, scatter_idxs, torch.cat(x_embeds) + pe_embed[pe_idxs]) expected non-empty list of tensors. Also more scaffolding needed to make this work for masking, since we build the targets first and the source is just the ~target_mask, and there was stuff in the code not to drop target streams, only to drop as sources" This reverts commit 70ce173. * move per stream config overrides to masker, and now randomly drop has its own scaffold when building source inputs * address reviewer comments on static method and consolidated config * update merge_masking_config docstring to reflect the randomly_drop is dop independently per source sample * drop decision for stream applies to all source strategies. Dropping streams only applies during training * update the test --------- Co-authored-by: Sophie Xhonneux <sophiex@Sophies-MacBook-Pro.local> Co-authored-by: sophiex <24638638+sophie-xhonneux@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Christian Lessig <christian.lessig@ecmwf.int>
diff --git a/src/weathergen/datasets/masking.py b/src/weathergen/datasets/masking.py
@@ -10,6 +10,7 @@
 
 from weathergen.datasets.batch import SampleMetaData
 from weathergen.train.utils import Stage
+from weathergen.utils.distributed import is_root
 from weathergen.utils.utils import is_stream_diagnostic, is_stream_forcing
 
 logger = logging.getLogger(__name__)
@@ -111,7 +112,7 @@ class Masker:
                                         specific to the masking strategy. See above.
     """
 
-    def __init__(self, healpix_level: int, stage: Stage):
+    def __init__(self, healpix_level: int, stage: Stage, streams=None, mode_cfg=None):
         self.rng = None
 
         self.mask_value = 0.0
@@ -123,12 +124,89 @@ def __init__(self, healpix_level: int, stage: Stage):
 
         self.stage = stage
 
+        # Build and store per-stream effective masking configs
+        if streams is not None and mode_cfg is not None:
+            self._effective_masking_cfgs = self.build_effective_masking_cfgs(streams, mode_cfg)
+        else:
+            self._effective_masking_cfgs = {}
+
     def reset_rng(self, rng) -> None:
         """
         Reset rng after mini_epoch to ensure proper randomization
         """
         self.rng = rng
 
+    def merge_masking_config(self, mode_cfg, override):
+        """Merge a stream's masking override into the base mode config.
+
+        Only masking strategy fields are overridden. Structural keys like
+        ``num_samples`` and ``num_steps_input`` remain unchanged.
+
+        The override is flat per section (``model_input`` / ``target_input``),
+        not per named strategy.  If a section has multiple strategies (e.g.
+        ``"input_physical"`` and ``"input_jepa"``), masking strategy fields are
+        broadcast to all of them.  ``randomly_drop_as_source_rate`` is a
+        per-stream rate; the drop decision is made once per call to
+        ``build_samples_for_stream`` and applies to all source strategies
+        uniformly (training only).
+
+        Expected YAML in a stream config, e.g.:
+
+            STREAM_NAME:
+              type: ...
+              filenames: ...
+              ...
+              masking_override:
+                target_input:
+                  masking_strategy_config:
+                    hl_mask: 3
+              ...
+
+        This overrides only ``hl_mask`` within ``masking_strategy_config`` for
+        every target strategy, inheriting rate, rate_sampling, etc. from the
+        global config.  ``masking_strategy`` itself can also be replaced.
+        """
+        if override is None:
+            return mode_cfg
+
+        stream_cfg_masking = copy.deepcopy(mode_cfg)
+
+        # Copy top-level masking keys from override
+        if "randomly_drop_as_source_rate" in override:
+            stream_cfg_masking["randomly_drop_as_source_rate"] = override[
+                "randomly_drop_as_source_rate"
+            ]
+
+        for section_key in ("model_input", "target_input"):
+            override_values = override.get(section_key, None)
+            if override_values is None:
+                continue
+            section = stream_cfg_masking.get(section_key, None)
+            if section is None:
+                continue
+            for strategy_cfg in section.values():
+                if "masking_strategy" in override_values:
+                    strategy_cfg["masking_strategy"] = override_values["masking_strategy"]
+                if "masking_strategy_config" in override_values:
+                    strategy_cfg["masking_strategy_config"] = omegaconf.OmegaConf.merge(
+                        strategy_cfg.get("masking_strategy_config", omegaconf.OmegaConf.create({})),
+                        override_values["masking_strategy_config"],
+                    )
+
+        return stream_cfg_masking
+
+    def build_effective_masking_cfgs(self, streams, mode_cfg):
+        """Build effective masking configs for all streams."""
+        cfgs = {}
+        for stream_info in streams:
+            name = stream_info["name"]
+            override = stream_info.get("masking_override", None)
+            cfgs[name] = self.merge_masking_config(mode_cfg, override)
+            if override is not None and is_root():
+                logger.info(f"Stream '{name}' using masking override: {override}")
+
+        return cfgs
+
     def _get_sampling_rate(self, cfg):
         """
         Get the sampling, if requested by sampling it itself
@@ -257,25 +335,33 @@ def build_samples_for_stream(
         self,
         training_mode: str,
         num_cells: int,
-        stage_cfg: dict,
-        stream_cfg: dict,
+        stream_info: dict,
     ) -> tuple[np.typing.NDArray, list[np.typing.NDArray], list[SampleMetaData]]:
         """
         Construct teacher/student keep masks for a stream.
         SampleMetaData is currently just a dict with the masking params used.
         """
 
+        stream_masking_cfg = self._effective_masking_cfgs[stream_info["name"]]
+
         # target and source configs
-        target_cfgs = stage_cfg.get("target_input", [])
-        source_cfgs = stage_cfg.get("model_input", [])
+        target_cfgs = stream_masking_cfg.get("target_input", [])
+        source_cfgs = stream_masking_cfg.get("model_input", [])
 
         # target and source are assumed identical when target is not specified
         if len(target_cfgs) == 0:
             target_cfgs = copy.deepcopy(source_cfgs)
 
-        losses = stage_cfg.losses
+        losses = stream_masking_cfg.losses
         corr_dict = self.parse_src_target_correspondence(losses, target_cfgs, source_cfgs)
 
+        # randomly_drop_as_source_rate from consolidated masking config (training only)
+        randomly_drop_rate = (
+            stream_masking_cfg.get("randomly_drop_as_source_rate", 0.0)
+            if self.stage == "train"
+            else 0.0
+        )
+
         target_masks = MaskData()
 
         # iterate over all target samples
@@ -285,9 +371,10 @@ def build_samples_for_stream(
             # different samples/view per strategy
             for _ in range(target_cfg.get("num_samples", 1)):
                 # determine if forcing dataset => mask is empty
-                if is_stream_forcing(stream_cfg, self.stage):
+                if is_stream_forcing(stream_info, self.stage):
                     target_mask, mask_params = torch.zeros(num_cells, dtype=torch.bool), {}
                 else:
+                    # targets are never randomly dropped
                     target_mask, mask_params = self._get_mask(
                         num_cells=num_cells,
                         strategy=target_cfg.get("masking_strategy"),
@@ -312,6 +399,7 @@ def build_samples_for_stream(
         source_masks = MaskData()
         source_target_mapping = []
         target_num_samples = get_num_samples(target_cfgs)
+        is_stream_dropped = randomly_drop_rate > 0.0 and self.rng.uniform() < randomly_drop_rate
         i_source = 0
         for i_src_cfg, (_, source_cfg) in enumerate(source_cfgs.items()):
             # skip items that do not appear in loss
@@ -336,8 +424,8 @@ def build_samples_for_stream(
                 # target is specified)
                 target_idx += i_sample % target_num_samples[target_cfg_idx].item()
 
-                # determine if forcing dataset => mask is empty
-                if is_stream_diagnostic(stream_cfg, self.stage):
+                # determine if diagnostic dataset or randomly dropped => mask is empty
+                if is_stream_diagnostic(stream_info, self.stage) or is_stream_dropped:
                     source_mask, mask_params = torch.zeros(num_cells, dtype=torch.bool), {}
                 else:
                     source_mask, mask_params = self._get_mask(
@@ -427,7 +515,10 @@ def _get_mask(
         return (mask, params)
 
     def _generate_cell_mask(
-        self, num_cells: int, strategy: str, masking_strategy_config: dict
+        self,
+        num_cells: int,
+        strategy: str,
+        masking_strategy_config: dict,
     ) -> (np.typing.NDArray, dict):
         """Generate a boolean keep mask at data healpix level (True = keep cell).
 
@@ -692,8 +783,8 @@ def _prepare_healpix_based_masking(self, cfg, keep_rate):
 
         hl_data = self.healpix_level_data
         hl_mask = cfg.get("hl_mask")
-        assert hl_mask is not None and hl_mask < hl_data, (
-            "For healpix keep mask generation, cfg['hl_mask'] must be set and < data level."
+        assert hl_mask is not None and hl_mask <= hl_data, (
+            "For healpix keep mask generation, cfg['hl_mask'] must be set and <= data level."
         )
         num_parent_cells = 12 * (4**hl_mask)
         level_diff = hl_data - hl_mask
diff --git a/src/weathergen/datasets/multi_stream_data_sampler.py b/src/weathergen/datasets/multi_stream_data_sampler.py
@@ -243,7 +243,8 @@ def __init__(
             else cf.data_loading.rng_seed * 97
         )
 
-        self.tokenizer = TokenizerMasking(cf.healpix_level, Masker(cf.healpix_level, stage))
+        self.masker = Masker(cf.healpix_level, stage, self.streams, self.mode_cfg)
+        self.tokenizer = TokenizerMasking(cf.healpix_level, self.masker)
 
         self.mini_epoch = 0
 
@@ -575,16 +576,14 @@ def _get_data_windows(self, base_idx, num_forecast_steps, num_steps_input_max, s
 
     def _get_source_target_masks(self, training_mode):
         """
-        Generate source and target masks for all streams
+        Generate source and target masks for all streams.
         """
-
         masks = {}
         for stream_info in self.streams:
             # Build source and target sample masks
             masks[stream_info["name"]] = self.tokenizer.build_samples_for_stream(
                 training_mode,
                 self.num_healpix_cells,
-                self.mode_cfg,
                 stream_info,
             )
             # identical for all streams
diff --git a/src/weathergen/datasets/tokenizer_masking.py b/src/weathergen/datasets/tokenizer_masking.py
@@ -81,13 +81,12 @@ def build_samples_for_stream(
         self,
         training_mode: str,
         num_cells: int,
-        stage_cfg: dict,
-        stream_cfg: dict,
+        stream_info: dict,
     ) -> tuple[np.typing.NDArray, list[np.typing.NDArray], list[SampleMetaData]]:
         """
         Create masks for samples
         """
-        return self.masker.build_samples_for_stream(training_mode, num_cells, stage_cfg, stream_cfg)
+        return self.masker.build_samples_for_stream(training_mode, num_cells, stream_info)
 
     def cell_to_token_mask(self, idxs_cells, idxs_cells_lens, mask):
         """ """
diff --git a/tests/test_per_stream_masking_override.py b/tests/test_per_stream_masking_override.py