Enable per stream masking config override by shmh40 · Pull Request #1951 · ecmwf/WeatherGenerator

shmh40 · 2026-02-27T12:36:23Z

Description

Enables per-stream masking config override.

Issue Number

Closes #1950

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

Two issues caused the EMA teacher's effective rank to drop to ~8-10 (multi-GPU) or ~40 (single-GPU) at training start when rope_2D=True, while the student appeared unaffected: 1. pe_global zeroed with rope_2D: When rope_2D was enabled, pe_global was cleared to zero under the assumption that RoPE replaces it. However, RoPE only provides relative position in Q/K attention -- it does not affect V. pe_global is the sole source of per-cell token identity for masked cells (which have no content from local assimilation). Without it, all masked cells are identical, collapsing the teacher representation. The student metric was artificially inflated by dropout noise hiding the same underlying low-rank issue. Fix: always initialize pe_global -- it and RoPE serve complementary roles. 2. EMA reset ignores DDP key prefix: EMAModel.reset() loads the student state_dict directly via load_state_dict, but DDP wrapping adds a module. prefix to all keys. With strict=False, every key silently fails to match, leaving the teacher with uninitialized weights from to_empty(). The update() method already handled this mismatch but reset() did not. Combined with q_cells being skipped in EMA updates, the teacher q_cells was permanently corrupted on multi-GPU runs. Fix: strip the module. prefix before loading. Co-Authored-By: Claude Opus 4.6 <[email protected]>

clessig · 2026-02-27T13:29:14Z

src/weathergen/datasets/multi_stream_data_sampler.py

        self.perms = None
        self.perms_num_forecast_steps = None

+    def _build_effective_masking_cfgs(self) -> dict[StreamName, Config]:


Shouldn't this be in Masker?

clessig · 2026-02-27T13:30:14Z

src/weathergen/datasets/multi_stream_data_sampler.py

+        stay consistent across streams for batch assembly.
+        """
+        cfgs: dict[StreamName, Config] = {}
+        for stream_info in self.streams:


I think the logic here should be:

for each stream:
stream_merged_masking_config = config.merge( masking_config, stream_config.masking_override)

clessig · 2026-02-27T13:32:18Z

src/weathergen/datasets/multi_stream_data_sampler.py

-        Generate source and target masks for all streams
+        Generate source and target masks for all streams.
+
+        Each stream uses its own effective masking config (which may include


I don't think this should be here (would make sense if the masking_config would be passed as an argument, which is a sensible possibility). Can be in line 631.

…eams_1947' into shmh40/dev/1950-per-stream-masking

shmh40 · 2026-02-27T18:18:49Z

Thanks for the comments @clessig. Agreed it should be handled in masking and I have tried to do this with only minor changes to msds. Hopefully better now. We can think if we want it to be more general but I am ok with this for now.

I directly merged your PR in too. I haven't come up with a neater solution to the source/target distinction but can revisit it next week if helpful.

clessig

Left some more comments. We need an example how the overrides look like but for documentation and to ensure things are handled correctly. I also made some more suggestions to further improve encapsulation.

clessig · 2026-03-01T16:09:13Z

src/weathergen/datasets/masking.py

+        if override is None:
+            return mode_cfg
+
+        effective = copy.deepcopy(mode_cfg)


Can we use a better variable name than effective, e.g. stream_cfg_masking seem appropriate

src/weathergen/datasets/masking.py

clessig · 2026-03-01T16:18:18Z

src/weathergen/datasets/masking.py

                        num_cells=num_cells,
                        strategy=target_cfg.get("masking_strategy"),
                        masking_strategy_config=target_cfg.get("masking_strategy_config", {}),
+                        stream_cfg=stream_cfg_target,


The masking_strategy_config that is passed here should be the consolidated one. Then stream_cfg should not be needed here. See also the comment above.

Yes I think so too. Just checking -- this is done currently because we need to pass through the randomly_drop_as_source_rate? So we should put the randomly_drop in build_samples_from_stream, and then we can remove stream_cfg and stream_cfg_target deepcopy?

Where do we have build_samples_from_stream? randomly_drop_as_source_rate should be in the consolidated masking config.

clessig · 2026-03-01T16:18:31Z

src/weathergen/datasets/masking.py

                        num_cells=num_cells,
                        strategy=source_cfg.get("masking_strategy"),
                        masking_strategy_config=masking_config,
+                        stream_cfg=stream_cfg,


See above. Do we really need it?

clessig · 2026-03-01T16:20:36Z

src/weathergen/datasets/multi_stream_data_sampler.py

        for stream_info in self.streams:
+            # Each stream uses its own effective masking config (which may include
+            # per-stream ``masking_override`` merged on top of the global config).
+            stage_cfg = self._effective_masking_cfgs[stream_info["name"]]


stage_cfg -> stream_cfg -- nothing related to stage here.

clessig · 2026-03-01T16:21:22Z

src/weathergen/datasets/multi_stream_data_sampler.py

+        self.masker = Masker(cf.healpix_level, stage)
+        self.tokenizer = TokenizerMasking(cf.healpix_level, self.masker)
+
+        self._effective_masking_cfgs = self.masker.build_effective_masking_cfgs(


I am wondering if _effective_masking_cfgs cannot be kept in the Masker and constructed in the constructor?

This is possible, I don't think it makes too much difference though?

It helps encapsulation. MultiStreamDataSampler is complex enough.

…cfg to stream cfg

…r-stream-masking

…_idxs, torch.cat(x_embeds) + pe_embed[pe_idxs]) expected non-empty list of tensors. Also more scaffolding needed to make this work for masking, since we build the targets first and the source is just the ~target_mask, and there was stuff in the code not to drop target streams, only to drop as sources

… scatter_idxs, torch.cat(x_embeds) + pe_embed[pe_idxs]) expected non-empty list of tensors. Also more scaffolding needed to make this work for masking, since we build the targets first and the source is just the ~target_mask, and there was stuff in the code not to drop target streams, only to drop as sources" This reverts commit 70ce173.

… its own scaffold when building source inputs

…r-stream-masking

clessig

Minor comments but please address before merging and let me know if we should disuss anything.

clessig · 2026-03-11T06:23:04Z

src/weathergen/datasets/masking.py

+                # determine if diagnostic dataset => mask is empty
                if is_stream_diagnostic(stream_cfg, self.stage):
                    source_mask, mask_params = torch.zeros(num_cells, dtype=torch.bool), {}
+                elif randomly_drop_rate > 0.0 and self.rng.uniform() < randomly_drop_rate:


The two branches are identical so they should be merged. To make the code more readable, one can compute the condition first, e.g.

is_stream_dropped = randomly_drop_rate > 0.0 and self.rng.uniform() < randomly_drop_rate if is_stream_diagnostic(stream_cfg, self.stage) or is_stream_dropped: source_mask, mask_params = torch.zeros(num_cells, dtype=torch.bool), {}

clessig · 2026-03-11T06:23:44Z

src/weathergen/datasets/masking.py


    def _generate_cell_mask(
-        self, num_cells: int, strategy: str, masking_strategy_config: dict
+        self,


This should fit again in one line (linters are stupid and they only add lines, never remove :))

Linter didn't want one line! :(

clessig · 2026-03-11T06:25:43Z

src/weathergen/datasets/multi_stream_data_sampler.py

        for stream_info in self.streams:
+            # Each stream uses its own effective masking config (which may include
+            # per-stream ``masking_override`` merged on top of the global config).
+            stream_cfg = self.masker._effective_masking_cfgs[stream_info["name"]]


Why do we still need this here? Is it not directly available in the masker where it is needed?

Also stream_cfg is a bit misleading and I would call it stream_masking_cfg

clessig · 2026-03-11T06:26:31Z

src/weathergen/datasets/masking.py

        self.rng = rng

+    @staticmethod
+    def merge_masking_config(mode_cfg, override):


I don't think this should be static.

clessig · 2026-03-11T06:26:49Z

src/weathergen/datasets/masking.py

+
+        return stream_cfg_masking
+
+    @staticmethod


I don't think this should be static.

shmh40 · 2026-03-11T11:52:35Z

To activate randomly_drop_as_source_rate, or to override the masking_strategy_config for a particular stream, an example snippet to be included in the stream config is below:

Example usage:

ERA5_smooth :
  type : anemoi
  filenames : ['aifs-ea-an-oper-0001-mars-o96-1979-2023-6h-v8.zarr']
  stream_id : 0
  source : ['2d', '2t', 'msl', 'skt', 'sp']
  target : ['2d', '2t', 'msl', 'skt', 'sp']
  ...
  masking_override :
    randomly_drop_as_source_rate: 0.9
    target_input :
      masking_strategy_config :
        hl_mask : 3
  embed:
     ...

… dop independently per source sample

…r-stream-masking

…treams only applies during training

clessig

Much cleaner now. Thanks for cleaning up.

Sophie Xhonneux and others added 22 commits February 4, 2026 19:53

Add collapse monitoring

71d2cce

Fix bug

1d29611

Fix SVD computation failing

bc92ae7

Reduce variables logged

7693c19

Fix EMA beta value computation

7f8de00

Refactor get_current_beta to ema.py

c3eb019

Sensible default for ema in jepa

59a0a89

Allow collapse monitoring for forecasting

ebbbf33

Fix no collapse monitoring for forecasting

97f9734

Try to fix forecasting

0111e75

Merge branch 'sophiex/dev/monitor-collapse' into sophiex/dev/fix-2d-rope

920c50a

Try adding 2d rope to Query engine

f664d36

Merge branch 'develop' into sophiex/dev/fix-2d-rope

f01eeee

Merge branch 'develop' into sophiex/dev/fix-2d-rope

d15c825

Merge branch 'develop' into sophiex/dev/fix-2d-rope

2ab36db

Fix shape mismatch

7fdabb6

Run linter

216b53c

Adding support for dropping of streams

f84fade

enable healpix masking at the level of the data

81961aa

enable per stream masking strategy config override

b7c1342

per stream masking override test

ceb590a

shmh40 self-assigned this Feb 27, 2026

shmh40 added this to WeatherGen-dev Feb 27, 2026

shmh40 added the model:pretrain label Feb 27, 2026

github-actions bot added data Anything related to the datasets used in the project data:reading Everything related to data reading model Related to model training or definition (not generic infra) labels Feb 27, 2026

clessig reviewed Feb 27, 2026

View reviewed changes

shmh40 mentioned this pull request Feb 27, 2026

Script to compute spatial autocorrelation of structured/unstructured datasets #1955

Merged

4 tasks

shmh40 added 5 commits February 27, 2026 15:30

move perstream masking to masker

7333444

fix moving per stream config in masker

b2eb519

lint

3f94682

tidy up

c268e0f

Merge remote-tracking branch 'origin/clessig/develop/feature_drop_str…

40b2ea9

…eams_1947' into shmh40/dev/1950-per-stream-masking

clessig reviewed Mar 1, 2026

View reviewed changes

clessig mentioned this pull request Mar 2, 2026

Add support for masking by dropping of streams #1948

Closed

4 tasks

shmh40 added 6 commits March 3, 2026 18:37

better naming and docs of per stream override, and msds rename stage …

ff2bf62

…cfg to stream cfg

Merge remote-tracking branch 'origin/develop' into shmh40/dev/1950-pe…

3dc4e1e

…r-stream-masking

move per stream config overrides to masker, and now randomly drop has…

5d6816a

… its own scaffold when building source inputs

Merge remote-tracking branch 'origin/develop' into shmh40/dev/1950-pe…

498857a

…r-stream-masking

clessig approved these changes Mar 11, 2026

View reviewed changes

address reviewer comments on static method and consolidated config

80b4778

shmh40 added 4 commits March 11, 2026 12:02

update merge_masking_config docstring to reflect the randomly_drop is…

6e05aec

… dop independently per source sample

Merge remote-tracking branch 'origin/develop' into shmh40/dev/1950-pe…

9ab99e6

…r-stream-masking

drop decision for stream applies to all source strategies. Dropping s…

f989cfc

…treams only applies during training

update the test

12ed8d4

clessig approved these changes Mar 12, 2026

View reviewed changes

Merge branch 'develop' into shmh40/dev/1950-per-stream-masking

d886df6

clessig merged commit 9600183 into develop Mar 12, 2026
3 of 5 checks passed

github-project-automation bot moved this to Done in WeatherGen-dev Mar 12, 2026

Conversation

shmh40 commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue Number

Checklist before asking for review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shmh40 commented Feb 27, 2026

Uh oh!

clessig left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clessig left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shmh40 commented Mar 11, 2026

Uh oh!

clessig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shmh40 commented Feb 27, 2026 •

edited

Loading