Add support for masking by dropping of streams by clessig · Pull Request #1948 · ecmwf/WeatherGenerator

clessig · 2026-02-27T10:58:19Z

Description

Add support for masking by dropping of streams

Issue Number

Closes #1947

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

Two issues caused the EMA teacher's effective rank to drop to ~8-10 (multi-GPU) or ~40 (single-GPU) at training start when rope_2D=True, while the student appeared unaffected: 1. pe_global zeroed with rope_2D: When rope_2D was enabled, pe_global was cleared to zero under the assumption that RoPE replaces it. However, RoPE only provides relative position in Q/K attention -- it does not affect V. pe_global is the sole source of per-cell token identity for masked cells (which have no content from local assimilation). Without it, all masked cells are identical, collapsing the teacher representation. The student metric was artificially inflated by dropout noise hiding the same underlying low-rank issue. Fix: always initialize pe_global -- it and RoPE serve complementary roles. 2. EMA reset ignores DDP key prefix: EMAModel.reset() loads the student state_dict directly via load_state_dict, but DDP wrapping adds a module. prefix to all keys. With strict=False, every key silently fails to match, leaving the teacher with uninitialized weights from to_empty(). The update() method already handled this mismatch but reset() did not. Combined with q_cells being skipped in EMA updates, the teacher q_cells was permanently corrupted on multi-GPU runs. Fix: strip the module. prefix before loading. Co-Authored-By: Claude Opus 4.6 <[email protected]>

clessig · 2026-03-02T15:46:03Z

Will be merged with #1951

Sophie Xhonneux and others added 19 commits February 4, 2026 19:53

Add collapse monitoring

71d2cce

Fix bug

1d29611

Fix SVD computation failing

bc92ae7

Reduce variables logged

7693c19

Fix EMA beta value computation

7f8de00

Refactor get_current_beta to ema.py

c3eb019

Sensible default for ema in jepa

59a0a89

Allow collapse monitoring for forecasting

ebbbf33

Fix no collapse monitoring for forecasting

97f9734

Try to fix forecasting

0111e75

Merge branch 'sophiex/dev/monitor-collapse' into sophiex/dev/fix-2d-rope

920c50a

Try adding 2d rope to Query engine

f664d36

Merge branch 'develop' into sophiex/dev/fix-2d-rope

f01eeee

Merge branch 'develop' into sophiex/dev/fix-2d-rope

d15c825

Merge branch 'develop' into sophiex/dev/fix-2d-rope

2ab36db

Fix shape mismatch

7fdabb6

Run linter

216b53c

Adding support for dropping of streams

f84fade

github-project-automation bot added this to WeatherGen-dev Feb 27, 2026

github-actions bot added the model Related to model training or definition (not generic infra) label Feb 27, 2026

clessig closed this Mar 2, 2026

github-project-automation bot moved this to Done in WeatherGen-dev Mar 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for masking by dropping of streams#1948

Add support for masking by dropping of streams#1948
clessig wants to merge 19 commits intodevelopfrom
clessig/develop/feature_drop_streams_1947

clessig commented Feb 27, 2026

Uh oh!

clessig commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

clessig commented Feb 27, 2026

Description

Issue Number

Checklist before asking for review

Uh oh!

clessig commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants