Add Q-Q analysis metric for extreme value evaluation #1734

jehangirawan · 2026-01-28T12:37:11Z

Issue Number

Description

Implements a quantile-quantile (Q-Q) analysis metric for evaluating weather forecast model performance across the full data distribution, with emphasis on extreme values in the tails.

Key additions:

New qq_analysis metric in the evaluation pipeline
Two-panel visualization: Q-Q scatter plot + deviation plot with highlighted extreme regions
Stores quantile data and extreme tail MSE values in JSON for post-processing

Motivation: Traditional metrics (RMSE, MAE) focus on central tendencies and may miss distributional biases in extremes. Q-Q analysis directly compares predicted vs. observed quantiles, making it ideal for detecting systematic over/under-prediction of extreme weather events.

Usage:

config/evaluate/eval_config.yml

evaluation:
metrics: ["qq_analysis"]

uv run --offline evaluate --config config/evaluate/eval_config.yml

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
  [x] I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

jehangirawan · 2026-01-28T15:48:13Z

Sample Q-Q Analysis Output:
The left panel shows the Q-Q (quantile-quantile) scatter plot, and the right panel displays the quantile deviation. The shaded regions indicate the extreme zones (5th and 95th percentiles).

qq_analysis_qq_analysis_global_q5krziaj_ERA5_2t

packages/evaluate/src/weathergen/evaluate/plotting/plotter.py

iluise · 2026-01-28T16:56:39Z

packages/evaluate/src/weathergen/evaluate/plotting/plotter.py

+            quantile_levels = ds_ref["quantile_levels"].values
+
+            # Find extreme regions (typically below 5% and above 95%)
+            lower_extreme_idx = quantile_levels < 0.05


can we make the thresholds as arguments of the plot instead of hardcoded?

iluise · 2026-01-28T16:58:39Z

packages/evaluate/src/weathergen/evaluate/scores/score.py

+        qq_deviation = np.abs(p_quantiles - gt_quantiles)
+
+        # Calculate normalized deviation (relative to interquartile range of ground truth)
+        gt_q25 = gt_flat.quantile(0.25, dim="_agg_points")


can we have configurable thresholds instead of hardcoded here? like gt_q_low and gt_q_high

packages/evaluate/src/weathergen/evaluate/scores/score.py

iluise · 2026-01-28T17:00:59Z

packages/evaluate/src/weathergen/evaluate/scores/score.py

+        qq_data_coord_array = np.empty(overall_qq_score.shape, dtype=object)
+
+        # Iterate over all positions and create individual JSON strings
+        for idx in np.ndindex(overall_qq_score.shape):


this is extremely redundant. we should already have the to_list definition somewhere, but here is an example of how Id make it shorter:

def _to_list(arr, idx): return ( arr.values[(...,) + idx].tolist() if arr.ndim > 1 else arr.values.tolist() ) def _to_float(arr, idx): return float(arr.values[idx]) if arr.ndim > 0 else float(arr.values) qq_full_data = { "quantile_levels": quantile_levels.tolist(), "p_quantiles": _to_list(p_quantiles, idx), "gt_quantiles": _to_list(gt_quantiles, idx), "qq_deviation": _to_list(qq_deviation, idx), "qq_deviation_normalized": _to_list(qq_deviation_normalized, idx), "extreme_low_mse": _to_float(extreme_low_mse, idx), "extreme_high_mse": _to_float(extreme_high_mse, idx), } qq_data_coord_array[idx] = json.dumps(qq_full_data)

In general you can always ask Claude or Copilot (or chatGPT) to restructure your code in a more modular and compact syntax ;)

Good catch! Definitely cleaner this way. I’ve refactored the logic to be more modular as suggested. Thanks for the tip!"

iluise · 2026-01-28T17:01:45Z

packages/evaluate/src/weathergen/evaluate/utils/utils.py

+            combined_metrics = xr.concat(
+                valid_scores,
+                dim="metric",
+                coords="different",


are you sure that this doesn't mess up the other scores?

Modified and tested successfully with multiple metrics (rmse, bias, mae, qq_analysis)

iluise · 2026-01-28T17:03:24Z

packages/evaluate/src/weathergen/evaluate/utils/utils.py


            metric_stream.loc[criteria] = combined_metrics

+            # Preserve metric-specific coordinates that loc assignment may drop


please avoid adding metric specific ifs to this function. This should stay as general as possible.
I know that the problem here is having the additional dimension that you can not stack with the others. Maybe we can have a meeting and brainstorm also with @SavvasMel about how to do it properly as in any case we will need it for the rank_histogram as well.

packages/evaluate/src/weathergen/evaluate/scores/score.py

packages/evaluate/src/weathergen/evaluate/utils/utils.py

- Rename methods, add helper functions, fix naming conventions - Make percentile thresholds configurable - Create QuantilePlots class, implement generic coordinate handling - All reviewer comments addressed

jehangirawan · 2026-01-29T23:41:43Z

packages/evaluate/src/weathergen/evaluate/utils/utils.py

+            # Restore metric-specific coordinates that were dropped by coords="minimal"
+            # (e.g., quantiles, extreme_percentiles for qq_analysis)
+            for coord_name in combined_metrics.coords:


I’ve updated the logic to be fully generic by checking dimension compatibility dynamically. This now handles the extra dimensions for qq_analysis and will work for other metrics like rank_histogram as well without any hardcoding.

All comments addressed. Tested successfully with multiple metrics (rmse, bias, mae, qq_analysis)
Ready for re-review! @iluise

github-project-automation bot added this to WeatherGen-dev Jan 28, 2026

jehangirawan force-pushed the add_quantile_quantile_score branch from a5f491d to 312301e Compare January 28, 2026 14:21

iluise reviewed Jan 28, 2026

View reviewed changes

jehangirawan added 3 commits January 29, 2026 23:03

Add Q-Q analysis metric for extreme value evaluation

8c6c253

Add Q-Q analysis metric for extreme value evaluation

5173ebb

Address PR review: refactor qq_analysis for clarity and extensibility

baf9ffa

- Rename methods, add helper functions, fix naming conventions - Make percentile thresholds configurable - Create QuantilePlots class, implement generic coordinate handling - All reviewer comments addressed

jehangirawan force-pushed the add_quantile_quantile_score branch from 09b6189 to aa97395 Compare January 29, 2026 22:08

Fix: remove duplicate inline import in plot_utils.py

a48ddd7

jehangirawan force-pushed the add_quantile_quantile_score branch from aa97395 to a48ddd7 Compare January 29, 2026 23:07

jehangirawan commented Jan 29, 2026

View reviewed changes


		metric_stream.loc[criteria] = combined_metrics

		# Preserve metric-specific coordinates that loc assignment may drop

Add Q-Q analysis metric for extreme value evaluation #1734

Are you sure you want to change the base?

Add Q-Q analysis metric for extreme value evaluation #1734

Uh oh!

Conversation

jehangirawan commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Number

Description

config/evaluate/eval_config.yml

Checklist before asking for review

Uh oh!

jehangirawan commented Jan 28, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jehangirawan commented Jan 28, 2026 •

edited

Loading