Skip to content

Conversation

@jehangirawan
Copy link
Contributor

@jehangirawan jehangirawan commented Jan 28, 2026

Issue Number

closes #1674

Description

Implements a quantile-quantile (Q-Q) analysis metric for evaluating weather forecast model performance across the full data distribution, with emphasis on extreme values in the tails.

Key additions:

  • New qq_analysis metric in the evaluation pipeline
  • Two-panel visualization: Q-Q scatter plot + deviation plot with highlighted extreme regions
  • Stores quantile data and extreme tail MSE values in JSON for post-processing

Motivation: Traditional metrics (RMSE, MAE) focus on central tendencies and may miss distributional biases in extremes. Q-Q analysis directly compares predicted vs. observed quantiles, making it ideal for detecting systematic over/under-prediction of extreme weather events.

Usage:

config/evaluate/eval_config.yml

evaluation:
metrics: ["qq_analysis"]

uv run --offline evaluate --config config/evaluate/eval_config.yml

Checklist before asking for review

  • I have performed a self-review of my code
  • My changes comply with basic sanity checks:
    • I have fixed formatting issues with ./scripts/actions.sh lint
    • I have run unit tests with ./scripts/actions.sh unit-test
      [x] I have documented my code and I have updated the docstrings.
    • I have added unit tests, if relevant
  • I have tried my changes with data and code:
    • I have run the integration tests with ./scripts/actions.sh integration-test
    • (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
    • (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
  • I have informed and aligned with people impacted by my change:
    • for config changes: the MatterMost channels and/or a design doc
    • for changes of dependencies: the MatterMost software development channel

@jehangirawan
Copy link
Contributor Author

Sample Q-Q Analysis Output:
The left panel shows the Q-Q (quantile-quantile) scatter plot, and the right panel displays the quantile deviation. The shaded regions indicate the extreme zones (5th and 95th percentiles).

qq_analysis_qq_analysis_global_q5krziaj_ERA5_2t

quantile_levels = ds_ref["quantile_levels"].values

# Find extreme regions (typically below 5% and above 95%)
lower_extreme_idx = quantile_levels < 0.05
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make the thresholds as arguments of the plot instead of hardcoded?

qq_deviation = np.abs(p_quantiles - gt_quantiles)

# Calculate normalized deviation (relative to interquartile range of ground truth)
gt_q25 = gt_flat.quantile(0.25, dim="_agg_points")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have configurable thresholds instead of hardcoded here? like gt_q_low and gt_q_high

qq_data_coord_array = np.empty(overall_qq_score.shape, dtype=object)

# Iterate over all positions and create individual JSON strings
for idx in np.ndindex(overall_qq_score.shape):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is extremely redundant. we should already have the to_list definition somewhere, but here is an example of how Id make it shorter:

def _to_list(arr, idx):
    return (
        arr.values[(...,) + idx].tolist()
        if arr.ndim > 1
        else arr.values.tolist()
    )

def _to_float(arr, idx):
    return float(arr.values[idx]) if arr.ndim > 0 else float(arr.values)


qq_full_data = {
    "quantile_levels": quantile_levels.tolist(),
    "p_quantiles": _to_list(p_quantiles, idx),
    "gt_quantiles": _to_list(gt_quantiles, idx),
    "qq_deviation": _to_list(qq_deviation, idx),
    "qq_deviation_normalized": _to_list(qq_deviation_normalized, idx),
    "extreme_low_mse": _to_float(extreme_low_mse, idx),
    "extreme_high_mse": _to_float(extreme_high_mse, idx),
}

qq_data_coord_array[idx] = json.dumps(qq_full_data)

In general you can always ask Claude or Copilot (or chatGPT) to restructure your code in a more modular and compact syntax ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Definitely cleaner this way. I’ve refactored the logic to be more modular as suggested. Thanks for the tip!"

combined_metrics = xr.concat(
valid_scores,
dim="metric",
coords="different",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure that this doesn't mess up the other scores?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified and tested successfully with multiple metrics (rmse, bias, mae, qq_analysis)


metric_stream.loc[criteria] = combined_metrics

# Preserve metric-specific coordinates that loc assignment may drop
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid adding metric specific ifs to this function. This should stay as general as possible.
I know that the problem here is having the additional dimension that you can not stack with the others. Maybe we can have a meeting and brainstorm also with @SavvasMel about how to do it properly as in any case we will need it for the rank_histogram as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed:

- Rename methods, add helper functions, fix naming conventions
- Make percentile thresholds configurable
- Create QuantilePlots class, implement generic coordinate handling
- All reviewer comments addressed
@jehangirawan jehangirawan force-pushed the add_quantile_quantile_score branch from 09b6189 to aa97395 Compare January 29, 2026 22:08
@jehangirawan jehangirawan force-pushed the add_quantile_quantile_score branch from aa97395 to a48ddd7 Compare January 29, 2026 23:07
Comment on lines +219 to +221
# Restore metric-specific coordinates that were dropped by coords="minimal"
# (e.g., quantiles, extreme_percentiles for qq_analysis)
for coord_name in combined_metrics.coords:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve updated the logic to be fully generic by checking dimension compatibility dynamically. This now handles the extra dimensions for qq_analysis and will work for other metrics like rank_histogram as well without any hardcoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All comments addressed. Tested successfully with multiple metrics (rmse, bias, mae, qq_analysis)
Ready for re-review! @iluise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Add Q-Q analysis for extreme value evaluation

2 participants