-
Notifications
You must be signed in to change notification settings - Fork 51
Add Q-Q analysis metric for extreme value evaluation #1734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Add Q-Q analysis metric for extreme value evaluation #1734
Conversation
a5f491d to
312301e
Compare
| quantile_levels = ds_ref["quantile_levels"].values | ||
|
|
||
| # Find extreme regions (typically below 5% and above 95%) | ||
| lower_extreme_idx = quantile_levels < 0.05 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make the thresholds as arguments of the plot instead of hardcoded?
| qq_deviation = np.abs(p_quantiles - gt_quantiles) | ||
|
|
||
| # Calculate normalized deviation (relative to interquartile range of ground truth) | ||
| gt_q25 = gt_flat.quantile(0.25, dim="_agg_points") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have configurable thresholds instead of hardcoded here? like gt_q_low and gt_q_high
| qq_data_coord_array = np.empty(overall_qq_score.shape, dtype=object) | ||
|
|
||
| # Iterate over all positions and create individual JSON strings | ||
| for idx in np.ndindex(overall_qq_score.shape): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is extremely redundant. we should already have the to_list definition somewhere, but here is an example of how Id make it shorter:
def _to_list(arr, idx):
return (
arr.values[(...,) + idx].tolist()
if arr.ndim > 1
else arr.values.tolist()
)
def _to_float(arr, idx):
return float(arr.values[idx]) if arr.ndim > 0 else float(arr.values)
qq_full_data = {
"quantile_levels": quantile_levels.tolist(),
"p_quantiles": _to_list(p_quantiles, idx),
"gt_quantiles": _to_list(gt_quantiles, idx),
"qq_deviation": _to_list(qq_deviation, idx),
"qq_deviation_normalized": _to_list(qq_deviation_normalized, idx),
"extreme_low_mse": _to_float(extreme_low_mse, idx),
"extreme_high_mse": _to_float(extreme_high_mse, idx),
}
qq_data_coord_array[idx] = json.dumps(qq_full_data)
In general you can always ask Claude or Copilot (or chatGPT) to restructure your code in a more modular and compact syntax ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Definitely cleaner this way. I’ve refactored the logic to be more modular as suggested. Thanks for the tip!"
| combined_metrics = xr.concat( | ||
| valid_scores, | ||
| dim="metric", | ||
| coords="different", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure that this doesn't mess up the other scores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified and tested successfully with multiple metrics (rmse, bias, mae, qq_analysis)
|
|
||
| metric_stream.loc[criteria] = combined_metrics | ||
|
|
||
| # Preserve metric-specific coordinates that loc assignment may drop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please avoid adding metric specific ifs to this function. This should stay as general as possible.
I know that the problem here is having the additional dimension that you can not stack with the others. Maybe we can have a meeting and brainstorm also with @SavvasMel about how to do it properly as in any case we will need it for the rank_histogram as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed:
- Rename methods, add helper functions, fix naming conventions - Make percentile thresholds configurable - Create QuantilePlots class, implement generic coordinate handling - All reviewer comments addressed
09b6189 to
aa97395
Compare
aa97395 to
a48ddd7
Compare
| # Restore metric-specific coordinates that were dropped by coords="minimal" | ||
| # (e.g., quantiles, extreme_percentiles for qq_analysis) | ||
| for coord_name in combined_metrics.coords: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve updated the logic to be fully generic by checking dimension compatibility dynamically. This now handles the extra dimensions for qq_analysis and will work for other metrics like rank_histogram as well without any hardcoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All comments addressed. Tested successfully with multiple metrics (rmse, bias, mae, qq_analysis)
Ready for re-review! @iluise

Issue Number
closes #1674
Description
Implements a quantile-quantile (Q-Q) analysis metric for evaluating weather forecast model performance across the full data distribution, with emphasis on extreme values in the tails.
Key additions:
qq_analysismetric in the evaluation pipelineMotivation: Traditional metrics (RMSE, MAE) focus on central tendencies and may miss distributional biases in extremes. Q-Q analysis directly compares predicted vs. observed quantiles, making it ideal for detecting systematic over/under-prediction of extreme weather events.
Usage:
config/evaluate/eval_config.yml
evaluation:
metrics: ["qq_analysis"]
Checklist before asking for review
./scripts/actions.sh lint./scripts/actions.sh unit-test[x] I have documented my code and I have updated the docstrings.
./scripts/actions.sh integration-testlaunch-slurm.py --time 60