Fix MeanAverageRecall: compute mAR@K using top-K detections per image (COCO-compliant) #1967

stop1one · 2025-09-22T03:23:09Z

Description

This PR fixes the calculation of mAR@K in MeanAverageRecall to comply with the COCO evaluation protocol.
Previously, the implementation selected the top-K predictions globally across all images, rather than per image.
According to the COCO evaluation protocol, mAR@K should be calculated by considering the top-K highest-confidence detections for each image.

This issue is tracked in issue #1966

To resolve this, I modified the _compute and _compute_average_recall_for_classes function to first filter the statistics by confidence score before concatenating them and calculate the confusion matrix.

No new dependencies are required for this change.

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

How has this change been tested, please provide a testcase or example of how you tested the change?

I tested the change by running the metric on a dataset with varying numbers of predictions per image and verified that, for each image, only the top-K predictions (by confidence) were used in the mAR@K calculation.

Any specific deployment considerations

No special deployment considerations are required.

Docs

Docs updated? What were the changes: N/A

CLAassistant · 2025-09-22T03:23:15Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

galafis · 2025-09-27T14:33:52Z

Parabéns pela correção! A modificação para que o mAR@K seja calculado por imagem está totalmente alinhada ao protocolo COCO, melhorando a acurácia das métricas. Essa abordagem é recomendada para benchmarks robustos e facilita comparações justas entre modelos. Ótima contribuição para toda a comunidade, obrigado! Assinado: Gabriel.

galafis · 2025-09-27T14:38:48Z

Excellent work on fixing the mAR@K calculation! This is a critical correction that addresses a fundamental issue in metric computation. The COCO evaluation protocol indeed requires per-image top-K filtering, and this fix ensures proper compliance.

Technical insights:

Per-image vs Global Filtering: Your modification correctly implements the per-image top-K selection, which is essential for fair model comparison across different detection densities
Metric Reliability: This fix significantly improves metric reliability, especially for datasets with varying object densities per image
Benchmarking Impact: The correction will provide more accurate benchmarking results, aligning with standard COCO evaluation practices used by the research community

Implementation notes:

The confidence-based filtering before concatenation is the correct approach
This ensures each image contributes equally to the final mAR calculation regardless of detection count
Consider adding a unit test to verify the per-image K-filtering behavior with synthetic data

This contribution enhances the library's evaluation accuracy and research reproducibility. Well done!

Best regards,
Gabriel

stop1one · 2025-10-01T12:45:45Z

Thank you for the encouraging feedback and detailed notes, Gabriel 🙏
I’ve recently resumed work on this and will add the unit test you suggested (per-image K-filtering with synthetic data) very soon.

stop1one · 2025-10-02T02:29:29Z

I've added a simple unit test with synthetic data to validate the mAR@K calculation.

Test setup:

15 images in total.
Each image has ≤ 5 bounding boxes/detections.
Therefore, mAR@10 and mAR@100 should be identical, since $K=10$ is already exceeds the maximum number of detections per image.

Result with the original (buggy) implementation:

E           AssertionError: 
E           Arrays are not almost equal to 5 decimals
E           
E           Mismatched elements: 2 / 3 (66.7%)
E           Max absolute difference among violations: 0.23173375
E           Max relative difference among violations: 0.80613893
E            ACTUAL: array([0.05573, 0.52786, 0.63622])
E            DESIRED: array([0.28746, 0.63622, 0.63622])

As shown above, mAR@10 (0.52786) ≠ mAR@100 (0.63622), which is incorrect.
This demonstrates that the original code was applying Top-K filtering across the dataset, rather than per image.
The fix in this PR corrects that behaviour.

fix: compute mAR@K using top-K predictions per image, not globally

8ded602

stop1one requested a review from SkalskiP as a code owner September 22, 2025 03:23

fix(pre_commit): 🎨 auto format pre-commit hooks

dad01f7

galafis mentioned this pull request Sep 27, 2025

MeanAverageRecall does not follow COCO: mAR@K should use top-K detections per image, not globally #1966

Open

2 tasks

stop1one and others added 3 commits October 2, 2025 11:09

Merge branch 'roboflow:develop' into fix/mAR-at-K-per-image

f8ff351

Fix: Add unit test for mAR@K per-image filtering

26a525f

fix(pre_commit): 🎨 auto format pre-commit hooks

67ba0ab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix MeanAverageRecall: compute mAR@K using top-K detections per image (COCO-compliant) #1967

Fix MeanAverageRecall: compute mAR@K using top-K detections per image (COCO-compliant) #1967

Uh oh!

stop1one commented Sep 22, 2025 •

edited

Loading

Uh oh!

CLAassistant commented Sep 22, 2025

Uh oh!

galafis commented Sep 27, 2025

Uh oh!

galafis commented Sep 27, 2025

Uh oh!

stop1one commented Oct 1, 2025

Uh oh!

stop1one commented Oct 2, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix MeanAverageRecall: compute mAR@K using top-K detections per image (COCO-compliant) #1967

Are you sure you want to change the base?

Fix MeanAverageRecall: compute mAR@K using top-K detections per image (COCO-compliant) #1967

Uh oh!

Conversation

stop1one commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How has this change been tested, please provide a testcase or example of how you tested the change?

Any specific deployment considerations

Docs

Uh oh!

CLAassistant commented Sep 22, 2025

Uh oh!

galafis commented Sep 27, 2025

Uh oh!

galafis commented Sep 27, 2025

Uh oh!

stop1one commented Oct 1, 2025

Uh oh!

stop1one commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stop1one commented Sep 22, 2025 •

edited

Loading

stop1one commented Oct 2, 2025 •

edited

Loading