Skip to content

Conversation

stop1one
Copy link

@stop1one stop1one commented Sep 22, 2025

Description

This PR fixes the calculation of mAR@K in MeanAverageRecall to comply with the COCO evaluation protocol.
Previously, the implementation selected the top-K predictions globally across all images, rather than per image.
According to the COCO evaluation protocol, mAR@K should be calculated by considering the top-K highest-confidence detections for each image.

This issue is tracked in issue #1966

To resolve this, I modified the _compute and _compute_average_recall_for_classes function to first filter the statistics by confidence score before concatenating them and calculate the confusion matrix.

No new dependencies are required for this change.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How has this change been tested, please provide a testcase or example of how you tested the change?

I tested the change by running the metric on a dataset with varying numbers of predictions per image and verified that, for each image, only the top-K predictions (by confidence) were used in the mAR@K calculation.

Any specific deployment considerations

No special deployment considerations are required.

Docs

  • Docs updated? What were the changes: N/A

@stop1one stop1one requested a review from SkalskiP as a code owner September 22, 2025 03:23
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@galafis
Copy link

galafis commented Sep 27, 2025

Parabéns pela correção! A modificação para que o mAR@K seja calculado por imagem está totalmente alinhada ao protocolo COCO, melhorando a acurácia das métricas. Essa abordagem é recomendada para benchmarks robustos e facilita comparações justas entre modelos. Ótima contribuição para toda a comunidade, obrigado! Assinado: Gabriel.

@galafis
Copy link

galafis commented Sep 27, 2025

Excellent work on fixing the mAR@K calculation! This is a critical correction that addresses a fundamental issue in metric computation. The COCO evaluation protocol indeed requires per-image top-K filtering, and this fix ensures proper compliance.

Technical insights:

  1. Per-image vs Global Filtering: Your modification correctly implements the per-image top-K selection, which is essential for fair model comparison across different detection densities
  2. Metric Reliability: This fix significantly improves metric reliability, especially for datasets with varying object densities per image
  3. Benchmarking Impact: The correction will provide more accurate benchmarking results, aligning with standard COCO evaluation practices used by the research community

Implementation notes:

  • The confidence-based filtering before concatenation is the correct approach
  • This ensures each image contributes equally to the final mAR calculation regardless of detection count
  • Consider adding a unit test to verify the per-image K-filtering behavior with synthetic data

This contribution enhances the library's evaluation accuracy and research reproducibility. Well done!

Best regards,
Gabriel

@stop1one
Copy link
Author

stop1one commented Oct 1, 2025

Thank you for the encouraging feedback and detailed notes, Gabriel 🙏
I’ve recently resumed work on this and will add the unit test you suggested (per-image K-filtering with synthetic data) very soon.

@stop1one
Copy link
Author

stop1one commented Oct 2, 2025

I've added a simple unit test with synthetic data to validate the mAR@K calculation.

Test setup:

  • 15 images in total.
  • Each image has ≤ 5 bounding boxes/detections.
  • Therefore, mAR@10 and mAR@100 should be identical, since $K=10$ is already exceeds the maximum number of detections per image.

Result with the original (buggy) implementation:

E           AssertionError: 
E           Arrays are not almost equal to 5 decimals
E           
E           Mismatched elements: 2 / 3 (66.7%)
E           Max absolute difference among violations: 0.23173375
E           Max relative difference among violations: 0.80613893
E            ACTUAL: array([0.05573, 0.52786, 0.63622])
E            DESIRED: array([0.28746, 0.63622, 0.63622])

As shown above, mAR@10 (0.52786) ≠ mAR@100 (0.63622), which is incorrect.
This demonstrates that the original code was applying Top-K filtering across the dataset, rather than per image.
The fix in this PR corrects that behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants