DVGBench: Implicit-to-Explicit Visual Grounding Benchmark in UAV Imagery with Large Vision-Language Models
- [2026.01.04] 🚀🚀 We have released the evaluation code and benchmark (test set).
- [2026.01.02] 🎉🎉 Accepted by ISPRS JPRS 2026.
- [2026.01.01] Click for the latest trends in Remote Sensing Vision-Language Datasets and Models.
Remote sensing (RS) large vision–language models (LVLMs) have shown strong promise across visual grounding (VG) tasks. However, existing RS VG datasets predominantly rely on explicit referring expressions—such as relative position, relative size, and color cues—thereby constraining performance on implicit VG tasks that require scenario-specific domain knowledge. This article introduces \dataset, a high-quality implicit VG benchmark for drones, covering six major application scenarios: traffic, disaster, security, sport, social activity, and productive activity. Each object provides both explicit and implicit queries. Based on the dataset, we design DroneVG-R1, an LVLM that integrates the novel Implicit-to-Explicit Chain-of-Thought (I2E-CoT) within a reinforcement learning paradigm. This enables the model to take advantage of scene-specific expertise, converting implicit references into explicit ones and thus reducing grounding difficulty. Finally, an evaluation of mainstream models on both explicit and implicit VG tasks reveals substantial limitations in their reasoning capabilities. These findings provide actionable insights for advancing the reasoning capacity of LVLMs for drone-based agents.
-
Benchmark: We introduce DVGBench, a human-annotated VG benchmark designed for real-world UAV applications, is presented. It spans six diverse scenarios and provides both box and mask-level annotations, along with explicit as well as implicit referring expressions.
-
Model: Based on DVGBench, DroneVG-R1, an LVLM tailored for implicit VG in UAV contexts, is proposed. A segmentation model is incorporated to support reasoning segmentation.
-
Method: An I2E-CoT strategy is introduced to enhance grounding accuracy by converting implicit references into explicit textual descriptions. To incentivize this conversion, a novel reasoning reward function based on explicit reference similarity is designed.
-
Exploration: Extensive evaluations of existing models are performed, uncovering their limitations in implicit VG. Through comparative analysis of performance on explicit versus implicit queries, insights into the reasoning gaps and directions for improvement are provided.
Examples of six UAV application scenarios in DVGBench
Please download the dataset first and then refer to the code in the evaluation to infer and evaluate the score.
Framework of the DroneVG-R1, which comprises a reasoning model and a segmentation model. The reasoning model is an LVLM that generates reasoning chains and provides box-level results. Subsequently, the segmentation model produces a pixel-wise mask based on the box. In addition to regular format rewards and perceptual rewards, we have also designed a reasoning reward to enhance the quality of the model’s implicit-to-explicit conversion through human-annotated explicit references.
Overview of the Implicit-to-Explicit mechanism. This diagram compares the standard Group Relative Policy Optimization (GRPO) with our I2E-CoT approach. The GRPO mislocates the left-turning vehicle due to visual attention distraction during reasoning. In contrast, the I2E-CoT method employs the <explicit> token to generate an explicit reference for the object, correcting the initial localization and producing the correct answer. Attention graphs reveal that during the <explicit> phase, I2E-CoT identifies the explicit "green" cue, substantially increasing attention to the corresponding image tokens (blue line).
Two examples used to demonstrate the impact of the I2E-CoT mechanism on image attention. In the sports scenario example on the right, we observed a moment of shift in image attention. Before the phrase "white shirt" appeared, the model's attention was somewhat dispersed or focused on the left person. After the phrase "white shirt" appeared, as shown in the attention proportion curves and heatmaps, the model’s image attention shifted significantly from the left person to the rider on the right wearing a white shirt. It was this single explicit descriptive word that refined the model’s localization, making the attention accurately lock onto the target rider.
@article{zhou2026dvgbench,
author={Zhou, Yue and Chen, Jue and Huang, Penghui and Ding, Ran and Zou, Zhentao and Gao, Pengfei and Li, Ke and Yang, Xue and Jiang, Xue and Yang, Hongxin and Li Jonathan},
journal={ISPRS Journal of Photogrammetry and Remote Sensing},
title={DVGBench: Implicit-to-Explicit Visual Grounding Benchmark in UAV Imagery with Large Vision-Language Models},
year={2026},
volume={},
number={},
pages={}
}This implementation is based on Qwen2.5-VL, ms-swift and Look-Back. Thanks for the awesome work.
If you have any questions, please feel free to reach out at [email protected].










