Skip to content

[ISPRS2026] DVGBench: Implicit-to-Explicit Visual Grounding Benchmark in UAV Imagery with Large Vision-Language Models

Notifications You must be signed in to change notification settings

VisionXLab/DVGBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

DVGBench: Implicit-to-Explicit Visual Grounding Benchmark in UAV Imagery with Large Vision-Language Models

Yue Zhou1  Jue Chen1  Zilun Zhang2  Penghui Huang3  Ran Ding3  Zhentao Zou3  PengFei Gao4  Yuchen Wei4  Ke Li4Xue Yang3Jiang Xue3  Hongxin Yang1  Jonathan Li1
1East China Normal University  2Zhejiang University  3Shanghai Jiao Tong University  4Information Engineering University 

Paper Paper Dataset

Oryx Video-ChatGPT


📢 Latest Updates

  • [2026.01.04] 🚀🚀 We have released the evaluation code and benchmark (test set).
  • [2026.01.02] 🎉🎉 Accepted by ISPRS JPRS 2026.
  • [2026.01.01] Click for the latest trends in Remote Sensing Vision-Language Datasets and Models.

Abstract

Remote sensing (RS) large vision–language models (LVLMs) have shown strong promise across visual grounding (VG) tasks. However, existing RS VG datasets predominantly rely on explicit referring expressions—such as relative position, relative size, and color cues—thereby constraining performance on implicit VG tasks that require scenario-specific domain knowledge. This article introduces \dataset, a high-quality implicit VG benchmark for drones, covering six major application scenarios: traffic, disaster, security, sport, social activity, and productive activity. Each object provides both explicit and implicit queries. Based on the dataset, we design DroneVG-R1, an LVLM that integrates the novel Implicit-to-Explicit Chain-of-Thought (I2E-CoT) within a reinforcement learning paradigm. This enables the model to take advantage of scene-specific expertise, converting implicit references into explicit ones and thus reducing grounding difficulty. Finally, an evaluation of mainstream models on both explicit and implicit VG tasks reveals substantial limitations in their reasoning capabilities. These findings provide actionable insights for advancing the reasoning capacity of LVLMs for drone-based agents.

Infants can understand references composed of colors and relative positions easily, but cannot comprehend references involving common sense or domain knowledge. We refer to the latter as Implicit Visual Grounding.

🏆 Contributions

  • Benchmark: We introduce DVGBench, a human-annotated VG benchmark designed for real-world UAV applications, is presented. It spans six diverse scenarios and provides both box and mask-level annotations, along with explicit as well as implicit referring expressions.

  • Model: Based on DVGBench, DroneVG-R1, an LVLM tailored for implicit VG in UAV contexts, is proposed. A segmentation model is incorporated to support reasoning segmentation.

  • Method: An I2E-CoT strategy is introduced to enhance grounding accuracy by converting implicit references into explicit textual descriptions. To incentivize this conversion, a novel reasoning reward function based on explicit reference similarity is designed.

  • Exploration: Extensive evaluations of existing models are performed, uncovering their limitations in implicit VG. Through comparative analysis of performance on explicit versus implicit queries, insights into the reasoning gaps and directions for improvement are provided.


💬 Benchmark

Examples of six UAV application scenarios in DVGBench

Please download the dataset first and then refer to the code in the evaluation to infer and evaluate the score.


🤖 DroneVG-R1

Framework of the DroneVG-R1, which comprises a reasoning model and a segmentation model. The reasoning model is an LVLM that generates reasoning chains and provides box-level results. Subsequently, the segmentation model produces a pixel-wise mask based on the box. In addition to regular format rewards and perceptual rewards, we have also designed a reasoning reward to enhance the quality of the model’s implicit-to-explicit conversion through human-annotated explicit references.


🔍 I2E-CoT

Overview of the Implicit-to-Explicit mechanism. This diagram compares the standard Group Relative Policy Optimization (GRPO) with our I2E-CoT approach. The GRPO mislocates the left-turning vehicle due to visual attention distraction during reasoning. In contrast, the I2E-CoT method employs the <explicit> token to generate an explicit reference for the object, correcting the initial localization and producing the correct answer. Attention graphs reveal that during the <explicit> phase, I2E-CoT identifies the explicit "green" cue, substantially increasing attention to the corresponding image tokens (blue line).


🚀 Exploration

Two examples used to demonstrate the impact of the I2E-CoT mechanism on image attention. In the sports scenario example on the right, we observed a moment of shift in image attention. Before the phrase "white shirt" appeared, the model's attention was somewhat dispersed or focused on the left person. After the phrase "white shirt" appeared, as shown in the attention proportion curves and heatmaps, the model’s image attention shifted significantly from the left person to the rider on the right wearing a white shirt. It was this single explicit descriptive word that refined the model’s localization, making the attention accurately lock onto the target rider.

🎬 Demo

📜 Citation

@article{zhou2026dvgbench,
  author={Zhou, Yue and Chen, Jue and Huang, Penghui and Ding, Ran and Zou, Zhentao and Gao, Pengfei and Li, Ke and Yang, Xue and Jiang, Xue and Yang, Hongxin and Li Jonathan},
  journal={ISPRS Journal of Photogrammetry and Remote Sensing}, 
  title={DVGBench: Implicit-to-Explicit Visual Grounding Benchmark in UAV Imagery with Large Vision-Language Models}, 
  year={2026},
  volume={},
  number={},
  pages={}
}

Acknowledgement

This implementation is based on Qwen2.5-VL, ms-swift and Look-Back. Thanks for the awesome work.

Contact

If you have any questions, please feel free to reach out at [email protected].

About

[ISPRS2026] DVGBench: Implicit-to-Explicit Visual Grounding Benchmark in UAV Imagery with Large Vision-Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •