Fengxiang Wang1, Mingshuo Chen2, Yueying Li1, Di Wang4,5, Haotian Wang1,
Zonghao Guo3, Zefan Wang3, Boqi Shan6, Long Lan1, Yuilin Wang3 †,
Hongzhen Wang3 †, Wenjing Yang1 †, Bo Du4, Jing Zhang4 †
1 National University of Defense Technology, China
2 Beijing University of Posts and Telecommunications, China
3 Tsinghua University, China, 4 Wuhan University, China
5 Zhongguancun Academy, China, 6 Beihang University, China
- 📚 Contents
- 🔥News
- 📜Dataset
- 🔍Key Insights and Method
- 🚀Finetuning and Evaluation
- 🔗Citation
- 🤝Acknowledgement
- [2025.09.19] Selected as Spotlight by NeurlPS 2025!
- [2025.09.19] GeoLLaVA-8k has been accepted by NeurlPS 2025.
- [Coming Soon] More details on motivation and ablation studies.
- [2025.05.28] Training code together with model and dataset released.
- [2025.05.27] The paper is available on arXiv.
We introduce two ultra-high-resolution (UHR) vision-language datasets for GeoLLaVA-8K:
-
SuperRS-VQA (avg. 8,376×8,378) and HighRS-VQA (avg. 2,000×1,912), the highest-resolution RS VQA datasets to date.
-
A total of 81,367 UHR image–text pairs (SuperRS-VQA + HighRS-VQA) are used for supervised fine-tuning of GeoLLaVA-8K.
-
Construction pipeline:
- Manual annotation of 12K UHR samples by experts and crowd-workers.
- Semi-automated generation of 100K medium-to-high-resolution (2K×2K) pairs using GPT-4o, followed by an influence-based selection via the LESS framework.
- Deduplication against existing RS datasets to minimize overlap.
-
Data Selection Pipeline for MHR Data
To improve the relevance of our medium-to-high-resolution (MHR, 2K×2K) samples to UHR downstream tasks and ensure its cultivation of reasoning capabilities for models fine-tuned on it, we adopt an influence-based data selection pipeline.
Low Semantic Density in Remote-sensing Imagery
-
Overwhelming Background Tokens Hinder MLLM Fine-Tuning on RS Data
Q1: “Do background tokens dominate UHR RS imagery?”
Results show background coverage in RS images reaches up to 73.14%.
Q2: “What happens when you prune background tokens?”
Our pilot studies reveal significant redundancy in RS images: crucial information is concentrated in a small subset of object-centric tokens, while pruning background tokens (e.g., ocean or forest) can even improve performance.
-
Scarce Object Tokens Drive MLLM Fine-Tuning on RS Data
Q3: “Whether essential information is concentrated in small targets and captured by corresponding visual tokens?”
Ablating object tokens (26.5) causes a 34.9% drop in generative VQA and 24.8% drop in discriminative VQA, whereas randomly removing the same number of tokens yields only 6.7% and 1.1% decreases—demonstrating that essential information is indeed localized in small target-aligned tokens.
Background Token Pruning and Anchored Token Selection
GeoLLaVA is built upon LongVA, and tested on A800-SXM4-80GB with CUDA12.1 and PyTorch 2.1.2.
We recommend use uv for environment setup.
uv venv -p 3.11 # or 3.10
uv pip install torch==2.1.2+cu121 torchvision==0.16.2+cu121 torchaudio==2.1.2+cu121 --index-url https://download.pytorch.org/whl/cu121
uv pip install flash-attn==2.7.3 --no-build-isolation --no-cache-dir # or install wheel mannualy
uv pip install -r requirements.txt --no-deps
cd longva && uv pip install -e . --no-depsPlease first download the dataset from huggingface. Then
cd longva
bash scripts/ft3.shFor evaluation, please use lmms-eval and refer the XLRS-Bench-lite:
CKPT_PATH=initiacms/GeoLLaVA-8K # or local path
accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
--model "longva" \
--model_args "pretrained=${CKPT_PATH},use_flash_attention_2=True" \
--tasks xlrs-lite \
--batch_size 1 \
--log_samples \
--log_samples_suffix longva_xlrs_lite \
--output_path ./logs/If you find our work helpful, please consider citing:
@article{wang2025geollava8kscalingremotesensingmultimodal,
title={GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution},
author={Fengxiang Wang and Mingshuo Chen and Yueying Li and Di Wang and Haotian Wang and Zonghao Guo and Zefan Wang and Boqi Shan and Long Lan and Yulin Wang and Hongzhen Wang and Wenjing Yang and Bo Du and Jing Zhang},
journal={arXiv preprint arXiv:2505.21375},
year={2025},
}