Skip to content

[CVPR 2025] Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding

Notifications You must be signed in to change notification settings

MICV-yonsei/LocalizationHeads

 
 

Repository files navigation

Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding [CVPR 2025 Highlight]

Table of Contents

TL;DR

This repository provides a one-shot evaluation protocol designed to support the discovery and validation of our primary contribution "Localization Heads" in Large Vision-Language Models.

This repository contains tools for finding and analyzing localization heads in multimodal LLMs. The tools help identify which attention heads in a model are most responsible for localizing objects in images.

Paper: https://arxiv.org/abs/2503.06287

A simple, training-free pipeline to discover and use "localization heads" in LVLMs (e.g., LLaVA) for visual grounding. This overhaul removes the need to fork Transformers, uses standard Hugging Face APIs to collect attentions, and provides a clean Hydra-driven workflow for collection → analysis → visualization → bbox/mask outputs.

Highlights

  • No local Transformers fork: uses output_attentions=True to capture attention
  • Eager attention only: stable attention tensors from standard HF backends
  • Two criteria head selection:
    • Criterion-1: Value-based elbow (chord method) on head-wise image-attention sums
    • Criterion-2: Spatial Entropy (lower is better) with bottom-row focus filtering
  • Combine top-K heads → smoothing → binary mask → bbox (xyxy)
  • Hydra config groups: model, logic, data with minimal options
  • Single-file and batch processing with consistent outputs

Experiment Dataset

For our experiments, we prepared 1,000 data samples from the RefCOCO training set. The RefCOCO dataset contains images with referring expressions that uniquely identify specific objects in the images. This selected subset allowed us to comprehensively evaluate the localization capabilities of various attention heads in the model. For more detailed information about the dataset preparation and experimental setup, please refer to our paper.

Requirements

  • Python 3.9+
  • GPU recommended (CUDA), but CPU is supported for smaller tests
  • Install Python packages:
pip install -r requirements.txt

Notes:

  • We keep transformers unpinned to follow the latest stable. If you hit regressions, pin a known good version (e.g., >=4.52.3).
  • We require eager attention (no Flash Attention/SDPA for attention outputs).
  • hydra-colorlog is used to enable the hydra/job_logging: colorlog config.

Quick Start

# Demo or Debug: Single image + query (full pipeline: collect + analyze + viz + bbox/mask)
python pipeline.py \
  stage=pipeline \
  data.image_file=examples/images/bird.png \
  data.query="birds."

Optional: choose cache directory for model downloads

python pipeline.py \
  stage=pipeline \
  data.image_file=examples/images/cat.png \
  data.query="a cat on the floor" \
  model.cache_dir=/your/hf/cache

Optional: capture generated text and use attentions from the first generated token (falls back to forward attentions if unavailable)

python pipeline.py \
  stage=pipeline \
  data.image_file=examples/images/cat.png \
  data.query="a cat on the floor" \
  model.use_generate=true

Configuration (Hydra)

Configs live under conf/ and are composed in conf/config.yaml.

  • conf/model/llava15_7b.yaml

    • name: Hugging Face repo id, e.g., liuhaotian/llava-v1.5-7b
    • cache_dir: optional HF cache dir (also exported to TRANSFORMERS_CACHE, HF_HOME, HF_HUB_CACHE)
    • device: auto | cpu | cuda:<id> (use device_id when auto)
    • device_id: GPU index used when device=auto
    • conv_mode: prompt template key (default: referseg)
    • max_new_tokens, do_sample, num_beams: used only if use_generate=true
    • use_generate: false by default (forward-only attentions)
    • use_flash_attn: false (keep eager)
  • conf/logic/selection_v1.yaml

    • top_k: number of heads to combine for visualization/mask
    • threshold.method: chord (value-based elbow)
    • threshold.min_keep: ensure at least N heads remain after criterion-1
    • entropy.binarize_threshold: threshold to build components after ReLU(mean-centered)
    • smoothing.sigma: Gaussian sigma before combining heads
    • mask.method: mean_relu (fixed in this version)
  • conf/data/local_examples.yaml

    • image_file, query, attention_file
    • data_file: JSONL path for batch mode
    • process_all, start_index, end_index
    • output_dir: outputs root (default _overhaul/outputs/localization_heads)
    • visualize_batch: save figures for each batch item

Commands

Experiments for batch (We had 10 trials of experiments with a set consisting of 1,000 samples.)

  • Batch mode (JSONL schema below)
python pipeline.py stage=batch \
  data.data_file=examples/localization_data.jsonl \
  data.process_all=true

Single image/query

  • Full pipeline
python pipeline.py stage=pipeline \
  data.image_file=examples/images/bird.png \
  data.query="the bird on the branch"
  • Collect only
python pipeline.py stage=collect \
  data.image_file=examples/images/dog.png \
  data.query="a small dog wearing a collar"
  • Analyze an existing attention file
python pipeline.py stage=analyze \
  data.attention_file=_overhaul/outputs/localization_heads/liuhaotian-llava-v1.5-7b/sample.pkl
  • Visualize an existing attention file (also writes bbox/mask)
python pipeline.py stage=visualize \
  data.attention_file=_overhaul/outputs/localization_heads/liuhaotian-llava-v1.5-7b/sample.pkl

Input JSONL Schema

Each line is a JSON object:

{"id": "example_1", "prompt": "a black cat on the sofa", "image_path": "examples/images/cat.png"}

Required keys: id, prompt, image_path.

Outputs

Under data.output_dir/<model_name_sanitized>/ for single items (or per-id for batch):

  • <id>.pkl: attention dict with attn tensor [L, H, 1, V] and meta
  • <id>_analysis.pkl: ranked head list (top by spatial entropy)
  • <id>_topK.png: image + top-K attention maps
  • <id>_mask.png: binary pseudo-mask at image resolution
  • <id>_bbox.json: bbox (xyxy), image size, and selected head details

meta includes: image_file, query, model_name, image_size, vis_len, patch_size, num_layers, num_heads, and optionally generated_text when use_generate=true.

How Attention Is Collected

  • Forward mode (default):
    • Call model(..., output_attentions=True)
    • Take the last token attention to visual token range: [L, H, 1, V]
  • Generate mode (model.use_generate=true):
    • Call generate(..., return_dict_in_generate=True, output_attentions=True)
    • Use attentions from the first generated step (shape [L, H, 1, src_len]) sliced to visual tokens; falls back to forward if not present

Repository Layout

.
├─ pipeline.py                 # Hydra entrypoint
├─ collector.py                # Attention collection (forward/generate)
├─ analyze.py                  # Elbow (value) + spatial entropy head selection
├─ bbox.py                     # Combine heads → mask → bbox
├─ viz.py                      # Image + top-K attention plots
├─ requirements.txt            # Minimal runtime dependencies
├─ conf/
│  ├─ config.yaml              # Hydra root
│  ├─ model/llava15_7b.yaml    # Model + runtime options
│  ├─ logic/selection_v1.yaml  # Head selection + post-processing
│  └─ data/local_examples.yaml # IO + batch options
├─ llava/                      # Minimal LLaVA components (no Transformers fork)
│  ├─ model/ ...               # Builder + vision tower wiring
│  ├─ conversation.py          # Prompt templates
│  ├─ constants.py             # Special tokens, log dir
│  └─ mm_utils.py              # Tokenization + image utils
├─ lab/                        # Lightweight stations (token segmentation metadata)
│  └─ stations.py
└─ examples/                   # Example images + JSONL

Tips & Troubleshooting

  • Eager attention: keep use_flash_attn=false so attention tensors are returned
  • Cache directory: prefer model.cache_dir, which is passed to all from_pretrained(...) calls; env vars (TRANSFORMERS_CACHE, HF_HOME, HF_HUB_CACHE) are also set when provided
  • If hydra/job_logging: colorlog not found, ensure pip install hydra-colorlog
  • If you see memory errors, reduce image size, lower top_k, or try CPU for small tests

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

[CVPR 2025] Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%