- [2025/02/04] We uploaded our Synthdog data to HuggingFace: Dataset
- [2025/01/10] We released Centurio with model checkpoints and code for training & testing. Data will follow soon.
The model can be used directly through the transformers library with our custom code.
Check out the model cards of our checkpoints in the Centurio Collection on HuggingFace for more details.
Example Code
from transformers import AutoModelForCausalLM, AutoProcessor
import timm
from PIL import Image    
import requests
url = "https://upload.wikimedia.org/wikipedia/commons/b/bd/Golden_Retriever_Dukedestiny01_drvd.jpg"
image = Image.open(requests.get(url, stream=True).raw)
model_name = "WueNLP/centurio_qwen"
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
## Appearance of images in the prompt are indicates with '<image_placeholder>'!
prompt = "<image_placeholder>\nBriefly describe the image in German."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},  # This is the system prompt used during our training.
    {"role": "user", "content": prompt}
]
text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True
)
model_inputs = processor(text=[text], images=[image] return_tensors="pt").to(model.device)
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=128
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]We use the trident, a modular framework by Fabian Schmidt that combines pytorch-lightning with hydra configs.
pip install -r requirements.txt 
pip install git+https://github.com/fdschmidt93/trident.git
Primer on trident: trident in 20 minutes
tl;dr: We compose a hierarchy of configs (/configs) to define an experiment config (experiment) that defines
(1) which datasets to use for training and testing and for the latter which metrics to use (dataspec and dataspecs),
(2) which model to use (module), and
(3) other things like optimizer, logging, checkpointing
for a PyTorch Lightning run using our code written in /src.
Below is an example showing you how to use the experiment configs (mblipv2_train.yaml and mblipv2_pretrain.yaml).
Trident allows us to overwrite (nearly) all parameters specified in the configs, which we use to specify various parameters like the LLM, learning rate, etc.
For an example on how to structure the data json files, see the examples in /data.
CLI Command
python -u -m trident.run experiment=mblipv2_train \
  run.train_data=/p/data/jsons \ # prefix path for all json
  run.image_root=/p/data/images \
  run.train_file=multilingual/combination/mblipv2_instruct_base_en.json \
  hydra.run.dir=$output \  # the output folder
  ++run.llm=microsoft/Phi-3.5-mini-instruct \
  ++run.vit_model=vit_so400m_patch14_siglip_384 \
  ++run.train_padding_side="left" \
  module.model.adapter_type="mlp" \
  module.model.load_4bit=False \
  module.model.use_flash_attn=True \
  module.model.use_lora=True \
  module.optimizer.lr=0.0001 \
  module.optimizer.weight_decay=0.0 \
  module.model.lora_r=256 module.model.lora_alpha=512 \
  ++run.max_seq_len=1024 \
  run.test_batch_size=2 run.test_num_workers=2 \
  run.train_batch_size=2 run.train_num_workers=6 \
  trainer.devices=$NUM_GPUS \  # single or multi-gpu both works out of the box
  trainer.accumulate_grad_batches=$ACCUM \
  ++run.seed=4242 \
  trainer.val_check_interval=0.25 \
  ++trainer.strategy="ddp_find_unused_parameters_true" \  # was needed for Phi 3.5 to work. Other LLMs can remove this and use the default Deepspeed Stage 2 config.
  '++logger.wandb.tags=[training,english_only]' \
To use the image tiling approach used for Centurio replace
module.model.adapter_type="mlp" \
with
  ++run.multi_scale=2 \
  module.model.adapter_type="multiscale-pool" \
  ++module.model.adapter_config.multi_scale=2 \
Below is an example for evaluating a model trained with the above training script on a downstream task (MAXM in this case) by loading the checkpoint from DeepSpeed (which contains the MLP weights) and the PEFT adapter checkpoint:
CLI Command
python -u -m trident.run experiment=mblipv2_test_maxm \
  run.train_data=/p/data/jsons \ # prefix path for all json
  run.xm3600_image_root=/p/data/images/maxm \
  hydra.run.dir=$output \
  ++module.model.train_checkpoint=/checkpoints/12_08_2024_09_58_16/checkpoints/0-24250.ckpt/checkpoint/mp_rank_00_model_states.pt \
  ++module.model.lora_checkpoint=/checkpoints/12_08_2024_09_58_16/checkpoints/0-24250 \
  ++run.llm=meta-llama/Meta-Llama-3-8B-Instruct \
  ++run.vit_model=vit_so400m_patch14_siglip_384 \
  ++run.train_padding_side="left" \
  module.model.adapter_type="mlp" \
  module.model.load_4bit=False \
  module.model.use_flash_attn=True \
  module.model.use_lora=True \
  run.test_batch_size=2 run.test_num_workers=16 \
  trainer.devices=1 \ # multi-GPU is not supported
  '++logger.wandb.tags=[eval,maxm]'
@article{centurio2025,
  author       = {Gregor Geigle and
                  Florian Schneider and
                  Carolin Holtermann and
                  Chris Biemann and
                  Radu Timofte and
                  Anne Lauscher and
                  Goran Glava\v{s}},
  title        = {Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model},
  journal      = {arXiv},
  volume       = {abs/2501.05122},
  year         = {2025},
  url          = {https://arxiv.org/abs/2501.05122},
  eprinttype    = {arXiv},
  eprint       = {2501.05122},
}