This repository integrates Google DeepMind's PaliGemma2 Mix models into the FiftyOne computer vision platform. PaliGemma2 Mix is a set of vision-language models fine-tuned on diverse tasks, designed to work out-of-the-box for a variety of computer vision applications.
PaliGemma2 Mix models can perform:
- Image captioning (multiple detail levels)
- Object detection
- Semantic segmentation (Not perfect, but good for initial exploration)
- Optical character recognition (OCR)
- Visual question answering
- Zero-shot classification
| Model | Size | Resolution | Source |
|---|---|---|---|
paligemma2-3b-mix-224 |
3B | 224×224 | HuggingFace |
paligemma2-10b-mix-224 |
10B | 224×224 | HuggingFace |
paligemma2-28b-mix-224 |
28B | 224×224 | HuggingFace |
paligemma2-3b-mix-448 |
3B | 448×448 | HuggingFace |
paligemma2-10b-mix-448 |
10B | 448×448 | HuggingFace |
paligemma2-28b-mix-448 |
28B | 448×448 | HuggingFace |
- FiftyOne
- PyTorch
- Transformers (>=4.50)
- Huggingface Hub
- JAX/FLAX (for segmentation masks)
- NumPy
- PIL
- Install required packages:
pip install fiftyone torch torchvision transformers huggingface-hub jax flax- Register the model repository:
import fiftyone.zoo as foz
foz.register_zoo_model_source("https://github.com/harpreetsahota204/paligemma2")- Download your chosen model:
foz.download_zoo_model(
"https://github.com/harpreetsahota204/paligemma2",
model_name="google/paligemma2-10b-mix-448",
)import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub
# Load a sample dataset
dataset = load_from_hub(
"voxel51/hand-keypoints",
name="hands_subset",
max_samples=10
)import fiftyone.zoo as foz
model = foz.load_zoo_model(
"google/paligemma2-10b-mix-448",
# install_requirements=True #if you are using for the first time and need to download reuirement,
# ensure_requirements=True # ensure any requirements are installed before loading the model
)# Set operation and detail level
model.operation = "caption"
model.detail_level = "coco-style" # Options: "short", "coco-style", "detailed"
# Apply to dataset
dataset.apply_model(model, label_field="captions")# Set operation and classes to detect
model.operation = "detection"
model.prompt = ["person", "hand", "face"] # List of classes to detect
# Alternative format: model.prompt = "person; hand; face"
# Apply to dataset
dataset.apply_model(model, label_field="detections")# Set operation and classes to segment
model.operation = "segment"
model.prompt = ["person", "hand"] # List of classes to segment
# Alternative format: model.prompt = "person; hand"
# Apply to dataset
dataset.apply_model(model, label_field="segmentations")# Set operation for OCR
model.operation = "ocr"
# Apply to dataset
dataset.apply_model(model, label_field="text")# Set operation for classification
model.operation = "classify"
model.prompt = ["indoor", "outdoor", "close-up", "wide-angle"] # Potential classes
# Apply to dataset
dataset.apply_model(model, label_field="classifications")# Set operation for answering questions
model.operation = "answer"
model.prompt = "How many people are in this image?"
# Apply to dataset
dataset.apply_model(model, label_field="answers")# Launch the FiftyOne App to visualize the results
session = fo.launch_app(dataset)For higher quality results (at the cost of speed), use higher resolution models:
# Lower resolution, faster
small_model = foz.load_zoo_model("google/paligemma2-3b-mix-224")
# Higher resolution, better quality
large_model = foz.load_zoo_model("google/paligemma2-28b-mix-448")PaliGemma2 models are subject to the Gemma license. Please review the license terms before using these models.
@article{
title={PaliGemma 2: A Family of Versatile VLMs for Transfer},
author={Andreas Steiner and André Susano Pinto and Michael Tschannen and Daniel Keysers and Xiao Wang and Yonatan Bitton and Alexey Gritsenko and Matthias Minderer and Anthony Sherbondy and Shangbang Long and Siyang Qin and Reeve Ingle and Emanuele Bugliarello and Sahar Kazemzadeh and Thomas Mesnard and Ibrahim Alabdulmohsin and Lucas Beyer and Xiaohua Zhai},
year={2024},
journal={arXiv preprint arXiv:2412.03555}
}