PaliGemma2 Mix for FiftyOne

This repository integrates Google DeepMind's PaliGemma2 Mix models into the FiftyOne computer vision platform. PaliGemma2 Mix is a set of vision-language models fine-tuned on diverse tasks, designed to work out-of-the-box for a variety of computer vision applications.

Features

PaliGemma2 Mix models can perform:

Image captioning (multiple detail levels)
Object detection
Semantic segmentation (Not perfect, but good for initial exploration)
Optical character recognition (OCR)
Visual question answering
Zero-shot classification

Available Models

Model	Size	Resolution	Source
`paligemma2-3b-mix-224`	3B	224×224	HuggingFace
`paligemma2-10b-mix-224`	10B	224×224	HuggingFace
`paligemma2-28b-mix-224`	28B	224×224	HuggingFace
`paligemma2-3b-mix-448`	3B	448×448	HuggingFace
`paligemma2-10b-mix-448`	10B	448×448	HuggingFace
`paligemma2-28b-mix-448`	28B	448×448	HuggingFace

Requirements

FiftyOne
PyTorch
Transformers (>=4.50)
Huggingface Hub
JAX/FLAX (for segmentation masks)
NumPy
PIL

Installation

Install required packages:

pip install fiftyone torch torchvision transformers huggingface-hub jax flax

Register the model repository:

import fiftyone.zoo as foz
foz.register_zoo_model_source("https://github.com/harpreetsahota204/paligemma2")

Download your chosen model:

foz.download_zoo_model(
    "https://github.com/harpreetsahota204/paligemma2",
    model_name="google/paligemma2-10b-mix-448", 
)

Usage Examples

Load a dataset

import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

# Load a sample dataset
dataset = load_from_hub(
    "voxel51/hand-keypoints",
    name="hands_subset",
    max_samples=10
)

Load the model

import fiftyone.zoo as foz

model = foz.load_zoo_model(
    "google/paligemma2-10b-mix-448",
    # install_requirements=True #if you are using for the first time and need to download reuirement,
    # ensure_requirements=True #  ensure any requirements are installed before loading the model
)

Image Captioning

# Set operation and detail level
model.operation = "caption"
model.detail_level = "coco-style"  # Options: "short", "coco-style", "detailed"

# Apply to dataset
dataset.apply_model(model, label_field="captions")

Object Detection

# Set operation and classes to detect
model.operation = "detection"
model.prompt = ["person", "hand", "face"]  # List of classes to detect
# Alternative format: model.prompt = "person; hand; face"

# Apply to dataset
dataset.apply_model(model, label_field="detections")

Semantic Segmentation

# Set operation and classes to segment
model.operation = "segment"
model.prompt = ["person", "hand"]  # List of classes to segment
# Alternative format: model.prompt = "person; hand"

# Apply to dataset
dataset.apply_model(model, label_field="segmentations")

OCR (Optical Character Recognition)

# Set operation for OCR
model.operation = "ocr"

# Apply to dataset
dataset.apply_model(model, label_field="text")

Zero-Shot Classification

# Set operation for classification
model.operation = "classify"
model.prompt = ["indoor", "outdoor", "close-up", "wide-angle"]  # Potential classes

# Apply to dataset
dataset.apply_model(model, label_field="classifications")

Visual Question Answering

# Set operation for answering questions
model.operation = "answer"
model.prompt = "How many people are in this image?"

# Apply to dataset
dataset.apply_model(model, label_field="answers")

Visualize Results

# Launch the FiftyOne App to visualize the results
session = fo.launch_app(dataset)

Using Different Resolution Models

For higher quality results (at the cost of speed), use higher resolution models:

# Lower resolution, faster
small_model = foz.load_zoo_model("google/paligemma2-3b-mix-224")

# Higher resolution, better quality
large_model = foz.load_zoo_model("google/paligemma2-28b-mix-448")

License

PaliGemma2 models are subject to the Gemma license. Please review the license terms before using these models.

Citation

@article{
    title={PaliGemma 2: A Family of Versatile VLMs for Transfer},
    author={Andreas Steiner and André Susano Pinto and Michael Tschannen and Daniel Keysers and Xiao Wang and Yonatan Bitton and Alexey Gritsenko and Matthias Minderer and Anthony Sherbondy and Shangbang Long and Siyang Qin and Reeve Ingle and Emanuele Bugliarello and Sahar Kazemzadeh and Thomas Mesnard and Ibrahim Alabdulmohsin and Lucas Beyer and Xiaohua Zhai},
    year={2024},
    journal={arXiv preprint arXiv:2412.03555}
}

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
manifest.json		manifest.json
paligemma.gif		paligemma.gif
parse_segmentation_output.py		parse_segmentation_output.py
using_paligemma2mix_zoo_model.ipynb		using_paligemma2mix_zoo_model.ipynb
vae-oid.npz		vae-oid.npz
zoo.py		zoo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PaliGemma2 Mix for FiftyOne

Features

Available Models

Requirements

Installation

Usage Examples

Load a dataset

Load the model

Image Captioning

Object Detection

Semantic Segmentation

OCR (Optical Character Recognition)

Zero-Shot Classification

Visual Question Answering

Visualize Results

Using Different Resolution Models

License

Citation

About

Uh oh!

Releases

Packages

Languages

License

harpreetsahota204/paligemma2

Folders and files

Latest commit

History

Repository files navigation

PaliGemma2 Mix for FiftyOne

Features

Available Models

Requirements

Installation

Usage Examples

Load a dataset

Load the model

Image Captioning

Object Detection

Semantic Segmentation

OCR (Optical Character Recognition)

Zero-Shot Classification

Visual Question Answering

Visualize Results

Using Different Resolution Models

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages