Skip to content

harpreetsahota204/paligemma2

Repository files navigation

PaliGemma2 Mix for FiftyOne

This repository integrates Google DeepMind's PaliGemma2 Mix models into the FiftyOne computer vision platform. PaliGemma2 Mix is a set of vision-language models fine-tuned on diverse tasks, designed to work out-of-the-box for a variety of computer vision applications.

Features

PaliGemma2 Mix models can perform:

  • Image captioning (multiple detail levels)
  • Object detection
  • Semantic segmentation (Not perfect, but good for initial exploration)
  • Optical character recognition (OCR)
  • Visual question answering
  • Zero-shot classification

Available Models

Model Size Resolution Source
paligemma2-3b-mix-224 3B 224×224 HuggingFace
paligemma2-10b-mix-224 10B 224×224 HuggingFace
paligemma2-28b-mix-224 28B 224×224 HuggingFace
paligemma2-3b-mix-448 3B 448×448 HuggingFace
paligemma2-10b-mix-448 10B 448×448 HuggingFace
paligemma2-28b-mix-448 28B 448×448 HuggingFace

Requirements

  • FiftyOne
  • PyTorch
  • Transformers (>=4.50)
  • Huggingface Hub
  • JAX/FLAX (for segmentation masks)
  • NumPy
  • PIL

Installation

  1. Install required packages:
pip install fiftyone torch torchvision transformers huggingface-hub jax flax
  1. Register the model repository:
import fiftyone.zoo as foz
foz.register_zoo_model_source("https://github.com/harpreetsahota204/paligemma2")
  1. Download your chosen model:
foz.download_zoo_model(
    "https://github.com/harpreetsahota204/paligemma2",
    model_name="google/paligemma2-10b-mix-448", 
)

Usage Examples

Load a dataset

import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

# Load a sample dataset
dataset = load_from_hub(
    "voxel51/hand-keypoints",
    name="hands_subset",
    max_samples=10
)

Load the model

import fiftyone.zoo as foz

model = foz.load_zoo_model(
    "google/paligemma2-10b-mix-448",
    # install_requirements=True #if you are using for the first time and need to download reuirement,
    # ensure_requirements=True #  ensure any requirements are installed before loading the model
)

Image Captioning

# Set operation and detail level
model.operation = "caption"
model.detail_level = "coco-style"  # Options: "short", "coco-style", "detailed"

# Apply to dataset
dataset.apply_model(model, label_field="captions")

Object Detection

# Set operation and classes to detect
model.operation = "detection"
model.prompt = ["person", "hand", "face"]  # List of classes to detect
# Alternative format: model.prompt = "person; hand; face"

# Apply to dataset
dataset.apply_model(model, label_field="detections")

Semantic Segmentation

# Set operation and classes to segment
model.operation = "segment"
model.prompt = ["person", "hand"]  # List of classes to segment
# Alternative format: model.prompt = "person; hand"

# Apply to dataset
dataset.apply_model(model, label_field="segmentations")

OCR (Optical Character Recognition)

# Set operation for OCR
model.operation = "ocr"

# Apply to dataset
dataset.apply_model(model, label_field="text")

Zero-Shot Classification

# Set operation for classification
model.operation = "classify"
model.prompt = ["indoor", "outdoor", "close-up", "wide-angle"]  # Potential classes

# Apply to dataset
dataset.apply_model(model, label_field="classifications")

Visual Question Answering

# Set operation for answering questions
model.operation = "answer"
model.prompt = "How many people are in this image?"

# Apply to dataset
dataset.apply_model(model, label_field="answers")

Visualize Results

# Launch the FiftyOne App to visualize the results
session = fo.launch_app(dataset)

Using Different Resolution Models

For higher quality results (at the cost of speed), use higher resolution models:

# Lower resolution, faster
small_model = foz.load_zoo_model("google/paligemma2-3b-mix-224")

# Higher resolution, better quality
large_model = foz.load_zoo_model("google/paligemma2-28b-mix-448")

License

PaliGemma2 models are subject to the Gemma license. Please review the license terms before using these models.

Citation

@article{
    title={PaliGemma 2: A Family of Versatile VLMs for Transfer},
    author={Andreas Steiner and André Susano Pinto and Michael Tschannen and Daniel Keysers and Xiao Wang and Yonatan Bitton and Alexey Gritsenko and Matthias Minderer and Anthony Sherbondy and Shangbang Long and Siyang Qin and Reeve Ingle and Emanuele Bugliarello and Sahar Kazemzadeh and Thomas Mesnard and Ibrahim Alabdulmohsin and Lucas Beyer and Xiaohua Zhai},
    year={2024},
    journal={arXiv preprint arXiv:2412.03555}
}

About

Implementing PaliGemma-2-Mix as a Remote Zoo Model for FiftyOne

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published