Skip to content

models OpenAI CLIP Image Text Embeddings vit base patch32

github-actions[bot] edited this page Sep 27, 2024 · 12 revisions

OpenAI-CLIP-Image-Text-Embeddings-vit-base-patch32

Overview

OpenAI's CLIP (Contrastive Language–Image Pre-training) model was designed to investigate the factors that contribute to the robustness of computer vision tasks. It can seamlessly adapt to a range of image classification tasks without requiring specific training for each, demonstrating efficiency, flexibility, and generality.

In terms of architecture, CLIP utilizes a ViT-B/32 Transformer for image encoding and a masked self-attention Transformer for text encoding. These encoders undergo training to improve the similarity of (image, text) pairs using a contrastive loss.

For training purposes, CLIP leverages image-text pairs from the internet and engages in a proxy task: when presented with an image, predict the correct text snippet from a set of 32,768 randomly sampled options. This approach allows CLIP to comprehend visual concepts and establish associations with their textual counterparts, enhancing its performance across various visual classification tasks.

The design of CLIP effectively tackles notable challenges, including the dependence on expensive labeled datasets, the need for fine-tuning on new datasets to achieve optimal performance across diverse tasks, and the disparity between benchmark and real-world performance.

The primary intended users of these models are AI researchers for tasks requiring image and/or text embeddings such as text and image retrieval.

For more details on CLIP model, review the original-paper or the original-model-card.

Training Details

Training Data

The training of the CLIP model involved utilizing publicly accessible image-caption data obtained by crawling several websites and incorporating commonly-used existing image datasets like YFCC100M. Researchers curated a novel dataset comprising 400 million image-text pairs sourced from diverse publicly available internet outlets. This dataset, referred to as WIT (WebImageText), possesses a word count comparable to the WebText dataset employed in training GPT-2.

As a consequence, the data in WIT is reflective of individuals and societies predominantly linked to the internet, often leaning towards more developed nations and a demographic skewed towards younger, male users.

Training Procedure

The Vision Transformers ViT-B/32 underwent training for 32 epochs, employing the Adam optimizer with applied decoupled weight decay regularization. The learning rate was decayed using a cosine schedule. The learnable temperature parameter τ was initialized to the equivalent of 0.07. Training utilized a very large mini-batch size of 32,768, and mixed-precision techniques were employed to expedite training and conserve memory. The largest Vision Transformer was trained over a period of 12 days on 256 V100 GPUs. For a more in-depth understanding, refer to sections 2 and 3 of the original-paper.

Evaluation Results

The performance of CLIP has been evaluated on a wide range of benchmarks across a variety of computer vision datasets such as OCR to texture recognition to fine-grained classification. The section 3 and 4 of the paper describes model performance on multiple datasets.

Limitations and Biases

CLIP has difficulties with tasks such as fine-grained classification and object counting. Its performance also raises concerns regarding fairness and bias. Additionally, there is a notable limitation in the evaluation approach, with the use of linear probes potentially underestimating CLIP's true performance, as suggested by evidence.

CLIP's performance and inherent biases can vary depending on class design and category choices. Assessing Fairface images unveiled significant racial and gender disparities, influenced by class construction. Evaluations on gender, race, and age classification using the Fairface dataset indicated gender accuracy exceeding 96%, with variations among races. Racial classification achieved approximately 93%, while age classification reached around 63%. These assessments aim to gauge model performance across demographics, pinpoint potential risks, and are not intended to endorse or promote such tasks. For a more details, refer to sections 6 and 7 of the original-paper.

License

MIT License

Inference Samples

Inference type Python sample (Notebook) CLI with YAML
Real time image-text-embeddings-online-endpoint.ipynb image-text-embeddings-online-endpoint.sh
Batch image-text-embeddings-batch-endpoint.ipynb image-text-embeddings-batch-endpoint.sh

Sample input and output for image embeddings

Sample input

{
   "input_data":{
      "columns":[
         "image", "text"
      ],
      "index":[0, 1],
      "data":[
         ["image1", ""],
         ["image2", ""]
      ]
   }
}

Note: "image1" and "image2" should be publicly accessible urls or strings in base64 format

Sample output

[
    {
        "image_features": [-0.92, -0.13, 0.02, ... , 0.13],
    },
    {
        "image_features": [0.54, -0.83, 0.13, ... , 0.26],
    }
]

Note: returned embeddings have dimension 512 and are not normalized

Sample input and output for text embeddings

Sample input

{
   "input_data":{
      "columns":[
         "image", "text"
      ],
      "index":[0, 1],
      "data":[
         ["", "sample text 1"],
         ["", "sample text 2"]
      ]
   }
}

Sample output

[
    {
        "text_features": [0.42, -0.13, -0.92, ... , 0.63],
    },
    {
        "text_features": [-0.14, 0.93, -0.15, ... , 0.66],
    }
]

Note: returned embeddings have dimension 512 and are not normalized

Sample input and output for image and text embeddings

Sample input

{
   "input_data":{
      "columns":[
         "image", "text"
      ],
      "index":[0, 1],
      "data":[
         ["image1", "sample text 1"],
         ["image2", "sample text 2"]
      ]
   }
}

Note: "image1" and "image2" should be publicly accessible urls or strings in base64 format

Sample output

[
    {
        "image_features": [0.92, -0.13, 0.02, ... , -0.13],
        "text_features": [0.42, 0.13, -0.92, ... , -0.63]
    },
    {
        "image_features": [-0.54, -0.83, 0.13, ... , -0.26],
        "text_features": [-0.14, -0.93, 0.15, ... , 0.66]
    }
]

Note: returned embeddings have dimension 512 and are not normalized

Version: 10

Tags

Preview huggingface_model_id : openai/clip-vit-base-patch32 SharedComputeCapacityEnabled author : OpenAI license : mit task : embeddings hiddenlayerscanned inference_compute_allow_list : ['Standard_DS2_v2', 'Standard_D2a_v4', 'Standard_D2as_v4', 'Standard_DS3_v2', 'Standard_D4a_v4', 'Standard_D4as_v4', 'Standard_DS4_v2', 'Standard_D8a_v4', 'Standard_D8as_v4', 'Standard_DS5_v2', 'Standard_D16a_v4', 'Standard_D16as_v4', 'Standard_D32a_v4', 'Standard_D32as_v4', 'Standard_D48a_v4', 'Standard_D48as_v4', 'Standard_D64a_v4', 'Standard_D64as_v4', 'Standard_D96a_v4', 'Standard_D96as_v4', 'Standard_F4s_v2', 'Standard_FX4mds', 'Standard_F8s_v2', 'Standard_FX12mds', 'Standard_F16s_v2', 'Standard_F32s_v2', 'Standard_F48s_v2', 'Standard_F64s_v2', 'Standard_F72s_v2', 'Standard_FX24mds', 'Standard_FX36mds', 'Standard_FX48mds', 'Standard_E2s_v3', 'Standard_E4s_v3', 'Standard_E8s_v3', 'Standard_E16s_v3', 'Standard_E32s_v3', 'Standard_E48s_v3', 'Standard_E64s_v3', 'Standard_NC4as_T4_v3', 'Standard_NC6s_v3', 'Standard_NC8as_T4_v3', 'Standard_NC12s_v3', 'Standard_NC16as_T4_v3', 'Standard_NC24s_v3', 'Standard_NC64as_T4_v3', 'Standard_NC24ads_A100_v4', 'Standard_NC48ads_A100_v4', 'Standard_NC96ads_A100_v4', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4', 'Standard_ND40rs_v2']

View in Studio: https://ml.azure.com/registries/azureml/models/OpenAI-CLIP-Image-Text-Embeddings-vit-base-patch32/version/10

License: mit

Properties

SharedComputeCapacityEnabled: True

inference-min-sku-spec: 2|0|7|14

inference-recommended-sku: Standard_DS2_v2, Standard_D2a_v4, Standard_D2as_v4, Standard_DS3_v2, Standard_D4a_v4, Standard_D4as_v4, Standard_DS4_v2, Standard_D8a_v4, Standard_D8as_v4, Standard_DS5_v2, Standard_D16a_v4, Standard_D16as_v4, Standard_D32a_v4, Standard_D32as_v4, Standard_D48a_v4, Standard_D48as_v4, Standard_D64a_v4, Standard_D64as_v4, Standard_D96a_v4, Standard_D96as_v4, Standard_F4s_v2, Standard_FX4mds, Standard_F8s_v2, Standard_FX12mds, Standard_F16s_v2, Standard_F32s_v2, Standard_F48s_v2, Standard_F64s_v2, Standard_F72s_v2, Standard_FX24mds, Standard_FX36mds, Standard_FX48mds, Standard_E2s_v3, Standard_E4s_v3, Standard_E8s_v3, Standard_E16s_v3, Standard_E32s_v3, Standard_E48s_v3, Standard_E64s_v3, Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2

Clone this wiki locally