Skip to content

models google vit base patch16 224

github-actions[bot] edited this page Oct 8, 2024 · 25 revisions

google-vit-base-patch16-224

Overview

The Vision Transformer (ViT) model, as introduced in the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" by Dosovitskiy et al., underwent pre-training on ImageNet-21k with a resolution of 224x224. Subsequently, it was fine-tuned on ImageNet 2012, consisting of 1 million images and 1,000 classes, also at a resolution of 224x224. The model was first released in this repository, but the weights were converted to PyTorch from the timm repository by Ross Wightman, who had previously converted the weights from JAX to PyTorch.

An image is treated as a sequence of patches and it is processed by a standard Transformer encoder as used in NLP. These patches are linearly embedded, and a [CLS] token is added at the beginning of the sequence for classification tasks. The model also requires absolute position embeddings before feeding the sequence Transformer encoder. So the pre-training creates an inner representation of images that can be used to extract features that are useful for downstream tasks. For instance, if a dataset of labeled images is available, a linear layer can be placed on top of the pre-trained encoder, to train a standard classifier.

Training Details

Training Data

The ViT model is pre-trained on ImageNet-21k dataset with a resolution of 224x224 and fine-tuned on ImageNet 2012, consisting of 1 million images and 1,000 classes.

Training Procedure

In the preprocessing step, images are resized to the same resolution 224x224. Then normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).

The model was trained on TPUv3 hardware (8 cores). All models are trained using Adam with β1 = 0.9, β2 = 0.999, with a batch size of 4096, a high weight decay of 0.1, learning rate warmup of 10k steps. Authors found that it is beneficial to additionally apply gradient clipping at global norm 1. Training resolution is 224. For more details on hyperparameters refer to table 3 of the original-paper.

For more details on self-supervised pre-training (ImageNet-21k) followed by supervised fine-tuning (ImageNet-1k) refer to the section 3 and 4 of the original-paper.

Evaluation Results

For ViT image classification benchmark results, Refer to table 2 and table 5 of the original-paper.

License

apache-2.0

Inference Samples

Inference type Python sample (Notebook) CLI with YAML
Real time image-classification-online-endpoint.ipynb image-classification-online-endpoint.sh
Batch image-classification-batch-endpoint.ipynb image-classification-batch-endpoint.sh

Finetuning Samples

Task Use case Dataset Python sample (Notebook) CLI with YAML
Image Multi-class classification Image Multi-class classification fridgeObjects fridgeobjects-multiclass-classification.ipynb fridgeobjects-multiclass-classification.sh
Image Multi-label classification Image Multi-label classification multilabel fridgeObjects fridgeobjects-multilabel-classification.ipynb fridgeobjects-multilabel-classification.sh

Evaluation Samples

Task Use case Dataset Python sample (Notebook)
Image Multi-class classification Image Multi-class classification fridgeObjects image-multiclass-classification.ipynb
Image Multi-label classification Image Multi-label classification multilabel fridgeObjects image-multilabel-classification.ipynb

Sample input and output

Sample input

{
  "input_data": ["image1", "image2"]
}

Note: "image1" and "image2" string should be in base64 format or publicly accessible urls.

Sample output

[
  [
    {
      "label" : "can",
      "score" : 0.91
    },
    {
      "label" : "carton",
      "score" : 0.09
    },
  ],
  [
    {
      "label" : "carton",
      "score" : 0.9
    },
    {
      "label" : "can",
      "score" : 0.1
    },
  ]
]

Visualization of inference result for a sample image

mc visualization

Version: 17

Tags

huggingface_model_id : google/vit-base-patch16-224 license : apache-2.0 model_specific_defaults : ordereddict({'apply_deepspeed': 'true', 'apply_ort': 'true'}) task : image-classification hiddenlayerscanned training_dataset : imagenet-1k, imagenet-21k SharedComputeCapacityEnabled author : Google inference_compute_allow_list : ['Standard_DS3_v2', 'Standard_D4a_v4', 'Standard_D4as_v4', 'Standard_DS4_v2', 'Standard_D8a_v4', 'Standard_D8as_v4', 'Standard_DS5_v2', 'Standard_D16a_v4', 'Standard_D16as_v4', 'Standard_D32a_v4', 'Standard_D32as_v4', 'Standard_D48a_v4', 'Standard_D48as_v4', 'Standard_D64a_v4', 'Standard_D64as_v4', 'Standard_D96a_v4', 'Standard_D96as_v4', 'Standard_FX4mds', 'Standard_F8s_v2', 'Standard_FX12mds', 'Standard_F16s_v2', 'Standard_F32s_v2', 'Standard_F48s_v2', 'Standard_F64s_v2', 'Standard_F72s_v2', 'Standard_FX24mds', 'Standard_FX36mds', 'Standard_FX48mds', 'Standard_E2s_v3', 'Standard_E4s_v3', 'Standard_E8s_v3', 'Standard_E16s_v3', 'Standard_E32s_v3', 'Standard_E48s_v3', 'Standard_E64s_v3', 'Standard_NC4as_T4_v3', 'Standard_NC6s_v3', 'Standard_NC8as_T4_v3', 'Standard_NC12s_v3', 'Standard_NC16as_T4_v3', 'Standard_NC24s_v3', 'Standard_NC64as_T4_v3', 'Standard_NC24ads_A100_v4', 'Standard_NC48ads_A100_v4', 'Standard_NC96ads_A100_v4', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4', 'Standard_ND40rs_v2'] evaluation_compute_allow_list : ['Standard_NC4as_T4_v3', 'Standard_NC6s_v3', 'Standard_NC8as_T4_v3', 'Standard_NC12s_v3', 'Standard_NC16as_T4_v3', 'Standard_NC24s_v3', 'Standard_NC64as_T4_v3', 'Standard_NC96ads_A100_v4', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4', 'Standard_ND40rs_v2'] finetune_compute_allow_list : ['Standard_NC4as_T4_v3', 'Standard_NC6s_v3', 'Standard_NC8as_T4_v3', 'Standard_NC12s_v3', 'Standard_NC16as_T4_v3', 'Standard_NC24s_v3', 'Standard_NC64as_T4_v3', 'Standard_NC96ads_A100_v4', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4', 'Standard_ND40rs_v2']

View in Studio: https://ml.azure.com/registries/azureml/models/google-vit-base-patch16-224/version/17

License: apache-2.0

Properties

SharedComputeCapacityEnabled: True

SHA: 2ddc9d4e473d7ba52128f0df4723e478fa14fb80

finetuning-tasks: image-classification

finetune-min-sku-spec: 4|1|28|176

finetune-recommended-sku: Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2

evaluation-min-sku-spec: 4|1|28|176

evaluation-recommended-sku: Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2

inference-min-sku-spec: 2|0|14|28

inference-recommended-sku: Standard_DS3_v2, Standard_D4a_v4, Standard_D4as_v4, Standard_DS4_v2, Standard_D8a_v4, Standard_D8as_v4, Standard_DS5_v2, Standard_D16a_v4, Standard_D16as_v4, Standard_D32a_v4, Standard_D32as_v4, Standard_D48a_v4, Standard_D48as_v4, Standard_D64a_v4, Standard_D64as_v4, Standard_D96a_v4, Standard_D96as_v4, Standard_FX4mds, Standard_F8s_v2, Standard_FX12mds, Standard_F16s_v2, Standard_F32s_v2, Standard_F48s_v2, Standard_F64s_v2, Standard_F72s_v2, Standard_FX24mds, Standard_FX36mds, Standard_FX48mds, Standard_E2s_v3, Standard_E4s_v3, Standard_E8s_v3, Standard_E16s_v3, Standard_E32s_v3, Standard_E48s_v3, Standard_E64s_v3, Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2

Clone this wiki locally