Skip to content

models facebook deit base patch16 224

github-actions[bot] edited this page Oct 8, 2024 · 27 revisions

facebook-deit-base-patch16-224

Overview

DeiT (Data-efficient image Transformers) is an image transformer that do not require very large amounts of data for training. This is achieved through a novel distillation procedure using teacher-student strategy, which results in high throughput and accuracy. DeiT is pre-trained and fine-tuned on ImageNet-1k (1 million images, 1,000 classes) at resolution 224x224. The model was first released in this repository, but the weights were converted to PyTorch from the timm repository by Ross Wightman.

An image is treated as a sequence of patches and it is processed by a standard Transformer encoder as used in NLP. These patches are linearly embedded, and a [CLS] token is added at the beginning of the sequence for classification tasks. The model also requires absolute position embeddings before feeding the sequence Transformer encoder. So the pre-training creates an inner representation of images that can be used to extract features that are useful for downstream tasks. For instance, if a dataset of labeled images is available, a linear layer can be placed on top of the pre-trained encoder, to train a standard classifier.

For more details on DeiT, Review the original-paper.

Training Details

Training Data

The DeiT model is pre-trained and fine-tuned on ImageNet 2012, consisting of 1 million images and 1,000 classes on a resolution of 224x224.

Training Procedure

In the preprocessing step, images are resized to the same resolution 224x224. Different augmentations like Rand-Augment, and random erasing are used. For more details on transformations during training/validation refer this-link. At inference time, images are rescaled to the same resolution 256x256, center-cropped at 224x224 and then normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).

The model was trained on a single 8-GPU node for 3 days. Training resolution is 224. For more details on hyperparameters refer to table 9 of the original-paper.

For more details on pre-training (ImageNet-1k) followed by supervised fine-tuning (ImageNet-1k) refer to the section 2 to 5 of the original-paper.

Evaluation Results

DeiT base model achieved top-1 accuracy of 81.8% and top-5 accuracy of 95.6% on ImageNet with 86M parameters with image size 224x224. For DeiT image classification benchmark results, refer to the table 5 of the original-paper.

It's important to note that during the fine-tuning process, superior performance is attained with a higher resolution, and enhancing the model size leads to improved performance.

License

apache-2.0

Inference Samples

Inference type Python sample (Notebook) CLI with YAML
Real time image-classification-online-endpoint.ipynb image-classification-online-endpoint.sh
Batch image-classification-batch-endpoint.ipynb image-classification-batch-endpoint.sh

Finetuning Samples

Task Use case Dataset Python sample (Notebook) CLI with YAML
Image Multi-class classification Image Multi-class classification fridgeObjects fridgeobjects-multiclass-classification.ipynb fridgeobjects-multiclass-classification.sh
Image Multi-label classification Image Multi-label classification multilabel fridgeObjects fridgeobjects-multilabel-classification.ipynb fridgeobjects-multilabel-classification.sh

Evaluation Samples

Task Use case Dataset Python sample (Notebook)
Image Multi-class classification Image Multi-class classification fridgeObjects image-multiclass-classification.ipynb
Image Multi-label classification Image Multi-label classification multilabel fridgeObjects image-multilabel-classification.ipynb

Sample input and output

Sample input

{
  "input_data": ["image1", "image2"]
}

Note: "image1" and "image2" string should be in base64 format or publicly accessible urls.

Sample output

[
  [
    {
      "label" : "can",
      "score" : 0.91
    },
    {
      "label" : "carton",
      "score" : 0.09
    },
  ],
  [
    {
      "label" : "carton",
      "score" : 0.9
    },
    {
      "label" : "can",
      "score" : 0.1
    },
  ]
]

Visualization of inference result for a sample image

mc visualization

Version: 19

Tags

huggingface_model_id : facebook/deit-base-patch16-224 license : apache-2.0 model_specific_defaults : ordereddict({'apply_deepspeed': 'true', 'apply_ort': 'true'}) task : image-classification hiddenlayerscanned training_dataset : imagenet-1k SharedComputeCapacityEnabled author : Meta inference_compute_allow_list : ['Standard_DS3_v2', 'Standard_D4a_v4', 'Standard_D4as_v4', 'Standard_DS4_v2', 'Standard_D8a_v4', 'Standard_D8as_v4', 'Standard_DS5_v2', 'Standard_D16a_v4', 'Standard_D16as_v4', 'Standard_D32a_v4', 'Standard_D32as_v4', 'Standard_D48a_v4', 'Standard_D48as_v4', 'Standard_D64a_v4', 'Standard_D64as_v4', 'Standard_D96a_v4', 'Standard_D96as_v4', 'Standard_FX4mds', 'Standard_F8s_v2', 'Standard_FX12mds', 'Standard_F16s_v2', 'Standard_F32s_v2', 'Standard_F48s_v2', 'Standard_F64s_v2', 'Standard_F72s_v2', 'Standard_FX24mds', 'Standard_FX36mds', 'Standard_FX48mds', 'Standard_E2s_v3', 'Standard_E4s_v3', 'Standard_E8s_v3', 'Standard_E16s_v3', 'Standard_E32s_v3', 'Standard_E48s_v3', 'Standard_E64s_v3', 'Standard_NC4as_T4_v3', 'Standard_NC6s_v3', 'Standard_NC8as_T4_v3', 'Standard_NC12s_v3', 'Standard_NC16as_T4_v3', 'Standard_NC24s_v3', 'Standard_NC64as_T4_v3', 'Standard_NC24ads_A100_v4', 'Standard_NC48ads_A100_v4', 'Standard_NC96ads_A100_v4', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4', 'Standard_ND40rs_v2'] evaluation_compute_allow_list : ['Standard_NC4as_T4_v3', 'Standard_NC6s_v3', 'Standard_NC8as_T4_v3', 'Standard_NC12s_v3', 'Standard_NC16as_T4_v3', 'Standard_NC24s_v3', 'Standard_NC64as_T4_v3', 'Standard_NC96ads_A100_v4', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4', 'Standard_ND40rs_v2'] finetune_compute_allow_list : ['Standard_NC4as_T4_v3', 'Standard_NC6s_v3', 'Standard_NC8as_T4_v3', 'Standard_NC12s_v3', 'Standard_NC16as_T4_v3', 'Standard_NC24s_v3', 'Standard_NC64as_T4_v3', 'Standard_NC96ads_A100_v4', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4', 'Standard_ND40rs_v2']

View in Studio: https://ml.azure.com/registries/azureml/models/facebook-deit-base-patch16-224/version/19

License: apache-2.0

Properties

SharedComputeCapacityEnabled: True

SHA: fb2c78a54a5637dec350432794f7b93e31f910c9

evaluation-min-sku-spec: 4|1|28|176

evaluation-recommended-sku: Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2

finetune-min-sku-spec: 4|1|28|176

finetune-recommended-sku: Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2

finetuning-tasks: image-classification

inference-min-sku-spec: 2|0|14|28

inference-recommended-sku: Standard_DS3_v2, Standard_D4a_v4, Standard_D4as_v4, Standard_DS4_v2, Standard_D8a_v4, Standard_D8as_v4, Standard_DS5_v2, Standard_D16a_v4, Standard_D16as_v4, Standard_D32a_v4, Standard_D32as_v4, Standard_D48a_v4, Standard_D48as_v4, Standard_D64a_v4, Standard_D64as_v4, Standard_D96a_v4, Standard_D96as_v4, Standard_FX4mds, Standard_F8s_v2, Standard_FX12mds, Standard_F16s_v2, Standard_F32s_v2, Standard_F48s_v2, Standard_F64s_v2, Standard_F72s_v2, Standard_FX24mds, Standard_FX36mds, Standard_FX48mds, Standard_E2s_v3, Standard_E4s_v3, Standard_E8s_v3, Standard_E16s_v3, Standard_E32s_v3, Standard_E48s_v3, Standard_E64s_v3, Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2

Clone this wiki locally