📌 This is an official PyTorch implementation of [CVPR 2023] - EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention
EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention
Xinyu Liu, Houwen Peng, Ningxin Zheng, Yuqing Yang, Han Hu, Yixuan Yuan
The Chinese Univerisity of Hong Kong, Microsoft Research Asia
EfficientViT is a family of high-speed vision transformers. It is built with a new memory efficient building block with a sandwich layout, and an efficient cascaded group attention operation which mitigates attention computation redundancy.
[2023.5.11] 📰 Code and pre-trained models of EfficientViT are released.
⭐ EfficientViT family shows better speed and accuracy.
-
EfficientViT uses sandwich layout block to reduce memory time consumption and cascaded group attention to mitigate attention computation redundancy.
-
EfficientViT-M0 with 63.2% Top-1 accuracy achieves 27,644 images/s on V100 GPU, 228.4 images/s on Intel CPU, and 340.1 images/s as onnx models.
-
EfficientViT-M4 achieves 74.3% Top-1 accuracy on ImageNet-1k, with 15,914 imgs/s inference throughput under 224x224 resolutions, measured on the V100 GPU.
-
EfficientViT-M5 trained for 300 epochs (~30h on 8 V100 GPUs) achieves 77.1% Top-1 accuracy and 93.4% Top-5 accuracy with a throughput of 10,621 images/s on V100 GPU.
🔰 We provide a simple way to use the pre-trained EfficientViT models directly:
from classification.model.build import EfficientViT_M4
model = EfficientViT_M4(pretrained='efficientvit_m4')
out = model(image)
🔨 Here we provide setup, evaluation, and training scripts for different tasks.
Please refer to Classification.
Please refer to Downstream.
If you find our project is helpful, please feel free to leave a star and cite our paper:
@InProceedings{liu2023efficientvit,
title = {EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention},
author = {Liu, Xinyu and Peng, Houwen and Zheng, Ningxin and Yang, Yuqing and Hu, Han and Yuan, Yixuan},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2023},
}
We sincerely appreciate Swin Transformer, LeViT, pytorch-image-models, and PyTorch for their awesome codebases.