This is conducted on CIFAR10 dataset with
We have divided training images set into
- Data Transformation implemented on Training + Validation set : Horizontal Flip, Resized Crop, and Normalization
- Data Transformation implemented on Test : Normalization.
Vision Transformer (ViT)
Attention Block in ViT is modelled with embedding size of 256, hidden dimension of 512, and 8 heads in Multi-head Attention block before being passed onto Transformer.
MLP-Mixer
Two MLP Heads were added into one MLP-Mixer block. MLP head had hidden dimension of 512, and token and channel MLP dimension as 256, and 1024, respectively.
We have then conducted our experiment with three uniformity across both model of ViT and MLP-Mixer:
learning_rate = 2e-4
num_epochs = 30
anddropout=0.2
Our best results for both models are given below:
Model Name | Training Parameters | Training Time | Learning Rate | No. Epochs | Training Accuracy | Validation Accuracy | MODEL ACCURACY |
---|---|---|---|---|---|---|---|
ViT | 3,195,146 | 6:00 Hr | 2e-4 | 30 | 61.85 | 59.36 | 58.71 |
MLP-Mixer | 1, 116, 490 | 1:48 Hr | 2e-4 | 30 | 70.80 | 68.64 | 68.32 |
We can clearly observe, with less than one-third of parameters and training time, MLP-Mixer is clearly a winner over ViT for better training accuracy(& validation), and model accuracy.
Note:
We ran our models with