Since Our Problem is to translate videos (sequence of frames) into sentences (sequence of words and characters), which is a seq2seq problem, so we have to use a state-of-art models like Transformers which is much better than LSTM, CRNN, or RNN.
Sign Language Translation
You Can find the data that I worked on here: Data
Inference is performed by starting with an SOS token and predicting one character at a time using the previous prediction.
Inference requires the encoder to encode the input frames and subsequently use that encoding to predict the 1st character by inputting the encoding and SOS (Start of Sentence) token. Next, the encoding, SOS token and 1st predicted token are used to predict the 2nd character. Inference thus requires 1 call to the encoder and multiple calls to the encoder. On average a phrase is 18 characters long, requiring 18+1(SOS token) calls to the decoder.
Some inspiration is taken from the 1st place solution - training from the last Google - Isolated Sign Language Recognition competition.
Special thanks for all of these guys, Many many thanks to them:
https://www.kaggle.com/competitions/asl-fingerspelling/discussion/434364
[1st place solution] Improved Squeezeformer + TransformerDecoder + Clever augmentations: https://www.kaggle.com/competitions/asl-fingerspelling/discussion/434485
[5th place solution] Vanilla Transformer, Data2vec Pretraining, CutMix, and KD: https://www.kaggle.com/competitions/asl-fingerspelling/discussion/434415
https://www.kaggle.com/code/gusthema/asl-fingerspelling-recognition-w-tensorflow
This man helps me alot: https://www.kaggle.com/competitions/asl-fingerspelling/discussion/411060
You can find my Neptune Project here: Neptune.ai
There are two approaches that I tried to solve this Problem, Note: I used TensorFlow library for building these two architectures:
The processing is as follows:
-
Select dominant hand based on most number of non empty hand frames.
-
Filter out all frames with missing dominant hand coordinates.
-
Resize video to 256 frames.
-
Excluding samples with low frames per character ratio.
-
Added phrase type(Phone Number, URL, Address).
Tranformer Model 4.887.936 Million Parameters(Embedding+ Landmark Embedding+ Encoder(2 Encoder Blocks)+ Decoder(2 Decoder Blocks)+ 4 Attention Heads in Encoder and Decoder+ Causal Attention Masking), without Data Augmentation, Lips/Right_HAND/Left_HAND landmarkes, X/Y only without z, Preprocessing (Fill Nan with zeroes/ Filtering Empty Hand Frames/ PAD to zeros/Downsampling resize image to 128)), 100 Epochs, POD/SOS/EOS Tokens Used, 64 Batch Size, learning rate= 0.001, Weight Decay Ratio = 0.05, Maximum Phrase length 31+1 EOS Token, splitting 10% of the data into val_dataset and the other is for training.
The Evaluation metrices used:
- Levenshtein Distance(Train And Validation):
- Sparse Categorical Crossentropy With Label Smoothing:
- Top1Accuray:
- Top5Accuray:
You can Found more about monitoring the results on this file: Monitoring_Project_Performance
ASL-32 is the best and we got it using this architecture, you can Found more details about statistics about results that we got here: Neptune.ai
Transformer Architecture
My Transformer Architecture On the Code
The model consists of a Lankmark Embbeding + Conformer, 2 layer MLP landmark encoder + 6 layer 384-dim Conformer + 1 layer GRU. Total params: 15,892,142 Trainable params: 15,868,334 Non-trainable params: 23,808 It tooks like 7 hours to train were the epochs was 100 but I stop it because the loss didn't improve that much, I tried using Kaggle TPUs but it didn't so if you how to use them, Note: If the kaggle TPUs used the number of epochs will increase to 500 and the batch size will encrease as well using this piece of code:
# Increase number of epochs if the Multi GPUs are available
if strategy.num_replicas_in_sync==2:
N_EPOCHS = 400
print("cuz 2 GPUs are available, so the number of epochs changes from 100 to : ", N_EPOCHS)
# Increase number of epochs if the TPU is available
if TPU:
N_EPOCHS = 500
print("cuz tpu is available, number of epochs would be:", N_EPOCHS)
if TPU:
BATCH_SIZE = 25 * strategy.num_replicas_in_sync
print(BATCH_SIZE).
Data Augmentation and Preprocessing was applied as follows:
- Padding(short sequences), Resizing(longer sequences).
- Mean Calculation with Ignoring Handling.
- Standard Deviation Calculation with Ignoring NaN.
- Normalization(Standardization).
- Global Normalization(Standardization of the pose keypoints).
- Splitting, rearranging , resizing (lips, hands, nose, eyes, and pose).
- Interpolation(Resizes the sequence to a target length, Random interpolation).
- Random Spatial Rotation (finger keypoints, degree(-10,10)).
- Random Scaling(scales finger keypoints, scale(0.9, 1.1)).
- Rotation, Shear, Scaling(degree=(-15,15), shear=(-0.10,0.10), scale=(0.75,1.5).
- Inner Flipping(around mean of coordinates).
- Left-Right Flipping(rigth, Left body like Left, Right hand and so on for each left, right data aspect).
- Random Rotation and Scaling(each finger individually).
- Temporal Resampling(resampling the temporal length of the data sequence at a new rate).
- Subsequence Resampling(resamples a subsection of the data sequence).
- Masking(learns the mode to handle incomplete data).
- Random Rotation and Scaling(for each finger individually).
- Spatial Masking(spatial mask to a random part of the data).
- Temporal Masking(Masks a random temporal segment of the data).
- Random Shifting(shift_range=0.1).
- Partial Rotation(Applies rotations to parts of the data or individual fingers, whole sequence or a subsection).
- Partial Shifting.
- Combined Masking(Combines temporal and feature masking in one step).
- Composite Augmentation(applying a random combination of augmentation techniques).
Number of Epochs= 100, BATCH_SIZE = 64, Number of Unique Characters To Predict + Pad Token + SOS Token + EOS Token= 62, Maximum Learning Rate= 1e-3, weight decay ration for learning rate = 0.05, Maximum phrase length 31+1 Eos Token, Number of frames to resize recording to is 384, Drop out ration was 0.1, Causal Masking is applied, landmarks (Nose 4 landmarks, Lips 41 landmarks, Pose 17 landmark, Eyes 16(R)+16(L)=32 landmarks, Hands 42 landmarks) In Total 42+76+33 = 151 (HAND_NUMS 42, FACE_NUMS 76, POSE_NUMS 33), X/Y/Z used that means we add the depth in this approach, split the data into train size = 66208, and val size = 1000.
Conformer run was ASL-27 in Neptune.ai, it was not that good, because I didn't give that much time to modify the code to work in the best way and to modify the model architecture (change the number of heads, coformer blocks, decoder blocks, change the landmark indicies, remove z, add or remove augmentation techniques), actually there are a lot of reasons behind this or maybe the whole idea of conformer is wrong.
Since there is no time to try these concepts to solve this problem, I think using clever augmentation from the first place solution with GNN, Flash Attention, or STMC transformer will give a better Performance. Try to solve this problem by using the simple solutions and then go deeper with more complex solutions. Thanks a lot if you reach this part. And try to use Supplementary dataset, since the top solutions in this competition used Supplementary dataset and they said it was a useful.
- Using Conformer, I talked about this.
- In Transformer, Increasing the number of Encoder Blocks from 4 to 8 , number of Decoder Blocks from 2 to 4, Number of Attention Heads from 4 to 8 and Multi Layer Perception Ration from 2 to 4, which in total gives 14.7 Million trainable Parameteres.
- Including Z the depth.