Diacritics are short vowels with a constant length that are spoken. The same word in the Arabic language can have different meanings and different pronunciations based on how it is diacritized.
In this project, we implement a pipeline to predict the diacritic of each character in an Arabic text using Natural Language Processing techniques.
- Split the sentences with punctuations.
- Split into smaller sentences of length no more than
500
characters (without counting diacritics). - Remove all the non-Arabic characters.
- Remove diacritics.
- Start each sentence with
<s>
and end it with</s>
(both will have a corresponding class ‘no diacritics’ ‘’)
- One Hot encoding
char level
- Trainable embeddings
char level
- Word2vec embeddings + oneHot
word and char level
- BLSTM
- RNN
Diacritic Error Rate (DER) = 1 - Accuracy
Final model used for the test set submission on Kaggle: BLSTM model with char embedding layer
Team: The Powerpuff Girls
nlp.mp4
Asmaa Adel |
Samaa Hazem |
Norhan reda |
HodaGamal |