Tashkeela dataset as used by Rababa

Note	This is a copy of the "Tashkeela processed" dataset hosted on SourceForge and Kaggle. It is provided on GitHub for unencumbered access.

Purpose

This is the Tashkeela dataset used for training Rababa.

Original description by Hamza Abbad

A version of the Tashkeela Arabic diacritized text dataset cleaned from the non-Arabic content and the undiacritized text, then divided into training, development, and testing sets.

The cleaning process includes removing the XML tags and strange symbols, as well as fixing diacritics errors. After that, the tokenization is performed while focusing on the extraction of the Arabic words. The result is a space-separated tokens file, where the words and the numbers are separated, but not the sequences of punctuation (ie, an ending parenthesis followed by a dot). The sentence segmentation is done at usual punctuations such as dots, commas, interrogation/exclamation marks, and line end as well.

The partition process is done by shuffling groups of sentences then dividing each group into three parts (Train/Val/Test) and storing them in individual files.

Features:

Raw fully-diacritized Arabic texts.
Over 3 million sentences with different number of words.
Mostly Classical Arabic.
Space separated tokens.
90% training , 5% validation and 5% testing data.

License

This dataset is offered via GPLv2 as per the original datasets.

Credits

Taha Zerouki, is the author of the original Tashkeela dataset for his search and collection of the diacritized Arabic content from serval websites and combining them in a single public dataset. The original dataset is available on SourceForge.
Hamza Abbad created the "Tashkeela processed" dataset which cleans up the original dataset. Including cleanups by removing non-Arabic content and the undiacritized text, then divided into training, development, and testing sets. This dataset is available on SourceForge and Kaggle.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
tashkeela_test		tashkeela_test
tashkeela_train		tashkeela_train
tashkeela_val		tashkeela_val
README.adoc		README.adoc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tashkeela dataset as used by Rababa

Purpose

Original description by Hamza Abbad

License

Credits

About

Releases

Packages

interscript/rababa-tashkeela

Folders and files

Latest commit

History

Repository files navigation

Tashkeela dataset as used by Rababa

Purpose

Original description by Hamza Abbad

License

Credits

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages