You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
that's a very interesting question. I quickly analyzed the pretraining corpus of the BERTurk model (trained on the 35GB corpus). BERTurk was pretrained from scratch.
It has 299_245_100 training instances (training instance is considered as line in training corpus).
I used fasttext language detection (this model) and counted the number of training instances where English has highest probabilty: 618_394. So 0,5% of the corpus are "real" English training instances.
Hi Stefan,
When I use the Turkish model on an English dataset for classification, it works surprisingly well. So, I have two questions:
Thanks!
The text was updated successfully, but these errors were encountered: