How is this model bilingual? #38

smtnkc · 2024-03-13T18:31:02Z

Hi Stefan,

When I use the Turkish model on an English dataset for classification, it works surprisingly well. So, I have two questions:

Does the training corpus contain English texts?
Is it trained from scratch or on the English model's weights?

Thanks!

stefan-it · 2024-03-19T14:55:05Z

Hi @smtnkc ,

that's a very interesting question. I quickly analyzed the pretraining corpus of the BERTurk model (trained on the 35GB corpus). BERTurk was pretrained from scratch.

It has 299_245_100 training instances (training instance is considered as line in training corpus).

I used fasttext language detection (this model) and counted the number of training instances where English has highest probabilty: 618_394. So 0,5% of the corpus are "real" English training instances.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is this model bilingual? #38

How is this model bilingual? #38

smtnkc commented Mar 13, 2024

stefan-it commented Mar 19, 2024 •

edited

Loading

How is this model bilingual? #38

How is this model bilingual? #38

Comments

smtnkc commented Mar 13, 2024

stefan-it commented Mar 19, 2024 • edited Loading

stefan-it commented Mar 19, 2024 •

edited

Loading