Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is this model bilingual? #38

Open
smtnkc opened this issue Mar 13, 2024 · 1 comment
Open

How is this model bilingual? #38

smtnkc opened this issue Mar 13, 2024 · 1 comment

Comments

@smtnkc
Copy link

smtnkc commented Mar 13, 2024

Hi Stefan,

When I use the Turkish model on an English dataset for classification, it works surprisingly well. So, I have two questions:

  1. Does the training corpus contain English texts?
  2. Is it trained from scratch or on the English model's weights?

Thanks!

@stefan-it
Copy link
Owner

stefan-it commented Mar 19, 2024

Hi @smtnkc ,

that's a very interesting question. I quickly analyzed the pretraining corpus of the BERTurk model (trained on the 35GB corpus). BERTurk was pretrained from scratch.

It has 299_245_100 training instances (training instance is considered as line in training corpus).

I used fasttext language detection (this model) and counted the number of training instances where English has highest probabilty: 618_394. So 0,5% of the corpus are "real" English training instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants