Description
Tokenizing text using OpenAi tokenizer
"Cvrči cvrči cvrčak na čvoru crne smrče" - [34, 37020, 46195, 72, 269, 37020, 46195, 72, 269, 37020, 46195, 461, 12385, 34754, 235, 20867, 84, 1067, 710, 895, 81, 46195, 68]
Tokenizing text using GPT2Tokenizer
"Cvrči cvrči cvrčak na čvoru crne smrče" - [34, 37020, 8423, 8423, 72, 269, 37020, 8423, 8423, 72, 269, 37020, 8423, 8423, 461, 12385, 9242, 8423, 20867, 84, 1067, 710, 895, 81, 8423, 77, 377, 293]
When running tests
Expected :[34, 37020, 46195, 72, 269, 37020, 46195, 72, 269, 37020, 46195, 461, 12385, 34754, 235, 20867, 84, 1067, 710, 895, 81, 46195, 68]
Actual :[34, 37020, 8423, 8423, 72, 269, 37020, 8423, 8423, 72, 269, 37020, 8423, 8423, 461, 12385, 9242, 8423, 20867, 84, 1067, 710, 895, 81, 8423, 77, 377, 293]
Great work - by the way - English sentence tokenizers works like a charm!