Skip to content

Tokenizing Croatian sentences differs from OpenAi tokenizer #2

Open
@kdujmic

Description

@kdujmic

Tokenizing text using OpenAi tokenizer
"Cvrči cvrči cvrčak na čvoru crne smrče" - [34, 37020, 46195, 72, 269, 37020, 46195, 72, 269, 37020, 46195, 461, 12385, 34754, 235, 20867, 84, 1067, 710, 895, 81, 46195, 68]

Tokenizing text using GPT2Tokenizer
"Cvrči cvrči cvrčak na čvoru crne smrče" - [34, 37020, 8423, 8423, 72, 269, 37020, 8423, 8423, 72, 269, 37020, 8423, 8423, 461, 12385, 9242, 8423, 20867, 84, 1067, 710, 895, 81, 8423, 77, 377, 293]

When running tests
Expected :[34, 37020, 46195, 72, 269, 37020, 46195, 72, 269, 37020, 46195, 461, 12385, 34754, 235, 20867, 84, 1067, 710, 895, 81, 46195, 68]
Actual :[34, 37020, 8423, 8423, 72, 269, 37020, 8423, 8423, 72, 269, 37020, 8423, 8423, 461, 12385, 9242, 8423, 20867, 84, 1067, 710, 895, 81, 8423, 77, 377, 293]

Great work - by the way - English sentence tokenizers works like a charm!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions