Web and basic protected patterns by default #138

samirsalman · 2023-03-20T22:11:53Z

By default the library is not using protected patterns such of WEB_PROTECTED_PATTERNS which contains for example URLs and emails patterns.

# Example
tokenizer.tokenize("http://www.someurl.com")

# Expected output
["http://www.someurl.com"]

# sacremoses output
["http", ":",  "/", "/", "www.someurl.com"]

I suggest to use WEB_PROTECTED_PATTERNS and BASIC_PATTERNS by default when user does not specify protected patterns.
This allow user to avoid issues with URLs tokenization when use tokenize function with default arguments. The user can still specify different protected patterns or force to don't use protected patterns by setting protected_patterns parameter to empty list:

tokenizer.tokenize("http://www.someurl.com",protected_patterns=[])

Web and basic protected patterns by default

6f9d7f4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Web and basic protected patterns by default #138

Web and basic protected patterns by default #138

Uh oh!

samirsalman commented Mar 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Web and basic protected patterns by default #138

Are you sure you want to change the base?

Web and basic protected patterns by default #138

Uh oh!

Conversation

samirsalman commented Mar 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant