Skip to content

Conversation

@samirsalman
Copy link

By default the library is not using protected patterns such of WEB_PROTECTED_PATTERNS which contains for example URLs and emails patterns.

# Example
tokenizer.tokenize("http://www.someurl.com")

# Expected output
["http://www.someurl.com"]

# sacremoses output
["http", ":",  "/", "/", "www.someurl.com"]

I suggest to use WEB_PROTECTED_PATTERNS and BASIC_PATTERNS by default when user does not specify protected patterns.
This allow user to avoid issues with URLs tokenization when use tokenize function with default arguments. The user can still specify different protected patterns or force to don't use protected patterns by setting protected_patterns parameter to empty list:

tokenizer.tokenize("http://www.someurl.com",protected_patterns=[])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant