Skip to content

Token.shape_ not keeping a consistent length #12593

Discussion options

You must be logged in to vote

The only limit is on how long a single substring of the same repeated type like x can be.

It's hard to describe this succinctly, but here's the description from Lexeme.shape in the docs:

Transform of the word’s string, to show orthographic features. Alphabetic characters are replaced by x or X, and numeric characters are replaced by d, and sequences of the same character are truncated after length 4. For example,"Xxxx"or"dd".

The implementation is here:

def word_shape(text: str) -> str:
if len(text) >= 100:
return "LONG"
shape = []
last = ""
shape_char = ""
seq = 0
for char in text:

Replies: 2 comments 1 reply

Comment options

You must be logged in to vote
0 replies
Answer selected by adrianeboyd
Comment options

You must be logged in to vote
1 reply
@adrianeboyd
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / doc Feature: Doc, Span and Token objects
2 participants