Token.shape_ not keeping a consistent length #12593
-
I've been using shapes in certain matcher rules and I came across this: The You (or me via pull request) may want to slice the Now off to look at what happens to "O'Connor" and friends. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
The only limit is on how long a single substring of the same repeated type like It's hard to describe this succinctly, but here's the description from
The implementation is here: Lines 118 to 142 in 7369832 If you want, you can replace this with your own shape method by using a custom language. |
Beta Was this translation helpful? Give feedback.
-
Lexeme.shape... I did not find that, my bad, I assumed it would be in Usage or under Token.shape because that's what the shape was hanging off of. Lesson learned: Look at ALL of the references in the search before posting. Actually, the text explanation is quite clear, except for the examples. Maybe an example of something along the lines of: the shape of "a123456789z" is truncated to "x9999x" but that's obviously your call.
I'm just guessing here, but I suppose you can base this custom language off of one of the stock ones. I'll keep that in mind but the reason for using spaCy is that I wanted to move away from doing it all myself via my own home-cooked (aka half-baked) rule generator. This mosaic of rules and statistical methods I'm writing is a steppingstone to using a true model. I am not a fan of writing rules and the first thing I do on my base repo is warn people away from relying too much on rules. But here I am. Anyhoo... Thank you again for another patience, informative and, given that it was an RTFM, polite reply. |
Beta Was this translation helpful? Give feedback.
The only limit is on how long a single substring of the same repeated type like
x
can be.It's hard to describe this succinctly, but here's the description from
Lexeme.shape
in the docs:The implementation is here:
spaCy/spacy/lang/lex_attrs.py
Lines 118 to 142 in 7369832