Anti-Pattern Matching support in Rule Base Matcher #7588

delzac · 2021-03-27T07:37:12Z

delzac
Mar 27, 2021

Some patterns are better defined by their exceptions (i.e. anti-patterns) which spaCy don't natively support. Native support for anti-pattern matching will promote more readable pattern construction.

Take this example, where we are looking for potential accidental misspelling of "not" as "nut". We might accept nut job, nut case and french nut, but not nut sure, nut true and try nut to drop this

Without using anti-patterns, we might write the following pattern to look for the misspelling:

pattern = [
    {"TEXT": {"NOT_IN": ["french"]},
    {"TEXT": "nut"},
    {"TEXT": {"IN": ["sure", true", "to"]}, "OP": "?"},
    {"TEXT": {"NOT_IN": ["job", "case"]},
]

This above pattern is confusing to understanding.

It would be much more readable if we use anti-patterns like so.

pattern = [
    {"TEXT": "nut"},
    {"TEXT": {"IN": ["sure", true", "to"]}, "OP": "?"},
]

anti_patterns = [
    [{"TEXT": "french", {"TEXT": "nut"}],
    [{"TEXT": "nut"}, {"TEXT": "case"}],
    [{"TEXT": "nut"}, {"TEXT": "job"}],
]

Currently, users have to write a lot of boilerplate code to use anti-patterns. Typically like so,

pattern_matcher = Matcher(...)
pattern_matcher.add(...)
anti_pattern_matcher = Matcher(...)
anti_pattern_matcher.add(...)

matches = pattern_matcher(doc)
anti_matches = anti_pattern_matcher(doc)
matches = remove_overlap(matches, anti_matches)

I would like to propose the following API for spaCy to natively support anti-patterns:

matcher= Matcher(...)
matcher.add('NUT-RULE", patterns, anti-patterns, callback=...)

matches = matcher(doc)

Would be happy to raise a PR for this feature if the maintainers are agreeable to it! :)

adrianeboyd · 2021-03-29T07:23:34Z

adrianeboyd
Mar 29, 2021

I don't have the final say, but I think that the potential behavior of "anti-patterns" is going to be too variable for this to be a good feature to include in the core library. How is overlap defined, what does remove_overlap do, etc. There are so many options that I think it would better to keep this out of the plain Matcher itself and let users implement custom components to handle the filtering.

You can also do this with one matcher and use the match IDs to filter the types of matches. I suspect this would would be slightly faster since the matcher only has to run over the document once.

1 reply

delzac Mar 29, 2021
Author

My idea of anti-patterns would be to remove any matches (from pattern) that have any kind of overlap with matches from anti-patterns of the same rule.

So if a document consist of A B C D E F G
And pattern matches A B and C D
but anti-pattern matches D E and F G
then the final output from matcher should be A B only as C D has partial overlap with an anti-pattern.

I hear your concerns. My proposal stemmed from looking at how rule-based matching is done in other libraries where they have the concept of anti-pattern baked in. I found the explicit separation of positive and negative pattern helpful for me (i maintain a large rule-based matcher system using spaCy), so i thought i should bring it up here for consideration.

Thanks for taking the time to do so. :)

jordi-reinsma · 2022-01-25T01:46:37Z

jordi-reinsma
Jan 25, 2022

If anyone else see this discussion, here is an example of how to implement @delzac 's idea (removing overlaps with same rule id): https://gist.github.com/jordi-reinsma/2de3ad79ced025772bf9517f93614e93

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anti-Pattern Matching support in Rule Base Matcher #7588

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Anti-Pattern Matching support in Rule Base Matcher #7588

delzac Mar 27, 2021

Replies: 2 comments · 1 reply

adrianeboyd Mar 29, 2021

delzac Mar 29, 2021 Author

jordi-reinsma Jan 25, 2022

delzac
Mar 27, 2021

Replies: 2 comments 1 reply

adrianeboyd
Mar 29, 2021

delzac Mar 29, 2021
Author

jordi-reinsma
Jan 25, 2022