-
Notifications
You must be signed in to change notification settings - Fork 29
Open
Labels
bugSomething isn't workingSomething isn't working
Milestone
Description
Describe the bug
Normalized lookups use lowercase forms which can fail to find non-trivial case distinctions like the German ß which has the uppercase form SS:
>>> 'ß'.lower()
'ß'
>>> 'ß'.upper()
'SS'To Reproduce
>>> import wn
>>> de = wn.Wordnet("odenet:1.4")
>>> [w.lemma() for w in de.words("SCHWARZ")]
['schwarz']
>>> [w.lemma() for w in de.words("WEISS")]
[]
>>> [w.lemma() for w in de.words("WEIß")]
['Weiß', 'weiß']Expected behavior
Python has the str.casefold() method which is the preferred method for Unicode caseless matching. This seems like a more proper solution that would account for more kinds of normalization.
>>> 'ß'.casefold()
'ss'This does not account for ALL case distinctions, however. In particular, the Turkic dotless-i is not considered equivalent to the uppercase I even under casefolding:
>>> 'ı'.casefold() == "I".casefold()
FalseEnvironment
$ python --version
Python 3.12.7
$ python -m wn --version
Wn 0.11.0
$ python -m wn lexicons
odenet 1.4 [de] Offenes Deutsches WordNet
oewn 2024 [en] Open Engish WordnetAdditional context
CPython's re library uses a set of special cases for things like the dotless-i: https://github.com/python/cpython/blob/main/Lib/re/_casefix.py
>>> import re
>>> re.match("ı", "I", flags=re.IGNORECASE)
<re.Match object; span=(0, 1), match='I'>Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working