Skip to content

Normalization should use casefold() instead of lower() #233

@goodmami

Description

@goodmami

Describe the bug

Normalized lookups use lowercase forms which can fail to find non-trivial case distinctions like the German ß which has the uppercase form SS:

>>> 'ß'.lower()
'ß'
>>> 'ß'.upper()
'SS'

To Reproduce

>>> import wn
>>> de = wn.Wordnet("odenet:1.4")
>>> [w.lemma() for w in de.words("SCHWARZ")]
['schwarz']
>>> [w.lemma() for w in de.words("WEISS")]
[]
>>> [w.lemma() for w in de.words("WEIß")]
['Weiß', 'weiß']

Expected behavior

Python has the str.casefold() method which is the preferred method for Unicode caseless matching. This seems like a more proper solution that would account for more kinds of normalization.

>>> 'ß'.casefold()
'ss'

This does not account for ALL case distinctions, however. In particular, the Turkic dotless-i is not considered equivalent to the uppercase I even under casefolding:

>>> 'ı'.casefold() == "I".casefold()
False

Environment

$ python --version
Python 3.12.7
$ python -m wn --version
Wn 0.11.0
$ python -m wn lexicons
odenet  1.4     [de]    Offenes Deutsches WordNet
oewn    2024    [en]    Open Engish Wordnet

Additional context

CPython's re library uses a set of special cases for things like the dotless-i: https://github.com/python/cpython/blob/main/Lib/re/_casefix.py

>>> import re
>>> re.match("ı", "I", flags=re.IGNORECASE)
<re.Match object; span=(0, 1), match='I'>

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions