Normalization should use casefold() instead of lower()

**Describe the bug**

Normalized lookups use lowercase forms which can fail to find non-trivial case distinctions like the German *ß* which has the uppercase form *SS*:

```python
>>> 'ß'.lower()
'ß'
>>> 'ß'.upper()
'SS'
```

**To Reproduce**

```python
>>> import wn
>>> de = wn.Wordnet("odenet:1.4")
>>> [w.lemma() for w in de.words("SCHWARZ")]
['schwarz']
>>> [w.lemma() for w in de.words("WEISS")]
[]
>>> [w.lemma() for w in de.words("WEIß")]
['Weiß', 'weiß']
```

**Expected behavior**

Python has the [str.casefold()](https://docs.python.org/3/library/stdtypes.html#str.casefold) method which is the preferred method for [Unicode caseless matching](https://www.unicode.org/reports/tr21/tr21-5.html#Caseless_Matching). This seems like a more proper solution that would account for more kinds of normalization.

```python
>>> 'ß'.casefold()
'ss'
```

This does not account for ALL case distinctions, however. In particular, the Turkic dotless-i is not considered equivalent to the uppercase I even under casefolding:

```python
>>> 'ı'.casefold() == "I".casefold()
False
```

**Environment**

```console
$ python --version
Python 3.12.7
$ python -m wn --version
Wn 0.11.0
$ python -m wn lexicons
odenet  1.4     [de]    Offenes Deutsches WordNet
oewn    2024    [en]    Open Engish Wordnet
```

**Additional context**

CPython's `re` library uses a set of special cases for things like the dotless-i: https://github.com/python/cpython/blob/main/Lib/re/_casefix.py

```python
>>> import re
>>> re.match("ı", "I", flags=re.IGNORECASE)
<re.Match object; span=(0, 1), match='I'>
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Normalization should use casefold() instead of lower() #233

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Normalization should use casefold() instead of lower() #233

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions