Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wndb2lmf.py script does not handle pre-3.0 versions #38

Open
goodmami opened this issue Oct 1, 2024 · 15 comments · May be fixed by #42
Open

wndb2lmf.py script does not handle pre-3.0 versions #38

goodmami opened this issue Oct 1, 2024 · 15 comments · May be fixed by #42
Labels
bug Something isn't working
Milestone

Comments

@goodmami
Copy link
Collaborator

goodmami commented Oct 1, 2024

For context: goodmami/wn#199

There are two main issues:

  • The CILI mapping format in https://github.com/globalwordnet/cili/ in older-wn-mappings/ has 3 columns instead of 2
  • WordNet versions prior to 2.1 do not include verb.Framestext, so we'd need to hard-code the frames into the script
@goodmami goodmami added the bug Something isn't working label Oct 1, 2024
@goodmami
Copy link
Collaborator Author

goodmami commented Oct 1, 2024

A third and fourth issue for PWN 1.5:

  • The Unix database files are not available, only a Windows .zip and an old Mac .bin file. For the Windows file, the filenames are all-caps and slightly different (e.g., DICT/NOUN.DAT instead of dict/data.noun), so we'd need to do some special file-loading. Luckily, there doesn't seem to be any strange encoding or line-ending problems.
  • The data files don't just have the license text and data lines, but in between these two blocks there is also a list of files used to create the database. The code for reading data files needs to be made aware of this.

@goodmami
Copy link
Collaborator Author

goodmami commented Oct 1, 2024

The 3rd column of the older cili mappings is a confidence score. We should probably parse that and use it for thresholding out low-quality mappings.

@goodmami
Copy link
Collaborator Author

goodmami commented Oct 7, 2024

@jmccrae and @fcbond, a question for you at the end.

For WNDB to WN-LMF conversion, I use the index.sense file to get the sense keys (cntlist has sense keys, too, but only those with nonzero counts). The index.sense file also includes tag_cnt, so it obviates the need to look at cntlist for counts. At least, that's what I had assumed.

But since WordNet-1.5 does not include index.sense, I need to recreate the sense keys from scratch. It took me a few evenings, but I got a script that rebuilds index.sense with zero diffs when testing against WordNet-1.7, 1.7.1, 2.0, and 2.1.

The (hopefully) last remaining issue relates to the presence of the adjposition markers (a) or (p) (for some reason, I haven't seen (ip)) in the head word of a satellite adjective sense key. For example:

upward%5:00:00:ascending(a):00

It never seems to happen in the primary lemma of a sense key, only the head word for satellite adjectives. They can appear in the cntlist file:

$ grep -c '([ap])' WordNet-*/dict/cntlist
WordNet-1.5/DICT/CNTLIST:93
WordNet-1.6/dict/cntlist:131
WordNet-1.7.1/dict/cntlist:0
WordNet-1.7/dict/cntlist:0
WordNet-2.0/dict/cntlist:3
WordNet-2.1/dict/cntlist:0
WordNet-3.0/dict/cntlist:130
WordNet-3.1/dict/cntlist:130

... or the index.sense file (only for WN-1.6):

$ grep -c '([ap])' WordNet-*/dict/index.sense
WordNet-1.6/dict/index.sense:374
WordNet-1.7.1/dict/index.sense:0
WordNet-1.7/dict/index.sense:0
WordNet-2.0/dict/index.sense:0
WordNet-2.1/dict/index.sense:0
WordNet-3.0/dict/index.sense:0
WordNet-3.1/dict/index.sense:0

These differences seem to be the source of incorrect counts in the original index.sense files. The sense key only%5:00:00:single(a):00 has a tag_cnt of 118 in WN-1.6's cntlist file (first field), but 0 in index.sense (last field):

$ grep "only%5:00:00:single" WordNet-1.6/dict/{cntlist,index.sense}
WordNet-1.6/dict/cntlist:118 only%5:00:00:single(a):00 1
WordNet-1.6/dict/index.sense:only%5:00:00:single(a):00 02111616 1 0

WN-1.7 does not have the adjposition markers in the cntlist or index.sense file, and it gets the full count in both:

$ grep "only%5:00:00:single" WordNet-1.7/dict/{cntlist,index.sense}
WordNet-1.7/dict/cntlist:118 only%5:00:00:single:00 1
WordNet-1.7/dict/index.sense:only%5:00:00:single:00 02148121 1 118

WN-3.0 has the markers in cntlist but not index.sense, and again the counts are wrong:

$ grep "only%5:00:00:single" WordNet-3.0/dict/{cntlist,index.sense}
WordNet-3.0/dict/cntlist:118 only%5:00:00:single(a):00 1
WordNet-3.0/dict/index.sense:only%5:00:00:single:05 02214736 1 0

Since I assume we are faithfully converting the PWN versions to WN-LMF and not fixing bugs, I'm wondering what I should be including in the WN-LMF files. I'm thinking of taking cntlist as the source of truth for counts with a more robust lookup, but should I normalize the sense keys in older versions? That is, for WN 1.5 and 1.6, should we record the sense key as only%5:00:00:single(a):00 or only%5:00:00:single:00?

@goodmami goodmami added this to the Release 1.5 milestone Oct 18, 2024
@goodmami
Copy link
Collaborator Author

@jmccrae and @fcbond, I didn't get any response to my question above, so here is my plan:

  1. Sense keys will be used as they are in index.sense (1.6 through 3.1)
  2. Sense keys for 1.5 will be generated like 1.6 (with adjposition markers)
  3. Counts will be used from cntlist with a flexible search

The following table illustrates what that looks like:

WordNet Sense Key Count
1.5 only%5:00:00:single(a):00 102
1.6 only%5:00:00:single(a):00 118
1.7 only%5:00:00:single:00 118
1.7.1 only%5:00:00:single:00 01
2.0 only%5:00:00:single:00 01
2.1 only%5:00:00:single:05 01
3.0 only%5:00:00:single:05 118
3.1 only%5:00:00:single:05 118

The flexible search for counts means that the counts in my rebuilt index.sense file will differ from the existing index.sense files for WordNets 1.6, 3.0, and 3.1.

Footnotes

  1. This sense key, with or without adjposition markers, was missing from the cntlist file entirely. 2 3

@jmccrae
Copy link

jmccrae commented Oct 24, 2024

I agree we should stick with index.sense primarily

In SemCor the sense is annotated as only%5:00:00:single(a):00 and this is aligned to WN1.6

Filename: brown2/tagfiles/br-h13.xml

<wf cmd="ignore" pos="DT">the</wf>
<wf cmd="done" pos="JJ" lemma="only" wnsn="0" lexsn="5:00:00:single(a):00">only</wf>
<wf cmd="done" pos="JJ" lemma="effective" wnsn="1" lexsn="3:00:00::">effective</wf>
<wf cmd="done" pos="NN" lemma="method" wnsn="1" lexsn="1:09:00::">method</wf>

@goodmami
Copy link
Collaborator Author

@jmccrae thank you, that's the kind of confirmation I needed.

@goodmami
Copy link
Collaborator Author

Trying to wrap this up. I'm able to now generate the WN-LMF for all, including WordNet-1.5, but some details are missing. My questions are:

  1. Do we need to create our own README files for pre-3.0 versions? @fcbond we did this for 3.0 and 3.1 where we listed changes from the original. Alternatively we could repackage the original READMEs, although they contain a lot of outdated info (like requesting copies of WordNet on diskette or magnetic tape).
  2. What citation do we use for the earlier versions? I see that WordNet 1.5 and 1.6 were copyright 1995 and 1997, respectively, and the Fellbaum citation is 1998, then WordNet 1.7 is copyright 2001. Maybe cite Miller for 1.5 and 1.6 and Fellbaum from 1.7 on?
  3. Should I use a potentially buggy original sense index for 1.5 or my generated one? (see below)

I found some extra files hosted here: https://wordnetcode.princeton.edu/1.5/. Unfortunately there is still no Unix version of the database, and no LICENSE file, but I did find a sense index that wasn't distributed with the database zip file. I compared it (SENSE.IDX below) to the one I produced (index.sense, ignoring the sense counts in the final column with a sed command as 1.5 didn't include them), and I find only two differences:

$ diff etc/SENSE.IDX <( sed -E 's/ [0-9]+$//' etc/WordNet-1.5/DICT/index.sense )
36445a36446
> create%2:36:00:: 00926188 5
36450d36450
< create%2:36:16:: 00926188 5
91235a91236
> make%2:36:00:: 00952386 24
91245d91245
< make%2:36:16:: 00952386 24

For some reason the original sense index has the lex_id of these two as 16 instead of 00. Interestingly, the lex_id in the sense index is a two-digit decimal number while in the data files it is a one-digit hexadecimal number, so 16 is not even a valid lex_id (the highest, f, would be 15 in decimal). So this seems like a bug of sorts, maybe a kind of overflow error. I also note that only make has a lex_id of f in the data file, so it's possible the lexicographer files have > 16 senses for make, while the highest lex_id for create is just d.

I also found database files for versions back to 1.2, but I won't bother with those as they aren't even listed on the Princeton WordNet's page about old versions.

@goodmami
Copy link
Collaborator Author

While we're on the topic, the licenses we currently package with omw-en (here) and omw-en31 (here) appear to be the webpage text of https://wordnet.princeton.edu/license-and-commercial-use, which includes the license text plus some preamble, and furthermore the license text is not formatted so nicely as the LICENSE file (https://wordnetcode.princeton.edu/3.0/LICENSE) distributed with the database. I think we should distribute the same LICENSE file as is included with the database.

@fcbond
Copy link
Contributor

fcbond commented Dec 15, 2024 via email

goodmami added a commit that referenced this issue Dec 21, 2024
@goodmami
Copy link
Collaborator Author

Ok, I've pushed a commit to #42. All the PWN lexicons seem to be building. I still have some chores before it's ready, which I've added as items in the description of the pull request.

@goodmami
Copy link
Collaborator Author

goodmami commented Jan 8, 2025

I would use your index file, and document the difference somewhere.

Unless you have strong opinions on this, I'm now thinking that we should just use the original sense index in case someone ever annotated data with it. That means the generated sense indexes won't be used at all, but the exercise of recreating them was useful to discover bugs in the originals (the lex-id here, the (a) and (p) being on the head word and incorrect counts in others).

@fcbond
Copy link
Contributor

fcbond commented Jan 8, 2025 via email

@goodmami
Copy link
Collaborator Author

goodmami commented Jan 9, 2025

Do you think it is worth documenting the differences in our READMEs?

Yes, that's a good idea.

I think Eric Kafe has already dealt with a lot of the issues in https://github.com/ekaf/ski

Thanks, that's the first I've seen of that project.

@goodmami
Copy link
Collaborator Author

goodmami commented Jan 18, 2025

@fcbond @jmccrae I'm almost finished, but I've come across some WN-LMF validation errors for WN 1.5, 1.6, and 1.7 and I could use your feedback.

There were issues with redundant senses (i.e., the same word appearing more than once on a synset in a WNDB data file), but I just suppressed the extras. There were also some redundant sense relations (also in WN 2.1, 3.0, and 3.1) which I also suppressed.

The remaining issue is with a single adverb entry's \ pointer ("derived from adjective", but we call it "pertainym", probably because that's what the pointer means on adjectives). Here's the line for WN 1.5 in data.adv (ADV.DAT in the Windows database):

00175161 02 r 01 animatedly 0 001 \ 00600880 a 0000 | "They talked animatedly"

The line is the same for 1.6 and 1.7 except for the offsets. The issue here is the \ pointer's source/target is 0000, which indicates a synset relation, but \ is a sense relation. As I see it, the choice here is whether to discard the relation or patch it to it's most likely intended target. Since the target synset has words animated and alive, the intended target is probably animated.

What do you think?


edit: to be clear, I don't want to fix problems with the wordnets, but the pointer issue and redundant sense issue make the XML invalid against the WN-LMF 1.3 schema. The redundant sense relations don't cause validation problems, so I could keep them in.

@goodmami goodmami linked a pull request Jan 18, 2025 that will close this issue
13 tasks
@jmccrae
Copy link

jmccrae commented Jan 20, 2025

@fcbond @jmccrae I'm almost finished, but I've come across some WN-LMF validation errors for WN 1.5, 1.6, and 1.7 and I could use your feedback.

There were issues with redundant senses (i.e., the same word appearing more than once on a synset in a WNDB data file), but I just suppressed the extras. There were also some redundant sense relations (also in WN 2.1, 3.0, and 3.1) which I also suppressed.

Okay

The remaining issue is with a single adverb entry's \ pointer ("derived from adjective", but we call it "pertainym", probably because that's what the pointer means on adjectives). Here's the line for WN 1.5 in data.adv (ADV.DAT in the Windows database):

00175161 02 r 01 animatedly 0 001 \ 00600880 a 0000 | "They talked animatedly"

The line is the same for 1.6 and 1.7 except for the offsets. The issue here is the \ pointer's source/target is 0000, which indicates a synset relation, but \ is a sense relation. As I see it, the choice here is whether to discard the relation or patch it to it's most likely intended target. Since the target synset has words animated and alive, the intended target is probably animated.

It has been fixed in future versions of PWN, so I would fix it here.

What do you think?

edit: to be clear, I don't want to fix problems with the wordnets, but the pointer issue and redundant sense issue make the XML invalid against the WN-LMF 1.3 schema. The redundant sense relations don't cause validation problems, so I could keep them in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants