-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wndb2lmf.py script does not handle pre-3.0 versions #38
Comments
A third and fourth issue for PWN 1.5:
|
The 3rd column of the older cili mappings is a confidence score. We should probably parse that and use it for thresholding out low-quality mappings. |
@jmccrae and @fcbond, a question for you at the end. For WNDB to WN-LMF conversion, I use the But since WordNet-1.5 does not include The (hopefully) last remaining issue relates to the presence of the adjposition markers
It never seems to happen in the primary lemma of a sense key, only the head word for satellite adjectives. They can appear in the
... or the
These differences seem to be the source of incorrect counts in the original
WN-1.7 does not have the adjposition markers in the
WN-3.0 has the markers in
Since I assume we are faithfully converting the PWN versions to WN-LMF and not fixing bugs, I'm wondering what I should be including in the WN-LMF files. I'm thinking of taking |
@jmccrae and @fcbond, I didn't get any response to my question above, so here is my plan:
The following table illustrates what that looks like:
The flexible search for counts means that the counts in my rebuilt Footnotes |
I agree we should stick with In SemCor the sense is annotated as
<wf cmd="ignore" pos="DT">the</wf>
<wf cmd="done" pos="JJ" lemma="only" wnsn="0" lexsn="5:00:00:single(a):00">only</wf>
<wf cmd="done" pos="JJ" lemma="effective" wnsn="1" lexsn="3:00:00::">effective</wf>
<wf cmd="done" pos="NN" lemma="method" wnsn="1" lexsn="1:09:00::">method</wf> |
@jmccrae thank you, that's the kind of confirmation I needed. |
Trying to wrap this up. I'm able to now generate the WN-LMF for all, including WordNet-1.5, but some details are missing. My questions are:
I found some extra files hosted here: https://wordnetcode.princeton.edu/1.5/. Unfortunately there is still no Unix version of the database, and no LICENSE file, but I did find a sense index that wasn't distributed with the database zip file. I compared it ( $ diff etc/SENSE.IDX <( sed -E 's/ [0-9]+$//' etc/WordNet-1.5/DICT/index.sense )
36445a36446
> create%2:36:00:: 00926188 5
36450d36450
< create%2:36:16:: 00926188 5
91235a91236
> make%2:36:00:: 00952386 24
91245d91245
< make%2:36:16:: 00952386 24 For some reason the original sense index has the lex_id of these two as I also found database files for versions back to 1.2, but I won't bother with those as they aren't even listed on the Princeton WordNet's page about old versions. |
While we're on the topic, the licenses we currently package with |
Hi,
On Sat, 14 Dec 2024 at 21:20, Michael Wayne Goodman < ***@***.***> wrote:
Trying to wrap this up. I'm able to now generate the WN-LMF for all,
including WordNet-1.5, but some details are missing. My questions are:
1. *Do we need to create our own README files for pre-3.0 versions?*
@fcbond <https://github.com/fcbond> we did this for 3.0
<https://github.com/omwn/omw-data/blob/main/wns/en30/README.md> and 3.1
<https://github.com/omwn/omw-data/blob/main/wns/en31/README.md> where
we listed changes from the original. Alternatively we could repackage the
original READMEs, although they contain a lot of outdated info (like
requesting copies of WordNet on diskette or magnetic tape).
I think we should create our own README files. Maybe also include the
original as README.original?
1. *What citation do we use for the earlier versions?* I see that
WordNet 1.5 and 1.6 were copyright 1995 and 1997, respectively, and the
Fellbaum citation is 1998, then WordNet 1.7 is copyright 2001. Maybe cite
Miller for 1.5 and 1.6 and Fellbaum from 1.7 on?
That makes sense
1. *Should I use a potentially buggy original sense index for 1.5 or
my generated one?* (see below)
I found some extra files hosted here:
https://wordnetcode.princeton.edu/1.5/. Unfortunately there is still no
Unix version of the database, and no LICENSE file, but I did find a sense
index that wasn't distributed with the database zip file. I compared it (
SENSE.IDX below) to the one I produced (index.sense, ignoring the sense
counts in the final column with a sed command as 1.5 didn't include
them), and I find only two differences:
$ diff etc/SENSE.IDX <( sed -E 's/ [0-9]+$//' etc/WordNet-1.5/DICT/index.sense )
36445a36446> create%2:36:00:: 00926188 5
36450d36450< create%2:36:16:: 00926188 5
91235a91236> make%2:36:00:: 00952386 24
91245d91245< make%2:36:16:: 00952386 24
For some reason the original sense index has the lex_id of these two as 16
instead of 00. Interestingly, the lex_id in the sense index is a
two-digit decimal number while in the data files it is a one-digit
hexadecimal number, so 16 is not even a valid lex_id (the highest, f,
would be 15 in decimal). So this seems like a bug of sorts, maybe a kind
of overflow error. I also note that only *make* has a lex_id of f in the
data file, so it's possible the lexicographer files have > 16 senses for
*make*, while the highest lex_id for *create* is just d.
I also found database files for versions back to 1.2, but I won't bother
with those as they aren't even listed on the Princeton WordNet's page about
old versions.
I would use your index file, and document the difference somewhere.
…--
Francis Bond <https://fcbond.github.io/>
|
Ok, I've pushed a commit to #42. All the PWN lexicons seem to be building. I still have some chores before it's ready, which I've added as items in the description of the pull request. |
Unless you have strong opinions on this, I'm now thinking that we should just use the original sense index in case someone ever annotated data with it. That means the generated sense indexes won't be used at all, but the exercise of recreating them was useful to discover bugs in the originals (the lex-id here, the |
Hi,
that makes sense. Do you think it is worth documenting the differences
in our READMEs? I think Eric Kafe has already dealt with a lot of the
issues in https://github.com/ekaf/ski
…On Wed, 8 Jan 2025 at 02:33, Michael Wayne Goodman ***@***.***> wrote:
I would use your index file, and document the difference somewhere.
Unless you have strong opinions on this, I'm now thinking that we should
just use the original sense index in case someone ever annotated data with
it. That means the generated sense indexes won't be used at all, but the
exercise of recreating them was useful to discover bugs in the originals
(the lex-id here, the (a) and (p) being on the head word and incorrect
counts in others).
—
Reply to this email directly, view it on GitHub
<#38 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIPZRQQHXDYA7WLQQAFIU32JR57RAVCNFSM6AAAAABPEYI3JKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZWGUZTGNJSHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Francis Bond <https://fcbond.github.io/>
|
Yes, that's a good idea.
Thanks, that's the first I've seen of that project. |
@fcbond @jmccrae I'm almost finished, but I've come across some WN-LMF validation errors for WN 1.5, 1.6, and 1.7 and I could use your feedback. There were issues with redundant senses (i.e., the same word appearing more than once on a synset in a WNDB data file), but I just suppressed the extras. There were also some redundant sense relations (also in WN 2.1, 3.0, and 3.1) which I also suppressed. The remaining issue is with a single adverb entry's
The line is the same for 1.6 and 1.7 except for the offsets. The issue here is the What do you think? edit: to be clear, I don't want to fix problems with the wordnets, but the pointer issue and redundant sense issue make the XML invalid against the WN-LMF 1.3 schema. The redundant sense relations don't cause validation problems, so I could keep them in. |
Okay
It has been fixed in future versions of PWN, so I would fix it here.
|
For context: goodmami/wn#199
There are two main issues:
older-wn-mappings/
has 3 columns instead of 2verb.Framestext
, so we'd need to hard-code the frames into the scriptThe text was updated successfully, but these errors were encountered: