Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update wndb2lmf to build Pre-3.0 WordNets #42

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

goodmami
Copy link
Collaborator

@goodmami goodmami commented Oct 28, 2024

See #38

Summary of changes:

  • Hard-code syntactic frames
  • Read ILI files with confidence scores; allow confidence threshold in conversion script
  • Split WNDB-reading functions into wndb.py module
  • Add build_senseidx.py to rebuild an index.sense file from WordNet data/index/cntlist files
  • Don't use case-insensitive matching for exceptional forms
  • Figure out why syntactic frames aren't showing up anymore
  • Update build.sh to build all versions of WordNet
    • Add custom code for downloading and renaming files for WordNet 1.5 so it can be converted
    • Build the index.sense files as appropriate for each WordNet (as discussed here)
    • Edit the READMEs for each omw-en* lexicon (summary of changes, include original README)
    • Validate the generated XML
    • Change WN-LMF to 1.3
    • Remove old directories wns/en30/, wns/en31/, and wns/pwn/

Earlier versions of the Princeton WordNet did not include
verb.Framestext, and they never changed across versions, so it's
easier to just hard-code them than to load them from a file. The only
potential issue I see is that this is content copied from the
copyrighted WordNet documentation and there might not be enough
attribution. I do link back to the documentation, so hopefully we're
good there.
build_senseidx.py will create exact replicas of the index.sense files
for WordNet 1.7 and higher versions.

For WordNet 1.6, you can get close with the --use-adjposition option,
but the counts for any sense key with an adjposition (a) or (p) after
the head word of satellite adjective sense keys would need to be reset
to 0.

WordNet 1.5 did not have an index.sense file distributed with it.
- word = "Original_word" as in the WNDB data files
- respaced = "Original word" with spaces instead of _
- lemma = "original_word" as in the WNDB index files
The frames were being sent to Wn's LMF in the 1.0 format and weren't
being written in the 'subcat' attribute on senses. This is now fixed.
Also try to make it more robust for WN1.5
@goodmami goodmami marked this pull request as ready for review January 21, 2025 00:36
@goodmami goodmami requested a review from fcbond January 21, 2025 00:37
@goodmami
Copy link
Collaborator Author

@fcbond Sorry for the large PR. All the lexicons (including the new ones) now pass validation:

$ ./validate.sh 1.5
build/omw-1.5/omw-arb/omw-arb.xml - valid
build/omw-1.5/omw-bg/omw-bg.xml - valid
build/omw-1.5/omw-ca/omw-ca.xml - valid
build/omw-1.5/omw-cmn/omw-cmn.xml - valid
build/omw-1.5/omw-da/omw-da.xml - valid
build/omw-1.5/omw-el/omw-el.xml - valid
build/omw-1.5/omw-en15/omw-en15.xml - valid
build/omw-1.5/omw-en16/omw-en16.xml - valid
build/omw-1.5/omw-en171/omw-en171.xml - valid
build/omw-1.5/omw-en17/omw-en17.xml - valid
build/omw-1.5/omw-en20/omw-en20.xml - valid
build/omw-1.5/omw-en21/omw-en21.xml - valid
build/omw-1.5/omw-en30/omw-en30.xml - valid
build/omw-1.5/omw-en31/omw-en31.xml - valid
build/omw-1.5/omw-es/omw-es.xml - valid
build/omw-1.5/omw-eu/omw-eu.xml - valid
build/omw-1.5/omw-fi/omw-fi.xml - valid
build/omw-1.5/omw-fr/omw-fr.xml - valid
build/omw-1.5/omw-gl/omw-gl.xml - valid
build/omw-1.5/omw-he/omw-he.xml - valid
build/omw-1.5/omw-hr/omw-hr.xml - valid
build/omw-1.5/omw-id/omw-id.xml - valid
build/omw-1.5/omw-is/omw-is.xml - valid
build/omw-1.5/omw-it/omw-it.xml - valid
build/omw-1.5/omw-iwn/omw-iwn.xml - valid
build/omw-1.5/omw-ja/omw-ja.xml - valid
build/omw-1.5/omw-lt/omw-lt.xml - valid
build/omw-1.5/omw-nb/omw-nb.xml - valid
build/omw-1.5/omw-nl/omw-nl.xml - valid
build/omw-1.5/omw-nn/omw-nn.xml - valid
build/omw-1.5/omw-pl/omw-pl.xml - valid
build/omw-1.5/omw-pt/omw-pt.xml - valid
build/omw-1.5/omw-ro/omw-ro.xml - valid
build/omw-1.5/omw-sk/omw-sk.xml - valid
build/omw-1.5/omw-sl/omw-sl.xml - valid
build/omw-1.5/omw-sq/omw-sq.xml - valid
build/omw-1.5/omw-sv/omw-sv.xml - valid
build/omw-1.5/omw-th/omw-th.xml - valid
build/omw-1.5/omw-zsm/omw-zsm.xml - valid

Feel free to review the whole thing if you have time, but otherwise please just pay attention to the changes to scripts/tsv2lmf.py since I believe you also have some changes to merge in this file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant