Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update wndb2lmf to build Pre-3.0 WordNets #42

Merged
merged 22 commits into from
Jan 23, 2025
Merged

Conversation

goodmami
Copy link
Collaborator

@goodmami goodmami commented Oct 28, 2024

See #38

Summary of changes:

  • Hard-code syntactic frames
  • Read ILI files with confidence scores; allow confidence threshold in conversion script
  • Split WNDB-reading functions into wndb.py module
  • Add build_senseidx.py to rebuild an index.sense file from WordNet data/index/cntlist files
  • Don't use case-insensitive matching for exceptional forms
  • Figure out why syntactic frames aren't showing up anymore
  • Update build.sh to build all versions of WordNet
    • Add custom code for downloading and renaming files for WordNet 1.5 so it can be converted
    • Build the index.sense files as appropriate for each WordNet (as discussed here)
    • Edit the READMEs for each omw-en* lexicon (summary of changes, include original README)
    • Validate the generated XML
    • Change WN-LMF to 1.3
    • Remove old directories wns/en30/, wns/en31/, and wns/pwn/
    • Change the <Requires> element on non-English lexicons to point to omw-en30:1.5
    • Update index.toml

Earlier versions of the Princeton WordNet did not include
verb.Framestext, and they never changed across versions, so it's
easier to just hard-code them than to load them from a file. The only
potential issue I see is that this is content copied from the
copyrighted WordNet documentation and there might not be enough
attribution. I do link back to the documentation, so hopefully we're
good there.
build_senseidx.py will create exact replicas of the index.sense files
for WordNet 1.7 and higher versions.

For WordNet 1.6, you can get close with the --use-adjposition option,
but the counts for any sense key with an adjposition (a) or (p) after
the head word of satellite adjective sense keys would need to be reset
to 0.

WordNet 1.5 did not have an index.sense file distributed with it.
- word = "Original_word" as in the WNDB data files
- respaced = "Original word" with spaces instead of _
- lemma = "original_word" as in the WNDB index files
The frames were being sent to Wn's LMF in the 1.0 format and weren't
being written in the 'subcat' attribute on senses. This is now fixed.
Also try to make it more robust for WN1.5
@goodmami goodmami marked this pull request as ready for review January 21, 2025 00:36
@goodmami goodmami requested a review from fcbond January 21, 2025 00:37
@goodmami
Copy link
Collaborator Author

@fcbond Sorry for the large PR. All the lexicons (including the new ones) now pass validation:

$ ./validate.sh 1.5
build/omw-1.5/omw-arb/omw-arb.xml - valid
build/omw-1.5/omw-bg/omw-bg.xml - valid
build/omw-1.5/omw-ca/omw-ca.xml - valid
build/omw-1.5/omw-cmn/omw-cmn.xml - valid
build/omw-1.5/omw-da/omw-da.xml - valid
build/omw-1.5/omw-el/omw-el.xml - valid
build/omw-1.5/omw-en15/omw-en15.xml - valid
build/omw-1.5/omw-en16/omw-en16.xml - valid
build/omw-1.5/omw-en171/omw-en171.xml - valid
build/omw-1.5/omw-en17/omw-en17.xml - valid
build/omw-1.5/omw-en20/omw-en20.xml - valid
build/omw-1.5/omw-en21/omw-en21.xml - valid
build/omw-1.5/omw-en30/omw-en30.xml - valid
build/omw-1.5/omw-en31/omw-en31.xml - valid
build/omw-1.5/omw-es/omw-es.xml - valid
build/omw-1.5/omw-eu/omw-eu.xml - valid
build/omw-1.5/omw-fi/omw-fi.xml - valid
build/omw-1.5/omw-fr/omw-fr.xml - valid
build/omw-1.5/omw-gl/omw-gl.xml - valid
build/omw-1.5/omw-he/omw-he.xml - valid
build/omw-1.5/omw-hr/omw-hr.xml - valid
build/omw-1.5/omw-id/omw-id.xml - valid
build/omw-1.5/omw-is/omw-is.xml - valid
build/omw-1.5/omw-it/omw-it.xml - valid
build/omw-1.5/omw-iwn/omw-iwn.xml - valid
build/omw-1.5/omw-ja/omw-ja.xml - valid
build/omw-1.5/omw-lt/omw-lt.xml - valid
build/omw-1.5/omw-nb/omw-nb.xml - valid
build/omw-1.5/omw-nl/omw-nl.xml - valid
build/omw-1.5/omw-nn/omw-nn.xml - valid
build/omw-1.5/omw-pl/omw-pl.xml - valid
build/omw-1.5/omw-pt/omw-pt.xml - valid
build/omw-1.5/omw-ro/omw-ro.xml - valid
build/omw-1.5/omw-sk/omw-sk.xml - valid
build/omw-1.5/omw-sl/omw-sl.xml - valid
build/omw-1.5/omw-sq/omw-sq.xml - valid
build/omw-1.5/omw-sv/omw-sv.xml - valid
build/omw-1.5/omw-th/omw-th.xml - valid
build/omw-1.5/omw-zsm/omw-zsm.xml - valid

Feel free to review the whole thing if you have time, but otherwise please just pay attention to the changes to scripts/tsv2lmf.py since I believe you also have some changes to merge in this file.

@goodmami
Copy link
Collaborator Author

I forgot to have the non-English lexicons require omw-en30:1.5 instead of omw-en:1.4.

Also note that I now have the 30 in omw-en30 to be consistent with all the others.

@goodmami goodmami merged commit 01f1b44 into main Jan 23, 2025
@goodmami
Copy link
Collaborator Author

@fcbond merging so we can move ahead. If you have tsv2lmf.py changes please do see what has changed here (similarly if you have changes to wndb2lmf.py). Let me know if you need help with any conflicts.

@fcbond
Copy link
Contributor

fcbond commented Jan 25, 2025 via email

@goodmami goodmami deleted the gh-38-older-pwn-versions branch March 2, 2025 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants