-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature to modify wordnets in the database #17
Comments
I'm not adding support just yet, but I'm thinking of adding a I want to add this column to the next release because I want to group as many schema-related changes as possible into one release, in order to reduce the number of times people have to rebuild. |
My first thought was to set up some triggers so that any changes to the database, whether using Wn or not, would set the |
Is there any progress on this? Thanks for your work btw :) |
@Hypercookie No, not yet. This feature won't be implemented until it has some higher priority, because new features -> more code to maintain, and, for the moment, I'm the only maintainer. I'm going to edit this issue to make more clear how to increase the priority (and feel free to respond accordingly). |
Thanks for your response @goodmami |
On Mon, 21 Mar 2022 at 16:05, Jannes Müller ***@***.***> wrote:
Thanks for your response @goodmami <https://github.com/goodmami>
I work for a Research Project in Natural Language Processing. We want to
enable our users to add own relations between words/synsets and modify
existing ones. Since this feature has relativly high priority for us I have
forked this project and will try to implement the needed features myself. I
may open a pull request once this is done and cleaned up.
That sounds very useful, thanks.
…--
Francis Bond <https://fcbond.github.io/>
|
Thanks, @Hypercookie, for explaining its importance, and for taking the initiative to implement it. I have some further thoughts regarding the implementation if this were to be merged into this repository:
|
Thanks for your input :) I have already started on this.
reset_all_wordnets() #Resets all modfied Wordnets
w = wn.Wordnet("odenet")
t1 = w.synset("odenet-1-n")
t2 = w.synset("odenet-10-n")
# Deleting
print(t2.hypernyms()) # -> [Synset('odenet-4866-n')]
SynsetEditor(t2.hypernyms()[0]).delete()
print(t2.hypernyms()) # -> []
# Creating / Modifying
# Since there is no lexid and no Synset passed this will create a new
# Synset in the Lexicon with id 'odenet'
# Calls to modfications can be made in chains to make the interface more fluent
e = SynsetEditor("odenet").set_hypernym_of(t1).set_meronym_of(t2)
print(t1.hypernyms()) #[Synset('odenet-5437-n'), Synset('odenet-362443-mod')]
# \> Thats the one we created
Note that this needs some more work obviously (For example the e = SynsetEditor("odenet")
e["definition"] = "Fancy Definition"
e["hypernyms"] = [fancy_synset,fancy_synset_2]
But I think that wrapping the editor in a seperate class ensures that no accidents occur either way.
I will try to make this as nice as possible, but time is also an issue for us ( as it is for everybody ) |
Hi, just a comment on the ids. I think the original odenet ids (and the same for most wordnets) are wordnet-offset-pos, where offset is the offset of the corresponding synset in princeton wordnet 3.0, and pos is one of the small set of wordnet pos codes (originally n, v, a, r, but now extended to a few more (see below) from constants.py
If you keep to this convention and just generate new integers from above say 20000000 (and I guess check for a clash if multiple people are adding things) it may make it easier for people to debug. PWN offsets are all below 20,000,000, ... |
Thanks very much :) I will do that! |
Maybe some updates :
Maybe you guys have some more input on this :) |
(recreating what I think was the context)
The
Great, it's nice to hear you're making steady progress.
Actually I would rather not have that API, for some reasons:
Instead I'd prefer something like this: >>> from wn.editor import LexiconEditor
>>> es_editor = LexiconEditor('omw-es:1.4') # unlike wn.Wordnet, this works with exactly 1 lexicon at a time Then I'm a bit less particular about how that object is used, but maybe: >>> es_editor.add_synset(id=ssid, ili=..., ...)
>>> es_editor.add_word(id=wid, pos=..., ...)
>>> es_editor.add_sense(id=sid, synset=ssid, word=wid, ...)
>>> es_editor.add_sense_relation(id=sid, target=..., relation=...)
>>> es_editor.commit() # commit transaction in DB A set of methods like this instead of a
The shortcuts are a nice idea. You might allow a function similar to wn.util.synset_id_formatter() for auto-creating word and sense IDs with a reasonable default (e.g., see _make_entry_id() and escape_lemma() in the omw-data code). |
Hi ;)
I see now what you mean. That was also my first approach because it looked like an easy thing to do but then I noticed that if we want to modfiy something that already exists, we (at least I belive so) will not get around modfiying the sqllite db itself, same for deletion. And if we are already modifying the db it acutaly would mean more code to create entries via the lmf module then to just create the according row, and then edit the object with the methods that modfiy the existing objects. (See below for some more info of what I mean)
Agreed!
I also do! And thats why i tried this at first and quickly came to realize that if we want do modify existing synsets this approach where we have only one class that manages everything will become very complicated and totally messy. We would need a method for everything a user might want to do. For example modfiy the pronounciation of a word etc, we then also dont have a clear interface but a collection of methods. Thats why I decided let the user decide which granularity he wants. SynsetEditor('omw-es:1.4').definition('fancy synset')
# Since the passed string is a lexicon id, this will create a new synset and return its editor instance.
# But we can also do:
SynsetEditor(wn.synsets('Car')[0]).definition('this is not a car') In fact a real world scenario (debatable) would look like this: import wn
from wn.mod import *
reset_all_wordnets() # Reset all Wordnets
# Create a new Synset in the 'odenet' lexicon.
# Add the Word "Audi" and "Mercedes" to it (without caring for Senses or Forms) and retrive the Synset
syn: wn.Synset = SynsetEditor("odenet").add_word("Audi").add_word("Mercedes").synset #This will become a clean method.
# Spawn a new Editor for the Auto synset (which exists in already in the 'odenet' lexicon) and make it a hypernym of the
# previous synset.
SynsetEditor(wn.synsets("Auto")[0]).set_hypernym_of(syn)
print("words in synset:\n ")
for i in syn.words():
print(i.lemma())
print("\n\nhypernyms: \n")
for i in syn.hypernyms():
for word in i.words():
print(word.lemma())
#dont want the synset anymore?
SynsetEditor(syn).delete()
This reuses so much code that in fact I mostly only wrote code to modify existing entries. Edit: Maybe something about the formatter functions. This is probably the only point where this approach bothers me... If we want a user to create a entry, he will just pass the lexicon id to specify the lexicon. If we now want to acutally create the entry in the db we have no idea which position or which forms it will have ... so we have to either set a fixed id without much sense (the position can be modfied) or modify the id as soon as the forms are set and we have a lemma. I dislike both aproaches. The third one would be to wait before actually creating an object and using something like |
https://hyper-wn.readthedocs.io/en/latest/api/wn.editor.html |
https://github.com/Hypercookie/wn/projects |
@Hypercookie thanks for sharing. It looks like you've put in a ton of work, and it's nice to see it shaping up. As it currently is, however, I don't think I'm prepared to accept a PR with a nearly 1800-line module, as that would greatly increase the maintenance burden in Wn for a feature that, as useful as it is, would still be used by a minority of users. So it might make more sense to distribute it as a separate package. To that end, I'm happy to find a way to expose certain internals in Wn's public API (such as |
@goodmami No problem! I absolutly see your point. I think distributing as an 'extra' package would be the nicest way. Im happy to maintain the editor as long as it is needed in my own repository. Maybe we could link it in the documentation somewhere? But I got to be honest with you I have no idea of the necessary steps to create such an 'extra' package/ preprare my repository for that. (Will google this later) As to the internal Wn APIs I (as you saw) basically only need access to |
Great, I'm glad that makes sense to you, too. Regarding the internals, |
I only need |
Yes, as long as you are working with only one lexicon, there will only be one Word, Sense, and Synset with the same ID. That is what the |
The problem is basically that when a user adds stuff, they could mess with those ids (by adding one with the same id) so if there are no unique constraints in the synsets and senses, the editor could become unpredictable. Thats why I used the rowids so much, since they are primary keys, so it is a bit cleaner to use them to identify rows (in my opinion), but I can understand if you dont want to expose internal ids of wn. So I would try to write the constraints in code. It would also possible to add those constraints in the database but I dont know If you want to change the schema. |
I definitely do not expect or encourage multiple synsets or senses with the same id and from the same lexicon. I'll see if I can get those constraints added. And I don't have a problem with people using the rowids, especially when working directly with the database, but I don't suggest using them from the public API classes (e.g., |
Allright that sounds reasonable. I will adapt the constructors of the editors to take an id and an lexicon, transform then into a rowid, and then continue as normal. I will log a warning or something if multiple rowids are found (aka. the id is double in this lexicon) and then simply take the first. If you get to adding the constraints ( no pressure here ) this will not occur anymore but better safe then sorry. I probably will need until monday or thursday for that. |
@Hypercookie that sounds like a good plan. I also don't know when I'll find a few hours to try and code it up and test it, so don't hold your breath, but if your proposed change works then at least it will be more robust to unannounced changes to Wn's non-public API. |
Yes that sounds good. I will just finish this up, release it as a pre-release version, and when you are finished I make the required changes and release the v1.0.0 version. |
https://pypi.org/project/wn-editor/ whenever you are ready to include this as an extra package :) It is not very detailed yet ( and there is no documentation. ) |
Thanks! I'm able to download the package, but the link to the project homepage on GitHub gives a 404, so I think it might be a private repo? |
My bad! Should be public now... |
Updated:
This issue is for tracking the feature for modifying wordnets in the database through Wn. Currently the feature has low priority and won't be implemented unless there's a need.
Anyone who wants this feature please read the following:
If you have a use case where the lack of modifiable wordnets in Wn is holding you back, please:
Original issue text:
The text was updated successfully, but these errors were encountered: