Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental N-Best results #3 #67

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

nakagami
Copy link
Contributor

I read this article https://jetbead.hatenablog.com/entry/20160119/1453139047 , and implemented.

@mocobeta
Copy link
Owner

Sorry for my late response, I had no time to respond here.
Thanks for the nice patch! I have concerns about performance degradation and memory consumption when N is large. Maybe we should limit the maximum size of N to 2 or 3, but it's just a guess so some benchmark tests will be needed.

@mocobeta
Copy link
Owner

It seems TravisCI failed because of Java 8 installation failure.
The installation part has been disabled in the master branch, can you please merge the latest master?

@nakagami
Copy link
Contributor Author

merged

@mocobeta
Copy link
Owner

@nakagami
Sorry, seems there are some conflicts because the master branch drops python 2 support (#65).
Could you resolve conflicts if you are still interested in this issue?

@nakagami
Copy link
Contributor Author

merged

@mocobeta
Copy link
Owner

mocobeta commented Feb 23, 2020

Hi @nakagami,
again, sorry for the delay...... but I looked at the code closely.

The backward_aster() looks fine to me as an experimental feature - but I think it should not be merged into the Tokenizer class as-is, since this is incompatible with current Tokenier API.

  • Tokenizer#tokenize() method returns a nested list with the n_best parameter instead of a flat list or generator of tokens, this would be confusing for many users.
  • If the text length exceeds Tokenizer.MAX_CHUNK_SIZE, n_best parameter is silently ignored. I understand there is resource leak concern with current nBest implementation, but this implicit limitation is not intuitive.
  • Associated with the above point, n_best parameter doesn't work with stream parameter.

I'd propose we create a new class for tracking nBest path, say NBestTokenizer, for now. Would it be possible? Once we solve the conflicts and resource leak concerns I mentioned above, we will be able to add nBest to the first-class feature for the (default) Tokenizer class.
Let me know if you have any thoughts / proposals about it.

Additionally, I think it would be really nice if the nBest feature supports dotfile option so that users can visualize the multiple paths with graphviz, though this does not limit us to proceed this issue. 🙂

@nakagami
Copy link
Contributor Author

I hope someone's (including @mocobeta ) work.
I think I don't understand detailed implementation of janone, and I don't have enogh time to work.

And sorry, I don't have good idea that n-best ignore stream parameter ignored.

@mocobeta
Copy link
Owner

mocobeta commented Feb 24, 2020

@nakagami Sure thing, the internal of Tokenizer class is one of the most complicated stuff, since this has been evolved through the past releases. I will try to find the way to get this in to the master for future updates.
Anyway thank you for working on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants