-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Divide Input" option needs better documentation. #65
Comments
Thanks for checking in! That setting is actually for a slightly different problem — 4 would be too low a number.
Some of those optional settings are... underdocumented! Sorry. In retrospect, "n-word chunks" is a little bit deceptive too — I see the potential for confusion there.
The use case for that setting would be more like this: suppose you had a set of documents that were very different in length. 10 novels and 400 works of flash fiction, say. In theory LDA (the underlying algorithm) ought to be able to deal with that pretty well, but you might wonder what would happen if you split the novels into, say, 1000-word chunks, so that all the "documents" would be roughly the same length. If you wanted to perform that experiment, you'd ordinarily have to segment the texts using a script, and you'd have to do a bunch of work to keep the metadata aligned with the segmented texts. This setting saves you from having to do that work.
You'd definitely want to set this to a number above 100, and I'd say probably above 1000. The keywords would still be single words — it's just vanilla topic modeling on shorter documents.
MALLET does support bigram models (which I think is what you expected here), but that option isn't baked into the TMT yet, I'm afraid.
|
Ah! that actually solves another problem I'd been having! Bigrams - yes, that's what my student and I were expecting. Thank you for explaining that. |
woops, maybe you didn't want this closed. |
Ha! I didn't realize that openers had the power to close -- should have though. Yeah I'll leave this open for now. If you believe there's such a thing as a documentation bug -- and I do -- then this is definitely one. Of course, you could argue that this issue could serve as the documentation, making this a self-resolving issue. But once it's closed, it no longer shows up in the default search, and nobody can find it. So the moment you close it, the bug is back, and it has to be reopened. :) |
Sorry about that! I went and tried it again, armed with my new knowledge of how it works. In the results, when I opened the metadata.csv, a number of my documents were no longer present; that is to say, no results recorded for them. I had n set for 1000, so I thought perhaps the missing ones were smaller and somehow got folded into the previous 1000-chunk, but no, the missing ones should have been split into three or four chunks at least. So I'm not sure what's going on there... I can't seem to see the commonality between the documents that get dropped. |
Hm, that is more worrying. It's always easier to debug these things with the data that causes the problem -- is it something you can share with me privately or is it too sensitive? I have also run into bugs in which the output from one model interferes with the input to a new model. I thought I had caught all of them but it's possible I didn't. Could you try re-running the model with a new blank output directory and an input directory that you are 100% sure contains just the original, correct files? If the same results appear, I'll need some kind of minimal example to reproduce and correct the bug. |
I'm also going to open this as a new issue, since it's distinct from the documentation problem. |
I'll give these things a try; the data is not mine to share, unfortunately, but I'll see what I can do. Thanks! |
@shawngraham any developments with this? I'd love to fix the problem if I possibly can. |
Just a small question regarding the 'divide input into n-word chunks' option in the advanced setting. When I run that on say a 4-gram, I understand what's going on from the point of view of input - but in terms of the output, the topic keywords say are individual words again? A student was asking me this, expecting that the keywords would also be 4-grams, and so I figured, good question...
Thanks! Really appreciate all the work you've done with this tool.
The text was updated successfully, but these errors were encountered: