Description
Hi,
I'm testing some semi-supervised models, each with 20 topics created through lists of roughly 15 anchor words per topic. The documents within the corpus I'm working with have a large variance in word length (150 - 20,000+). I've broken the documents into smaller batches to help control for document length, and am looking to find the batch size and anchor strength which creates the best model.
I know that total correlation is the measure which CorEx maximizes when constructing the topic model, but in my experimenting with anchor strength I've found that TC always increased linearly with anchor strength, even when it's set into the thousands. So far I've been evaluating my models by comparing the anchor words of each topic to the words returned from .get_topics(), and I was wondering if there is a more quantitative way of selecting one model over another? I've looked into using other packages to measure the sematic similarity between the anchor words and the different words retrieved by .get_topics(), but wanted to reach out to see if there's any other metrics out there to measure model performance.
Additionally, besides batch size and anchor strength, are there any other parameters I should be aware of when fitting a model? Any help would be greatly appreciated.