Metrics for Model Selection 

Hi,

I'm testing some semi-supervised models, each with 20 topics created through lists of roughly 15 anchor words per topic. The documents within the corpus I'm working with have a large variance in word length (150 - 20,000+). I've broken the documents into smaller batches to help control for document length, and am looking to find the batch size and anchor strength which creates the best model.

I know that total correlation is the measure which CorEx maximizes when constructing the topic model, but in my experimenting with anchor strength I've found that TC always increased linearly with anchor strength, even when it's set into the thousands. So far I've been evaluating my models by comparing the anchor words of each topic to the words returned from .get_topics(), and I was wondering if there is a more quantitative way of selecting one model over another? I've looked into using other packages to measure the sematic similarity between the anchor words and the different words retrieved by .get_topics(), but wanted to reach out to see if there's any other metrics out there to measure model performance.

Additionally, besides batch size and anchor strength, are there any other parameters I should be aware of when fitting a model? Any help would be greatly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metrics for Model Selection #51

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metrics for Model Selection #51

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions