Skip to content

Metrics for Model Selection  #51

Open
@mchabala

Description

@mchabala

Hi,

I'm testing some semi-supervised models, each with 20 topics created through lists of roughly 15 anchor words per topic. The documents within the corpus I'm working with have a large variance in word length (150 - 20,000+). I've broken the documents into smaller batches to help control for document length, and am looking to find the batch size and anchor strength which creates the best model.

I know that total correlation is the measure which CorEx maximizes when constructing the topic model, but in my experimenting with anchor strength I've found that TC always increased linearly with anchor strength, even when it's set into the thousands. So far I've been evaluating my models by comparing the anchor words of each topic to the words returned from .get_topics(), and I was wondering if there is a more quantitative way of selecting one model over another? I've looked into using other packages to measure the sematic similarity between the anchor words and the different words retrieved by .get_topics(), but wanted to reach out to see if there's any other metrics out there to measure model performance.

Additionally, besides batch size and anchor strength, are there any other parameters I should be aware of when fitting a model? Any help would be greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions