Description
I have a dataset that consists of 10 thousand documents. It definitely contains documents for 16 topics. With anchor words, I want to classify a dataset into 16 topics. For each topic, I set anchor words (some anchors have more words, some less, but on average about 50 words per topic).
For each topic anchor words are set in a separate list, then I check for the presence of anchor words in the texts and add them to the general list of lists anchors.
But at the output, one topic always dominates (90-95%) in my documents, and this is the topic whose words are set first in the anchor words (I checked this by changing the order of the anchor words).
For example, I have a desserts and alcoholic drinks theme. If I put the anchor words of the theme desserts first in the list of anchor words, then this theme will prevail in the output. If I first put the anchor words of the topic of alcoholic beverages, then the topic of alcoholic beverages will prevail.
To prevail this means that 90% or more of the documents are marked with the first topic of the anchor words. Other of the 16 topics also appear in the output, but much less often and also wrong.
Can you please tell me why this is happening and what am I doing possibly wrong?
Thank you in advance for your help and answer!