Skip to content

Conversation

@x-tabdeveloping
Copy link
Owner

I'm working on adding term importance from this paper.
It's honestly way smarter than tf-idf based approaches, and doesn't suffer from the smoothing issues of the bayes method I developed earlier.
It also doesn't have the theoretical weaknesses of Top2Vec, and it produces similar or better quality topics.
I'm considering making it the default in the library.

@x-tabdeveloping
Copy link
Owner Author

Thanks for sharing the post @KennethEnevoldsen

@x-tabdeveloping
Copy link
Owner Author

Also, the component values are way more interpretable, since they're basically z-scores.
You can essentially assign significance to descriptive words, which is awesome.

@KennethEnevoldsen
Copy link
Collaborator

KennethEnevoldsen commented Dec 10, 2024

Oh this looks great! Glad to see that you are already tackling it

What are your thoughts on adapting it to an embedding use case?

@x-tabdeveloping
Copy link
Owner Author

import numpy as np
import pandas as pd
import plotly.express as px
# from scipy.stats import multivariate_normal
from sentence_transformers import SentenceTransformer
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from tqdm import tqdm

from turftopic.feature_importance import fighting_words
from turftopic.supervised.semantic_lexical import SemanticLexicalAnalysis

ds = fetch_20newsgroups(
    subset="all",
    remove=("headers", "footers", "quotes"),
    #   categories=["alt.atheism", "sci.space"],
)
embeddings = np.load("_emb/20news_all-MiniLM.npy")
corpus = ds.data
labels = np.array(ds.target)

trf = SentenceTransformer("all-MiniLM-L6-v2")
# embeddings = trf.encode(corpus, show_progress_bar=True)

model = SemanticLexicalAnalysis(encoder=trf).fit(
    corpus, y=labels, embeddings=embeddings
)

model.plot_semantic_lexical_square(19)

model.plot_residuals(1)

image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants