-
Notifications
You must be signed in to change notification settings - Fork 380
add AbsTaskSpectralClustering #2430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Roman Solomatin <[email protected]>
I have implemented Spectral Clustering to reflect cosine similarity, as we previously discussed.
However, I noticed that the create_task_list() function collects task categories using a two-level iteration.
Could you review whether this approach is reasonable and applicable? |
logger = logging.getLogger(__name__) | ||
|
||
|
||
class SpectralClusteringEvaluator(Evaluator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be moved to evaluators
labels, | ||
task_name: str | None = None, | ||
clustering_batch_size: int = 500, | ||
limit: int | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be removed in 2.0
limit: int | None = None, |
if limit is not None: | ||
sentences = sentences[:limit] | ||
labels = labels[:limit] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be removed in 2.0
if limit is not None: | |
sentences = sentences[:limit] | |
labels = labels[:limit] |
return {"v_measure": v_measure} | ||
|
||
|
||
class AbsTaskSpectralClustering(AbsTask): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you clustering is almost 1to1 to original clustering. Maybe it would be better to move evaluator to properties of task and your tasks will use
for cluster_set in tqdm.tqdm(dataset, desc="Clustering"):
evaluator = self.evaluator(
class Task(AbsClustering):
evaluator = SpectralClusteringEvaluator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Samoed
Thanks for your advice!
I will revise it again at 'evaluation' level (if it is deemed meaningful).
Additionally, I will apply try/except as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should build this on the fast clustering task, not AbsTaskClustering (it is much slower and gives less consistent estimates)
from collections import Counter | ||
from typing import Any | ||
|
||
import networkx as nx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to convert with try... ecept
I think this should be moved inside __call__
Code Quality
make lint
to maintain consistent style.Documentation
Testing
make test-with-coverage
.make test
ormake test-with-coverage
to ensure no existing functionality is broken.Adding datasets checklist
Reason for dataset addition: ...
mteb -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.