add AbsTaskSpectralClustering #2430

OnAnd0n · 2025-03-25T09:14:41Z

Code Quality

Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

New Tests Added: Write tests to cover new functionality. Validate with make test-with-coverage.
Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

Adding datasets checklist

Reason for dataset addition: ...

I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Co-authored-by: Roman Solomatin <[email protected]>

mteb/abstasks/AbsTaskSpectralClustering.py

OnAnd0n · 2025-03-25T09:26:43Z

@Samoed @KennethEnevoldsen

I have implemented Spectral Clustering to reflect cosine similarity, as we previously discussed.
question regarding clustering evaluation in MTEB
Initially, you suggested the following format:

class AbsTaskSpectralClustering(AbsTaskClusteringFast)
     clustering_model = sklearn.cluster.spectralclustering     
     ...

# and

class AbsTaskCosineClustering(AbsTaskClusteringFast)
     clustering_model = ...  
     ...

However, I noticed that the create_task_list() function collects task categories using a two-level iteration.
If a three-level structure(AbsTaskSpectralClustering -> AbsTaskClusteringFast -> AbsTask) is introduced,
tasks might not be properly collected into Task_list.

def create_task_list() -> list[type[AbsTask]]:
    tasks_categories_cls = list(AbsTask.__subclasses__())
    tasks = [
        cls
        for cat_cls in tasks_categories_cls
        for cls in cat_cls.__subclasses__()
        if cat_cls.__name__.startswith("AbsTask")
    ]
    return tasks

Could you review whether this approach is reasonable and applicable?
Thank you for your time.

Samoed · 2025-03-25T10:10:53Z

mteb/abstasks/AbsTaskSpectralClustering.py

+logger = logging.getLogger(__name__)
+
+
+class SpectralClusteringEvaluator(Evaluator):


This should be moved to evaluators

Samoed · 2025-03-25T10:11:17Z

mteb/abstasks/AbsTaskSpectralClustering.py

+        labels,
+        task_name: str | None = None,
+        clustering_batch_size: int = 500,
+        limit: int | None = None,


This will be removed in 2.0

Suggested change

limit: int | None = None,

Samoed · 2025-03-25T10:11:36Z

mteb/abstasks/AbsTaskSpectralClustering.py

+        if limit is not None:
+            sentences = sentences[:limit]
+            labels = labels[:limit]


This will be removed in 2.0

Suggested change

if limit is not None:

sentences = sentences[:limit]

labels = labels[:limit]

Samoed · 2025-03-25T10:20:21Z

mteb/abstasks/AbsTaskSpectralClustering.py

+        return {"v_measure": v_measure}
+
+
+class AbsTaskSpectralClustering(AbsTask):


I think you clustering is almost 1to1 to original clustering. Maybe it would be better to move evaluator to properties of task and your tasks will use

for cluster_set in tqdm.tqdm(dataset, desc="Clustering"): evaluator = self.evaluator(

class Task(AbsClustering): evaluator = SpectralClusteringEvaluator

@Samoed
Thanks for your advice!

I will revise it again at 'evaluation' level (if it is deemed meaningful).
Additionally, I will apply try/except as well.

We should build this on the fast clustering task, not AbsTaskClustering (it is much slower and gives less consistent estimates)

Samoed · 2025-03-25T10:37:29Z

mteb/abstasks/AbsTaskSpectralClustering.py

+from collections import Counter
+from typing import Any
+
+import networkx as nx


Need to convert with try... ecept I think this should be moved inside __call__

OnAnd0n and others added 19 commits March 8, 2025 09:35

add PatentFnBClustering.py

0d25079

do make lint and revise

90952db

rollback Makefile

d3870d8

Update mteb/tasks/Clustering/kor/PatentFnBClustering.py

485a8c5

Co-authored-by: Roman Solomatin <[email protected]>

klue_mrc_domain

858635a

Merge branch 'main' of https://github.com/OnAnd0n/MTEB

088f046

make lint

3ae005a

klue_modified_clustering_dataset

8cf0713

Merge branch 'embeddings-benchmark:main' into main

363a7e9

clustering & kor folder add __init.py

abfc2f7

clustering & kor folder add __init__.py

6478e7c

task.py roll-back

1f25cf9

correct text_creation to sample_creation & delete form in MetaData

c215cf6

correct task_subtype in TaskMetaData

b4c1284

delete space

3b91cfa

edit metadata

a7c8180

edit task_subtypes

5bfac54

Merge branch 'embeddings-benchmark:main' into main

480b645

add AbaTaskSpectralClustering

4f6b75c

Samoed requested changes Mar 25, 2025

View reviewed changes

mteb/abstasks/AbsTaskSpectralClustering.py Show resolved Hide resolved

Samoed reviewed Mar 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add AbsTaskSpectralClustering #2430

add AbsTaskSpectralClustering #2430

OnAnd0n commented Mar 25, 2025 •

edited

Loading

OnAnd0n commented Mar 25, 2025 •

edited

Loading

Samoed Mar 25, 2025

Samoed Mar 25, 2025

Samoed Mar 25, 2025

Samoed Mar 25, 2025

OnAnd0n Mar 25, 2025

KennethEnevoldsen Mar 27, 2025

Samoed Mar 25, 2025

		logger = logging.getLogger(__name__)


		class SpectralClusteringEvaluator(Evaluator):

	if limit is not None:
	sentences = sentences[:limit]
	labels = labels[:limit]

		return {"v_measure": v_measure}


		class AbsTaskSpectralClustering(AbsTask):

add AbsTaskSpectralClustering #2430

Are you sure you want to change the base?

add AbsTaskSpectralClustering #2430

Conversation

OnAnd0n commented Mar 25, 2025 • edited Loading

Code Quality

Documentation

Testing

Adding datasets checklist

OnAnd0n commented Mar 25, 2025 • edited Loading

Samoed Mar 25, 2025

Choose a reason for hiding this comment

Samoed Mar 25, 2025

Choose a reason for hiding this comment

Samoed Mar 25, 2025

Choose a reason for hiding this comment

Samoed Mar 25, 2025

Choose a reason for hiding this comment

OnAnd0n Mar 25, 2025

Choose a reason for hiding this comment

KennethEnevoldsen Mar 27, 2025

Choose a reason for hiding this comment

Samoed Mar 25, 2025

Choose a reason for hiding this comment

OnAnd0n commented Mar 25, 2025 •

edited

Loading

OnAnd0n commented Mar 25, 2025 •

edited

Loading