Genie splits dense clusters.

Hey there!

I've been using Genie for some time now, and I think it's a remarkable algorithm. Its performance is absolutely stellar, and it's a clustering algorithm that, for the most part, "just works". That said, I've recently encountered a somewhat counterintuitive behavior in the algorithm. Genie appears to cleave dense clusters apart when `gini_threshold` is set too low and cluster imbalance is very high.

Take the dataset generated with the following `make_blobs` setup and Genie clustering:
```
from sklearn.datasets import make_blobs
X, y = make_blobs(
    n_samples=[1000, 100, 100],
    cluster_std=1,
    random_state=42
)

cls = Genie(gini_threshold=0.3)
labels = cls.fit_predict(X)
```

We get the following:

![Image](https://github.com/user-attachments/assets/44dea1c7-b6ea-4526-a664-4bd2bb4fe267)

We can see that Genie splits the largest, densest cluster right through its most-dense region. To confirm that this cluster is in fact the densest of the three, we can plot the core distances at `k=15` neighbors.

![Image](https://github.com/user-attachments/assets/1e613978-f334-4537-9002-f00955acbf16)

To demonstrate the extent to which Genie ignores inter-cluster distances, we can set `cluster_std=0.1` in the `make_blobs` call above and try again:

![Image](https://github.com/user-attachments/assets/24e9c622-8514-4b46-9e6f-e4733161379f)

Is this expected behavior? Will Genie always ignore inter-cluster distances in order to achieve the desired cluster balance? If my understanding is correct, Genie is agglomerative and therefore performs merging operations in order of smallest inter-cluster distances to largest. If this is true, then the largest cluster here should already have formed well before the scale at which either of the smaller clusters are merged. Genie must be doing something to separate the already-formed dense cluster in a way that splits it in two when merging the smaller clusters.

Is there any way to treat the `gini_threshold` as a soft requirement instead of being strictly enforced in this way?

This is probably obvious, but this behavior goes away if one increases the `gini_threshold`. I was still somewhat surprised to see this behavior at lower thresholds, though. I can also "solve" the problem by setting `M` sufficiently high (20 or higher works well in the first example above), but for my use case I strongly prefer avoiding labeling points as noise.

Any explanation or advice would be appreciated. Thanks for the incredible work on this algorithm and repository!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Genie splits dense clusters. #91

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Genie splits dense clusters. #91

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions