Skip to content

Genie splits dense clusters. #91

Open
@kc-howe

Description

@kc-howe

Hey there!

I've been using Genie for some time now, and I think it's a remarkable algorithm. Its performance is absolutely stellar, and it's a clustering algorithm that, for the most part, "just works". That said, I've recently encountered a somewhat counterintuitive behavior in the algorithm. Genie appears to cleave dense clusters apart when gini_threshold is set too low and cluster imbalance is very high.

Take the dataset generated with the following make_blobs setup and Genie clustering:

from sklearn.datasets import make_blobs
X, y = make_blobs(
    n_samples=[1000, 100, 100],
    cluster_std=1,
    random_state=42
)

cls = Genie(gini_threshold=0.3)
labels = cls.fit_predict(X)

We get the following:

Image

We can see that Genie splits the largest, densest cluster right through its most-dense region. To confirm that this cluster is in fact the densest of the three, we can plot the core distances at k=15 neighbors.

Image

To demonstrate the extent to which Genie ignores inter-cluster distances, we can set cluster_std=0.1 in the make_blobs call above and try again:

Image

Is this expected behavior? Will Genie always ignore inter-cluster distances in order to achieve the desired cluster balance? If my understanding is correct, Genie is agglomerative and therefore performs merging operations in order of smallest inter-cluster distances to largest. If this is true, then the largest cluster here should already have formed well before the scale at which either of the smaller clusters are merged. Genie must be doing something to separate the already-formed dense cluster in a way that splits it in two when merging the smaller clusters.

Is there any way to treat the gini_threshold as a soft requirement instead of being strictly enforced in this way?

This is probably obvious, but this behavior goes away if one increases the gini_threshold. I was still somewhat surprised to see this behavior at lower thresholds, though. I can also "solve" the problem by setting M sufficiently high (20 or higher works well in the first example above), but for my use case I strongly prefer avoiding labeling points as noise.

Any explanation or advice would be appreciated. Thanks for the incredible work on this algorithm and repository!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions