Description
Hey there!
I've been using Genie for some time now, and I think it's a remarkable algorithm. Its performance is absolutely stellar, and it's a clustering algorithm that, for the most part, "just works". That said, I've recently encountered a somewhat counterintuitive behavior in the algorithm. Genie appears to cleave dense clusters apart when gini_threshold
is set too low and cluster imbalance is very high.
Take the dataset generated with the following make_blobs
setup and Genie clustering:
from sklearn.datasets import make_blobs
X, y = make_blobs(
n_samples=[1000, 100, 100],
cluster_std=1,
random_state=42
)
cls = Genie(gini_threshold=0.3)
labels = cls.fit_predict(X)
We get the following:
We can see that Genie splits the largest, densest cluster right through its most-dense region. To confirm that this cluster is in fact the densest of the three, we can plot the core distances at k=15
neighbors.
To demonstrate the extent to which Genie ignores inter-cluster distances, we can set cluster_std=0.1
in the make_blobs
call above and try again:
Is this expected behavior? Will Genie always ignore inter-cluster distances in order to achieve the desired cluster balance? If my understanding is correct, Genie is agglomerative and therefore performs merging operations in order of smallest inter-cluster distances to largest. If this is true, then the largest cluster here should already have formed well before the scale at which either of the smaller clusters are merged. Genie must be doing something to separate the already-formed dense cluster in a way that splits it in two when merging the smaller clusters.
Is there any way to treat the gini_threshold
as a soft requirement instead of being strictly enforced in this way?
This is probably obvious, but this behavior goes away if one increases the gini_threshold
. I was still somewhat surprised to see this behavior at lower thresholds, though. I can also "solve" the problem by setting M
sufficiently high (20 or higher works well in the first example above), but for my use case I strongly prefer avoiding labeling points as noise.
Any explanation or advice would be appreciated. Thanks for the incredible work on this algorithm and repository!