Skip to content

KMeans Optimization: Incorporating Weights into Re-Clustering Process #1001

@ChiaTrama

Description

@ChiaTrama

Hello,

As discussed in this topic on Dask's forum, my colleague and I compared in a distributed environment the dask-ml implementation of the KMeans class with our own implementation. During the comparison, we observed that the dask-ml initialization doesn't appear to use weights during the centroid re-clustering phase.

In the current dask-ml KMeans implementation, the standard KMeans algorithm is used for centroid re-clustering. In contrast, we incorporated weights into two areas:

  • KMeans++ initialization.
  • Weighted average during centroid re-clustering.

Although our implementation is less efficient than dask-ml in terms of execution time, we achieved better results when clustering a blob dataset, likely due to a reduction in the number of clustering iterations rather than direct code optimizations.

If you're interested, feel free to review our repository for further details on our approach:
GitHub Repository.

Thank you for considering this issue.

Best regards,
Chiara

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions