Skip to content

[ENH] K-means: clusters can be inferred for new data #7010

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 11, 2025

Conversation

markotoplak
Copy link
Member

@markotoplak markotoplak commented Jan 30, 2025

Issue

@borondics had a so big data set that k-means was too slow, so he tried doing it on a sample and then modelling the clustering with a classifier. But for k-means this should not be needed because we could use the means/medoids to "predict" the cluster directly.

With one additional line we could make the Cluster useful with Apply domain.

image(3)

Why don't we already do it? All the machinery is already there, in ClusteringModel, which does seems unused though. Does

Includes
  • Code changes
  • Tests
  • Documentation

@markotoplak markotoplak added the needs discussion Core developers need to discuss the issue label Jan 30, 2025
@markotoplak markotoplak marked this pull request as draft January 30, 2025 15:36
Copy link

codecov bot commented Jan 30, 2025

Codecov Report

Attention: Patch coverage is 95.00000% with 1 line in your changes missing coverage. Please review.

Please upload report for BASE (master@66928eb). Learn more about missing BASE report.
Report is 23 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##             master    #7010   +/-   ##
=========================================
  Coverage          ?   88.72%           
=========================================
  Files             ?      332           
  Lines             ?    73444           
  Branches          ?        0           
=========================================
  Hits              ?    65164           
  Misses            ?     8280           
  Partials          ?        0           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@janezd janezd self-assigned this Feb 7, 2025
@markotoplak markotoplak assigned markotoplak and unassigned janezd Feb 12, 2025
@markotoplak markotoplak removed the needs discussion Core developers need to discuss the issue label Mar 18, 2025
@markotoplak markotoplak force-pushed the same_clustering_for_new_data branch from 1dc93bf to 87bf4cc Compare March 28, 2025 11:02
@janezd janezd added this to the 3.39 milestone May 30, 2025
@markotoplak markotoplak force-pushed the same_clustering_for_new_data branch 2 times, most recently from f213cbe to 0fb6067 Compare May 30, 2025 11:53
@markotoplak markotoplak force-pushed the same_clustering_for_new_data branch from 0fb6067 to 79d9a62 Compare May 30, 2025 12:13
@markotoplak markotoplak marked this pull request as ready for review May 30, 2025 12:51
@markotoplak
Copy link
Member Author

@janezd, I finally finished __eq__ and __hash__. To make it work, I needed to change some attributes into read-only properties (which also made me rewrite a test that relied on changing them).

@markotoplak markotoplak removed their assignment May 30, 2025
@markotoplak markotoplak requested a review from janezd May 30, 2025 12:52
@markotoplak markotoplak force-pushed the same_clustering_for_new_data branch from e52507e to 6ff8cff Compare June 3, 2025 07:21
@markotoplak markotoplak removed the request for review from janezd June 11, 2025 12:20
@markotoplak markotoplak merged commit 49fdb23 into biolab:master Jun 11, 2025
21 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants