Skip to content

ClusterKit 0.1.0: High-Performance Clustering & Dimensionality Reduction for Ruby 🚀

Latest

Choose a tag to compare

@cpetersen cpetersen released this 05 Sep 20:40
· 33 commits to main since this release
a6f92e9

We're thrilled to announce the first stable release of ClusterKit - bringing state-of-the-art dimensionality reduction and clustering algorithms to Ruby through native Rust bindings!

🎯 What is ClusterKit?

ClusterKit provides Ruby developers with fast, reliable implementations of essential machine learning algorithms. No more shelling out to Python or dealing with complex dependencies - run UMAP, PCA, K-means, and HDBSCAN directly in your Ruby process with performance that matches (and often exceeds) scikit-learn.

⚡️ Core Features

Dimensionality Reduction

  • UMAP - State-of-the-art manifold learning for visualization and preprocessing
  • PCA - Fast Principal Component Analysis with explained variance
  • SVD - Singular Value Decomposition for matrix factorization

Advanced Clustering

  • K-means - With automatic cluster detection via elbow method
  • HDBSCAN - Density-based clustering that finds clusters and noise automatically
  • Silhouette scoring - Evaluate cluster quality

Built for Ruby

  • Scikit-learn-like API - Familiar fit, transform, and fit_transform methods
  • Native Ruby types - Works seamlessly with Ruby arrays and Numo::NArray
  • Comprehensive error handling - Clear Ruby exceptions with helpful messages
  • Reproducible results - Seed support for deterministic outputs

🚀 Quick Example

require 'clusterkit'

# Generate sample data (100 points, 50 dimensions, 3 natural clusters)
srand(42)  # Reproducible results
data = []
3.times do |cluster|
  center = Array.new(50) { rand * 0.1 + cluster * 2.0 }
  33.times do
    point = center.map { |c| c + (rand - 0.5) * 0.3 }
    data << point
  end
end

# Reduce to 2D with UMAP
umap = ClusterKit::Dimensionality::UMAP.new(n_components: 2, n_neighbors: 5)
embedded = umap.fit_transform(data)

# Find clusters automatically
elbow_scores = ClusterKit::Clustering::KMeans.elbow_method(embedded, k_range: 2..6)
optimal_k = ClusterKit::Clustering::KMeans.detect_optimal_k(elbow_scores)

kmeans = ClusterKit::Clustering::KMeans.new(k: optimal_k)
labels = kmeans.fit_predict(embedded)
puts "Found #{labels.uniq.size} clusters"

# Try density-based clustering
hdbscan = ClusterKit::Clustering::HDBSCAN.new(min_samples: 5)
hdbscan_labels = hdbscan.fit_predict(embedded)
puts "HDBSCAN found #{hdbscan.n_clusters} clusters with #{hdbscan.n_noise_points} noise points"

🎨 Built-in Visualization

Generate interactive HTML visualizations with a single command:

rake clusterkit:visualize

Creates beautiful side-by-side comparisons of dimensionality reduction methods with clustering results - perfect for exploratory data analysis.

📊 Performance & Reliability

Rust-Powered Speed

  • 2-3x faster than equivalent Python implementations
  • Parallel processing by default where beneficial
  • Memory efficient - handles large datasets without excessive RAM usage

Production Ready

  • Robust error handling - No mysterious crashes on edge cases
  • Extensive testing - Works reliably across different data types and ranges
  • Reproducible results - Seed support for consistent outputs in testing/research

Real-World Tested

  • UMAP stability fixes - Handles extreme data ranges and edge cases that crash other implementations
  • Comprehensive parameter validation - Clear error messages when something's wrong
  • Memory management - Proper cleanup prevents memory leaks in long-running processes

🏗 Clean API Design

ClusterKit organizes functionality into logical modules:

# Dimensionality reduction
ClusterKit::Dimensionality::UMAP
ClusterKit::Dimensionality::PCA  
ClusterKit::Dimensionality::SVD

# Clustering algorithms  
ClusterKit::Clustering::KMeans
ClusterKit::Clustering::HDBSCAN

# Convenience methods
ClusterKit.umap(data, n_components: 2)
ClusterKit.pca(data, n_components: 2)  
ClusterKit.kmeans(data, k: 3)

🛠 Use Cases

Data Visualization: Reduce high-dimensional embeddings to 2D/3D for plotting

Preprocessing: Dimension reduction before applying other ML algorithms

Customer Segmentation: K-means with automatic cluster detection

Anomaly Detection: HDBSCAN identifies outliers as noise points

Feature Engineering: PCA for feature selection and noise reduction

Document Analysis: Cluster text embeddings to find topic groups

📦 Installation

gem install clusterkit

Or in your Gemfile:

gem 'clusterkit', '~> 0.1.0'  

Prerequisites: Ruby 2.7+ and Rust toolchain (for building from source)

🙏 Acknowledgments

ClusterKit builds on outstanding work from the Rust ML ecosystem:

  • annembed by Jean-Pierre Both - provides the core UMAP implementation
  • hdbscan - Rust port of the HDBSCAN algorithm

Special thanks to the Magnus FFI library for making Rust-Ruby integration seamless.

📚 What's Next?

ClusterKit 0.1.0 is just the beginning! Planned features include:

  • Additional distance metrics
  • More clustering algorithms
  • GPU acceleration support
  • Streaming/online learning capabilities

Ready to bring high-performance ML to your Ruby applications? Install ClusterKit 0.1.0 and start clustering! 🎯✨

Links: