We're thrilled to announce the first stable release of ClusterKit - bringing state-of-the-art dimensionality reduction and clustering algorithms to Ruby through native Rust bindings!
🎯 What is ClusterKit?
ClusterKit provides Ruby developers with fast, reliable implementations of essential machine learning algorithms. No more shelling out to Python or dealing with complex dependencies - run UMAP, PCA, K-means, and HDBSCAN directly in your Ruby process with performance that matches (and often exceeds) scikit-learn.
⚡️ Core Features
Dimensionality Reduction
- UMAP - State-of-the-art manifold learning for visualization and preprocessing
- PCA - Fast Principal Component Analysis with explained variance
- SVD - Singular Value Decomposition for matrix factorization
Advanced Clustering
- K-means - With automatic cluster detection via elbow method
- HDBSCAN - Density-based clustering that finds clusters and noise automatically
- Silhouette scoring - Evaluate cluster quality
Built for Ruby
- Scikit-learn-like API - Familiar
fit,transform, andfit_transformmethods - Native Ruby types - Works seamlessly with Ruby arrays and Numo::NArray
- Comprehensive error handling - Clear Ruby exceptions with helpful messages
- Reproducible results - Seed support for deterministic outputs
🚀 Quick Example
require 'clusterkit'
# Generate sample data (100 points, 50 dimensions, 3 natural clusters)
srand(42) # Reproducible results
data = []
3.times do |cluster|
center = Array.new(50) { rand * 0.1 + cluster * 2.0 }
33.times do
point = center.map { |c| c + (rand - 0.5) * 0.3 }
data << point
end
end
# Reduce to 2D with UMAP
umap = ClusterKit::Dimensionality::UMAP.new(n_components: 2, n_neighbors: 5)
embedded = umap.fit_transform(data)
# Find clusters automatically
elbow_scores = ClusterKit::Clustering::KMeans.elbow_method(embedded, k_range: 2..6)
optimal_k = ClusterKit::Clustering::KMeans.detect_optimal_k(elbow_scores)
kmeans = ClusterKit::Clustering::KMeans.new(k: optimal_k)
labels = kmeans.fit_predict(embedded)
puts "Found #{labels.uniq.size} clusters"
# Try density-based clustering
hdbscan = ClusterKit::Clustering::HDBSCAN.new(min_samples: 5)
hdbscan_labels = hdbscan.fit_predict(embedded)
puts "HDBSCAN found #{hdbscan.n_clusters} clusters with #{hdbscan.n_noise_points} noise points"🎨 Built-in Visualization
Generate interactive HTML visualizations with a single command:
rake clusterkit:visualizeCreates beautiful side-by-side comparisons of dimensionality reduction methods with clustering results - perfect for exploratory data analysis.
📊 Performance & Reliability
Rust-Powered Speed
- 2-3x faster than equivalent Python implementations
- Parallel processing by default where beneficial
- Memory efficient - handles large datasets without excessive RAM usage
Production Ready
- Robust error handling - No mysterious crashes on edge cases
- Extensive testing - Works reliably across different data types and ranges
- Reproducible results - Seed support for consistent outputs in testing/research
Real-World Tested
- UMAP stability fixes - Handles extreme data ranges and edge cases that crash other implementations
- Comprehensive parameter validation - Clear error messages when something's wrong
- Memory management - Proper cleanup prevents memory leaks in long-running processes
🏗 Clean API Design
ClusterKit organizes functionality into logical modules:
# Dimensionality reduction
ClusterKit::Dimensionality::UMAP
ClusterKit::Dimensionality::PCA
ClusterKit::Dimensionality::SVD
# Clustering algorithms
ClusterKit::Clustering::KMeans
ClusterKit::Clustering::HDBSCAN
# Convenience methods
ClusterKit.umap(data, n_components: 2)
ClusterKit.pca(data, n_components: 2)
ClusterKit.kmeans(data, k: 3)🛠 Use Cases
Data Visualization: Reduce high-dimensional embeddings to 2D/3D for plotting
Preprocessing: Dimension reduction before applying other ML algorithms
Customer Segmentation: K-means with automatic cluster detection
Anomaly Detection: HDBSCAN identifies outliers as noise points
Feature Engineering: PCA for feature selection and noise reduction
Document Analysis: Cluster text embeddings to find topic groups
📦 Installation
gem install clusterkitOr in your Gemfile:
gem 'clusterkit', '~> 0.1.0' Prerequisites: Ruby 2.7+ and Rust toolchain (for building from source)
🙏 Acknowledgments
ClusterKit builds on outstanding work from the Rust ML ecosystem:
- annembed by Jean-Pierre Both - provides the core UMAP implementation
- hdbscan - Rust port of the HDBSCAN algorithm
Special thanks to the Magnus FFI library for making Rust-Ruby integration seamless.
📚 What's Next?
ClusterKit 0.1.0 is just the beginning! Planned features include:
- Additional distance metrics
- More clustering algorithms
- GPU acceleration support
- Streaming/online learning capabilities
Ready to bring high-performance ML to your Ruby applications? Install ClusterKit 0.1.0 and start clustering! 🎯✨
Links: