Release ClusterKit 0.1.0: High-Performance Clustering & Dimensionality Reduction for Ruby 🚀 · scientist-labs/clusterkit

We're thrilled to announce the first stable release of ClusterKit - bringing state-of-the-art dimensionality reduction and clustering algorithms to Ruby through native Rust bindings!

🎯 What is ClusterKit?

ClusterKit provides Ruby developers with fast, reliable implementations of essential machine learning algorithms. No more shelling out to Python or dealing with complex dependencies - run UMAP, PCA, K-means, and HDBSCAN directly in your Ruby process with performance that matches (and often exceeds) scikit-learn.

⚡️ Core Features

Dimensionality Reduction

UMAP - State-of-the-art manifold learning for visualization and preprocessing
PCA - Fast Principal Component Analysis with explained variance
SVD - Singular Value Decomposition for matrix factorization

Advanced Clustering

K-means - With automatic cluster detection via elbow method
HDBSCAN - Density-based clustering that finds clusters and noise automatically
Silhouette scoring - Evaluate cluster quality

Built for Ruby

Scikit-learn-like API - Familiar fit, transform, and fit_transform methods
Native Ruby types - Works seamlessly with Ruby arrays and Numo::NArray
Comprehensive error handling - Clear Ruby exceptions with helpful messages
Reproducible results - Seed support for deterministic outputs

🚀 Quick Example

require 'clusterkit'

# Generate sample data (100 points, 50 dimensions, 3 natural clusters)
srand(42)  # Reproducible results
data = []
3.times do |cluster|
  center = Array.new(50) { rand * 0.1 + cluster * 2.0 }
  33.times do
    point = center.map { |c| c + (rand - 0.5) * 0.3 }
    data << point
  end
end

# Reduce to 2D with UMAP
umap = ClusterKit::Dimensionality::UMAP.new(n_components: 2, n_neighbors: 5)
embedded = umap.fit_transform(data)

# Find clusters automatically
elbow_scores = ClusterKit::Clustering::KMeans.elbow_method(embedded, k_range: 2..6)
optimal_k = ClusterKit::Clustering::KMeans.detect_optimal_k(elbow_scores)

kmeans = ClusterKit::Clustering::KMeans.new(k: optimal_k)
labels = kmeans.fit_predict(embedded)
puts "Found #{labels.uniq.size} clusters"

# Try density-based clustering
hdbscan = ClusterKit::Clustering::HDBSCAN.new(min_samples: 5)
hdbscan_labels = hdbscan.fit_predict(embedded)
puts "HDBSCAN found #{hdbscan.n_clusters} clusters with #{hdbscan.n_noise_points} noise points"

🎨 Built-in Visualization

Generate interactive HTML visualizations with a single command:

rake clusterkit:visualize

Creates beautiful side-by-side comparisons of dimensionality reduction methods with clustering results - perfect for exploratory data analysis.

📊 Performance & Reliability

Rust-Powered Speed

2-3x faster than equivalent Python implementations
Parallel processing by default where beneficial
Memory efficient - handles large datasets without excessive RAM usage

Production Ready

Robust error handling - No mysterious crashes on edge cases
Extensive testing - Works reliably across different data types and ranges
Reproducible results - Seed support for consistent outputs in testing/research

Real-World Tested

UMAP stability fixes - Handles extreme data ranges and edge cases that crash other implementations
Comprehensive parameter validation - Clear error messages when something's wrong
Memory management - Proper cleanup prevents memory leaks in long-running processes

🏗 Clean API Design

ClusterKit organizes functionality into logical modules:

# Dimensionality reduction
ClusterKit::Dimensionality::UMAP
ClusterKit::Dimensionality::PCA  
ClusterKit::Dimensionality::SVD

# Clustering algorithms  
ClusterKit::Clustering::KMeans
ClusterKit::Clustering::HDBSCAN

# Convenience methods
ClusterKit.umap(data, n_components: 2)
ClusterKit.pca(data, n_components: 2)  
ClusterKit.kmeans(data, k: 3)

🛠 Use Cases

Data Visualization: Reduce high-dimensional embeddings to 2D/3D for plotting

Preprocessing: Dimension reduction before applying other ML algorithms

Customer Segmentation: K-means with automatic cluster detection

Anomaly Detection: HDBSCAN identifies outliers as noise points

Feature Engineering: PCA for feature selection and noise reduction

Document Analysis: Cluster text embeddings to find topic groups

📦 Installation

gem install clusterkit

Or in your Gemfile:

gem 'clusterkit', '~> 0.1.0'

Prerequisites: Ruby 2.7+ and Rust toolchain (for building from source)

🙏 Acknowledgments

ClusterKit builds on outstanding work from the Rust ML ecosystem:

annembed by Jean-Pierre Both - provides the core UMAP implementation
hdbscan - Rust port of the HDBSCAN algorithm

Special thanks to the Magnus FFI library for making Rust-Ruby integration seamless.

📚 What's Next?

ClusterKit 0.1.0 is just the beginning! Planned features include:

Additional distance metrics
More clustering algorithms
GPU acceleration support
Streaming/online learning capabilities

Ready to bring high-performance ML to your Ruby applications? Install ClusterKit 0.1.0 and start clustering! 🎯✨

Links:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ClusterKit 0.1.0: High-Performance Clustering & Dimensionality Reduction for Ruby 🚀

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

🎯 What is ClusterKit?

⚡️ Core Features

Dimensionality Reduction

Advanced Clustering

Built for Ruby

🚀 Quick Example

🎨 Built-in Visualization

📊 Performance & Reliability

Rust-Powered Speed

Production Ready

Real-World Tested

🏗 Clean API Design

🛠 Use Cases

📦 Installation

🙏 Acknowledgments

📚 What's Next?

Uh oh!