Skip to content

Releases: scientist-labs/clusterkit

ClusterKit 0.2.0: High-Speed Semantic Search with HNSW 🚀

05 Sep 20:43
f385465

Choose a tag to compare

We're excited to announce ClusterKit 0.2.0, bringing fast approximate nearest neighbor search to Ruby with full HNSW (Hierarchical Navigable Small World) integration!

🔍 HNSW: Millisecond Vector Search

ClusterKit now includes a complete HNSW implementation for lightning-fast similarity search. Perfect for recommendation systems, semantic search, and real-time ML applications.

Quick Example with red-candle

require 'clusterkit'
require 'candle'

# Load embedding model
embedding_model = Candle::EmbeddingModel.from_pretrained(
  'sentence-transformers/all-MiniLM-L6-v2',
  device: Candle::Device.best
)

# Create HNSW index
index = ClusterKit::HNSW.new(
  dim: 384,                    # Model dimensions
  m: 16,                       # Graph connectivity
  ef_construction: 200,        # Build quality
  max_elements: 10000,
  random_seed: 42             # Reproducible results
)

# Add documents with metadata
documents = [
  "Ruby is a programming language",
  "Python is popular for data science",
  "Machine learning models require data"
]

documents.each_with_index do |doc, i|
  embedding = embedding_model.embedding(doc).first.to_a
  index.add_item(embedding, 
    label: "doc_#{i}",
    metadata: {
      'text' => doc,
      'length' => doc.length,
      'word_count' => doc.split.size
    }
  )
end

# Semantic search
query = "programming languages"
query_embedding = embedding_model.embedding(query).first.to_a
results = index.search_with_metadata(query_embedding, k: 3)

results.each do |result|
  puts "#{result[:metadata]['text']} (similarity: #{1.0 - result[:distance]})"
end

🎯 Key Features

Fast & Scalable

  • Sub-millisecond search for most datasets
  • Configurable precision/speed tradeoff via ef parameter
  • Parallel index construction for large datasets
  • Memory-efficient storage with controllable footprint

Ruby-Native Experience

  • Metadata support - store rich data alongside vectors
  • Flexible input types - handles Ruby arrays seamlessly
  • Save/load functionality - persist indices to disk
  • Error handling - clear Ruby exceptions for debugging

Production Ready

  • Batch operations for efficient bulk loading
  • Search quality controls - adjust ef for speed vs accuracy
  • Memory estimates built-in for capacity planning
  • Thread-safe operations throughout

📊 Performance Configurations

dim = 384

# High recall (>95%) - Best quality, slower
index = ClusterKit::HNSW.new(
  dim: dim, m: 32, ef_construction: 400
).tap { |idx| idx.set_ef(100) }

# Balanced (>90%) - Good quality, fast
index = ClusterKit::HNSW.new(
  dim: dim, m: 16, ef_construction: 200  
).tap { |idx| idx.set_ef(50) }

# Speed optimized (>85%) - Fastest
index = ClusterKit::HNSW.new(
  dim: dim, m: 8, ef_construction: 100
).tap { |idx| idx.set_ef(20) }

🛠 Use Cases

Semantic Search: Build document search with embedding models like sentence-transformers

Recommendation Systems: Find similar items/users with sub-second response times

Duplicate Detection: Identify near-duplicate content efficiently

Real-time ML: Power recommendation APIs with millisecond latency

RAG Applications: Fast retrieval for retrieval-augmented generation

🔧 Technical Improvements

  • Enhanced metadata handling - supports mixed data types (strings, integers, floats)
  • Better error propagation - clearer debugging experience
  • Rust performance optimizations - leverages latest hnsw_rs improvements
  • Memory safety - proper lifetime management for index persistence

📦 Installation

gem install clusterkit

Or in your Gemfile:

gem 'clusterkit', '~> 0.2.0'

🔄 Compatibility

  • Full backward compatibility with 0.1.x clustering and dimensionality reduction
  • Ruby 2.7+ support
  • Works alongside existing UMAP, PCA, K-means, and HDBSCAN functionality

📚 Documentation

Complete examples and API documentation available in the README.

🙏 Acknowledgments

Built on the excellent hnsw_rs Rust library. Special thanks to the annembed and Magnus communities for their foundational work.


Ready to add lightning-fast search to your Ruby applications? Upgrade to ClusterKit 0.2.0 and start building! ⚡️🔍

ClusterKit 0.1.0: High-Performance Clustering & Dimensionality Reduction for Ruby 🚀

05 Sep 20:40
a6f92e9

Choose a tag to compare

We're thrilled to announce the first stable release of ClusterKit - bringing state-of-the-art dimensionality reduction and clustering algorithms to Ruby through native Rust bindings!

🎯 What is ClusterKit?

ClusterKit provides Ruby developers with fast, reliable implementations of essential machine learning algorithms. No more shelling out to Python or dealing with complex dependencies - run UMAP, PCA, K-means, and HDBSCAN directly in your Ruby process with performance that matches (and often exceeds) scikit-learn.

⚡️ Core Features

Dimensionality Reduction

  • UMAP - State-of-the-art manifold learning for visualization and preprocessing
  • PCA - Fast Principal Component Analysis with explained variance
  • SVD - Singular Value Decomposition for matrix factorization

Advanced Clustering

  • K-means - With automatic cluster detection via elbow method
  • HDBSCAN - Density-based clustering that finds clusters and noise automatically
  • Silhouette scoring - Evaluate cluster quality

Built for Ruby

  • Scikit-learn-like API - Familiar fit, transform, and fit_transform methods
  • Native Ruby types - Works seamlessly with Ruby arrays and Numo::NArray
  • Comprehensive error handling - Clear Ruby exceptions with helpful messages
  • Reproducible results - Seed support for deterministic outputs

🚀 Quick Example

require 'clusterkit'

# Generate sample data (100 points, 50 dimensions, 3 natural clusters)
srand(42)  # Reproducible results
data = []
3.times do |cluster|
  center = Array.new(50) { rand * 0.1 + cluster * 2.0 }
  33.times do
    point = center.map { |c| c + (rand - 0.5) * 0.3 }
    data << point
  end
end

# Reduce to 2D with UMAP
umap = ClusterKit::Dimensionality::UMAP.new(n_components: 2, n_neighbors: 5)
embedded = umap.fit_transform(data)

# Find clusters automatically
elbow_scores = ClusterKit::Clustering::KMeans.elbow_method(embedded, k_range: 2..6)
optimal_k = ClusterKit::Clustering::KMeans.detect_optimal_k(elbow_scores)

kmeans = ClusterKit::Clustering::KMeans.new(k: optimal_k)
labels = kmeans.fit_predict(embedded)
puts "Found #{labels.uniq.size} clusters"

# Try density-based clustering
hdbscan = ClusterKit::Clustering::HDBSCAN.new(min_samples: 5)
hdbscan_labels = hdbscan.fit_predict(embedded)
puts "HDBSCAN found #{hdbscan.n_clusters} clusters with #{hdbscan.n_noise_points} noise points"

🎨 Built-in Visualization

Generate interactive HTML visualizations with a single command:

rake clusterkit:visualize

Creates beautiful side-by-side comparisons of dimensionality reduction methods with clustering results - perfect for exploratory data analysis.

📊 Performance & Reliability

Rust-Powered Speed

  • 2-3x faster than equivalent Python implementations
  • Parallel processing by default where beneficial
  • Memory efficient - handles large datasets without excessive RAM usage

Production Ready

  • Robust error handling - No mysterious crashes on edge cases
  • Extensive testing - Works reliably across different data types and ranges
  • Reproducible results - Seed support for consistent outputs in testing/research

Real-World Tested

  • UMAP stability fixes - Handles extreme data ranges and edge cases that crash other implementations
  • Comprehensive parameter validation - Clear error messages when something's wrong
  • Memory management - Proper cleanup prevents memory leaks in long-running processes

🏗 Clean API Design

ClusterKit organizes functionality into logical modules:

# Dimensionality reduction
ClusterKit::Dimensionality::UMAP
ClusterKit::Dimensionality::PCA  
ClusterKit::Dimensionality::SVD

# Clustering algorithms  
ClusterKit::Clustering::KMeans
ClusterKit::Clustering::HDBSCAN

# Convenience methods
ClusterKit.umap(data, n_components: 2)
ClusterKit.pca(data, n_components: 2)  
ClusterKit.kmeans(data, k: 3)

🛠 Use Cases

Data Visualization: Reduce high-dimensional embeddings to 2D/3D for plotting

Preprocessing: Dimension reduction before applying other ML algorithms

Customer Segmentation: K-means with automatic cluster detection

Anomaly Detection: HDBSCAN identifies outliers as noise points

Feature Engineering: PCA for feature selection and noise reduction

Document Analysis: Cluster text embeddings to find topic groups

📦 Installation

gem install clusterkit

Or in your Gemfile:

gem 'clusterkit', '~> 0.1.0'  

Prerequisites: Ruby 2.7+ and Rust toolchain (for building from source)

🙏 Acknowledgments

ClusterKit builds on outstanding work from the Rust ML ecosystem:

  • annembed by Jean-Pierre Both - provides the core UMAP implementation
  • hdbscan - Rust port of the HDBSCAN algorithm

Special thanks to the Magnus FFI library for making Rust-Ruby integration seamless.

📚 What's Next?

ClusterKit 0.1.0 is just the beginning! Planned features include:

  • Additional distance metrics
  • More clustering algorithms
  • GPU acceleration support
  • Streaming/online learning capabilities

Ready to bring high-performance ML to your Ruby applications? Install ClusterKit 0.1.0 and start clustering! 🎯✨

Links: