Skip to content

ClusterKit 0.2.0: High-Speed Semantic Search with HNSW 🚀

Choose a tag to compare

@cpetersen cpetersen released this 05 Sep 20:43
· 14 commits to main since this release
f385465

We're excited to announce ClusterKit 0.2.0, bringing fast approximate nearest neighbor search to Ruby with full HNSW (Hierarchical Navigable Small World) integration!

🔍 HNSW: Millisecond Vector Search

ClusterKit now includes a complete HNSW implementation for lightning-fast similarity search. Perfect for recommendation systems, semantic search, and real-time ML applications.

Quick Example with red-candle

require 'clusterkit'
require 'candle'

# Load embedding model
embedding_model = Candle::EmbeddingModel.from_pretrained(
  'sentence-transformers/all-MiniLM-L6-v2',
  device: Candle::Device.best
)

# Create HNSW index
index = ClusterKit::HNSW.new(
  dim: 384,                    # Model dimensions
  m: 16,                       # Graph connectivity
  ef_construction: 200,        # Build quality
  max_elements: 10000,
  random_seed: 42             # Reproducible results
)

# Add documents with metadata
documents = [
  "Ruby is a programming language",
  "Python is popular for data science",
  "Machine learning models require data"
]

documents.each_with_index do |doc, i|
  embedding = embedding_model.embedding(doc).first.to_a
  index.add_item(embedding, 
    label: "doc_#{i}",
    metadata: {
      'text' => doc,
      'length' => doc.length,
      'word_count' => doc.split.size
    }
  )
end

# Semantic search
query = "programming languages"
query_embedding = embedding_model.embedding(query).first.to_a
results = index.search_with_metadata(query_embedding, k: 3)

results.each do |result|
  puts "#{result[:metadata]['text']} (similarity: #{1.0 - result[:distance]})"
end

🎯 Key Features

Fast & Scalable

  • Sub-millisecond search for most datasets
  • Configurable precision/speed tradeoff via ef parameter
  • Parallel index construction for large datasets
  • Memory-efficient storage with controllable footprint

Ruby-Native Experience

  • Metadata support - store rich data alongside vectors
  • Flexible input types - handles Ruby arrays seamlessly
  • Save/load functionality - persist indices to disk
  • Error handling - clear Ruby exceptions for debugging

Production Ready

  • Batch operations for efficient bulk loading
  • Search quality controls - adjust ef for speed vs accuracy
  • Memory estimates built-in for capacity planning
  • Thread-safe operations throughout

📊 Performance Configurations

dim = 384

# High recall (>95%) - Best quality, slower
index = ClusterKit::HNSW.new(
  dim: dim, m: 32, ef_construction: 400
).tap { |idx| idx.set_ef(100) }

# Balanced (>90%) - Good quality, fast
index = ClusterKit::HNSW.new(
  dim: dim, m: 16, ef_construction: 200  
).tap { |idx| idx.set_ef(50) }

# Speed optimized (>85%) - Fastest
index = ClusterKit::HNSW.new(
  dim: dim, m: 8, ef_construction: 100
).tap { |idx| idx.set_ef(20) }

🛠 Use Cases

Semantic Search: Build document search with embedding models like sentence-transformers

Recommendation Systems: Find similar items/users with sub-second response times

Duplicate Detection: Identify near-duplicate content efficiently

Real-time ML: Power recommendation APIs with millisecond latency

RAG Applications: Fast retrieval for retrieval-augmented generation

🔧 Technical Improvements

  • Enhanced metadata handling - supports mixed data types (strings, integers, floats)
  • Better error propagation - clearer debugging experience
  • Rust performance optimizations - leverages latest hnsw_rs improvements
  • Memory safety - proper lifetime management for index persistence

📦 Installation

gem install clusterkit

Or in your Gemfile:

gem 'clusterkit', '~> 0.2.0'

🔄 Compatibility

  • Full backward compatibility with 0.1.x clustering and dimensionality reduction
  • Ruby 2.7+ support
  • Works alongside existing UMAP, PCA, K-means, and HDBSCAN functionality

📚 Documentation

Complete examples and API documentation available in the README.

🙏 Acknowledgments

Built on the excellent hnsw_rs Rust library. Special thanks to the annembed and Magnus communities for their foundational work.


Ready to add lightning-fast search to your Ruby applications? Upgrade to ClusterKit 0.2.0 and start building! ⚡️🔍