ClusterKit 0.2.0: High-Speed Semantic Search with HNSW 🚀
We're excited to announce ClusterKit 0.2.0, bringing fast approximate nearest neighbor search to Ruby with full HNSW (Hierarchical Navigable Small World) integration!
🔍 HNSW: Millisecond Vector Search
ClusterKit now includes a complete HNSW implementation for lightning-fast similarity search. Perfect for recommendation systems, semantic search, and real-time ML applications.
Quick Example with red-candle
require 'clusterkit'
require 'candle'
# Load embedding model
embedding_model = Candle::EmbeddingModel.from_pretrained(
'sentence-transformers/all-MiniLM-L6-v2',
device: Candle::Device.best
)
# Create HNSW index
index = ClusterKit::HNSW.new(
dim: 384, # Model dimensions
m: 16, # Graph connectivity
ef_construction: 200, # Build quality
max_elements: 10000,
random_seed: 42 # Reproducible results
)
# Add documents with metadata
documents = [
"Ruby is a programming language",
"Python is popular for data science",
"Machine learning models require data"
]
documents.each_with_index do |doc, i|
embedding = embedding_model.embedding(doc).first.to_a
index.add_item(embedding,
label: "doc_#{i}",
metadata: {
'text' => doc,
'length' => doc.length,
'word_count' => doc.split.size
}
)
end
# Semantic search
query = "programming languages"
query_embedding = embedding_model.embedding(query).first.to_a
results = index.search_with_metadata(query_embedding, k: 3)
results.each do |result|
puts "#{result[:metadata]['text']} (similarity: #{1.0 - result[:distance]})"
end🎯 Key Features
Fast & Scalable
- Sub-millisecond search for most datasets
- Configurable precision/speed tradeoff via
efparameter - Parallel index construction for large datasets
- Memory-efficient storage with controllable footprint
Ruby-Native Experience
- Metadata support - store rich data alongside vectors
- Flexible input types - handles Ruby arrays seamlessly
- Save/load functionality - persist indices to disk
- Error handling - clear Ruby exceptions for debugging
Production Ready
- Batch operations for efficient bulk loading
- Search quality controls - adjust
effor speed vs accuracy - Memory estimates built-in for capacity planning
- Thread-safe operations throughout
📊 Performance Configurations
dim = 384
# High recall (>95%) - Best quality, slower
index = ClusterKit::HNSW.new(
dim: dim, m: 32, ef_construction: 400
).tap { |idx| idx.set_ef(100) }
# Balanced (>90%) - Good quality, fast
index = ClusterKit::HNSW.new(
dim: dim, m: 16, ef_construction: 200
).tap { |idx| idx.set_ef(50) }
# Speed optimized (>85%) - Fastest
index = ClusterKit::HNSW.new(
dim: dim, m: 8, ef_construction: 100
).tap { |idx| idx.set_ef(20) }🛠 Use Cases
Semantic Search: Build document search with embedding models like sentence-transformers
Recommendation Systems: Find similar items/users with sub-second response times
Duplicate Detection: Identify near-duplicate content efficiently
Real-time ML: Power recommendation APIs with millisecond latency
RAG Applications: Fast retrieval for retrieval-augmented generation
🔧 Technical Improvements
- Enhanced metadata handling - supports mixed data types (strings, integers, floats)
- Better error propagation - clearer debugging experience
- Rust performance optimizations - leverages latest hnsw_rs improvements
- Memory safety - proper lifetime management for index persistence
📦 Installation
gem install clusterkitOr in your Gemfile:
gem 'clusterkit', '~> 0.2.0'🔄 Compatibility
- Full backward compatibility with 0.1.x clustering and dimensionality reduction
- Ruby 2.7+ support
- Works alongside existing UMAP, PCA, K-means, and HDBSCAN functionality
📚 Documentation
Complete examples and API documentation available in the README.
🙏 Acknowledgments
Built on the excellent hnsw_rs Rust library. Special thanks to the annembed and Magnus communities for their foundational work.
Ready to add lightning-fast search to your Ruby applications? Upgrade to ClusterKit 0.2.0 and start building! ⚡️🔍