We're excited to announce the first release of Topical, a lightweight Ruby gem that orchestrates topic extraction from document embeddings using proven clustering libraries!
What is Topical?
Topical provides a complete topic modeling pipeline for Ruby, integrating ClusterKit's clustering algorithms with c-TF-IDF term extraction and quality metrics. It handles the complex orchestration so you can focus on analyzing your topics, not building the pipeline.
π― Key Features
Complete Topic Modeling Pipeline
- Integrates ClusterKit - HDBSCAN and K-means clustering via proven Rust libraries
- c-TF-IDF term extraction - Identifies distinctive terms that characterize each topic
- Quality metrics - Coherence and diversity scoring for topic evaluation
- Outlier detection - Surfaces documents that don't fit discovered topic patterns
Fast Topic Labeling
- Term-based labeling - Fast, deterministic labels from distinctive terms
- No external dependencies - Works reliably without LLM complexity
- Clean, readable results - Combines top distinctive terms meaningfully
Production Ready
- Configurable logging - Use Ruby's standard Logger or your own
- Model persistence - Save and load fitted models as JSON
- Dimensionality reduction - Optional UMAP integration via ClusterKit
- Ruby-native - Clean API that feels natural to Ruby developers
Quick Example
require 'topical'
require 'red-candle'
# Generate embeddings (using red-candle)
documents = [
"Ruby is a dynamic programming language with elegant syntax",
"Python excels at data science and machine learning tasks",
"JavaScript powers modern web development and user interfaces",
"Rust provides memory safety without garbage collection",
"Go simplifies concurrent programming with goroutines"
]
embedding_model = Candle::EmbeddingModel.from_pretrained(
'sentence-transformers/all-MiniLM-L6-v2'
)
embeddings = documents.map { |doc| embedding_model.embedding(doc).first.to_a }
# Extract topics with intelligent clustering
topics = Topical.extract(
embeddings: embeddings,
documents: documents,
clustering_method: :hdbscan
)
# Explore discovered topics
topics.each do |topic|
puts "\n#{topic.label}"
puts "Terms: #{topic.terms.first(5).join(', ')}"
puts "Documents: #{topic.documents.length}"
topic.documents.each { |doc| puts " - #{doc[0..60]}..." }
endπ§ Flexible Architecture
Engine-Based API
# Full control over the topic modeling pipeline
engine = Topical::Engine.new(
clustering_method: :hdbscan,
min_cluster_size: 3,
reduce_dimensions: true,
logger: Logger.new($stdout) # Configurable logging
)
topics = engine.fit(embeddings: embeddings, documents: documents)
# Transform new documents
new_assignments = engine.transform(embeddings: new_embeddings)
# Persist your model
engine.save("my_topic_model.json")
reloaded = Topical::Engine.load("my_topic_model.json")Advanced Topic Summaries
# Use Topical for clustering + your choice of LLM for summaries
topics = engine.fit(embeddings: embeddings, documents: documents)
# Generate rich summaries with any LLM
topics.each do |topic|
summary = my_llm.summarize(
documents: topic.representative_docs(k: 5),
terms: topic.terms.join(', '),
context: "customer feedback analysis" # Domain-specific!
)
puts "#{topic.label}: #{summary}"
end
# See examples/topic_summaries_with_llm.rb for complete implementationπ Technical Excellence
Comprehensive Test Coverage
- 79.46% line coverage with SimpleCov integration
- 64 test specifications covering core functionality, edge cases, and error handling
- Fast test suite - Runs in ~0.1 seconds for rapid development
Clean Architecture
- Orchestration layer - Coordinates ClusterKit clustering with term extraction and labeling
- Modular design - Clean interfaces between clustering, term extraction, and quality assessment
- Ruby integration - Smooth interop between Rust performance and Ruby usability
π Ruby Ecosystem Integration
Works seamlessly with:
- red-candle - For embedding generation and advanced LLM summaries (see examples)
- ClusterKit - For HDBSCAN clustering and optional UMAP dimensionality reduction
- Standard Ruby Logger - For production logging needs
- Any LLM provider - Clean separation allows easy integration at application level
π¦ Installation
gem install topicalFor embedding generation and advanced examples:
gem install red-candle # Optional, for generating embeddings and LLM summaries
gem install clusterkit # Optional, for dimensionality reduction π― Use Cases
Content Analysis: Automatically categorize articles, posts, or documents
Customer Feedback: Discover themes in reviews, surveys, or support tickets
Research: Analyze academic papers, finding research trends and topic clusters
Knowledge Management: Organize internal documents and wikis by topic
Social Media: Understand trending topics and community discussions
π What Topical Brings
Pipeline Orchestration:
- Coordinates ClusterKit's Rust-powered clustering with Ruby-native term analysis
- Handles embedding validation, dimensionality reduction, and topic construction
- Provides quality metrics and persistence out of the box
Developer Experience:
- Ruby-native API - Feels natural to Ruby developers despite Rust underpinnings
- Rapid testing - 64 specs run in ~0.1 seconds
- Clear abstractions - Simple methods that hide clustering complexity
π― Advanced Topic Summaries
For detailed topic analysis, combine Topical's clustering with LLM summarization:
# Step 1: Use Topical for excellent clustering
topics = Topical.extract(embeddings: embeddings, documents: documents)
# Step 2: Use your preferred LLM for rich summaries
topics.each do |topic|
summary = my_llm.generate(
"Summarize what connects these documents: #{topic.representative_docs(k: 3).join(' | ')}"
)
puts "#{topic.label}: #{summary}"
endSee examples/topic_summaries_with_llm.rb for a complete working example with red-candle.
π Documentation
Complete documentation and examples available in the GitHub repository.
π Acknowledgments
Built on excellent Rust libraries via Magnus bindings:
- ClusterKit for fast HDBSCAN clustering and UMAP dimensionality reduction
- red-candle for embedding generation (examples only)
- Inspired by Python's BERTopic and scikit-learn ecosystem
The clean architecture makes it easy to integrate with any LLM provider at the application level for advanced features like topic summarization.
Ready to discover the hidden topics in your documents? Install Topical 0.1.0 and start clustering! ππ