Skip to content

Releases: scientist-labs/topical

Topical 0.1.0: Fast, Focused Topic Modeling for Ruby 🚀

06 Sep 14:38
5074dd2

Choose a tag to compare

We're excited to announce the first release of Topical, a lightweight Ruby gem that orchestrates topic extraction from document embeddings using proven clustering libraries!

What is Topical?

Topical provides a complete topic modeling pipeline for Ruby, integrating ClusterKit's clustering algorithms with c-TF-IDF term extraction and quality metrics. It handles the complex orchestration so you can focus on analyzing your topics, not building the pipeline.

🎯 Key Features

Complete Topic Modeling Pipeline

  • Integrates ClusterKit - HDBSCAN and K-means clustering via proven Rust libraries
  • c-TF-IDF term extraction - Identifies distinctive terms that characterize each topic
  • Quality metrics - Coherence and diversity scoring for topic evaluation
  • Outlier detection - Surfaces documents that don't fit discovered topic patterns

Fast Topic Labeling

  • Term-based labeling - Fast, deterministic labels from distinctive terms
  • No external dependencies - Works reliably without LLM complexity
  • Clean, readable results - Combines top distinctive terms meaningfully

Production Ready

  • Configurable logging - Use Ruby's standard Logger or your own
  • Model persistence - Save and load fitted models as JSON
  • Dimensionality reduction - Optional UMAP integration via ClusterKit
  • Ruby-native - Clean API that feels natural to Ruby developers

Quick Example

require 'topical'
require 'red-candle'

# Generate embeddings (using red-candle)
documents = [
  "Ruby is a dynamic programming language with elegant syntax",
  "Python excels at data science and machine learning tasks", 
  "JavaScript powers modern web development and user interfaces",
  "Rust provides memory safety without garbage collection",
  "Go simplifies concurrent programming with goroutines"
]

embedding_model = Candle::EmbeddingModel.from_pretrained(
  'sentence-transformers/all-MiniLM-L6-v2'
)
embeddings = documents.map { |doc| embedding_model.embedding(doc).first.to_a }

# Extract topics with intelligent clustering
topics = Topical.extract(
  embeddings: embeddings,
  documents: documents,
  clustering_method: :hdbscan
)

# Explore discovered topics
topics.each do |topic|
  puts "\n#{topic.label}"
  puts "Terms: #{topic.terms.first(5).join(', ')}"
  puts "Documents: #{topic.documents.length}"
  topic.documents.each { |doc| puts "  - #{doc[0..60]}..." }
end

🔧 Flexible Architecture

Engine-Based API

# Full control over the topic modeling pipeline
engine = Topical::Engine.new(
  clustering_method: :hdbscan,
  min_cluster_size: 3,
  reduce_dimensions: true,
  logger: Logger.new($stdout)   # Configurable logging
)

topics = engine.fit(embeddings: embeddings, documents: documents)

# Transform new documents
new_assignments = engine.transform(embeddings: new_embeddings)

# Persist your model
engine.save("my_topic_model.json")
reloaded = Topical::Engine.load("my_topic_model.json")

Advanced Topic Summaries

# Use Topical for clustering + your choice of LLM for summaries
topics = engine.fit(embeddings: embeddings, documents: documents)

# Generate rich summaries with any LLM
topics.each do |topic|
  summary = my_llm.summarize(
    documents: topic.representative_docs(k: 5),
    terms: topic.terms.join(', '),
    context: "customer feedback analysis"  # Domain-specific!
  )
  puts "#{topic.label}: #{summary}"
end

# See examples/topic_summaries_with_llm.rb for complete implementation

🏗 Technical Excellence

Comprehensive Test Coverage

  • 79.46% line coverage with SimpleCov integration
  • 64 test specifications covering core functionality, edge cases, and error handling
  • Fast test suite - Runs in ~0.1 seconds for rapid development

Clean Architecture

  • Orchestration layer - Coordinates ClusterKit clustering with term extraction and labeling
  • Modular design - Clean interfaces between clustering, term extraction, and quality assessment
  • Ruby integration - Smooth interop between Rust performance and Ruby usability

🔄 Ruby Ecosystem Integration

Works seamlessly with:

  • red-candle - For embedding generation and advanced LLM summaries (see examples)
  • ClusterKit - For HDBSCAN clustering and optional UMAP dimensionality reduction
  • Standard Ruby Logger - For production logging needs
  • Any LLM provider - Clean separation allows easy integration at application level

📦 Installation

gem install topical

For embedding generation and advanced examples:

gem install red-candle  # Optional, for generating embeddings and LLM summaries
gem install clusterkit  # Optional, for dimensionality reduction  

🎯 Use Cases

Content Analysis: Automatically categorize articles, posts, or documents

Customer Feedback: Discover themes in reviews, surveys, or support tickets

Research: Analyze academic papers, finding research trends and topic clusters

Knowledge Management: Organize internal documents and wikis by topic

Social Media: Understand trending topics and community discussions

🚀 What Topical Brings

Pipeline Orchestration:

  • Coordinates ClusterKit's Rust-powered clustering with Ruby-native term analysis
  • Handles embedding validation, dimensionality reduction, and topic construction
  • Provides quality metrics and persistence out of the box

Developer Experience:

  • Ruby-native API - Feels natural to Ruby developers despite Rust underpinnings
  • Rapid testing - 64 specs run in ~0.1 seconds
  • Clear abstractions - Simple methods that hide clustering complexity

🎯 Advanced Topic Summaries

For detailed topic analysis, combine Topical's clustering with LLM summarization:

# Step 1: Use Topical for excellent clustering
topics = Topical.extract(embeddings: embeddings, documents: documents)

# Step 2: Use your preferred LLM for rich summaries  
topics.each do |topic|
  summary = my_llm.generate(
    "Summarize what connects these documents: #{topic.representative_docs(k: 3).join(' | ')}"
  )
  puts "#{topic.label}: #{summary}"
end

See examples/topic_summaries_with_llm.rb for a complete working example with red-candle.

📚 Documentation

Complete documentation and examples available in the GitHub repository.

🙏 Acknowledgments

Built on excellent Rust libraries via Magnus bindings:

  • ClusterKit for fast HDBSCAN clustering and UMAP dimensionality reduction
  • red-candle for embedding generation (examples only)
  • Inspired by Python's BERTopic and scikit-learn ecosystem

The clean architecture makes it easy to integrate with any LLM provider at the application level for advanced features like topic summarization.


Ready to discover the hidden topics in your documents? Install Topical 0.1.0 and start clustering! 🔍📊