Release ParseKit 0.1.0 Release 🚀 · scientist-labs/parsekit

We're excited to announce the initial release of ParseKit, a Ruby document parsing toolkit that brings native performance to document text extraction with zero runtime dependencies!

🎯 What is ParseKit?

ParseKit is a native Ruby gem that extracts text from various document formats using high-performance Rust implementations. Unlike other Ruby document parsing solutions, ParseKit bundles all necessary libraries statically, making installation simple with no system dependencies required.

Key Features

📄 Multiple Format Support: PDF, DOCX, XLSX, XLS, PPTX, images (PNG, JPG, TIFF, BMP)
🔍 Built-in OCR: Bundled Tesseract for image text extraction
⚡ Native Performance: Rust-powered parsing with Ruby convenience
📦 Zero Dependencies: Everything bundled - just gem install and go
🛡️ Cross-Platform: Works on Linux, macOS, and Windows

📚 Supported Formats

Format	Extensions	Method	Features
PDF	.pdf	`parse_pdf`	Text extraction via MuPDF
Word	.docx	`parse_docx`	Office Open XML format
PowerPoint	.pptx	`parse_pptx`	Text from slides and notes
Excel	.xlsx, .xls	`parse_xlsx`	Both modern and legacy formats
Images	.png, .jpg, .jpeg, .tiff, .bmp	`ocr_image`	OCR via bundled Tesseract
JSON	.json	`parse_json`	Pretty-printed output
XML/HTML	.xml, .html	`parse_xml`	Text content extraction
Text	.txt, .csv, .md	`parse_text`	With encoding detection

🚀 Quick Start

Installation

gem install parsekit

Or add to your Gemfile:

gem 'parsekit', '~> 0.1.0'

Basic Usage

require 'parsekit'

# Simple file parsing - format auto-detected
text = ParseKit.parse_file("document.pdf")
puts text

# Parse binary data directly  
file_data = File.binread("document.docx")
text = ParseKit.parse_bytes(file_data.bytes)
puts text

# Use parser instance for multiple files
parser = ParseKit::Parser.new
text = parser.parse_file("report.xlsx")
puts text

Advanced Usage

# Direct format-specific parsing
parser = ParseKit::Parser.new

# PDF text extraction
pdf_data = File.read('document.pdf', mode: 'rb').bytes
pdf_text = parser.parse_pdf(pdf_data)

# OCR on images
image_data = File.read('scan.png', mode: 'rb').bytes
ocr_text = parser.ocr_image(image_data)

# PowerPoint presentations  
pptx_data = File.read('slides.pptx', mode: 'rb').bytes
slide_text = parser.parse_pptx(pptx_data)

# Excel spreadsheets
xlsx_data = File.read('data.xlsx', mode: 'rb').bytes
sheet_text = parser.parse_xlsx(xlsx_data)

Configuration Options

# Create parser with options
parser = ParseKit::Parser.new(
  strict_mode: true,
  max_size: 50 * 1024 * 1024,  # 50MB limit
  encoding: 'UTF-8'
)

# Or use the strict convenience method
parser = ParseKit::Parser.strict

🔧 Technical Architecture

ParseKit uses a hybrid Ruby/Rust architecture:

Ruby Layer: Provides convenient API and format detection
Rust Layer: High-performance parsing using:
- MuPDF for PDF text extraction (statically linked)
- tesseract-rs for OCR (bundled Tesseract by default)
- docx-rs for Word document parsing
- calamine for Excel parsing
- zip + quick-xml for PowerPoint parsing
- Magnus for Ruby-Rust FFI bindings

🎨 Zero-Dependency Philosophy

Traditional Ruby document parsing requires complex system dependencies:

Tesseract OCR installation
Poppler for PDF handling
ImageMagick for image processing
Platform-specific libraries

ParseKit eliminates all of this by bundling everything needed:

# Traditional approach
brew install tesseract poppler imagemagick  # macOS
sudo apt-get install tesseract-ocr poppler-utils imagemagick  # Ubuntu
gem install some-parsing-gem

# ParseKit approach  
gem install parsekit  # Done!

⚡ Performance Features

Native Rust Speed: Core parsing implemented in Rust for maximum performance
Statically Linked Libraries: MuPDF and Tesseract compiled with optimizations
Efficient Memory Usage: Streaming where possible, configurable size limits
Smart Format Detection: Magic number detection with filename fallback

🛠️ Advanced OCR Configuration

ParseKit includes two OCR modes for maximum flexibility:

Bundled Mode (Default)

# Zero setup - works out of the box
parser = ParseKit::Parser.new
text = parser.ocr_image(image_data)

System Mode (Advanced Users)

For developers who want faster gem installation and already have Tesseract:

# Install without bundled features
gem install parsekit -- --no-default-features

# For development
rake compile CARGO_FEATURES=""  # Disables bundled-tesseract

🧪 Real-World Examples

Batch Document Processing

require 'parsekit'

parser = ParseKit::Parser.new
documents_dir = "path/to/documents"

Dir.glob("#{documents_dir}/*.{pdf,docx,xlsx,pptx,png,jpg}").each do |file|
  begin
    text = parser.parse_file(file)
    
    # Process extracted text
    puts "#{file}: #{text.length} characters extracted"
    
    # Save to text file
    output_file = file.gsub(/\.[^.]+$/, '.txt')
    File.write(output_file, text)
  rescue => e
    puts "Error processing #{file}: #{e.message}"
  end
end

OCR Pipeline

require 'parsekit'

def extract_text_from_images(image_dir)
  parser = ParseKit::Parser.new
  results = {}
  
  Dir.glob("#{image_dir}/*.{png,jpg,jpeg,tiff,bmp}").each do |image_file|
    puts "Processing #{image_file}..."
    
    image_data = File.read(image_file, mode: 'rb').bytes
    text = parser.ocr_image(image_data)
    
    results[image_file] = {
      text: text,
      length: text.length,
      processed_at: Time.now
    }
  end
  
  results
end

# Process all images
results = extract_text_from_images("scanned_documents/")
results.each do |file, data|
  puts "#{file}: #{data[:length]} chars - #{data[:text][0..100]}..."
end

Document Classification

require 'parsekit'

class DocumentClassifier
  def initialize
    @parser = ParseKit::Parser.new
  end
  
  def classify(file_path)
    text = @parser.parse_file(file_path)
    
    case text
    when /\b(invoice|bill|payment)\b/i
      :invoice
    when /\b(resume|curriculum vitae|cv)\b/i  
      :resume
    when /\b(contract|agreement|terms)\b/i
      :contract
    when /\b(report|analysis|summary)\b/i
      :report
    else
      :unknown
    end
  end
end

classifier = DocumentClassifier.new

Dir.glob("uploads/*.{pdf,docx}").each do |file|
  category = classifier.classify(file)
  puts "#{file} -> #{category}"
  
  # Move to appropriate directory
  FileUtils.mkdir_p("sorted/#{category}")
  FileUtils.mv(file, "sorted/#{category}/")
end

🔄 Migration Guide

Coming from other Ruby document parsing gems? Here's how ParseKit compares:

From pdf-reader

# Before (pdf-reader)
require 'pdf-reader'
text = PDF::Reader.new('document.pdf').pages.map(&:text).join

# After (ParseKit)  
require 'parsekit'
text = ParseKit.parse_file('document.pdf')

From docx gem

# Before (docx)
require 'docx'
doc = Docx::Document.open('document.docx')
text = doc.paragraphs.map(&:text).join

# After (ParseKit)
require 'parsekit' 
text = ParseKit.parse_file('document.docx')

From RTesseract

# Before (RTesseract - requires system tesseract)
require 'rtesseract'  
text = RTesseract.new('image.png').to_s

# After (ParseKit - zero dependencies)
require 'parsekit'
text = ParseKit.parse_file('image.png')

📦 Installation Requirements

Ruby: >= 3.0.0
Rust: Automatically handled during gem installation
System Dependencies: None! Everything is bundled

🙏 Acknowledgments

ParseKit builds on excellent Rust crates:

mupdf for PDF parsing
tesseract-rs for OCR
docx-rs for Word documents
calamine for Excel files
quick-xml and zip for PowerPoint
magnus for Ruby-Rust integration

🚀 Ready to Parse?

Install ParseKit 0.1.0 today and start extracting text from any document format with zero hassle:

gem install parsekit

No system dependencies. No complex setup. Just install and parse! 🎯✨

For documentation, examples, and source code, visit: github.com/cpetersen/parsekit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ParseKit 0.1.0 Release 🚀

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

🎯 What is ParseKit?

Key Features

📚 Supported Formats

🚀 Quick Start

Installation

Basic Usage

Advanced Usage

Configuration Options

🔧 Technical Architecture

🎨 Zero-Dependency Philosophy

⚡ Performance Features

🛠️ Advanced OCR Configuration

Bundled Mode (Default)

System Mode (Advanced Users)

🧪 Real-World Examples

Batch Document Processing

OCR Pipeline

Document Classification

🔄 Migration Guide

From pdf-reader

From docx gem

From RTesseract

📦 Installation Requirements

🙏 Acknowledgments

🚀 Ready to Parse?

Uh oh!