Skip to content

ParseKit 0.1.0 Release 🚀

Latest

Choose a tag to compare

@cpetersen cpetersen released this 06 Sep 02:34
· 21 commits to main since this release
e6b578f

We're excited to announce the initial release of ParseKit, a Ruby document parsing toolkit that brings native performance to document text extraction with zero runtime dependencies!

🎯 What is ParseKit?

ParseKit is a native Ruby gem that extracts text from various document formats using high-performance Rust implementations. Unlike other Ruby document parsing solutions, ParseKit bundles all necessary libraries statically, making installation simple with no system dependencies required.

Key Features

  • 📄 Multiple Format Support: PDF, DOCX, XLSX, XLS, PPTX, images (PNG, JPG, TIFF, BMP)
  • 🔍 Built-in OCR: Bundled Tesseract for image text extraction
  • ⚡ Native Performance: Rust-powered parsing with Ruby convenience
  • 📦 Zero Dependencies: Everything bundled - just gem install and go
  • 🛡️ Cross-Platform: Works on Linux, macOS, and Windows

📚 Supported Formats

Format Extensions Method Features
PDF .pdf parse_pdf Text extraction via MuPDF
Word .docx parse_docx Office Open XML format
PowerPoint .pptx parse_pptx Text from slides and notes
Excel .xlsx, .xls parse_xlsx Both modern and legacy formats
Images .png, .jpg, .jpeg, .tiff, .bmp ocr_image OCR via bundled Tesseract
JSON .json parse_json Pretty-printed output
XML/HTML .xml, .html parse_xml Text content extraction
Text .txt, .csv, .md parse_text With encoding detection

🚀 Quick Start

Installation

gem install parsekit

Or add to your Gemfile:

gem 'parsekit', '~> 0.1.0'

Basic Usage

require 'parsekit'

# Simple file parsing - format auto-detected
text = ParseKit.parse_file("document.pdf")
puts text

# Parse binary data directly  
file_data = File.binread("document.docx")
text = ParseKit.parse_bytes(file_data.bytes)
puts text

# Use parser instance for multiple files
parser = ParseKit::Parser.new
text = parser.parse_file("report.xlsx")
puts text

Advanced Usage

# Direct format-specific parsing
parser = ParseKit::Parser.new

# PDF text extraction
pdf_data = File.read('document.pdf', mode: 'rb').bytes
pdf_text = parser.parse_pdf(pdf_data)

# OCR on images
image_data = File.read('scan.png', mode: 'rb').bytes
ocr_text = parser.ocr_image(image_data)

# PowerPoint presentations  
pptx_data = File.read('slides.pptx', mode: 'rb').bytes
slide_text = parser.parse_pptx(pptx_data)

# Excel spreadsheets
xlsx_data = File.read('data.xlsx', mode: 'rb').bytes
sheet_text = parser.parse_xlsx(xlsx_data)

Configuration Options

# Create parser with options
parser = ParseKit::Parser.new(
  strict_mode: true,
  max_size: 50 * 1024 * 1024,  # 50MB limit
  encoding: 'UTF-8'
)

# Or use the strict convenience method
parser = ParseKit::Parser.strict

🔧 Technical Architecture

ParseKit uses a hybrid Ruby/Rust architecture:

  • Ruby Layer: Provides convenient API and format detection
  • Rust Layer: High-performance parsing using:
    • MuPDF for PDF text extraction (statically linked)
    • tesseract-rs for OCR (bundled Tesseract by default)
    • docx-rs for Word document parsing
    • calamine for Excel parsing
    • zip + quick-xml for PowerPoint parsing
    • Magnus for Ruby-Rust FFI bindings

🎨 Zero-Dependency Philosophy

Traditional Ruby document parsing requires complex system dependencies:

  • Tesseract OCR installation
  • Poppler for PDF handling
  • ImageMagick for image processing
  • Platform-specific libraries

ParseKit eliminates all of this by bundling everything needed:

# Traditional approach
brew install tesseract poppler imagemagick  # macOS
sudo apt-get install tesseract-ocr poppler-utils imagemagick  # Ubuntu
gem install some-parsing-gem

# ParseKit approach  
gem install parsekit  # Done! 

⚡ Performance Features

  • Native Rust Speed: Core parsing implemented in Rust for maximum performance
  • Statically Linked Libraries: MuPDF and Tesseract compiled with optimizations
  • Efficient Memory Usage: Streaming where possible, configurable size limits
  • Smart Format Detection: Magic number detection with filename fallback

🛠️ Advanced OCR Configuration

ParseKit includes two OCR modes for maximum flexibility:

Bundled Mode (Default)

# Zero setup - works out of the box
parser = ParseKit::Parser.new
text = parser.ocr_image(image_data)

System Mode (Advanced Users)

For developers who want faster gem installation and already have Tesseract:

# Install without bundled features
gem install parsekit -- --no-default-features

# For development
rake compile CARGO_FEATURES=""  # Disables bundled-tesseract

🧪 Real-World Examples

Batch Document Processing

require 'parsekit'

parser = ParseKit::Parser.new
documents_dir = "path/to/documents"

Dir.glob("#{documents_dir}/*.{pdf,docx,xlsx,pptx,png,jpg}").each do |file|
  begin
    text = parser.parse_file(file)
    
    # Process extracted text
    puts "#{file}: #{text.length} characters extracted"
    
    # Save to text file
    output_file = file.gsub(/\.[^.]+$/, '.txt')
    File.write(output_file, text)
  rescue => e
    puts "Error processing #{file}: #{e.message}"
  end
end

OCR Pipeline

require 'parsekit'

def extract_text_from_images(image_dir)
  parser = ParseKit::Parser.new
  results = {}
  
  Dir.glob("#{image_dir}/*.{png,jpg,jpeg,tiff,bmp}").each do |image_file|
    puts "Processing #{image_file}..."
    
    image_data = File.read(image_file, mode: 'rb').bytes
    text = parser.ocr_image(image_data)
    
    results[image_file] = {
      text: text,
      length: text.length,
      processed_at: Time.now
    }
  end
  
  results
end

# Process all images
results = extract_text_from_images("scanned_documents/")
results.each do |file, data|
  puts "#{file}: #{data[:length]} chars - #{data[:text][0..100]}..."
end

Document Classification

require 'parsekit'

class DocumentClassifier
  def initialize
    @parser = ParseKit::Parser.new
  end
  
  def classify(file_path)
    text = @parser.parse_file(file_path)
    
    case text
    when /\b(invoice|bill|payment)\b/i
      :invoice
    when /\b(resume|curriculum vitae|cv)\b/i  
      :resume
    when /\b(contract|agreement|terms)\b/i
      :contract
    when /\b(report|analysis|summary)\b/i
      :report
    else
      :unknown
    end
  end
end

classifier = DocumentClassifier.new

Dir.glob("uploads/*.{pdf,docx}").each do |file|
  category = classifier.classify(file)
  puts "#{file} -> #{category}"
  
  # Move to appropriate directory
  FileUtils.mkdir_p("sorted/#{category}")
  FileUtils.mv(file, "sorted/#{category}/")
end

🔄 Migration Guide

Coming from other Ruby document parsing gems? Here's how ParseKit compares:

From pdf-reader

# Before (pdf-reader)
require 'pdf-reader'
text = PDF::Reader.new('document.pdf').pages.map(&:text).join

# After (ParseKit)  
require 'parsekit'
text = ParseKit.parse_file('document.pdf')

From docx gem

# Before (docx)
require 'docx'
doc = Docx::Document.open('document.docx')
text = doc.paragraphs.map(&:text).join

# After (ParseKit)
require 'parsekit' 
text = ParseKit.parse_file('document.docx')

From RTesseract

# Before (RTesseract - requires system tesseract)
require 'rtesseract'  
text = RTesseract.new('image.png').to_s

# After (ParseKit - zero dependencies)
require 'parsekit'
text = ParseKit.parse_file('image.png')

📦 Installation Requirements

  • Ruby: >= 3.0.0
  • Rust: Automatically handled during gem installation
  • System Dependencies: None! Everything is bundled

🙏 Acknowledgments

ParseKit builds on excellent Rust crates:

🚀 Ready to Parse?

Install ParseKit 0.1.0 today and start extracting text from any document format with zero hassle:

gem install parsekit

No system dependencies. No complex setup. Just install and parse! 🎯✨


For documentation, examples, and source code, visit: github.com/cpetersen/parsekit