Releases: scientist-labs/parsekit
ParseKit 0.1.0 Release 🚀
We're excited to announce the initial release of ParseKit, a Ruby document parsing toolkit that brings native performance to document text extraction with zero runtime dependencies!
🎯 What is ParseKit?
ParseKit is a native Ruby gem that extracts text from various document formats using high-performance Rust implementations. Unlike other Ruby document parsing solutions, ParseKit bundles all necessary libraries statically, making installation simple with no system dependencies required.
Key Features
- 📄 Multiple Format Support: PDF, DOCX, XLSX, XLS, PPTX, images (PNG, JPG, TIFF, BMP)
- 🔍 Built-in OCR: Bundled Tesseract for image text extraction
- ⚡ Native Performance: Rust-powered parsing with Ruby convenience
- 📦 Zero Dependencies: Everything bundled - just
gem installand go - 🛡️ Cross-Platform: Works on Linux, macOS, and Windows
📚 Supported Formats
| Format | Extensions | Method | Features |
|---|---|---|---|
parse_pdf |
Text extraction via MuPDF | ||
| Word | .docx | parse_docx |
Office Open XML format |
| PowerPoint | .pptx | parse_pptx |
Text from slides and notes |
| Excel | .xlsx, .xls | parse_xlsx |
Both modern and legacy formats |
| Images | .png, .jpg, .jpeg, .tiff, .bmp | ocr_image |
OCR via bundled Tesseract |
| JSON | .json | parse_json |
Pretty-printed output |
| XML/HTML | .xml, .html | parse_xml |
Text content extraction |
| Text | .txt, .csv, .md | parse_text |
With encoding detection |
🚀 Quick Start
Installation
gem install parsekitOr add to your Gemfile:
gem 'parsekit', '~> 0.1.0'Basic Usage
require 'parsekit'
# Simple file parsing - format auto-detected
text = ParseKit.parse_file("document.pdf")
puts text
# Parse binary data directly
file_data = File.binread("document.docx")
text = ParseKit.parse_bytes(file_data.bytes)
puts text
# Use parser instance for multiple files
parser = ParseKit::Parser.new
text = parser.parse_file("report.xlsx")
puts textAdvanced Usage
# Direct format-specific parsing
parser = ParseKit::Parser.new
# PDF text extraction
pdf_data = File.read('document.pdf', mode: 'rb').bytes
pdf_text = parser.parse_pdf(pdf_data)
# OCR on images
image_data = File.read('scan.png', mode: 'rb').bytes
ocr_text = parser.ocr_image(image_data)
# PowerPoint presentations
pptx_data = File.read('slides.pptx', mode: 'rb').bytes
slide_text = parser.parse_pptx(pptx_data)
# Excel spreadsheets
xlsx_data = File.read('data.xlsx', mode: 'rb').bytes
sheet_text = parser.parse_xlsx(xlsx_data)Configuration Options
# Create parser with options
parser = ParseKit::Parser.new(
strict_mode: true,
max_size: 50 * 1024 * 1024, # 50MB limit
encoding: 'UTF-8'
)
# Or use the strict convenience method
parser = ParseKit::Parser.strict🔧 Technical Architecture
ParseKit uses a hybrid Ruby/Rust architecture:
- Ruby Layer: Provides convenient API and format detection
- Rust Layer: High-performance parsing using:
- MuPDF for PDF text extraction (statically linked)
- tesseract-rs for OCR (bundled Tesseract by default)
- docx-rs for Word document parsing
- calamine for Excel parsing
- zip + quick-xml for PowerPoint parsing
- Magnus for Ruby-Rust FFI bindings
🎨 Zero-Dependency Philosophy
Traditional Ruby document parsing requires complex system dependencies:
- Tesseract OCR installation
- Poppler for PDF handling
- ImageMagick for image processing
- Platform-specific libraries
ParseKit eliminates all of this by bundling everything needed:
# Traditional approach
brew install tesseract poppler imagemagick # macOS
sudo apt-get install tesseract-ocr poppler-utils imagemagick # Ubuntu
gem install some-parsing-gem
# ParseKit approach
gem install parsekit # Done! ⚡ Performance Features
- Native Rust Speed: Core parsing implemented in Rust for maximum performance
- Statically Linked Libraries: MuPDF and Tesseract compiled with optimizations
- Efficient Memory Usage: Streaming where possible, configurable size limits
- Smart Format Detection: Magic number detection with filename fallback
🛠️ Advanced OCR Configuration
ParseKit includes two OCR modes for maximum flexibility:
Bundled Mode (Default)
# Zero setup - works out of the box
parser = ParseKit::Parser.new
text = parser.ocr_image(image_data)System Mode (Advanced Users)
For developers who want faster gem installation and already have Tesseract:
# Install without bundled features
gem install parsekit -- --no-default-features
# For development
rake compile CARGO_FEATURES="" # Disables bundled-tesseract🧪 Real-World Examples
Batch Document Processing
require 'parsekit'
parser = ParseKit::Parser.new
documents_dir = "path/to/documents"
Dir.glob("#{documents_dir}/*.{pdf,docx,xlsx,pptx,png,jpg}").each do |file|
begin
text = parser.parse_file(file)
# Process extracted text
puts "#{file}: #{text.length} characters extracted"
# Save to text file
output_file = file.gsub(/\.[^.]+$/, '.txt')
File.write(output_file, text)
rescue => e
puts "Error processing #{file}: #{e.message}"
end
endOCR Pipeline
require 'parsekit'
def extract_text_from_images(image_dir)
parser = ParseKit::Parser.new
results = {}
Dir.glob("#{image_dir}/*.{png,jpg,jpeg,tiff,bmp}").each do |image_file|
puts "Processing #{image_file}..."
image_data = File.read(image_file, mode: 'rb').bytes
text = parser.ocr_image(image_data)
results[image_file] = {
text: text,
length: text.length,
processed_at: Time.now
}
end
results
end
# Process all images
results = extract_text_from_images("scanned_documents/")
results.each do |file, data|
puts "#{file}: #{data[:length]} chars - #{data[:text][0..100]}..."
endDocument Classification
require 'parsekit'
class DocumentClassifier
def initialize
@parser = ParseKit::Parser.new
end
def classify(file_path)
text = @parser.parse_file(file_path)
case text
when /\b(invoice|bill|payment)\b/i
:invoice
when /\b(resume|curriculum vitae|cv)\b/i
:resume
when /\b(contract|agreement|terms)\b/i
:contract
when /\b(report|analysis|summary)\b/i
:report
else
:unknown
end
end
end
classifier = DocumentClassifier.new
Dir.glob("uploads/*.{pdf,docx}").each do |file|
category = classifier.classify(file)
puts "#{file} -> #{category}"
# Move to appropriate directory
FileUtils.mkdir_p("sorted/#{category}")
FileUtils.mv(file, "sorted/#{category}/")
end🔄 Migration Guide
Coming from other Ruby document parsing gems? Here's how ParseKit compares:
From pdf-reader
# Before (pdf-reader)
require 'pdf-reader'
text = PDF::Reader.new('document.pdf').pages.map(&:text).join
# After (ParseKit)
require 'parsekit'
text = ParseKit.parse_file('document.pdf')From docx gem
# Before (docx)
require 'docx'
doc = Docx::Document.open('document.docx')
text = doc.paragraphs.map(&:text).join
# After (ParseKit)
require 'parsekit'
text = ParseKit.parse_file('document.docx')From RTesseract
# Before (RTesseract - requires system tesseract)
require 'rtesseract'
text = RTesseract.new('image.png').to_s
# After (ParseKit - zero dependencies)
require 'parsekit'
text = ParseKit.parse_file('image.png')📦 Installation Requirements
- Ruby: >= 3.0.0
- Rust: Automatically handled during gem installation
- System Dependencies: None! Everything is bundled
🙏 Acknowledgments
ParseKit builds on excellent Rust crates:
- mupdf for PDF parsing
- tesseract-rs for OCR
- docx-rs for Word documents
- calamine for Excel files
- quick-xml and zip for PowerPoint
- magnus for Ruby-Rust integration
🚀 Ready to Parse?
Install ParseKit 0.1.0 today and start extracting text from any document format with zero hassle:
gem install parsekitNo system dependencies. No complex setup. Just install and parse! 🎯✨
For documentation, examples, and source code, visit: github.com/cpetersen/parsekit