Document and Code Processing Utilities for Linux and macOS
TextHarvest is a powerful, modernized collection of Bash shell scripts designed for automated document and source code processing. Version 2.0.0 introduces a unified CLI interface, cross-platform support, and enhanced functionality while maintaining the simplicity and effectiveness of the original design.
- π Source Code Processing - Generate consolidated listings from project directories
- π PDF Text Extraction - Extract text directly from text-based PDF files
- π OCR Processing - Perform optical character recognition on scanned/image PDFs
- π₯οΈ Cross-Platform - Works on Linux (Ubuntu/Debian, RHEL/CentOS/Fedora) and macOS
- β‘ Parallel Processing - Multi-threaded operations for improved performance
- ποΈ Interactive Mode - User-friendly file and project selection
- βοΈ Configuration Management - Hierarchical config system with environment overrides
- π§ͺ Dry Run Mode - Preview operations before execution
-
Clone the repository:
git clone https://github.com/user/textharvest.git cd textharvest -
Make the main script executable:
chmod +x textharvest.sh
-
Install dependencies:
./textharvest.sh setup
-
Create input directories:
mkdir source_code source_pdf
-
Add your files:
- Place project folders in
source_code/ - Place PDF files in
source_pdf/
- Place project folders in
-
Process your files:
./textharvest.sh code --interactive # Process source code ./textharvest.sh pdf-text --parallel # Extract PDF text ./textharvest.sh pdf-ocr -l eng+fra # OCR with multiple languages
| Command | Description |
|---|---|
code |
Generate source code listings from project directories |
pdf-text |
Extract text directly from text-based PDF files |
pdf-ocr |
OCR and extract text from scanned/image PDF files |
setup |
Install required dependencies |
config |
Manage configuration settings |
version |
Show version information |
help |
Show help information |
| Option | Description |
|---|---|
-v, --verbose |
Verbose output |
-vv |
Very verbose output (debug) |
-q, --quiet |
Quiet mode |
--dry-run |
Preview operations without executing |
--help |
Show help for any command |
# Get help for specific commands
./textharvest.sh code --help
./textharvest.sh pdf-ocr --help
# Interactive project selection
./textharvest.sh code --interactive
# Parallel processing with custom directories
./textharvest.sh pdf-text --parallel -i my_pdfs -o text_results
# OCR with multiple languages
./textharvest.sh pdf-ocr -l eng+fra+deu --force-ocr
# Dry run to preview operations
./textharvest.sh code --dry-run --verbose
# Custom configuration
./textharvest.sh config --init
./textharvest.sh config --showsource_code/- Project subdirectories containing source filessource_pdf/- PDF files for text extraction or OCR processing
code_listings/- Generated source code listingstext_output/- PDF text extraction resultsocr_pdf_output/- Intermediate OCR-processed PDFsocr_text_output/- Final OCR text extraction results
TextHarvest/
βββ textharvest.sh # Main CLI interface
βββ textharvest.conf # Configuration template
βββ setup.sh # Cross-platform installer
βββ lib/
β βββ common.sh # Shared utility functions
βββ process_code.sh # Source code processing
βββ process_pdf_text.sh # PDF text extraction
βββ process_pdf_ocr.sh # OCR processing
βββ README.md # This file
Configuration files are loaded in this order (later files override earlier ones):
/etc/textharvest.conf- System-wide settings~/.textharvest.conf- User settings./textharvest.conf- Local project settings
| Variable | Description | Default |
|---|---|---|
TEXTHARVEST_CODE_DIR |
Source code input directory | source_code |
TEXTHARVEST_PDF_DIR |
PDF input directory | source_pdf |
TEXTHARVEST_VERBOSE_LEVEL |
Default verbosity (0-3) | 1 |
TEXTHARVEST_MAX_JOBS |
Parallel processing jobs | 4 |
# Create configuration files
./textharvest.sh config --init # Create local config
./textharvest.sh config --init --global # Create user config
# View and validate settings
./textharvest.sh config --show # Show current settings
./textharvest.sh config --validate # Verify configuration| Platform | Package Managers | Status |
|---|---|---|
| Linux | apt (Ubuntu/Debian)yum/dnf (RHEL/CentOS/Fedora) |
β Full support |
| macOS | brew (Homebrew) |
β Full support |
-
Install Homebrew (if not already installed):
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" -
Add Homebrew to PATH:
# For Apple Silicon Macs: export PATH="/opt/homebrew/bin:$PATH" # For Intel Macs: export PATH="/usr/local/bin:$PATH"
-
Run TextHarvest setup:
./textharvest.sh setup
- Automatic OS detection and package manager selection
- Cross-platform file operations with proper path handling
- Native dependency installation for each platform
- Consistent CLI behavior across all supported systems
TextHarvest automatically installs these dependencies via your system's package manager:
| Tool | Purpose | Linux Package | macOS Package |
|---|---|---|---|
| poppler | PDF text extraction | poppler-utils |
poppler |
| tesseract | OCR engine | tesseract-ocr |
tesseract |
| ocrmypdf | PDF OCR processing | pip install ocrmypdf |
ocrmypdf |
| Language packs | OCR languages | tesseract-ocr-eng |
tesseract-lang |
If you prefer to install dependencies manually:
Linux (Ubuntu/Debian):
sudo apt update
sudo apt install poppler-utils tesseract-ocr tesseract-ocr-eng python3-pip
pip3 install --user ocrmypdfLinux (RHEL/CentOS/Fedora):
sudo dnf install poppler-utils tesseract tesseract-langpack-eng python3-pip
pip3 install --user ocrmypdfmacOS (Homebrew):
brew install poppler tesseract tesseract-lang ocrmypdf- Project archival - Create comprehensive text-based documentation
- Code review preparation - Generate consolidated listings for review
- Documentation generation - Extract code for technical documentation
- Analysis and auditing - Prepare code for external analysis tools
- Research workflows - Extract text from academic papers and reports
- Data extraction - Convert PDFs to searchable text for analysis
- Archive digitization - OCR scanned documents and legacy files
- Content migration - Extract text for content management systems
- Automated workflows - Process hundreds of files efficiently
- CI/CD integration - Generate documentation as part of build processes
- Content indexing - Prepare documents for search engines
- Format conversion - Convert document collections to text format
./textharvest.sh code --interactive- Browse and select specific projects or files
- Preview operations before execution
- Step-by-step processing with user confirmation
./textharvest.sh pdf-text --parallel --max-jobs 8- Multi-threaded processing for large file collections
- Configurable job limits for system optimization
- Progress tracking and ETA calculations
./textharvest.sh pdf-ocr -l eng+fra+deu --deskew --clean- Multiple language support
- Image preprocessing options
- Customizable OCR parameters
./textharvest.sh code --dry-run --verbose- Preview operations without making changes
- Validate input files and directories
- Test configuration and command syntax
TextHarvest v2.0.0 features a modern, modular architecture:
- Unified CLI Interface - Single entry point (
textharvest.sh) for all operations - Shared Library - Common functions in
lib/common.shfor consistency - Configuration System - Hierarchical config with environment variable overrides
- Error Handling - Comprehensive error checking and user feedback
- Progress Tracking - Real-time progress indicators and timing information
- β Single command interface replaces multiple scripts
- β Cross-platform support for Linux and macOS
- β Interactive modes for better user experience
- β Configuration management system
- β Parallel processing capabilities
- β Dry run mode for safe operation testing
- β Enhanced error handling and logging
- β Plugin architecture foundation for extensibility
TextHarvest is designed to be simple, reliable, and extensible. Contributions are welcome for:
- Additional file format support
- Platform compatibility improvements
- Performance optimizations
- Documentation enhancements
- Bug fixes and feature requests
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License Summary:
- β Commercial use allowed
- β Modification and distribution permitted
- β Private use encouraged
- β No warranty or liability
- π Attribution required (keep copyright notice)
- GitHub Repository: https://github.com/matthewdeaves/textharvest
- Issue Tracker: Report bugs and request features
- Documentation: Additional guides and examples
TextHarvest v2.0.0 - Efficient document and code processing for the modern developer.