A complete pipeline for extracting text content from Shamela (المكتبة الشاملة) digital Islamic library into structured JSON files.
This tool extracts content from Shamela's Lucene indices and SQLite databases, converting them into clean JSON files with complete metadata including:
- Full book text with page structure and proper UTF-8 Arabic text support
- Author information and metadata
- Category classifications
- Title hierarchies with optional table of contents generation
- Footnotes and references with improved separation logic
- Requires ~40GB of disk space for extraction of full Shamela installation
New Features:
- Test Mode: Process single books for faster development (
--test-single-book) - Table of Contents: Generate hierarchical TOC with accurate page links (
--generate-toc) - UTF-8 Support: Proper Arabic text display (no more '????' characters)
- Java 8+ (required for Lucene processing)
- Python 3.7+ with the following packages:
pandassqlite3(usually included with Python)pathlib(usually included with Python)
-
Shamela Installation - Complete Shamela4 directory with the following (included in full shamela instalation):
database/store/(Lucene indices)database/master.db(main metadata database)database/book/(individual book databases)
-
Java Dependencies - Already included in this repository:
lib/lucene-core-9.10.0.jarlib/lucene-backward-codecs-9.10.0.jar
-
Clone this repository:
git clone https://github.com/ammusto/shamela-extractor cd shamela-extractor -
Install Python dependencies:
pip install pandas
-
Verify Java installation:
java -version
-
Compile the java extractor:
javac -cp "lib/*" java/ShamelaIndexExporter.java
Your setup should look like this:
shamela-extractor/
├── lib/
│ ├── lucene-core-9.10.0.jar
│ └── lucene-backward-codecs-9.10.0.jar
├── java/
│ ├── ShamelaIndexExporter.java
│ ├── ShamelaIndexExporter.class (after compilation)
│ ├── ShamelaIndexExporter$TitleDoc.class (after compilation)
│ └── ShamelaIndexExporter$DocWithId.class (after compilation)
├── extract_indices.py
├── build_jsons.py
└── README.md
Default shamela installation:
.../shamela4/
├── database/
│ ├── store/ (Lucene indices)
│ ├── master.db (main database)
│ └── book/ (individual book databases)
└── ...
# Full extraction (interactive mode)
python extract_indices.py
# Or with command line arguments
python extract_indices.py --shamela-path "C:/shamela4" --output-path "./output"
# Test mode - process only first book (faster for development)
python extract_indices.py --test-single-bookWhat it does:
- Extracts all Lucene indices to CSV files with proper UTF-8 encoding
- Creates individual CSV files for each book's content (~17GB for full Shamela installation)
- Exports author and book metadata
- Validates prerequisites automatically
Options:
- Interactive mode: You'll be prompted for Shamela path and output directory
- Command line mode: Use
--shamela-pathand--output-pathflags - Test mode: Use
--test-single-bookfor single book processing (~2 minutes vs 30+ minutes)
Output:
exported_indices/book_data/- Individual book CSV files (1000.csv,1001.csv, etc.)exported_indices/author.csv- Author metadataexported_indices/book.csv- Book metadatalogs/lucene_extraction.log- Processing log
# Basic JSON generation (interactive mode)
python build_jsons.py
# With table of contents generation
python build_jsons.py --generate-toc
# Test mode with TOC (single book only)
python build_jsons.py --test-single-book --generate-toc
# With command line arguments
python build_jsons.py --shamela-path "C:/shamela4" --extracted-data-path "./output" --generate-tocWhat it does:
- Combines CSV data with SQLite metadata
- Creates structured JSON files for each book with proper UTF-8 Arabic text
- Processes text content and hierarchies with improved footnote separation
- Links authors, categories, and metadata
- Optionally generates table of contents with accurate page mapping
Options:
--generate-toc: Creates hierarchical table of contents with proper page links--test-single-book: Process only first book for faster development- Interactive mode: Prompts for paths if not provided via command line
Output:
books_json/1000.json,books_json/1001.json, etc. - Complete book JSON files with optional TOClogs/json_building.log- Processing log
python clean_files.pyWhat it does:
- Deletes book_data folder and all book-level CSV files (~17GB) (no longer needed)
Example book JSON (with optional table of contents when using --generate-toc flag):
{
"book_id": "1",
"title": "الفواكه العذاب في الرد على من لم يحكم السنة والكتاب",
"book_date": 1225,
"category": {
"id": 1,
"name": "العقيدة"
},
"book_type": 1,
"printed": 1,
"pdf_links": {
"files": [
"/1/41557.pdf"
],
"cover": 2,
"size": 1445468
},
"meta_data": {
"date": "08121431"
},
"authors": [
{
"id": "513",
"name": "حمد بن ناصر آل معمر",
"death_number": 1225,
"death_text": "1225",
"is_main_author": true
}
],
"book_meta": [
"الكتاب: الفواكه العذاب في الرد على من لم يحكّم السنة والكتاب",
"..."
],
"author_meta": [
"حمد بن ناصر آل معمر (٠٠٠ - ١٢٢٥ هـ = ٠٠٠ - ١٨٣٩ م)",
"..."
],
"parts": [
{
"part": "",
"pages": [
{
"page_id": "1",
"page_number": "3",
"body": "...",
"footnote": "..."
},
...
]
}
],
"table_of_contents": [
{
"page": "1/5",
"title": "مقدمة المحقق",
"chapters": [
{
"page": "1/9",
"title": "ترجمة المؤلف",
"chapters": []
}
]
}
]
}Log Files:
- Complete processing logs in
logs/directory - Every book processing result logged
- Error details for troubleshooting
"Java is not installed or not in PATH"
- Install Java JDK/JRE 8 or higher
- Ensure
javacommand works in terminal
"No book content CSV files found!"
- Run
extract_indices.pyfirst - Check that
exported_indices/book_data/contains CSV files
"Shamela Lucene store not found"
- Verify Shamela installation path
Books fail to process
- Check
logs/json_building.logfor specific errors - Verify individual book databases in
database/book/exist
Arabic text shows as '????' characters
- This should be fixed automatically with the new UTF-8 support
- If still occurring, ensure you're using the latest version
Table of contents not generating
- Ensure you're using the
--generate-tocflag withbuild_jsons.py - Check that individual book databases exist in
database/book/
For a complete Shamela library:
- CSV files: ~17GB
- JSON files: ~18GB
- Processing time: 30+ minutes for extracting indices and building JSONs, and about 4 minutes 30 seconds for generating the table of contents (TOC)
- Test mode: ~2 minutes for a single book (ideal for development and testing)
- Input CSV files:
1000.csv,1001.csv, etc. (based on Shamela's book IDs) - Output JSON files:
1000.json,1001.json, etc. (matching book IDs) - Log files: Timestamped in
logs/directory
MIT License
For issues or questions:
- Check the log files in
logs/directory - Verify your Shamela installation is complete
- Ensure all prerequisites are installed