Le Bibliothécaire is a Python library designed to automatically download and clean French literary texts from online public domain sources such as Project Gutenberg and Wikisource. It provides a unified interface for bulk downloading and standardized text cleanup — ideal for building corpora for NLP, digital humanities, or literary analysis.
le\_bibliothecaire/
├── cleaner/
│ ├── clean\_up.py # Cleans and normalizes downloaded text
│ └── **init**.py
├── downloaders/
│ ├── base\_downloader.py # Abstract downloader with retry & delay logic
│ ├── gutenberg\_downloader.py
│ ├── wikisource\_downloader.py
│ ├── combined\_downloader.py # Unified interface for all sources
│ ├── utils.py # Shared helper functions
│ └── **init**.py
├── **init**.py
├── setup.py
├── LICENSE.md
└── README.md
- ✅ Download texts by French authors from Project Gutenberg and French Wikisource.
- ✅ Unified interface for triggering downloads from all sources.
- ✅ Retry logic with exponential backoff and randomized delays to reduce server strain.
- ✅ File system-safe naming and automatic directory organization by author.
- ✅ Clean-up pipeline for:
- Removing metadata, boilerplate headers, and footers.
- Trimming prologues, epilogues, and chapter markers.
- Removing non-literary markers like export notes.
- ✅ CLI support for batch cleaning text files in a directory.
Clone the repository:
git clone https://github.com/yourusername/le_bibliothecaire.git
cd le_bibliothecaire
Install dependencies (use a virtualenv if needed):
pip install -r requirements.txt
Or install as a package:
pip install .
Use the CombinedDownloader to fetch works by an author from all supported sources:
from le_bibliothecaire import CombinedDownloader
downloader = CombinedDownloader(base_folder="downloads")
downloader.download_all("Victor Hugo")
You can also use individual downloaders if desired:
from le_bibliothecaire import GutenbergDownloader, WikisourceDownloader
gutenberg = GutenbergDownloader("downloads")
gutenberg.download("Jules Verne")
wikisource = WikisourceDownloader("downloads")
wikisource.download("Émile Zola")
To clean up downloaded files:
python le_bibliothecaire/cleaner/clean_up.py downloads cleaned_texts
This will:
- Recursively scan the downloads/ folder
- Clean all .txt files
- Write cleaned versions to cleaned_texts/, preserving folder structure
Alternatively, call from Python:
from le_bibliothecaire.cleaner.clean_up import process_directory
process_directory("downloads", "cleaned_texts")
After downloading and cleaning Victor Hugo:
downloads/
└── Victor Hugo/
├── Les Misérables.txt
└── Notre-Dame de Paris.txt
cleaned_texts/
└── Victor Hugo/
├── Les Misérables.txt
└── Notre-Dame de Paris.txt
All downloaders support:
- Retries on failure (default: 3)
- Random delays between requests (default: 1–4s)
- Optional toggling of Gutenberg/Wikisource via flags
Example:
CombinedDownloader(
base_folder="downloads",
retries=5,
delay_range=(2, 5),
enable_delay=True,
gutenberg_enabled=True,
wikisource_enabled=False,
)
- requests
- beautifulsoup4
- Python 3.8+
Install them with:
pip install requests beautifulsoup4
This project is licensed under the terms of the MIT License. See the LICENSE.md file for details.
- Project Gutenberg: https://www.gutenberg.org
- French Wikisource: https://fr.wikisource.org
- Inspired by open-source scrapers and literary data initiatives.
- Add async download mode for faster scraping
- Support other languages (EN, DE, etc.)
- Add automatic EPUB or PDF conversion
- Integrate with HuggingFace datasets
📮 For questions or contributions, feel free to open an issue or pull request!