A scraper that downloads the whole repository of Project Gutenberg and puts it into a locally browsable directory and then in a ZIM file, a clean and user friendly format for storing content for offline usage.
It's recommended that you use virtualenv
. py2.7.x
and py3.6+
are supported.
sudo apt-get install python-pip python-dev libxml2-dev libxslt-dev advancecomp jpegoptim pngquant p7zip-full gifsicle curl zip
sudo pip install virtualenv
sudo easy_install pip
sudo pip install virtualenv
brew install advancecomp jpegoptim pngquant p7zip gifsicle
git clone [email protected]:kiwix/gutenberg.git
cd gutenberg
virtualenv gut-env (or any name you want)
./gut-env/bin/pip install -r requirements.pip
- Activate the environment:
source gut-env/bin/activate
- Quit the environment:
deactivate
After setting up the whole environment you can just run the main script gutenberg2zim
.
It will download, process and export the content.
./gutenberg2zim
You can also specify parameters to customize the content. Only want books with the Id 100-200? Books only in French? English? Or only those both? No problem! You can also include or exclude book formats. You can add bookshelves and the option to search books by title to enrich your user experince.
./gutenberg2zim -l en,fr -f pdf --books 100-200 --bookshelves --title-search
This will download English and French books that have the Id 100 to 200 in the html (default) and pdf format.
You can find the full arguments list below.
-h --help Display this help message
-y --wipe-db Do not wipe the DB during parse stage
-F --force Redo step even if target already exist
-l --languages=<list> Comma-separated list of lang codes to filter export to (preferably ISO 639-1, else ISO 639-3)
-f --formats=<list> Comma-separated list of formats to filter export to (epub, html, pdf, all)
-m --mirror=<url> Use URL as base for all downloads.
-r --rdf-folder=<folder> Don't download rdf-files.tar.bz2 and use extracted folder instead
-e --static-folder=<folder> Use-as/Write-to this folder static HTML
-z --zim-file=<file> Write ZIM into this file path
-t --zim-title=<title> Set ZIM title
-n --zim-desc=<description> Set ZIM description
-d --dl-folder=<folder> Folder to use/write-to downloaded ebooks
-u --rdf-url=<url> Alternative rdf-files.tar.bz2 URL
-b --books=<ids> Execute the processes for specific books, separated by commas, or dashes for intervals
-c --concurrency=<nb> Number of concurrent process for download and parsing tasks
--dlc=<nb> Number of concurrent *download* process for download (overwrites --concurrency). if server blocks high rate requests
--one-language-one-zim When more than 1 language, do one zim for each language (and one with all)
-x --zim-title=<title> Custom title for the ZIM file
-q --zim-desc=<desc> Custom description for the ZIM file
--no-index Do NOT create full-text index within ZIM file
-x --zim-title=<title> Custom title for the ZIM file
-q --zim-desc=<desc> Custom description for the ZIM file
--check Check dependencies
--prepare Download & extract rdf-files.tar.bz2
--parse Parse all RDF files and fill-up the DB
--download Download ebooks based on filters
--export Export downloaded content to zim-friendly static HTML
--dev Exports *just* Home+JS+CSS files (overwritten by --zim step)
--zim Create a ZIM file
--title-search Add title search parameter
--bookshelves Add bookshelves