The provided Shell/Python scripts use the Internet Archive Python-/CL-interface to transfer selected filetypes from Internet Archive collections to a Hadoop cluster (HDFS).
The transfer happens in two steps:
- A list of all files to transfer will be created (
create_filelist.py
) - While the files are being transfered, the script keeps track of the already transfered files to allow restarting the process anytime and continue where it stopped (
download_files.py
)
The downloaded files are being separated into different folders corresponding to their filetypes.
During transfer, a specified number of files will be downloaded into a local staging directory and copied to HDFS in a bunch (default 10, max_staging
in download_files.py
).
First of all, please install https://github.com/jjjake/internetarchive to have the required ia
command available.
Next, please modify download.sh
according to your needs to include your paths and required filetypes.
download.sh
calls the python scripts and should be used to start off the transfer process.
To be called with ./download.sh <COLLECTION_NAME>
.
E.g., ./download.sh ArchiveIt-Collection-1234
Called by download.sh
: ./create_filelist.py <COLLECTION_NAME> <GLOB_PATTERNS>
.
E.g., ./create_filelist.py ArchiveIt-Collection-1234 *.cdx *.cdx.gz *.warc.gz
Called by download.sh
: ./download_files.py <COLLECTION_NAME> <LOCAL_STAGING_PATH> <HDFS_PATH>
E.g., ./download_files.py ArchiveIt-Collection-1234 /mnt/ephemeral0/holzmann/tmp /user/holzmann/ia1