This role installs HDFS on Ubuntu/Debian Linux servers.
- The role requires java and zookeeper to be installed, configured and running.
- hosts: hadoop_hosts
become: True
roles:
- hdfs
For an example inventory please check the inventory file.
If hdfs_ssh_fence
is set to true
the playbook has to be run with the -K
option of ansible-playbook!
This role supports two different modes of installation:
- Single Namenode with Secondary Namenode
- Two Namenodes in HA mode
The number of namenodes specifies the mode. If two namenodes are specified HDFS will be installed in an HA fashion.
For documentation details of HDFS please refer to the official Hadoop documentation.
This role makes use of groups to figure out which server needs which installation. The groups are listed below:
- namenodes
- datanodes
- secondarynamenode (Single NN setup only)
- zookeeper_hosts (High availability mode only)
- journalnodes (High availability mode only)
Alternatively variables like hdfs_namenodes
can be overwritten (see defaults/main.yml).
The following gives a list of important variables that have to be set for a specific deployment. Most variables can be set in group_vars or host_vars.
hdfs_cluster_name
: Name of your clusterhdfs_parent_dir
: Where to install HDFS tohdfs_version
: Hadoop version to usehdfs_tmpdir
: Where to write HDFS tmp fileshdfs_namenode_dir_list
: Files of namenodeshdfs_datanode_dir_list
: Files of datanodeshdfs_namenode_checkpoint_dir_list
: Files of secondary namenodehdfs_distribution_method
: Should tar.gz be 'downloaded', 'local_file' or 'compile' install?hdfs_bootstrap
: Should the cluster be formatted? (If you have an already existing installation this option is not recommended)hdfs_host_domain_name
: Only set this variable if your host entries are not FQDNs. E.g. value: "node.dns.example.com"hdfs_upgrade
: Only set this variable to perform an upgrade (given that hdfs_version is changed)hdfs_upgrade_force
: Only set this variable to force an upgrade (the playbook will run even if the version hasn't changed. Good when something went wrong and a node has been already upgraded)
For more configuration variables see the documentation in defaults/main.yml.
If hdfs_upgrade
is set to true
the playbook will assume an upgrade is taking place and some input from the user might be required.
Additional configuration to hdfs-site.xml
and core-site.xml
can be added by overwriting the following variables:
hdfs_site_additional_properties
core_site_additional_properties
This section gives a brief description on what each playbook does.
CURRENTLY ONLY WORKS WITH Ubuntu 14.04. (16.04. has a newer protobuf version and compilation fails)
This playbook will compile hadoop on server hdfs_compile_node to enable hadoop native libraries (Compression codecs and HDFS Short-Circuit Local Reads). This playbook will install the development tools necessary to be able to compile hadoop. (Download and compilation may take a while depending on you internet connection and server power (10-20 min))
To activate this playbook enable set hdfs_distribution_method
to compile
.
Known issues:
- Sometimes the git download fails for the first time. Just run it again.
Options:
hdfs_compile_node
: Server to compile onhdfs_compile_from_git
: True if it should download the latest version from github.comhdfs_compile_version
: Version to download from github (tags usable by e.g. 'tags/rel/release-2.7.2' or 'HEAD')hdfs_fetch_folder
: Local folder to download the compiled tar to.
This playbook installs the hadoop binaries and creates links for easy usage.
This playbook writes the configuration files.
This playbook upgrades HDFS in a controlled way (applicable only to HA modes). This follows a procedure of no downtime that can be summarized as follows:
- Prepare rolling upgrade, wait for "Proceed with rolling upgrade" ..1. Perform upgrade of active namenode (by means of failover to standby) ..2. Failover to newly upgraded namenode, upgrade the second namenode
- Perform upgrade of the datanodes in a rolling fashion ..1. Stop running datanode (check if running) ..2. Install the new version ..3. Restart it with the new program version (check if running)
- Finalize the rolling upgrade
Be prepared to react to some input made by the playbook specially when dealing with starting and stopping of services.
If anything goes wrong, and some nodes were already upgraded, run the playbook again setting hdfs_upgrade_force
set to True
. This process is idempotent.
This playbook will create a user hdfs_user
, generate an ssh-key for it, distribute the key and register all servers in known_hosts file of each other.
This playbook sets up SSH access for the hdfs_user
between the namenode servers. Used if an SSH fence is the preferred method as fencing method. (See HA Documentation)
This playbook writes configuration files needed only by the namenode, creates folder and sets up services for namenode and zkfc.
This playbook creates the folders specified in hdfs_datanode_dir_list
and registers the hdfs-datanode service.
This playbook will install the journal node service.
This playbook will install and register hdfs-secondarynamenode service.
This playbook bootstraps a cluster in HA mode
This playbook bootstraps a cluster in SPOF mode. (One namenode and one secondary namenode)
The tests are run using molecule and a docker container.
- Docker
- molecule (pip module)
- docker-py (pip module)
From the root folder run molecule test
.
Apache 2.0
- Bertrand Bossy
- Florian Froese
- Laurent Hoss