Hadoop Cluster in the Cloud using Chef and Poolparty

Overview

This will help you set up a scalable, efficient hadoop cluster on the Amazon EC2 cloud. It uses Poolparty to create instances, and Chef to provision them after start.

Chef is declarative: you specify a final state for each node to reach, not a procedure to follow. Adminstration is more efficient, robust and maintainable.
You get a nice central dashboard to manage clients
You can easily roll out configuration changes across all your machines
Chef is actively developed and has well-written recipes for a ton of different software packages.
Poolparty makes creating amazon cloud machines concise and easy: you can specify spot instances, ebs-backed volumes, disable-api-termination, and more.

Components

Hadoop
NFS
Persistent HDFS on EBS volumes
Zookeeper (in progress)
Cassandra (in progress)

Process

Install prerequisites
Set up your local credentials and settings
Launch chef server
Try a single-machine cluster
Launch slaves
Launch a stand-alone cluster
Set up EBS volumes
Launch a cluster with EBS volumes
Cassandra
Zookeeper

TODO

get the run_list out of the user-data

Fix the device type identification on ebs vols.

Recommended cluster composition

We’re going to target these two plausible cluster setups:

Small, simple cluster (“Zaius”)

A modest, no-fuss cluster to get started:

Master node acts as chef server, nfs server, hadoop master (namenode, secondarynamenode and jobtracker), hadoop worker.
0-5 worker nodes: nfs client, hadoop worker.
All nodes are EBS-backed instances, sized large enough to hold the HDFS.
Use non-spot pricing, but manage costs by starting/stopping instances when not in use. (Set ‘ebs_delete_on_termination’ to false and ‘disable-api-termination’ to true)

Industrial-strength cluster with persistent HDFS (“Maximus”)

A many-node cluster that can be spot priced (or frequently launched/terminated); uses persistent EBS volumes for the HDFS (much more efficient than S3).

A standalone EBS-backed small instance acting as the chef server and nfs server. Can start/stop when not in use (set ‘ebs_delete_on_termination’ false and ‘disable-api-termination’ true) or use a reserved instance.
Spot-priced master node (namenode, secondarynamenode and jobtracker) that is also a hadoop worker, nfs client, and cassandra node.
6-40 spot-priced worker nodes: hadoop worker, nfs client, cassandra node.
All nodes are local-backed instances with EBS volumes attached at startup.
You can shut down the cluster (or tolerate EC2 shutting it down if the spot price spikes) without harm to the HDFS. The NFS home dir lets you develop scripts on a small cluster and only spin up the big cluster for production jobs.
For a larger cluster, you can turn off worker roles for the master node, and can specify the namenode and jobtracker to reside on different machines.
You can specify any scale of instance depending on whether your job is IO-, CPU- or memory-intensive, and size master and worker nodes independently.

Prerequisites

You should already be familiar with hadoop and with the Amazon cloud. These scripts are meant to efficiently orchestrate many dependent packages, and the bugs are still being straightened out.

Choose a name for your cluster. In this example, we’ll use ‘zaius’ for the small cluster and ‘maximus’ for the big cluster.
Visit the aws console and ensure you’re registered for EC2 and SimpleDB. (You may have to click through a license agreement and check your email)
Choose your availability zone (spot pricing in the US-East-1 region seems to be the lowest). You must set all machines in the cluster to the same availability zone.
Chef needs a good, durable domain name. Allocate an elastic IP; have your DNS server point both ‘chef.yourdomain.com’ and ‘zaius.yourdomain.com’ at that elastic IP.

From now on, I’m going to just use ‘chef.yourdomain.com’, ‘zaius’ and ‘maximus’ without apology, but substitute accordingly.

Install dependencies

Install these gems:

chef
configliere
amazon-ec2
broham

NOTE: Please use the infochimps branch of poolparty for spot instance support and other tweaks

infochimps-poolparty

Set up configuration files and credentials

At this point there’s still a lot of moving parts. What I do is to make one directory for all the poolparty, chef, and other config files, and then use symlinks to make everyone happy. (Note: if you’re already using the Cloudera hadoop-ec2 scripts some of this is already in place.)


  mkdir ~/.hadoop-ec2
  mkdir ~/.hadoop-ec2/keypairs
  ln -nfs ~/.hadoop-ec2 ~/.poolparty
  ln -nfs ~/.hadoop-ec2 ~/.chef

From the chef server,

sudo cat /etc/chef/validation.peb

And copy/paste the contents into

~/.chef/keypairs/chef-validator.pem

From this code repo dir, copy the template config files over.


  cd PATH/TO/hadoop_cluster_chef
  cp ./config/knife.rb                 ~/.hadoop-ec2/knife.rb                           
  cp ./config/poolparty-example.yaml   ~/.hadoop-ec2/poolparty.yaml 
  ln -nfs ~/.hadoop-ec2/poolparty.yaml ~/.hadoop-ec2/aws
  # optional:
  ( cd ~/.hadoop-ec2 && git init && git add . && git commit -m "Initial commit" )

In ~/.chef/knife.rb, enter your the chef_server_url

Credentials

We need to stuff in the AWS credentials (use the aws console to get keys and so forth).

Create keypairs named ‘chef’ and ‘zaius’. Save them as ~/.hadoop-ec2/keypairs/chef.pem and ~/.hadoop-ec2/keypairs/zaius.pem respectively, and fix their permissions: chmod 600 ~/.hadoop-ec2/keypairs/*.pem.
In the file ~/.hadoop-ec2/poolparty.yaml,
- Add your AWS access key and secret access key at the top and again under attributes[:aws]
- Put the domain name of your chef server in the top-level attributes[:chef] section, and again in the pools[:chef][:server][:attributes][:node_name].
- Put the elastic IPs you allocated in the right places (chef server, cluster masters)
- If needed, update the region, availability zone and ec2_url. If you change regions you’ll have to change the AMI ids too.

Start Chef server

sudo cat /etc/chef/server.rb for the initial web password.
for foo in chef-client chef-server chef-server-webui chef-solr chef-solr-indexer ; do sudo service $foo restart ; done

Click clients, create — make client called knife_user that is an admin
paste the text of the private key into ~/.chef/knife_user.pem
Copy the text of /etc/chef/validation.pem (from the server) into ~/.chef/chef-validator.pem
chmod 600 ~/.chef/*.pem

knife cookbook upload --all
for foo in roles/* ; do knife role from file $foo ; done (if you have a private cookbooks repo, run it from there, too)

installing from apt

You need to have a apt world acct

https://launchpad.net/~jtimberman/archive/opschef/packages
sudo add-apt-repository ppa:user/ppa-name

Set up knife and webui

EITHER: use your chef server to create a client called ‘knife_user’, and save the .pem file it generates as ~/.hadoop-ec2/keypairs/knife_user.pem.

OR: On the server, run and answer as follows:


  sudo knife configure -i
  Your chef server URL? http://chef.infochimps.com:4000
  Your client user name? knife_user
  Your validation client user name? chef-validator
  Path to a chef repository (or leave blank)? 
  WARN: Creating initial API user...

(note that on the chef server this must be run sudo so we can see the key files; that the port is ‘4000’ (server, not webui)

Now copy ~/.chef/knife_user.pem and /etc/chef/validation.pem on the server
to (on your computer) ~/.hadoop-ec2/keypairs/knife_user.pem and
~/.hadoop-ec2/keypairs/chef-validator.pem respectively; and chmod og-rwx ~/.hadoop-ec2/keypairs/*.pem

If you do knife client list you should now see something like


  [
    "chef-validator", 
    "chef-webui",
    "knife_user"
  ]

Stock the Chef Server

Upload your roles and recipes:


  cd PATH/TO/hadoop_cluster_chef
  for foo in roles/*.rb ; do echo $foo ; knife role from file $foo ; done
  knife cookbook upload --all
  rake load_data_bags

Play around with single-machine server


cloud-start -n master -c clouds/hadoop_clouds.rb

If you’re using HDFS on the ephemeral drives, put the maintain_hadoop_cluster::reformat_scratch_disks, maintain_hadoop_cluster::format_namenode and maintain_hadoop_cluster::make_standard_hdfs_dirs.rb recipe in below hadoop and before hadoop_master. Nobody will question your courage if (rather than add them to the runlist) you just run the commands therein manually.

To see how things are hangin’,

On the master node
for foo in hadoop-0.20-{namenode,jobtracker,tasktracker,datanode,secondarynamenode} cassandra chef-client ; do sudo service $foo stop ; done

On a worker node
for foo in hadoop-0.20-{tasktracker,datanode} cassandra chef-client nfs-kernel-server ; do sudo service $foo status ; done

Burn an AMI

From your local machine — bring over your credentials
scp -i ~/.hadoop-ec2/keypairs/gibbon.pem ~/.hadoop-ec2/{aws_private_setup.sh,certs/cert.pem,certs/pk.pem,keypairs/gibbon.pem} [email protected]:/tmp

… and then on the target machine move them from /tmp to /mnt (which is ignored in bundling)
sudo mv /tmp/*.pem /tmp/aws_private_setup.sh /mnt

(all following commands are also on the target machine)

Shutdown services
for foo in hadoop-0.20-{namenode,jobtracker,tasktracker,datanode,secondarynamenode} cassandra chef-client nfs-kernel-server ; do sudo service $foo stop ; done
and make the following ones not restart on bootup
for foo in hadoop-0.20-{tasktracker,datanode,namenode,jobtracker,secondarynamenode} cassandra ; do sudo update-rc.d -f $foo remove ; done
Give the process list a how’s your father — nothing interesting should be running.
ps aux
Unmount anything that’s mounted.
mount
sudo umount /home

Give apt some last-minute lovin’
sudo apt-get -y update ;
sudo apt-get -y upgrade ;
sudo apt-get -f install ;
sudo apt-get clean ;
sudo updatedb ;

Nuke files that would be inconvenient to persist: chef config and startup files; log files; and files that contain keys of some sort.
sudo rm /etc/hostname /etc/chef/{chef_config.json,client.pem,validation.pem} /var/lib/cloud/data/scripts/*
sudo rm /var/log/chef/* /etc/sv/chef-client/log/main/* /var/log/.gz /var/log/hadoop/ /tmp/*
sudo rm -rf /{root,home/ubuntu}/{.cache,.gem} /etc/hadoop/conf/core-site.xml
sudo bash -c ‘for foo in /var/log/{messages,debug,udev,lastlog,dpkg.log,bootstrap.log,user.log} ; do echo > $foo ; done’

If you want to record the AMI version, something like
sudo rm /etc/motd ;
sudo bash -c ’echo “CHIMP CHIMP CHIMP CRUNCH CRUNCH CRUNCH (image burned at `date`)” > /etc/motd ’ ;
If you want to, edit the /etc/ssh/sshd_config

Burning an EBS-backed AMI

Just use the console. MAKE SURE TO STOP, UNMOUNT AND DETACH ALL EBS VOLUMES first.

Burning an Instance-backed (s3) AMIs

cd /mnt
. /mnt/aws_private_setup.sh

Modify the following to suit. Bundle will complain about excludes that are missing, so adjust until it stops bitching.
AMI_EXCLUDES=/ebs1,/ebs2,/mnt,/data,/{root,home/ubuntu}/.ssh/authorized_keys,/etc/ssh/ssh_*key*
AWS_REGION=us-east-1
CLUSTER=gibbon
ami_bucket=s3amis.infinitemonkeys.info/${CLUSTER}-slave-ami-32bit-`date “+%Y%m%d”`
sudo mkdir -p /mnt/`dirname $ami_bucket`
EC2_URL=https://${AWS_REGION}.ec2.amazonaws.com
This will take a long fucking time. 15 minutes on a small instance. It fucking sucks.
sudo mkdir -p /mnt/$ami_bucket ; time sudo ec2-bundle-vol —exclude=$AMI_EXCLUDES -r i386 -d /mnt/$ami_bucket -k /mnt/pk.pem -c /mnt/cert.pem -u $AWS_ACCOUNT_ID —ec2cert /etc/ec2/amitools/cert-ec2.pem
time ( PATH=/usr/bin:$PATH ; ec2-upload-bundle -b $ami_bucket -m /mnt/$ami_bucket/image.manifest.xml -a $AWS_ACCESS_KEY_ID -s $AWS_SECRET_ACCESS_KEY ) ;
time ec2-register -K /mnt/pk.pem -C /mnt/cert.pem —region $AWS_REGION -n $ami_bucket $ami_bucket/image.manifest.xml

launch

tail -f /var/log/dpkg.log /tmp/user_data-progress.log

Start the hadoop master node:


  cloud-start -n master  -c clouds/hadoop_clouds.rb
  # ... twiddle thumbs ...
  cloud-ssh   -n server  -c clouds/chef_clouds.rb

Once the master node starts, try a couple slaves


  cloud-start -n master  -c clouds/hadoop_clouds.rb

Once the cluster works with no EBS volumes, then you should try defining

Debugging

Debugging chef client (and client on boot-up)

If you installed from


  tail -n200 -f /etc/sv/chef-client/log/main/current

If you need to kickstart the chef-client, log into the machine as ubuntu user and


  sudo service chef-client stop # so that it doesn't try running while you're experimenting
  cd /etc/chef
  tail -f /etc/sv/chef-client/log/main/* &
  sudo chef-client
  # ...
  sudo service chef-client start # once you're done


  sudo service chef-client stop
  tail -f /etc/sv/chef-client/log/main/current &

If the node is confused about its identity — gives you `error!': 401 "Unauthorized" (Net::HTTPServerException) — then you should remove /etc/chef/chef_config.json and /etc/chef/client.pem, then re-run sudo chef-client

Debugging chef everything


  tail -n200 -f /etc/sv/chef-*/log/main/current

If you’re having 401 Authentication errors,

check that broham didn’t make you be some node you don’t want to be
you can edit the node name in /etc/chef/chef_config.json file directly, and you can overwrite the /etc/chef/validation.pem file — the client script will let those settings override the userdata config.
Once you’ve checked that, blow away the client.rb file and re-run chef server. It should authenticate as the node name you set.

Debug chef server bootstrap

Using on the chef server can help debug authentication problems


  tail -n100 -f /etc/sv/chef-server*/log/main/current

Immediately after If the webui doesn’t log you in, try doing sudo service chef-server-webui restart — it occasionally will fail to create the admin user for some reason.

Debugging hadoop


  tail -f /var/log/hadoop/hadoop-hadoop-namenode-chef.infochimps.com.log &
  sudo service hadoop-0.20-datanode status
  # ... and so on ...
  sudo service hadoop-0.20-datanode restart

Cassandra

Logs are in /etc/

To check that cassandra works as it should:


  grep ListenAddress /etc/cassandra/storage-conf.xml
  irb
  # in irb
  require 'rubygems' ; require 'cassandra' ; include Cassandra::Constants ;
  # plug your ip address into the line below.
  twitter = Cassandra.new('Twitter', '10.162.67.95:9160')
  user = {'screen_name' => 'buttonscat'} ;
  twitter.insert(:Users, '5', user)
  twitter.get(:Users, '5')

Tips and Notes

If you’re west, first run from the shell


  export EC2_URL=https://us-west-1.ec2.amazonaws.com
<pre><code>
Instance attributes: disable_api_termination and delete_on_termination
To set delete_on_termination to ‘true’ after the fact, run the following


  ec2-modify-instance-attribute -v i-0704be6c --block-device-mapping /dev/sda1=vol-XX8d2c80::true


(You’ll have to modify the instance and volume to suit)
If you set disable_api_termination to true, in order to terminate the node run


  ec2-modify-instance-attribute -v i-0704be6c --disable-api-termination false

Dumb shit I did that maybe if I note it here someone else might not.

	Make sure you scrub the userdata scripts from a bootstrapped image before preparing an AMI — it’s surprising behavior to see old config files reappear.


	If you’re seeing weird interactions between chef components make sure you’re pointing at the chef server you think you are

Tradeoffs of EBS-backed volumes
Be careful of the tradeoffs with EBS-backed volumes.

	good: You can start and stop instances — don’t pay for the compute from the end of that hour until you restart.
	good: It’s way easier to tune up an AMI. (Then again, chef makes much of that unnecessary)
	good: You can make the volume survive even if the node is terminated (spot price is exceeded, machine crashes, etc).
	good: You can make a persistent HDFS without having to fart around attaching EBS volumes at startup. There are performance tradeoffs, though.
	bad: The disk is noticably slower. Make sure to point tempfiles and scratch space to the local drives. (The scripts currently handle most but not all of this).
	bad: The root volume counts against your quota for EBS volumes.
	bad: Starting more than six or so EBS-backed instances can cause AWS to shit a brick allocating all the volumes.

Refer to the standard setups described above.
Information Sharing using simpleDB

	Make sure you log into the aws console and check in as a SimpleDB user. (You have to click through a license agreement, it should approve you within minutes)

sudo bash -c ‘export HOSTNAME=gibbon.infinitemonkeys.info ; PUBLIC_IP=204.236.225.16 ; echo $HOSTNAME > /etc/hostname ; hostname -F /etc/hostname ; sysctl -w kernel.hostname=$HOSTNAME ; sed -i “s/127.0.0.1 *localhost/127.0.0.1      $HOSTNAME `hostname -s` localhost/” /etc/hosts ; if grep -q $PUBLIC_IP /etc/hosts ; then true ; else echo $PUBLIC_IP $HOSTNAME `hostname -s ` >> /etc/hosts ; fi’

Name		Name	Last commit message	Last commit date
Latest commit History 186 Commits
clouds		clouds
config		config
cookbooks		cookbooks
databags		databags
notes		notes
roles		roles
tasks		tasks
.gitignore		.gitignore
README.textile		README.textile
Rakefile		Rakefile
TODO.textile		TODO.textile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hadoop Cluster in the Cloud using Chef and Poolparty

Overview

Components

Process

TODO

Recommended cluster composition

Small, simple cluster (“Zaius”)

Industrial-strength cluster with persistent HDFS (“Maximus”)

Prerequisites

Install dependencies

Set up configuration files and credentials

Credentials

Start Chef server

installing from apt

Set up knife and webui

Stock the Chef Server

Play around with single-machine server

Burn an AMI

Burning an EBS-backed AMI

Burning an Instance-backed (s3) AMIs

launch

Debugging

Debugging chef client (and client on boot-up)

Debugging chef everything

Debug chef server bootstrap

Debugging hadoop

Cassandra

Tips and Notes

Instance attributes: disable_api_termination and delete_on_termination

Dumb shit I did that maybe if I note it here someone else might not.

Tradeoffs of EBS-backed volumes

Information Sharing using simpleDB

About

Releases

Packages

LusciousPear/hadoop_cluster_chef

Folders and files

Latest commit

History

Repository files navigation

Hadoop Cluster in the Cloud using Chef and Poolparty

Overview

Components

Process

TODO

Recommended cluster composition

Small, simple cluster (“Zaius”)

Industrial-strength cluster with persistent HDFS (“Maximus”)

Prerequisites

Install dependencies

Set up configuration files and credentials

Credentials

Start Chef server

installing from apt

Set up knife and webui

Stock the Chef Server

Play around with single-machine server

Burn an AMI

Burning an EBS-backed AMI

Burning an Instance-backed (s3) AMIs

launch

Debugging

Debugging chef client (and client on boot-up)

Debugging chef everything

Debug chef server bootstrap

Debugging hadoop

Cassandra

Tips and Notes

Instance attributes: disable_api_termination and delete_on_termination

Dumb shit I did that maybe if I note it here someone else might not.

Tradeoffs of EBS-backed volumes

Information Sharing using simpleDB

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages