A ready to go Big Data cluster (Hadoop + Hadoop Streaming + Spark + PySpark) with Docker and Docker Swarm!
- Why?
- Features
- Running toy cluster
- Running a real cluster in Docker Swarm
- Usage
- Going further
- Frequent problems
- Contributing
Although today you can find several repositories ready to deploy a Spark or Hadoop cluster, they all run into the same problem: they do not work when deployed on Docker Swarm due to several issues ranging from the definition of the worker nodes to connection problems with Docker network interfaces.
This repository seeks to solve the problem by offering a functional alternative, both a toy cluster to deploy on a single machine, as well as a real cluster that works on multiple nodes that conform a Docker Swarm cluster.
This repository is inspired by and uses several scripts taken from Rubenafo's repo and Sdesilva26's repo, however there are several changes introduced; the API is simpler, there is more documentation about usage and some extra features:
- ✅ Ready to deploy in a Docker Swarm cluster: all the networking and port configuration issues have been fixed so you can scale your cluster to as many worker nodes as you need.
- ⚡️ Hadoop, HDFS, Spark, Scala and PySpark ready to use: all the tools are available inside the container globally so you don't have to fight with environment variables and executable paths.
- 🌟 New technology: our image offers Hadoop 3.4.0, Spark 3.5.4 and Python 3.12.6
- ⚙️ Less configuration: we have removed some settings to keep the minimum possible configuration, this way you prevent errors, unexpected behaviors and get the freedom to set parameters via environment variables and have an agile development that does not require rebuilding the Docker image.
- 🐍 Python dependencies: we include the most used Python dependencies like Pandas, Numpy and Scipy to be able to work on datasets and perform mathematical operations (you can remove them if you don't need them!)
You have two ways to run a cluster on a single machine:
- Use
toy-cluster.sh
script... - Or use
docker-compose.yml
file
The script has the following commands:
- deploy: create a new Docker network, containers (a master and 3 workers) and start these last
- start: start the existing containers
- stop: stop the running containers
- remove: remove all the created containers
- info:: useful URLs
So, if you want to try your new cluster run ./toy-cluster.sh deploy
to create a network, containers and format namenode HDFS (note that this script will start the containers too). To stop, start again or remove just run the stop
, start
or remove
respectively.
Use ./toy-cluster.sh info
to see the URLs to check Hadoop and Spark clusters status.
The docker-compose.yml
file has the same structure than toy-cluster.sh
script except for the use of volumes to preserve HDFS data.
Only for the first time, you need to format the namenode information directory. Do not execute this command when you are in production with valid data stored as you will lose all your data stored in the HDFS:
docker container run --rm -v hdfs_master_data_swarm:/home/hadoop/data/nameNode jwaresolutions/big-data-cluster:<tag> /usr/local/hadoop/bin/hadoop namenode -format
Then you can manage your toy cluster with the following commands:
- To start the cluster run:
docker-compose up -d
- To stop the cluster run:
docker-compose down
Important: the use of ./toy-cluster.sh info
works with this! So you can get the useful cluster URLs.
Here is the important stuff, there are some minors steps to do to make it work: first of all you need a Docker Swarm cluster:
- Start the cluster in your master node:
docker swarm init
. - Generate a token for the workers to be added (official doc):
docker swarm join-token worker
. It will print on screen a token in a command that must be executed in all the workers to be added. - Run the command generated in the previous step in all workers node:
docker swarm join: --token <token generated> <HOST>:<PORT>
You have your Docker Swarm cluster! Now you have to label all the nodes to indicate which one will be the master and workers. On master node run:
- List all cluster nodes to get their ID:
docker node ls
- Label master node as master:
docker node update --label-add role=master <MASTER NODE ID>
- For every worker ID node run:
docker node update --label-add role=worker <WORKER NODE ID>
Create needed network and volumes:
docker network create -d overlay cluster_net_swarm
docker volume create --name=hdfs_master_data_swarm
docker volume create --name=hdfs_master_checkpoint_data_swarm
docker volume create --name=hdfs_worker_data_swarm
Now it is time to select a tag of the Docker image. The default is latest but it is not recommended to use it in production. After choose one, set it version on docker-compose_cluster.yml
and the command below.
Only for the first time, you need to format the namenode information directory in Master and Workers nodes. Do not execute this command when you are in production with valid data stored as you will lose all your data stored in the HDFS:
docker container run --rm -v hdfs_master_data_swarm:/home/hadoop/data/nameNode jwaresolutions/big-data-cluster:<tag> /usr/local/hadoop/bin/hadoop namenode -format
Now you are ready to deploy your production cluster!
docker stack deploy -c docker-compose_cluster.yml big-data-cluster
Finally you can use your cluster! Like the toy cluster, you have available some useful URLs:
- <MASTER IP>:8080 -> Spark panel
- <MASTER IP>:18080 -> Spark applications logs
- <MASTER IP>:9870 -> HDFS panel
Enter the master node:
docker container exec -it <MASTER CONTAINER ID> bash
You can store files in the Hadoop Distributed File System:
echo "test" > test.txt
hdfs dfs -copyFromLocal ./test.txt /test.txt
If you check in a worker node that the file is visible in the entire cluster:
hdfs dfs -ls /
-
You can initiate a PySpark console:
pyspark --master spark://master-node:7077
- Now, for example, read a file and count lines:
lines = sc.textFile('hdfs://master-node:9000/test.txt') lines_count = lines.count() print(f'Line count -> {lines_count}')
-
Or you can submit an script:
- Make the script:
from pyspark import SparkContext import random NUM_SAMPLES = 1000 sc = SparkContext("spark://master-node:7077", "Pi Estimation") def inside(p): x, y = random.random(), random.random() return x*x + y*y < 1 count = sc.parallelize(range(0, NUM_SAMPLES)) \ .filter(inside).count() print("Pi is roughly %f" % (4.0 * count / NUM_SAMPLES))
- Submit it:
spark-submit your-script.py
Adding workers to cluster is easy:
- Add a worker to your Swarm cluster as explained in Running a real cluster in Docker Swarm and label it with
role=worker
. - Increment the number of replicas in
docker-compose_cluster.yml
forworker
service. - Deploy the stack again with
docker stack deploy -c docker-compose_cluster.yml big-data-cluster
(restart is no required).
In both docker-compose.yml
(toy cluster) and docker-compose_cluster.yml
(real cluster) there is a commented line in volumes
section. Just uncomment it and set the the file/folder in host file and the destination inside master node in cluster! For more information read official documentation about volumes
setting in Docker Compose.
- Add the dependency to the
requirements.txt
file. - Build the image again.
To check Spark stderr
and stdout
files you can run bash
inside the Worker container and then run the following commands:
- stderr:
cat /sbin/spark-3.5.2-bin-without-hadoop/work/<app id>/<partition id>/stderr
- stdout:
cat /sbin/spark-3.5.2-bin-without-hadoop/work/<app id>/<partition id>/stdout
Sometimes it throws a Connection refused error when run a HDFS command or try to access to DFS from Hadoop/Spark. There is official documentation about this problem. The solution that worked for this repository was running the commands listed in this Stack Overflow answer. That is why you need to format the namenode directory the first time you are deploying the real cluster (see Running a real cluster in Docker Swarm).
This problem means that Namenode is now running in master node, is associated with Connection refused for HDFS problem and has the same solution. Once Namenode is running the port should be working correctly.
If there are nodes that are not listed as active in the HDFS panel you may also need to run the nanemode directory formatting command on the Workers nodes, not just the Driver. See Running a real cluster in Docker Swarm to get the command.
Any kind of help is welcome and appreciated! If you find a bug please submit an issue or make a PR:
- Fork this repo.
- Create a branch where you will develop some changes.
- Make a PR.
There are some TODOs to complete:
- Find a way to prevent Connection refused error to avoid format the namenode information directory
- Add examples for Hadoop
- Add examples for Hadoop Streaming
- Add examples for Spark Streaming