Lighting a Spark
Audience : Anyone who wants to set up a spark cluster. \
Pre-requisite: Knowledge of docker, docker-compose, pyspark
Setting up a Spark Standalone Cluster
In order to set this up we require a few ingredients namely - Java, Spark and python for running pyspark applications. We will set up one master node and 3 workers, although you can scale up or down as you see fit. Both the master and worker will come from the same image, just the entrypoint will be different. Your Java installation needs to be 1.8 as that is what Spark 2.4.x runs on. To set up my dockerfile I start of with a base alpine java 8 image. While this is good for standalone, you might prefer to use ubuntu as I have run into some issues (none that cannot be fixed) when trying to set up a yarn cluster using the same image. So in your project directory create a docker file and then add the following lines to it
FROM openjdk:8-alpine USER root # wget, tar, bash for getting spark and hadoop RUN apk --update add wget tar bash
This will first use the alpine image as our base, make root as the user and download the requisite packages for what’s coming next which is downloading the Spark tar and unpacking it in a directory. I am using /spark as my spark folder. I will then add this to my PATH variable and also declare SPARK_HOME as another environment variable that points to /spark. The last step isn’t necessary from what I have seen so far as spark is smart enough to find the path if the variable isn’t declared but it doesn’t hurt.
RUN tar -xzf spark-2.4.5-bin-hadoop2.7.tgz && \ mv spark-2.4.5-bin-hadoop2.7 /spark && \ rm spark-2.4.5-bin-hadoop2.7.tgz # add to PATH ENV PATH $PATH:/spark/bin:/spark/sbin ENV SPARK_HOME /spark
Now we are set up with bot Java and Spark on our system. But in order to use PySpark we need to now install python which the following piece of code accomplishes.
# Install components for Python RUN apk add --no-cache --update \ git \ libffi-dev \ openssl-dev \ zlib-dev \ bzip2-dev \ readline-dev \ sqlite-dev \ musl \ libc6-compat \ linux-headers \ build-base \ procps # Set Python version ARG PYTHON_VERSION='3.7.6' # Set pyenv home ARG PYENV_HOME=/root/.pyenv # Install pyenv, then install python version RUN git clone --depth 1 https://github.com/pyenv/pyenv.git $PYENV_HOME && \ rm -rfv $PYENV_HOME/.git ENV PATH $PYENV_HOME/shims:$PYENV_HOME/bin:$PATH RUN pyenv install $PYTHON_VERSION RUN pyenv global $PYTHON_VERSION RUN pip install --upgrade pip && pyenv rehash # Clean RUN rm -rf ~/.cache/pip
What we do in the lines above is to first get all the required packages needed to install python. Git for cloning the python directory, zlib-dev, bzip2-dev are compression libraries required for the install and so on and so forth. I landed on the required list through some trial and error during installation.
The last step in the dockerfile is embedding the shell scripts that will act as an entry-point to the docker containers. The entrypoint is just the command that starts the spark master or worker node as a daemon. So let’s first create another directory called scripts in our project directory and add two shell files under it namely - run_master.sh and run_worker.sh
run_master has the command to start the spark master node