Lighting a Spark

Audience : Anyone who wants to set up a spark cluster. \

Pre-requisite: Knowledge of docker, docker-compose, pyspark

Setting up a Spark Standalone Cluster

In order to set this up we require a few ingredients namely - Java, Spark and python for running pyspark applications. We will set up one master node and 3 workers, although you can scale up or down as you see fit. Both the master and worker will come from the same image, just the entrypoint will be different. Your Java installation needs to be 1.8 as that is what Spark 2.4.x runs on. To set up my dockerfile I start of with a base alpine java 8 image. While this is good for standalone, you might prefer to use ubuntu as I have run into some issues (none that cannot be fixed) when trying to set up a yarn cluster using the same image. So in your project directory create a docker file and then add the following lines to it

FROM openjdk:8-alpine

USER root

# wget, tar, bash for getting spark and hadoop
RUN apk --update add wget tar bash

This will first use the alpine image as our base, make root as the user and download the requisite packages for what’s coming next which is downloading the Spark tar and unpacking it in a directory. I am using /spark as my spark folder. I will then add this to my PATH variable and also declare SPARK_HOME as another environment variable that points to /spark. The last step isn’t necessary from what I have seen so far as spark is smart enough to find the path if the variable isn’t declared but it doesn’t hurt.

RUN tar -xzf spark-2.4.5-bin-hadoop2.7.tgz && \
    mv spark-2.4.5-bin-hadoop2.7 /spark && \
    rm spark-2.4.5-bin-hadoop2.7.tgz

# add to PATH
ENV PATH $PATH:/spark/bin:/spark/sbin

ENV SPARK_HOME /spark

Now we are set up with bot Java and Spark on our system. But in order to use PySpark we need to now install python which the following piece of code accomplishes.

# Install components for Python
RUN apk add --no-cache --update \
    git \
    libffi-dev \
    openssl-dev \
    zlib-dev \
    bzip2-dev \
    readline-dev \
    sqlite-dev \
    musl \
    libc6-compat \
    linux-headers \
    build-base \
    procps 

# Set Python version
ARG PYTHON_VERSION='3.7.6'
# Set pyenv home
ARG PYENV_HOME=/root/.pyenv

# Install pyenv, then install python version
RUN git clone --depth 1 https://github.com/pyenv/pyenv.git $PYENV_HOME && \
    rm -rfv $PYENV_HOME/.git

ENV PATH $PYENV_HOME/shims:$PYENV_HOME/bin:$PATH

RUN pyenv install $PYTHON_VERSION
RUN pyenv global $PYTHON_VERSION
RUN pip install --upgrade pip && pyenv rehash

# Clean
RUN rm -rf ~/.cache/pip

What we do in the lines above is to first get all the required packages needed to install python. Git for cloning the python directory, zlib-dev, bzip2-dev are compression libraries required for the install and so on and so forth. I landed on the required list through some trial and error during installation.

The last step in the dockerfile is embedding the shell scripts that will act as an entry-point to the docker containers. The entrypoint is just the command that starts the spark master or worker node as a daemon. So let’s first create another directory called scripts in our project directory and add two shell files under it namely - run_master.sh and run_worker.sh

run_master has the command to start the spark master node