Skip to content

Apache Spark

What Is Apache Spark

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Credits

Our implementation of Spark on the Kennesaw State University HPC environment is based on code released by the University of Michigan's Advanced Research Computing (ARC) group through their https://github.com/umich-arc/spark-on-hpc. We reworked the code to use Torque instead of SLURM.

Getting Started

These instructions provide a step-by-step walkthrough for setting up a stand-alone Spark instance on the KSU HPC environment. The accompanying code is tailored for researchers who are proficient in the technical nuances of sizing a stand-alone Spark instance and need to submit a batch job to Spark.

Creating Your Environment

Take a look at our Setting Up A Conda Environment page as well as our Using Jupyter Notebooks On The HPC page.

Batch Jobs

To launch a Spark instance on an HPC cluster, you need to copy the example PBS job script batch-job.sh and customize it with your email address, the resources you require for your Spark job, and the location of your Spark code. Then, you will run the PBS job script with qsub batch-job.sh. As the PBS job runs, it will call the spark-start script which launches a standalone Spark cluster and submits a Spark batch job with spark-submit. When the Spark job finishes, the PBS job will terminate. Spark driver logs are written to the PBS job's output log.

batch-job.sh
#!/bin/bash
#PBS -N spark-cluster
#PBS -q batch
#PBS -l nodes=2:ppn=24,walltime=60:00
#PBS -m abe
#PBS -M barney@kennesaw.edu

# These modules are required. You may need to customize the module version
# depending on which cluster you are on.
module load Spark/3.5.0 Anaconda3/2023.07 OpenJDK

# Start the Spark instance.
spark-start

# Clean up PBS_JOBID to be a bit more useful
JOBID=${PBS_JOBID%.*}

# Source spark-env.sh to get useful env variables.
source ${HOME}/scratch/.spark-local/${JOBID}/spark/conf/spark-env.sh

# Customize the executor resources below to match resources requested above
# with an allowance for spark driver overhead. Also change the path to your spark job.
spark-submit --master ${SPARK_MASTER_URL} \
  --executor-cores 1 \
  --executor-memory 5G \
  --total-executor-cores 70 \
  /data/Apps/spark/spark-on-hpc/examples/pi.py

# vim:set tabstop=4 shiftwidth=4 expandtab autoindent:

Interactive Jobs

To launch a Spark instance on an HPC cluster, you need to copy the example PBS job script interactive-job.sh and customize it with your email address, the resources you require for your Spark job, and the location of your Spark code. Then, you will run the PBS job script with qsub batch-job.sh. As the PBS job runs, it will call the spark-start script which launches a standalone Spark cluster.

At this point, you'll need to load whatever modules are required for your tasks. At a minimum, you'll need to do the following to make sure that you can run spark-submit and or pyspark:

[barney@hpcprdssh03 ~]$ module load Spark/3.5.0 Anaconda3/2023.07 OpenJDK
At this point you can use spark-submit to submit jobs interactively to the cluster. Information on the host and port you'll need to use to connect to the spark cluster instance can be found in a file called spark-cluster-jobid.log in your working directory.

When you are finished, you'll need to terminate your job using the qdel command.

interactive-job.sh
#!/bin/bash
#PBS -N spark-cluster
#PBS -q batch
#PBS -l nodes=2:ppn=24,walltime=60:00
#PBS -m abe
#PBS -M barney@kennesaw.edu
# Clean up PBS_JOBID to be a bit more useful
JOBID=${PBS_JOBID%.*}

cd ${PBS_O_WORKDIR}
LOGFILE=${PBS_O_WORKDIR}/spark-cluster-${JOBID}.log
touch ${LOGFILE}

# These modules are required. You may need to customize the module version
# depending on which cluster you are on.
module load Spark/3.5.0 Anaconda3/2023.07 OpenJDK

# Start the Spark instance.
(
spark-start

# Source spark-env.sh to get useful env variables.
source ${HOME}/scratch/.spark-local/${JOBID}/spark/conf/spark-env.sh

echo "***** Spark cluster is running. Submit jobs to ${SPARK_MASTER_URL}. *****"
) | tee ${LOGFILE}
sleep infinity

# vim:set tabstop=4 shiftwidth=4 expandtab autoindent:

Using Spark In A Jupyter Notebook

We've added an option to the ksu-jupyter-notebook script we use to start Jupyter Notebook sessions to also start up a Spark Cluster. You just need to add either -S or --spark to the command, so it looks something like:

[barney@hpcprdssh03 ~]$ ksu-jupyter-notebook -l nodes=1:ppn=24,walltime=4:00:00 --spark