Apache Spark
What Is Apache Spark
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.
Credits
Our implementation of Spark on the Kennesaw State University HPC environment is based on code released by the University of Michigan's Advanced Research Computing (ARC) group through their https://github.com/umich-arc/spark-on-hpc. We reworked the code to use Torque instead of SLURM.
Getting Started
These instructions provide a step-by-step walkthrough for setting up a stand-alone Spark instance on the KSU HPC environment. The accompanying code is tailored for researchers who are proficient in the technical nuances of sizing a stand-alone Spark instance and need to submit a batch job to Spark.
Creating Your Environment
Take a look at our Setting Up A Conda Environment page as well as our Using Jupyter Notebooks On The HPC page.
Batch Jobs
To launch a Spark instance on an HPC cluster, you need to copy the
example PBS job script batch-job.sh
and customize it with your email
address, the resources you require for your Spark job, and the location
of your Spark code. Then, you will run the PBS job script with qsub
batch-job.sh
. As the PBS job runs, it will call the spark-start
script which launches a standalone Spark cluster and submits a Spark
batch job with spark-submit
. When the Spark job finishes, the PBS job
will terminate. Spark driver logs are written to the PBS job's output log.
#!/bin/bash
#PBS -N spark-cluster
#PBS -q batch
#PBS -l nodes=2:ppn=24,walltime=60:00
#PBS -m abe
#PBS -M barney@kennesaw.edu
# These modules are required. You may need to customize the module version
# depending on which cluster you are on.
module load Spark/3.5.0 Anaconda3/2023.07 OpenJDK
# Start the Spark instance.
spark-start
# Clean up PBS_JOBID to be a bit more useful
JOBID=${PBS_JOBID%.*}
# Source spark-env.sh to get useful env variables.
source ${HOME}/scratch/.spark-local/${JOBID}/spark/conf/spark-env.sh
# Customize the executor resources below to match resources requested above
# with an allowance for spark driver overhead. Also change the path to your spark job.
spark-submit --master ${SPARK_MASTER_URL} \
--executor-cores 1 \
--executor-memory 5G \
--total-executor-cores 70 \
/data/Apps/spark/spark-on-hpc/examples/pi.py
# vim:set tabstop=4 shiftwidth=4 expandtab autoindent:
Interactive Jobs
To launch a Spark instance on an HPC cluster, you need to copy the
example PBS job script interactive-job.sh
and customize it with your
email address, the resources you require for your Spark job, and the
location of your Spark code. Then, you will run the PBS job script with
qsub batch-job.sh
. As the PBS job runs, it will call the spark-start
script which launches a standalone Spark cluster.
At this point, you'll need to load whatever modules are required for
your tasks. At a minimum, you'll need to do the following to make sure
that you can run spark-submit
and or pyspark
:
[barney@hpcprdssh03 ~]$ module load Spark/3.5.0 Anaconda3/2023.07 OpenJDK
spark-submit
to submit jobs interactively
to the cluster. Information on the host and port you'll need to use
to connect to the spark cluster instance can be found in a file called
spark-cluster-jobid.log
in your working directory.
When you are finished, you'll need to terminate your job using the
qdel
command.
#!/bin/bash
#PBS -N spark-cluster
#PBS -q batch
#PBS -l nodes=2:ppn=24,walltime=60:00
#PBS -m abe
#PBS -M barney@kennesaw.edu
# Clean up PBS_JOBID to be a bit more useful
JOBID=${PBS_JOBID%.*}
cd ${PBS_O_WORKDIR}
LOGFILE=${PBS_O_WORKDIR}/spark-cluster-${JOBID}.log
touch ${LOGFILE}
# These modules are required. You may need to customize the module version
# depending on which cluster you are on.
module load Spark/3.5.0 Anaconda3/2023.07 OpenJDK
# Start the Spark instance.
(
spark-start
# Source spark-env.sh to get useful env variables.
source ${HOME}/scratch/.spark-local/${JOBID}/spark/conf/spark-env.sh
echo "***** Spark cluster is running. Submit jobs to ${SPARK_MASTER_URL}. *****"
) | tee ${LOGFILE}
sleep infinity
# vim:set tabstop=4 shiftwidth=4 expandtab autoindent:
Using Spark In A Jupyter Notebook
We've added an option to the ksu-jupyter-notebook
script we use to start Jupyter Notebook sessions to also start up a Spark Cluster. You just need to add either -S
or --spark
to the command, so it looks something like:
[barney@hpcprdssh03 ~]$ ksu-jupyter-notebook -l nodes=1:ppn=24,walltime=4:00:00 --spark