Apache Spark/en: Difference between revisions
No edit summary |
No edit summary Tags: Mobile edit Mobile web edit |
||
(14 intermediate revisions by 2 users not shown) | |||
Line 3: | Line 3: | ||
= Introduction = | = Introduction = | ||
Apache Spark | Apache Spark is an open source framework for distributed computation initially developed by the AMPLab at Berkeley University and is now a project sponsored by the Apache foundation. Unlike the MapReduce algorithm implemented by Hadoop that uses disk storage, Spark makes use of primitives which are stored in memory, thereby achieving up to 100x the performance of Hadoop in certain applications. Loading data in memory allows them to be queried frequently, making Spark a framework especially appropriate for automated learning and interactive data analysis. | ||
= Usage = | = Usage = | ||
Line 22: | Line 22: | ||
module load spark/2.3.0 | module load spark/2.3.0 | ||
module load python/ | module load python/3.7 | ||
# Recommended settings for calling Intel MKL routines from multi-threaded applications | # Recommended settings for calling Intel MKL routines from multi-threaded applications | ||
Line 29: | Line 29: | ||
export SPARK_IDENT_STRING=$SLURM_JOBID | export SPARK_IDENT_STRING=$SLURM_JOBID | ||
export SPARK_WORKER_DIR=$SLURM_TMPDIR | export SPARK_WORKER_DIR=$SLURM_TMPDIR | ||
export SLURM_SPARK_MEM=$(printf "%.0f" $((${SLURM_MEM_PER_NODE} * | export SLURM_SPARK_MEM=$(printf "%.0f" $((${SLURM_MEM_PER_NODE} *95/100))) | ||
Line 36: | Line 36: | ||
MASTER_URL=$(grep -Po '(?=spark://).*' $SPARK_LOG_DIR/spark-${SPARK_IDENT_STRING}-org.apache.spark.deploy.master*.out) | MASTER_URL=$(grep -Po '(?=spark://).*' $SPARK_LOG_DIR/spark-${SPARK_IDENT_STRING}-org.apache.spark.deploy.master*.out) | ||
NWORKERS=$(( | NWORKERS=$((SLURM_NTASKS_PER_NODE * SLURM_JOB_NUM_NODES - 1)) | ||
SPARK_NO_DAEMONIZE=1 srun -n ${NWORKERS} -N ${NWORKERS} --label --output=$SPARK_LOG_DIR/spark-%j-workers.out start-slave.sh -m ${SLURM_SPARK_MEM}M -c ${SLURM_CPUS_PER_TASK} ${MASTER_URL} & | SPARK_NO_DAEMONIZE=1 srun -n ${NWORKERS} -N ${NWORKERS} --label --output=$SPARK_LOG_DIR/spark-%j-workers.out start-slave.sh -m ${SLURM_SPARK_MEM}M -c ${SLURM_CPUS_PER_TASK} ${MASTER_URL} & | ||
slaves_pid=$! | slaves_pid=$! | ||
Line 46: | Line 46: | ||
stop-master.sh | stop-master.sh | ||
}} | }} | ||
== Java Jars == | == Java Jars == | ||
{{File | {{File | ||
Line 67: | Line 68: | ||
export SPARK_IDENT_STRING=$SLURM_JOBID | export SPARK_IDENT_STRING=$SLURM_JOBID | ||
export SPARK_WORKER_DIR=$SLURM_TMPDIR | export SPARK_WORKER_DIR=$SLURM_TMPDIR | ||
export SLURM_SPARK_MEM=$(printf "%.0f" $((${SLURM_MEM_PER_NODE} * | export SLURM_SPARK_MEM=$(printf "%.0f" $((${SLURM_MEM_PER_NODE} *95/100))) | ||
start-master.sh | start-master.sh | ||
Line 73: | Line 74: | ||
MASTER_URL=$(grep -Po '(?=spark://).*' $SPARK_LOG_DIR/spark-${SPARK_IDENT_STRING}-org.apache.spark.deploy.master*.out) | MASTER_URL=$(grep -Po '(?=spark://).*' $SPARK_LOG_DIR/spark-${SPARK_IDENT_STRING}-org.apache.spark.deploy.master*.out) | ||
NWORKERS=$(( | NWORKERS=$((SLURM_NTASKS_PER_NODE * SLURM_JOB_NUM_NODES - 1)) | ||
SPARK_NO_DAEMONIZE=1 srun -n ${NWORKERS} -N ${NWORKERS} --label --output=$SPARK_LOG_DIR/spark-%j-workers.out start-slave.sh -m ${SLURM_SPARK_MEM}M -c ${SLURM_CPUS_PER_TASK} ${MASTER_URL} & | SPARK_NO_DAEMONIZE=1 srun -n ${NWORKERS} -N ${NWORKERS} --label --output=$SPARK_LOG_DIR/spark-%j-workers.out start-slave.sh -m ${SLURM_SPARK_MEM}M -c ${SLURM_CPUS_PER_TASK} ${MASTER_URL} & | ||
slaves_pid=$! | slaves_pid=$! | ||
SLURM_SPARK_SUBMIT="srun -n 1 -N 1 spark-submit --master ${MASTER_URL} --executor-memory ${SLURM_SPARK_MEM}M" | SLURM_SPARK_SUBMIT="srun -n 1 -N 1 spark-submit --master ${MASTER_URL} --executor-memory ${SLURM_SPARK_MEM}M" | ||
$SLURM_SPARK_SUBMIT --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2. | $SLURM_SPARK_SUBMIT --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.3.0.jar 1000 | ||
$SLURM_SPARK_SUBMIT --class org.apache.spark.examples.SparkLR $SPARK_HOME/examples/jars/spark-examples_2.11-2. | $SLURM_SPARK_SUBMIT --class org.apache.spark.examples.SparkLR $SPARK_HOME/examples/jars/spark-examples_2.11-2.3.0.jar 1000 | ||
kill $slaves_pid | kill $slaves_pid | ||
stop-master.sh | stop-master.sh | ||
}} | }} | ||
= Monitoring = | = Monitoring = | ||
The event log for the Spark application which was executed can be saved for later examination by means of a Web application provided by Spark. The following instructions show how to activate the saving of the event logs and how to start the Web application. | |||
== Configuration == | == Configuration == | ||
First create a directory which will contain the event log of the application: | |||
{{Command| mkdir ~/.spark/<spark version>/eventlog}} | {{Command| mkdir ~/.spark/<spark version>/eventlog}} | ||
If it doesn't already exist, next create a directory which will contain the Spark configuration parameters: | |||
{{Command| mkdir ~/.spark/<spark version>/conf}} | {{Command| mkdir ~/.spark/<spark version>/conf}} | ||
Line 106: | Line 108: | ||
== Visualization == | == Visualization == | ||
Create a [[SSH_tunnelling|tunnel]] between your computer and the cluster. | |||
Load the Spark module: | |||
{{Command|module load spark/2.3.0}} | {{Command|module load spark/2.3.0}} | ||
Start the Web application for visualizing the event log: | |||
{{Command|SPARK_NO_DAEMONIZE{{=}}1 start-history-server.sh | {{Command|SPARK_NO_DAEMONIZE{{=}}1 start-history-server.sh | ||
|result= | |result= | ||
Line 134: | Line 136: | ||
}} | }} | ||
Copy the URL which is shown in the terminal and paste it in your Web browser. | |||
To stop the visualization application, type the combination Ctrl-C in the terminal used to launch the application. |
Latest revision as of 23:04, 29 August 2024
Introduction
Apache Spark is an open source framework for distributed computation initially developed by the AMPLab at Berkeley University and is now a project sponsored by the Apache foundation. Unlike the MapReduce algorithm implemented by Hadoop that uses disk storage, Spark makes use of primitives which are stored in memory, thereby achieving up to 100x the performance of Hadoop in certain applications. Loading data in memory allows them to be queried frequently, making Spark a framework especially appropriate for automated learning and interactive data analysis.
Usage
PySpark
#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --time=00:01:00
#SBATCH --nodes=4
#SBATCH --mem=4G
#SBATCH --cpus-per-task=8
#SBATCH --ntasks-per-node=1
module load spark/2.3.0
module load python/3.7
# Recommended settings for calling Intel MKL routines from multi-threaded applications
# https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications
export MKL_NUM_THREADS=1
export SPARK_IDENT_STRING=$SLURM_JOBID
export SPARK_WORKER_DIR=$SLURM_TMPDIR
export SLURM_SPARK_MEM=$(printf "%.0f" $((${SLURM_MEM_PER_NODE} *95/100)))
start-master.sh
sleep 5
MASTER_URL=$(grep -Po '(?=spark://).*' $SPARK_LOG_DIR/spark-${SPARK_IDENT_STRING}-org.apache.spark.deploy.master*.out)
NWORKERS=$((SLURM_NTASKS_PER_NODE * SLURM_JOB_NUM_NODES - 1))
SPARK_NO_DAEMONIZE=1 srun -n ${NWORKERS} -N ${NWORKERS} --label --output=$SPARK_LOG_DIR/spark-%j-workers.out start-slave.sh -m ${SLURM_SPARK_MEM}M -c ${SLURM_CPUS_PER_TASK} ${MASTER_URL} &
slaves_pid=$!
srun -n 1 -N 1 spark-submit --master ${MASTER_URL} --executor-memory ${SLURM_SPARK_MEM}M $SPARK_HOME/examples/src/main/python/pi.py
kill $slaves_pid
stop-master.sh
Java Jars
#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --time=00:01:00
#SBATCH --nodes=4
#SBATCH --mem=4G
#SBATCH --cpus-per-task=8
#SBATCH --ntasks-per-node=1
module load spark/2.3.0
# Recommended settings for calling Intel MKL routines from multi-threaded applications
# https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications
export MKL_NUM_THREADS=1
export SPARK_IDENT_STRING=$SLURM_JOBID
export SPARK_WORKER_DIR=$SLURM_TMPDIR
export SLURM_SPARK_MEM=$(printf "%.0f" $((${SLURM_MEM_PER_NODE} *95/100)))
start-master.sh
sleep 5
MASTER_URL=$(grep -Po '(?=spark://).*' $SPARK_LOG_DIR/spark-${SPARK_IDENT_STRING}-org.apache.spark.deploy.master*.out)
NWORKERS=$((SLURM_NTASKS_PER_NODE * SLURM_JOB_NUM_NODES - 1))
SPARK_NO_DAEMONIZE=1 srun -n ${NWORKERS} -N ${NWORKERS} --label --output=$SPARK_LOG_DIR/spark-%j-workers.out start-slave.sh -m ${SLURM_SPARK_MEM}M -c ${SLURM_CPUS_PER_TASK} ${MASTER_URL} &
slaves_pid=$!
SLURM_SPARK_SUBMIT="srun -n 1 -N 1 spark-submit --master ${MASTER_URL} --executor-memory ${SLURM_SPARK_MEM}M"
$SLURM_SPARK_SUBMIT --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.3.0.jar 1000
$SLURM_SPARK_SUBMIT --class org.apache.spark.examples.SparkLR $SPARK_HOME/examples/jars/spark-examples_2.11-2.3.0.jar 1000
kill $slaves_pid
stop-master.sh
Monitoring
The event log for the Spark application which was executed can be saved for later examination by means of a Web application provided by Spark. The following instructions show how to activate the saving of the event logs and how to start the Web application.
Configuration
First create a directory which will contain the event log of the application:
[name@server ~]$ mkdir ~/.spark/<spark version>/eventlog
If it doesn't already exist, next create a directory which will contain the Spark configuration parameters:
[name@server ~]$ mkdir ~/.spark/<spark version>/conf
In this directory, create the following file or add the content shown to the file spark-defaults.conf
if it already exists.
spark.eventLog.enabled true
spark.eventLog.dir /home/<userid>/.spark/<spark version>/eventlog
spark.history.fs.logDirectory /home/<userid>/.spark/<spark version>/eventlog
Visualization
Create a tunnel between your computer and the cluster.
Load the Spark module:
[name@server ~]$ module load spark/2.3.0
Start the Web application for visualizing the event log:
[name@server ~]$ SPARK_NO_DAEMONIZE=1 start-history-server.sh
starting org.apache.spark.deploy.history.HistoryServer, logging to /home/<userid>/.spark/<spark version>/log/spark-<userid>-org.apache.spark.deploy.history.HistoryServer-1-<server>.computecanada.ca.out
Spark Command: /cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/java/1.8.0_121/bin/java -cp /home/<userid>/.spark/<spark version>/conf/:/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/spark/2.2.0/jars/* -Xmx1g org.apache.spark.deploy.history.HistoryServer
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/10/13 04:28:56 INFO HistoryServer: Started daemon with process name: 71616@<server>.computecanada.ca
17/10/13 04:28:56 INFO SignalUtils: Registered signal handler for TERM
17/10/13 04:28:56 INFO SignalUtils: Registered signal handler for HUP
17/10/13 04:28:56 INFO SignalUtils: Registered signal handler for INT
17/10/13 04:28:56 INFO SecurityManager: Changing view acls to: <userid>
17/10/13 04:28:56 INFO SecurityManager: Changing modify acls to: <userid>
17/10/13 04:28:56 INFO SecurityManager: Changing view acls groups to:
17/10/13 04:28:56 INFO SecurityManager: Changing modify acls groups to:
17/10/13 04:28:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(<userid>); groups with view permissions: Set(); users with modify permissions: Set(<userid>); groups with modify permissions: Set()
17/10/13 04:28:56 INFO FsHistoryProvider: History server ui acls disabled; users with admin permissions: ; groups with admin permissions
17/10/13 04:29:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/10/13 04:29:02 INFO FsHistoryProvider: Replaying log path: file:/home/<userid>/.spark/<spark version>/eventlog/app-20171013040359-0000
17/10/13 04:29:02 INFO Utils: Successfully started service on port 18080.
17/10/13 04:29:02 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and started at http://<server ip address>:18080
Copy the URL which is shown in the terminal and paste it in your Web browser.
To stop the visualization application, type the combination Ctrl-C in the terminal used to launch the application.