Apache Spark: Difference between revisions

no edit summary
(added line numbers to file display)
No edit summary
Line 4: Line 4:


Apache Spark est une framework de calcul distribuée open source initialement développé par l'AMPLab de l'Université Berkeley, et maintenant un projet de la fondation Apache. Contrairement à l'algorithme MapReduce implémenté par Hadoop qui utilise le stockage sur disque, Spark utilise des primitives conservées en mémoire lui permettant d'atteindre des performances jusqu'à 100 fois plus rapide pour certaines applications. Le chargement des données en mémoire permet de les interroger fréquemment ce qui fait de Spark une framework particulièrement approprié pour l'apprentissage automatique et l'analyse de données interactive.
Apache Spark est une framework de calcul distribuée open source initialement développé par l'AMPLab de l'Université Berkeley, et maintenant un projet de la fondation Apache. Contrairement à l'algorithme MapReduce implémenté par Hadoop qui utilise le stockage sur disque, Spark utilise des primitives conservées en mémoire lui permettant d'atteindre des performances jusqu'à 100 fois plus rapide pour certaines applications. Le chargement des données en mémoire permet de les interroger fréquemment ce qui fait de Spark une framework particulièrement approprié pour l'apprentissage automatique et l'analyse de données interactive.
= Configuration =
<code>$HOME/.spark/<version>/conf</code>
<code>export MKL_NUM_THREADS=1</code>
[https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications Recommended settings for calling Intel MKL routines from multi-threaded applications]


= Utilisation =
= Utilisation =
Line 29: Line 21:
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-node=1


module load spark/2.2.0
module load spark/2.3.0
module load python/2.7.13
module load python/2.7.13


# Recommended settings for calling Intel MKL routines from multi-threaded applications
# https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications
export MKL_NUM_THREADS=1
export SPARK_IDENT_STRING=$SLURM_JOBID
export SPARK_IDENT_STRING=$SLURM_JOBID
export SPARK_WORKER_DIR=$SLURM_TMPDIR
export SPARK_WORKER_DIR=$SLURM_TMPDIR
export SLURM_SPARK_MEM=$(printf "%.0f" $((${SLURM_MEM_PER_NODE} *0.95)))


start-master.sh
start-master.sh
sleep 1
sleep 5
MASTER_URL=$(grep -Po '(?=spark://).*' $SPARK_LOG_DIR/spark-${SPARK_IDENT_STRING}-org.apache.spark.deploy.master*.out)
MASTER_URL=$(grep -Po '(?=spark://).*' $SPARK_LOG_DIR/spark-${SPARK_IDENT_STRING}-org.apache.spark.deploy.master*.out)


NWORKERS=$((SLURM_NTASKS - 1))
NWORKERS=$((SLURM_NTASKS - 1))
SPARK_NO_DAEMONIZE=1 srun -n ${NWORKERS} -N ${NWORKERS} --label --output=$SPARK_LOG_DIR/spark-%j-workers.out start-slave.sh -m ${SLURM_MEM_PER_NODE}M -c ${SLURM_CPUS_PER_TASK} ${MASTER_URL} &
SPARK_NO_DAEMONIZE=1 srun -n ${NWORKERS} -N ${NWORKERS} --label --output=$SPARK_LOG_DIR/spark-%j-workers.out start-slave.sh -m ${SLURM_SPARK_MEM}M -c ${SLURM_CPUS_PER_TASK} ${MASTER_URL} &
slaves_pid=$!
slaves_pid=$!


srun -n 1 -N 1 spark-submit --master ${MASTER_URL} --executor-memory ${SLURM_MEM_PER_NODE}M $SPARK_HOME/examples/src/main/python/pi.py
 
srun -n 1 -N 1 spark-submit --master ${MASTER_URL} --executor-memory ${SLURM_SPARK_MEM}M $SPARK_HOME/examples/src/main/python/pi.py


kill $slaves_pid
kill $slaves_pid
stop-master.sh
stop-master.sh
}}
}}
== Notes on pyspark_submit.sh ==
1. If you are encountering error <code><span style="color:#ff0000">Error: Cannot load main class from JAR file</span></code> try replacing the <code>--executor-memory ${SLURM_MEM_PER_NODE}M</code> (Line no. 23) with <code> --executor-memory=<size>M </code>
2. If you are encountering error <code><span style="color:#ff0000">Error: Master must either be yarn or start with spark, mesos, local</span></code> increase the sleep time. Replace <code>sleep 1</code> (Line no. 16) with <code>sleep 5</code>


== Java Jars  ==
== Java Jars  ==
Line 67: Line 61:
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-node=1


module load spark/2.2.0
module load spark/2.3.0


# Recommended settings for calling Intel MKL routines from multi-threaded applications
# https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications
export MKL_NUM_THREADS=1
export SPARK_IDENT_STRING=$SLURM_JOBID
export SPARK_IDENT_STRING=$SLURM_JOBID
export SPARK_WORKER_DIR=$SLURM_TMPDIR
export SPARK_WORKER_DIR=$SLURM_TMPDIR
export SLURM_SPARK_MEM=$(printf "%.0f" $((${SLURM_MEM_PER_NODE} *0.95)))


start-master.sh
start-master.sh
sleep 1
sleep 5
MASTER_URL=$(grep -Po '(?=spark://).*' $SPARK_LOG_DIR/spark-${SPARK_IDENT_STRING}-org.apache.spark.deploy.master*.out)
MASTER_URL=$(grep -Po '(?=spark://).*' $SPARK_LOG_DIR/spark-${SPARK_IDENT_STRING}-org.apache.spark.deploy.master*.out)


NWORKERS=$((SLURM_NTASKS - 1))
NWORKERS=$((SLURM_NTASKS - 1))
SPARK_NO_DAEMONIZE=1 srun -n ${NWORKERS} -N ${NWORKERS} --label --output=$SPARK_LOG_DIR/spark-%j-workers.out start-slave.sh -m ${SLURM_MEM_PER_NODE}M -c ${SLURM_CPUS_PER_TASK} ${MASTER_URL} &
SPARK_NO_DAEMONIZE=1 srun -n ${NWORKERS} -N ${NWORKERS} --label --output=$SPARK_LOG_DIR/spark-%j-workers.out start-slave.sh -m ${SLURM_SPARK_MEM}M -c ${SLURM_CPUS_PER_TASK} ${MASTER_URL} &
slaves_pid=$!
slaves_pid=$!


SLURM_SPARK_SUBMIT="srun -n 1 -N 1 spark-submit --master ${MASTER_URL} --executor-memory ${SLURM_MEM_PER_NODE}M"
SLURM_SPARK_SUBMIT="srun -n 1 -N 1 spark-submit --master ${MASTER_URL} --executor-memory ${SLURM_SPARK_MEM}M"
$SLURM_SPARK_SUBMIT --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.2.0.jar 1000
$SLURM_SPARK_SUBMIT --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.2.0.jar 1000
$SLURM_SPARK_SUBMIT --class org.apache.spark.examples.SparkLR $SPARK_HOME/examples/jars/spark-examples_2.11-2.2.0.jar 1000
$SLURM_SPARK_SUBMIT --class org.apache.spark.examples.SparkLR $SPARK_HOME/examples/jars/spark-examples_2.11-2.2.0.jar 1000
Line 111: Line 109:
== Visualisation ==
== Visualisation ==


Créer un [tunnel] entre votre ordinateur et la grappe de calcul.
Créer un [[SSH_tunnelling/fr|tunnel]] entre votre ordinateur et la grappe de calcul.


Charger le module Spark :
Charger le module Spark :
{{Command|module load spark/2.2.0}}
{{Command|module load spark/2.3.0}}


Lancer l'application web de visualisation des journaux :
Lancer l'application web de visualisation des journaux :
cc_staff
505

edits