Apache Spark: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
No edit summary
Line 30: Line 30:
export SPARK_IDENT_STRING=$SLURM_JOBID
export SPARK_IDENT_STRING=$SLURM_JOBID
export SPARK_WORKER_DIR=$SLURM_TMPDIR
export SPARK_WORKER_DIR=$SLURM_TMPDIR
start-master.sh
start-master.sh
sleep 1
MASTER_URL=$(grep -Po '(?=spark://).*' $SPARK_LOG_DIR/spark-${SPARK_IDENT_STRING}-org.apache.spark.deploy.master*.out)


(
NWORKERS=$((SLURM_NTASKS - 1))
export SPARK_NO_DAEMONIZE=1;
SPARK_NO_DAEMONIZE=1 srun -n ${NWORKERS} -N ${NWORKERS} --label --output=$SPARK_LOG_DIR/spark-%j-workers.out start-slave.sh -m ${SLURM_MEM_PER_NODE}M -c ${SLURM_CPUS_PER_TASK} ${MASTER_URL} &
export -n HOSTNAME;
slaves_pid=$!
srun -x $(hostname -s) -n $((SLURM_NTASKS -1)) --label --output=$SPARK_LOG_DIR/spark-$SPARK_IDENT_STRING-workers.out \
    start-slave.sh -m ${SLURM_MEM_PER_NODE}M -c ${SLURM_CPUS_PER_TASK} spark://$(hostname -f):7077
) &


spark-submit --master spark://$(hostname -f):7077 --executor-memory ${SLURM_MEM_PER_NODE}M $SPARK_HOME/examples/src/main/python/pi.py 100000
SLURM_SPARK_SUBMIT="srun -n 1 -N 1 spark-submit --master ${MASTER_URL} --executor-memory ${SLURM_MEM_PER_NODE}M"
$SLURM_SPARK_SUBMIT --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.2.0.jar 1000
$SLURM_SPARK_SUBMIT --class org.apache.spark.examples.SparkLR $SPARK_HOME/examples/jars/spark-examples_2.11-2.2.0.jar 1000


kill $slaves_pid
stop-master.sh
stop-master.sh
}}
}}

Revision as of 21:33, 12 October 2017


This article is a draft

This is not a complete article: This is a draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.

Introduction[edit]

Apache Spark est une framework de calcul distribuée open source initialement développé par l'AMPLab de l'Université Berkeley, et maintenant un projet de la fondation Apache. Contrairement à l'algorithme MapReduce implémenté par Hadoop qui utilise le stockage sur disque, Spark utilise des primitives conservées en mémoire lui permettant d'atteindre des performances jusqu'à 100 fois plus rapide pour certaines applications. Le chargement des données en mémoire permet de les interroger fréquemment ce qui fait de Spark une framework particulièrement approprié pour l'apprentissage automatique et l'analyse de données interactive.

Configuration[edit]

$HOME/.spark/<version>/conf

export MKL_NUM_THREADS=1

Utilisation[edit]

File : pyspark_submit.sh

#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --time=00:01:00
#SBATCH --nodes=4
#SBATCH --mem=4G
#SBATCH --cpus-per-task=8
#SBATCH --ntasks-per-node=1

module load spark/2.2.0
module load python/2.7.13

export SPARK_IDENT_STRING=$SLURM_JOBID
export SPARK_WORKER_DIR=$SLURM_TMPDIR

start-master.sh
sleep 1
MASTER_URL=$(grep -Po '(?=spark://).*' $SPARK_LOG_DIR/spark-${SPARK_IDENT_STRING}-org.apache.spark.deploy.master*.out)

NWORKERS=$((SLURM_NTASKS - 1))
SPARK_NO_DAEMONIZE=1 srun -n ${NWORKERS} -N ${NWORKERS} --label --output=$SPARK_LOG_DIR/spark-%j-workers.out start-slave.sh -m ${SLURM_MEM_PER_NODE}M -c ${SLURM_CPUS_PER_TASK} ${MASTER_URL} &
slaves_pid=$!

SLURM_SPARK_SUBMIT="srun -n 1 -N 1 spark-submit --master ${MASTER_URL} --executor-memory ${SLURM_MEM_PER_NODE}M"
$SLURM_SPARK_SUBMIT --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.2.0.jar 1000
$SLURM_SPARK_SUBMIT --class org.apache.spark.examples.SparkLR $SPARK_HOME/examples/jars/spark-examples_2.11-2.2.0.jar 1000

kill $slaves_pid
stop-master.sh