R: Difference between revisions
No edit summary |
|||
Line 321: | Line 321: | ||
registerDoParallel(cl) | registerDoParallel(cl) | ||
# Compute | # Compute (Source : https://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf) | ||
x <- iris[which(iris[,5] != "setosa"), c(1,5)] | x <- iris[which(iris[,5] != "setosa"), c(1,5)] | ||
trials <- 10000 | trials <- 10000 |
Revision as of 00:31, 7 August 2018
R is a system for statistical computation and graphics. It consists of a language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files.
Even though R was not developed for high performance computing (HPC), its popularity with scientists from a variety of disciplines, including engineering, mathematics, statistics, bioinformatics, etc. makes it an essential tool on HPC installations dedicated to academic research. Features such as C extensions, byte-compiled code and parallelisation allow for reasonable performance in single-node jobs. Thanks to R’s modular nature, users can customize the R functions available to them by installing packages from the Comprehensive R Archive Network (CRAN) into their home directories.
The R interpreter[edit]
You need to begin by loading an R module; there will typically be several versions available and you can see a list of all of them using the command
[name@server ~]$ module spider r
You can load a particular R module using a command like
[name@server ~]$ module load r/3.3.3
You will also likely need to load gcc
[name@server ~]$ module load gcc/5.4.0
You might also have to load various other modules depending on the packages you need to install. For example, "rgdal" will require that you load a module called "gdal", which itself requires that you load nixpkgs and gcc. Nixpkgs should already be loaded by default. You can ensure that it is by running
[name@server ~]$ module list
If nixpkgs is not listed, you can load it by running
[name@server ~]$ module load nixpkgs/16.09
If any package fails to install, be sure to read the error message carefully, as it might give you some details concerning some additional modules you need to load. You can also find out if a module is dependent on any other module by running
[name@server ~]$ module spider gdal/2.2.1
With R and the appropriate modules now loaded in your environment, you can start the R interpreter and type R code inside that environment:
[name@server ~]$ R
R version 3.3.3 (2017-03-06) -- "Another Canoe"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> values <- c(3,5,7,9)
> values[0]
[1] 3
> q()
To execute R scripts, use the Rscript front-end with the file containing the R commands as an argument:
[name@server ~]$ Rscript computation.R
This front-end will automatically pass scripting-appropriate options --slave and --no-restore to the R interpreter. These also imply the --no-save option, preventing the creation of useless workspace files on exit.
Installing R packages[edit]
To install packages from CRAN, you can use the install.packages facility inside the R interpreter. Many R packages are developed using the Gnu family of compilers so we recommend that you load a gcc module before trying to install any R packages, using the same version of the gcc module for each package:
[name@server ~]$ module load gcc/5.4.0
For example, to install the sp package that provides classes and methods for spatial data, use the following command on a login node:
[name@server ~]$ R
[...]
> install.packages("sp")
When asked, select an appropriate mirror for download. Ideally, it will be geographically close to the cluster you're working on.
Some packages require defining the environment variable TMPDIR before installing and others expect the Gnu C and/or C++ compiler to be used rather than the Intel compiler family, so it is prudent to first load one of the gcc modules before starting R.
To install a package that you downloaded (i.e. not from CRAN), you can install it as follow. Assuming the package is named archive_package.tgz, run the following command in a shell:
[name@server ~]$ R CMD INSTALL -l 'path for your local (home) R library' archive_package.tgz
Exploiting Parallelism in R[edit]
Given the nature of Compute Canada clusters, they provide an ideal environment in which to exploit parallelism in your R codes. In this section we present two common methods of parallelizing an R code, both of which are supported on Compute Canada and which we encourage you to consider. Note that the processors on Compute Canada clusters are quite ordinary; what makes these supercomputers super is that you have access to thousands of CPU cores with a high-performance network, so running in parallel is the best way to truly take advantage of the hardware.
Rmpi[edit]
Installing[edit]
This next procedure installs Rmpi, an interface (wrapper) to MPI routines, which allow R to run in parallel.
1. See the available R modules by running:
module spider r
2. Select the version (here, for example, 3.4.0), and also load the required OpenMPI module:
module load r/3.4.0
module load openmpi/1.10.7
3. Download the latest R version; change the version number to whatever is desired.
wget https://cran.r-project.org/src/contrib/Rmpi_0.6-6.tar.gz
4. Specify the directory where you want to install the package files; you must have write permission for this directory. The directory name can be changed if desired.
mkdir -p ~/local/R_libs/
export R_LIBS=~/local/R_libs/
5. Run the install command.
R CMD INSTALL --configure-args="--with-Rmpi-include=$EBROOTOPENMPI/include --with-Rmpi-libpath=$EBROOTOPENMPI/lib --with-Rmpi-type='OPENMPI' " Rmpi_0.6-6.tar.gz
This uses OpenMPI version 1.10.7 which is needed to spawn processes correctly (the default MPI module 2.1.1 has a problem with that at present). Again, carefully read any error message that comes up when packages fail to install and load the required modules to ensure that all your packages are successfully installed.
Running[edit]
1. Place your R code in a script file, in this case the file is called test.R.
#Tell all slaves to return a message identifying themselves.
library("Rmpi")
sprintf("TEST mpi.universe.size() = %i", mpi.universe.size())
ns <- mpi.universe.size() - 1
sprintf("TEST attempt to spawn %i slaves", ns)
mpi.spawn.Rslaves(nslaves=ns)
mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))
mpi.remote.exec(paste(mpi.comm.get.parent()))
#Send execution commands to the slaves
x<-5
#These would all be pretty correlated one would think
x<-mpi.remote.exec(rnorm,x)
length(x)
x
mpi.close.Rslaves()
mpi.quit()
2. Copy the following content in a job submission script called job.sh:
#!/bin/bash
#SBATCH --account=def-someacct # replace this with your own account
#SBATCH --ntasks=5 # number of MPI processes
#SBATCH --mem-per-cpu=2048M # memory; default unit is megabytes
#SBATCH --time=0-00:15 # time (DD-HH:MM)
module load r/3.4.0
module load openmpi/1.10.7
export R_LIBS=~/local/R_libs/
mpirun -np 1 R CMD BATCH test.R test.txt
3. Submit the job with:
sbatch job.sh
For more on submitting jobs, see the Running jobs page.
doParallel and foreach[edit]
Usage[edit]
Foreach can be considered as a unified interface for all backends (i.e. doMC, doMPI, doParallel, doRedis, etc.). It works on all platforms, assuming that the backend works. doParallel acts as an interface between foreach and the parallel package and can be loaded alone. There are some known efficiency issues when using foreach to run a very large number of very small tasks. Therefore, keep in mind that the following code is not the best example of an optimized use of the foreach() call but rather that the function chosen was kept at a minimum for demonstration purposes.
You must register the backend by feeding it the number of cores available. If the backend is not registered, foreach will assume that the number of cores is 1 and will proceed to go through the iterations serially.
The general method to use foreach is:
- to load both foreach and the backend package;
- to register the backend;
- to call foreach() by keeping it on the same line as the %do% (serial) or %dopar% operator.
Running[edit]
1. Place your R code in a script file, in this case the file is called test_foreach.R.
# library(foreach) # optional if using doParallel
library(doParallel) #
# a very simple function
test_func <- function(var1, var2) {
return(var1*var2)
}
# we will iterate over two sets of values, you can modify this to explore the mechanism of foreach
var1.v = c(1:8)
var2.v = seq(0.1, 1, length.out = 8)
# Use the environment variable SLURM_NTASKS to set the number of cores.
# This is for SLURM. Replace SLURM_NTASKS by the proper variable for your system.
# Avoid manually setting a number of cores.
ncores = Sys.getenv("SLURM_NTASKS")
registerDoParallel(cores=ncores)# Shows the number of Parallel Workers to be used
print(ncores) # this how many cores are available, and how many you have requested.
getDoParWorkers()# you can compare with the number of actual workers
# be careful! foreach() and %dopar% must be on the same line!
foreach(var1=var1.v, .combine=rbind) %:% foreach(var2=var2.v, .combine=rbind) %dopar% {test_func(var1=var1, var2=var2)}
2. Copy the following content in a job submission script called job_foreach.sh:
#!/bin/bash
#SBATCH --account=def-someacct # replace this with your own account
#SBATCH --ntasks=4 # number of processes
#SBATCH --mem-per-cpu=2048M # memory; default unit is megabytes
#SBATCH --time=0-00:15 # time (DD-HH:MM)
#SBATCH --mail-user=yourname@someplace.com # Send email updates to you or someone else
#SBATCH --mail-type=ALL # send an email in all cases (job started, job ended, job aborted)
module load r/3.4.3
export R_LIBS=~/local/R_libs/
R CMD BATCH --no-save --no-restore test_foreach.R
3. Submit the job with:
[name@server ~]$ sbatch job_foreach.sh
For more on submitting jobs, see the Running jobs page.
doParallel and makeCluster[edit]
Running[edit]
1. Place your R code in a script file, in this case the file is called test_makecluster.R.
library(doParallel)
# Create an array from the NODESLIST environnement variable
nodeslist = unlist(strsplit(Sys.getenv("NODESLIST"), split=" "))
# Create the cluster with the nodes name. One process per count of node name.
# nodeslist = node1 node1 node2 node2, means we are starting 2 processes on node1, likewise on node2.
cl = makeCluster(nodeslist, type = "PSOCK")
registerDoParallel(cl)
# Compute (Source : https://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000
ptime <- system.time({
foreach(icount(trials), .combine=cbind) %dopar%
{
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})
ptime[3]
# Don't forget to release resources
stopCluster(cl)
2. Copy the following content in a job submission script called job_makecluster.sh:
#!/bin/bash
#SBATCH --account=def-someacct # replace this with your own account
#SBATCH --ntasks=4 # number of processes
#SBATCH --mem-per-cpu=512M # memory; default unit is megabytes
#SBATCH --time=00:05:00 # time (HH:MM:SS)
module load r/3.5.0
# Export the nodes names.
# If all processes are allocated on the same node, NODESLIST contains : node1 node1 node1 node1
export NODESLIST=$(echo $(srun hostname))
R -f test_makecluster.R
3. Submit the job with:
[name@server ~]$ sbatch job_makecluster.sh
For more on submitting jobs, see the Running jobs page.