R

From Alliance Doc
Revision as of 19:51, 21 December 2018 by Rdickson (talk | contribs) (Marked this version for translation)
Jump to navigation Jump to search
Other languages:

R is a system for statistical computation and graphics. It consists of a language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files.

Even though R was not developed for high performance computing (HPC), its popularity with scientists from a variety of disciplines, including engineering, mathematics, statistics, bioinformatics, etc. makes it an essential tool on HPC installations dedicated to academic research. Features such as C extensions, byte-compiled code and parallelisation allow for reasonable performance in single-node jobs. Thanks to R’s modular nature, users can customize the R functions available to them by installing packages from the Comprehensive R Archive Network (CRAN) into their home directories.

The R interpreter[edit]

You need to begin by loading an R module; there will typically be several versions available and you can see a list of all of them using the command

Question.png
[name@server ~]$ module spider r

You can load a particular R module using a command like

Question.png
[name@server ~]$ module load r/3.3.3

For more on this see Using modules.

Now you can start the R interpreter and type R code inside that environment:

Question.png
[name@server ~]$ R
R version 3.3.3 (2017-03-06) -- "Another Canoe"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> values <- c(3,5,7,9)
> values[0]
[1] 3
> q()

To execute an R script non-interactively, use Rscript with the file containing the R commands as an argument:

Question.png
[name@server ~]$ Rscript computation.R

Rscript will automatically pass scripting-appropriate options --slave and --no-restore to the R interpreter. These also imply the --no-save option, preventing the creation of useless workspace files on exit.

Note that any calculations lasting more than two or three minutes should not be run on the login node. They should be run via the job scheduler. See Running jobs for more information.

Installing R packages[edit]

install.packages()[edit]

To install packages from CRAN, you can use install.packages while running an interactive R session on the cluster's login node. Many R packages are developed using the Gnu family of compilers so we recommend that you load a gcc module before trying to install any R packages. Use the same version of the gcc for all packages you install.

Question.png
[name@server ~]$ module load gcc/5.4.0

For example, to install the sp package that provides classes and methods for spatial data, use the following command on a login node:

Question.png
[name@server ~]$ R
[...]
> install.packages("sp")

When asked, select an appropriate mirror for download. Ideally, it will be geographically close to the cluster you're working on.

Some packages require defining the environment variable TMPDIR before installing.

Dependencies[edit]

Some packages depend on external libraries which are already installed on our clusters. If the library you need is listed at Available software, then load the appropriate module before installing the package that requires it.

For example, the package rgdal requires a library called gdal. Running module spider gdal/2.2.1 shows that it requires nixpkgs and gcc modules. If you took the advice above to load gcc then both these should already be loaded. Verify this by running

Question.png
[name@server ~]$ module list

If any package fails to install, be sure to read the error message carefully as it might give you details concerning additional modules you need to load. See Using modules for more on the module family of commands.

Downloaded packages[edit]

To install a package that you downloaded (i.e. not using install.packages()), you can install it as follow. Assuming the package is named archive_package.tgz, run the following command in a shell:

Question.png
[name@server ~]$ R CMD INSTALL -l 'path for your local (home) R library' archive_package.tgz

Exploiting Parallelism in R[edit]

The processors on Compute Canada clusters are quite ordinary. What makes these supercomputers super is that you have access to thousands of CPU cores with a high-performance network. In order to take advantage of this hardware you must run code "in parallel".

The CRAN Task View on High-Performance and Parallel Computing with R describes a bewildering collection of inter-related R packages for parallel computing. In the following subsections we present two methods of parallelizing an R code, both of which are supported on Compute Canada clusters.

A note on terminology: In much Compute Canada documentation the term 'node' refers to an individual machine, also called a 'host', and a collection of such nodes makes up a 'cluster'. In much R documentation, the term 'node' refers to a worker process and a 'cluster' is a collection of such processes. For example: "Following snow, a pool of worker processes listening via sockets for commands from the master is called a `cluster' of nodes."[1]

Rmpi[edit]

Installing[edit]

This next procedure installs Rmpi, an interface (wrapper) to MPI routines, which allow R to run in parallel.

1. See the available R modules by running:

module spider r

2. Select the R version and load the required Open MPI module. This example uses Open MPI version 1.10.7 which is needed to spawn processes correctly; the default Open MPI module 2.1.1 has a problem with that at present.

module load r/3.4.0
module load openmpi/1.10.7

3. Download the latest Rmpi version; change the version number to whatever is desired.

wget https://cran.r-project.org/src/contrib/Rmpi_0.6-6.tar.gz

4. Specify the directory where you want to install the package files; you must have write permission for this directory. The directory name can be changed if desired.

mkdir -p ~/local/R_libs/
export R_LIBS=~/local/R_libs/

5. Run the install command.

R CMD INSTALL --configure-args="--with-Rmpi-include=$EBROOTOPENMPI/include   --with-Rmpi-libpath=$EBROOTOPENMPI/lib --with-Rmpi-type='OPENMPI' " Rmpi_0.6-6.tar.gz

Again, carefully read any error message that comes up when packages fail to install and load the required modules to ensure that all your packages are successfully installed.

Running[edit]

1. Place your R code in a script file, in this case the file is called test.R.


File : test.R

#Tell all slaves to return a message identifying themselves.
library("Rmpi")
sprintf("TEST mpi.universe.size() =  %i", mpi.universe.size())
ns <- mpi.universe.size() - 1
sprintf("TEST attempt to spawn %i slaves", ns)
mpi.spawn.Rslaves(nslaves=ns)
mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))
mpi.remote.exec(paste(mpi.comm.get.parent()))
#Send execution commands to the slaves
x<-5
#These would all be pretty correlated one would think
x<-mpi.remote.exec(rnorm,x)
length(x)
x
mpi.close.Rslaves()
mpi.quit()


2. Copy the following content in a job submission script called job.sh:


File : job.sh

#!/bin/bash
#SBATCH --account=def-someacct   # replace this with your own account
#SBATCH --ntasks=5               # number of MPI processes
#SBATCH --mem-per-cpu=2048M      # memory; default unit is megabytes
#SBATCH --time=0-00:15           # time (DD-HH:MM)
module load r/3.4.0
module load openmpi/1.10.7
export R_LIBS=~/local/R_libs/
mpirun -np 1 R CMD BATCH test.R test.txt


3. Submit the job with:

sbatch job.sh

For more on submitting jobs, see the Running jobs page.

doParallel and foreach[edit]

Usage[edit]

Foreach can be considered as a unified interface for all backends (i.e. doMC, doMPI, doParallel, doRedis, etc.). It works on all platforms, assuming that the backend works. doParallel acts as an interface between foreach and the parallel package and can be loaded alone. There are some known efficiency issues when using foreach to run a very large number of very small tasks. Therefore, keep in mind that the following code is not the best example of an optimized use of the foreach() call but rather that the function chosen was kept at a minimum for demonstration purposes.

You must register the backend by feeding it the number of cores available. If the backend is not registered, foreach will assume that the number of cores is 1 and will proceed to go through the iterations serially.

The general method to use foreach is:

  1. to load both foreach and the backend package;
  2. to register the backend;
  3. to call foreach() by keeping it on the same line as the %do% (serial) or %dopar% operator.

Running[edit]

1. Place your R code in a script file, in this case the file is called test_foreach.R.


File : test_foreach.R

# library(foreach) # optional if using doParallel
library(doParallel) #

# a very simple function
test_func <- function(var1, var2) {
    return(var1*var2)
}

# we will iterate over two sets of values, you can modify this to explore the mechanism of foreach
var1.v = c(1:8)
var2.v = seq(0.1, 1, length.out = 8)

# Use the environment variable SLURM_NTASKS to set the number of cores.
# This is for SLURM. Replace SLURM_NTASKS by the proper variable for your system.
# Avoid manually setting a number of cores.
ncores = Sys.getenv("SLURM_NTASKS") 

registerDoParallel(cores=ncores)# Shows the number of Parallel Workers to be used
print(ncores) # this how many cores are available, and how many you have requested.
getDoParWorkers()# you can compare with the number of actual workers

# be careful! foreach() and %dopar% must be on the same line!
foreach(var1=var1.v, .combine=rbind) %:% foreach(var2=var2.v, .combine=rbind) %dopar% {test_func(var1=var1, var2=var2)}


2. Copy the following content in a job submission script called job_foreach.sh:


File : job_foreach.sh

#!/bin/bash
#SBATCH --account=def-someacct   # replace this with your own account
#SBATCH --ntasks=4               # number of processes
#SBATCH --mem-per-cpu=2048M      # memory; default unit is megabytes
#SBATCH --time=0-00:15           # time (DD-HH:MM)
#SBATCH --mail-user=yourname@someplace.com # Send email updates to you or someone else
#SBATCH --mail-type=ALL          # send an email in all cases (job started, job ended, job aborted)

module load r/3.4.3
export R_LIBS=~/local/R_libs/
R CMD BATCH --no-save --no-restore test_foreach.R


3. Submit the job with:

Question.png
[name@server ~]$ sbatch job_foreach.sh

For more on submitting jobs, see the Running jobs page.

doParallel and makeCluster[edit]

Usage[edit]

You must register the backend by feeding it the nodes name multiplied by the desired number of processes. For instance, with two nodes (node1 and node2) and two processes, we would create a cluster composed of : node1 node1 node2 node2 hosts. The PSOCK cluster type will run commands through SSH connections into the nodes.

Running[edit]

1. Place your R code in a script file, in this case the file is called test_makecluster.R.

File : test_makecluster.R

library(doParallel)

# Create an array from the NODESLIST environnement variable
nodeslist = unlist(strsplit(Sys.getenv("NODESLIST"), split=" "))

# Create the cluster with the nodes name. One process per count of node name.
# nodeslist = node1 node1 node2 node2, means we are starting 2 processes on node1, likewise on node2.
cl = makeCluster(nodeslist, type = "PSOCK") 
registerDoParallel(cl)

# Compute (Source : https://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000
ptime <- system.time({
    foreach(icount(trials), .combine=cbind) %dopar%
    {
        ind <- sample(100, 100, replace=TRUE)
        result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
        coefficients(result1)
    }
})
ptime[3]

# Don't forget to release resources
stopCluster(cl)


2. Copy the following content in a job submission script called job_makecluster.sh:

File : job_makecluster.sh

#!/bin/bash
#SBATCH --account=def-someacct  # replace this with your own account
#SBATCH --ntasks=4              # number of processes
#SBATCH --mem-per-cpu=512M      # memory; default unit is megabytes
#SBATCH --time=00:05:00         # time (HH:MM:SS)

module load r/3.5.0

# Export the nodes names. 
# If all processes are allocated on the same node, NODESLIST contains : node1 node1 node1 node1
# Cut the domain name and keep only the node name
export NODESLIST=$(echo $(srun hostname | cut -f 1 -d '.'))
R -f test_makecluster.R


3. Submit the job with:

Question.png
[name@server ~]$ sbatch job_makecluster.sh

For more information on submitting jobs, see the Running jobs page.