R: Difference between revisions
No edit summary |
(Marked this version for translation) |
||
(132 intermediate revisions by 14 users not shown) | |||
Line 1: | Line 1: | ||
<languages/> | <languages/> | ||
[[Category:Software]] | |||
<translate> | |||
< | <!--T:1--> | ||
R is a system for statistical computation and graphics. It consists of a language plus a runtime environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files. | |||
<!--T:2--> | |||
Even though R was not developed for high-performance computing (HPC), its popularity with scientists from a variety of disciplines, including engineering, mathematics, statistics, bioinformatics, etc. makes it an essential tool on HPC installations dedicated to academic research. Features such as C extensions, byte-compiled code and parallelization allow for reasonable performance in single-node jobs. Thanks to R’s modular nature, users can customize the R functions available to them by installing packages from the Comprehensive R Archive Network ([https://cran.r-project.org/ CRAN]) into their home directories. | |||
<!--T:83--> | |||
User Julie Fortin has written a blog post, [https://medium.com/the-nature-of-food/how-to-run-your-r-script-with-compute-canada-c325c0ab2973 "How to run your R script with Compute Canada"] which you might find useful. | |||
== The R interpreter == | == The R interpreter == <!--T:3--> | ||
You need to begin by loading an R module; there will typically be several versions available and you can see a list of all of them using the command | You need to begin by loading an R module; there will typically be several versions available and you can see a list of all of them using the command | ||
{{Command | {{Command | ||
Line 14: | Line 19: | ||
You can load a particular R module using a command like | You can load a particular R module using a command like | ||
{{Command | {{Command | ||
|module load r/ | |module load gcc/9.3.0 r/4.0.2 | ||
}} | }} | ||
For more on this, see [[Utiliser des modules/en|Using modules.]] | |||
<!--T:10--> | |||
Now you can start the R interpreter and type R code inside that environment: | |||
{{Command | {{Command | ||
|lang="R" | |lang="R" | ||
|R | |R | ||
|result= | |result= | ||
R version | R version 4.0.2 (2020-06-22) -- "Taking Off Again" | ||
Copyright (C) | Copyright (C) 2020 The R Foundation for Statistical Computing | ||
Platform: x86_64-pc-linux-gnu (64-bit) | Platform: x86_64-pc-linux-gnu (64-bit) | ||
<!--T:11--> | |||
R is free software and comes with ABSOLUTELY NO WARRANTY. | R is free software and comes with ABSOLUTELY NO WARRANTY. | ||
You are welcome to redistribute it under certain conditions. | You are welcome to redistribute it under certain conditions. | ||
Type 'license()' or 'licence()' for distribution details. | Type 'license()' or 'licence()' for distribution details. | ||
Natural language support but running in an English locale | <!--T:12--> | ||
Natural language support but running in an English locale | |||
<!--T:13--> | |||
R is a collaborative project with many contributors. | R is a collaborative project with many contributors. | ||
Type 'contributors()' for more information and | Type 'contributors()' for more information and | ||
'citation()' on how to cite R or R packages in publications. | 'citation()' on how to cite R or R packages in publications. | ||
<!--T:14--> | |||
Type 'demo()' for some demos, 'help()' for on-line help, or | Type 'demo()' for some demos, 'help()' for on-line help, or | ||
'help.start()' for an HTML browser interface to help. | 'help.start()' for an HTML browser interface to help. | ||
Type 'q()' to quit R. | Type 'q()' to quit R. | ||
<!--T:15--> | |||
> values <- c(3,5,7,9) | > values <- c(3,5,7,9) | ||
> values[ | > values[1] | ||
[1] 3 | [1] 3 | ||
> q() | > q() | ||
}} | }} | ||
<!--T:66--> | |||
To execute an R script non-interactively, use <code>Rscript</code> with the file containing the R commands as an argument: | |||
<!--T:16--> | |||
{{Command|Rscript computation.R}} | {{Command|Rscript computation.R}} | ||
<!--T:17--> | |||
<code>Rscript</code> will automatically pass scripting-appropriate options <code>--slave</code> and <code>--no-restore</code> to the R interpreter. These also imply the <code>--no-save</code> option, preventing the creation of useless workspace files on exit. | |||
<!--T:67--> | |||
Note that '''any calculations lasting more than two or three minutes should not be run on the login node'''. | |||
They should be run via the job scheduler. | |||
<!--T:74--> | |||
A simple job script looks like this: | |||
</translate> | |||
{{File | |||
|name=job.sh | |||
|lang="bash" | |||
|contents= | |||
#!/bin/bash | |||
#SBATCH --account=def-someacct # replace this with your own account | |||
#SBATCH --mem-per-cpu=2000M # memory; default unit is megabytes | |||
#SBATCH --time=0-00:15 # time (DD-HH:MM) | |||
module load gcc/9.3.0 r/4.0.2 # Adjust version and add the gcc module used for installing packages. | |||
Rscript computation.R | |||
}} | |||
<translate> | |||
<!--T:75--> | |||
See [[Running jobs]] for more information. | |||
== Installing R packages == <!--T:68--> | |||
=== install.packages() === <!--T:69--> | |||
<!--T:91--> | |||
To install packages from [https://cran.r-project.org/ CRAN], you can use <code>install.packages</code> in an interactive R session on a cluster login node. Since the compute nodes on most clusters do not have access to the Internet, installing R packages in a batch or interactive job is not possible. Many R packages are developed using the GNU family of compilers so we recommend that you load a <code>gcc</code> [[Utiliser des modules/en|module]] before trying to install any R packages. Use the same version of the <code>gcc</code> for all packages you install. | |||
{{Command | |||
|module load gcc/9.3.0 r/4.0.2 | |||
}} | |||
== Installing R | ==== Installing for a specific R version ==== <!--T:18--> | ||
For example, to install the <code>sp</code> package that provides classes and methods for spatial data, use the following command on a login node: | |||
<!--T:19--> | |||
{{Command | {{Command | ||
|R | |R | ||
|result= | |result= | ||
[...] | [...] | ||
> install.packages( | > install.packages('sp', repos='https://cloud.r-project.org/') | ||
}} | }} | ||
<!--T:20--> | |||
If the argument <code>repos</code> is not specified, you will be asked to select an appropriate mirror for download. Ideally, it will be geographically close to the cluster you're working on. | |||
<!--T:21--> | |||
Some packages require defining the environment variable <code>TMPDIR</code> before installing. | |||
==== Installing for one or many R versions ==== <!--T:79--> | |||
Specify the local installation directory according to the R module that is currently loaded. | |||
{{Commands | |||
|mkdir -p ~/.local/R/$EBVERSIONR/ | |||
|export R_LIBS{{=}}~/.local/R/$EBVERSIONR/ | |||
}} | |||
Install the package. | |||
{{Command|R -e 'install.packages("sp", repos{{=}}"https://cloud.r-project.org/")'}} | |||
<!--T:80--> | |||
In your submission script, you then have to load the desired R module and set the local library directory with <code>export R_LIBS=~/.local/R/$EBVERSIONR/</code>. | |||
=== Dependencies === <!--T:5--> | |||
Some packages depend on external libraries which are already installed on our clusters. If the library you need is listed at [[Available software]], then load the appropriate [[Utiliser des modules/en|module]] before installing the package that requires it. | |||
<!--T:70--> | |||
For example, the package <code>rgdal</code> requires a library called <code>gdal</code>. Running <code>module spider gdal/2.2.1</code> shows that it requires <code>nixpkgs</code> and <code>gcc</code> modules. If you took the advice above to load <code>gcc</code> then both these should already be loaded. Verify this by running | |||
<!--T:6--> | |||
{{Command | {{Command | ||
|R CMD INSTALL archive_package.tgz | |module list | ||
}} | |||
<!--T:9--> | |||
If any package fails to install, be sure to read the error message carefully as it might give you details concerning additional modules you need to load. See [[Utiliser des modules/en|Using modules]] for more on the <code>module</code> family of commands. | |||
=== Downloaded packages === <!--T:22--> | |||
To install a package that you downloaded (i.e. not using <code>install.packages()</code>), you can install it as follows. Assuming the package is named <code>archive_package.tgz</code>, run the following command in a shell: | |||
{{Command | |||
|R CMD INSTALL -l 'path for your local (home) R library' archive_package.tgz | |||
}} | }} | ||
==Using system calls in R== <!--T:76--> | |||
<!--T:77--> | |||
Using the R command <code>system()</code> you can execute commands in the ambient environment from inside R. On our clusters, this can lead to problems because R will give an incorrect value to the environment variable <code>LD_LIBRARY_PATH</code>. You can avoid this problem by using the syntax <code>system("LD_LIBRARY_PATH=$RSNT_LD_LIBRARY_PATH <my system call>")</code> in your R system calls. | |||
== Passing arguments to R scripts == <!--T:84--> | |||
Sometimes it can be useful to pass parameters as arguments to R scripts, to avoid having to either change the R script for every job or having to manage multiple copies of otherwise identical scripts. This can be useful for specifying the names for input- or output files, as well as specifying numerical parameters. For example, instead of specifying the name of an input file and/or a numerical parameter like this | |||
{{File | |||
|name= | |||
|lang="R" | |||
|contents= | |||
filename = "input.csv" | |||
iterations = 5 | |||
}} | |||
and changing the code every time either of these changes, parameters can be passed to the R-script when starting it: | |||
{{Command | |||
|Rscript myscript.R input_1.csv 5 | |||
}} | |||
and the next | |||
{{Command | |||
|Rscript myscript.R input_2.csv 10 | |||
}} | |||
<!--T:85--> | |||
The following example expects exactly two arguments. The first one should be a string which will be used for the variable "name" and the second one should be an integer for the variable "number". | |||
{{File | |||
|name=arguments_test.R | |||
|lang="R" | |||
|contents= | |||
args = commandArgs(trailingOnly=TRUE) | |||
<!--T:86--> | |||
# test if there is at least two arguments: if not, return an error | |||
if (length(args)<2) { | |||
stop("At least two arguments must be supplied ('name' (text) and 'numer' (integer) )", call.=FALSE) | |||
} | |||
<!--T:87--> | |||
name <- args[1] # read first argument as string | |||
number <- as.integer( args[2] ) # read second argument as integer | |||
<!--T:88--> | |||
print(paste("Processing with name:'", name, "' and number:'", number,"'", sep = '')) | |||
}} | |||
<!--T:89--> | |||
This script can be used like this: | |||
{{Command | |||
|lang="R" | |||
| Rscript arguments_test.R Hello 42 | |||
|result= | |||
[1] "Processing with name:'Hello' and number:'42'" | |||
}} | |||
==Exploiting parallelism in R== <!--T:46--> | |||
<!--T:71--> | |||
The processors on our clusters are quite ordinary. | |||
What makes these supercomputers <i>super</i> is that you have access to thousands of CPU cores with a high-performance network. | |||
In order to take advantage of this hardware, you must run code "in parallel." However, note that prior to investing a lot of time and effort | |||
in parallelizing your R code, you should first ensure that your serial implementation is as efficient as possible. As an interpreted | |||
language, the use of loops in R, and especially nested loops, constitutes a significant performance bottleneck. Whenever possible you | |||
should try to use vectorized forms of R functions and more functional elements of the R programming language like the family of | |||
<code>apply</code> functions and the <code>ifelse</code> function. This will frequently offer you a far better performance gain by eliminating | |||
a loop altogether instead of simply parallelizing the (slow) execution of this loop across several CPU cores. | |||
<!--T:72--> | |||
The [https://cran.r-project.org/web/views/HighPerformanceComputing.html CRAN Task View on High-Performance and Parallel Computing with R] | |||
describes a bewildering collection of interrelated R packages for parallel computing. | |||
For an excellent overview and advice, see the October 2023 Compute Ontario colloquium | |||
[https://education.scinet.utoronto.ca/course/view.php?id=1333 "High-Performance R"] | |||
([https://education.scinet.utoronto.ca/mod/resource/view.php?id=2887 slides]). | |||
<!--T:90--> | |||
The following subsections contain some further notes and examples. | |||
<!--T:73--> | |||
<b>A note on terminology:</b> In most of our documentation the term 'node' refers | |||
to an individual machine, also called a 'host', and a collection of such nodes makes up a 'cluster'. | |||
In a lot of R documentation however, the term 'node' refers to a worker process and a 'cluster' is a | |||
collection of such processes. As an example, consider the following quote, "Following <b>snow</b>, a pool | |||
of worker processes listening ''via'' sockets for commands from the master is called a 'cluster' of | |||
nodes."<ref>Core package "parallel" vignette, https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf</ref>. | |||
=== doParallel and foreach === <!--T:39--> | |||
====Usage==== | |||
Foreach can be considered as a unified interface for all backends (i.e. doMC, doMPI, doParallel, doRedis, etc.). It works on all platforms, assuming that the backend works. doParallel acts as an interface between foreach and the parallel package and can be loaded alone. There are some [[Scalability|known efficiency issues]] when using foreach to run a very large number of very small tasks. Therefore, keep in mind that the following code is not the best example of an optimized use of the foreach() call but rather that the function chosen was kept at a minimum for demonstration purposes. | |||
<!--T:48--> | |||
You must register the backend by feeding it the number of cores available. If the backend is not registered, foreach will assume that the number of cores is 1 and will proceed to go through the iterations serially. | |||
<!--T:82--> | |||
The general method to use foreach is: | |||
# to load both foreach and the backend package; | |||
# to register the backend; | |||
# to call foreach() by keeping it on the same line as the %do% (serial) or %dopar% operator. | |||
====Running==== <!--T:40--> | |||
<!--T:49--> | |||
1. Place your R code in a script file, in this case the file is called ''test_foreach.R''. | |||
<!--T:41--> | |||
{{File | |||
|name=test_foreach.R | |||
|lang="r" | |||
|contents= | |||
# library(foreach) # optional if using doParallel | |||
library(doParallel) # | |||
<!--T:50--> | |||
# a very simple function | |||
test_func <- function(var1, var2) { | |||
# some heavy workload | |||
sum <- 0 | |||
for (i in c(1:3141593)) { | |||
sum <- sum + var1 * sin(var2 / i) | |||
} | |||
return(sqrt(sum)) | |||
} | |||
<!--T:51--> | |||
# we will iterate over two sets of values, you can modify this to explore the mechanism of foreach | |||
var1.v = c(1:8) | |||
var2.v = seq(0.1, 1, length.out = 8) | |||
<!--T:52--> | |||
# Use the environment variable SLURM_CPUS_PER_TASK to set the number of cores. | |||
# This is for SLURM. Replace SLURM_CPUS_PER_TASK by the proper variable for your system. | |||
# Avoid manually setting a number of cores. | |||
ncores = Sys.getenv("SLURM_CPUS_PER_TASK") | |||
<!--T:53--> | |||
registerDoParallel(cores=ncores)# Shows the number of Parallel Workers to be used | |||
print(ncores) # this how many cores are available, and how many you have requested. | |||
getDoParWorkers()# you can compare with the number of actual workers | |||
<!--T:54--> | |||
# be careful! foreach() and %dopar% must be on the same line! | |||
foreach(var1=var1.v, .combine=rbind) %:% foreach(var2=var2.v, .combine=rbind) %dopar% {test_func(var1=var1, var2=var2)} | |||
}} | |||
<!--T:42--> | |||
2. Copy the following content in a job submission script called ''job_foreach.sh'': | |||
</translate> | |||
{{File | |||
|name=job_foreach.sh | |||
|lang="bash" | |||
|contents= | |||
#!/bin/bash | |||
#SBATCH --account=def-someacct # replace this with your supervisors account | |||
#SBATCH --nodes=1 # number of node MUST be 1 | |||
#SBATCH --cpus-per-task=4 # number of processes | |||
#SBATCH --mem-per-cpu=2048M # memory; default unit is megabytes | |||
#SBATCH --time=0-00:15 # time (DD-HH:MM) | |||
#SBATCH --mail-user=yourname@someplace.com # Send email updates to you or someone else | |||
#SBATCH --mail-type=ALL # send an email in all cases (job started, job ended, job aborted) | |||
module load gcc/9.3.0 r/4.0.2 | |||
export R_LIBS=~/local/R_libs/ | |||
R CMD BATCH --no-save --no-restore test_foreach.R | |||
}} | |||
<translate> | |||
<!--T:44--> | |||
3. Submit the job with: | |||
<!--T:45--> | |||
{{Command|sbatch job_foreach.sh}} | |||
<!--T:47--> | |||
For more on submitting jobs, see [[Running jobs]]. | |||
=== doParallel and makeCluster === <!--T:56--> | |||
====Usage==== | |||
You must register the backend by feeding it the nodes name multiplied by the desired number of processes. For instance, with two nodes (node1 and node2) and two processes, we would create a cluster composed of : <code>node1 node1 node2 node2</code> hosts. The ''PSOCK'' cluster type will run commands through SSH connections into the nodes. | |||
====Running==== <!--T:57--> | |||
1. Place your R code in a script file, in this case the file is called <code>test_makecluster.R</code>. | |||
{{File | |||
|name=test_makecluster.R | |||
|lang="r" | |||
|contents= | |||
library(doParallel) | |||
<!--T:58--> | |||
# Create an array from the NODESLIST environnement variable | |||
nodeslist = unlist(strsplit(Sys.getenv("NODESLIST"), split=" ")) | |||
<!--T:59--> | |||
# Create the cluster with the nodes name. One process per count of node name. | |||
# nodeslist = node1 node1 node2 node2, means we are starting 2 processes on node1, likewise on node2. | |||
cl = makeCluster(nodeslist, type = "PSOCK") | |||
registerDoParallel(cl) | |||
<!--T:60--> | |||
# Compute (Source : https://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf) | |||
x <- iris[which(iris[,5] != "setosa"), c(1,5)] | |||
trials <- 10000 | |||
<!--T:81--> | |||
foreach(icount(trials), .combine=cbind) %dopar% | |||
{ | |||
ind <- sample(100, 100, replace=TRUE) | |||
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit)) | |||
coefficients(result1) | |||
} | |||
<!--T:61--> | |||
# Don't forget to release resources | |||
stopCluster(cl) | |||
}} | |||
<!--T:62--> | |||
2. Copy the following content in a job submission script called <code>job_makecluster.sh</code>: | |||
{{File | |||
|name=job_makecluster.sh | |||
|lang="bash" | |||
|escape|mode=text | |||
|contents= | |||
#!/bin/bash | |||
#SBATCH --account=def-someacct # replace this with your supervisors account | |||
#SBATCH --ntasks=4 # number of processes | |||
#SBATCH --mem-per-cpu=512M # memory; default unit is megabytes | |||
#SBATCH --time=00:05:00 # time (HH:MM:SS) | |||
<!--T:63--> | |||
module load gcc/9.3.0 r/4.0.2 | |||
<!--T:64--> | |||
# Export the nodes names. | |||
# If all processes are allocated on the same node, NODESLIST contains : node1 node1 node1 node1 | |||
# Cut the domain name and keep only the node name | |||
export NODESLIST=$(echo $(srun hostname {{!}} cut -f 1 -d '.')) | |||
R -f test_makecluster.R | |||
}} | |||
<!--T:92--> | |||
In the above example the scheduler might place all four processes on just one node. | |||
This is okay, but if you wish to prove that the same job works even if the processes happen | |||
to be placed on different nodes, then add the line <code>#SBATCH --ntasks-per-node=2</code> | |||
<!--T:65--> | |||
3. Submit the job with: | |||
{{Command|sbatch job_makecluster.sh}} | |||
For more information on submitting jobs, see [[Running jobs]]. | |||
=== Rmpi === <!--T:23--> | |||
Note that the following instructions do not work on [[Cedar]], so any use of Rmpi should be restricted to the other clusters. | |||
====Installing==== <!--T:24--> | |||
This next procedure installs [https://cran.r-project.org/web/packages/Rmpi/index.html Rmpi], an interface (wrapper) to MPI routines, which allow R to run in parallel. | |||
<!--T:25--> | |||
1. See the available R modules by running: | |||
<source lang="bash"> | |||
module spider r | |||
</source> | |||
<!--T:26--> | |||
2. Select the R version and load the required Open MPI module. This example uses Open MPI version 4.0.3 to spawn the processes correctly. | |||
<source lang="bash"> | |||
module load gcc/11.3.0 | |||
module load r/4.2.1 | |||
module load openmpi/4.1.4 | |||
</source> | |||
<!--T:27--> | |||
3. Download [https://cran.r-project.org/web/packages/Rmpi/index.html the latest Rmpi version]; change the version number to whatever is desired. | |||
<source lang="bash"> | |||
wget https://cran.r-project.org/src/contrib/Rmpi_0.6-9.2.tar.gz | |||
</source> | |||
<!--T:28--> | |||
4. Specify the directory where you want to install the package files; you must have write permission for this directory. The directory name can be changed if desired. | |||
<source lang="bash"> | |||
mkdir -p ~/local/R_libs/ | |||
export R_LIBS=~/local/R_libs/ | |||
</source> | |||
<!--T:29--> | |||
5. Run the install command. | |||
<source lang="bash"> | |||
R CMD INSTALL --configure-args="--with-Rmpi-include=$EBROOTOPENMPI/include --with-Rmpi-libpath=$EBROOTOPENMPI/lib --with-Rmpi-type='OPENMPI' " Rmpi_0.6-9.2.tar.gz | |||
</source> | |||
<!--T:30--> | |||
Again, carefully read any error message that comes up when packages fail to install and load the required modules to ensure that all your packages are successfully installed. | |||
====Running==== <!--T:31--> | |||
<!--T:32--> | |||
1. Place your R code in a script file, in this case the file is called ''test.R''. | |||
<!--T:33--> | |||
{{File | |||
|name=test.R | |||
|lang="r" | |||
|contents= | |||
#Tell all slaves to return a message identifying themselves. | |||
library("Rmpi") | |||
sprintf("TEST mpi.universe.size() = %i", mpi.universe.size()) | |||
ns <- mpi.universe.size() - 1 | |||
sprintf("TEST attempt to spawn %i slaves", ns) | |||
mpi.spawn.Rslaves(nslaves=ns) | |||
mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size())) | |||
mpi.remote.exec(paste(mpi.comm.get.parent())) | |||
#Send execution commands to the slaves | |||
x<-5 | |||
#These would all be pretty correlated one would think | |||
x<-mpi.remote.exec(rnorm,x) | |||
length(x) | |||
x | |||
mpi.close.Rslaves() | |||
mpi.quit() | |||
}} | |||
<!--T:34--> | |||
2. Copy the following content in a job submission script called ''job.sh'': | |||
<!--T:35--> | |||
{{File | |||
|name=job.sh | |||
|lang="bash" | |||
|contents= | |||
#!/bin/bash | |||
#SBATCH --account=def-someacct # replace this with your supervisors account | |||
#SBATCH --ntasks=5 # number of MPI processes | |||
#SBATCH --mem-per-cpu=2048M # memory; default unit is megabytes | |||
#SBATCH --time=0-00:15 # time (DD-HH:MM) | |||
<!--T:78--> | |||
module load gcc/11.3.0 | |||
module load r/4.2.1 | |||
module load openmpi/4.1.4 | |||
export R_LIBS=~/local/R_libs/ | |||
mpirun -np 1 R CMD BATCH test.R test.txt | |||
}} | |||
<!--T:36--> | |||
3. Submit the job with: | |||
<!--T:37--> | |||
<source lang="bash"> | |||
sbatch job.sh | |||
</source> | |||
<!--T:38--> | |||
For more on submitting jobs, see [[Running jobs]]. | |||
</translate> | </translate> |
Latest revision as of 17:42, 29 May 2024
R is a system for statistical computation and graphics. It consists of a language plus a runtime environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files.
Even though R was not developed for high-performance computing (HPC), its popularity with scientists from a variety of disciplines, including engineering, mathematics, statistics, bioinformatics, etc. makes it an essential tool on HPC installations dedicated to academic research. Features such as C extensions, byte-compiled code and parallelization allow for reasonable performance in single-node jobs. Thanks to R’s modular nature, users can customize the R functions available to them by installing packages from the Comprehensive R Archive Network (CRAN) into their home directories.
User Julie Fortin has written a blog post, "How to run your R script with Compute Canada" which you might find useful.
The R interpreter[edit]
You need to begin by loading an R module; there will typically be several versions available and you can see a list of all of them using the command
[name@server ~]$ module spider r
You can load a particular R module using a command like
[name@server ~]$ module load gcc/9.3.0 r/4.0.2
For more on this, see Using modules.
Now you can start the R interpreter and type R code inside that environment:
[name@server ~]$ R
R version 4.0.2 (2020-06-22) -- "Taking Off Again"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> values <- c(3,5,7,9)
> values[1]
[1] 3
> q()
To execute an R script non-interactively, use Rscript
with the file containing the R commands as an argument:
[name@server ~]$ Rscript computation.R
Rscript
will automatically pass scripting-appropriate options --slave
and --no-restore
to the R interpreter. These also imply the --no-save
option, preventing the creation of useless workspace files on exit.
Note that any calculations lasting more than two or three minutes should not be run on the login node. They should be run via the job scheduler.
A simple job script looks like this:
#!/bin/bash
#SBATCH --account=def-someacct # replace this with your own account
#SBATCH --mem-per-cpu=2000M # memory; default unit is megabytes
#SBATCH --time=0-00:15 # time (DD-HH:MM)
module load gcc/9.3.0 r/4.0.2 # Adjust version and add the gcc module used for installing packages.
Rscript computation.R
See Running jobs for more information.
Installing R packages[edit]
install.packages()[edit]
To install packages from CRAN, you can use install.packages
in an interactive R session on a cluster login node. Since the compute nodes on most clusters do not have access to the Internet, installing R packages in a batch or interactive job is not possible. Many R packages are developed using the GNU family of compilers so we recommend that you load a gcc
module before trying to install any R packages. Use the same version of the gcc
for all packages you install.
[name@server ~]$ module load gcc/9.3.0 r/4.0.2
Installing for a specific R version[edit]
For example, to install the sp
package that provides classes and methods for spatial data, use the following command on a login node:
[name@server ~]$ R
[...]
> install.packages('sp', repos='https://cloud.r-project.org/')
If the argument repos
is not specified, you will be asked to select an appropriate mirror for download. Ideally, it will be geographically close to the cluster you're working on.
Some packages require defining the environment variable TMPDIR
before installing.
Installing for one or many R versions[edit]
Specify the local installation directory according to the R module that is currently loaded.
[name@server ~]$ mkdir -p ~/.local/R/$EBVERSIONR/
[name@server ~]$ export R_LIBS=~/.local/R/$EBVERSIONR/
Install the package.
[name@server ~]$ R -e 'install.packages("sp", repos="https://cloud.r-project.org/")'
In your submission script, you then have to load the desired R module and set the local library directory with export R_LIBS=~/.local/R/$EBVERSIONR/
.
Dependencies[edit]
Some packages depend on external libraries which are already installed on our clusters. If the library you need is listed at Available software, then load the appropriate module before installing the package that requires it.
For example, the package rgdal
requires a library called gdal
. Running module spider gdal/2.2.1
shows that it requires nixpkgs
and gcc
modules. If you took the advice above to load gcc
then both these should already be loaded. Verify this by running
[name@server ~]$ module list
If any package fails to install, be sure to read the error message carefully as it might give you details concerning additional modules you need to load. See Using modules for more on the module
family of commands.
Downloaded packages[edit]
To install a package that you downloaded (i.e. not using install.packages()
), you can install it as follows. Assuming the package is named archive_package.tgz
, run the following command in a shell:
[name@server ~]$ R CMD INSTALL -l 'path for your local (home) R library' archive_package.tgz
Using system calls in R[edit]
Using the R command system()
you can execute commands in the ambient environment from inside R. On our clusters, this can lead to problems because R will give an incorrect value to the environment variable LD_LIBRARY_PATH
. You can avoid this problem by using the syntax system("LD_LIBRARY_PATH=$RSNT_LD_LIBRARY_PATH <my system call>")
in your R system calls.
Passing arguments to R scripts[edit]
Sometimes it can be useful to pass parameters as arguments to R scripts, to avoid having to either change the R script for every job or having to manage multiple copies of otherwise identical scripts. This can be useful for specifying the names for input- or output files, as well as specifying numerical parameters. For example, instead of specifying the name of an input file and/or a numerical parameter like this
filename = "input.csv"
iterations = 5
and changing the code every time either of these changes, parameters can be passed to the R-script when starting it:
[name@server ~]$ Rscript myscript.R input_1.csv 5
and the next
[name@server ~]$ Rscript myscript.R input_2.csv 10
The following example expects exactly two arguments. The first one should be a string which will be used for the variable "name" and the second one should be an integer for the variable "number".
args = commandArgs(trailingOnly=TRUE)
# test if there is at least two arguments: if not, return an error
if (length(args)<2) {
stop("At least two arguments must be supplied ('name' (text) and 'numer' (integer) )", call.=FALSE)
}
name <- args[1] # read first argument as string
number <- as.integer( args[2] ) # read second argument as integer
print(paste("Processing with name:'", name, "' and number:'", number,"'", sep = ''))
This script can be used like this:
[name@server ~]$ Rscript arguments_test.R Hello 42
[1] "Processing with name:'Hello' and number:'42'"
Exploiting parallelism in R[edit]
The processors on our clusters are quite ordinary.
What makes these supercomputers super is that you have access to thousands of CPU cores with a high-performance network.
In order to take advantage of this hardware, you must run code "in parallel." However, note that prior to investing a lot of time and effort
in parallelizing your R code, you should first ensure that your serial implementation is as efficient as possible. As an interpreted
language, the use of loops in R, and especially nested loops, constitutes a significant performance bottleneck. Whenever possible you
should try to use vectorized forms of R functions and more functional elements of the R programming language like the family of
apply
functions and the ifelse
function. This will frequently offer you a far better performance gain by eliminating
a loop altogether instead of simply parallelizing the (slow) execution of this loop across several CPU cores.
The CRAN Task View on High-Performance and Parallel Computing with R describes a bewildering collection of interrelated R packages for parallel computing. For an excellent overview and advice, see the October 2023 Compute Ontario colloquium "High-Performance R" (slides).
The following subsections contain some further notes and examples.
A note on terminology: In most of our documentation the term 'node' refers to an individual machine, also called a 'host', and a collection of such nodes makes up a 'cluster'. In a lot of R documentation however, the term 'node' refers to a worker process and a 'cluster' is a collection of such processes. As an example, consider the following quote, "Following snow, a pool of worker processes listening via sockets for commands from the master is called a 'cluster' of nodes."[1].
doParallel and foreach[edit]
Usage[edit]
Foreach can be considered as a unified interface for all backends (i.e. doMC, doMPI, doParallel, doRedis, etc.). It works on all platforms, assuming that the backend works. doParallel acts as an interface between foreach and the parallel package and can be loaded alone. There are some known efficiency issues when using foreach to run a very large number of very small tasks. Therefore, keep in mind that the following code is not the best example of an optimized use of the foreach() call but rather that the function chosen was kept at a minimum for demonstration purposes.
You must register the backend by feeding it the number of cores available. If the backend is not registered, foreach will assume that the number of cores is 1 and will proceed to go through the iterations serially.
The general method to use foreach is:
- to load both foreach and the backend package;
- to register the backend;
- to call foreach() by keeping it on the same line as the %do% (serial) or %dopar% operator.
Running[edit]
1. Place your R code in a script file, in this case the file is called test_foreach.R.
# library(foreach) # optional if using doParallel
library(doParallel) #
# a very simple function
test_func <- function(var1, var2) {
# some heavy workload
sum <- 0
for (i in c(1:3141593)) {
sum <- sum + var1 * sin(var2 / i)
}
return(sqrt(sum))
}
# we will iterate over two sets of values, you can modify this to explore the mechanism of foreach
var1.v = c(1:8)
var2.v = seq(0.1, 1, length.out = 8)
# Use the environment variable SLURM_CPUS_PER_TASK to set the number of cores.
# This is for SLURM. Replace SLURM_CPUS_PER_TASK by the proper variable for your system.
# Avoid manually setting a number of cores.
ncores = Sys.getenv("SLURM_CPUS_PER_TASK")
registerDoParallel(cores=ncores)# Shows the number of Parallel Workers to be used
print(ncores) # this how many cores are available, and how many you have requested.
getDoParWorkers()# you can compare with the number of actual workers
# be careful! foreach() and %dopar% must be on the same line!
foreach(var1=var1.v, .combine=rbind) %:% foreach(var2=var2.v, .combine=rbind) %dopar% {test_func(var1=var1, var2=var2)}
2. Copy the following content in a job submission script called job_foreach.sh:
#!/bin/bash
#SBATCH --account=def-someacct # replace this with your supervisors account
#SBATCH --nodes=1 # number of node MUST be 1
#SBATCH --cpus-per-task=4 # number of processes
#SBATCH --mem-per-cpu=2048M # memory; default unit is megabytes
#SBATCH --time=0-00:15 # time (DD-HH:MM)
#SBATCH --mail-user=yourname@someplace.com # Send email updates to you or someone else
#SBATCH --mail-type=ALL # send an email in all cases (job started, job ended, job aborted)
module load gcc/9.3.0 r/4.0.2
export R_LIBS=~/local/R_libs/
R CMD BATCH --no-save --no-restore test_foreach.R
3. Submit the job with:
[name@server ~]$ sbatch job_foreach.sh
For more on submitting jobs, see Running jobs.
doParallel and makeCluster[edit]
Usage[edit]
You must register the backend by feeding it the nodes name multiplied by the desired number of processes. For instance, with two nodes (node1 and node2) and two processes, we would create a cluster composed of : node1 node1 node2 node2
hosts. The PSOCK cluster type will run commands through SSH connections into the nodes.
Running[edit]
1. Place your R code in a script file, in this case the file is called test_makecluster.R
.
library(doParallel)
# Create an array from the NODESLIST environnement variable
nodeslist = unlist(strsplit(Sys.getenv("NODESLIST"), split=" "))
# Create the cluster with the nodes name. One process per count of node name.
# nodeslist = node1 node1 node2 node2, means we are starting 2 processes on node1, likewise on node2.
cl = makeCluster(nodeslist, type = "PSOCK")
registerDoParallel(cl)
# Compute (Source : https://cran.r-project.org/web/packages/doParallel/vignettes/gettingstartedParallel.pdf)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000
foreach(icount(trials), .combine=cbind) %dopar%
{
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
# Don't forget to release resources
stopCluster(cl)
2. Copy the following content in a job submission script called job_makecluster.sh
:
#!/bin/bash
#SBATCH --account=def-someacct # replace this with your supervisors account
#SBATCH --ntasks=4 # number of processes
#SBATCH --mem-per-cpu=512M # memory; default unit is megabytes
#SBATCH --time=00:05:00 # time (HH:MM:SS)
module load gcc/9.3.0 r/4.0.2
# Export the nodes names.
# If all processes are allocated on the same node, NODESLIST contains : node1 node1 node1 node1
# Cut the domain name and keep only the node name
export NODESLIST=$(echo $(srun hostname | cut -f 1 -d '.'))
R -f test_makecluster.R
In the above example the scheduler might place all four processes on just one node.
This is okay, but if you wish to prove that the same job works even if the processes happen
to be placed on different nodes, then add the line #SBATCH --ntasks-per-node=2
3. Submit the job with:
[name@server ~]$ sbatch job_makecluster.sh
For more information on submitting jobs, see Running jobs.
Rmpi[edit]
Note that the following instructions do not work on Cedar, so any use of Rmpi should be restricted to the other clusters.
Installing[edit]
This next procedure installs Rmpi, an interface (wrapper) to MPI routines, which allow R to run in parallel.
1. See the available R modules by running:
module spider r
2. Select the R version and load the required Open MPI module. This example uses Open MPI version 4.0.3 to spawn the processes correctly.
module load gcc/11.3.0
module load r/4.2.1
module load openmpi/4.1.4
3. Download the latest Rmpi version; change the version number to whatever is desired.
wget https://cran.r-project.org/src/contrib/Rmpi_0.6-9.2.tar.gz
4. Specify the directory where you want to install the package files; you must have write permission for this directory. The directory name can be changed if desired.
mkdir -p ~/local/R_libs/
export R_LIBS=~/local/R_libs/
5. Run the install command.
R CMD INSTALL --configure-args="--with-Rmpi-include=$EBROOTOPENMPI/include --with-Rmpi-libpath=$EBROOTOPENMPI/lib --with-Rmpi-type='OPENMPI' " Rmpi_0.6-9.2.tar.gz
Again, carefully read any error message that comes up when packages fail to install and load the required modules to ensure that all your packages are successfully installed.
Running[edit]
1. Place your R code in a script file, in this case the file is called test.R.
#Tell all slaves to return a message identifying themselves.
library("Rmpi")
sprintf("TEST mpi.universe.size() = %i", mpi.universe.size())
ns <- mpi.universe.size() - 1
sprintf("TEST attempt to spawn %i slaves", ns)
mpi.spawn.Rslaves(nslaves=ns)
mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))
mpi.remote.exec(paste(mpi.comm.get.parent()))
#Send execution commands to the slaves
x<-5
#These would all be pretty correlated one would think
x<-mpi.remote.exec(rnorm,x)
length(x)
x
mpi.close.Rslaves()
mpi.quit()
2. Copy the following content in a job submission script called job.sh:
#!/bin/bash
#SBATCH --account=def-someacct # replace this with your supervisors account
#SBATCH --ntasks=5 # number of MPI processes
#SBATCH --mem-per-cpu=2048M # memory; default unit is megabytes
#SBATCH --time=0-00:15 # time (DD-HH:MM)
module load gcc/11.3.0
module load r/4.2.1
module load openmpi/4.1.4
export R_LIBS=~/local/R_libs/
mpirun -np 1 R CMD BATCH test.R test.txt
3. Submit the job with:
sbatch job.sh
For more on submitting jobs, see Running jobs.
- ↑ Core package "parallel" vignette, https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf