Difference between revisions of "HPC:R"

From HPC wiki
(Created page with "=== R (programming language) === There are currently several versions of the R programming language installed across all the HPC nodes. === Usage === Currently, the default ...")
 
Line 81: Line 81:
 
[asrini@node062 ~]$ echo 'library()' | R --slave
 
[asrini@node062 ~]$ echo 'library()' | R --slave
 
</pre>
 
</pre>
 
+
 
 +
=== Running R doParallel jobs ===
 +
The doParallel package is a "parallel backend" for the foreach package.  It provides a mechanism needed to execute foreach loops in parallel. This section specifically covers the use of the R doParallel library on the PMACS cluster nodes.
 +
 
 +
==== Install doParallel ====
 +
The doParallel library is not installed by default so you will have to install it.
 +
 
 +
<pre>
 +
[asrini@consign ~]$ bsub -Is bash
 +
 
 +
[asrini@node063 ~]$ R
 +
 
 +
> install.packages("doParallel");
 +
 
 +
</pre>
 +
 
 +
'''Note 1: '''
 +
The doParallel library will be installed in under $HOME/R/x86_64-redhat-linux-gnu-library/3.0/
 +
 
 +
'''Note 2: '''
 +
If an alternative version of R is preferred, make sure you load the appropriate module for that version before attempting the install. Also make sure you are using that specific version of R before running jobs.
 +
 
 +
==== Using doParallel ====
 +
The doParallel library has the ability to detect the number of CPU cores a given system has and can spawn one thread on each core that was detected. This functionality is provided by the "detectCores()" function that is part of this library. However, we '''do not recommend''' the use of this function in our environment because this function does not operate within the confines of our job scheduler - IBM's Platform LSF. The use of the "detectCores()" function and the subsequent "makeCluster()" function will result in one user's jobs affecting other user jobs running on the same node.
 +
 
 +
Instead, we recommend forcing doParallel to operate within the confines of the job scheduler. Below are some examples:
 +
 
 +
===== Example 1 =====
 +
the following example, taken from the doParallel manual, shows basic use of the doParallel library with the value for core count hard coded in the R script:
 +
 
 +
<pre>
 +
 
 +
[asrini@consign ~]$ bsub -n 2 -Is bash
 +
 
 +
[asrini@node063]$ cat doParallel_test.R
 +
library(doParallel)
 +
 
 +
registerDoParallel(cores=2)
 +
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
 +
trials <- 10000
 +
ptime <- system.time({
 +
r <- foreach(icount(trials), .combine=cbind) %dopar% {
 +
  ind <- sample(100, 100, replace=TRUE)
 +
  result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
 +
  coefficients(result1)
 +
  }
 +
})[3]
 +
ptime
 +
 
 +
</pre> 
 +
 
 +
'''Note :'''  In the above example, cores is set to 2. This is because the interactive session was launched on 2 cores (bsub -n 2 -Is bash). If more cores were requested, the doParallel script can be changed accordingly.
 +
 
 +
And can be run as:
 +
 
 +
<pre>
 +
 
 +
[asrini@node063]$ Rscript doParallel_test.R
 +
Loading required package: foreach
 +
Loading required package: iterators
 +
Loading required package: parallel
 +
elapsed
 +
17.159
 +
 
 +
</pre>
 +
 
 +
 
 +
===== Example 2 =====
 +
 
 +
The following example shows how to pass the value of number cores as an argument to the R script.
 +
 
 +
<pre>
 +
 
 +
[asrini@consign ~]$ bsub -n 2 -Is bash
 +
 
 +
[asrini@node063 ~]$ cat doParallel_test2.R
 +
 
 +
library(doParallel)
 +
core_count <- as.numeric(commandArgs(TRUE)[1])
 +
cl <- makeCluster(core_count)
 +
registerDoParallel(cl)
 +
print(cl)
 +
 
 +
 
 +
[asrini@node063 test_jobs]$ Rscript doParallel_test2.R 2
 +
Loading required package: foreach
 +
Loading required package: iterators
 +
Loading required package: parallel
 +
socket cluster with 2 nodes on host ‘localhost’
 +
 
 +
</pre>
 +
 
 +
'''Note :''' In the above example, the value 2 was passed to the R script which was then accepted as an argument by the R script and a doParallel cluster of the same size was in turn created. If more cores were requested in the bsub submission, the argument passed to the doParallel script can be changed accordingly.
 +
 
 +
This can also be done as a batch (non-interactive) submission:
 +
 
 +
<pre>
 +
 
 +
[asrini@consign ~]$ bsub -n 2 -R "span[hosts=1]" -e doParallel.e -o doParallel.o Rscript doParallel_test2.R 2
 +
 
 +
</pre>
 +
 
 +
The error and output files from the above bsub submission contain the same information as shown previously.
 +
 
 
=== Other Pages ===
 
=== Other Pages ===
 
*[[HPC:Software|Available Software]]
 
*[[HPC:Software|Available Software]]

Revision as of 20:04, 15 April 2015

R (programming language)

There are currently several versions of the R programming language installed across all the HPC nodes.

Usage

Currently, the default version of R installed across all the HPC nodes, is version 3.0.1.

[asrini@node062 ~]$ which R
/usr/bin/R

[asrini@node062 ~]$ R --version
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
http://www.gnu.org/licenses/.

Note: This version (3.0.1) is likely to be removed as the default version and will be made available as module.

Other R versions installed across the HPC nodes can be loaded as a module

[asrini@node062 ~]$ module show R-3.1.1
-------------------------------------------------------------------
/usr/share/Modules/modulefiles/R-3.1.1:

module-whatis	 GNU R
prepend-path	 PATH /opt/software/R/3.1.1/bin
prepend-path	 MANPATH /opt/software/R/3.1.1/share/man
-------------------------------------------------------------------


[asrini@node062 ~]$ module load R-3.1.1

[asrini@node062 ~]$ R --version
R version 3.1.1 (2014-07-10) -- "Sock it to Me"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
http://www.gnu.org/licenses/.

[asrini@node062 ~]$ module show R-3.1.2
-------------------------------------------------------------------
/usr/share/Modules/modulefiles/R-3.1.2:

module-whatis	 GNU R
prepend-path	 PATH /opt/software/R/3.1.2/bin
prepend-path	 MANPATH /opt/software/R/3.1.2/share/man
-------------------------------------------------------------------

[asrini@node062 ~]$ module load R-3.1.2

[asrini@node062 ~]$ R --version
R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
http://www.gnu.org/licenses/.

R libraries/packages

Several difficult to install R packages/libraries have been installed across all the HPC nodes for every version of R that is installed on the cluster. These typically include various BioConductor packages and some other packages that would otherwise need administrative privileges to install. Users are encouraged to install R packages needed for their work, if the desired package is not already installed. To see a full listing of R packages installed run the following command in an interactive shell:

[asrini@node062 ~]$ echo 'library()' | R --slave

Running R doParallel jobs

The doParallel package is a "parallel backend" for the foreach package. It provides a mechanism needed to execute foreach loops in parallel. This section specifically covers the use of the R doParallel library on the PMACS cluster nodes.

Install doParallel

The doParallel library is not installed by default so you will have to install it.

[asrini@consign ~]$ bsub -Is bash

[asrini@node063 ~]$ R

> install.packages("doParallel");

Note 1: The doParallel library will be installed in under $HOME/R/x86_64-redhat-linux-gnu-library/3.0/

Note 2: If an alternative version of R is preferred, make sure you load the appropriate module for that version before attempting the install. Also make sure you are using that specific version of R before running jobs.

Using doParallel

The doParallel library has the ability to detect the number of CPU cores a given system has and can spawn one thread on each core that was detected. This functionality is provided by the "detectCores()" function that is part of this library. However, we do not recommend the use of this function in our environment because this function does not operate within the confines of our job scheduler - IBM's Platform LSF. The use of the "detectCores()" function and the subsequent "makeCluster()" function will result in one user's jobs affecting other user jobs running on the same node.

Instead, we recommend forcing doParallel to operate within the confines of the job scheduler. Below are some examples:

Example 1

the following example, taken from the doParallel manual, shows basic use of the doParallel library with the value for core count hard coded in the R script:


[asrini@consign ~]$ bsub -n 2 -Is bash

[asrini@node063]$ cat doParallel_test.R
library(doParallel)

registerDoParallel(cores=2)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000
ptime <- system.time({
r <- foreach(icount(trials), .combine=cbind) %dopar% {
   ind <- sample(100, 100, replace=TRUE)
   result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
   coefficients(result1)
   }
})[3]
ptime

Note : In the above example, cores is set to 2. This is because the interactive session was launched on 2 cores (bsub -n 2 -Is bash). If more cores were requested, the doParallel script can be changed accordingly.

And can be run as:


[asrini@node063]$ Rscript doParallel_test.R
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
elapsed
 17.159


Example 2

The following example shows how to pass the value of number cores as an argument to the R script.


[asrini@consign ~]$ bsub -n 2 -Is bash

[asrini@node063 ~]$ cat doParallel_test2.R

library(doParallel)
core_count <- as.numeric(commandArgs(TRUE)[1])
cl <- makeCluster(core_count)
registerDoParallel(cl)
print(cl)


[asrini@node063 test_jobs]$ Rscript doParallel_test2.R 2
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
socket cluster with 2 nodes on host ‘localhost’

Note : In the above example, the value 2 was passed to the R script which was then accepted as an argument by the R script and a doParallel cluster of the same size was in turn created. If more cores were requested in the bsub submission, the argument passed to the doParallel script can be changed accordingly.

This can also be done as a batch (non-interactive) submission:


[asrini@consign ~]$ bsub -n 2 -R "span[hosts=1]" -e doParallel.e -o doParallel.o Rscript doParallel_test2.R 2

The error and output files from the above bsub submission contain the same information as shown previously.

Other Pages