Difference between revisions of "HPC:R"

From HPC wiki
Line 122: Line 122:
  
 
Details regarding the package and its manual can be found on [http://cran.r-project.org/web/packages/doParallel/index.html CRAN]  
 
Details regarding the package and its manual can be found on [http://cran.r-project.org/web/packages/doParallel/index.html CRAN]  
 +
 +
Click the "Expand" link to see information on how to install and run doParallel jobs on our HPC
 +
 +
----
 +
<div class="mw-collapsible mw-collapsed">
  
 
==== Install doParallel ====
 
==== Install doParallel ====
Line 222: Line 227:
  
 
The error and output files from the above bsub submission contain the same information as shown previously.
 
The error and output files from the above bsub submission contain the same information as shown previously.
 +
 +
</div>
  
 
=== Other Pages ===
 
=== Other Pages ===

Revision as of 16:27, 5 December 2019

R (programming language)

There are currently several versions of the R programming language installed across all the HPC compute nodes.

Running R programs on the PMACS HPC system

Various versions of R and other software packages are available as modules only on compute nodes (both interactive and non-interactive). Do not try to run these on the head node. If attempting to run R interactively, first launch an interactive session:

[asrini@consign ~]$ bsub -Is bash
Job <35804293> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on node060.hpc.local>>
[asrini@node062 ~]$ 

Please read the rest of this section, before launching R jobs on the PMACS HPC system.

Available Versions

[asrini@node062 ~]$ module avail R

--------------------------------------------------------------------------------- /usr/share/Modules/modulefiles ----------------------------------------------------------------------------------
R-3.1.1 R-3.1.2 R-3.2.1 R-3.2.2

Usage

Currently, the default version of R installed across all the HPC nodes, is version 3.0.1.

[asrini@node062 ~]$ which R
/usr/bin/R

[asrini@node062 ~]$ R --version
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
http://www.gnu.org/licenses/.

Note: This version (3.0.1) is likely to be removed as the default version and will be made available as module.

Other R versions installed across the HPC nodes can be loaded as a module

[asrini@node062 ~]$ module show R-3.1.1
-------------------------------------------------------------------
/usr/share/Modules/modulefiles/R-3.1.1:

module-whatis	 GNU R
prepend-path	 PATH /opt/software/R/3.1.1/bin
prepend-path	 MANPATH /opt/software/R/3.1.1/share/man
-------------------------------------------------------------------


[asrini@node062 ~]$ module load R-3.1.1

[asrini@node062 ~]$ R --version
R version 3.1.1 (2014-07-10) -- "Sock it to Me"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
http://www.gnu.org/licenses/.

[asrini@node062 ~]$ module show R-3.1.2
-------------------------------------------------------------------
/usr/share/Modules/modulefiles/R-3.1.2:

module-whatis	 GNU R
prepend-path	 PATH /opt/software/R/3.1.2/bin
prepend-path	 MANPATH /opt/software/R/3.1.2/share/man
-------------------------------------------------------------------

[asrini@node062 ~]$ module load R-3.1.2

[asrini@node062 ~]$ R --version
R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
http://www.gnu.org/licenses/.

Pre-installed R libraries/packages

Several difficult to install R packages/libraries have been installed across all the HPC nodes for every version of R that is installed on the cluster. These typically include various BioConductor packages and some other packages that would otherwise need administrative privileges to install. Users are encouraged to install R packages needed for their work, if the desired package is not already installed. To see a full listing of R packages installed run the following command in an interactive shell:

[asrini@node062 ~]$ echo 'library()' | R --slave

R package installation

Most R package installation can be done using the normal package.install("<package_name>") approach.

However, certain R packages while not difficult to install, tend to have dependencies that require special care/attention during the install process. Below is a listing of some of those packages:

monocle3

Monocle 3 is an analysis toolkit for single-cell RNA-Seq experiments. While this package can be installed Bioconductor, it does have other dependencies, namely - GDAL/R-GDAL

NOTE: If you are using Anaconda (a.k.a "conda") to manage packages within your home or project directories, there is some possibility that version conflicts may arise. You may have to disable your conda environments, in order for the Monocle3 installation to be successful on our HPC system, using these instructions

Step 1: Install GDAL by following these instructions

r-gdal

Running R doParallel jobs

The doParallel package is a "parallel backend" for the foreach package. It provides a mechanism needed to execute foreach loops in parallel. This section specifically covers the use of the R doParallel library on the PMACS cluster nodes.

Details regarding the package and its manual can be found on CRAN

Click the "Expand" link to see information on how to install and run doParallel jobs on our HPC


Install doParallel

The doParallel library is not installed by default so you will have to install it.

[asrini@consign ~]$ bsub -Is bash

[asrini@node063 ~]$ R

> install.packages("doParallel");

Note 1: The doParallel library will be installed in under $HOME/R/x86_64-redhat-linux-gnu-library/3.0/

Note 2: If an alternative version of R is preferred, make sure you load the appropriate module for that version before attempting the install. Also make sure you are using that specific version of R before running jobs.

Using doParallel

The doParallel library has the ability to detect the number of CPU cores a given system has and can spawn one thread on each core that was detected. This functionality is provided by the "detectCores()" function that is part of this library. However, we do not recommend the use of this function in our environment because this function does not operate within the confines of our job scheduler - IBM's Platform LSF. The use of the "detectCores()" function and the subsequent "makeCluster()" function will result in one user's jobs affecting other user jobs running on the same node.

Instead, we recommend forcing doParallel to operate within the confines of the job scheduler. Below are some examples:

Example 1

the following example, taken from the doParallel manual, shows basic use of the doParallel library with the value for core count hard coded in the R script:


[asrini@consign ~]$ bsub -n 2 -Is bash

[asrini@node063]$ cat doParallel_test.R
library(doParallel)

registerDoParallel(cores=2)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000
ptime <- system.time({
r <- foreach(icount(trials), .combine=cbind) %dopar% {
   ind <- sample(100, 100, replace=TRUE)
   result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
   coefficients(result1)
   }
})[3]
ptime

Note : In the above example, cores is set to 2. This is because the interactive session was launched on 2 cores (bsub -n 2 -Is bash). If more cores were requested, the doParallel script can be changed accordingly.

And can be run as:


[asrini@node063]$ Rscript doParallel_test.R
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
elapsed
 17.159


Example 2

The following example shows how to pass the value of number cores as an argument to the R script.


[asrini@consign ~]$ bsub -n 2 -Is bash

[asrini@node063 ~]$ cat doParallel_test2.R

library(doParallel)
core_count <- as.numeric(commandArgs(TRUE)[1])
cl <- makeCluster(core_count)
registerDoParallel(cl)
print(cl)


[asrini@node063 test_jobs]$ Rscript doParallel_test2.R 2
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
socket cluster with 2 nodes on host ‘localhost’

Note : In the above example, the value 2 was passed to the R script which was then accepted as an argument by the R script and a doParallel cluster of the same size was in turn created. If more cores were requested in the bsub submission, the argument passed to the doParallel script can be changed accordingly.

This can also be done as a batch (non-interactive) submission:


[asrini@consign ~]$ bsub -n 2 -R "span[hosts=1]" -e doParallel.e -o doParallel.o Rscript doParallel_test2.R 2

The error and output files from the above bsub submission contain the same information as shown previously.

Other Pages