HPC:R
Contents
R (programming language)
There are currently several versions of the R programming language installed across all the HPC compute nodes.
Running R programs on the PMACS HPC system
Various versions of R and other software packages are available as modules only on compute nodes (both interactive and non-interactive). Do not try to run these on the head node. If attempting to run R interactively, first launch an interactive session:
[asrini@consign ~]$ bsub -Is bash Job <35804293> is submitted to default queue <interactive>. <<Waiting for dispatch ...>> <<Starting on node060.hpc.local>> [asrini@node062 ~]$
Please read the rest of this section, before launching R jobs on the PMACS HPC system.
Available Versions
[asrini@node062 ~]$ module avail R --------------------------------------------------------------------------------- /usr/share/Modules/modulefiles ---------------------------------------------------------------------------------- R-3.1.1 R-3.1.2 R-3.2.1 R-3.2.2
Usage
Currently, the default version of R installed across all the HPC nodes, is version 3.0.1.
[asrini@node062 ~]$ which R /usr/bin/R [asrini@node062 ~]$ R --version R version 3.0.1 (2013-05-16) -- "Good Sport" Copyright (C) 2013 The R Foundation for Statistical Computing Platform: x86_64-redhat-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under the terms of the GNU General Public License versions 2 or 3. For more information about these matters see http://www.gnu.org/licenses/.
Note: This version (3.0.1) is likely to be removed as the default version and will be made available as module.
Other R versions installed across the HPC nodes can be loaded as a module
[asrini@node062 ~]$ module show R-3.1.1 ------------------------------------------------------------------- /usr/share/Modules/modulefiles/R-3.1.1: module-whatis GNU R prepend-path PATH /opt/software/R/3.1.1/bin prepend-path MANPATH /opt/software/R/3.1.1/share/man ------------------------------------------------------------------- [asrini@node062 ~]$ module load R-3.1.1 [asrini@node062 ~]$ R --version R version 3.1.1 (2014-07-10) -- "Sock it to Me" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-redhat-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under the terms of the GNU General Public License versions 2 or 3. For more information about these matters see http://www.gnu.org/licenses/. [asrini@node062 ~]$ module show R-3.1.2 ------------------------------------------------------------------- /usr/share/Modules/modulefiles/R-3.1.2: module-whatis GNU R prepend-path PATH /opt/software/R/3.1.2/bin prepend-path MANPATH /opt/software/R/3.1.2/share/man ------------------------------------------------------------------- [asrini@node062 ~]$ module load R-3.1.2 [asrini@node062 ~]$ R --version R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-redhat-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under the terms of the GNU General Public License versions 2 or 3. For more information about these matters see http://www.gnu.org/licenses/.
Pre-installed R libraries/packages
Several difficult to install R packages/libraries have been installed across all the HPC nodes for every version of R that is installed on the cluster. These typically include various BioConductor packages and some other packages that would otherwise need administrative privileges to install. Users are encouraged to install R packages needed for their work, if the desired package is not already installed. To see a full listing of R packages installed run the following command in an interactive shell:
[asrini@node062 ~]$ echo 'library()' | R --slave
R package installation
Most R package installation can be done using the normal package.install("<package_name>") approach.
However, certain R packages while not difficult to install, tend to have dependencies that require special care/attention during the install process. Below is a listing of some of those packages:
monocle3
Monocle 3 is an analysis toolkit for single-cell RNA-Seq experiments. While this package can be installed Bioconductor, it does have other dependencies, namely - GDAL/R-GDAL
NOTE: If you are using Anaconda (a.k.a "conda") to manage packages within your home or project directories, there is some possibility that version conflicts may arise. You may have to disable your conda environments, in order for the Monocle3 installation to be successful on our HPC system, using these instructions
Click the "Expand" link to see information on how to install monocole3 on our HPC
Step 1: Install GDAL by following these instructions
Step 2: Install PROJ.4 by following these instructions
Step 3: Launch an interactive session (if you haven't already)
[asrini@consign ~]$ bsub -Is bash Job <57475473> is submitted to default queue <interactive>. <<Waiting for dispatch ...>> <<Starting on node061.hpc.local>> [asrini@node061 ~]$
Step 4: Setup all the necessary environment variables
export PATH=$HOME/software/bin:$PATH export CPATH=$HOME/software/include:$CPATH export LD_LIBRARY_PATH=$HOME/software/lib:$LD_LIBRARY_PATH export LIBRARY_PATH=$HOME/software/lib:$LIBRARY_PATH
Step 5: Install monocle3 : NOTE: The instructions below will install monocle against R v3.5.1. If you wish to install this against another version, load the appropriate module
module load R/3.5.1 R
install.packages("devtools") devtools::install_github('cole-trapnell-lab/leidenbase') devtools::install_github('cole-trapnell-lab/monocle3')
rgdal
The R-GDAL (rgdal) package provides bindings to the 'Geospatial' Data Abstraction Library ('GDAL') (>= 1.11.4) and access to projection/transformation operations from the 'PROJ.4' library.
NOTE: If you are using Anaconda (a.k.a "conda") to manage packages within your home or project directories, there is some possibility that version conflicts may arise. You may have to disable your conda environments, in order for the Monocle3 installation to be successful on our HPC system, using these instructions
Click the "Expand" link to see information on how to install rgdal on our HPC
Step 1: Install GDAL by following these instructions
Step 2: Install PROJ.4 by following these instructions
Step 3: Launch an interactive session (if you haven't already)
[asrini@consign ~]$ bsub -Is bash Job <57475473> is submitted to default queue <interactive>. <<Waiting for dispatch ...>> <<Starting on node061.hpc.local>> [asrini@node061 ~]$
Step 4: Setup all the necessary environment variables
export PATH=$HOME/software/bin:$PATH export CPATH=$HOME/software/include:$CPATH export LD_LIBRARY_PATH=$HOME/software/lib:$LD_LIBRARY_PATH export LIBRARY_PATH=$HOME/software/lib:$LIBRARY_PATH
Step 5: Install rgdal : NOTE: The instructions below will install rgdal against R v3.5.1. If you wish to install this against another version, load the appropriate module
module load R/3.5.1 R
From within R:
> install.packages("rgdal");
Running R doParallel jobs
The doParallel package is a "parallel backend" for the foreach package. It provides a mechanism needed to execute foreach loops in parallel. This section specifically covers the use of the R doParallel library on the PMACS cluster nodes.
Details regarding the package and its manual can be found on CRAN
Click the "Expand" link to see information on how to install and run doParallel jobs on our HPC
Install doParallel
The doParallel library is not installed by default so you will have to install it.
[asrini@consign ~]$ bsub -Is bash [asrini@node063 ~]$ R > install.packages("doParallel");
Note 1: The doParallel library will be installed in under $HOME/R/x86_64-redhat-linux-gnu-library/3.0/
Note 2: If an alternative version of R is preferred, make sure you load the appropriate module for that version before attempting the install. Also make sure you are using that specific version of R before running jobs.
Using doParallel
The doParallel library has the ability to detect the number of CPU cores a given system has and can spawn one thread on each core that was detected. This functionality is provided by the "detectCores()" function that is part of this library. However, we do not recommend the use of this function in our environment because this function does not operate within the confines of our job scheduler - IBM's Platform LSF. The use of the "detectCores()" function and the subsequent "makeCluster()" function will result in one user's jobs affecting other user jobs running on the same node.
Instead, we recommend forcing doParallel to operate within the confines of the job scheduler. Below are some examples:
Example 1
the following example, taken from the doParallel manual, shows basic use of the doParallel library with the value for core count hard coded in the R script:
[asrini@consign ~]$ bsub -n 2 -Is bash [asrini@node063]$ cat doParallel_test.R library(doParallel) registerDoParallel(cores=2) x <- iris[which(iris[,5] != "setosa"), c(1,5)] trials <- 10000 ptime <- system.time({ r <- foreach(icount(trials), .combine=cbind) %dopar% { ind <- sample(100, 100, replace=TRUE) result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit)) coefficients(result1) } })[3] ptime
Note : In the above example, cores is set to 2. This is because the interactive session was launched on 2 cores (bsub -n 2 -Is bash). If more cores were requested, the doParallel script can be changed accordingly.
And can be run as:
[asrini@node063]$ Rscript doParallel_test.R Loading required package: foreach Loading required package: iterators Loading required package: parallel elapsed 17.159
Example 2
The following example shows how to pass the value of number cores as an argument to the R script.
[asrini@consign ~]$ bsub -n 2 -Is bash [asrini@node063 ~]$ cat doParallel_test2.R library(doParallel) core_count <- as.numeric(commandArgs(TRUE)[1]) cl <- makeCluster(core_count) registerDoParallel(cl) print(cl) [asrini@node063 test_jobs]$ Rscript doParallel_test2.R 2 Loading required package: foreach Loading required package: iterators Loading required package: parallel socket cluster with 2 nodes on host ‘localhost’
Note : In the above example, the value 2 was passed to the R script which was then accepted as an argument by the R script and a doParallel cluster of the same size was in turn created. If more cores were requested in the bsub submission, the argument passed to the doParallel script can be changed accordingly.
This can also be done as a batch (non-interactive) submission:
[asrini@consign ~]$ bsub -n 2 -R "span[hosts=1]" -e doParallel.e -o doParallel.o Rscript doParallel_test2.R 2
The error and output files from the above bsub submission contain the same information as shown previously.