HPC:R

From HPC wiki
Revision as of 20:39, 14 December 2020 by Asrini (talk | contribs) (→‎Usage)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

R (programming language)

There are currently several versions of the R programming language installed across all the HPC compute nodes.

Running R programs on the PMACS HPC system

Various versions of R and other software packages are available as modules only on compute nodes (both interactive and non-interactive). Do not try to run these on the head node. If attempting to run R interactively, first launch an interactive session:

[asrini@consign ~]$ bsub -Is bash
Job <35804293> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on node060.hpc.local>>
[asrini@node062 ~]$ 

Please read the rest of this section, before launching R jobs on the PMACS HPC system.

Available Versions

[asrini@node062 ~]$ module avail R
---------------------------------- /usr/share/Modules/modulefiles ----------------------------------
R-3.1.1 R-3.1.2 R-3.2.1 R-3.2.2

Usage

Currently, there is NO default version of R installed on the HPC nodes, and you must load a module for the version of your choice.

R versions installed across the HPC nodes can be loaded as a module

[asrini@node156 ~]$ module avail R/
------------------------------ /usr/share/Modules/modulefiles ------------------------------
R/3.1.1 R/3.1.2 R/3.2.1 R/3.2.2 R/3.3.0 R/3.3.1 R/3.4.2 R/3.4.3 R/3.5.1 R/3.6.3 R/4.0.2


Using one of the installed versions:

[asrini@node156 ~]$ module show R/4.0.2 
-------------------------------------------------------------------
/usr/share/Modules/modulefiles/R/4.0.2:

module-whatis	 GNU R 
prepend-path	 CPATH /opt/software/R/4.0.2/lib64/R/include 
prepend-path	 PATH /opt/software/R/4.0.2/bin 
prepend-path	 LD_LIBRARY_PATH /opt/software/R/4.0.2/lib64 
prepend-path	 LIBRARY_PATH /opt/software/R/4.0.2/lib64 
prepend-path	 MANPATH /opt/software/R/4.0.2/share/man 
-------------------------------------------------------------------

[asrini@node156 ~]$ module load R/4.0.2 

[asrini@node156 ~]$ R --version
R version 4.0.2 (2020-06-22) -- "Taking Off Again"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
https://www.gnu.org/licenses/.

Pre-installed R libraries/packages

Several difficult to install R packages/libraries have been installed across all the HPC nodes for every version of R that is installed on the cluster. These typically include various BioConductor packages and some other packages that would otherwise need administrative privileges to install. Users are encouraged to install R packages needed for their work, if the desired package is not already installed. To see a full listing of R packages installed run the following command in an interactive shell:

[asrini@node062 ~]$ echo 'library()' | R --slave

R package installation

Most R package installation can be done using the normal package.install("<package_name>") approach.

However, certain R packages while not difficult to install, tend to have dependencies that require special care/attention during the install process. Below is a listing of some of those packages:

monocle3

Monocle 3 is an analysis toolkit for single-cell RNA-Seq experiments. While this package can be installed Bioconductor, it does have other dependencies, namely - GDAL/R-GDAL

NOTE: If you are using Anaconda (a.k.a "conda") to manage packages within your home or project directories, there is some possibility that version conflicts may arise. You may have to disable your conda environments, in order for the Monocle3 installation to be successful on our HPC system, using these instructions


Click the "Expand" link (below) to see information on how to install monocole3 on our HPC


Step 1: Install GDAL by following these instructions

Step 2: Install PROJ.4 by following these instructions

Step 3: Launch an interactive session (if you haven't already)

[asrini@consign ~]$ bsub -Is bash
Job <57475473> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on node061.hpc.local>>

[asrini@node061 ~]$ 

Step 4: Setup all the necessary environment variables

export PATH=$HOME/software/bin:$PATH
export CPATH=$HOME/software/include:$CPATH
export LD_LIBRARY_PATH=$HOME/software/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=$HOME/software/lib:$LIBRARY_PATH

Step 5: Install monocle3 : NOTE: The instructions below will install monocle against R v3.5.1. If you wish to install this against another version, load the appropriate module

module load R/3.5.1

R
install.packages("devtools")
devtools::install_github('cole-trapnell-lab/leidenbase')
devtools::install_github('cole-trapnell-lab/monocle3')

NOTE: To use monocle3 after it has been installed successfully, you must execute Step 4 (above), to set all the environment variables correctly, before loading the monocle3 R library.

rgdal

The R-GDAL (rgdal) package provides bindings to the 'Geospatial' Data Abstraction Library ('GDAL') (>= 1.11.4) and access to projection/transformation operations from the 'PROJ.4' library.

NOTE: If you are using Anaconda (a.k.a "conda") to manage packages within your home or project directories, there is some possibility that version conflicts may arise. You may have to disable your conda environments, in order for the Monocle3 installation to be successful on our HPC system, using these instructions


Click the "Expand" link (below) to see information on how to install rgdal on our HPC


Step 1: Install GDAL by following these instructions

Step 2: Install PROJ.4 by following these instructions

Step 3: Launch an interactive session (if you haven't already)

[asrini@consign ~]$ bsub -Is bash
Job <57475473> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on node061.hpc.local>>

[asrini@node061 ~]$ 

Step 4: Setup all the necessary environment variables

export PATH=$HOME/software/bin:$PATH
export CPATH=$HOME/software/include:$CPATH
export LD_LIBRARY_PATH=$HOME/software/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=$HOME/software/lib:$LIBRARY_PATH

Step 5: Install rgdal : NOTE: The instructions below will install rgdal against R v3.5.1. If you wish to install this against another version, load the appropriate module

module load R/3.5.1

R

From within R:

> install.packages("rgdal");


NOTE: To use rgdal after it has been installed successfully, you must execute Step 4 (above), to set all the environment variables correctly, before loading the rgdal R library.

Running R doParallel jobs

The doParallel package is a "parallel backend" for the foreach package. It provides a mechanism needed to execute foreach loops in parallel. This section specifically covers the use of the R doParallel library on the PMACS cluster nodes.

Details regarding the package and its manual can be found on CRAN

Click the "Expand" link (below) to see information on how to install and run doParallel jobs on our HPC


Install doParallel

The doParallel library is not installed by default so you will have to install it.

[asrini@consign ~]$ bsub -Is bash

[asrini@node063 ~]$ R

> install.packages("doParallel");

Note 1: The doParallel library will be installed in under $HOME/R/x86_64-redhat-linux-gnu-library/3.0/

Note 2: If an alternative version of R is preferred, make sure you load the appropriate module for that version before attempting the install. Also make sure you are using that specific version of R before running jobs.

Using doParallel

The doParallel library has the ability to detect the number of CPU cores a given system has and can spawn one thread on each core that was detected. This functionality is provided by the "detectCores()" function that is part of this library. However, we do not recommend the use of this function in our environment because this function does not operate within the confines of our job scheduler - IBM's Platform LSF. The use of the "detectCores()" function and the subsequent "makeCluster()" function will result in one user's jobs affecting other user jobs running on the same node.

Instead, we recommend forcing doParallel to operate within the confines of the job scheduler. Below are some examples:

Example 1

the following example, taken from the doParallel manual, shows basic use of the doParallel library with the value for core count hard coded in the R script:


[asrini@consign ~]$ bsub -n 2 -Is bash

[asrini@node063]$ cat doParallel_test.R
library(doParallel)

registerDoParallel(cores=2)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000
ptime <- system.time({
r <- foreach(icount(trials), .combine=cbind) %dopar% {
   ind <- sample(100, 100, replace=TRUE)
   result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
   coefficients(result1)
   }
})[3]
ptime

Note : In the above example, cores is set to 2. This is because the interactive session was launched on 2 cores (bsub -n 2 -Is bash). If more cores were requested, the doParallel script can be changed accordingly.

And can be run as:


[asrini@node063]$ Rscript doParallel_test.R
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
elapsed
 17.159


Example 2

The following example shows how to pass the value of number cores as an argument to the R script.


[asrini@consign ~]$ bsub -n 2 -Is bash

[asrini@node063 ~]$ cat doParallel_test2.R

library(doParallel)
core_count <- as.numeric(commandArgs(TRUE)[1])
cl <- makeCluster(core_count)
registerDoParallel(cl)
print(cl)


[asrini@node063 test_jobs]$ Rscript doParallel_test2.R 2
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
socket cluster with 2 nodes on host ‘localhost’

Note : In the above example, the value 2 was passed to the R script which was then accepted as an argument by the R script and a doParallel cluster of the same size was in turn created. If more cores were requested in the bsub submission, the argument passed to the doParallel script can be changed accordingly.

This can also be done as a batch (non-interactive) submission:


[asrini@consign ~]$ bsub -n 2 -R "span[hosts=1]" -e doParallel.e -o doParallel.o Rscript doParallel_test2.R 2

The error and output files from the above bsub submission contain the same information as shown previously.

Other Pages