HPC:GPU

This sections describes how to use the GPU resources available via the HPC system. (Currently being Modified)

2x GPU nodes; each configured with
- - 2x 22-core Intel Xeon E5-2699 v4 2.20GHz CPUs (88 threads per node, with hyperthreading turned on)
  - 512GB RAM
  - 1x Nvidia Tesla P100 16GB GPU Card (3584 CUDA cores & 16GB RAM per card)
  - 100 Gb/s InfiniBand connection to the GPFS file system
  - 1.6TB dedicated scratch space provided by local NVMe

Queue 'gpu1': Two new GPU nodes available in this Queue, running latest Linux Rocky 9.2. Below are the hardware specifications:

2x GPU nodes; each configured with
- - 2x 36-core Intel(R) Xeon(R) Platinum 8452Y @ 3.2GHz CPUs (144 threads per node, with hyperthreading turned on)
  - 1TB RAM
  - 4x Nvidia GH100[H100 SXM5 80GB] GPU Card (16896 CUDA cores & 80GB RAM per card)
  - 200 Gb/s InfiniBand connection to the GPFS file system
  - 1.5TB dedicated scratch space provided by local NVMe

Queue 'gpu2': There are currently four GPU nodes available in this Queue. Below are the hardware specifications:

4x GPU nodes; each configured with
- - 2x 6-core Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz CPUs (12 threads per node, with hyperthreading turned on)
  - 192GB RAM
  - 1x Nvidia Tesla T4(TU104GL) 16GB GPU Card (2560 CUDA cores & 16GB RAM per card)
  - 56Gb/s InfiniBand connection to the GPFS file system
  - 1.6TB dedicated scratch space provided by local NVMe

Software

The GPU nodes have the same set of available software as the rest of the compute nodes. The full list of available software here

In addition to this, there are two versions of CUDA that are readily available:

$ module avail CUDA

--------------------------------------------------------------------------------- /usr/share/Modules/modulefiles ----------------------------------------------------------------------------------
CUDA/10.1.243 CUDA/9.2.148

Using CUDA

To use one of the available versions of CUDA, simply load the appropriate module. GPU and GPU2 nodes are installed with CUDA 11.7 in-built and GPU1 nodes has CUDA 12.3 in-built.

NOTE: The following command MUST be run via either an interactive or non-interactive job

 
-bash-4.2$ module load CUDA/10.1.243 

-bash-4.2$ which nvcc
/opt/software/CUDA/10.1.243/bin/nvcc

-bash-4.2$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

Running jobs on GPU nodes

Both interactive and non-interactive jobs can be run on the GPU nodes. At present, the GPU nodes are available via a dedicated queue.

 
[bsubram@consign ~]$ bqueues | grep gpu
QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP 
gpu              30  Open:Active       -    -    -    -    22     0    22     0
gpu1             30  Open:Active       -    -    -    -     0     0     0     0
gpu2             30  Open:Active       -   48    -    -     0     0     0     0

Interactive jobs

To launch an interactive job on one of the GPU nodes use the usual bsub command with the "-q gpu" option:

[bsubram@consign ~]$ bsub -q gpu -Is bash
Job <63866682> is submitted to queue <gpu>.
<<Waiting for dispatch ...>>
<<Starting on gpunode02.hpc.local>>

[bsubram@gpunode02 ~]$ module avail CUDA

--------------------------------------------------------------------------------- /usr/share/Modules/modulefiles ----------------------------------------------------------------------------------
CUDA/10.1.243 CUDA/9.2.148

[bsubram@gpunode02 ~]$ module load CUDA/10.1.243 

[bsubram@gpunode02 ~]$ which nvcc
/opt/software/CUDA/10.1.243/bin/nvcc

[bsubram@gpunode02 ~]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

TensorFlow Example

The example below shows how to install TensorFlow on a GPU node in our HPC environment

[bsubram@consign ~]$ bsub -q gpu -Is bash
Job <63866699> is submitted to queue <gpu>.
<<Waiting for dispatch ...>>
<<Starting on gpunode02.hpc.local>>

After launching an interactive session, load the CUDA and python modules and setup a new virtual environment to install TensorFlow


[bsubram@gpunode02 ~]$ module load CUDA/10.1.243 

[bsubram@gpunode02 python_envs]$ module load python/3.6.3

[bsubram@gpunode02 python_envs]$ virtualenv my_tensorflow_testenv --system-site-packages

[bsubram@gpunode02 python_envs]$ source my_tensorflow_testenv/bin/activate

(my_tensorflow_testenv) [bsubram@gpunode02 python_envs]$ pip install tensorflow

# With GPU support:
(my_tensorflow_testenv) [bsubram@gpunode02 python_envs]$ pip install tensorflow_gpu

NOTE 1: The above command will install TensorFlow and its dependencies within the newly created virtual environment. So this virtual environment will have to be activated again, when we need to use the installed packages.

Verify that the package was installed correctly

 
(my_tensorflow_testenv) [bsubram@gpunode02 ~]$ python -c  "import tensorflow as tf; print(tf.__version__);

NOTE 2: The above command will print out warning messages about a missing library 'libnvinfer.so.6'. This can be ignored.

Non-interactive jobs

Non-interactive jobs can be run via the GPU queue similar to any other queue, but with the addition of the "-q gpu" option.

For example:

[bsubram@consign ~]$ bsub -q gpu -e my_gpujob.e -o my_gpujob.o sh mygpucode.sh 
Job <63867209> is submitted to queue <gpu>.

LSF Job script for GPU jobs

Below is a sample LSF JOB script that can be adapted for running GPU bound jobs on our HPC. GPU parameters can also be supplied to the job, to request higher GPU resources like GPU Memory.

[bsubram@consign ~]$ cat lsf_GPU_job.sh 
#!/bin/bash
#BSUB -J GPU_job1 
#BSUB -o GPU_job1.%J.out
#BSUB -e GPU_job1.%J.error
#BSUB -n 2 # Requesting 2 CPU CORES
#BSUB -M 10240 # Reqesting 10GB RAM
#BSUB -R "span[hosts=1] rusage [mem=10240]" 
#BSUB -gpu "num=1:mode=shared:gmem=8192"
#BSUB -q gpu

echo "GPU Job 1"
source my_tensorflow_testenv/bin/activate
echo "tensorflow version:" 
python -c "import tensorflow as tf; print(tf.__version__)"
echo "python version:"
python -V
sleep 10

NOTE 1: The above example assumes that a virtual environment has already been created and activates it for the job.

To submit the above script

[bsubram@consign ~]$ bsub < lsf_GPU_job.sh

NOTE 2: The "<" in the above command is required.