Difference between revisions of "HPC:GPU"
Line 16: | Line 16: | ||
* '''2x GPU nodes'''; each configured with | * '''2x GPU nodes'''; each configured with | ||
− | *** 2x '''36-core Intel(R) Xeon(R) Platinum 8452Y 3.2GHz''' CPUs (144 threads per node, with hyperthreading turned on) | + | *** 2x '''36-core Intel(R) Xeon(R) Platinum 8452Y @ 3.2GHz''' CPUs (144 threads per node, with hyperthreading turned on) |
*** '''1TB RAM''' | *** '''1TB RAM''' | ||
*** '''4x Nvidia GH100[H100 SXM5 80GB] GPU Card (16896 CUDA cores & 80GB RAM per card)''' | *** '''4x Nvidia GH100[H100 SXM5 80GB] GPU Card (16896 CUDA cores & 80GB RAM per card)''' | ||
Line 45: | Line 45: | ||
==== Using CUDA ==== | ==== Using CUDA ==== | ||
− | To use one of the available versions of CUDA, simply load the appropriate module | + | To use one of the available versions of CUDA, simply load the appropriate module. GPU and GPU2 nodes are installed with CUDA 11.7 in-built and GPU1 nodes has CUDA 12.3 in-built. |
'''NOTE: The following command MUST be run via either an interactive or non-interactive job''' | '''NOTE: The following command MUST be run via either an interactive or non-interactive job''' | ||
Line 66: | Line 66: | ||
<pre> | <pre> | ||
− | [asrini@consign ~]$ bqueues gpu | + | [asrini@consign ~]$ bqueues | grep gpu |
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP | QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP | ||
− | gpu 30 Open:Active - | + | gpu 30 Open:Active - - - - 22 0 22 0 |
+ | gpu1 30 Open:Active - - - - 0 0 0 0 | ||
+ | gpu2 30 Open:Active - 48 - - 0 0 0 0 | ||
+ | |||
</pre> | </pre> | ||
Revision as of 21:18, 9 April 2024
This sections describes how to use the GPU resources available via the HPC system. (Currently being Modified)
Contents
Hardware
Queue 'gpu': There are currently two GPU nodes available via the HPC. Below are the hardware specifications:
- 2x GPU nodes; each configured with
- 2x 22-core Intel Xeon E5-2699 v4 2.20GHz CPUs (88 threads per node, with hyperthreading turned on)
- 512GB RAM
- 1x Nvidia Tesla P100 16GB GPU Card (3584 CUDA cores & 16GB RAM per card)
- 100 Gb/s InfiniBand connection to the GPFS file system
- 1.6TB dedicated scratch space provided by local NVMe
Queue 'gpu1': Two new GPU nodes available in this Queue, running latest Linux Rocky 9.2. Below are the hardware specifications:
- 2x GPU nodes; each configured with
- 2x 36-core Intel(R) Xeon(R) Platinum 8452Y @ 3.2GHz CPUs (144 threads per node, with hyperthreading turned on)
- 1TB RAM
- 4x Nvidia GH100[H100 SXM5 80GB] GPU Card (16896 CUDA cores & 80GB RAM per card)
- 200 Gb/s InfiniBand connection to the GPFS file system
- 1.5TB dedicated scratch space provided by local NVMe
Queue 'gpu2': There are currently four GPU nodes available in this Queue. Below are the hardware specifications:
- 4x GPU nodes; each configured with
- 2x 6-core Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz CPUs (12 threads per node, with hyperthreading turned on)
- 192GB RAM
- 1x Nvidia Tesla T4(TU104GL) 16GB GPU Card (2560 CUDA cores & 16GB RAM per card)
- 56Gb/s InfiniBand connection to the GPFS file system
- 1.6TB dedicated scratch space provided by local NVMe
Software
The GPU nodes have the same set of available software as the rest of the compute nodes. The full list of available software here
In addition to this, there are two versions of CUDA that are readily available:
$ module avail CUDA --------------------------------------------------------------------------------- /usr/share/Modules/modulefiles ---------------------------------------------------------------------------------- CUDA/10.1.243 CUDA/9.2.148
Using CUDA
To use one of the available versions of CUDA, simply load the appropriate module. GPU and GPU2 nodes are installed with CUDA 11.7 in-built and GPU1 nodes has CUDA 12.3 in-built.
NOTE: The following command MUST be run via either an interactive or non-interactive job
-bash-4.2$ module load CUDA/10.1.243 -bash-4.2$ which nvcc /opt/software/CUDA/10.1.243/bin/nvcc -bash-4.2$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243
Running jobs on GPU nodes
Both interactive and non-interactive jobs can be run on the GPU nodes. At present, the GPU nodes are available via a dedicated queue.
[asrini@consign ~]$ bqueues | grep gpu QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP gpu 30 Open:Active - - - - 22 0 22 0 gpu1 30 Open:Active - - - - 0 0 0 0 gpu2 30 Open:Active - 48 - - 0 0 0 0
Interactive jobs
To launch an interactive job on one of the GPU nodes use the usual bsub command with the "-q gpu" option:
[asrini@consign ~]$ bsub -q gpu -Is bash Job <63866682> is submitted to queue <gpu>. <<Waiting for dispatch ...>> <<Starting on gpunode02.hpc.local>> [asrini@gpunode02 ~]$ module avail CUDA --------------------------------------------------------------------------------- /usr/share/Modules/modulefiles ---------------------------------------------------------------------------------- CUDA/10.1.243 CUDA/9.2.148 [asrini@gpunode02 ~]$ module load CUDA/10.1.243 [asrini@gpunode02 ~]$ which nvcc /opt/software/CUDA/10.1.243/bin/nvcc [asrini@gpunode02 ~]$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243
TensorFlow Example
The example below shows how to install TensorFlow on a GPU node in our HPC environment
[asrini@consign ~]$ bsub -q gpu -Is bash Job <63866699> is submitted to queue <gpu>. <<Waiting for dispatch ...>> <<Starting on gpunode02.hpc.local>>
After launching an interactive session, load the CUDA and python modules and setup a new virtual environment to install TensorFlow
[asrini@gpunode02 ~]$ module load CUDA/10.1.243 [asrini@gpunode02 python_envs]$ module load python/3.6.3 [asrini@gpunode02 python_envs]$ virtualenv my_tensorflow_testenv --system-site-packages [asrini@gpunode02 python_envs]$ source my_tensorflow_testenv/bin/activate (my_tensorflow_testenv) [asrini@gpunode02 python_envs]$ pip install tensorflow # With GPU support: (my_tensorflow_testenv) [asrini@gpunode02 python_envs]$ pip install tensorflow_gpu
NOTE 1: The above command will install TensorFlow and its dependencies within the newly created virtual environment. So this virtual environment will have to be activated again, when we need to use the installed packages.
Verify that the package was installed correctly
(my_tensorflow_testenv) [asrini@gpunode02 ~]$ python -c "import tensorflow as tf; print(tf.__version__);
NOTE 2: The above command will print out warning messages about a missing library 'libnvinfer.so.6'. This can be ignored.
Non-interactive jobs
Non-interactive jobs can be run via the GPU queue similar to any other queue, but with the addition of the "-q gpu" option.
For example:
[asrini@consign ~]$ bsub -q gpu -e my_gpujob.e -o my_gpujob.o sh mygpucode.sh Job <63867209> is submitted to queue <gpu>.
LSF Job script for GPU jobs
Below is a sample LSF JOB script that can be adapted for running GPU bound jobs on our HPC
[asrini@consign ~]$ cat lsf_GPU_job.sh #!/bin/bash #BSUB -J GPU_job1 #BSUB -o GPU_job1.%J.out #BSUB -e GPU_job1.%J.error #BSUB -n 2 # Requesting 2 CPU CORES #BSUB -M 10240 # Reqesting 10GB RAM #BSUB -R "span[hosts=1] rusage [mem=10240]" #BSUB -q gpu echo "GPU Job 1" source my_tensorflow_testenv/bin/activate echo "tensorflow version:" python -c "import tensorflow as tf; print(tf.__version__)" echo "python version:" python -V sleep 10
NOTE 1: The above example assumes that a virtual environment has already been created and activates it for the job.
To submit the above script
[asrini@consign ~]$ bsub < lsf_GPU_job.sh
NOTE 2: The "<" in the above command is required.