Difference between revisions of "HPC:GPU"
Line 1: | Line 1: | ||
− | This sections describes how to use the GPU resources available via the HPC system. | + | This sections describes how to use the GPU resources available via the HPC system. (Currently being Modified) |
=== Hardware === | === Hardware === | ||
+ | Queue 'gpu': | ||
There are currently two GPU nodes available via the HPC. Below are the hardware specifications: | There are currently two GPU nodes available via the HPC. Below are the hardware specifications: | ||
Line 10: | Line 11: | ||
*** 100 Gb/s InfiniBand connection to the GPFS file system | *** 100 Gb/s InfiniBand connection to the GPFS file system | ||
*** 1.6TB dedicated scratch space provided by local NVMe | *** 1.6TB dedicated scratch space provided by local NVMe | ||
+ | |||
+ | Queue 'gpu1': | ||
+ | Two new GPU nodes available in this Queue, running latest Linux Rocky 9.2. Below are the hardware specifications: | ||
+ | |||
+ | * '''2x GPU nodes'''; each configured with | ||
+ | *** 2x 22-core Intel Xeon E5-2699 v4 2.20GHz CPUs (88 threads per node, with hyperthreading turned on) | ||
+ | *** 512GB RAM | ||
+ | *** '''1x Nvidia Tesla P100 16GB GPU Card (3584 CUDA cores & 16GB RAM per card)''' | ||
+ | *** 100 Gb/s InfiniBand connection to the GPFS file system | ||
+ | *** 1.6TB dedicated scratch space provided by local NVMe | ||
+ | |||
+ | Queue 'gpu2': | ||
+ | There are currently four GPU nodes available in this Queue. Below are the hardware specifications: | ||
+ | |||
+ | * '''4x GPU nodes'''; each configured with | ||
+ | *** 2x 22-core Intel Xeon E5-2699 v4 2.20GHz CPUs (88 threads per node, with hyperthreading turned on) | ||
+ | *** 512GB RAM | ||
+ | *** '''1x Nvidia Tesla P100 16GB GPU Card (3584 CUDA cores & 16GB RAM per card)''' | ||
+ | *** 100 Gb/s InfiniBand connection to the GPFS file system | ||
+ | *** 1.6TB dedicated scratch space provided by local NVMe | ||
+ | |||
=== Software === | === Software === |
Revision as of 21:33, 4 April 2024
This sections describes how to use the GPU resources available via the HPC system. (Currently being Modified)
Contents
Hardware
Queue 'gpu': There are currently two GPU nodes available via the HPC. Below are the hardware specifications:
- 2x GPU nodes; each configured with
- 2x 22-core Intel Xeon E5-2699 v4 2.20GHz CPUs (88 threads per node, with hyperthreading turned on)
- 512GB RAM
- 1x Nvidia Tesla P100 16GB GPU Card (3584 CUDA cores & 16GB RAM per card)
- 100 Gb/s InfiniBand connection to the GPFS file system
- 1.6TB dedicated scratch space provided by local NVMe
Queue 'gpu1': Two new GPU nodes available in this Queue, running latest Linux Rocky 9.2. Below are the hardware specifications:
- 2x GPU nodes; each configured with
- 2x 22-core Intel Xeon E5-2699 v4 2.20GHz CPUs (88 threads per node, with hyperthreading turned on)
- 512GB RAM
- 1x Nvidia Tesla P100 16GB GPU Card (3584 CUDA cores & 16GB RAM per card)
- 100 Gb/s InfiniBand connection to the GPFS file system
- 1.6TB dedicated scratch space provided by local NVMe
Queue 'gpu2': There are currently four GPU nodes available in this Queue. Below are the hardware specifications:
- 4x GPU nodes; each configured with
- 2x 22-core Intel Xeon E5-2699 v4 2.20GHz CPUs (88 threads per node, with hyperthreading turned on)
- 512GB RAM
- 1x Nvidia Tesla P100 16GB GPU Card (3584 CUDA cores & 16GB RAM per card)
- 100 Gb/s InfiniBand connection to the GPFS file system
- 1.6TB dedicated scratch space provided by local NVMe
Software
The GPU nodes have the same set of available software as the rest of the compute nodes. The full list of available software here
In addition to this, there are two versions of CUDA that are readily available:
$ module avail CUDA --------------------------------------------------------------------------------- /usr/share/Modules/modulefiles ---------------------------------------------------------------------------------- CUDA/10.1.243 CUDA/9.2.148
Using CUDA
To use one of the available versions of CUDA, simply load the appropriate module
NOTE: The following command MUST be run via either an interactive or non-interactive job
-bash-4.2$ module load CUDA/10.1.243 -bash-4.2$ which nvcc /opt/software/CUDA/10.1.243/bin/nvcc -bash-4.2$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243
Running jobs on GPU nodes
Both interactive and non-interactive jobs can be run on the GPU nodes. At present, the GPU nodes are available via a dedicated queue.
[asrini@consign ~]$ bqueues gpu QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP gpu 30 Open:Active - 88 - - 41 0 41 0
Interactive jobs
To launch an interactive job on one of the GPU nodes use the usual bsub command with the "-q gpu" option:
[asrini@consign ~]$ bsub -q gpu -Is bash Job <63866682> is submitted to queue <gpu>. <<Waiting for dispatch ...>> <<Starting on gpunode02.hpc.local>> [asrini@gpunode02 ~]$ module avail CUDA --------------------------------------------------------------------------------- /usr/share/Modules/modulefiles ---------------------------------------------------------------------------------- CUDA/10.1.243 CUDA/9.2.148 [asrini@gpunode02 ~]$ module load CUDA/10.1.243 [asrini@gpunode02 ~]$ which nvcc /opt/software/CUDA/10.1.243/bin/nvcc [asrini@gpunode02 ~]$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243
TensorFlow Example
The example below shows how to install TensorFlow on a GPU node in our HPC environment
[asrini@consign ~]$ bsub -q gpu -Is bash Job <63866699> is submitted to queue <gpu>. <<Waiting for dispatch ...>> <<Starting on gpunode02.hpc.local>>
After launching an interactive session, load the CUDA and python modules and setup a new virtual environment to install TensorFlow
[asrini@gpunode02 ~]$ module load CUDA/10.1.243 [asrini@gpunode02 python_envs]$ module load python/3.6.3 [asrini@gpunode02 python_envs]$ virtualenv my_tensorflow_testenv --system-site-packages [asrini@gpunode02 python_envs]$ source my_tensorflow_testenv/bin/activate (my_tensorflow_testenv) [asrini@gpunode02 python_envs]$ pip install tensorflow # With GPU support: (my_tensorflow_testenv) [asrini@gpunode02 python_envs]$ pip install tensorflow_gpu
NOTE 1: The above command will install TensorFlow and its dependencies within the newly created virtual environment. So this virtual environment will have to be activated again, when we need to use the installed packages.
Verify that the package was installed correctly
(my_tensorflow_testenv) [asrini@gpunode02 ~]$ python -c "import tensorflow as tf; print(tf.__version__);
NOTE 2: The above command will print out warning messages about a missing library 'libnvinfer.so.6'. This can be ignored.
Non-interactive jobs
Non-interactive jobs can be run via the GPU queue similar to any other queue, but with the addition of the "-q gpu" option.
For example:
[asrini@consign ~]$ bsub -q gpu -e my_gpujob.e -o my_gpujob.o sh mygpucode.sh Job <63867209> is submitted to queue <gpu>.
LSF Job script for GPU jobs
Below is a sample LSF JOB script that can be adapted for running GPU bound jobs on our HPC
[asrini@consign ~]$ cat lsf_GPU_job.sh #!/bin/bash #BSUB -J GPU_job1 #BSUB -o GPU_job1.%J.out #BSUB -e GPU_job1.%J.error #BSUB -n 2 # Requesting 2 CPU CORES #BSUB -M 10240 # Reqesting 10GB RAM #BSUB -R "span[hosts=1] rusage [mem=10240]" #BSUB -q gpu echo "GPU Job 1" source my_tensorflow_testenv/bin/activate echo "tensorflow version:" python -c "import tensorflow as tf; print(tf.__version__)" echo "python version:" python -V sleep 10
NOTE 1: The above example assumes that a virtual environment has already been created and activates it for the job.
To submit the above script
[asrini@consign ~]$ bsub < lsf_GPU_job.sh
NOTE 2: The "<" in the above command is required.