Difference between revisions of "HPC:User Guide"
Line 265: | Line 265: | ||
9990021 asrini DONE normal consign.hpc node010.hpc * sleep.sh Jan 14 15:35 | 9990021 asrini DONE normal consign.hpc node010.hpc * sleep.sh Jan 14 15:35 | ||
9990022 asrini DONE interactiv consign.hpc node062.hpc bash Jan 14 15:38</pre> | 9990022 asrini DONE interactiv consign.hpc node062.hpc bash Jan 14 15:38</pre> | ||
+ | |||
+ | |||
+ | Detailed information about jobs that are currently running: | ||
+ | |||
+ | Example: | ||
+ | <pre>$ bjobs -l <job_id></pre> | ||
+ | |||
+ | <pre> | ||
+ | $ bjobs -l 9990022 | ||
+ | |||
+ | Job <9990022>, User <asrini>, Project <default>, Status <RUN>, Queue <umem>, Com | ||
+ | mand <sh sleep.sh> | ||
+ | Tue Jan 15 10:22:21: Submitted from host <consign.hpc.local>, CWD </home/asrini | ||
+ | /hack_area/test_jobs/>, 2 Processors Requested; | ||
+ | |||
+ | MEMLIMIT | ||
+ | 1024 M | ||
+ | Tue Jan 15 10:22:23: Started on 2 Hosts/Processors <node036.hpc.local> <node036 | ||
+ | .hpc.local> <node036.hpc.local> <node036.hpc.local>, Execu | ||
+ | tion Home </home/asrini>, Execution CWD </home/asrini/hack | ||
+ | _area/test_jobs/>; | ||
+ | Tue Jan 15 10:22:23: Resource usage collected. | ||
+ | MEM: 2 Mbytes; SWAP: 50 Mbytes; NTHREAD: 1 | ||
+ | PGID: 30614; PIDs: 30614 | ||
+ | |||
+ | |||
+ | SCHEDULING PARAMETERS: | ||
+ | r15s r1m r15m ut pg io ls it tmp swp mem | ||
+ | loadSched - - - - - - - - - - - | ||
+ | loadStop - - - - - - - - - - - | ||
+ | </pre> | ||
+ | |||
+ | |||
+ | Detailed listing of Job in '''PEND''' state | ||
+ | |||
+ | Example: | ||
+ | |||
+ | <pre> | ||
+ | $ bjobs -l 9990024 | ||
+ | |||
+ | Job <9990022>, User <asrini>, Project <default>, Status <PEND>, Queue <umem>, Co | ||
+ | mmand <sh sleep.sh> | ||
+ | Mon Jan 14 19:46:02: Submitted from host <consign.hpc.local>, CWD </home/asrini | ||
+ | /hack_area/test_jobs/>, 4 Processors Requested, Requested | ||
+ | Resources <rusage[mem=33554432]>; | ||
+ | |||
+ | MEMLIMIT | ||
+ | 32 G | ||
+ | PENDING REASONS: | ||
+ | Job requirements for reserving resource (mem) not satisfied: 5 hosts; | ||
+ | Not specified in job submission: 57 hosts; | ||
+ | Load information unavailable: 4 hosts; | ||
+ | |||
+ | SCHEDULING PARAMETERS: | ||
+ | r15s r1m r15m ut pg io ls it tmp swp mem | ||
+ | loadSched - - - - - - - - - - - | ||
+ | loadStop - - - - - - - - - - - | ||
+ | </pre> | ||
+ | |||
+ | Notice the '''PENDING REASONS''' section in the above output. The above job was put into pending (PEND) state due to insufficient resources being available to the job when it was submitted. When resources become available, the job will run (RUN state). Unless, the requested resources are significantly greater than the computational capacity of the PMACS cluster. | ||
==== Job History ==== | ==== Job History ==== |
Revision as of 14:38, 15 July 2014
Contents
- 1 Other Pages
- 2 Guidelines
- 3 Setting up your profile (optional, use only if the LSF commands below don't work)
- 4 Overview of Common Commands (IBM Platform LSF)
- 5 Default Submission Constraints
- 6 Example usage of LSF commands
- 7 Parallel Environment
- 8 Environment Modules
- 9 Instructions for generating Public-Private keypairs
Other Pages
Guidelines
Do not run compute-intensive tasks on the cluster head node (consign.pmacs.upenn.edu). Use an interactive node (bsub -Is bash) instead. Please read the man page for 'bsub' and the documentation linked below.
Setting up your profile (optional, use only if the LSF commands below don't work)
The LSF commands (a.k.a "b-commands") will only work if the LSF profile file is sourced. We recommend adding the following to your .bash_profile file if it doesn't already exist
if [ -f /usr/share/lsf/conf/profile.lsf ]; then source /usr/share/lsf/conf/profile.lsf fi
Overview of Common Commands (IBM Platform LSF)
Please also refer the man pages for bsub, bjobs, bkill, bhist, bqueues, bhosts and bacct
- bsub <job script> : submit a job description file for execution
- bsub -Is bash : request a node for interactive use
- bjbos <jobid> : show the status of a job
- bjbos -l <jobid> : show more details about the job
- bkill <jobid> : cancel a job
- bjobs -u <user> : display jobs queued or running for a single user
- bjobs -u all : to see all jobs
- bacct : to get a summary of usage
- bacct -D start_time,end_time : to get a summary of usage for specific time period
Example to get usage for the month of January, 2014:
bacct -D 2014/01/01/00:00,2014/01/31/24:00
Default Submission Constraints
Please note that all submissions, both batch and interactive sessions, on the cluster have a default memory limit of 3GB. To request more memory for a given request, you can use one of the other queues that allow jobs to request and use more than 3GB. For example, to run a batch job that needs 5GB of RAM
bsub -q plus <job_script>
Where -q tells the LSF scheduler that you wish to use the "plus" queue which allows up to 6GB of RAM usage per slot.
To request more than 6GB of RAM use the "-q max_mem64"
Example usage of LSF commands
Queues
To check the various queues run bqueues:
$ bqueues QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP normal 30 Open:Active - - - - 331 0 331 0 interactive 30 Open:Active - - - - 3 0 3 0 plus 30 Open:Active - - - - 0 0 0 0 max_mem30 30 Open:Active - - - - 66 0 66 0 max_mem64 30 Open:Active - - - - 0 0 0 0 denovo 30 Open:Active - - - - 31 0 31 0
To get detailed information about a certain queue, run:
$ bqueues -l normal QUEUE: normal -- Queue for normal workload taking less than 3GBytes of memory. Jobs that allocate more than 4GBytes of memory will be killed in this queue. This is the default queue. PARAMETERS/STATISTICS PRIO NICE STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SSUSP USUSP RSV 30 20 Open:Active - - - - 330 0 330 0 0 0 Interval for a host to accept two jobs is 0 seconds SWAPLIMIT 4 G SCHEDULING PARAMETERS r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - SCHEDULING POLICIES: NO_INTERACTIVE USERS: all HOSTS: compute/ RES_REQ: rusage[mem=3000]
Compute node information
To get information on the physical compute hosts that are a part of this cluster:
$ bhosts Or if you know the name of the node $ bhosts node001.hpc.local HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV node001.hpc.local ok - 32 0 0 0 0 0 The above output says there are are maximum of 32 available CPU SLOTS on the node and no current jobs running on it. The output of bhosts below shows 27 jobs assigned and currently running on this node. $ bhosts node048.hpc.local HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV node048.hpc.local ok - 32 27 27 0 0 0 The output below shows that the node is closed since the number of jobs running on the node is equal to the maximum CPU SLOTS allotment for the node. $ bhosts node025.hpc.local HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV node025.hpc.local closed - 32 32 32 0 0 0
Batch (non-interactive) Job submission
To run a job in batch mode:
$ bsub <script_name>
Example:
$ bsub sh sleep.sh Job <9990021> is submitted to default queue <normal>.
Note about error and output files
By default error and output files are not generated. These need to be explicitly requested by passing the -e and -o flags to bsub. So the above example will be:
$ bsub -e sleep.e -o sleep.o sh sleep.sh
Alternative way to run a job in batch mode:
$ bsub < <script_name>
Sample job script:
$ cat job_script.sh #!/bin/bash #BSUB -J my_test_job # LSF job name #BSUB -o my_test_job.%J.out # Name of the job output file #BSUB -e my_test_job.%J.error # Name of the job error file echo "this is a test" sleep 15
Example job with job script:
$ bsub < job_script.sh Job <9990032> is submitted to default queue <normal>.
Interactive Job submission
$ bsub -Is bash Job <9990022> is submitted to default queue <interactive>. <<Waiting for dispatch ...>> <<Starting on node062.hpc.local>>
Interactive Job submission with X11
Normal interactive jobs are submitted with the "bsub -Is bash" submission command as described above. However, in order to get a graphical interface (GUI) based application running on an interactive node, a slightly different submission process needs to be followed.
Step 1: Check if you have generated a SSH keypair.
[asrini@consign ~]$ ls $HOME/.ssh /bin/ls: cannot access /home/asrini/.ssh: No such file or directory
The above output shows that there are no ssh keys present. If the above command lists something similar to the output below, skip to Step 2b.
[asrini@consign ~]$ ls $HOME/.ssh authorized_keys id_rsa id_rsa.pub
If no keypair exists, run the following commands on the PMACS cluster head node:
Step 2a: Generate the keypair:
[asrini@consign ~]$ ssh-keygen
Output of the above command should look similar to this if you accepted the defaults (pressed "Enter/Return" for all the optiosn )
Generating public/private rsa key pair. Enter file in which to save the key (/home/asrini/.ssh/id_rsa): Created directory '/home/asrini/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/asrini/.ssh/id_rsa. Your public key has been saved in /home/asrini/.ssh/id_rsa.pub. The key fingerprint is: 2c:bc:1f:88:8a:54:83:6e:ab:g7:29:28:c5:08:a5:da asrini@consign.hpc.local The key's randomart image is: +--[ RSA 2048]----+ |... | | o o | |. . o | |.... . . | |..Eo o S | |...... + | | +. o o . | |o.o+ . . | |+o+. . | +-----------------+
Step 2b: Copy the Public key into the authorized_keys file:
[asrini@consign ~]$ cat .ssh/id_rsa.pub >> .ssh/authorized_keys [asrini@consign ~]$ chmod 600 .ssh/authorized_keys
Step 3: You are now ready to start an interactive session with X11 forwarding enabled.
Step 3a.: But first log out of the PMACS cluster head node
Step 3b.: Login from your local machine with X11 enabled and verify that the $DISPLAY variable is set:
GNU/Linux and Mac users:
$ ssh consign.pmacs.upenn.edu -X [asrini@consign ~]$ echo $DISPLAY localhost:14.0
Windows users will need to use a combination of PuTTy and Xming or some other X-Windows server for Windows.
Step 3c. Start the interactive session and verify that the $DISPLAY variable is still set:
[asrini@consign ~]$ bsub -XF -Is bash Job <868591> is submitted to default queue <interactive>. <<ssh X11 forwarding job>> <<Waiting for dispatch ...>> <<Starting on node062.hpc.local>> [asrini@node062 ~]$ echo $DISPLAY localhost:10.0
Note: The DISPLAY variable need not be the same
You are now ready to launch your X11 based application.
Checking job status
Running jobs:
$ bjobs -u <your_username>
Example:
$ bjobs -u asrini JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 9990022 asrini RUN interactiv consign.hpc node062.hpc bash Jan 14 15:38
Checking status of finished jobs:
Example:
$ bjobs -d -u <your_username>
$ bjobs -d -u asrini JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 9990020 asrini DONE normal consign.hpc node010.hpc sleep 2 Jan 14 15:34 9990021 asrini DONE normal consign.hpc node010.hpc * sleep.sh Jan 14 15:35 9990022 asrini DONE interactiv consign.hpc node062.hpc bash Jan 14 15:38
Detailed information about jobs that are currently running:
Example:
$ bjobs -l <job_id>
$ bjobs -l 9990022 Job <9990022>, User <asrini>, Project <default>, Status <RUN>, Queue <umem>, Com mand <sh sleep.sh> Tue Jan 15 10:22:21: Submitted from host <consign.hpc.local>, CWD </home/asrini /hack_area/test_jobs/>, 2 Processors Requested; MEMLIMIT 1024 M Tue Jan 15 10:22:23: Started on 2 Hosts/Processors <node036.hpc.local> <node036 .hpc.local> <node036.hpc.local> <node036.hpc.local>, Execu tion Home </home/asrini>, Execution CWD </home/asrini/hack _area/test_jobs/>; Tue Jan 15 10:22:23: Resource usage collected. MEM: 2 Mbytes; SWAP: 50 Mbytes; NTHREAD: 1 PGID: 30614; PIDs: 30614 SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - -
Detailed listing of Job in PEND state
Example:
$ bjobs -l 9990024 Job <9990022>, User <asrini>, Project <default>, Status <PEND>, Queue <umem>, Co mmand <sh sleep.sh> Mon Jan 14 19:46:02: Submitted from host <consign.hpc.local>, CWD </home/asrini /hack_area/test_jobs/>, 4 Processors Requested, Requested Resources <rusage[mem=33554432]>; MEMLIMIT 32 G PENDING REASONS: Job requirements for reserving resource (mem) not satisfied: 5 hosts; Not specified in job submission: 57 hosts; Load information unavailable: 4 hosts; SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - -
Notice the PENDING REASONS section in the above output. The above job was put into pending (PEND) state due to insufficient resources being available to the job when it was submitted. When resources become available, the job will run (RUN state). Unless, the requested resources are significantly greater than the computational capacity of the PMACS cluster.
Job History
Historical information about your jobs can be found by running:
$ bhist -d -u <your_username>
Example output:
$ bhist -d -u asrini Summary of time in seconds spent in various states: JOBID USER JOB_NAME PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 9990019 asrini bash 1 0 36 0 0 0 37 9990020 asrini sleep 2 2 0 2 0 0 0 4 9990021 asrini *leep.sh 2 0 25 0 0 0 27 9990022 asrini bash 0 0 395 0 0 0 395
Parallel Environment
To run a parallel job you would include the -n flag to the busb command above.
For example, to run an interactive job with 16 CPUs:
$ bsub -n 16 -Is bash Job <9990023> is submitted to default queue <interactive>. <<Waiting for dispatch ...>> <<Starting on node063.hpc.local>> $ bjobs -u asrini JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 9990023 asrini RUN interactiv consign.hpc node063.hpc bash Jan 14 15:50 node063.hpc.local node063.hpc.local node063.hpc.local node063.hpc.local node063.hpc.local node063.hpc.local node063.hpc.local node063.hpc.local node063.hpc.local node063.hpc.local node063.hpc.local node063.hpc.local node063.hpc.local node063.hpc.local node063.hpc.local
Similarly, to run a batch job with 16 CPUs:
$ bsub -n 16 <my_parallel_job>
Environment Modules
User loadable modules are available if the system default packages don't meet your requirements. To know what modules are available, you'll need to run the "module avail" command from an interactive session. To see what modules are available:
[asrini@consign ~]$ bsub -Is bash Job <9990024> is submitted to default queue <interactive>. <<Waiting for dispatch ...>> <<Starting on node063.hpc.local>> [asrini@node063 ~]$ module avail ------------------------------------------------------------------- /usr/share/Modules/modulefiles ------------------------------------------------------------------- NAMD-2.9-Linux-x86_64-multicore dot module-info picard-1.96 rum-2.0.5_05 STAR-2.3.0e java-sdk-1.6.0 modules pkg-config-path samtools-0.1.19 STAR-hg19 java-sdk-1.7.0 mpich2-x86_64 python-2.7.5 use.own STAR-mm9 ld-library-path null r-libs-user bowtie2-2.1.0 manpath openmpi-1.5.4-x86_64 ruby-1.8.7-p374 devtoolset-2 module-cvs perl5lib ruby-1.9.3-p448
Example use of modules:
[asrini@node063 ~]$ python -V Python 2.6.6 [asrini@node063 ~]$ which python /usr/bin/python [asrini@node063 ~]$ module load python-2.7.5 [asrini@node063 ~]$ python -V Python 2.7.5 [asrini@node063 ~]$ which python /opt/software/python/python-2.7.5/bin/python [asrini@node063 ~]$ module unload python-2.7.5 [asrini@node063 ~]$ which python /usr/bin/python
Instructions for generating Public-Private keypairs
On Mac OS X and GNU/Linux systems, run the following command from within a terminal and follow the on-screen instructions:
$ ssh-keygen Generating public/private rsa key pair. Enter file in which to save the key ($HOME/.ssh/id_rsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in $HOME/.ssh/id_rsa. Your public key has been saved in $HOME/.ssh/id_rsa.pub. The key fingerprint is: xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx asrini@ The key's randomart image is: +--[ RSA 2048]----+ | . | | kjweo | | x B E x | | * B l + | | S +aser .| | + + | | . weq | | . x 12| | 45+ | +-----------------+
On Windows machines you can generate and use PublicKeys with Putty. Here is a link to a Youtube channel which has video tutorials for generating and using Public keys.
After generating a Public-Private keypair, copy the contents of the .ssh/id_rsa.pub file to a file named .ssh/authorized_keys in your home area on the PMACS cluster.
[$USER@consign ~]$ vim .ssh/authorized_keys One SSH public key per line; save and close the file
Then change the permissions on the file:
[$USER@consign ~]$ chmod 600 .ssh/authorized_keys