Contents
- 1 Guidelines
- 2 First Steps
- 3 Overview of Common Commands (IBM Platform LSF)
- 4 Default Submission Constraints
- 5 Example usage of LSF commands
- 6 PMACS HPC Cluster Queues
- 7 Submitting jobs to alternate queues
- 8 Compute node information
- 9 Batch (non-interactive) Job submission
- 10 Interactive Job submission
- 11 Interactive Job submission with X11
- 12 Checking job status
- 13 Monitor job output
- 14 Resubmitting jobs
- 15 Job History
- 16 Job accounting and summary statistics
- 17 Parallel Environment
- 18 Job Dependency
- 19 Job Checkpoints
- 20 Environment Modules
- 21 Additional LSF Documentation
- 22 Other Pages
- 23 Azure Archive
Guidelines
Do NOT run compute-intensive tasks on the cluster head node (consign.pmacs.upenn.edu
). Use an interactive node (bsub -Is bash) instead. Please read the man page for 'bsub' and the documentation linked below.
First Steps
If you have not done so, see the Connecting to the PMACS cluster section first before you can begin using the PMACS cluster.
Overview of Common Commands (IBM Platform LSF)
Please also refer the man pages for bsub, bjobs, bkill, bhist, bqueues, bhosts and bacct
bsub <job script>
: submit a job description file for executionbsub -Is bash
: request a node for interactive usebjobs <jobid>
: show the status of a jobbjobs -l <jobid>
: show more details about the jobbkill <jobid>
: cancel a jobbjobs -u <user>
: display jobs queued or running for a single userbjobs -u all
: to see all jobsbhist
: for a history of recently run jobsbqueues
: provides the current status of the various queuesbhosts
: provides the current status of all the hosts in the clusterbacct
: to get a summary of usagebacct -D start_time,end_time
: to get a summary of usage for specific time period
Example to get usage for the month of January, 2014:
bacct -D 2014/01/01/00:00,2014/01/31/24:00
Default Submission Constraints
Please note that all submissions, both batch and interactive sessions, on the cluster have a default memory limit of 6GB and a default CPU core allocation of 1 vCPU core. To request more memory for a given request, you can use the -M and -R options for bsub. For example, to run a batch job that needs 10GB of RAM you can run the following command:
bsub -e <error_file> -o <output_file> -M 10240 -R "rusage [mem=10240]" sh <job_script>
Similarly, to run a batch job with more than one CPUs:
bsub -e <error_file> -o <output_file> -n 4 -R "span[hosts=1]" sh <my_parallel_job_script.sh>
For a job that requires additional CPUs and RAM, use -n, -M and -R options:
bsub -e <error_file> -o <output_file> -n 4 -M 10240 -R "rusage [mem=10240] span[hosts=1]" sh <my_multicore_large_mem_job_script.sh>
See section on Parallel Environment below for more details.
Example usage of LSF commands
Documented below are a few of the commonly used LSF commands. They are NOT intended for copy-paste purposes but are intended to provide some guidelines on how these can be used.
PMACS HPC Cluster Queues
The following queues are available for use on the PMACS HPC:
1. normal (default) : Intended for non-interactive jobs, the default reservations are 1 vCPU core and 6GB of RAM. Per user job limit: 1000
2. interactive : Intended for interactive jobs, the default reservations are 1 vCPU core and 6GB of RAM. Per user job limit: 10
3. denovo : Intended for big-memory jobs. The default reservations are 1 CPU core and 24 GB of RAM. Per user job limit: 32
$ bqueues QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP normal 30 Open:Active - 1000 - - 0 0 0 0 interactive 30 Open:Active - - - - 1 0 1 0 denovo 30 Open:Active - 32 - - 0 0 0 0
To get detailed information about a certain queue, run:
$ bqueues -l normal QUEUE: normal -- Queue for any kind of workload taking. By default, jobs are allocated 6 GB of memory and 1 vCPU core. Request more CPU with "-n <num_cpus>" and more RAM with " -M <required_memory_in_MB> ". This is the default queue. PARAMETERS/STATISTICS PRIO NICE STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SSUSP USUSP RSV 30 20 Open:Active - 1000 - - 25410 23448 1962 0 0 0 Interval for a host to accept two jobs is 0 seconds DEFAULT LIMITS: MEMLIMIT 6 G MAXIMUM LIMITS: MEMLIMIT 250 G SCHEDULING PARAMETERS r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - nfsops uptime loadSched - - loadStop - - SCHEDULING POLICIES: NO_INTERACTIVE USERS: all HOSTS: compute/ RES_REQ: span[ptile='!'] same[model] affinity[thread(1)]
Submitting jobs to alternate queues
To submit jobs to an alternate queue, use the "-q <queue_name" bsub option.
For example, to submit a job to the "denovo" queue:
$ bsub -q denovo sh my_large_memory_job.sh Job <35683661> is submitted to queue <denovo>.
Compute node information
To get information on the physical compute hosts that are a part of this cluster:
$ bhosts Or if you know the name of the node $ bhosts node001.hpc.local HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV node001.hpc.local ok - 32 0 0 0 0 0 The above output says there are are maximum of 32 available CPU SLOTS on the node and no current jobs running on it. The output of bhosts below shows 27 jobs assigned and currently running on this node. $ bhosts node048.hpc.local HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV node048.hpc.local ok - 32 27 27 0 0 0 The output below shows that the node is closed since the number of jobs running on the node is equal to the maximum CPU SLOTS allotment for the node. $ bhosts node025.hpc.local HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV node025.hpc.local closed - 32 32 32 0 0 0
Batch (non-interactive) Job submission
There are a couple of ways to run non-interactive jobs - as a regular shell script and as a LSF job script.
Click on the "Expand" link below to see both sections.
Running shell scripts
To run a job in batch mode:
$ bsub <script_name>
Example:
$ bsub sh sleep.sh Job <9990021> is submitted to default queue <normal>.
Note about error and output files
By default error and output files are not generated. These need to be explicitly requested by passing the -e and -o flags to bsub. So the above example will be:
$ bsub -e sleep.e -o sleep.o sh sleep.sh
Running LSF job scripts
Alternative way to run a job in batch mode:
$ bsub < <script_name>
Sample job script:
$ cat job_script.sh #!/bin/bash #BSUB -J my_test_job # LSF job name #BSUB -o my_test_job.%J.out # Name of the job output file #BSUB -e my_test_job.%J.error # Name of the job error file echo "this is a test" sleep 15
Example job with job script:
$ bsub < job_script.sh Job <9990032> is submitted to default queue <normal>.
Interactive Job submission
$ bsub -Is bash Job <9990022> is submitted to default queue <interactive>. <<Waiting for dispatch ...>> <<Starting on node062.hpc.local>>
Interactive Job submission with X11
Normal interactive jobs are submitted with the "bsub -Is bash" submission command as described above. However, in order to get a graphical interface (GUI) based application running on an interactive node, a slightly different submission process needs to be followed.
Step 1: Check if you have generated a SSH keypair.
[asrini@consign ~]$ ls $HOME/.ssh /bin/ls: cannot access /home/asrini/.ssh: No such file or directory
The above output shows that there are no ssh keys present. If the above command lists something similar to the output below, skip to Step 2b.
[asrini@consign ~]$ ls $HOME/.ssh authorized_keys id_rsa id_rsa.pub
If no keypair exists, run the following commands on the PMACS cluster head node:
Step 2a: Generate the keypair:
[asrini@consign ~]$ ssh-keygen
Output of the above command should look similar to this if you accepted the defaults (pressed "Enter/Return" for all the optiosn )
Generating public/private rsa key pair. Enter file in which to save the key (/home/asrini/.ssh/id_rsa): Created directory '/home/asrini/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/asrini/.ssh/id_rsa. Your public key has been saved in /home/asrini/.ssh/id_rsa.pub. The key fingerprint is: 2c:bc:1f:88:8a:54:83:6e:ab:g7:29:28:c5:08:a5:da asrini@consign.hpc.local The key's randomart image is: +--[ RSA 2048]----+ |... | | o o | |. . o | |.... . . | |..Eo o S | |...... + | | +. o o . | |o.o+ . . | |+o+. . | +-----------------+
Step 2b: Copy the Public key into the authorized_keys file:
[asrini@consign ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys [asrini@consign ~]$ chmod 600 ~/.ssh/authorized_keys
Step 3: You are now ready to start an interactive session with X11 forwarding enabled.
Step 3a.: But first log out of the PMACS cluster head node
Step 3b.: Login from your local machine with X11 enabled and verify that the $DISPLAY variable is set:
GNU/Linux and Mac users:
$ ssh consign.pmacs.upenn.edu -X [asrini@consign ~]$ echo $DISPLAY localhost:14.0
Windows users will need to use a combination of PuTTy and Xming or some other X-Windows server for Windows.
Step 3c. Start the interactive session and verify that the $DISPLAY variable is still set:
[asrini@consign ~]$ bsub -XF -Is bash Job <868591> is submitted to default queue <interactive>. <<ssh X11 forwarding job>> <<Waiting for dispatch ...>> <<Starting on node062.hpc.local>> [asrini@node062 ~]$ echo $DISPLAY localhost:10.0
Note: The DISPLAY variable need not be the same
You are now ready to launch your X11 based application.
Checking job status
Running jobs:
$ bjobs -u <your_username>
Example:
$ bjobs -u asrini JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 9990022 asrini RUN interactiv consign.hpc node062.hpc bash Jan 14 15:38
Checking status of finished jobs:
Example:
$ bjobs -d -u <your_username>
$ bjobs -d -u asrini JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 9990020 asrini DONE normal consign.hpc node010.hpc sleep 2 Jan 14 15:34 9990021 asrini DONE normal consign.hpc node010.hpc * sleep.sh Jan 14 15:35 9990022 asrini DONE interactiv consign.hpc node062.hpc bash Jan 14 15:38
Detailed information about jobs that are currently running:
Example:
$ bjobs -l <job_id>
$ bjobs -l 9990022 Job <9990022>, User <asrini>, Project <default>, Status <RUN>, Queue <umem>, Com mand <sh sleep.sh> Tue Jan 14 10:22:21: Submitted from host <consign.hpc.local>, CWD </home/asrini /hack_area/test_jobs/>, 2 Processors Requested; MEMLIMIT 1024 M Tue Jan 14 10:22:23: Started on 2 Hosts/Processors <node036.hpc.local> <node036 .hpc.local> <node036.hpc.local> <node036.hpc.local>, Execu tion Home </home/asrini>, Execution CWD </home/asrini/hack _area/test_jobs/>; Tue Jan 14 10:22:23: Resource usage collected. MEM: 2 Mbytes; SWAP: 50 Mbytes; NTHREAD: 1 PGID: 30614; PIDs: 30614 SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - -
Detailed listing of Job in PEND state
Example:
$ bjobs -l 9990024 Job <9990022>, User <asrini>, Project <default>, Status <PEND>, Queue <umem>, Co mmand <sh sleep.sh> Tue Jan 14 19:46:02: Submitted from host <consign.hpc.local>, CWD </home/asrini /hack_area/test_jobs/>, 4 Processors Requested, Requested Resources <rusage[mem=33554432]>; MEMLIMIT 32 G PENDING REASONS: Job requirements for reserving resource (mem) not satisfied: 5 hosts; Not specified in job submission: 57 hosts; Load information unavailable: 4 hosts; SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - -
Notice the PENDING REASONS section in the above output. The above job was put into pending (PEND) state due to insufficient resources being available to the job when it was submitted. When resources become available, the job will run (RUN state). Unless, the requested resources are significantly greater than the computational capacity of the PMACS cluster.
Monitor job output
The bpeek command can be used to check on the output of running (non-interactive) jobs.
$ bsub -J test 'sleep 30; R --version' $ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 1097411 asrini RUN normal consign.hpc node116.hpc test Aug 30 12:14 $ bpeek << output from stdout >>
The above bpeek command does not display any output because the job had not printed anything to STDOUT when the bpeek command was run. The -f flag for the bpeek command can be used to monitor the output continuously.
$ bpeek -f << output from stdout >> R version 3.1.1 (2014-07-10) -- "Sock it to Me" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-redhat-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under the terms of the GNU General Public License versions 2 or 3. For more information about these matters see http://www.gnu.org/licenses/.
When multiple jobs, it is recommended to pass the JOBID as input:
$ bpeek -f 1097411 << output from stdout >> R version 3.1.1 (2014-07-10) -- "Sock it to Me" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-redhat-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under the terms of the GNU General Public License versions 2 or 3. For more information about these matters see http://www.gnu.org/licenses/.
Resubmitting jobs
Jobs can be resubmitted to the queue, if the job has stalled. Below is a sample (sleep) job:
$ bsub -J test 'sleep 300; R --version' Job <1097439> is submitted to default queue <normal>. $ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 5097439 asrini RUN normal consign.hpc node087.hpc test Aug 30 12:27 $ bjobs -l Job <1097439>, Job Name <test>, User <asrini>, Project <default>, Status <RUN>, Queue <normal>, Command <sleep 300; R --version>, Share g roup charged </asrini> Tue Aug 30 12:27:30: Submitted from host <consign.hpc.local>, CWD <$HOME>; Tue Aug 30 12:27:30: Started 1 Task(s) on Host(s) <node087.hpc.local>, Allocate d 1 Slot(s) on Host(s) <node087.hpc.local>, Execution Home </home/asrini>, Execution CWD </home/asrini>; Tue Aug 30 12:27:45: Resource usage collected. MEM: 1 Mbytes; SWAP: 0 Mbytes; NTHREAD: 4 PGID: 22983; PIDs: 22983 22988 22990 MEMORY USAGE: MAX MEM: 1 Mbytes; AVG MEM: 1 Mbytes SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - nfsops uptime loadSched - - loadStop - - RESOURCE REQUIREMENT DETAILS: Combined: select[type == local] order[r15s:pg] span[ptile='!',Intel_EM64T:32] same[model] affinity[thread(1)*1] Effective: select[type == local] order[r15s:pg] span[ptile='!',Intel_EM64T:32] same[model] affinity[thread(1)*1]
Resubmit the job using the "brequeue" command:
$ brequeue 1097439 Job <1097439> is being requeued
Job status check shows that the job was killed and restarted on a different node:
$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 1097489 asrini RUN normal consign.hpc node006.hpc test Aug 30 12:35
Job History
Historical information about your jobs can be found by running:
Note: By default bhist only provides historical information about jobs run/completed in the past week. If historical or accounting information about jobs from more than a week is needed, see bacct usage information below.
$ bhist -d -u <your_username>
Example output:
$ bhist -d -u asrini Summary of time in seconds spent in various states: JOBID USER JOB_NAME PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 9990019 asrini bash 1 0 36 0 0 0 37 9990020 asrini sleep 2 2 0 2 0 0 0 4 9990021 asrini *leep.sh 2 0 25 0 0 0 27 9990022 asrini bash 0 0 395 0 0 0 395
Detailed history of jobs that were completed
Example:
$ bhist -d -l <job_id>
$ bhist -d -l 9990022 Job <9990022>, User <asrini>, Project <default>, Command <sh sleep.sh> Tue Jan 14 10:22:21: Submitted from host <consign.hpc.local>, to Queue <umem>, CWD </home/asrini/hack_area/test_jobs/>, 4 Processors Requ ested; MEMLIMIT 1024 M Tue Jan 14 10:22:23: Dispatched to 4 Hosts/Processors <node036.hpc.local> <node 036.hpc.local> <node036.hpc.local> <node036.hpc.local>; Tue Jan 14 10:22:23: Starting (Pid 30614); Tue Jan 14 10:22:23: Running with execution home </home/asrini>, Execution CWD </home/asrini/hack_area/test_jobs/>, Execution Pid <30614> ; Tue Jan 14 10:22:33: Done successfully. The CPU time used is 0.0 seconds; Tue Jan 14 10:22:33: Post job process done successfully; Summary of time in seconds spent in various states by Tue Jan 14 10:22:33 PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 2 0 10 0 0 0 12
Notice the "Done successfully" message in the above output
Detailed history of jobs that were killed or did not finish successfully:
$ bhist -d -l 9990024 Job <9990022>, User <asrini>, Project <default>, Command <sh sleep.sh> Tue Jan 14 19:46:02: Submitted from host <consign.hpc.local>, to Queue <umem>, CWD </home/asrini/hack_area/test_jobs/>, 4 Processors Requ ested, Requested Resources <rusage[mem=33554432]>; MEMLIMIT 32 G Tue Jan 14 10:33:29: Signal <KILL> requested by user or administrator <asrini>; Tue Jan 14 10:33:29: Exited; Tue Jan 14 10:33:29: Completed <exit>; TERM_OWNER: job killed by owner; Summary of time in seconds spent in various states by Tue Jan 14 10:33:29 PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 53247 0 0 0 0 0 53247
Notice the "Signal <KILL>" message in the above output.
Job accounting and summary statistics
The bacct command displays a summary of accounting statistics for all finished jobs (with a DONE or EXIT status). The bacct command can only be run from the PMACS cluster head node: consign.pmacs.upenn.edu
Accounting information about a user (this will take some time depending on how long you have used the PMACS cluster and how many jobs you have submitted):
Note: bacct may return information about other user's jobs. Make sure you pay close attention to the output.
$ bacct -u <your_username>
Example output:
$ bacct -u asrini Accounting information about jobs that are: - submitted by users asrini, - accounted on all projects. - completed normally or exited - executed on all hosts. - submitted to all queues. - accounted on all service classes. ------------------------------------------------------------------------------ SUMMARY: ( time unit: second ) Total number of done jobs: 89789 Total number of exited jobs: 10818 Total CPU time consumed: 394894.5 Average CPU time consumed: 3.9 Maximum CPU time of a job: 12342.2 Minimum CPU time of a job: 0.0 Total wait time in queues: 16464999.0 Average wait time in queue: 163.7 Maximum wait time in queue:53247.0 Minimum wait time in queue: 0.0 Average turnaround time: 213 (seconds/job) Maximum turnaround time: 513369 Minimum turnaround time: 1 Average hog factor of a job: 0.01 ( cpu time / turnaround time ) Maximum hog factor of a job: 1.24 Minimum hog factor of a job: 0.00 Total throughput: 11.30 (jobs/hour) during 8900.83 hours Beginning time: Jan 9 13:43 Ending time: Jan 15 10:33
By default, the PMACS cluster configuration will only provide a summary of the past 7 days. The -D option can be provided to expand this rage. For example to see a summary of jobs that were completed during a specific month:
$ bacct -D 2014/01/01/00:00,2014/02/01/23:59 -u asrini Accounting information about jobs that are: - submitted by users asrini, - accounted on all projects. - completed normally or exited - dispatched between Wed Jan 1 00:00:00 2014 ,and Sat Feb 1 23:59:00 2014 - executed on all hosts. - submitted to all queues. - accounted on all service classes. ------------------------------------------------------------------------------ SUMMARY: ( time unit: second ) Total number of done jobs: 63 Total number of exited jobs: 17 Total CPU time consumed: 23.4 Average CPU time consumed: 0.3 Maximum CPU time of a job: 7.6 Minimum CPU time of a job: 0.0 Total wait time in queues: 117.0 Average wait time in queue: 1.5 Maximum wait time in queue: 3.0 Minimum wait time in queue: 0.0 Average turnaround time: 234 (seconds/job) Maximum turnaround time: 8094 Minimum turnaround time: 2 Average hog factor of a job: 0.01 ( cpu time / turnaround time ) Maximum hog factor of a job: 0.04 Minimum hog factor of a job: 0.00 Total throughput: 0.14 (jobs/hour) during 554.63 hours Beginning time: Jan 8 11:21 Ending time: Jan 31 13:58
Detailed listing can be requested with the -l option (will give a very long listing!)
$ bacct -l -u <your_username>
Detailed accounting information about a specific job that completed successfully:
$ bacct -l 9990022 Accounting information about jobs that are: - submitted by all users. - accounted on all projects. - completed normally or exited - executed on all hosts. - submitted to all queues. - accounted on all service classes. ------------------------------------------------------------------------------ Job <9990022>, User <asrini>, Project <default>, Status <DONE>, Queue <umem>, Co mmand <sh sleep.sh> Tue Jan 14 10:22:21: Submitted from host <consign.hpc.local>, CWD </home/asrini /hack_area/test_jobs/>; Tue Jan 14 10:22:23: Dispatched to 4 Hosts/Processors <node036.hpc.local> <node 036.hpc.local> <node036.hpc.local> <node036.hpc.local>; Tue Jan 14 10:22:33: Completed <done>. Accounting information about this job: CPU_T WAIT TURNAROUND STATUS HOG_FACTOR MEM SWAP 0.03 2 12 done 0.0026 2M 50M ------------------------------------------------------------------------------ SUMMARY: ( time unit: second ) Total number of done jobs: 1 Total number of exited jobs: 0 Total CPU time consumed: 0.0 Average CPU time consumed: 0.0 Maximum CPU time of a job: 0.0 Minimum CPU time of a job: 0.0 Total wait time in queues: 2.0 Average wait time in queue: 2.0 Maximum wait time in queue: 2.0 Minimum wait time in queue: 2.0 Average turnaround time: 12 (seconds/job) Maximum turnaround time: 12 Minimum turnaround time: 12 Average hog factor of a job: 0.00 ( cpu time / turnaround time ) Maximum hog factor of a job: 0.00 Minimum hog factor of a job: 0.00
Detailed accounting information about a specific job that was killed/did not finish successfully:
$ bacct -l 9990024 Accounting information about jobs that are: - submitted by all users. - accounted on all projects. - completed normally or exited - executed on all hosts. - submitted to all queues. - accounted on all service classes. ------------------------------------------------------------------------------ Job <392124>, User <asrini>, Project <default>, Status <EXIT>, Queue <umem>, Co mmand <sh sleep.sh> Tue Jan 14 19:46:02: Submitted from host <consign.hpc.local>, CWD </home/asrini /hack_area/test_jobs/>; Tue Jan 14 10:33:29: Completed <exit>; TERM_OWNER: job killed by owner. Accounting information about this job: CPU_T WAIT TURNAROUND STATUS HOG_FACTOR MEM SWAP 0.00 53247 53247 exit 0.0000 0M 0M ------------------------------------------------------------------------------ SUMMARY: ( time unit: second ) Total number of done jobs: 0 Total number of exited jobs: 1 Total CPU time consumed: 0.0 Average CPU time consumed: 0.0 Maximum CPU time of a job: 0.0 Minimum CPU time of a job: 0.0 Total wait time in queues: 53247.0 Average wait time in queue:53247.0 Maximum wait time in queue:53247.0 Minimum wait time in queue:53247.0 Average turnaround time: 53247 (seconds/job) Maximum turnaround time: 53247 Minimum turnaround time: 53247 Average hog factor of a job: 0.00 ( cpu time / turnaround time ) Maximum hog factor of a job: 0.00 Minimum hog factor of a job: 0.00
Parallel Environment
To run a parallel job you would include the -n flag in the busb command above. The -n option requests the scheduler to reserve more than 1 CPU core for the job. It is not necessary that all cores reserved get assigned to the same physical compute node. Therefore, care must be taken when submitting such requests. See examples below.
For example, to run an interactive job with 16 CPUs :
$ bsub -n 16 -Is bash Job <9990023> is submitted to default queue <interactive>. <<Waiting for dispatch ...>> <<Starting on node063.hpc.local>> $ bjobs -u asrini JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 9990023 asrini RUN interactiv consign.hpc node063.hpc bash Jan 14 15:50 node063.hpc.local node063.hpc.local node063.hpc.local node063.hpc.local node063.hpc.local node063.hpc.local node063.hpc.local node060.hpc.local node060.hpc.local node060.hpc.local node060.hpc.local node060.hpc.local node060.hpc.local node060.hpc.local node060.hpc.local
Note: the example bsub command above did not force the scheduler to reserve the CPU cores on the same node. To do so, you must use the '-R "span[hosts=1]" ' option:
$ bsub -n 16 -R "span[hosts=1]" -Is bash
Similarly, to run a batch job with 16 CPUs:
$ bsub -n 16 -R "span[hosts=1]" sh <my_parallel_job_script.sh>
Job Dependency
See the section on Job Dependency
Job Checkpoints
Checkpoint can be added to any LSF job, and it can help to 'restart' the job from that checkpoint, if the job had a failure/termination. LSF users can make jobs checkpointable by submitting jobs using bsub -k and specifying a checkpoint directory.
bsub -k "checkpoint_dir [check‐point_period]"
Please note that, adding a check-point to the job, may add a delay in completing the Job and also create multiple temporary files and that can increase the disk utilization during the job run.
Environment Modules
User loadable modules are available if the system default packages don't meet your requirements. To know what modules are available, you'll need to run the "module avail" command from an interactive session. To see what modules are available:
[asrini@consign ~]$ bsub -Is bash Job <9990024> is submitted to default queue <interactive>. <<Waiting for dispatch ...>> <<Starting on node063.hpc.local>> [asrini@node063 ~]$ module avail ------------------------------------------------------------------- /usr/share/Modules/modulefiles ------------------------------------------------------------------- NAMD-2.9-Linux-x86_64-multicore dot module-info picard-1.96 rum-2.0.5_05 STAR-2.3.0e java-sdk-1.6.0 modules pkg-config-path samtools-0.1.19 STAR-hg19 java-sdk-1.7.0 mpich2-x86_64 python-2.7.5 use.own STAR-mm9 ld-library-path null r-libs-user bowtie2-2.1.0 manpath openmpi-1.5.4-x86_64 ruby-1.8.7-p374 devtoolset-2 module-cvs perl5lib ruby-1.9.3-p448
Example use of modules:
[asrini@node063 ~]$ python -V Python 2.6.6 [asrini@node063 ~]$ which python /usr/bin/python [asrini@node063 ~]$ module load python-2.7.5 [asrini@node063 ~]$ python -V Python 2.7.5 [asrini@node063 ~]$ which python /opt/software/python/python-2.7.5/bin/python [asrini@node063 ~]$ module unload python-2.7.5 [asrini@node063 ~]$ which python /usr/bin/python
More information about Environment Modules can be found here
Additional LSF Documentation
Other Pages
Azure Archive
We have created a container in our Azure Storage account for your lab. Access is provided via the azcopy binary available on mercury. To access your container, you have been provided with a SAS token. Keep the token secure, it provides read/write and delete authority on all files in your container.
Tape archive retirement
As of 2/1/23 we have begun uploading the contents of the tape archive to azure. Each lab that has a directory in the tape archive now has a container in the hpcarchive storage account. The process of uploading tape archive data to the azure archive is expected to take weeks, we will advise. If before the uploads have been completed, you have an urgent need to recover something from tape, please let us know. Make a systems ticket in helpdesk.pmacs.upenn.edu and we will restore the files into a subdirectory of the lab's project dir as soon as possible2. The contents of the first level of subdirectories have been tarred to maintain file attributes. A .list file with the same name as the tar is provided to get the contents of the tar.
Please Note: TIERS
To keep storage costs at their lowest all files in the archive should be set to the Archive tier as soon as they are uploaded. Note that files are automatically set to Cool upon upload. Please change tier to to Archive as soon as uploading. We will be monitoring the status of files. Only Cool tier files are available for download. Therefore, before you can download something in the archive you must set it from Archive to Cool. This can take time (depending on the size of the file). Also after downloading the file, you want to set it back to Archive ( or delete it) to ensure the lowest cost. Note that files in the Archive tier can be deleted directly and do not have to be "rehydated" first.
Please Note: File Attributes
Azure blob storage will the record date and time attributes of files uploaded to the archive as the time that they were uploaded. Also user and group ownership and permissions are stripped from the file. If you are not concerned with losing this information, upload/download individual files to and from the azure archive.
You can make note of these attibutes with a file you make for yourself using find -ls > attibs.txt or ls -ltra files* > attibs.txt.
It may be better to make a tar file: tar -cf archive.tar files* and use the archive to store your files in these tars. This will ensure that all the attributes are available when you download the files. The downside is that if you make these tars containing too many files, you will have wait longer to download them when you only need one file in it. Either way, you will want to keep a file with its contents so you can be sure which file you should download, or delete.
Please reach out to psom-pmacshpc@pennmedicine.upenn.edu should you have questions about how to deal with file attributes.
Examples
Here are some examples of using the azcopy command:
Here we put the access token string into a variable SAS ( e.g. SAS=`cat lab_token` ) ( the SAS token has an expiration date one year from its creation, you can see the date in the string) Be careful handling the string that is the token, it has many special characters like & and if not quoted, the shell may attempt to interpret. Here the name of the lab is in $account
- List files in the container:
azcopy list "https://hpcarchive.blob.core.windows.net/${account}?${SAS}" --properties BlobAccessTier
- to upload a file ( filename ):
azcopy copy filename "https://hpcarchive.blob.core.windows.net/${account}?${SAS}" --preserve-posix-properties --preserve-permissions
- set tier of the file to Archive:
( note the recursive, it can be done to the container and everything in it, or to folders and evereything from that path on )
to archive
azcopy set-properties "https://hpcarchive.blob.core.windows.net/${account}/${file}?${SAS}" --block-blob-tier=archive --recursive=true
to cool
azcopy set-properties "https://hpcarchive.blob.core.windows.net/${account}/${file}?${SAS}" --block-blob-tier=cool --recursive=true
- You cannot download a file until rehydration is done, check with list command
- download a file in cool tier:
azcopy copy "https://hpcarchive.blob.core.windows.net/${account}/${file}?${SAS}" .
( or replace . with destination path )
- remove file:
azcopy remove "https://hpcarchive.blob.core.windows.net/${account}/${file}?${SAS}"
- sync a directory
(not recommended, but can be used recursively copy entire directories of files instead of tar files, you will lose DATE and ownership attributes on these files ) note: you will need to do a recursive set tier on these files after the sync
azcopy sync dir "https://hpcarchive.blob.core.windows.net/${account}?${SAS}"