Difference between revisions of "HPC:User Guide"

Revision as of 15:18, 15 July 2014

1 Other Pages
2 Guidelines
3 Setting up your profile (optional, use only if the LSF commands below don't work)
4 Overview of Common Commands (IBM Platform LSF)
5 Default Submission Constraints
6 Example usage of LSF commands
7 Parallel Environment
8 Environment Modules
9 Instructions for generating Public-Private keypairs

Other Pages

Guidelines

Do not run compute-intensive tasks on the cluster head node (consign.pmacs.upenn.edu). Use an interactive node (bsub -Is bash) instead. Please read the man page for 'bsub' and the documentation linked below.

Setting up your profile (optional, use only if the LSF commands below don't work)

The LSF commands (a.k.a "b-commands") will only work if the LSF profile file is sourced. We recommend adding the following to your .bash_profile file if it doesn't already exist

if [ -f /usr/share/lsf/conf/profile.lsf ]; then
        source /usr/share/lsf/conf/profile.lsf
fi

Overview of Common Commands (IBM Platform LSF)

Please also refer the man pages for bsub, bjobs, bkill, bhist, bqueues, bhosts and bacct

bsub <job script> : submit a job description file for execution
bsub -Is bash : request a node for interactive use
bjbos <jobid> : show the status of a job
bjbos -l <jobid> : show more details about the job
bkill <jobid> : cancel a job
bjobs -u <user> : display jobs queued or running for a single user
bjobs -u all : to see all jobs
bacct : to get a summary of usage
bacct -D start_time,end_time : to get a summary of usage for specific time period

Example to get usage for the month of January, 2014:

 bacct  -D 2014/01/01/00:00,2014/01/31/24:00

Default Submission Constraints

Please note that all submissions, both batch and interactive sessions, on the cluster have a default memory limit of 3GB. To request more memory for a given request, you can use one of the other queues that allow jobs to request and use more than 3GB. For example, to run a batch job that needs 5GB of RAM

   bsub -q plus <job_script>

Where -q tells the LSF scheduler that you wish to use the "plus" queue which allows up to 6GB of RAM usage per slot.

To request more than 6GB of RAM use the "-q max_mem64"

Example usage of LSF commands

Queues

To check the various queues run bqueues:

$ bqueues
QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP
normal           30  Open:Active       -    -    -    -   331     0   331     0
interactive      30  Open:Active       -    -    -    -     3     0     3     0
plus             30  Open:Active       -    -    -    -     0     0     0     0
max_mem30        30  Open:Active       -    -    -    -    66     0    66     0
max_mem64        30  Open:Active       -    -    -    -     0     0     0     0
denovo           30  Open:Active       -    -    -    -    31     0    31     0

To get detailed information about a certain queue, run:

$ bqueues -l normal

QUEUE: normal
  -- Queue for normal workload taking less than 3GBytes of memory. Jobs that allocate more than 4GBytes of memory 
will be killed in this queue. This is the default queue.

PARAMETERS/STATISTICS
PRIO NICE STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN SSUSP USUSP  RSV
 30   20  Open:Active       -    -    -    -   330     0   330     0     0    0
Interval for a host to accept two jobs is 0 seconds

 SWAPLIMIT
      4 G

SCHEDULING PARAMETERS
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -
 loadStop    -     -     -     -       -     -    -     -     -      -      -

SCHEDULING POLICIES:  NO_INTERACTIVE

USERS: all
HOSTS:  compute/
RES_REQ:  rusage[mem=3000]

Compute node information

To get information on the physical compute hosts that are a part of this cluster:

$ bhosts

Or if you know the name of the node
$ bhosts node001.hpc.local
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
node001.hpc.local  ok              -     32      0      0      0      0      0

The above output says there are are maximum of 32 available CPU SLOTS on the node and no current jobs running on it.

The output of bhosts below shows 27 jobs assigned and currently running on this node.
$ bhosts node048.hpc.local
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
node048.hpc.local  ok              -     32     27     27      0      0      0

The output below shows that the node is closed since the number of jobs running on the node is equal to the maximum CPU SLOTS  allotment for the node. 
$ bhosts node025.hpc.local
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
node025.hpc.local  closed          -     32     32     32      0      0      0

Batch (non-interactive) Job submission

To run a job in batch mode:

$ bsub <script_name>

Example:

$ bsub sh sleep.sh
Job <9990021> is submitted to default queue <normal>.

Note about error and output files

By default error and output files are not generated. These need to be explicitly requested by passing the -e and -o flags to bsub. So the above example will be:

$ bsub -e sleep.e -o sleep.o sh sleep.sh

Alternative way to run a job in batch mode:

$ bsub < <script_name>

Sample job script:

 $ cat job_script.sh
 #!/bin/bash
 #BSUB -J my_test_job            # LSF job name
 #BSUB -o my_test_job.%J.out     # Name of the job output file 
 #BSUB -e my_test_job.%J.error   # Name of the job error file

 echo "this is a test"
 sleep 15

Example job with job script:

$ bsub < job_script.sh
Job <9990032> is submitted to default queue <normal>.

Interactive Job submission

$ bsub -Is bash

Job <9990022> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on node062.hpc.local>>

Interactive Job submission with X11

Normal interactive jobs are submitted with the "bsub -Is bash" submission command as described above. However, in order to get a graphical interface (GUI) based application running on an interactive node, a slightly different submission process needs to be followed.

Step 1: Check if you have generated a SSH keypair.

[asrini@consign ~]$ ls $HOME/.ssh
/bin/ls: cannot access /home/asrini/.ssh: No such file or directory

The above output shows that there are no ssh keys present. If the above command lists something similar to the output below, skip to Step 2b.

[asrini@consign ~]$ ls $HOME/.ssh
authorized_keys  id_rsa  id_rsa.pub

If no keypair exists, run the following commands on the PMACS cluster head node:

Step 2a: Generate the keypair:

[asrini@consign ~]$ ssh-keygen

Output of the above command should look similar to this if you accepted the defaults (pressed "Enter/Return" for all the optiosn )

Generating public/private rsa key pair.
Enter file in which to save the key (/home/asrini/.ssh/id_rsa):
Created directory '/home/asrini/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/asrini/.ssh/id_rsa.
Your public key has been saved in /home/asrini/.ssh/id_rsa.pub.
The key fingerprint is:
2c:bc:1f:88:8a:54:83:6e:ab:g7:29:28:c5:08:a5:da asrini@consign.hpc.local
The key's randomart image is:
+--[ RSA 2048]----+
|...              |
| o o             |
|. . o            |
|.... . .         |
|..Eo  o S        |
|...... +         |
| +. o o .        |
|o.o+   . .       |
|+o+.    .        |
+-----------------+

Step 2b: Copy the Public key into the authorized_keys file:

[asrini@consign ~]$ cat .ssh/id_rsa.pub >> .ssh/authorized_keys

[asrini@consign ~]$ chmod 600 .ssh/authorized_keys

Step 3: You are now ready to start an interactive session with X11 forwarding enabled.

Step 3a.: But first log out of the PMACS cluster head node

Step 3b.: Login from your local machine with X11 enabled and verify that the $DISPLAY variable is set:

GNU/Linux and Mac users:

$  ssh consign.pmacs.upenn.edu -X

[asrini@consign ~]$ echo $DISPLAY
localhost:14.0

Windows users will need to use a combination of PuTTy and Xming or some other X-Windows server for Windows.

Step 3c. Start the interactive session and verify that the $DISPLAY variable is still set:

[asrini@consign ~]$ bsub -XF -Is bash
Job <868591> is submitted to default queue <interactive>.
<<ssh X11 forwarding job>>
<<Waiting for dispatch ...>>
<<Starting on node062.hpc.local>>

[asrini@node062 ~]$ echo $DISPLAY
localhost:10.0

Note: The DISPLAY variable need not be the same

You are now ready to launch your X11 based application.

Checking job status

Running jobs:

$ bjobs -u <your_username>

Example:

$ bjobs -u asrini
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
9990022  asrini  RUN   interactiv consign.hpc node062.hpc bash       Jan 14 15:38

Checking status of finished jobs:

Example:

$ bjobs -d -u <your_username>

$ bjobs -d -u asrini
JOBID    USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
9990020  asrini  DONE  normal     consign.hpc node010.hpc sleep 2    Jan 14 15:34
9990021  asrini  DONE  normal     consign.hpc node010.hpc * sleep.sh Jan 14 15:35
9990022  asrini  DONE  interactiv consign.hpc node062.hpc bash       Jan 14 15:38

Detailed information about jobs that are currently running:

Example:

$ bjobs -l <job_id>

$ bjobs -l 9990022

Job <9990022>, User <asrini>, Project <default>, Status <RUN>, Queue <umem>, Com
                     mand <sh sleep.sh>
Tue Jan 15 10:22:21: Submitted from host <consign.hpc.local>, CWD </home/asrini
                     /hack_area/test_jobs/>, 2 Processors Requested;

 MEMLIMIT
   1024 M
Tue Jan 15 10:22:23: Started on 2 Hosts/Processors <node036.hpc.local> <node036
                     .hpc.local> <node036.hpc.local> <node036.hpc.local>, Execu
                     tion Home </home/asrini>, Execution CWD </home/asrini/hack
                     _area/test_jobs/>;
Tue Jan 15 10:22:23: Resource usage collected.
                     MEM: 2 Mbytes;  SWAP: 50 Mbytes;  NTHREAD: 1
                     PGID: 30614;  PIDs: 30614


 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -
 loadStop    -     -     -     -       -     -    -     -     -      -      -

Detailed listing of Job in PEND state

Example:

$ bjobs -l 9990024

Job <9990022>, User <asrini>, Project <default>, Status <PEND>, Queue <umem>, Co
                     mmand <sh sleep.sh>
Mon Jan 14 19:46:02: Submitted from host <consign.hpc.local>, CWD </home/asrini
                     /hack_area/test_jobs/>, 4 Processors Requested, Requested
                     Resources <rusage[mem=33554432]>;

 MEMLIMIT
     32 G
 PENDING REASONS:
 Job requirements for reserving resource (mem) not satisfied: 5 hosts;
 Not specified in job submission: 57 hosts;
 Load information unavailable: 4 hosts;

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -
 loadStop    -     -     -     -       -     -    -     -     -      -      -

Notice the PENDING REASONS section in the above output. The above job was put into pending (PEND) state due to insufficient resources being available to the job when it was submitted. When resources become available, the job will run (RUN state). Unless, the requested resources are significantly greater than the computational capacity of the PMACS cluster.

Job History

Historical information about your jobs can be found by running:

Note: By default bhist only provides historical information about jobs run/completed in the past week. If historical or accounting information about jobs from more than a week is needed, see bacct usage information below.

 $ bhist -d -u <your_username>

Example output:

$ bhist -d -u asrini
Summary of time in seconds spent in various states:
JOBID   USER    JOB_NAME  PEND    PSUSP   RUN     USUSP   SSUSP   UNKWN   TOTAL
9990019  asrini  bash      1       0       36      0       0       0       37
9990020  asrini  sleep 2   2       0       2       0       0       0       4
9990021  asrini  *leep.sh  2       0       25      0       0       0       27
9990022  asrini  bash      0       0       395     0       0       0       395

Detailed history of jobs that were completed

Example:

 $ bhist -d -l <job_id>


$ bhist -d -l 9990022

Job <9990022>, User <asrini>, Project <default>, Command <sh sleep.sh>
Tue Jan 14 10:22:21: Submitted from host <consign.hpc.local>, to Queue <umem>,
                     CWD </home/asrini/hack_area/test_jobs/>, 4 Processors Requ
                     ested;

 MEMLIMIT
   1024 M
Tue Jan 14 10:22:23: Dispatched to 4 Hosts/Processors <node036.hpc.local> <node
                     036.hpc.local> <node036.hpc.local> <node036.hpc.local>;
Tue Jan 14 10:22:23: Starting (Pid 30614);
Tue Jan 14 10:22:23: Running with execution home </home/asrini>, Execution CWD
                     </home/asrini/hack_area/test_jobs/>, Execution Pid <30614>
                     ;
Tue Jan 14 10:22:33: Done successfully. The CPU time used is 0.0 seconds;
Tue Jan 14 10:22:33: Post job process done successfully;

Summary of time in seconds spent in various states by  Tue Jan 14 10:22:33
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  2        0        10       0        0        0        12

Notice the "Done successfully" message in the above output

Detailed history of jobs that were killed or did not finish successfully:


$ bhist -d -l 9990024

Job <9990022>, User <asrini>, Project <default>, Command <sh sleep.sh>
Mon Jan 14 19:46:02: Submitted from host <consign.hpc.local>, to Queue <umem>,
                     CWD </home/asrini/hack_area/test_jobs/>, 4 Processors Requ
                     ested, Requested Resources <rusage[mem=33554432]>;

 MEMLIMIT
     32 G
Tue Jan 14 10:33:29: Signal <KILL> requested by user or administrator <asrini>;

Tue Jan 14 10:33:29: Exited;
Tue Jan 14 10:33:29: Completed <exit>; TERM_OWNER: job killed by owner;

Summary of time in seconds spent in various states by  Tue Jan 14 10:33:29
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  53247    0        0        0        0        0        53247

Notice the "Signal <KILL>" message in the above output.

Job accounting and summary statistics

The bacct command displays a summary of accounting statistics for all finished jobs (with a DONE or EXIT status). The bacct command can only be run from the PMACS cluster head node: consign.pmacs.upenn.edu

Accounting information about a user (this will take some time depending on how long you have used the PMACS cluster and how many jobs you have submitted):

 $ bacct -u <your_username>

Example output:

$ bacct -u asrini

Accounting information about jobs that are:
  - submitted by users asrini,
  - accounted on all projects.
  - completed normally or exited
  - executed on all hosts.
  - submitted to all queues.
  - accounted on all service classes.
------------------------------------------------------------------------------

SUMMARY:      ( time unit: second )
 Total number of done jobs:   89789      Total number of exited jobs: 10818
 Total CPU time consumed:   394894.5      Average CPU time consumed:     3.9
 Maximum CPU time of a job: 12342.2      Minimum CPU time of a job:     0.0
 Total wait time in queues: 16464999.0
 Average wait time in queue:  163.7
 Maximum wait time in queue:53247.0      Minimum wait time in queue:    0.0
 Average turnaround time:       213 (seconds/job)
 Maximum turnaround time:    513369      Minimum turnaround time:         1
 Average hog factor of a job:  0.01 ( cpu time / turnaround time )
 Maximum hog factor of a job:  1.24      Minimum hog factor of a job:  0.00
 Total throughput:            11.30 (jobs/hour)  during 8900.83 hours
 Beginning time:       Jan  9 13:43      Ending time:          Jan 15 10:33

By default, the PMACS cluster configuration will only provide a summary of the past 7 days. The -D option can be provided to expand this rage. For example to see a summary of jobs that were completed during a specific month:

$ bacct -D 2014/01/01/00:00,2014/02/01/23:59 -u asrini

Accounting information about jobs that are:
  - submitted by users asrini,
  - accounted on all projects.
  - completed normally or exited
  - dispatched between  Wed Jan  1 00:00:00 2014
                  ,and   Sat Feb  1 23:59:00 2014
  - executed on all hosts.
  - submitted to all queues.
  - accounted on all service classes.
------------------------------------------------------------------------------

SUMMARY:      ( time unit: second )
 Total number of done jobs:      63      Total number of exited jobs:    17
 Total CPU time consumed:      23.4      Average CPU time consumed:     0.3
 Maximum CPU time of a job:     7.6      Minimum CPU time of a job:     0.0
 Total wait time in queues:   117.0
 Average wait time in queue:    1.5
 Maximum wait time in queue:    3.0      Minimum wait time in queue:    0.0
 Average turnaround time:       234 (seconds/job)
 Maximum turnaround time:      8094      Minimum turnaround time:         2
 Average hog factor of a job:  0.01 ( cpu time / turnaround time )
 Maximum hog factor of a job:  0.04      Minimum hog factor of a job:  0.00
 Total throughput:             0.14 (jobs/hour)  during  554.63 hours
 Beginning time:       Jan  8 11:21      Ending time:          Jan 31 13:58

Detailed listing can be requested with the -l option (will give a very long listing!)

 $ bacct -l -u <your_username>

Detailed accounting information about a specific job that completed successfully:

$ bacct -l 9990022

Accounting information about jobs that are:
  - submitted by all users.
  - accounted on all projects.
  - completed normally or exited
  - executed on all hosts.
  - submitted to all queues.
  - accounted on all service classes.
------------------------------------------------------------------------------

Job <9990022>, User <asrini>, Project <default>, Status <DONE>, Queue <umem>, Co
                     mmand <sh sleep.sh>
Tue Jan 14 10:22:21: Submitted from host <consign.hpc.local>, CWD </home/asrini
                     /hack_area/test_jobs/>;
Tue Jan 14 10:22:23: Dispatched to 4 Hosts/Processors <node036.hpc.local> <node
                     036.hpc.local> <node036.hpc.local> <node036.hpc.local>;
Tue Jan 14 10:22:33: Completed <done>.

Accounting information about this job:
     CPU_T     WAIT     TURNAROUND   STATUS     HOG_FACTOR    MEM    SWAP
      0.03        2             12     done         0.0026     2M     50M
------------------------------------------------------------------------------

SUMMARY:      ( time unit: second )
 Total number of done jobs:       1      Total number of exited jobs:     0
 Total CPU time consumed:       0.0      Average CPU time consumed:     0.0
 Maximum CPU time of a job:     0.0      Minimum CPU time of a job:     0.0
 Total wait time in queues:     2.0
 Average wait time in queue:    2.0
 Maximum wait time in queue:    2.0      Minimum wait time in queue:    2.0
 Average turnaround time:        12 (seconds/job)
 Maximum turnaround time:        12      Minimum turnaround time:        12
 Average hog factor of a job:  0.00 ( cpu time / turnaround time )
 Maximum hog factor of a job:  0.00      Minimum hog factor of a job:  0.00

Detailed accounting information about a specific job that was killed/did not finish successfully:


$ bacct -l 9990024

Accounting information about jobs that are:
  - submitted by all users.
  - accounted on all projects.
  - completed normally or exited
  - executed on all hosts.
  - submitted to all queues.
  - accounted on all service classes.
------------------------------------------------------------------------------

Job <392124>, User <asrini>, Project <default>, Status <EXIT>, Queue <umem>, Co
                     mmand <sh sleep.sh>
Tue Jan 14 19:46:02: Submitted from host <consign.hpc.local>, CWD </home/asrini
                     /hack_area/test_jobs/>;
Tue Jan 14 10:33:29: Completed <exit>; TERM_OWNER: job killed by owner.

Accounting information about this job:
     CPU_T     WAIT     TURNAROUND   STATUS     HOG_FACTOR    MEM    SWAP
      0.00    53247          53247     exit         0.0000     0M      0M
------------------------------------------------------------------------------

SUMMARY:      ( time unit: second )
 Total number of done jobs:       0      Total number of exited jobs:     1
 Total CPU time consumed:       0.0      Average CPU time consumed:     0.0
 Maximum CPU time of a job:     0.0      Minimum CPU time of a job:     0.0
 Total wait time in queues: 53247.0
 Average wait time in queue:53247.0
 Maximum wait time in queue:53247.0      Minimum wait time in queue:53247.0
 Average turnaround time:     53247 (seconds/job)
 Maximum turnaround time:     53247      Minimum turnaround time:     53247
 Average hog factor of a job:  0.00 ( cpu time / turnaround time )
 Maximum hog factor of a job:  0.00      Minimum hog factor of a job:  0.00

Parallel Environment

To run a parallel job you would include the -n flag to the busb command above.

For example, to run an interactive job with 16 CPUs:

$ bsub -n 16 -Is bash
Job <9990023> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on node063.hpc.local>>

$ bjobs -u asrini
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
9990023  asrini  RUN   interactiv consign.hpc node063.hpc bash       Jan 14 15:50
                                             node063.hpc.local
                                             node063.hpc.local
                                             node063.hpc.local
                                             node063.hpc.local
                                             node063.hpc.local
                                             node063.hpc.local
                                             node063.hpc.local
                                             node063.hpc.local
                                             node063.hpc.local
                                             node063.hpc.local
                                             node063.hpc.local
                                             node063.hpc.local
                                             node063.hpc.local
                                             node063.hpc.local
                                             node063.hpc.local

Similarly, to run a batch job with 16 CPUs:

$ bsub -n 16 <my_parallel_job>

Environment Modules

User loadable modules are available if the system default packages don't meet your requirements. To know what modules are available, you'll need to run the "module avail" command from an interactive session. To see what modules are available:

[asrini@consign ~]$ bsub -Is bash
Job <9990024> is submitted to default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on node063.hpc.local>>
    
[asrini@node063 ~]$ module avail

------------------------------------------------------------------- /usr/share/Modules/modulefiles -------------------------------------------------------------------
NAMD-2.9-Linux-x86_64-multicore dot                             module-info                     picard-1.96                     rum-2.0.5_05
STAR-2.3.0e                     java-sdk-1.6.0                  modules                         pkg-config-path                 samtools-0.1.19
STAR-hg19                       java-sdk-1.7.0                  mpich2-x86_64                   python-2.7.5                    use.own
STAR-mm9                        ld-library-path                 null                            r-libs-user
bowtie2-2.1.0                   manpath                         openmpi-1.5.4-x86_64            ruby-1.8.7-p374
devtoolset-2                    module-cvs                      perl5lib                        ruby-1.9.3-p448

Example use of modules:

[asrini@node063 ~]$ python -V
Python 2.6.6

[asrini@node063 ~]$ which python
/usr/bin/python

[asrini@node063 ~]$ module load python-2.7.5

[asrini@node063 ~]$ python -V
Python 2.7.5

[asrini@node063 ~]$ which python
/opt/software/python/python-2.7.5/bin/python

[asrini@node063 ~]$ module unload python-2.7.5

[asrini@node063 ~]$ which python
/usr/bin/python

Instructions for generating Public-Private keypairs

On Mac OS X and GNU/Linux systems, run the following command from within a terminal and follow the on-screen instructions:

$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key ($HOME/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in $HOME/.ssh/id_rsa.
Your public key has been saved in $HOME/.ssh/id_rsa.pub.
The key fingerprint is:
xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx asrini@
The key's randomart image is:
+--[ RSA 2048]----+
|          .      |
|       kjweo     |
|        x B E x  |
|         * B l + |
|        S +aser .|
|           + +   |
|          . weq  |
|           . x 12|
|            45+  |
+-----------------+

On Windows machines you can generate and use PublicKeys with Putty. Here is a link to a Youtube channel which has video tutorials for generating and using Public keys.

After generating a Public-Private keypair, copy the contents of the .ssh/id_rsa.pub file to a file named .ssh/authorized_keys in your home area on the PMACS cluster.

[$USER@consign ~]$ vim .ssh/authorized_keys

One SSH public key per line; save and close the file

Then change the permissions on the file:

[$USER@consign ~]$ chmod 600 .ssh/authorized_keys

@@ Line 328: / Line 328: @@
 ==== Job History ====
 Historical information about your jobs can be found by running:
+Note: By default bhist only provides historical information about jobs run/completed in the past week.
+If historical or accounting information about jobs from more than a week is needed, see bacct usage information below.
 <pre> $ bhist -d -u &lt;your_username&gt; </pre>
@@ Line 351: / Line 354: @@
 Job <9990022>, User <asrini>, Project <default>, Command <sh sleep.sh>
-Tue Jan 15 10:22:21: Submitted from host <consign.hpc.local>, to Queue <umem>,
+Tue Jan 14 10:22:21: Submitted from host <consign.hpc.local>, to Queue <umem>,
                       CWD </home/asrini/hack_area/test_jobs/>, 4 Processors Requ
                       ested;
@@ Line 357: / Line 360: @@
   MEMLIMIT
 M
-Tue Jan 15 10:22:23: Dispatched to 4 Hosts/Processors <node036.hpc.local> <node
+Tue Jan 14 10:22:23: Dispatched to 4 Hosts/Processors <node036.hpc.local> <node
 .hpc.local> <node036.hpc.local> <node036.hpc.local>;
-Tue Jan 15 10:22:23: Starting (Pid 30614);
+Tue Jan 14 10:22:23: Starting (Pid 30614);
-Tue Jan 15 10:22:23: Running with execution home </home/asrini>, Execution CWD
+Tue Jan 14 10:22:23: Running with execution home </home/asrini>, Execution CWD
                       </home/asrini/hack_area/test_jobs/>, Execution Pid <30614>
                       ;
-Tue Jan 15 10:22:33: Done successfully. The CPU time used is 0.0 seconds;
+Tue Jan 14 10:22:33: Done successfully. The CPU time used is 0.0 seconds;
-Tue Jan 15 10:22:33: Post job process done successfully;
+Tue Jan 14 10:22:33: Post job process done successfully;
-Summary of time in seconds spent in various states by  Tue Jan 15 10:22:33
+Summary of time in seconds spent in various states by  Tue Jan 14 10:22:33
    PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
         0        10       0        0        0        12
@@ Line 373: / Line 376: @@
 Notice the '''"Done successfully"''' message in the above output
-Detailed history of jobs that were killed or did not finish sucessfully:
+Detailed history of jobs that were killed or did not finish successfully:
 <pre>
-$ bhist -d -l 9990022
+$ bhist -d -l 9990024
 Job <9990022>, User <asrini>, Project <default>, Command <sh sleep.sh>
@@ Line 386: / Line 389: @@
   MEMLIMIT
 G
-Tue Jan 15 10:33:29: Signal <KILL> requested by user or administrator <asrini>;
+Tue Jan 14 10:33:29: Signal <KILL> requested by user or administrator <asrini>;
-Tue Jan 15 10:33:29: Exited;
+Tue Jan 14 10:33:29: Exited;
-Tue Jan 15 10:33:29: Completed <exit>; TERM_OWNER: job killed by owner;
+Tue Jan 14 10:33:29: Completed <exit>; TERM_OWNER: job killed by owner;
-Summary of time in seconds spent in various states by  Tue Jan 15 10:33:29
+Summary of time in seconds spent in various states by  Tue Jan 14 10:33:29
    PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
     0        0        0        0        0        53247
@@ Line 397: / Line 400: @@
 Notice the '''"Signal <KILL>"''' message in the above output.
+==== Job accounting and summary statistics ====
+The bacct command  displays a summary of accounting statistics for all finished jobs (with a DONE or EXIT status). The bacct command can only be run from the PMACS cluster head node: '''consign.pmacs.upenn.edu'''
+Accounting information about a user (this will take some time depending on how long you have used the PMACS cluster and how many jobs you have submitted):
+<pre> $ bacct -u &lt;your_username&gt; </pre>
+Example output:
+<pre>
+$ bacct -u asrini
+Accounting information about jobs that are:
+  - submitted by users asrini,
+  - accounted on all projects.
+  - completed normally or exited
+  - executed on all hosts.
+  - submitted to all queues.
+  - accounted on all service classes.
+------------------------------------------------------------------------------
+SUMMARY:      ( time unit: second )
+ Total number of done jobs:   89789      Total number of exited jobs: 10818
+ Total CPU time consumed:   394894.5      Average CPU time consumed:     3.9
+ Maximum CPU time of a job: 12342.2      Minimum CPU time of a job:     0.0
+ Total wait time in queues: 16464999.0
+ Average wait time in queue:  163.7
+ Maximum wait time in queue:53247.0      Minimum wait time in queue:    0.0
+ Average turnaround time:       213 (seconds/job)
+ Maximum turnaround time:    513369      Minimum turnaround time:         1
+ Average hog factor of a job:  0.01 ( cpu time / turnaround time )
+ Maximum hog factor of a job:  1.24      Minimum hog factor of a job:  0.00
+ Total throughput:            11.30 (jobs/hour)  during 8900.83 hours
+ Beginning time:       Jan  9 13:43      Ending time:          Jan 15 10:33
+</pre>
+By default, the PMACS cluster configuration will only provide a summary of the past 7 days. The -D option can be provided to expand this rage.
+For example to see a summary of jobs that were completed during a specific month:
+<pre>
+$ bacct -D 2014/01/01/00:00,2014/02/01/23:59 -u asrini
+Accounting information about jobs that are:
+  - submitted by users asrini,
+  - accounted on all projects.
+  - completed normally or exited
+  - dispatched between  Wed Jan  1 00:00:00 2014
+                  ,and   Sat Feb  1 23:59:00 2014
+  - executed on all hosts.
+  - submitted to all queues.
+  - accounted on all service classes.
+------------------------------------------------------------------------------
+SUMMARY:      ( time unit: second )
+ Total number of done jobs:      63      Total number of exited jobs:    17
+ Total CPU time consumed:      23.4      Average CPU time consumed:     0.3
+ Maximum CPU time of a job:     7.6      Minimum CPU time of a job:     0.0
+ Total wait time in queues:   117.0
+ Average wait time in queue:    1.5
+ Maximum wait time in queue:    3.0      Minimum wait time in queue:    0.0
+ Average turnaround time:       234 (seconds/job)
+ Maximum turnaround time:      8094      Minimum turnaround time:         2
+ Average hog factor of a job:  0.01 ( cpu time / turnaround time )
+ Maximum hog factor of a job:  0.04      Minimum hog factor of a job:  0.00
+ Total throughput:             0.14 (jobs/hour)  during  554.63 hours
+ Beginning time:       Jan  8 11:21      Ending time:          Jan 31 13:58
+</pre>
+Detailed listing can be requested with the -l option (will give a very long listing!)
+<pre> $ bacct -l -u &lt;your_username&gt; </pre>
+Detailed accounting information about a specific job that completed successfully:
+<pre>
+$ bacct -l 9990022
+Accounting information about jobs that are:
+  - submitted by all users.
+  - accounted on all projects.
+  - completed normally or exited
+  - executed on all hosts.
+  - submitted to all queues.
+  - accounted on all service classes.
+------------------------------------------------------------------------------
+Job <9990022>, User <asrini>, Project <default>, Status <DONE>, Queue <umem>, Co
+                     mmand <sh sleep.sh>
+Tue Jan 14 10:22:21: Submitted from host <consign.hpc.local>, CWD </home/asrini
+                     /hack_area/test_jobs/>;
+Tue Jan 14 10:22:23: Dispatched to 4 Hosts/Processors <node036.hpc.local> <node
+.hpc.local> <node036.hpc.local> <node036.hpc.local>;
+Tue Jan 14 10:22:33: Completed <done>.
+Accounting information about this job:
+     CPU_T     WAIT     TURNAROUND   STATUS     HOG_FACTOR    MEM    SWAP
+.03        2             12     done         0.0026     2M     50M
+------------------------------------------------------------------------------
+SUMMARY:      ( time unit: second )
+ Total number of done jobs:       1      Total number of exited jobs:     0
+ Total CPU time consumed:       0.0      Average CPU time consumed:     0.0
+ Maximum CPU time of a job:     0.0      Minimum CPU time of a job:     0.0
+ Total wait time in queues:     2.0
+ Average wait time in queue:    2.0
+ Maximum wait time in queue:    2.0      Minimum wait time in queue:    2.0
+ Average turnaround time:        12 (seconds/job)
+ Maximum turnaround time:        12      Minimum turnaround time:        12
+ Average hog factor of a job:  0.00 ( cpu time / turnaround time )
+ Maximum hog factor of a job:  0.00      Minimum hog factor of a job:  0.00
+</pre>
+Detailed accounting information about a specific job that was killed/did not finish successfully:
+<pre>
+$ bacct -l 9990024
+Accounting information about jobs that are:
+  - submitted by all users.
+  - accounted on all projects.
+  - completed normally or exited
+  - executed on all hosts.
+  - submitted to all queues.
+  - accounted on all service classes.
+------------------------------------------------------------------------------
+Job <392124>, User <asrini>, Project <default>, Status <EXIT>, Queue <umem>, Co
+                     mmand <sh sleep.sh>
+Tue Jan 14 19:46:02: Submitted from host <consign.hpc.local>, CWD </home/asrini
+                     /hack_area/test_jobs/>;
+Tue Jan 14 10:33:29: Completed <exit>; TERM_OWNER: job killed by owner.
+Accounting information about this job:
+     CPU_T     WAIT     TURNAROUND   STATUS     HOG_FACTOR    MEM    SWAP
+.00    53247          53247     exit         0.0000     0M      0M
+------------------------------------------------------------------------------
+SUMMARY:      ( time unit: second )
+ Total number of done jobs:       0      Total number of exited jobs:     1
+ Total CPU time consumed:       0.0      Average CPU time consumed:     0.0
+ Maximum CPU time of a job:     0.0      Minimum CPU time of a job:     0.0
+ Total wait time in queues: 53247.0
+ Average wait time in queue:53247.0
+ Maximum wait time in queue:53247.0      Minimum wait time in queue:53247.0
+ Average turnaround time:     53247 (seconds/job)
+ Maximum turnaround time:     53247      Minimum turnaround time:     53247
+ Average hog factor of a job:  0.00 ( cpu time / turnaround time )
+ Maximum hog factor of a job:  0.00      Minimum hog factor of a job:  0.00
+</pre>
 === Parallel Environment ===

Difference between revisions of "HPC:User Guide"

Revision as of 15:18, 15 July 2014

Contents

Other Pages

Guidelines

Setting up your profile (optional, use only if the LSF commands below don't work)

Overview of Common Commands (IBM Platform LSF)

Default Submission Constraints

Example usage of LSF commands

Queues

Compute node information

Batch (non-interactive) Job submission

Note about error and output files

Interactive Job submission

Interactive Job submission with X11

Checking job status

Job History

Job accounting and summary statistics

Parallel Environment

Environment Modules

Instructions for generating Public-Private keypairs