Difference between revisions of "HPC:FAQ"

From HPC wiki
Line 134: Line 134:
 
   </pre>
 
   </pre>
  
'''Note''' '''Read the man page for bkill for a more detailed explanation'''
+
'''Note: ''' '''Read the man page for bkill for a more detailed explanation'''
  
  
Line 147: Line 147:
  
  
'''Note''' '''Read the man page for bkill for a more detailed explanation'''
+
'''Note: ''' '''Read the man page for bkill for a more detailed explanation'''
  
  

Revision as of 18:06, 23 July 2018

This page has all the answers you are looking for ..... OK, maybe not! But you will find answers to some of the most common questions we get about the PMACS HPC System.

Other Pages

Administrivia

  • How much does it cost to use the PMACS HPC cluster?
 Usage costs are published here and here 

Requesting Accounts

  • How do I request an account on the PMACS HPC Cluster?
 - Step 0 : Get a | UPENN PennKey
 - Step 1 : As part of our account creation process, we routinely collect several pieces of information listed here and here. Send us all this information in an email.
 - Step 2 : If you are not the PI, cc your PI/BA in your account request email so we can followup with them directly for email authorization. If you are the PI/BA, you don't have to do anything else.

  • I requested an account, per the instructions outlined above, how long does it take to create the account?
 Typically, less than 2 business days. Sometimes, emails do get missed, so feel free to nudge us again! 
  • OK, I got an email confirming my account. Now what?
 Use the cluster to do your research! 

General Questions

  • Before I begin using the PMACS HPC cluster, I'd like to know how much it would cost me to do my work?
 Unfortunately, there is no easy answer to this question. Cost of usage varies greatly on the kind of work you do, whether or not you have a working pipeline or if you are only now building some kind of processing pipeline, the tools you use etc. 
  • I have a limited amount funds available, can my usage be capped once I hit a certain limit?
 No, we currently have no cap usage after compute/storage costs have reached a dollar amount.  


PMACS ERA Team Contact Info

  • What is the best way to ask questions about the PMACS HPC System?
  Send all PMACS HPC related questions/requests to our group's email: psom-pmacshpc@pennmedicine.upenn.edu

Grant related questions

  • I'm submitting a grant application and would like to included some information about computation resources available.
 We have information here that you can copy-paste directly into your application.
  • Do I need to acknowledge the PMACS HPC system in my publication?
 Not necessarily, but a significant portion of our HPC and Archive systems was funded through a NIH grant - 1S10OD012312 NIH. So it would be great if you do acknowledge this grant (and us!).

Technical Questions

Job related Questions

  • How to check the status of my jobs?
You can use the "bjobs" command
 
Condensed output of bjobs 
 $ bjobs 27002288
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
27002288   asrini  RUN   interactiv consign.hpc node107.hpc bash       Feb 21 21:50
 


  • How to check if my job has stalled?
First, check the error and output files associated with the job.
If error/output files for the job don't provide the necessary information, check the long listing of bjobs a few times over the course of a few minutes, to verify if additional CPU time has accrued. 
To get the long listing for the the bjobs command, run "bjobs -l <jobid>" (example below):
$ bjobs -l 27007909

Job <27007909>, User <asrini>, Project <default>, Status <RUN>, Queue <normal>,
                     Command <sleep 1000>, Share group charged </asrini>
Thu Feb 22 11:31:32: Submitted from host <consign.hpc.local>, CWD <$HOME>;
Thu Feb 22 11:31:32: Started 1 Task(s) on Host(s) <node041.hpc.local>, Allocate
                     d 1 Slot(s) on Host(s) <node041.hpc.local>, Execution Home
                      </home/asrini>, Execution CWD </home/asrini>;
Thu Feb 22 11:31:34: Resource usage collected.
                     MEM: 0 Mbytes;  SWAP: 0 Mbytes;  NTHREAD: 1
                     PGID: 19597;  PIDs: 19597 


 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

           nfsops  uptime 
 loadSched     -       -  
 loadStop      -       -  

 RESOURCE REQUIREMENT DETAILS:
 Combined: select[type == local] order[r15s:pg] span[ptile='!',Intel_EM64T:32] 
                     same[model] affinity[thread(1)*1]
 Effective: select[type == local] order[r15s:pg] span[ptile='!',Intel_EM64T:32]
                      same[model] affinity[thread(1)*1] 

 


  • How to terminate a job
 Use the "bkill <jobid>" command without any additional flags, first:
  $ bkill 27023685
  Job <27023685> is being terminated
 
  $ bjobs 27023685
  JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
  27023685   asrini  EXIT  normal     consign.hpc node037.hpc *eep 10100 Feb 23 12:33
  
 If the above fails, Try "bkill -s 7 <JOBID>"; The "-s 7" option will send a SIGTERM signal/force kill signal to the JOB but the scheduler waits for confirmation that this took effect.
  $ bkill -s 7 27023688
  Job <27023688> is being signaled
  
  $ bjobs 27023688
  JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
  27023688   asrini  EXIT  normal     consign.hpc node012.hpc *eep 10100 Feb 23 12:39
  
 If the above fails, you can then try the "bkill -r <JOBID>" approach. This does the same as above, but the scheduler does not wait and proceeds to remove the job from the queue.
  $ bkill -r 27023755
  Job <27023755> is being terminated
  
  $ bjobs 27023755
  JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
  27023755   asrini  EXIT  normal     consign.hpc node037.hpc *eep 10100 Feb 23 12:41
  

Note: Read the man page for bkill for a more detailed explanation


  • How to terminate all my jobs
 $ bkill 0
 Job <3797180923> is being terminated
 Job <3797180924> is being terminated
 Job <3797180925> is being terminated
 Job <3797180926> is being terminated


Note: Read the man page for bkill for a more detailed explanation


  • How do I make sure my job does not run too long?
 Set a runtime limit, in minutes, for the job (a.k.a Wall-clock). This then terminates the job using bkill, when the preconfigured runtime limit has reached and the job is still running. 
  $ bsub -W 1 sleep 300
  Job <27576257> is submitted to default queue <normal>.


  $ bjobs -l

Job <27576257>, User <asrini>, Project <default>, Status <RUN>, Queue <normal>,
                     Command <sleep 300>, Share group charged </asrini>
Fri Mar  2 12:12:04: Submitted from host <consign.hpc.local>, CWD <$HOME>;

 RUNLIMIT                
 1.0 min of node120.hpc.local
Fri Mar  2 12:12:08: Started 1 Task(s) on Host(s) <node120.hpc.local>, Allocate
                     d 1 Slot(s) on Host(s) <node120.hpc.local>, Execution Home
                      </home/asrini>, Execution CWD </home/asrini>;

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

           nfsops  uptime 
 loadSched     -       -  
 loadStop      -       -  

 RESOURCE REQUIREMENT DETAILS:
 Combined: select[type == local] order[r15s:pg] span[ptile='!',Intel_EM64T:32] 
                     same[model] affinity[thread(1)*1]
 Effective: select[type == local] order[r15s:pg] span[ptile='!',Intel_EM64T:32]
                      same[model] affinity[thread(1)*1] 
  


  • How do I request more than the default 6GB RAM limit for my jobs?
 Use the -M <mem_in_MB> bsub option. For example, to request 10G RAM 
  $ bsub -M 10240 sh test_r.sh
  


  • Why is my job in pending (PEND) state?
 There can be many reasons for this. Always check the output of "bjobs -l"
   $ bjobs -l 35184277

   Job <35184277>, User <asrini>, Project <default>, Status <PEND>, Queue <normal>
                     , Command <sh myjob.sh>
   Tue May  8 11:48:11: Submitted from host <consign.hpc.local>, CWD <$HOME>, 16 T
                     ask(s);
   PENDING REASONS:
   Not specified in job submission: 68 hosts;
   Affinity resource requirement cannot be met because there are not enough processor units to satisfy the job affinity request: 8 hosts;
   Job slot limit reached: 11 hosts;
   Load information unavailable: 5 hosts;
   Not enough job slot(s): 26 hosts;
   Closed by LSF administrator: 9 hosts;
   Just started a job recently: 10 hosts;
   Not enough hosts to meet the job's spanning requirement: 2 hosts;

   SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
   loadSched   -     -     -     -       -     -    -     -     -      -      -  
   loadStop    -     -     -     -       -     -    -     -     -      -      -  

           nfsops  uptime 
   loadSched     -       -  
   loadStop      -       -  

   RESOURCE REQUIREMENT DETAILS:
   Combined: select[type == local] order[r15s:pg] span[ptile='!',Intel_EM64T:24] 
                     same[model] affinity[thread(1)*1]
   Effective: -

  
 Notice above, under "PENDING REASONS", the reason for the job to remain in pending state is listed as "not enough processor units to satisfy the job affinity request: 8 hosts;"

Software Installation

  • Am I allowed to install software in my home directory?
  Yes. See the Manual Software Installation section of our wiki for some pointers.
  • What are the rules for software installation?
  Rule 0 : All software MUST be installed on a compute node. Use an interactive session (bsub -Is bash) for all software installation. Do NOT use the head node.
  Rule 1 : If you get a missing library or header file error during the software installation, make sure you've read and followed RULE 0  
  Rule 2 : Do NOT use "sudo" in your installation steps. This will result in a permissions error and will generate an alert to us, the Admins.
  Rule 3 : Read the rest of this section
  Rule 4 : Send us a note with details on the steps you followed and exactly which interactive session you tried to install the program on so we can investigate.


  • I get a "Permission denied" error when I try to install a program in my home directory by running "make install". How do I fix it?
  The reason for this error is that by default most software packages are written to be compiled and installed in system level directories like /usr/bin /usr/local/bin etc.
  You have two options:
  
  Option 1: Compile the program with a "prefix" flag during the "configure" step:
   ./configure --prefix=$HOME
   
  Option 2: install the program in a different destination after the compilation is done (using home as the default; change it if you want to):
   make install DESTDIR=$HOME

   OR

   make install DESTDIR=/home/<usr_name>

   Replace "<usr_name>" with your user name 
   
  • OK I tried one the above options to install the program in my home directory and it worked! How do I use it?
 If you followed Option 1 above, the program likely was installed in a "bin" directory in your $HOME directory. First verify if the program exists there. Using (vcftools) as an example: 
   $ ls $HOME/bin/vcftools
   /home/asrini/bin/vcftools
  
  If you followed Option 2 above, then the program was likely installed under $HOME/usr/local/bin or $HOME/usr/bin/, in addition checking $HOME/bin, check these locations as well. Again, using (vcftools) as an example:
     $ ls $HOME/bin/vcftools

     $ ls $HOME/usr/bin/vcftools

     $ ls $HOME/usr/local/bin/vcftools
  
  Once you've determined where the location of the installed binary is, you can add this to your $PATH variable for easy use.
  Assuming the file is $HOME/bin: 
   export PATH=$HOME/bin:$PATH
   
  The above line can be added to your .bashrc or .bash_profile files
  • OK I tried both the above options to install the program in my home directory and still get an error. Now what?
 Send us a note with details on the steps you followed and exactly which interactive session you tried to install the program on so we can investigate.

Troubleshooting Job Failures

More to come soon ...


Downloading Sequencing data from BaseSpace

  • Is Illumina's BaseMount software available on the PMACS HPC system so I can download sequencing data directly from BaseSpace?
 No
  • I have sequencing data stored in Illumina's BaseSpace. How do I download the data?
 You can setup the BaseSpace Sequence Hub CLI tool to download sequencing data. Information on how to do this is available here

Sharing Data

More to come soon ...