HPC:Job Dependency

From HPC wiki

This page discusses details about setting job dependency criteria for jobs run on the PMACS HPC system.

Job Dependency, what is it?

Job dependency criteria can be useful when there are a sequence of steps that need to be executed as part of a pipeline/workflow and each step is its own LSF job.

Setting up Job dependency criteria for any job on the PMACS HPC system can be done with the use of the -w flag for the bsub command. When a job is submitted with the "-w <condition>" option, the job is not dispatched till the <condition>/criteria is met.

Until the dependency condition is met, the job remains in a pending (PEND) state.

If the condition is never met (for instance if an earlier step in the pipeline failed), the job continues to remain in a pending state.

Examples

Simple Job dependency

Consider a two step job. Step 1 loads a Python virtual environment and prints the version of Numpy available in that environment and step 2 loads a different Python virtual environment and then prints the version of Numpy within that environment

Note that step 1 is set to request 2 CPU cores and 10GB RAM while step 2 is using the default allocation of 1vCPU core and 6GB RAM:

Step 1:


[asrini@node061 job_scripts]$ cat j1.sh 
#!/bin/bash
#BSUB -J job1 
#BSUB -o job1.%J.out
#BSUB -e job1.%J.error
#BSUB -n 2 
#BSUB -M 10240
#BSUB -R "span[hosts=1] rusage [mem=10240]" 

echo "Job 1"
sleep 20
source $HOME/my_python-2.7.9/bin/activate
echo "numpy version:" 
python -c "import numpy; print numpy.__version__"
echo "python version:"
python -V

Step 2:

[asrini@node061 job_scripts]$ cat j2.sh 
#!/bin/bash
#BSUB -J job2 
#BSUB -o job2.%J.out
#BSUB -e job2.%J.error
#BSUB -w done(job1)

echo "Job 2"
sleep 20
source $HOME/my_python-2.7.5/bin/activate
echo "numpy version:" 
python -c "import numpy; print numpy.__version__"
echo "python version:"
python -V

Both these jobs are run as any job script should be; with the use of the < option

[asrini@node061 job_scripts]$ bsub < j1.sh 
Job <49119037> is submitted to default queue <normal>.

[asrini@node061 job_scripts]$ bsub < j2.sh 
Job <49119038> is submitted to default queue <normal>.

[asrini@node061 job_scripts]$ bjobs
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
49119037   asrini  RUN   normal     node061.hpc 2*node133.h job1       Feb 13 10:49
49119038   asrini  PEND  normal     node061.hpc             job2       Feb 13 10:49

[asrini@node061 job_scripts]$ bjobs -l 49119037


Job <49119037>, Job Name <job1>, User <asrini>, Project <default>, Status <RUN>
                     , Queue <normal>, Command <#!/bin/bash;#BSUB -J job1 ;#BSU
                     B -o job1.%J.out;#BSUB -e job1.%J.error;#BSUB -n 2 ;#BSUB 
                     -M 10240;#BSUB -R "span[hosts=1] rusage [mem=10240]" ; ech
                     o "Job 1";sleep 20;source $HOME/my_python-2.7.9/bin/activa
                     te;echo "numpy version:" ;python -c "import numpy; print n
                     umpy.__version__";echo "python version:";python -V>, Share
                      group charged </asrini>
Wed Feb 13 10:49:23: Submitted from host <node061.hpc.local>, CWD <$HOME/hack_a
                     rea/job_scripts>, Output File <job1.49119037.out>, Error F
                     ile <job1.49119037.error>, 2 Task(s), Requested Resources 
                     <span[hosts=1] rusage [mem=10240]>;

 MEMLIMIT
     10 G 
Wed Feb 13 10:49:25: Started 2 Task(s) on Host(s) <2*node133.hpc.local>, Alloca
                     ted 2 Slot(s) on Host(s) <2*node133.hpc.local>, Execution 
                     Home </home/asrini>, Execution CWD </home/asrini/hack_area
                     /job_scripts>;
Wed Feb 13 10:49:34: Resource usage collected.
                     MEM: 2 Mbytes;  SWAP: 0 Mbytes;  NTHREAD: 5
                     PGID: 14342;  PIDs: 14342 14343 14347 14360 


 MEMORY USAGE:
 MAX MEM: 2 Mbytes;  AVG MEM: 2 Mbytes

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

           nfsops  uptime 
 loadSched     -       -  
 loadStop      -       -  

 RESOURCE REQUIREMENT DETAILS:
 Combined: select[type == local] order[r15s:pg] rusage[mem=10240.00] span[hosts
                     =1] same[model] affinity[thread(1)*1]
 Effective: select[type == local] order[r15s:pg] rusage[mem=10240.00] span[host
                     s=1] same[model] affinity[thread(1)*1] 



[asrini@node061 job_scripts]$ bjobs -l 49119038

Job <49119038>, Job Name <job2>, User <asrini>, Project <default>, Status <PEND
                     >, Queue <normal>, Command <#!/bin/bash;#BSUB -J job2 ;#BS
                     UB -o job2.%J.out;#BSUB -e job2.%J.error;#BSUB -w done(job
                     1); echo "Job 2";sleep 20;source $HOME/my_python-2.7.5/bin
                     /activate;echo "numpy version:" ;python -c "import numpy; 
                     print numpy.__version__";echo "python version:";python -V>
Wed Feb 13 10:49:26: Submitted from host <node061.hpc.local>, CWD <$HOME/hack_a
                     rea/job_scripts>, Output File <job2.49119038.out>, Error F
                     ile <job2.49119038.error>, Dependency Condition <done(job1
                     )>;
 PENDING REASONS:
 Job dependency condition not satisfied;

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

           nfsops  uptime 
 loadSched     -       -  
 loadStop      -       -  

 RESOURCE REQUIREMENT DETAILS:
 Combined: select[type == local] order[r15s:pg] span[ptile='!',Intel_EM64T:26] 
                     same[model] affinity[thread(1)*1]
 Effective: -



[asrini@node061 job_scripts]$ bjobs -d
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
49119037   asrini  DONE  normal     node061.hpc 2*node133.h job1       Feb 13 10:49
49119038   asrini  DONE  normal     node061.hpc node045.hpc job2       Feb 13 10:49
The logs for the jobs also show that the jobs finished successfully and step 2 was completed after step 1


A more complex job dependency setup

Now, let us consider a 3 step process and one where step three requires both step 1 AND step 2 to complete. Using the same scripts as above, with the addition of a step 3 script:


[asrini@node061 job_scripts]$ cat j3.sh 
#!/bin/bash
#BSUB -J job3 
#BSUB -o job3.%J.out
#BSUB -e job3.%J.error
#BSUB -w done(job1)
#BSUB -w done(job2)

echo "Job 3"
sleep 50
source $HOME/my_python-3.4.2/bin/activate
echo "numpy version:" 
python -c "import numpy; print(numpy.__version__)"
echo "python version:"
python -V

[asrini@node061 job_scripts]$ bsub < j1.sh 
Job <49119072> is submitted to default queue <normal>.

[asrini@node061 job_scripts]$ bsub < j2.sh 
Job <49119073> is submitted to default queue <normal>.

[asrini@node061 job_scripts]$ bsub < j3.sh 
Job <49119074> is submitted to default queue <normal>.

[asrini@node061 job_scripts]$ bjobs
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
49119009   asrini  RUN   interactiv consign.hpc node061.hpc bash       Feb 13 10:44
49119072   asrini  RUN   normal     node061.hpc 2*node043.h job1       Feb 13 10:58
49119073   asrini  PEND  normal     node061.hpc             job2       Feb 13 10:58
49119074   asrini  PEND  normal     node061.hpc             job3       Feb 13 10:58

Checking status after all of three jobs are done shows all three jobs are done:

[asrini@node061 job_scripts]$ bjobs -d
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
49119004   asrini  DONE  normal     node065.hpc 2*node105.h job1       Feb 13 10:42
49119037   asrini  DONE  normal     node061.hpc 2*node133.h job1       Feb 13 10:49
49119038   asrini  DONE  normal     node061.hpc node045.hpc job2       Feb 13 10:49
49119072   asrini  DONE  normal     node061.hpc 2*node043.h job1       Feb 13 10:58
49119073   asrini  DONE  normal     node061.hpc node048.hpc job2       Feb 13 10:58
49119074   asrini  DONE  normal     node061.hpc node128.hpc job3       Feb 13 10:58

NOTE: In the above listing notice that the older jobs are also listed. When job2 and job3 were submitted, an older job1 already existed in the LSF logs and showed a status of "DONE". 
This however did not impact the second batch of submissions. The dependency criteria only applies to jobs that are currently in the queue.


Dependency condition not met

Below is an example of when step 1 failed/terminated without the DONE status :

Step 1: the script below has a few typos in the syntax and is expected to fail:
[asrini@node061 job_scripts]$ cat badj1.sh 
#!/bin/bash
#BSUB -J job1 
#BSUB -o job1.%J.out
#BSUB -e job1.%J.error
#BSUB -n 2 
#BSUB -M 10240
#BSUB -R "span[hosts=1] rusage [mem=10240]" 

cho "Job 1"
leep 20
ource $HOME/my_python-2.7.9/bin/activate
cho "numpy version:" 
ython -c "import nump; prin numpy.__version__"
cho "python version:"
ython -V




[asrini@node061 job_scripts]$ bsub < badj1.sh 
Job <49119110> is submitted to default queue <normal>.

[asrini@node061 job_scripts]$ bsub < j2.sh 
Job <49119111> is submitted to default queue <normal>.

[asrini@node061 job_scripts]$ bjobs
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
49119110   asrini  RUN   normal     node061.hpc 2*node120.h job1       Feb 13 11:11
49119111   asrini  PEND  normal     node061.hpc             job2       Feb 13 11:11

 Notice that job1 this time has completed with the EXIT status and therefore job 2 continues to wait in PEND status because the dependency condition is never met
[asrini@node061 job_scripts]$ bjobs 49119110
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
49119110   asrini  EXIT  normal     node061.hpc 2*node120.h job1       Feb 13 11:11

[asrini@node061 job_scripts]$ bjobs 49119111
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
49119111   asrini  PEND  normal     node061.hpc             job2       Feb 13 11:11

[asrini@node061 job_scripts]$ bjobs -l 49119111

Job <49119111>, Job Name <job2>, User <asrini>, Project <default>, Status <PEND
                     >, Queue <normal>, Command <#!/bin/bash;#BSUB -J job2 ;#BS
                     UB -o job2.%J.out;#BSUB -e job2.%J.error;#BSUB -w done(job
                     1); echo "Job 2";sleep 20;source $HOME/my_python-2.7.5/bin
                     /activate;echo "numpy version:" ;python -c "import numpy; 
                     print numpy.__version__";echo "python version:";python -V>
Wed Feb 13 11:11:42: Submitted from host <node061.hpc.local>, CWD <$HOME/hack_a
                     rea/job_scripts>, Output File <job2.49119111.out>, Error F
                     ile <job2.49119111.error>, Dependency Condition <done(job1
                     )>;
 PENDING REASONS:
 Dependency condition invalid or never satisfied;

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

           nfsops  uptime 
 loadSched     -       -  
 loadStop      -       -  

 RESOURCE REQUIREMENT DETAILS:
 Combined: select[type == local] order[r15s:pg] span[ptile='!',Intel_EM64T:26] 
                     same[model] affinity[thread(1)*1]
 Effective: -

Other Pages