HPC:Job Dependency
This page discusses details about setting job dependency criteria for jobs run on the PMACS HPC system.
Contents
Job Dependency, what is it?
Job dependency criteria can be useful when there are a sequence of steps that need to be executed as part of a pipeline/workflow and each step is its own LSF job.
Setting up Job dependency criteria for any job on the PMACS HPC system can be done with the use of the -w flag for the bsub command. When a job is submitted with the "-w <condition>" option, the job is not dispatched till the <condition>/criteria is met.
Until the dependency condition is met, the job remains in a pending (PEND) state.
If the condition is never met (for instance if an earlier step in the pipeline failed), the job continues to remain in a pending state.
Examples
Simple Job dependency
Consider a two step job. Step 1 loads a Python virtual environment and prints the version of Numpy available in that environment and step 2 loads a different Python virtual environment and then prints the version of Numpy within that environment
Note that step 1 is set to request 2 CPU cores and 10GB RAM while step 2 is using the default allocation of 1vCPU core and 6GB RAM:
Step 1:
[asrini@node061 job_scripts]$ cat j1.sh #!/bin/bash #BSUB -J job1 #BSUB -o job1.%J.out #BSUB -e job1.%J.error #BSUB -n 2 #BSUB -M 10240 #BSUB -R "span[hosts=1] rusage [mem=10240]" echo "Job 1" sleep 20 source $HOME/my_python-2.7.9/bin/activate echo "numpy version:" python -c "import numpy; print numpy.__version__" echo "python version:" python -V
Step 2:
[asrini@node061 job_scripts]$ cat j2.sh #!/bin/bash #BSUB -J job2 #BSUB -o job2.%J.out #BSUB -e job2.%J.error #BSUB -w done(job1) echo "Job 2" sleep 20 source $HOME/my_python-2.7.5/bin/activate echo "numpy version:" python -c "import numpy; print numpy.__version__" echo "python version:" python -V
Both these jobs are run as any job script should be; with the use of the < option
[asrini@node061 job_scripts]$ bsub < j1.sh Job <49119037> is submitted to default queue <normal>. [asrini@node061 job_scripts]$ bsub < j2.sh Job <49119038> is submitted to default queue <normal>.
[asrini@node061 job_scripts]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 49119037 asrini RUN normal node061.hpc 2*node133.h job1 Feb 13 10:49 49119038 asrini PEND normal node061.hpc job2 Feb 13 10:49 [asrini@node061 job_scripts]$ bjobs -l 49119037 Job <49119037>, Job Name <job1>, User <asrini>, Project <default>, Status <RUN> , Queue <normal>, Command <#!/bin/bash;#BSUB -J job1 ;#BSU B -o job1.%J.out;#BSUB -e job1.%J.error;#BSUB -n 2 ;#BSUB -M 10240;#BSUB -R "span[hosts=1] rusage [mem=10240]" ; ech o "Job 1";sleep 20;source $HOME/my_python-2.7.9/bin/activa te;echo "numpy version:" ;python -c "import numpy; print n umpy.__version__";echo "python version:";python -V>, Share group charged </asrini> Wed Feb 13 10:49:23: Submitted from host <node061.hpc.local>, CWD <$HOME/hack_a rea/job_scripts>, Output File <job1.49119037.out>, Error F ile <job1.49119037.error>, 2 Task(s), Requested Resources <span[hosts=1] rusage [mem=10240]>; MEMLIMIT 10 G Wed Feb 13 10:49:25: Started 2 Task(s) on Host(s) <2*node133.hpc.local>, Alloca ted 2 Slot(s) on Host(s) <2*node133.hpc.local>, Execution Home </home/asrini>, Execution CWD </home/asrini/hack_area /job_scripts>; Wed Feb 13 10:49:34: Resource usage collected. MEM: 2 Mbytes; SWAP: 0 Mbytes; NTHREAD: 5 PGID: 14342; PIDs: 14342 14343 14347 14360 MEMORY USAGE: MAX MEM: 2 Mbytes; AVG MEM: 2 Mbytes SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - nfsops uptime loadSched - - loadStop - - RESOURCE REQUIREMENT DETAILS: Combined: select[type == local] order[r15s:pg] rusage[mem=10240.00] span[hosts =1] same[model] affinity[thread(1)*1] Effective: select[type == local] order[r15s:pg] rusage[mem=10240.00] span[host s=1] same[model] affinity[thread(1)*1]
[asrini@node061 job_scripts]$ bjobs -l 49119038 Job <49119038>, Job Name <job2>, User <asrini>, Project <default>, Status <PEND >, Queue <normal>, Command <#!/bin/bash;#BSUB -J job2 ;#BS UB -o job2.%J.out;#BSUB -e job2.%J.error;#BSUB -w done(job 1); echo "Job 2";sleep 20;source $HOME/my_python-2.7.5/bin /activate;echo "numpy version:" ;python -c "import numpy; print numpy.__version__";echo "python version:";python -V> Wed Feb 13 10:49:26: Submitted from host <node061.hpc.local>, CWD <$HOME/hack_a rea/job_scripts>, Output File <job2.49119038.out>, Error F ile <job2.49119038.error>, Dependency Condition <done(job1 )>; PENDING REASONS: Job dependency condition not satisfied; SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - nfsops uptime loadSched - - loadStop - - RESOURCE REQUIREMENT DETAILS: Combined: select[type == local] order[r15s:pg] span[ptile='!',Intel_EM64T:26] same[model] affinity[thread(1)*1] Effective: -
[asrini@node061 job_scripts]$ bjobs -d JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 49119037 asrini DONE normal node061.hpc 2*node133.h job1 Feb 13 10:49 49119038 asrini DONE normal node061.hpc node045.hpc job2 Feb 13 10:49
The logs for the jobs also show that the jobs finished successfully and step 2 was completed after step 1
A more complex job dependency setup
Now, let us consider a 3 step process and one where step three requires both step 1 AND step 2 to complete. Using the same scripts as above, with the addition of a step 3 script:
[asrini@node061 job_scripts]$ cat j3.sh #!/bin/bash #BSUB -J job3 #BSUB -o job3.%J.out #BSUB -e job3.%J.error #BSUB -w done(job1) #BSUB -w done(job2) echo "Job 3" sleep 50 source $HOME/my_python-3.4.2/bin/activate echo "numpy version:" python -c "import numpy; print(numpy.__version__)" echo "python version:" python -V
[asrini@node061 job_scripts]$ bsub < j1.sh Job <49119072> is submitted to default queue <normal>. [asrini@node061 job_scripts]$ bsub < j2.sh Job <49119073> is submitted to default queue <normal>. [asrini@node061 job_scripts]$ bsub < j3.sh Job <49119074> is submitted to default queue <normal>. [asrini@node061 job_scripts]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 49119009 asrini RUN interactiv consign.hpc node061.hpc bash Feb 13 10:44 49119072 asrini RUN normal node061.hpc 2*node043.h job1 Feb 13 10:58 49119073 asrini PEND normal node061.hpc job2 Feb 13 10:58 49119074 asrini PEND normal node061.hpc job3 Feb 13 10:58
Checking status after all of three jobs are done shows all three jobs are done:
[asrini@node061 job_scripts]$ bjobs -d JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 49119004 asrini DONE normal node065.hpc 2*node105.h job1 Feb 13 10:42 49119037 asrini DONE normal node061.hpc 2*node133.h job1 Feb 13 10:49 49119038 asrini DONE normal node061.hpc node045.hpc job2 Feb 13 10:49 49119072 asrini DONE normal node061.hpc 2*node043.h job1 Feb 13 10:58 49119073 asrini DONE normal node061.hpc node048.hpc job2 Feb 13 10:58 49119074 asrini DONE normal node061.hpc node128.hpc job3 Feb 13 10:58
NOTE: In the above listing notice that the older jobs are also listed. When job2 and job3 were submitted, an older job1 already existed in the LSF logs and showed a status of "DONE". This however did not impact the second batch of submissions. The dependency criteria only applies to jobs that are currently in the queue.
Dependency condition not met
Below is an example of when step 1 failed/terminated without the DONE status :
Step 1: the script below has a few typos in the syntax and is expected to fail:
[asrini@node061 job_scripts]$ cat badj1.sh #!/bin/bash #BSUB -J job1 #BSUB -o job1.%J.out #BSUB -e job1.%J.error #BSUB -n 2 #BSUB -M 10240 #BSUB -R "span[hosts=1] rusage [mem=10240]" cho "Job 1" leep 20 ource $HOME/my_python-2.7.9/bin/activate cho "numpy version:" ython -c "import nump; prin numpy.__version__" cho "python version:" ython -V
[asrini@node061 job_scripts]$ bsub < badj1.sh Job <49119110> is submitted to default queue <normal>. [asrini@node061 job_scripts]$ bsub < j2.sh Job <49119111> is submitted to default queue <normal>. [asrini@node061 job_scripts]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 49119110 asrini RUN normal node061.hpc 2*node120.h job1 Feb 13 11:11 49119111 asrini PEND normal node061.hpc job2 Feb 13 11:11
Notice that job1 this time has completed with the EXIT status and therefore job 2 continues to wait in PEND status because the dependency condition is never met
[asrini@node061 job_scripts]$ bjobs 49119110 JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 49119110 asrini EXIT normal node061.hpc 2*node120.h job1 Feb 13 11:11 [asrini@node061 job_scripts]$ bjobs 49119111 JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 49119111 asrini PEND normal node061.hpc job2 Feb 13 11:11 [asrini@node061 job_scripts]$ bjobs -l 49119111 Job <49119111>, Job Name <job2>, User <asrini>, Project <default>, Status <PEND >, Queue <normal>, Command <#!/bin/bash;#BSUB -J job2 ;#BS UB -o job2.%J.out;#BSUB -e job2.%J.error;#BSUB -w done(job 1); echo "Job 2";sleep 20;source $HOME/my_python-2.7.5/bin /activate;echo "numpy version:" ;python -c "import numpy; print numpy.__version__";echo "python version:";python -V> Wed Feb 13 11:11:42: Submitted from host <node061.hpc.local>, CWD <$HOME/hack_a rea/job_scripts>, Output File <job2.49119111.out>, Error F ile <job2.49119111.error>, Dependency Condition <done(job1 )>; PENDING REASONS: Dependency condition invalid or never satisfied; SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - nfsops uptime loadSched - - loadStop - - RESOURCE REQUIREMENT DETAILS: Combined: select[type == local] order[r15s:pg] span[ptile='!',Intel_EM64T:26] same[model] affinity[thread(1)*1] Effective: -