Skip to main content

5.16) Array Jobs


In some cases it is necessary to run a large number of independent but nearly identical tasks, e.g. single program that needs to run on 1000 separate data sets, or a parameter sweep where a program is run repeatedly while varying a single parameter. This can be described as an “Embarrassingly Parallel” problem, since it is quite easy to separate the problem into independently running parallel tasks.

A naive approach would be to write a script to generate a large number of separate almost identical submission scripts and submit them individually. This can be quite cumbersome for the user and can put an unnecessary load on the scheduler.

SGE provides an alternative to this approach in the form of array jobs. It has the advantage of only needing to write a single script, only needing to manage a single job id and does not put a strain on the system.

An array job will in essence consist of a large number of independent tasks, each running a separate identical copy of the program. The number of tasks in the array can be set via the -t flag to qsub. SGE provides an environment variable $SGE_TASK_ID that will vary with the task number, for e.g:

$ qsub –cwd –l h_rt=0:30:00 –t 1-1000:1 array_job.sh