User Tools

Site Tools


Grid Engine

Grid Engine is a job scheduling application used to control cluster resources. We use it extensively.

Submitting Jobs

We require all jobs to be submitted to the batch queue. Grid Engine provides a utility called qsub to add a job to the batch queue. Each cluster has different rules based on the needs of their groups. In general, your job will run as soon as a resource becomes available. the qsub command requires at least one argument, the submit script path.

Serial Jobs

To submit a serial job you can do the following:

qsub ssubmit

Where ssubmit looks like:

#$ -cwd
#$ -j y
#$ -S /bin/bash
module load benchmarks
path/to/binary < path/to/input

Job Chains using hold_jid

Create script

#$ -S /bin/sh
#$ -N doThing1

Create script

#$ -S /bin/sh
#$ -N doThing2
#$ -hold_jid doThing1

doThing2 will wait until doThing1 finishes to run

-hold_jid <job_name> can only be used to reference your own jobs

-hold_jid <job_id> can be used to reference any job

Parallel MPI Jobs

To submit a parallel job you can do the following:

qsub -pe mpi 2 psubmit

Where psubmit looks similar to:

#$ -cwd
#$ -j y
#$ -N relay
#$ -S /bin/bash
#$ -e sge.err
#$ -o sge.out

module load benchmarks
mpirun relay-gcc-4.3.1 1

Array Jobs

An example of array jobs (one using R Scripts included) can be found here

Monitoring Jobs

We provide a few different ways to monitor you jobs.


Grid Engine comes with a command line utility called qstat that allows users to monitor jobs.

The default is to just show the jobs that you own.

$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
  40426 0.52881  scott        r     08/06/2008 14:39:16 ib.q@icompute-0-4.local           16        

If you want to see a particular user's jobs you can use the [-u user,…] option.

$ id -un
$ qstat -u bill
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
  40417 0.52823  bill         r     08/05/2008 14:13:16 ib.q@icompute-0-12.local           8       

qstat state letter codes:

Category                                                 SGE Letter Code
Pending	        	                                 qw
pending, user and system hold	                         hqw
pending, user and system hold, re-queue	                 hRwq
Running	                                                 r
transferring	                                         t
running, re-submit	                                 Rr
transferring,re-submit	                                 Rt
Suspended, job suspended	                         s, ts
queue suspended	                                         S, tS
queue suspended by alarm	                         T, tT
all suspended with re-submit	                         Rs, Rts, RS, RtS, RT, RtT
Error, all pending states with error	                 Eqw, Ehqw, EhRqw
Deleted, all running/suspended states with deletion      dr, dt, dRr, dRt, ds, dS, dT, dRs, dRS, dRT

Ganglia + Grid Engine

Many of our clusters use Rocks. We also use a tool called Ganglia to monitor systems and networks. Rocks has some magic that ties Grid Engine into Ganglia so you can view the status of the nodes that your job is running on. This is very useful in visualizing performance problems or bottlenecks.

To use Ganglia to monitor your job go to the web site for the cluster appending /ganglia. For example:

The Job Queue page will run a script on the server that reports the status of all jobs on the cluster (it runs the equivalent of qstat -u \*). You can filter the results by user and sort by the various columns. Sorting can be useful if there are a lot of jobs running. Once you find your job click on the job id or status and you will go to the job detail page.

The Job Detail page will show you all nodes that are involved in the current job. You can select to view a number of metrics by using the drop down menu.

support/hpc/gridengine.txt · Last modified: 2012/05/23 15:35 by tlknight