This is an old revision of the document!
SLURM is an open-source resource manager (batch queue) designed for Linux clusters of all sizes.
More in depth information at http://slurm.schedmd.com/documentation.html
#!/bin/bash -l # NOTE the -l flag! # If you need any help, please email help@cse.ucdavis.edu # Name of the job - You'll probably want to customize this. #SBATCH -J bench # Standard out and Standard Error output files with the job number in the name. #SBATCH -o bench-%j.output #SBATCH -e bench-%j.output # no -n here, the user is expected to provide that on the command line. # The useful part of your job goes below # run one thread for each one the user asks the queue for # hostname is just for debugging hostname export OMP_NUM_THREADS=$SLURM_NTASKS module load benchmarks stream
We will ask the batch queue for 1 node and 2, 4, 8, 16, and 32 CPUs to see how well this OpenMP code scales.
bill@gauss:~$ sbatch -N 1 -n 2 -t 5 test.sh Submitted batch job 12 bill@gauss:~$ sbatch -N 1 -n 4 -t 5 test.sh Submitted batch job 13 bill@gauss:~$ sbatch -N 1 -n 8 -t 5 test.sh Submitted batch job 14 bill@gauss:~$ sbatch -N 1 -n 16 -t 5 test.sh Submitted batch job 15 bill@gauss:~$ sbatch -N 1 -n 32 -t 5 test.sh Submitted batch job 16
This benchmark uses OpenMP to benchmark memory bandwidth with a 915MB array. This show increasing memory bandwidth as you use more CPUs up to a maximum of 60GB/sec using an entire node. If you are curious about the benchmark, additional information is available at https://www.cs.virginia.edu/stream/.
bill@gauss:~$ ls bench-12.output bench-14.output bench-16.output bench-13.output bench-15.output test.sh bill@gauss:~$ cat *.output | grep ":" STREAM version $Revision: 5.9 $ Copy: 12517.8583 0.0513 0.0511 0.0516 Scale: 12340.2147 0.0521 0.0519 0.0524 Add: 12495.4439 0.0770 0.0768 0.0772 Triad: 12490.9087 0.0782 0.0769 0.0796 STREAM version $Revision: 5.9 $ Copy: 16879.8667 0.0381 0.0379 0.0384 Scale: 16807.9956 0.0384 0.0381 0.0388 Add: 16733.7084 0.0578 0.0574 0.0583 Triad: 16482.7247 0.0585 0.0582 0.0589 STREAM version $Revision: 5.9 $ Copy: 16098.1749 0.0399 0.0398 0.0400 Scale: 16018.8248 0.0402 0.0400 0.0405 Add: 15887.7032 0.0606 0.0604 0.0610 Triad: 15839.4543 0.0608 0.0606 0.0611 STREAM version $Revision: 5.9 $ Copy: 31428.3070 0.0205 0.0204 0.0206 Scale: 31221.0489 0.0206 0.0205 0.0207 Add: 31324.3960 0.0308 0.0306 0.0311 Triad: 31151.6049 0.0310 0.0308 0.0313 STREAM version $Revision: 5.9 $ Copy: 61085.8038 0.0106 0.0105 0.0108 Scale: 60474.7806 0.0108 0.0106 0.0110 Add: 61318.3663 0.0159 0.0157 0.0162 Triad: 61049.6830 0.0159 0.0157 0.0160 bill@gauss:~$
The newest version of slurm supports array jobs. For example:
$ cat test.sh #!/bin/bash hostname echo $SLURM_ARRAY_TASK_ID
# Submit a job array with index values between 0 and 10,000 on all free CPUs: $ sbatch --array=0-1000 MyScript.sh
On the Farm cluster the maximum array size is 10001.
More information at http://www.schedmd.com/slurmdocs/job_array.html
prefixed with #SBATCH
: | A comment |
–job-name=myjob | Job Name |
–output=myjob.out | Output sent to this file |
–output=myjob.%j.%N.out | Output file named with job number and the node the job landed on |
–error=myjob.err | Errors written to this file |
–partition=med | Run is the med partition (known as a queue in SGE) |
–nodes=4 | Request four nodes |
–ntasks-per-node=8 | Request eight tasks per node. The number of tasks may not exceed the number of processor cores on the node |
–ntasks=10 | Request 10 tasks for your job |
–time=2-12:00:00 | The maximum amount of time SLURM will allow your job to run before it is killed. (2 days and 12 hours in the example) |
–mail-type=type | Set type to: BEGIN to notify you when your job starts, END for when it ends, FAIL for if it fails, or ALL for all of the above |
–mail-user=email@ucdavis.edu | |
–mem-per-cpu=MB | Specify a memory limit for each process of your job |
–mem=MB | Specify a memory limit for each node of your job |
–exclusive | Specify that you need exclusive access to nodes for your job |
–share | Specify that your job may share nodes with other jobs |
–begin=2013-09-21T01:30:00 | Start the job after this time |
–begin=now+1hour | Use a relative time to start the job |
–dependency=afterany:100:101 | Wait for jobs 100 and 101 to complete before starting |
–dependency=afterok:100:101 | Wait for jobs 100 and 101 to finish without error |
Show all available options
$ sbatch --help
Another useful command
$ sbatch --usage
You may also use options to the srun command:
$ srun [option list] [executable] [args]
Some srun options
-c # | The number of CPUs used by each process |
-d | Specify debug level between 0 and 5 |
-i file | Redirect input to file |
-o file | Redirect output |
-n # | Number of processes for the job |
-N # | Numbers of nodes to run the job on |
-s | Print usage stats as job exits |
-t | time limit for job, <minutes>, or <hours>:<minutes> are commonly used |
-v -vv -vvv | Increasing levels of verbosity |
-x node-name | Don't run job on node-name (and please report any problematic nodes to help@cse.ucdavis.edu) |
(takes 30 seconds or so)
$ srun -p partition-name -u --pty bash -il
$ squeue -u $USER
CODE | STATE | DESCRIPTION |
---|---|---|
CA | CANCELLED | Job was cancelled by the user or system administrator |
CD | COMPLETED | Job completed |
CF | CONFIGURING | Job has been allocated resources, but are waiting for them to become ready |
CG | COMPLETING | Job is in the process of completing |
F | FAILED | Job terminated with non-zero exit code |
NF | NODE_FAIL | Job terminated due to failure of one or more allocated nodes |
PD | PENDING | Job is awaiting resource allocation |
R | RUNNING | Job currently has an allocation |
S | SUSPENDED | Job has an allocation, but execution has been suspended |
TO | TIMEOUT | Job terminated upon reaching its time limit |
Information about a job
root@gauss:~# squeue -l -j 93659 Thu Dec 6 16:51:37 2012 JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON) 93659 debug aa_b[1] isudal RUNNING 33:49 UNLIMITED 1 c0-10
Uber detailed information about a job
$ scontrol show -d job <JOBID>
How to stop a job manually (i.e. Abort) First use squeue as above to find the job number.
$scancel -u $USER <JOBID>
If you forget the JOBID it will cancel all your jobs.
Generally, there are three SLURM partitions (aka queues) on a cluster.
low | Low priority means that you might be killed at any time. Great for soaking up unused cycles with short jobs; a particularly good fit for large array jobs when individual jobs have short run times |
med | Medium priority means you might be suspended, but will resume when a high priority job finishes. *NOT* recommended for MPI jobs. Up to 100% of idle resources can be used. |
hi | Your job will kill/suspend lower priority jobs. High priority means your jobs will keep the allocated hardware until it's done or there's a system or power failure. Limited to the number of CPUs your group contributed. Recommended for MPI jobs. |
bigmem | Large memory nodes, jobs will keep the allocated hardware until it's done or there's a system or power failure |
serial | Older serial nodes, jobs will keep the allocated hardware until it's done or there's a system or power failure |
#SBATCH -p bigmem #SBATCH -n 16 srun="srun -N1 -n1 -c2" ##where -c{INT} is the number of threads you're providing the program for sample in $samp_ids do FW_READS=$TMP_DIR/$sample"_R1_trimmed.fq" REV_READS=$TMP_DIR/$sample"_R2_trimmed.fq" ## NOTE parallelizing with SRUN requires the & at the end of the command in order for WAIT to work bwaalncmd1="$srun time bwa aln -t 2 -I $REF $FW_READS > $OUT_RESULTS/$prefix.1.sai 2> $OUT_STATS/$prefix.1.log &" echo "[map-bwa.sh]: running $bwaalncmd1" eval $bwaalncmd1 echoerr $bwaalncmd1 done wait
This script would run the same 'bwa' command on multiple samples in parallel. Then, wait for these srun commands to finish before continuing to the next step in the pipeline.
A task is to be understood as a process. A multi-process program is made of several tasks. By contrast, a multithreaded program is composed of only one task, which uses several CPUs. Tasks are requested/created with the –ntasks option, while CPUs, for the multithreaded programs, are requested with the --cpus-per-task option. Tasks cannot be split across several compute nodes, so requesting several CPUs with the –cpus-per-task option will ensure all CPUs are allocated on the same compute node. By contrast, requesting the same amount of CPUs with the –ntasks option may lead to several CPUs being allocated on several, distinct compute nodes.
alias sq="squeue -u $(whoami)" ##to check on your own running jobs alias sqb="squeue | grep bigmem" ##to check on the jobs on bigmem partition alias sqs="squeue | grep serial" ##to check on the jobs on serial partition alias sjob="scontrol show -d job" ##to check detailed information about a running job. USAGE: sjob 134158