This is an old revision of the document!
SLURM is an open-source resource manager (batch queue) designed for Linux clusters of all sizes.
More in depth information at http://slurm.schedmd.com/documentation.html
#!/bin/bash -l # NOTE the -l flag! # If you need any help, please email farm-hpc@ucdavis.edu # Name of the job - You'll probably want to customize this. #SBATCH --job-name=benchmark-test # Use the med2 partition (or which ever you have access to) # Run this to see what partitions you have access to: # sacctmgr -s list user $USER format=partition #SBATCH --partition=med2 # Standard out and Standard Error output files with the job number in the name. #SBATCH --output=bench-%j.output #SBATCH --error=bench-%j.output # Request 4 CPUs and 8 GB of RAM from 1 node: #SBATCH --nodes=1 #SBATCH --mem=8G #SBATCH --ntasks=1 #SBATCH --cpus-per-task=4 # The useful part of your job goes below # run one thread for each one the user asks the queue for # hostname is just for debugging hostname export OMP_NUM_THREADS=$SLURM_NTASKS module load benchmarks # The main job executable to run: note the use of srun before it srun stream
We will ask the batch queue for 1 node and 2, 4, 8, 16, and 32 CPUs to see how well this OpenMP code scales.
bill@gauss:~$ sbatch -N 1 -n 2 -t 5 test.sh Submitted batch job 12 bill@gauss:~$ sbatch -N 1 -n 4 -t 5 test.sh Submitted batch job 13 bill@gauss:~$ sbatch -N 1 -n 8 -t 5 test.sh Submitted batch job 14 bill@gauss:~$ sbatch -N 1 -n 16 -t 5 test.sh Submitted batch job 15 bill@gauss:~$ sbatch -N 1 -n 32 -t 5 test.sh Submitted batch job 16
This benchmark uses OpenMP to benchmark memory bandwidth with a 915MB array. This show increasing memory bandwidth as you use more CPUs up to a maximum of 60GB/sec using an entire node. If you are curious about the benchmark, additional information is available at https://www.cs.virginia.edu/stream/.
bill@gauss:~$ ls bench-12.output bench-14.output bench-16.output bench-13.output bench-15.output test.sh bill@gauss:~$ cat *.output | grep ":" STREAM version $Revision: 5.9 $ Copy: 12517.8583 0.0513 0.0511 0.0516 Scale: 12340.2147 0.0521 0.0519 0.0524 Add: 12495.4439 0.0770 0.0768 0.0772 Triad: 12490.9087 0.0782 0.0769 0.0796 STREAM version $Revision: 5.9 $ Copy: 16879.8667 0.0381 0.0379 0.0384 Scale: 16807.9956 0.0384 0.0381 0.0388 Add: 16733.7084 0.0578 0.0574 0.0583 Triad: 16482.7247 0.0585 0.0582 0.0589 STREAM version $Revision: 5.9 $ Copy: 16098.1749 0.0399 0.0398 0.0400 Scale: 16018.8248 0.0402 0.0400 0.0405 Add: 15887.7032 0.0606 0.0604 0.0610 Triad: 15839.4543 0.0608 0.0606 0.0611 STREAM version $Revision: 5.9 $ Copy: 31428.3070 0.0205 0.0204 0.0206 Scale: 31221.0489 0.0206 0.0205 0.0207 Add: 31324.3960 0.0308 0.0306 0.0311 Triad: 31151.6049 0.0310 0.0308 0.0313 STREAM version $Revision: 5.9 $ Copy: 61085.8038 0.0106 0.0105 0.0108 Scale: 60474.7806 0.0108 0.0106 0.0110 Add: 61318.3663 0.0159 0.0157 0.0162 Triad: 61049.6830 0.0159 0.0157 0.0160 bill@gauss:~$
The newest version of slurm supports array jobs. For example:
$ cat test-array.sh #!/bin/bash hostname echo $SLURM_ARRAY_TASK_ID
# Submit a job array with index values between 0 and 10,000 on all free CPUs: $ sbatch --array=0-10000 --partition=low test-array.sh
On the Farm cluster the maximum array size is 10001.
More information at http://www.schedmd.com/slurmdocs/job_array.html
prefixed with #SBATCH
: | A comment |
–job-name=myjob | Job Name |
–output=myjob.out | Output sent to this file |
–output=myjob.%j.%N.out | Output file named with job number and the node the job landed on |
–error=myjob.err | Errors written to this file |
–partition=med | Run is the med partition (known as a queue in SGE) |
–nodes=4 | Request four nodes |
–ntasks-per-node=8 | Request eight tasks per node. The number of tasks may not exceed the number of processor cores on the node |
–ntasks=10 | Request 10 tasks for your job |
–time=2-12:00:00 | The maximum amount of time SLURM will allow your job to run before it is killed. (2 days and 12 hours in the example) |
–mail-type=type | Set type to: BEGIN to notify you when your job starts, END for when it ends, FAIL for if it fails, or ALL for all of the above |
–mail-user=email@ucdavis.edu | |
–mem-per-cpu=MB | Specify a memory limit for each process of your job |
–mem=MB | Specify a memory limit for each node of your job |
–exclusive | Specify that you need exclusive access to nodes for your job |
–share | Specify that your job may share nodes with other jobs |
–begin=2013-09-21T01:30:00 | Start the job after this time |
–begin=now+1hour | Use a relative time to start the job |
–dependency=afterany:100:101 | Wait for jobs 100 and 101 to complete before starting |
–dependency=afterok:100:101 | Wait for jobs 100 and 101 to finish without error |
Show all available options
$ sbatch --help
Another useful command
$ sbatch --usage
You may also use options to the srun command:
$ srun [option list] [executable] [args]
Some srun options
-c # | The number of CPUs used by each process |
-d | Specify debug level between 0 and 5 |
-i file | Redirect input to file |
-o file | Redirect output |
-n # | Number of processes for the job |
-N # | Numbers of nodes to run the job on |
-s | Print usage stats as job exits |
-t | time limit for job, <minutes>, or <hours>:<minutes> are commonly used |
-v -vv -vvv | Increasing levels of verbosity |
-x node-name | Don't run job on node-name (and please report any problematic nodes to farm-hpc@ucdavis.edu) |
(takes 30 seconds or so)
$ srun --partition=partition-name --time=1:00:00 --unbuffered --pty /bin/bash -il
When the time limit expires you will be forcibly logged out and anything left running will be killed.
$ squeue -u $USER
CODE | STATE | DESCRIPTION |
---|---|---|
CA | CANCELLED | Job was cancelled by the user or system administrator |
CD | COMPLETED | Job completed |
CF | CONFIGURING | Job has been allocated resources, but are waiting for them to become ready |
CG | COMPLETING | Job is in the process of completing |
F | FAILED | Job terminated with non-zero exit code |
NF | NODE_FAIL | Job terminated due to failure of one or more allocated nodes |
PD | PENDING | Job is awaiting resource allocation |
R | RUNNING | Job currently has an allocation |
S | SUSPENDED | Job has an allocation, but execution has been suspended |
TO | TIMEOUT | Job terminated upon reaching its time limit |
Information about a job
root@gauss:~# squeue -l -j 93659 Thu Dec 6 16:51:37 2012 JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON) 93659 debug aa_b[1] isudal RUNNING 33:49 UNLIMITED 1 c0-10
Uber detailed information about a job
$ scontrol show -d job <JOBID>
How to stop a job manually (i.e. Abort) First use squeue as above to find the job number.
$scancel -u $USER <JOBID>
If you forget the JOBID it will cancel all your jobs.
The squeue command has some additional command flags that can be passed to better monitor your jobs, if necessary.
This section involves some Linux shell knowledge and an understanding of environment variables. If you are unsure, you can skip this section, or ask an administrator for help.
The default output fields of squeue are defined in the slurm module, but these can be overridden with the
–format
flag. The current Farm configuration is:
An example of the standard output of squeue -u <username>
:
JOBID PARTITION NAME USER ST TIME NODES CPU MIN_ME NODELIST(REASON) 12345 med myjob username R 1-22:20:42 1 22 24000M c10-67
These fields are defined by default using the following format codes:
%.14i %.9P %.8j %.8u %.2t %.11M %.6D %3C %6m %R
A full explanation of what formatting codes may be used can be found in man squeue
under the -o <output_format> –format=<output-format>
section.
To see the time and date that your jobs are scheduled to end, and how much time is remaining:
squeue --format="%.14i %9P %15j %.8u %.2t %.20e %.12L" -u <username>
Sample output:
JOBID PARTITION NAME USER ST END_TIME TIME_LEFT 1234 med myjob username R 2019-06-10T01:12:28 5-21:50:53
For convenience, you can add an alias to your ~/.bash_aliases file with this command and it will be available next time you log in. Here's an example of a helpful alias:
alias jobtimes="squeue --format=\"%.14i %9P %15j %.8u %.2t %.20e %.12L\" -u"
Next time you log in, the command “jobtimes <yourusername>” will be available and will display the information as above.
See the squeue man page for other fields that squeue can output.
The default squeue formatting is stored in the environment variable $SQUEUE_FORMAT
, which can be altered using the same flags as the –format option on the command line. PLEASE be cautious when altering environment variables. Use module show slurm
to see the default setting for $SQUEUE_FORMAT
.
Generally, there are three SLURM partitions (aka queues) on a cluster. These partitions divide up pools of nodes based on job priority needs.
low | Low priority means that you might be killed at any time. Great for soaking up unused cycles with short jobs; a particularly good fit for large array jobs when individual jobs have short run times. |
med | Medium priority means you might be suspended, but will resume when a high priority job finishes. *NOT* recommended for MPI jobs. Up to 100% of idle resources can be used. |
hi | Your job will kill/suspend lower priority jobs. High priority means your jobs will keep the allocated hardware until it's done or there's a system or power failure. Limited to the number of CPUs your group contributed. Recommended for MPI jobs. |
There are other types of partitions that may exist, as well.
bigmem, bm | Large memory nodes. Jobs will keep the allocated hardware until it's done or there's a system or power failure. (bigmems/bms may be further divided into l/m/h partitions, following the same priority rules as low/med/high in the table above.) |
gpu | GPU nodes, will keep the allocated hardware until it's done or there's a system or power failure. |
serial | Older serial nodes, jobs will keep the allocated hardware until it's done or there's a system or power failure. |
Nodes can be in more than one partition, and partitions with similar names generally have identical or near-identical hardware: low/med/high are typically one set of hardware, low2/med2/high2 are another, and so on.
There may be other partitions based on the hardware available on a particular cluster; not all users have access to all partitions. Consult with your account creation email, your PI, or the helpdesk if you are unsure what partitions you have access to or to use.
#SBATCH -p bigmem #SBATCH -n 16 srun="srun -N1 -n1 -c2" ##where -c{INT} is the number of threads you're providing the program for sample in $samp_ids do FW_READS=$TMP_DIR/$sample"_R1_trimmed.fq" REV_READS=$TMP_DIR/$sample"_R2_trimmed.fq" ## NOTE parallelizing with SRUN requires the & at the end of the command in order for WAIT to work bwaalncmd1="$srun time bwa aln -t 2 -I $REF $FW_READS > $OUT_RESULTS/$prefix.1.sai 2> $OUT_STATS/$prefix.1.log &" echo "[map-bwa.sh]: running $bwaalncmd1" eval $bwaalncmd1 echoerr $bwaalncmd1 done wait
This script would run the same 'bwa' command on multiple samples in parallel. Then, wait for these srun commands to finish before continuing to the next step in the pipeline.
A task is to be understood as a process. A multi-process program is made of several tasks. By contrast, a multithreaded program is composed of only one task, which uses several CPUs. Tasks are requested/created with the –ntasks option, while CPUs, for the multithreaded programs, are requested with the --cpus-per-task option. Tasks cannot be split across several compute nodes, so requesting several CPUs with the –cpus-per-task option will ensure all CPUs are allocated on the same compute node. By contrast, requesting the same amount of CPUs with the –ntasks option may lead to several CPUs being allocated on several, distinct compute nodes.
alias sq="squeue -u $(whoami)" ##to check on your own running jobs alias sqb="squeue | grep bigmem" ##to check on the jobs on bigmem partition alias sqs="squeue | grep serial" ##to check on the jobs on serial partition alias sjob="scontrol show -d job" ##to check detailed information about a running job. USAGE: sjob 134158