User Tools

Site Tools


support:hpc:software:slurm

SLURM: A Highly Scalable Resource Manager

SLURM is an open-source resource manager (batch queue) designed for Linux clusters of all sizes.

SLURM Quick Introduction

  • sinfo reports the state of partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options.
  • smap reports state information for jobs, partitions, and nodes managed by SLURM, but graphically displays the information to reflect network topology.
  • sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.
  • squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.
  • srun is used to submit a job for execution or initiate job steps in real time. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared nodes within the job's node allocation.smap reports state information for jobs, partitions, and nodes managed by SLURM, but graphically displays the information to reflect network topology.
  • scancel is used to stop a job early. Example, when you queue the wrong script or you know it's going to fail because you forgot something. See more in Monitoring Jobs:

More in depth information at http://slurm.schedmd.com/documentation.html

Example Script:

#!/bin/bash -l
# NOTE the -l flag!

# If you need any help, please email farm-hpc@ucdavis.edu

# Name of the job - You'll probably want to customize this.
#SBATCH --job-name=benchmark-test

# Use the med2 partition (or which ever you have access to)
# Run this to see what partitions you have access to:
# sacctmgr -s list user $USER format=partition
#SBATCH --partition=med2

# Standard out and Standard Error output files with the job number in the name.
#SBATCH --output=bench-%j.output
#SBATCH --error=bench-%j.output

# Request 4 CPUs and 8 GB of RAM from 1 node:
#SBATCH --nodes=1
#SBATCH --mem=8G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4 

# The useful part of your job goes below

# run one thread for each one the user asks the queue for
# hostname is just for debugging
hostname
export OMP_NUM_THREADS=$SLURM_NTASKS
module load benchmarks

# The main job executable to run: note the use of srun before it
srun stream

Example Run:

We will ask the batch queue for 1 node and 2, 4, 8, 16, and 32 CPUs to see how well this OpenMP code scales.

bill@gauss:~$ sbatch -N 1 -n 2 -t 5 test.sh 
Submitted batch job 12
bill@gauss:~$ sbatch -N 1 -n 4 -t 5 test.sh 
Submitted batch job 13
bill@gauss:~$ sbatch -N 1 -n 8 -t 5 test.sh 
Submitted batch job 14
bill@gauss:~$ sbatch -N 1 -n 16 -t 5 test.sh 
Submitted batch job 15
bill@gauss:~$ sbatch -N 1 -n 32 -t 5 test.sh 
Submitted batch job 16

Example Output:

This benchmark uses OpenMP to benchmark memory bandwidth with a 915MB array. This show increasing memory bandwidth as you use more CPUs up to a maximum of 60GB/sec using an entire node. If you are curious about the benchmark, additional information is available at https://www.cs.virginia.edu/stream/.

bill@gauss:~$ ls
bench-12.output  bench-14.output  bench-16.output
bench-13.output  bench-15.output  test.sh
bill@gauss:~$ cat *.output | grep ":"
STREAM version $Revision: 5.9 $
Copy:       12517.8583       0.0513       0.0511       0.0516
Scale:      12340.2147       0.0521       0.0519       0.0524
Add:        12495.4439       0.0770       0.0768       0.0772
Triad:      12490.9087       0.0782       0.0769       0.0796
STREAM version $Revision: 5.9 $
Copy:       16879.8667       0.0381       0.0379       0.0384
Scale:      16807.9956       0.0384       0.0381       0.0388
Add:        16733.7084       0.0578       0.0574       0.0583
Triad:      16482.7247       0.0585       0.0582       0.0589
STREAM version $Revision: 5.9 $
Copy:       16098.1749       0.0399       0.0398       0.0400
Scale:      16018.8248       0.0402       0.0400       0.0405
Add:        15887.7032       0.0606       0.0604       0.0610
Triad:      15839.4543       0.0608       0.0606       0.0611
STREAM version $Revision: 5.9 $
Copy:       31428.3070       0.0205       0.0204       0.0206
Scale:      31221.0489       0.0206       0.0205       0.0207
Add:        31324.3960       0.0308       0.0306       0.0311
Triad:      31151.6049       0.0310       0.0308       0.0313
STREAM version $Revision: 5.9 $
Copy:       61085.8038       0.0106       0.0105       0.0108
Scale:      60474.7806       0.0108       0.0106       0.0110
Add:        61318.3663       0.0159       0.0157       0.0162
Triad:      61049.6830       0.0159       0.0157       0.0160
bill@gauss:~$ 

Array Jobs:

The newest version of slurm supports array jobs. For example:

$ cat test-array.sh
#!/bin/bash
hostname
echo $SLURM_ARRAY_TASK_ID
# Submit a job array with index values between 0 and 10,000 on all free CPUs:
$ sbatch --array=0-10000 --partition=low test-array.sh

On the Farm cluster the maximum array size is 10001.

More information at http://www.schedmd.com/slurmdocs/job_array.html

SLURM Script Quick Reference

prefixed with #SBATCH

: A comment
–job-name=myjob Job Name
–output=myjob.out Output sent to this file
–output=myjob.%j.%N.out Output file named with job number and the node the job landed on
–error=myjob.err Errors written to this file
–partition=med Run is the med partition (known as a queue in SGE)
–nodes=4 Request four nodes
–ntasks-per-node=8 Request eight tasks per node. The number of tasks may not exceed the number of processor cores on the node
–ntasks=10 Request 10 tasks for your job
–time=2-12:00:00 The maximum amount of time SLURM will allow your job to run before it is killed. (2 days and 12 hours in the example)
–mail-type=type Set type to: BEGIN to notify you when your job starts, END for when it ends, FAIL for if it fails, or ALL for all of the above
–mail-user=email@ucdavis.edu
–mem-per-cpu=MB Specify a memory limit for each process of your job
–mem=MB Specify a memory limit for each node of your job
–exclusive Specify that you need exclusive access to nodes for your job
–share Specify that your job may share nodes with other jobs
–begin=2013-09-21T01:30:00 Start the job after this time
–begin=now+1hour Use a relative time to start the job
–dependency=afterany:100:101 Wait for jobs 100 and 101 to complete before starting
–dependency=afterok:100:101 Wait for jobs 100 and 101 to finish without error

Show all available options

$ sbatch --help

Another useful command

 $ sbatch --usage

SLURM Environment Variables

  • SLURM_NODELIST
  • SLURM_NODE_ALIASES
  • SLURM_NNODES
  • SLURM_JOBID
  • SLURM_TASKS_PER_NODE
  • SLURM_JOB_ID
  • SLURM_SUBMIT_DIR
  • SLURM_JOB_NODELIST

SRUN Options

You may also use options to the srun command:

$ srun [option list] [executable] [args]

Some srun options

-c # The number of CPUs used by each process
-d Specify debug level between 0 and 5
-i file Redirect input to file
-o file Redirect output
-n # Number of processes for the job
-N # Numbers of nodes to run the job on
-s Print usage stats as job exits
-t time limit for job, <minutes>, or <hours>:<minutes> are commonly used
-v -vv -vvv Increasing levels of verbosity
-x node-name Don't run job on node-name (and please report any problematic nodes to farm-hpc@ucdavis.edu)

Interactive Sessions

(Usually takes 30 seconds or so, but it depends on job backlog)

$ srun --partition=partition-name --time=1:00:00 --unbuffered --pty /bin/bash -il 

When the time limit expires you will be forcibly logged out and anything left running will be killed.

Monitoring Jobs:

$ squeue -u $USER
CODESTATEDESCRIPTION
CACANCELLEDJob was cancelled by the user or system administrator
CDCOMPLETEDJob completed
CFCONFIGURINGJob has been allocated resources, but are waiting for them to become ready
CGCOMPLETINGJob is in the process of completing
FFAILEDJob terminated with non-zero exit code
NFNODE_FAILJob terminated due to failure of one or more allocated nodes
PDPENDINGJob is awaiting resource allocation
RRUNNINGJob currently has an allocation
SSUSPENDEDJob has an allocation, but execution has been suspended
TOTIMEOUTJob terminated upon reaching its time limit

Information about a job

root@gauss:~# squeue -l -j  93659 
Thu Dec  6 16:51:37 2012
  JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)
  93659     debug  aa_b[1]   isudal  RUNNING      33:49 UNLIMITED      1 c0-10

Uber detailed information about a job

$ scontrol show -d job <JOBID>

SLURM Rosetta

Cancelling

How to stop a job manually (i.e. Abort) First use squeue as above to find the job number.

$scancel -u $USER <JOBID>

If you forget the JOBID it will cancel all your jobs.

Advanced (Optional) Squeue Usage

The squeue command has some additional command flags that can be passed to better monitor your jobs, if necessary.

This section involves some Linux shell knowledge and an understanding of environment variables. If you are unsure, you can skip this section, or ask an administrator for help.

The default output fields of squeue are defined in the slurm module, but these can be overridden with the –format flag. The current Farm configuration is: An example of the standard output of squeue -u <username>:

JOBID PARTITION     NAME     USER  ST        TIME  NODES CPU MIN_ME NODELIST(REASON)
12345       med    myjob  username  R  1-22:20:42      1 22  24000M c10-67

These fields are defined by default using the following format codes:

%.14i %.9P %.8j %.8u %.2t %.11M %.6D %3C %6m %R

A full explanation of what formatting codes may be used can be found in man squeue under the -o <output_format> –format=<output-format> section.

To see the time and date that your jobs are scheduled to end, and how much time is remaining:

squeue --format="%.14i %9P %15j %.8u %.2t %.20e %.12L" -u <username>

Sample output:

JOBID PARTITION NAME     USER     ST  END_TIME             TIME_LEFT
1234  med       myjob    username  R  2019-06-10T01:12:28  5-21:50:53

For convenience, you can add an alias to your ~/.bash_aliases file with this command and it will be available next time you log in. Here's an example of a helpful alias:

alias jobtimes="squeue --format=\"%.14i %9P %15j %.8u %.2t %.20e %.12L\" -u"

Next time you log in, the command “jobtimes <yourusername>” will be available and will display the information as above.

See the squeue man page for other fields that squeue can output.

The default squeue formatting is stored in the environment variable $SQUEUE_FORMAT, which can be altered using the same flags as the –format option on the command line. PLEASE be cautious when altering environment variables. Use module show slurm to see the default setting for $SQUEUE_FORMAT.

SLURM Partitions

Generally, there are three SLURM partitions (aka queues) on a cluster. These partitions divide up pools of nodes based on job priority needs.

low Low priority means that you might be killed at any time. Great for soaking up unused cycles with short jobs; a particularly good fit for large array jobs when individual jobs have short run times.
medMedium priority means you might be suspended, but will resume when a high priority job finishes. *NOT* recommended for MPI jobs. Up to 100% of idle resources can be used.
hiYour job will kill/suspend lower priority jobs. High priority means your jobs will keep the allocated hardware until it's done or there's a system or power failure. Limited to the number of CPUs your group contributed. Recommended for MPI jobs.

There are other types of partitions that may exist, as well.

bigmem, bm Large memory nodes. Jobs will keep the allocated hardware until it's done or there's a system or power failure. (bigmems/bms may be further divided into l/m/h partitions, following the same priority rules as low/med/high in the table above.)
gpu GPU nodes, will keep the allocated hardware until it's done or there's a system or power failure.
serial Older serial nodes, jobs will keep the allocated hardware until it's done or there's a system or power failure.

Nodes can be in more than one partition, and partitions with similar names generally have identical or near-identical hardware: low/med/high are typically one set of hardware, low2/med2/high2 are another, and so on.

There may be other partitions based on the hardware available on a particular cluster; not all users have access to all partitions. Consult with your account creation email, your PI, or the helpdesk if you are unsure what partitions you have access to or to use.

SBATCH job with parallel programs running:

#SBATCH -p bigmem
#SBATCH -n 16

srun="srun -N1 -n1 -c2"    ##where -c{INT} is the number of threads you're providing the program
for sample in $samp_ids
do

    FW_READS=$TMP_DIR/$sample"_R1_trimmed.fq"
    REV_READS=$TMP_DIR/$sample"_R2_trimmed.fq"


    ## NOTE parallelizing with SRUN requires the & at the end of the command in order for WAIT to work
    bwaalncmd1="$srun time bwa aln -t 2 -I $REF $FW_READS > $OUT_RESULTS/$prefix.1.sai 2> $OUT_STATS/$prefix.1.log &"
    echo "[map-bwa.sh]: running $bwaalncmd1"
    eval $bwaalncmd1
    echoerr $bwaalncmd1
    

done
wait

This script would run the same 'bwa' command on multiple samples in parallel. Then, wait for these srun commands to finish before continuing to the next step in the pipeline.

About Tasks

A task is to be understood as a process. A multi-process program is made of several tasks. By contrast, a multithreaded program is composed of only one task, which uses several CPUs. Tasks are requested/created with the –ntasks option, while CPUs, for the multithreaded programs, are requested with the --cpus-per-task option. Tasks cannot be split across several compute nodes, so requesting several CPUs with the –cpus-per-task option will ensure all CPUs are allocated on the same compute node. By contrast, requesting the same amount of CPUs with the –ntasks option may lead to several CPUs being allocated on several, distinct compute nodes.

Useful .bashrc or .bash_profile aliases for the cluster

alias sq="squeue -u $(whoami)"     ##to check on your own running jobs
alias sqb="squeue | grep bigmem"   ##to check on the jobs on bigmem partition
alias sqs="squeue | grep serial"   ##to check on the jobs on serial partition
alias sjob="scontrol show -d job"  ##to check detailed information about a running job. USAGE: sjob 134158

Useful Links

support/hpc/software/slurm.txt · Last modified: 2021/06/21 15:53 by omen