User Tools

Site Tools


support:hpc:software:slurm

This is an old revision of the document!


SLURM: A Highly Scalable Resource Manage

SLURM is an open-source resource manager (batch queue) designed for Linux clusters of all sizes.

SLURM Quick Introduction

  • sinfo reports the state of partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options.
  • smap reports state information for jobs, partitions, and nodes managed by SLURM, but graphically displays the information to reflect network topology.
  • sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.
  • squeue reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.
  • srun is used to submit a job for execution or initiate job steps in real time. srun has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared nodes within the job's node allocation.smap reports state information for jobs, partitions, and nodes managed by SLURM, but graphically displays the information to reflect network topology.
  • scancel is used to stop a job early. Example, when you queue the wrong script or you know it's going to fail because you forgot something. See more in Monitoring Jobs:

More in depth information at http://slurm.schedmd.com/documentation.html

Example Script:

#!/bin/bash -l
# NOTE the -l flag!

# If you need any help, please email help@cse.ucdavis.edu

# Name of the job - You'll probably want to customize this.
#SBATCH -J bench

# Standard out and Standard Error output files with the job number in the name.
#SBATCH -o bench-%j.output
#SBATCH -e bench-%j.output

# no -n here, the user is expected to provide that on the command line.

# The useful part of your job goes below

# run one thread for each one the user asks the queue for
# hostname is just for debugging
hostname
export OMP_NUM_THREADS=$SLURM_NTASKS
module load benchmarks
stream

Example Run:

We will ask the batch queue for 1 node and 2, 4, 8, 16, and 32 CPUs to see how well this OpenMP code scales.

bill@gauss:~$ sbatch -N 1 -n 2 -t 5 test.sh 
Submitted batch job 12
bill@gauss:~$ sbatch -N 1 -n 4 -t 5 test.sh 
Submitted batch job 13
bill@gauss:~$ sbatch -N 1 -n 8 -t 5 test.sh 
Submitted batch job 14
bill@gauss:~$ sbatch -N 1 -n 16 -t 5 test.sh 
Submitted batch job 15
bill@gauss:~$ sbatch -N 1 -n 32 -t 5 test.sh 
Submitted batch job 16

Example Output:

This benchmark uses OpenMP to benchmark memory bandwidth with a 915MB array. This show increasing memory bandwidth as you use more CPUs up to a maximum of 60GB/sec using an entire node. If you are curious about the benchmark, additional information is available at https://www.cs.virginia.edu/stream/.

bill@gauss:~$ ls
bench-12.output  bench-14.output  bench-16.output
bench-13.output  bench-15.output  test.sh
bill@gauss:~$ cat *.output | grep ":"
STREAM version $Revision: 5.9 $
Copy:       12517.8583       0.0513       0.0511       0.0516
Scale:      12340.2147       0.0521       0.0519       0.0524
Add:        12495.4439       0.0770       0.0768       0.0772
Triad:      12490.9087       0.0782       0.0769       0.0796
STREAM version $Revision: 5.9 $
Copy:       16879.8667       0.0381       0.0379       0.0384
Scale:      16807.9956       0.0384       0.0381       0.0388
Add:        16733.7084       0.0578       0.0574       0.0583
Triad:      16482.7247       0.0585       0.0582       0.0589
STREAM version $Revision: 5.9 $
Copy:       16098.1749       0.0399       0.0398       0.0400
Scale:      16018.8248       0.0402       0.0400       0.0405
Add:        15887.7032       0.0606       0.0604       0.0610
Triad:      15839.4543       0.0608       0.0606       0.0611
STREAM version $Revision: 5.9 $
Copy:       31428.3070       0.0205       0.0204       0.0206
Scale:      31221.0489       0.0206       0.0205       0.0207
Add:        31324.3960       0.0308       0.0306       0.0311
Triad:      31151.6049       0.0310       0.0308       0.0313
STREAM version $Revision: 5.9 $
Copy:       61085.8038       0.0106       0.0105       0.0108
Scale:      60474.7806       0.0108       0.0106       0.0110
Add:        61318.3663       0.0159       0.0157       0.0162
Triad:      61049.6830       0.0159       0.0157       0.0160
bill@gauss:~$ 

Array Jobs:

The newest version of slurm supports array jobs. For example:

$ cat test.sh
#!/bin/bash
hostname
echo $SLURM_ARRAY_TASK_ID
# Submit a job array with index values between 0 and 10,000 on all free CPUs:
$ sbatch --array=0-1000 MyScript.sh

On the Farm cluster the maximum array size is 10001.

More information at http://www.schedmd.com/slurmdocs/job_array.html

SLURM Script Quick Reference

prefixed with #SBATCH

: A comment
–job-name=myjob Job Name
–output=myjob.out Output sent to this file
–output=myjob.%j.%N.out Output file named with job number and the node the job landed on
–error=myjob.err Errors written to this file
–partition=med Run is the med partition (known as a queue in SGE)
–nodes=4 Request four nodes
–ntasks-per-node=8 Request eight tasks per node. The number of tasks may not exceed the number of processor cores on the node
–ntasks=10 Request 10 tasks for your job
–time=2-12:00:00 The maximum amount of time SLURM will allow your job to run before it is killed. (2 days and 12 hours in the example)
–mail-type=type Set type to: BEGIN to notify you when your job starts, END for when it ends, FAIL for if it fails, or ALL for all of the above
–mail-user=email@ucdavis.edu
–mem-per-cpu=MB Specify a memory limit for each process of your job
–mem=MB Specify a memory limit for each node of your job
–exclusive Specify that you need exclusive access to nodes for your job
–share Specify that your job may share nodes with other jobs
–begin=2013-09-21T01:30:00 Start the job after this time
–begin=now+1hour Use a relative time to start the job
–dependency=afterany:100:101 Wait for jobs 100 and 101 to complete before starting
–dependency=afterok:100:101 Wait for jobs 100 and 101 to finish without error

Show all available options

$ sbatch --help

Another useful command

 $ sbatch --usage

SLURM Environment Variables

  • SLURM_NODELIST
  • SLURM_NODE_ALIASES
  • SLURM_NNODES
  • SLURM_JOBID
  • SLURM_TASKS_PER_NODE
  • SLURM_JOB_ID
  • SLURM_SUBMIT_DIR
  • SLURM_JOB_NODELIST

SRUN Options

You may also use options to the srun command:

$ srun [option list] [executable] [args]

Some srun options

-c # The number of CPUs used by each process
-d Specify debug level between 0 and 5
-i file Redirect input to file
-o file Redirect output
-n # Number of processes for the job
-N # Numbers of nodes to run the job on
-s Print usage stats as job exits
-t time limit for job, <minutes>, or <hours>:<minutes> are commonly used
-v -vv -vvv Increasing levels of verbosity
-x node-name Don't run job on node-name (and please report any problematic nodes to help@cse.ucdavis.edu)

Interactive Sessions

(takes 30 seconds or so)

$ srun -p partition-name -u --pty bash -il 

Monitoring Jobs:

$ squeue -u $USER
CODESTATEDESCRIPTION
CACANCELLEDJob was cancelled by the user or system administrator
CDCOMPLETEDJob completed
CFCONFIGURINGJob has been allocated resources, but are waiting for them to become ready
CGCOMPLETINGJob is in the process of completing
FFAILEDJob terminated with non-zero exit code
NFNODE_FAILJob terminated due to failure of one or more allocated nodes
PDPENDINGJob is awaiting resource allocation
RRUNNINGJob currently has an allocation
SSUSPENDEDJob has an allocation, but execution has been suspended
TOTIMEOUTJob terminated upon reaching its time limit

Information about a job

root@gauss:~# squeue -l -j  93659 
Thu Dec  6 16:51:37 2012
  JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)
  93659     debug  aa_b[1]   isudal  RUNNING      33:49 UNLIMITED      1 c0-10

Uber detailed information about a job

$ scontrol show -d job <JOBID>

SLURM Rosetta

Cancelling

How to stop a job manually (i.e. Abort) First use squeue as above to find the job number.

$scancel -u $USER <JOBID>

If you forget the JOBID it will cancel all your jobs.

SLURM Partitions

Generally, there are three SLURM partitions (aka queues) on a cluster.

low Low priority means that you might be killed at any time. Great for soaking up unused cycles with short jobs; a particularly good fit for large array jobs when individual jobs have short run times
medMedium priority means you might be suspended, but will resume when a high priority job finishes. *NOT* recommended for MPI jobs. Up to 100% of idle resources can be used.
hiYour job will kill/suspend lower priority jobs. High priority means your jobs will keep the allocated hardware until it's done or there's a system or power failure. Limited to the number of CPUs your group contributed. Recommended for MPI jobs.
bigmem Large memory nodes, jobs will keep the allocated hardware until it's done or there's a system or power failure
serial Older serial nodes, jobs will keep the allocated hardware until it's done or there's a system or power failure

SBATCH job with parallel programs running:

#SBATCH -p bigmem
#SBATCH -n 16

srun="srun -N1 -n1 -c2"    ##where -c{INT} is the number of threads you're providing the program
for sample in $samp_ids
do

    FW_READS=$TMP_DIR/$sample"_R1_trimmed.fq"
    REV_READS=$TMP_DIR/$sample"_R2_trimmed.fq"


    ## NOTE parallelizing with SRUN requires the & at the end of the command in order for WAIT to work
    bwaalncmd1="$srun time bwa aln -t 2 -I $REF $FW_READS > $OUT_RESULTS/$prefix.1.sai 2> $OUT_STATS/$prefix.1.log &"
    echo "[map-bwa.sh]: running $bwaalncmd1"
    eval $bwaalncmd1
    echoerr $bwaalncmd1
    

done
wait

This script would run the same 'bwa' command on multiple samples in parallel. Then, wait for these srun commands to finish before continuing to the next step in the pipeline.

About Tasks

A task is to be understood as a process. A multi-process program is made of several tasks. By contrast, a multithreaded program is composed of only one task, which uses several CPUs. Tasks are requested/created with the –ntasks option, while CPUs, for the multithreaded programs, are requested with the --cpus-per-task option. Tasks cannot be split across several compute nodes, so requesting several CPUs with the –cpus-per-task option will ensure all CPUs are allocated on the same compute node. By contrast, requesting the same amount of CPUs with the –ntasks option may lead to several CPUs being allocated on several, distinct compute nodes.

Useful .bashrc or .bash_profile aliases for the cluster

alias sq="squeue -u $(whoami)"     ##to check on your own running jobs
alias sqb="squeue | grep bigmem"   ##to check on the jobs on bigmem partition
alias sqs="squeue | grep serial"   ##to check on the jobs on serial partition
alias sjob="scontrol show -d job"  ##to check detailed information about a running job. USAGE: sjob 134158

Useful Links

support/hpc/software/slurm.1523580138.txt.gz · Last modified: 2018/04/12 17:42 by bill