User Tools

Site Tools


support:hpc:software:slurm

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
support:hpc:software:slurm [2019/07/05 16:11]
tdthatch [Example Script:]
support:hpc:software:slurm [2021/06/21 15:53] (current)
omen [Interactive Sessions]
Line 21: Line 21:
 # NOTE the -l flag! # NOTE the -l flag!
  
-# If you need any help, please email help@cse.ucdavis.edu+# If you need any help, please email farm-hpc@ucdavis.edu
  
 # Name of the job - You'll probably want to customize this. # Name of the job - You'll probably want to customize this.
-#SBATCH -J bench+#SBATCH --job-name=benchmark-test 
 + 
 +# Use the med2 partition (or which ever you have access to) 
 +# Run this to see what partitions you have access to: 
 +# sacctmgr -s list user $USER format=partition 
 +#SBATCH --partition=med2
  
 # Standard out and Standard Error output files with the job number in the name. # Standard out and Standard Error output files with the job number in the name.
-#SBATCH -bench-%j.output +#SBATCH --output=bench-%j.output 
-#SBATCH -bench-%j.output+#SBATCH --error=bench-%j.output
  
-no -n here, the user is expected to provide that on the command line.+Request 4 CPUs and 8 GB of RAM from 1 node: 
 +#SBATCH --nodes=1 
 +#SBATCH --mem=8G 
 +#SBATCH --ntasks=1 
 +#SBATCH --cpus-per-task=4 
  
 # The useful part of your job goes below # The useful part of your job goes below
Line 40: Line 49:
 module load benchmarks module load benchmarks
  
-# The executable to run: note the use of srun before it+# The main job executable to run: note the use of srun before it
 srun stream srun stream
 </code> </code>
Line 99: Line 108:
 The newest version of slurm supports array jobs.  For example: The newest version of slurm supports array jobs.  For example:
 <code> <code>
-$ cat test.sh+$ cat test-array.sh
 #!/bin/bash #!/bin/bash
 hostname hostname
Line 107: Line 116:
 <code> <code>
 # Submit a job array with index values between 0 and 10,000 on all free CPUs: # Submit a job array with index values between 0 and 10,000 on all free CPUs:
-$ sbatch --array=0-1000 MyScript.sh+$ sbatch --array=0-10000 --partition=low test-array.sh
 </code> </code>
  
Line 177: Line 186:
 | -t | time limit for job, <minutes>, or <hours>:<minutes> are commonly used| | -t | time limit for job, <minutes>, or <hours>:<minutes> are commonly used|
 | -v -vv -vvv| Increasing levels of verbosity| | -v -vv -vvv| Increasing levels of verbosity|
-| -x node-name | Don't run job on node-name (and please report any problematic nodes to help@cse.ucdavis.edu) |+| -x node-name | Don't run job on node-name (and please report any problematic nodes to farm-hpc@ucdavis.edu) |
  
 ====== Interactive Sessions ====== ====== Interactive Sessions ======
  
-(takes 30 seconds or so)+(Usually takes 30 seconds or so, but it depends on job backlog)
  
-<code>$ srun -partition-name ---pty bash -il </code>+<code>$ srun --partition=partition-name --time=1:00:00 --unbuffered --pty /bin/bash -il </code>
  
 +When the time limit expires you will be forcibly logged out and anything left running will be killed.
 ======  Monitoring Jobs: ====== ======  Monitoring Jobs: ======
  
Line 259: Line 269:
 For convenience, you can add an alias to your ~/.bash_aliases file with this command and it will be available next time you log in. Here's an example of a helpful alias: For convenience, you can add an alias to your ~/.bash_aliases file with this command and it will be available next time you log in. Here's an example of a helpful alias:
 <code> <code>
-alias jobtimes="squeue --format=\"$.14i %9P %15j %.8u %.2t %.20e %.12L\" -u"+alias jobtimes="squeue --format=\"%.14i %9P %15j %.8u %.2t %.20e %.12L\" -u"
 </code> </code>
 Next time you log in, the command "jobtimes <yourusername>" will be available and will display the information as above. Next time you log in, the command "jobtimes <yourusername>" will be available and will display the information as above.
Line 271: Line 281:
 ====== SLURM Partitions ====== ====== SLURM Partitions ======
  
-Generally, there are three SLURM partitions (aka queues) on a cluster.+Generally, there are three SLURM partitions (aka queues) on a cluster. These partitions divide up pools of nodes based on job priority needs.
  
-|low| Low priority means that you might be killed at any time. Great for soaking up unused cycles with short jobs; a particularly good fit for large array jobs when individual jobs have short run times|+|low| Low priority means that you might be killed at any time. Great for soaking up unused cycles with short jobs; a particularly good fit for large array jobs when individual jobs have short run times.|
 |med|Medium priority means you might be suspended, but will resume when a high priority job finishes.  *NOT* recommended for MPI jobs.  Up to 100% of idle resources can be used.| |med|Medium priority means you might be suspended, but will resume when a high priority job finishes.  *NOT* recommended for MPI jobs.  Up to 100% of idle resources can be used.|
 |hi|Your job will kill/suspend lower priority jobs.  High priority means your jobs will keep the allocated hardware until it's done or there's a system or power failure.  Limited to the number of CPUs your group contributed.  Recommended for MPI jobs.| |hi|Your job will kill/suspend lower priority jobs.  High priority means your jobs will keep the allocated hardware until it's done or there's a system or power failure.  Limited to the number of CPUs your group contributed.  Recommended for MPI jobs.|
-|bigmem| Large memory nodes, jobs will keep the allocated hardware until it's done or there's a system or power failure| 
-|serial| Older serial nodes, jobs will keep the allocated hardware until it's done or there's a system or power failure| 
  
 +There are other types of partitions that may exist, as well.
 +
 +|bigmem, bm| Large memory nodes. Jobs will keep the allocated hardware until it's done or there's a system or power failure. (bigmems/bms may be further divided into l/m/h partitions, following the same priority rules as low/med/high in the table above.)|
 +|gpu      | GPU nodes, will keep the allocated hardware until it's done or there's a system or power failure.|
 +|serial   | Older serial nodes, jobs will keep the allocated hardware until it's done or there's a system or power failure.|
 +
 +Nodes can be in more than one partition, and partitions with similar names generally have identical or near-identical hardware: low/med/high are typically one set of hardware, low2/med2/high2 are another, and so on.
  
 +There may be other partitions based on the hardware available on a particular cluster; not all users have access to all partitions. Consult with your account creation email, your PI, or the helpdesk if you are unsure what partitions you have access to or to use.
 ======  SBATCH job with parallel programs running: ====== ======  SBATCH job with parallel programs running: ======
  
support/hpc/software/slurm.1562368271.txt.gz · Last modified: 2019/07/05 16:11 by tdthatch