This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
support:hpc:software:slurm [2019/03/01 11:56] tdthatch +R |
support:hpc:software:slurm [2021/06/21 15:53] omen [Interactive Sessions] |
||
---|---|---|---|
Line 21: | Line 21: | ||
# NOTE the -l flag! | # NOTE the -l flag! | ||
- | # If you need any help, please email help@cse.ucdavis.edu | + | # If you need any help, please email farm-hpc@ucdavis.edu |
# Name of the job - You'll probably want to customize this. | # Name of the job - You'll probably want to customize this. | ||
- | #SBATCH -J bench | + | #SBATCH --job-name=benchmark-test |
+ | |||
+ | # Use the med2 partition (or which ever you have access to) | ||
+ | # Run this to see what partitions you have access to: | ||
+ | # sacctmgr -s list user $USER format=partition | ||
+ | #SBATCH --partition=med2 | ||
# Standard out and Standard Error output files with the job number in the name. | # Standard out and Standard Error output files with the job number in the name. | ||
- | #SBATCH -o bench-%j.output | + | #SBATCH --output=bench-%j.output |
- | #SBATCH -e bench-%j.output | + | #SBATCH --error=bench-%j.output |
- | # no -n here, the user is expected to provide that on the command line. | + | # Request 4 CPUs and 8 GB of RAM from 1 node: |
+ | # | ||
+ | #SBATCH --mem=8G | ||
+ | #SBATCH --ntasks=1 | ||
+ | #SBATCH --cpus-per-task=4 | ||
# The useful part of your job goes below | # The useful part of your job goes below | ||
Line 39: | Line 48: | ||
export OMP_NUM_THREADS=$SLURM_NTASKS | export OMP_NUM_THREADS=$SLURM_NTASKS | ||
module load benchmarks | module load benchmarks | ||
- | stream | + | |
+ | # The main job executable to run: note the use of srun before it | ||
+ | srun stream | ||
</ | </ | ||
Line 97: | Line 108: | ||
The newest version of slurm supports array jobs. For example: | The newest version of slurm supports array jobs. For example: | ||
< | < | ||
- | $ cat test.sh | + | $ cat test-array.sh |
#!/bin/bash | #!/bin/bash | ||
hostname | hostname | ||
Line 105: | Line 116: | ||
< | < | ||
# Submit a job array with index values between 0 and 10,000 on all free CPUs: | # Submit a job array with index values between 0 and 10,000 on all free CPUs: | ||
- | $ sbatch --array=0-1000 MyScript.sh | + | $ sbatch --array=0-10000 --partition=low test-array.sh |
</ | </ | ||
Line 175: | Line 186: | ||
| -t | time limit for job, < | | -t | time limit for job, < | ||
| -v -vv -vvv| Increasing levels of verbosity| | | -v -vv -vvv| Increasing levels of verbosity| | ||
- | | -x node-name | Don't run job on node-name (and please report any problematic nodes to help@cse.ucdavis.edu) | | + | | -x node-name | Don't run job on node-name (and please report any problematic nodes to farm-hpc@ucdavis.edu) | |
====== Interactive Sessions ====== | ====== Interactive Sessions ====== | ||
- | (takes 30 seconds or so) | + | (Usually |
- | < | + | < |
+ | When the time limit expires you will be forcibly logged out and anything left running will be killed. | ||
====== | ====== | ||
Line 216: | Line 228: | ||
[[http:// | [[http:// | ||
- | |||
===== Cancelling ===== | ===== Cancelling ===== | ||
Line 225: | Line 236: | ||
</ | </ | ||
If you forget the JOBID it will cancel all your jobs. | If you forget the JOBID it will cancel all your jobs. | ||
+ | |||
+ | |||
+ | ===== Advanced (Optional) Squeue Usage ===== | ||
+ | The squeue command has some additional command flags that can be passed to better monitor your jobs, if necessary. | ||
+ | |||
+ | This section involves some Linux shell knowledge and an understanding of environment variables. If you are unsure, you can skip this section, or ask an administrator for help. | ||
+ | |||
+ | The default output fields of squeue are defined in the slurm module, but these can be overridden with the | ||
+ | '' | ||
+ | An example of the standard output of '' | ||
+ | < | ||
+ | |||
+ | JOBID PARTITION | ||
+ | 12345 | ||
+ | </ | ||
+ | These fields are defined by default using the following format codes: | ||
+ | < | ||
+ | %.14i %.9P %.8j %.8u %.2t %.11M %.6D %3C %6m %R | ||
+ | </ | ||
+ | A full explanation of what formatting codes may be used can be found in '' | ||
+ | |||
+ | To see the time and date that your jobs are scheduled to end, and how much time is remaining: | ||
+ | < | ||
+ | squeue --format=" | ||
+ | </ | ||
+ | Sample output: | ||
+ | < | ||
+ | JOBID PARTITION NAME | ||
+ | 1234 med | ||
+ | </ | ||
+ | |||
+ | For convenience, | ||
+ | < | ||
+ | alias jobtimes=" | ||
+ | </ | ||
+ | Next time you log in, the command " | ||
+ | |||
+ | See the squeue man page for other fields that squeue can output. | ||
+ | |||
+ | The default squeue formatting is stored in the environment variable '' | ||
+ | |||
====== SLURM Partitions ====== | ====== SLURM Partitions ====== | ||
- | Generally, there are three SLURM partitions (aka queues) on a cluster. | + | Generally, there are three SLURM partitions (aka queues) on a cluster. These partitions divide up pools of nodes based on job priority needs. |
- | |low| Low priority means that you might be killed at any time. Great for soaking up unused cycles with short jobs; a particularly good fit for large array jobs when individual jobs have short run times| | + | |low| Low priority means that you might be killed at any time. Great for soaking up unused cycles with short jobs; a particularly good fit for large array jobs when individual jobs have short run times.| |
|med|Medium priority means you might be suspended, but will resume when a high priority job finishes. | |med|Medium priority means you might be suspended, but will resume when a high priority job finishes. | ||
|hi|Your job will kill/ | |hi|Your job will kill/ | ||
- | |bigmem| Large memory nodes, jobs will keep the allocated hardware until it's done or there' | ||
- | |serial| Older serial nodes, jobs will keep the allocated hardware until it's done or there' | ||
+ | There are other types of partitions that may exist, as well. | ||
+ | |||
+ | |bigmem, bm| Large memory nodes. Jobs will keep the allocated hardware until it's done or there' | ||
+ | |gpu | GPU nodes, will keep the allocated hardware until it's done or there' | ||
+ | |serial | ||
+ | |||
+ | Nodes can be in more than one partition, and partitions with similar names generally have identical or near-identical hardware: low/ | ||
+ | There may be other partitions based on the hardware available on a particular cluster; not all users have access to all partitions. Consult with your account creation email, your PI, or the helpdesk if you are unsure what partitions you have access to or to use. | ||
====== | ====== | ||