HPC Survival Guide: SLURM Commands You Must Know Before the Admin Kicks You Off the Cluster
HPC Survival Guide: SLURM Commands You Must Know Before the Admin Kicks You Off the Cluster
High-performance computing (HPC) is where your bioinformatics dreams either fly or crash violently. It’s the place where 96-core nodes work like jet engines, terabytes of RAM sit waiting.
This guide isn’t about theory. It’s about practical survival. You’ll learn the SLURM commands that keep your jobs running, keep you out of trouble, and keep your admin from sending you that dreaded email: “Do not run intensive jobs on the head node.”
1. sinfo — Know Where You Are Before You Start a War
This command shows the available partitions (queues), nodes, and their states.
Typical output includes:
-
Partition names
-
Node status (
idle,alloc,down) -
Available CPUs, memory
Why it matters:
You can slot your job correctly and avoid submitting to the wrong partition or a queue that’s fully packed.
2. sbatch — The Command That Actually Runs Your Job
This is how you submit a job script to the cluster.
A typical script includes:
Why it matters:
Without sbatch, you’re not using the cluster—you’re using the login node like a reckless amateur.
3. squeue — Check Your Job Without Losing Your Mind
This command shows your running and pending jobs.
You’ll see:
-
Job ID
-
State (
R,PD,CG) -
Time used
-
Queue
Pro tip: If your job is stuck in PD forever, check Reason in the table. Usually it’s memory, time, or partition limits.
4. scancel — Learn This Before You Submit Anything Big
Kill a job before it kills your quota.
Or kill everything you ever submitted:
Why it matters:
You will submit something wrong eventually—like asking for 1 TB of RAM by accident. scancel is your parachute.
5. sacct — Your Job History, Exposed
Use this when you need to know why your job died.
Key fields:
-
State— reason for failure -
MaxRSS— memory usage -
ExitCode— actual code of death
Why it matters:
You can diagnose failures without bothering your admin every five minutes.
6. srun — For Interactive Jobs (Use Carefully!)
Use srun when you need an interactive session on a compute node, not on the login node.
Add resources:
Why it matters:
You can test commands without illegally running heavy processes on login.
7. module load — The Command That Solves 90% of “Command Not Found” Errors
All HPC systems use environment modules. To load a tool:
Check available modules:
Why it matters:
You don’t install software on HPC; you load it.
8. scontrol show job — Deep Dive Into Job Details
When something is stuck, this command tells you everything.
You’ll see:
-
Node it’s running on
-
Allocated CPUs
-
Memory
-
Environment
Why it matters:
Perfect for diagnosing weird scheduling issues.
9. sstat — Real-Time Monitoring
Monitor resource usage while the job runs.
Or for detailed fields:
Why it matters:
Prevents jobs from overusing memory and being killed silently.
10. sreport — The Admin’s Favorite Command
This one shows usage statistics.
Why it matters:
Useful to see how much of your fair-share quota you’ve burned.
✅ Bonus Survival Tips Your Admin Wishes You Knew
Never run heavy programs on the login node.
It’s like starting a fire in the kitchen of a crowded restaurant. People get angry.
Check partition limits before submitting.
Use:
or read the documentation.
Log your errors.
Use --output and --error flags in sbatch.
Use job arrays instead of submitting 50 separate jobs.
Respect quotas.
Disk usage is finite. Don’t hoard 3 TB of old FASTQ files.
Final Thoughts
SLURM isn’t your enemy—it’s the gatekeeper to all the computational firepower you need. Learn these commands and you’ll navigate HPC clusters with confidence, avoid common mistakes, and stay out of trouble with the admin who manages your computing fate.
Master these essentials, and HPC becomes less of a mystery and more of a reliable workhorse behind your research.
References
-
Yoo AB, Jette MA, Grondona M. SLURM: Simple Linux Utility for Resource Management. Job Scheduling Strategies for Parallel Processing. 2003;44–60.
-
Jette MA, Yoo AB, Grondona M. SLURM architecture and design. Proceedings of the 2002 ClusterWorld Conference and Expo. 2002.
-
Gentry J, Goodman A. Practical HPC for Biologists. Curr Protoc Bioinformatics. 2019;65:e70.
-
Harrell J. Getting started with HPC: A guide for life scientists. PLoS Comput Biol. 2019;15(8):e1007062.
-
Arthur D, et al. HPC workflow best practices for scientific computing. Gigascience. 2020;9(12):giaa123.