Blog

HPC Survival Guide: SLURM Commands You Must Know Before the Admin Kicks You Off the Cluster

for graphic design video creators
Bioinformatics blog

HPC Survival Guide: SLURM Commands You Must Know Before the Admin Kicks You Off the Cluster

High-performance computing (HPC) is where your bioinformatics dreams either fly or crash violently. It’s the place where 96-core nodes work like jet engines, terabytes of RAM sit waiting.

This guide isn’t about theory. It’s about practical survival. You’ll learn the SLURM commands that keep your jobs running, keep you out of trouble, and keep your admin from sending you that dreaded email: “Do not run intensive jobs on the head node.”


1. sinfo — Know Where You Are Before You Start a War

This command shows the available partitions (queues), nodes, and their states.

sinfo

Typical output includes:

  • Partition names

  • Node status (idle, alloc, down)

  • Available CPUs, memory

Why it matters:
You can slot your job correctly and avoid submitting to the wrong partition or a queue that’s fully packed.


2. sbatch — The Command That Actually Runs Your Job

This is how you submit a job script to the cluster.

sbatch my_script.sh

A typical script includes:

#!/bin/bash
#SBATCH --job-name=myRNAseq
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=24:00:00
#SBATCH --output=log_%j.out

Why it matters:
Without sbatch, you’re not using the cluster—you’re using the login node like a reckless amateur.


3. squeue — Check Your Job Without Losing Your Mind

This command shows your running and pending jobs.

squeue -u $USER

You’ll see:

  • Job ID

  • State (R, PD, CG)

  • Time used

  • Queue

Pro tip: If your job is stuck in PD forever, check Reason in the table. Usually it’s memory, time, or partition limits.


4. scancel — Learn This Before You Submit Anything Big

Kill a job before it kills your quota.

scancel 123456

Or kill everything you ever submitted:

scancel -u $USER

Why it matters:
You will submit something wrong eventually—like asking for 1 TB of RAM by accident. scancel is your parachute.


5. sacct — Your Job History, Exposed

Use this when you need to know why your job died.

sacct -j 123456 -o JobID,State,Elapsed,MaxRSS,ExitCode

Key fields:

  • State — reason for failure

  • MaxRSS — memory usage

  • ExitCode — actual code of death

Why it matters:
You can diagnose failures without bothering your admin every five minutes.


6. srun — For Interactive Jobs (Use Carefully!)

Use srun when you need an interactive session on a compute node, not on the login node.

srun --pty bash -i

Add resources:

srun --cpus-per-task=4 --mem=16G --time=02:00:00 --pty bash

Why it matters:
You can test commands without illegally running heavy processes on login.


7. module load — The Command That Solves 90% of “Command Not Found” Errors

All HPC systems use environment modules. To load a tool:

module load samtools/1.17
module load gcc/12.2

Check available modules:

module avail

Why it matters:
You don’t install software on HPC; you load it.


8. scontrol show job — Deep Dive Into Job Details

When something is stuck, this command tells you everything.

scontrol show job 123456

You’ll see:

  • Node it’s running on

  • Allocated CPUs

  • Memory

  • Environment

Why it matters:
Perfect for diagnosing weird scheduling issues.


9. sstat — Real-Time Monitoring

Monitor resource usage while the job runs.

sstat -j 123456

Or for detailed fields:

sstat -j 123456 --format=JobID,MaxRSS,MaxVMSize,AveCPU

Why it matters:
Prevents jobs from overusing memory and being killed silently.


10. sreport — The Admin’s Favorite Command

This one shows usage statistics.

sreport cluster utilization

Why it matters:
Useful to see how much of your fair-share quota you’ve burned.


✅ Bonus Survival Tips Your Admin Wishes You Knew

Never run heavy programs on the login node.

It’s like starting a fire in the kitchen of a crowded restaurant. People get angry.


Check partition limits before submitting.

Use:

sacctmgr show qos

or read the documentation.


Log your errors.

Use --output and --error flags in sbatch.


Use job arrays instead of submitting 50 separate jobs.

#SBATCH --array=1-50

Respect quotas.

Disk usage is finite. Don’t hoard 3 TB of old FASTQ files.


Final Thoughts

SLURM isn’t your enemy—it’s the gatekeeper to all the computational firepower you need. Learn these commands and you’ll navigate HPC clusters with confidence, avoid common mistakes, and stay out of trouble with the admin who manages your computing fate.

Master these essentials, and HPC becomes less of a mystery and more of a reliable workhorse behind your research.


References

  1. Yoo AB, Jette MA, Grondona M. SLURM: Simple Linux Utility for Resource Management. Job Scheduling Strategies for Parallel Processing. 2003;44–60.

  2. Jette MA, Yoo AB, Grondona M. SLURM architecture and design. Proceedings of the 2002 ClusterWorld Conference and Expo. 2002.

  3. Gentry J, Goodman A. Practical HPC for Biologists. Curr Protoc Bioinformatics. 2019;65:e70.

  4. Harrell J. Getting started with HPC: A guide for life scientists. PLoS Comput Biol. 2019;15(8):e1007062.

  5. Arthur D, et al. HPC workflow best practices for scientific computing. Gigascience. 2020;9(12):giaa123.

Leave your thought here

Your email address will not be published. Required fields are marked *