Slurm Clusters
Deploy fully managed Slurm Clusters on Runpod with zero configuration
Slurm Clusters are currently in beta. If you’d like to provide feedback, please join our Discord.
Runpod Slurm Clusters provide a fully managed high-performance computing and scheduling solution that enables you to rapidly create and manage Slurm Clusters with minimal setup.
For more information on working with Slurm, refer to the Slurm documentation.
Key features
Slurm Clusters eliminate the traditional complexity of cluster orchestration by providing:
- Zero configuration setup: Slurm and munge are pre-installed and fully configured.
- Instant provisioning: Clusters deploy rapidly with minimal setup.
- Automatic role assignment: Runpod automatically designates controller and agent nodes.
- Built-in optimizations: Pre-configured for optimal NCCL performance.
- Full Slurm compatibility: All standard Slurm commands work out-of-the-box.
If you prefer to manually configure your Slurm deployment, see Deploy an Instant Cluster with Slurm (unmanaged) for a step-by-step guide.
Deploy a Slurm Cluster
- Open the Instant Clusters page on the Runpod console.
- Click Create Cluster.
- Select Slurm Cluster from the cluster type dropdown menu.
- Configure your cluster specifications:
- Cluster name: Enter a descriptive name for your cluster.
- Pod count: Choose the number of Pods in your cluster.
- GPU type: Select your preferred GPU type.
- Region: Choose your deployment region.
- Network volume (optional): Add a network volume for persistent/shared storage. If using a network volume, ensure the region matches your cluster region.
- Pod template: Select a Pod template or click Edit Template to customize start commands, environment variables, ports, or container/volume disk capacity.
- Click Deploy Cluster.
Connect to a Slurm Cluster
Once deployment completes, you can access your cluster from the Instant Clusters page.
From this page you can select a cluster to view it’s component nodes, including a label indicating the Slurm controller (primary node) and Slurm agents (secondary nodes). Expand a node to view details like availability, GPU/storage utilization, and options for connection and management.
Connect to a node using the Connect button, or using any of the connection methods supported by Pods.
Submit and manage jobs
All standard Slurm commands are available without configuration. For example, you can:
Check cluster status and available resources:
Submit a job to the cluster from the Slurm controller node:
Monitor job queue and status:
View detailed job information from the Slurm controller node:
Advanced configuration
While Runpod’s Slurm Clusters work out-of-the-box, you can customize your configuration by connecting to the Slurm controller node using the web terminal or SSH.
Access Slurm configuration files in their standard locations:
/etc/slurm/slurm.conf
- Main configuration file./etc/slurm/topology.conf
- Network topology configuration./etc/slurm/gres.conf
- Generic resource configuration.
Modify these files as needed for your specific requirements.
Troubleshooting
If you encounter issues with your Slurm Cluster, try the following:
- Jobs stuck in pending state: Check resource availability with
sinfo
and ensure requested resources are available. If you need more resources, you can add more nodes to your cluster. - Authentication errors: Munge is pre-configured, but if issues arise, verify the munge service is running on all nodes.
- Performance issues: Review topology configuration and ensure jobs are using appropriate resource requests.
For additional support, contact Runpod support with your cluster ID and specific error messages.