The Oscar GPUs are in a separate partition to the regular compute nodes. The partition is called gpu. To see how many jobs are running and pending in the gpu partition, use
allq gpu
Interactive use
To start an session on a GPU node, use the interact command and specify the gpu partition. You also need to specify the requested number of GPUs using the -g option:
interact -q gpu -g 1
Batch jobs
Here is an example batch script for a cuda job that uses 1 gpu and 1 cpu for 5 minutes
#!/bin/bash# Request a GPU partition node and access to 1 GPU#SBATCH -p gpu --gres=gpu:1# Request 1 CPU core#SBATCH -n 1#SBATCH -t 00:05:00# Load a CUDA modulemoduleloadcuda# Run program./my_cuda_program
To submit this script:
GPU-HE Partition: MIG-ed and un-MIG-ed GPUs
The GPU-HE partition has a mix of B200 DGX nodes as well as RTX Pro 6000 Max-Q nodes. These GPUs have the ability to be sub-divided into virtualized "slices" of GPUs. This is a feature referred to as Multi-Instance GPU (MIG). For example, a B200 with 180 GB of VRAM can be virtualized into two MIG instances each having 90 GB of VRAM. For now, we have "MIG-ed" all of the Max-Q GPUs into two 48 GB slices, and we have MIG-ed two of the B200 nodes' GPUs into 90 GB slices. A MIG-ed GPU is allocated first by default.
To request a un-MIG-ed GPU explicitly, users need to specify the feature of nomig
To request a MIG-ed GPU explicitly, users need to specify the feature of nomig
DGX GPU Nodes in the GPU-HE Partition
All the nodes in the gpu-he partition have V100 GPUs. However, two of them are DGX nodes (gpu1404/1405) which have 8 GPUs. When a gpu-he job requests for more than 4 GPUs, the job will automatically be allocated to the DGX nodes.
The other non-DGX nodes actually have a better NVLink interconnect topology as all of them have direct links to the other. So the non-DGX nodes are better for a gpu-he job if the job does not require more than 4 GPUs.