Oscar
HomeServicesDocumentation
  • Overview
  • Quickstart
  • Getting Started
  • System Hardware
  • Account Information
  • Short "How to" Videos
  • Quick Reference
    • Common Acronyms and Terms
    • Managing Modules
    • Common Linux Commands
  • Getting Help
    • ❓FAQ
  • Citing CCV
  • CCV Account Information
  • Student Accounts
  • Offboarding
  • Connecting to Oscar
    • SSH (Terminal)
      • SSH Key Login (Passwordless SSH)
        • Mac/Linux/Windows(PowerShell)
        • Windows(PuTTY)
      • SSH Configuration File
      • X-Forwarding
      • SSH Agent Forwarding
        • Mac/Linux
        • Windows (PuTTY)
      • Arbiter2
    • Open OnDemand
      • Using File Explorer on OOD
      • Web-based Terminal App
      • Interactive Apps on OOD
      • Using Python or Conda environments in the Jupyter App
      • Using RStudio
      • Desktop App (VNC)
    • SMB (Local Mount)
    • Remote IDE (VS Code)
      • From Non-compliant Networks (2-FA)
      • Setup virtual environment and debugger
  • Managing files
    • Oscar's Filesystem
    • Transferring Files to and from Oscar
    • Transferring Files between Oscar and Campus File Storage (Replicated and Non-Replicated)
    • Resolving quota issues
      • Understanding Disk Quotas
    • Inspecting Disk Usage (Ncdu)
    • Restoring Deleted Files
    • Best Practices for I/O
    • Version Control
  • Submitting jobs
    • Running Jobs
    • Slurm Partitions
    • Interactive Jobs
    • Batch Jobs
    • Managing Jobs
    • Job Arrays
    • MPI Jobs
    • Condo/Priority Jobs
    • Dependent Jobs
    • Associations & Quality of Service (QOS)
  • GPU Computing
    • GPUs on Oscar
      • Grace Hopper GH200 GPUs
      • H100 NVL Tensor Core GPUs
      • Ampere Architecture GPUs
    • Submitting GPU Jobs
    • Intro to CUDA
    • Compiling CUDA
    • Installing Frameworks (PyTorch, TensorFlow, Jax)
      • Installing JAX
      • Installing TensorFlow
    • Mixing MPI and CUDA
  • Large Memory Computing
    • Large Memory Nodes on Oscar
  • Software
    • Software on Oscar
    • Using Modules
    • Migration of MPI Apps to Slurm 22.05.7
    • Python on Oscar
    • Python in batch jobs
    • Installing Python Packages
    • Installing R Packages
    • Using CCMake
    • Intro to Parallel Programming
    • Anaconda
    • Conda and Mamba
    • DMTCP
    • Screen
    • VASP
    • Gaussian
    • IDL
    • MPI4PY
  • Jupyter Notebooks/Labs
    • Jupyter Notebooks on Oscar
    • Jupyter Labs on Oscar
    • Tunneling into Jupyter with Windows
  • Debugging
    • Arm Forge
      • Configuring Remote Launch
      • Setting Job Submission Settings
  • MATLAB
    • Matlab GUI
    • Matlab Batch Jobs
    • Improving Performance and Memory Management
    • Parallel Matlab
  • Visualization 🕶
    • ParaView Remote Rendering
  • Singularity Containers
    • Intro to Apptainer
    • Building Images
    • Running Images
    • Accessing Oscar Filesystem
      • Example Container (TensorFlow)
    • Singularity Tips and Tricks
  • Installing Software Packages Locally
    • Installing your own version of Quantum Espresso
    • Installing your own version of Qmcpack
  • dbGaP
    • dbGaP Architecture
    • dbGaP Data Transfers
    • dbGaP Job Submission
  • RHEL9 Migration
    • RHEL-9 Migration
    • LMOD - New Module System
    • Module Changes
    • Testing Jupyter Notebooks on RHEL9 mini-cluster
  • Large Language Models
    • Ollama
Powered by GitBook
On this page
  • Associations
  • Example of GrpTRESRunMins Limit
  • Account Quality of Service (QoS) and Resources
  • myaccount - To list the QoS & Resources

Was this helpful?

Export as PDF
  1. Submitting jobs

Associations & Quality of Service (QOS)

Associations

Oscar uses associations to control job submissions from users. An association refers to a combination of four factors: Cluster, Account, User, and Partition. For a user to submit jobs to a partition, an association for the user and partition is required in Oscar.

To view a table of association data for a specific user (thegrouch in the example), enter the following command in Oscar:

(sacctmgr list assoc | head -2; sacctmgr list assoc | grep thegrouch) | cat

If thegrouch has an exploratory account, you should see an output similar to this:

   Cluster    Account       User  Partition     Share GrpJobs       GrpTRES GrpSubmit     GrpWall   GrpTRESMins MaxJobs       MaxTRES MaxTRESPerNode MaxSubmit     MaxWall   MaxTRESMins                  QOS   Def QOS GrpTRESRunMin
---------- ---------- ---------- ---------- --------- ------- ------------- --------- ----------- ------------- ------- ------------- -------------- --------- ----------- ------------- -------------------- --------- -------------
     oscar    default  thegrouch  gpu-debug         1                                                                                                                                               gpu-debug gpu-debug
     oscar    default  thegrouch     bigmem         1                                                                                                                                             norm-bigmem norm-big+
     oscar    default  thegrouch        smp         1                                                                                                                                                norm-smp  norm-smp
     oscar    default  thegrouch        gpu         1                                                                                                                                                norm-gpu  norm-gpu cpu=34560,gr+
     oscar    default  thegrouch      batch         1                                                                                                                                                  normal    normal    cpu=110000
     oscar    default  thegrouch        vnc         1                                                                                                                                                     vnc       vnc
     oscar    default  thegrouch      debug         1                                                                                                                                                   debug     debug

Note that the first four columns correspond to the four factors that form an association. Each row of the table corresponds to a unique association (i.e., a unique combination of Cluster, Account, User, and Partition values). Each association is assigned a Quality of Service (see QOS section below for more details).

Some associations have a value for GrpTRESRunMins. This value indicates a limit on the total number of Trackable RESource (TRES) minutes that can be used by jobs running with this association at any given time. The cpu=110000 for the association with the batch partition indicates that all of the jobs running with this association can have at most an accumulated 110,000 core-minute cost. If this limit is reached, new jobs will be delayed until other jobs have completed and freed up resources.

Example of GrpTRESRunMins Limit

Here is an example file that incurs a significant core-minute cost:

#!/bin/bash
#SBATCH -n 30
#SBATCH --mem=32G
#SBATCH -t 90:00:00

echo "Is this too much to ask? (Hint: What is the GrpTRESRunMins limit for batch?)"

If this file is named too_many_cpu_minutes.sh, a user withthegrouch's QOS might experience something like this:

$ sbatch too_many_cpu_minutes.sh
Submitted batch job 12345678
$ myq
Jobs for user thegrouch

Running:
(none)

Pending:
ID        NAME                     PART.  QOS     CPU  WALLTIME    EST.START  REASON
15726799  too_many_cpu_minutes.sh  batch  normal  30   3-18:00:00  N/A        (AssocGrpCPURunMinutesLimit)

The REASON field will be (None) at first, but after a minute or so, it should resemble the output above (after another myq command).

Note that the REASON the job is pending and not yet running is AssocGrpCPURunMinutesLimit. This is because the program requests 30 cores for 90 hours, which is more than the oscar/default/thegrouch/batch association allows (30 cores * 90 hours * 60 minutes/hour = 162,000 core-minutes > 110,000 core-minutes). In fact, this job could be pending indefinitely, so it would be a good idea for thegrouch to run scancel 12345678 and make a less demanding job request (or use an association that allows for that amount of resources).

Account Quality of Service (QoS) and Resources

myaccount - To list the QoS & Resources

The myaccount command serves as a comprehensive tool for users to assess the resources associated with their accounts. By utilizing this command, individuals can gain insights into critical parameters such as Max Resources Per User and Max Jobs Submit Per User.

[ccvdemo1@login010 ~]$ myaccount
My QoS                    Total Resources in this QoS              Max Resources Per User                   Max Jobs Submit Per User
------------------------- ------------------------------           ------------------------------           -----------         
debug                                                                                                       1200                
gpu-debug                                                          cpu=8,gres/gpu=4,mem=96G                 1200                
gpu                                                                node=1                                   1200                
normal                                                             cpu=32,mem=246G                          1000                
norm-bigmem                                                        cpu=32,gres/gpu=0,mem=770100M,node=2     1200                
norm-gpu                                                           cpu=12,gres/gpu=2,mem=192G               1200                
vnc                                                                                                         1                   
PreviousDependent JobsNextGPUs on Oscar

Last updated 1 year ago

Was this helpful?

Quality of Service (QoS) refers to the ability of a system to prioritize and manage network resources to ensure a certain level of performance or service quality. An association's QOS is used for job scheduling when a user requests that a job be run. Every QOS is linked to a set of job limits that reflect the limits of the cluster/account/user/partition of the association(s) that has/have that QOS. QOS's can also have information on GrpTRESRunMins limits for their corresponding associations. For example, have job limits of 1,198,080 core-minutes per job, which are associated with those accounts' QOS's. Whenever a job request is made (necessarily through a specific association), the job will only be queued if it meets the requirements of the association's QOS. In some cases, a QOS can be defined to have limits that differ from its corresponding association. In such cases, the limits of the QOS override the limits of the corresponding association. For more information, see the .

HPC Priority accounts
slurm QOS documentation