Oscar uses associations to control job submissions from users. An association refers to a combination of four factors: Cluster, Account, User, and Partition. For a user to submit jobs to a partition, an association for the user and partition is required in Oscar.
To view a table of association data for a specific user (thegrouch
in the example), enter the following command in Oscar:
If thegrouch
has an exploratory account, you should see an output similar to this:
Note that the first four columns correspond to the four factors that form an association. Each row of the table corresponds to a unique association (i.e., a unique combination of Cluster, Account, User, and Partition values). Each association is assigned a Quality of Service (see QOS section below for more details).
Some associations have a value for GrpTRESRunMins
. This value indicates a limit on the total number of Trackable RESource (TRES) minutes that can be used by jobs running with this association at any given time. The cpu=110000
for the association with the batch
partition indicates that all of the jobs running with this association can have at most an accumulated 110,000 core-minute cost. If this limit is reached, new jobs will be delayed until other jobs have completed and freed up resources.
GrpTRESRunMins
LimitHere is an example file that incurs a significant core-minute cost:
If this file is named too_many_cpu_minutes.sh
, a user withthegrouch
's QOS might experience something like this:
The REASON
field will be (None)
at first, but after a minute or so, it should resemble the output above (after another myq
command).
Note that the REASON
the job is pending and not yet running is AssocGrpCPURunMinutesLimit
. This is because the program requests 30 cores for 90 hours, which is more than the oscar/default/thegrouch/batch association allows (30 cores * 90 hours * 60 minutes/hour = 162,000 core-minutes > 110,000 core-minutes). In fact, this job could be pending indefinitely, so it would be a good idea for thegrouch
to run scancel 12345678
and make a less demanding job request (or use an association that allows for that amount of resources).
Quality of Service (QoS) refers to the ability of a system to prioritize and manage network resources to ensure a certain level of performance or service quality. An association's QOS is used for job scheduling when a user requests that a job be run. Every QOS is linked to a set of job limits that reflect the limits of the cluster/account/user/partition of the association(s) that has/have that QOS. QOS's can also have information on GrpTRESRunMins
limits for their corresponding associations. For example, HPC Priority accounts have job limits of 1,198,080 core-minutes per job, which are associated with those accounts' QOS's. Whenever a job request is made (necessarily through a specific association), the job will only be queued if it meets the requirements of the association's QOS. In some cases, a QOS can be defined to have limits that differ from its corresponding association. In such cases, the limits of the QOS override the limits of the corresponding association. For more information, see the slurm QOS documentation.
myaccount
- To list the QoS & ResourcesThe myaccount
command serves as a comprehensive tool for users to assess the resources associated with their accounts. By utilizing this command, individuals can gain insights into critical parameters such as Max Resources Per User
and Max Jobs Submit Per User
.