Managing Jobs

Cancelling a Job

To cancel a job, first find the jobID, you can use the vuw-myjobs (or squeue) command to see a list of your jobs, including jobIDs. Once you have that you can use the scancel command, eg

scancel 236789

To cancel all of your jobs you can use the -u flag followed by your username:

scancel -u <username>

Note: Before cancelling your jobs, please make sure it runs for at least 2 mins, including the jobs submitted in error.

Viewing Job information

Job History

If you want to get a quick view of all the jobs completed within the last 5 days you can use the vuw-job-history command, for example:

$ vuw-job-history

MY JOBS WITHIN LAST 5 days
       JobID      State    JobName  MaxVMSize    CPUTime
------------ ---------- ---------- ---------- ----------
2645          COMPLETED       bash              00:00:22
2645.extern   COMPLETED     extern      0.15G   00:00:22
2645.0        COMPLETED       bash      0.22G   00:00:20
2734          COMPLETED       bash              00:07:40
2734.extern   COMPLETED     extern      0.15G   00:07:40
2734.0        COMPLETED       bash      0.22G   00:07:40

Job Reports

To view a report of your past jobs you can run vuw-job-report:

$ vuw-job-report 162711

JOB REPORT FOR JOB 162711
     JobName  Nodes    ReqMem   UsedMem(GB)  ReqCPUs CPUTime    State    Completed
test-schro        1       64Gn                   24  00:02.513  COMPLETED 2019-05-28T16:17:10
     batch        1       64Gn      0.15G        24  00:00.210  COMPLETED 2019-05-28T16:17:10
    extern        1       64Gn      0.15G        24  00:00.002  COMPLETED 2019-05-28T16:17:10

NOTE: In this example you see that I requested 64 GigaBytes of memory but only used 0.15 GB. This means that 63 GB of memory went unused, which was a waste of resources.

You can also get a report of your completed jobs using the sacct command. For example if I wanted to get a report on how much memory my job used I could do the following:

sacct --units=G --format="MaxVMSize" -j 2156

  • MaxVMSize will report the maximum virtual memory (RAM plus swap space) used by my job in GigBytes ( --units=G )
  • -j 2156 shows the information for job ID 2156
  • type man sacct at a prompt in engaging to see the documentation on the sacct command

Viewing jobs in the Queue

To view your running jobs you can type vuw-myjobs eg:

$ vuw-myjobs
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
7921967 quicktest     bash username  R       0:12      1 c03n01

As you can see I have a single job running on the node c03n01 on the quicktest partition

You can see all the jobs in the queues by running the vuw-alljobs command. This will produce a very long list of jobs if the cluster is busy.

Job Queuing (aka Why isn't my job running?)

When a partition is busy, jobs will be placed in a queue. You can observe this in the vuw-myjobs and vuw-alljobs commands. The STATE of your job will be PENDING, this means it is waiting for resources or your job has been re-prioritized to allow other users access to run their jobs (this is called fair-share queueing).

The resource manager will list a reason the job is pending, these reasons can include:

  • Priority - Your job priority has been reduced to allow other users access to the cluster. If no other user with normal priority is also pending then your job will start once resources are available. Possible reasons why your priority has been lowered can include: the number of jobs you have run in the past 24-48 hours; the duration of the job and the amount of resources requested. The Slurm manager uses fair-share queuing to ensure the best use of the cluster. You can google fair-share queuing if you want to know more
  • Resources- There are insufficient resources to start your job. Some combination of CPU, Memory, Time or other specialized resource are unavailable. Once resources are freed up your job will begin to run.
    Time: If you request more time than the max run-time of a partition, your job will be queued indefinitely (in other words: it will never run). Your time request must be less than or equal to the Partition Max Run-Time. Also if a special reservation is placed on the cluster, for instance prior to a scheduled maintenance, this too will reduce the available time to run your job. You can see Max Run-Time for our partitions described in this document. CAD or ITS Staff will alert all users prior to any scheduled maintenance and advise them of the time restrictions.
  • QOSGrpCPULimit - This is a Quality of Service configuration to limit the number of CPUs per user. The QOSMax is the maximum that can be requested for any single job. If a user requests more CPUs than the QOSMax for a single job then the job will not run. If the user requests more than QOSMax in 2 or more jobs then the subsequent jobs will queue until the users running jobs complete.
  • PartitionTimeLimit - This means you have requested more time than the maximum runtime of the partition. This document contains information about the different partitions, including max run-time. Typing vuw-partitions will also show the max run-time for the partitions available to you.
  • ReqNodeNotAvail - 99% of the time you will receive this code if you have asked for too much time. This frequently occurs when the cluster is about to go into maintenance and a reservation has been placed on the cluster, which reduces the maximum run-time of all jobs. For example, if maintenance on the cluster is 1 week away, the maximum run-time on all jobs needs to be less than 1 week, regardless if the configured maximum run-time on a partition is greater than 1 week. To request time you can use the --time parameter. Another issue is if you request too much memory or a CPU configuration that does not exist on any node in a partition.
  • Required node not available (down, drained or reserved) - This is related to ReqNodeNotAvail, see above.