The Degond Compute Cluster

Latest cluster news

What is the Degond Compute Cluster?

Replacing the three Degond stand-alone compute servers kato, lax & nash the Degond Compute Cluster is a small HPC (High Performance Compute) cluster of Linux compute servers providing 56 processors on which anyone in the Degond section may run short or lengthy computation jobs, either singly or in parallel. The cluster is a batch-mode cluster operating in exactly the same way as the main Maths NextGen cluster - it accepts compute jobs which you submit to it, putting them into one of several queues available to you and then running them for you when the resources needed for your job are available. It will also allow you to use it interactively, connecting to a compute node for a traditional terminal/shell session.

Just a quick note on some cluster terminology before we get deeper into this:

  • a job is simply a program that you want the cluster to run for you; it might be one you've written yourself in C or Python, or it could be a Matlab, R, Maple, Sage, Magma, etc program or script

  • a node is just another name for a single compute server within the cluster

  • resources are the computing resources the cluster needs to run your job, such as the amount of memory required and the number of CPU processor cores, CPUs or even nodes required to run your job

Once you have submitted a job to the cluster, you can walk away knowing the cluster will run your job for you, usually immediately although if the cluster is busy or you have asked for a lot of resources such as 16 CPU cores or 64 gigabytes of memory, then your job will be held in the queue until other users' jobs complete, freeing up these resources. Optionally, you can also ask the cluster to notify you by email when your job starts, finishes or is aborted for any reason.

Using batch mode clusters for your computing make things a lot easier for you since you no longer have to choose which system to use, worry about whether there is enough memory available or if there are too many users already using that system or having to stay logged into a remote computer while your job is running, which can often be difficult to do. And from the cluster's point of view it means system resources across the whole cluster are allocated evenly & fairly between users and avoids individual servers/nodes being overloaded. One of the nice things about the cluster is it shields you from the hardware that actually runs your job or program so you don't need to make any decisions - the queuing system does this for you, matching your job to the available resources to make the most efficient use of the available hardware. So details of this cluster's hardware are largely irrelevant - it's just a large pool of servers as far as you, the user, is concerned. In fact, it's a 'private cloud'.

Once you have got used to using a HPC cluster like Degond, you will probably never want to go back to the old way of doing things by running programs in interactive sessions! (If you really do need to do things the old-fashioned way, if you need to use a program interactively such as the graphical interfaces of Matlab, Maple, etc or if you want to test your program before running it on the cluster, you can use the submission node kato to do this).

Finally, since the Degond compute cluster is for you, the user, we have always had a policy of openness and accessibility and you will find the Degond HPC is very user-oriented; additional software can be installed upon request to support individual user's requirements and users are not constrained to using particular compilers or packages, for example. Also, this cluster does not impose the CPU or 'wall time' (that is, 'real' chronological time) limits that other HPC clusters do; the default maximum compute time is 100 days but can be made infinite; contrast this with the 48 or 72 hour limits of other HPC clusters at Imperial College. If your research needs are unusual, complex or difficult to implement, we want to hear about it as in all probability the answer is "Yes, now what's the question?"

We hope you will enjoy using the job queueing system - no more having to log in at weekends to see if a job has finished or getting up in the night to start off another compute job!

What does the cluster consist of?

All of the nodes in the cluster are 64-bit systems with dual Xeon CPUs containing between 8 and 12 processor cores in each CPU, between 96 and 768 GB of memory plus up to 120 TB of networked storage. Running Ubuntu Linux LTS (Long Term Support) server edition 20.04, a comprehensive range of mathematical applications such as Matlab, Maple, R, etc is installed as well as development tools, libraries, compilers, editors, debuggers and so on; software is a mix of Ubuntu packages (with Python modules now sourced directly from PyPI repositories rather than Ubuntu packages), locally-built bespoke software and user-built programs. What you will not find on these nodes are web browsers, office suites, music and movie players and games! These are not desktop systems and are intended purely for 'real' computing.

The cluster comprises:

  • 2 compute nodes lax and nash

  • 1 combined submission/compute node & cluster management server kato

All jobs are submitted only through the submission node, kato, and direct access to individual compute nodes within the cluster from the college network are not possible.

The cluster job scheduling software is Torque in conjunction with the Maui scheduler.

This cluster currently does not have a dedicated storage server of its own but has access to the Maths storage servers clustor and clustor2, providing a large amount of fast local data storage for users, with clustor storage being the default. Other servers accessible from thsi cluster include the silos (silo1 to silo4), the calculus server and the College central ICNFS storage..

How do I access it?

Before you start, three important points:

First: from this point onwards it is assumed you have a Maths UNIX account with access to the Degond cluster specifically enabled and some basic knowledge of using Linux on the command line; if not, head over here for a quick introduction

Second: if you intend testing a program you have written or a Matlab/Maple/R script on the clusters, possibly with a view to submitting it to the cluster later, please use the submission server kato.

Third: if you want to use the job queueing system (as you should if you want to make the most of the cluster or use more than 30 minutes of CPU time) please make sure you have set up your ssh client to connect without a password and run the update-ssh-known-hosts script BEFORE you submit any jobs to the cluster. (If you don't do this, your job(s) might run but the queueing system won't be able to write output or error messages into your home directory and you will receive emails from the system explaining why your job has produced no obvious output. Your output will actually be saved in an emergency storage area accessible to the cluster administrator and can, if required, be copied or moved to another location on the cluster, such as a scratch disk, from which you can retrieve or delete your output).

To access the compute cluster, you simply need to log into kato using ssh, the Secure SHell communications protocol for remote shell access, using your college username and password. ssh client software is available for almost all computers - Linux, UNIX and Mac computers running OS X will already have ssh bundled with the operating system but you will have to install a ssh client on Windows PCs, tablets and phones. You'll find all you need to know about ssh here along with a link to download a free ssh client for Windows and the rest of these notes assume you have a working ssh client and know how to use it.

If you are connecting from another Linux or UNIX computer in Mathematics and your username on that system is the same as your college username, to log into kato you can omit your username from the ssh command and just type:

ssh kato

in a terminal, shell or xterm to connect to kato and log in. This is because Maths Linux systems 'know' that they are within the same (or domain and they can simply refer to other systems by their hosthame rather than their full Internet host address but if you are connecting from a system outside the Department of Mathematics, you will need to append the domain or to the node's name as shown:


or ssh

If your username on the node you are connecting from is different from your college username, then you'll have to tell your ssh client to use your college username instead to connect to a cluster node; in this example, your college username might be jbloggs so you would type:


or ssh

If you are logging into the compute cluster from Windows, then you can use PuTTY or SecureWinShell to do this but do remember that you will always have to give the full Internet address of the computer you wish to connect to, eg since you will not be able to use the short host address even on Windows PCs within the department, since Windows computers are not normally set up to have any knowledge of their default, local DNS domain name (although this can be done if desired).

Once you have logged into kato, you can use the node just as you would if you were sitting in front of any other Linux system with a screen and keyboard and a terminal open. In nearly all cases you will want to run a compute job on one of the queues - if you instead want to create a Matlab or Maple script, or write and compile a program in C, Fortran or Python, etc to be run on the cluster later, you can also do this on kato. You do these tasks in exactly the same way as you usually do on a 'local' Linux system and then when you want to run the program to test it, you can either start it as a normal 'foreground' task and remain logged in for as long as you like or you can start it as a 'background' job and then continue doing other things on the cluster system or log out altogether and leave your program running. Assuming your executable program is called myprogram, and that you are already in the same directory as where myprogram resides, typing:


will start the program running and will keep your login terminal session connected to the job you are running - you will have full interactive control over the program and will usually be able to terminate it by typing crtl-C or crtl-\ but if you disconnect from the compute cluster by ending your ssh connection, your program will stop running. But of you append & to your program start command:

./myprogram &

you will detach your program from your controlling shell and put it into the background; your shell command prompt will return and you can either do other things - even start different programs and put them into background mode - or log out, leaving all of your program(s) still running.

Where are my files?

Your files and folders are in your cluster home directory which is the directory you are in when you log into the cluster. By default, your cluster home directory uses storage on the ICNFS service, the college's central Linux home directory server, which provides the Linux equivalent of your Windows H: drive and this is also where your program output will be stored. ICNFS storage is created when a Maths UNIX account is set up for you and is mounted on each of the Maths compute cluster nodes; if you don't already have a Maths UNIX account, please ask for one to be set up for you because without it, you will not have a Linux home directory nor will you be able to log into or use the cluster. You also have additional storage on both clustor and clustor2 which are large fileservers in Maths and you can opt for your cluster home directory to use either of these instead of ICNFS.

Your ICNFS home directory is subject to a disk usage quota, or simply 'the quota'. This is amount of space you have been allocated on the ICT ICNFS server and varies from user to user although most users have the default 400 MB quota. To find out how much quota you have been assigned and how you have used/have left, log into kato and type:

quota -Qs

You should see something like this:

andy@kato:~ $ quota -Qs
Disk quotas for user andy (uid 24050): 
     Filesystem   space   quota   limit   grace   files   quota   limit   grace
                  3980M   4883M   5500M           43523       0       0  

The column headed 'space' is the amount of disk space you have used in your ICNFS home directory while the columns 'quota' and 'limit' denote the 'soft' and 'hard' quotas respectively. The hard quota (the value under 'limit') is the absolute maximum amount of space you can use on ICNFS and cannot be exceeded. The soft quota is typically set to be 90% of the hard quota and on some systems (especially ICT-built Linux systems) you will receive a warning as soon as you log in if you have reached or exceeded your soft quota. However, as far as the Maths compute cluster is concerned, the soft quota has no meaning and doesn't actually do anything - only the hard quota is relevant.

Although ICNFS-backed storage satisfies most users' needs, it's not particularly well-suited for HPC work owing to limited capacity and relatively slow read/write performance and if you plan on working with a lot of data or want your jobs to run quickly, consider using clustor storage which is much faster and has no disk quotas imposed.

Introducing job queuing

Cluster job management used to be regarded with deep suspicion by users in the same way that workforces are often distrustful of management - what could be simpler than logging into a computer and running your program? True, but what if everyone logged into the biggest and fastest systems available and ran their jobs there - these systems would slow to a crawl, memory and eventually disk swap space would be used up and the system crashes with the loss of everyone's output, while slower systems are still up & running, but idle.

Job management avoids this by putting all the compute servers into a logical pool, automatically routing jobs to individual nodes that have the resources to run the job. If the resources your job needs are not currently available on any node in the pool. the job is held in a queue - that is, it is deferred and run later when resources do become available, which may be minutes or hours later. Optionally, you can choose to receive an email from the system letting you know when your job actually starts, when it finishes and even be notified if it is aborted owing to some problem.

Queueing jobs has other advantages - you do not have to submit jobs to the cluster one at a time and wait for each to finish before submitting the next, you can submit as many as you like at the same time (up to a maximum of 3500) or whenever you wish even if you already have jobs either running or queued. Think of the cluster as a corn mill - a giant hopper is filled with tons of grain but a small but steady stream of grain runs out of the bottom onto the millstones, The same with the queuing system; you put a few hundred or even one or two thousand jobs into the queue and walk away, knowing that eventually all of these jobs will be executed. Some users take advantage of this to load up lots of jobs on a Friday afternoon, go away for the weekend and come back on Monday to find all jobs completed; the Christmas vacation is another popular time for pouring jobs into the system - 269,000 jobs were put into the system by one user alone just before Christmas 2013 (although this is now limited to 3500 jobs).

Fairness of resource usage amongst users is maintained through user resource limits - most users are limited to running (currently) 30 simultaneous jobs at any one time but you can have up to 3500 jobs in the queue waiting to be run. A few users have higher simultaneous job limits by special arrangement but there are no other user-specific limits and all users of the Maths compute cluster are treated equally.

The simultaneous job limit is kept under review and does vary from time to time - if the cluster is not especially busy and if only about 50% of the available resources are used, this limit is raised to allow more jobs to be run. But if there are a lot of users using it, then the limit is lowered to 20 or even less in the interests of fairness to all users.

Introducing the queues

Currently just one queue - the 'standard' queue - is available on this cluster and the vast majority of jobs can be run on this queue which is the default if you don't specify a queue when you use the system. Additional queues will be added once we have a good idea of how this cluster is being used compared with, say, the main Maths NectGen clustger.

From time to time, various additional queues are sometimes created for experimental purposes or for an individual user's use and these will show up if you use the 'qstat -q' command to see the queue status. However, only the standard queue described below is officially available and fully supported and use of any other queues you may stumble across is not encouraged!

The standard queue

The standard queue has the following resources:

number of nodes: 1
number of CPU cores: 1
memory available: 1930 MBytes
CPU time available: unlimited
wall time available: unlimited


  • a node means a single computer, such as lax or nash

  • a CPU core means one of the cores in each physical CPU (there are two physical CPUs with 4 cores each in most of the nodes in the cluster so there are a total of 8 cores)

  • the minimum amount of memory available in a cluster node is 96 GB which means each core has 2 GB (2048 MB) available but we need to allow for memory used by the operating system and queue management systems so 1930 MB has been found to be a safe allocation per core.

  • CPU time is the actual amount of time that the CPU core spends executing your job - the CPU core has a lot of other things to do besides running user's jobs, some of which are time critical and cannot be delayed until later. So the CPU uses time slicing where some of its time is allocated to various other tasks and the rest is available to your job.

  • wall time literally means 'real' time as would be displayed by an ordinary clock on the wall! Wall time is always greater than CPU time as no CPU core will be able to devote all of its time (aka processor cycles) to running your job. For example, you might have a job that uses exactly 1 hour (3600 seconds) of CPU time but because the CPU core has to spend, say, 2% of its time writing your data out to a file and doing a few other system tasks, the actual time taken as measured by a chronological timepiece (an ordinary clock) would be 3072 seconds or 1 hour, 1 minute and 12 seconds.

So in the standard queue, these resources mean that your job can run on one computer only in the cluster using just 1 CPU core and using no more than 2 Gbytes of memory, but there is no limit to how long that job is allowed to run for in terms of hours, days or weeks and there is no limit to the amount of CPU processor core time either. These resource limits are both the default and the maximum - when we gain more experience of the system and get a better idea of how users are using it and the resources required, we will then separate the default and maximum resource settings so that a user can submit a job without a resource specification and have it run using the default resources or alternatively, the user could request more resources up to but not exceeding the maximum limits when submitting the job.
wall time available: unlimited

Special queues

From time to time special queues are added to the cluster to meet a particular user's requirements or to resolve an issue. These will appear in a 'qstat -q' listing but are not necessarily meant for you to use since without knowledge of how they are configured, the result of using unknown queues will be unpredictable.

Before you start using it...

Setting up SSH to connect without a password

One thing you need to do - if you haven't already done so - is to set up a SSH private:public key pair so that the queueing system can automatically copy your jobs from one node to another without having to prompt you for a password every time. This is because the queueing system makes decisions on which is the best node on which to run your job - which will not be the submission node you are currently logged into - and it uses scp (SecureCoPy) to transfer jobs to another node. Once you have created a key pair, you'll wonder why you never did this before as being able to connect to large numbers of systems without having to type in passwords every time saves so much time and also makes possible a lot of completely automated and/or unattended tasks.

Creating a SSH key pair is easy using the ssh-keygen utility - there are quite a lot of options to this but by default it will create keys suitable for most users (a 2048 bit RSA key pair for use with ssh protocol 2 connections). But do note that you must create a keypair with no passphrase; if you specify a passphrase, it defeats the whole object of the private/public keypair scheme as you'll then be prompted for the passphrase instead of the password! So in the example ssh-keygen session shown below, simply hit return both times you are prompted for a passphrase and it will create a keypair that does not use or require a passphrase:

andy@kato:~ $ ssh-keygen 
Generating public/private rsa key pair.
Enter file in which to save the key (/home/ma/a/andy/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/ma/a/andy/.ssh/id_rsa.
Your public key has been saved in /home/ma/a/andy/.ssh/
The key fingerprint is:

This will create two keys in your .ssh folder:

andy@kato:~ $ ls -l .ssh
total 64
-rw------- 1 andy mastaff   883 Nov 10 14:50 id_rsa
-rw-r--r-- 1 andy mastaff   222 Nov 10 14:50

The id_rsa key is your private key which you should look after and never give to anyone else; note that its permissions are such that only yourself can read it or change it and if you relax these permissions in any way, the key becomes insecure. Most ssh clients (including the one on the compute cluster nodes) will warn or even prevent you from using the key until the permissions have been set correctly. On the other hand your public key is which can be read by anyone - this is intentional otherwise the remote system will not be able to use your public key.

The private key, id_rsa, needs to be in your .ssh folder on the computer you are making the ssh connection from while the public key must be in your .ssh folder on the computer you want to connect to. But you won't actually have to worry about copying public keys to other computers because for most compute cluster users whenever you log into another node in the cluster, you will be using exactly the same home directory containing the same .ssh folder and its keys.

Once you have created your key pair, copy your public key to a file called authorized keys:

andy@kato:~ $ cd .ssh
andy@kato:~/.ssh $ cp authorized_keys

Now make a ssh connection to another node in the compute cluster - if you have not connected to that node before, you'll be prompted whether you want to do this as your ssh client will not yet know anything about that node. Once you have answered 'yes' to confirm the connection, you will be logged in without being asked for your password. Magic!

Your ssh client will store the ssh host key presented by the remote system, along with some other bits of information relating to the connection, in a file called 'known_hosts' in the .ssh folder in your home directory. The next time you connect to that system, you will not be prompted again as the remote system is now known to you and you will be logged in straight away.

If you haven't already done so, you'll need to make at least one connection to each of the other nodes before using the job queueing system otherwise you may receive reports that jobs cannot be transferred to another node before execution or that the queueing system cannot copy the output and/or error files to your home directory on another node when your job has completed. A script called update-ssh-known-hosts has been provided which will automatically log into every node for you and store that node's public host key in your ~/.ssh/known_hosts file. From a terminal (shell) prompt type:


How do I submit a job to the cluster?

Jobs are submitted by first writing a small 'wrapper' shell script known as the 'submission script' and then using the qsub command to submit it. This script contains:

  • queueing system control directives - these always start with the characters #PBS at the beginning of the line and tell the queueing system to do certain things

  • standard shell commands to change directories, run your own shell scripts or run programs

  • comments which are preceded by the usual hash (pound) characters used to denote a comment line in a shell script; qsub will ignore these comments

Here is a sample submission script:

andy@kato:~/pbs_tests $ cat queue.test4
#PBS -N R_job
#PBS -m be
#PBS -q standard

cd ${HOME}/pbs_tests
/usr/bin/R --vanilla < test.r > test_r.out
This qsub wrapper starts with the usual shell interpreter definition in the first line (/bin/bash, /bin/csh or /bin/tcsh according to your preference - most users prefer the bash shell which is the default on Maths systems)

The next line contains the optional qsub 'N' option which allows you to give your job a meaningful name

#PBS -N R_job
On line 3, the cryptic directive
#PBS -m be
enables a very useful feature of the queueing system - it tells the system to send you an email when the job execution is started (that's what the 'b' for 'begin' stands for) and another email when it ends (which is what the 'e' is for). A third optional parameter for the -m mail option is 'a', which means an email is sent to you if the job is aborted.

Finally on line 4,
#PBS -q standard
tells the queueing system to run your job on the standard queue.

Commands for running your job itself then follow the qsub PBS directives; in this example, we first change to the user's working directory which in this example is the subdirectory in the home directory called 'pbs_tests' with cd $HOME/pbs_tests - it is a good idea to explicitly specify this in a qsub script as the queueing system runs jobs on your behalf as root and itself has no knowledge of your home directory nor does it know about which directory you invoked the qsub submission from or about paths relative to your home directory, for example.

Finally, the program execution command along with all of its arguments comes at the end of the script, in this case to run an R program that takes its input from the script 'test.r' and stores the output in the file 'r.out'. Once you have written your wrapper script, you can run it with qsub like this:

qsub my-wrapper-script

Checking on your jobs

You can check on the status of your jobs on the cluster using the Torque and Maui commands. For example, the rather basic 'qstat' command:
andy@kato:~/pbs_tests $ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1.kato.degond  time-waster_job  andy            0        Q standard          
2.kato.degond  examplejob       andy            0        Q standard          
3.kato.degond  recursive-ls     andy            0        Q standard
This example shows 3 jobs queued to the standard queue from the node called kato - note the column headed with 'S' (the status column) is showing all jobs as 'Q', meaning it is queued but not yet running.

Another important item in qstat's output is the Job ID - this shows the job as a number followed by a period and the node it has been submitted from. You will often need to know the Job ID when working with the queue system as it identifies your job(s); if you are working on just one node or if you only have one job with a given Job ID number, then you can omit the node identifier and simply refer to the job by number.

qstat will only produce output if jobs are waiting in the queue or are running and there is no output if the jobs have already terminated or if your submission to the queue was unsuccessful for some reason. When a job you have submitting is actually running, the status column in the qstat display changes to 'R' meaning the job is currently executing.

andy@kato:~/pbs_tests $ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
52.kato.degond   R_job            andy            0        R standard
After a job has run for a while, the 'Time Use' column is updated:

andy@kato:~/pbs_tests $ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
52.kato.degond   R_job            andy            00:00:23 R standard

Invoking qstat with the '-f' option gives you a comprehensive report on the job along with some information about the standard queue and its default/maximum resource allocations:

andy@kato:~/pbs_tests $ qstat -f
Job Id:
    Job_Name = R_job
    Job_Owner = andy@kato
    job_state = R
    queue = standard
    server = ma-queue
    Checkpoint = u
    ctime = Fri Feb 12 12:01:35 2010
    Error_Path =
    exec_host = lax/0
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = be
    mtime = Fri Feb 12 12:01:36 2010
    Output_Path =
    Priority = 0
    qtime = Fri Feb 12 12:01:35 2010
    Rerunable = True
    Resource_List.mem = 2000mb
    Resource_List.ncpus = 1
    Resource_List.nodect = 1
    Resource_List.nodes = 1
    session_id = 1531
    Variable_List = PBS_O_HOME=/home/ma/a/andy,PBS_O_LANG=en_GB.UTF-8,
    etime = Fri Feb 12 12:01:35 2010
    submit_args = queue.test4
    start_time = Fri Feb 12 12:01:36 2010
    start_count = 1
    fault_tolerant = False

An alternative to the qstat command is showq, which is one of the Maui scheduler commands:

andy@kato:~/pbs_tests $ showq
ACTIVE JOBS--------------------

52                     andy    Running     1    00:00:00  Fri Feb 12 12:24:55

   1 Active Job        1 of  128 Processors Active (0.78%)
                       1 of  11  Nodes Active      (9.00%)

IDLE JOBS----------------------

0 Idle Jobs

BLOCKED JOBS----------------

Total Jobs: 1   Active Jobs: 1   Idle Jobs: 0   Blocked Jobs: 0
This display shows a global view of all the available nodes in the Degond compute cluster as well as the 128 processor cores available, showing Job ID 52 is using one processor on one node.

Both Torque and Maui provide a wealth of additional commands for querying, monitoring and analysing your jobs in the queue and a complete listing of these commands can be found here.

Controlling your jobs

Sometimes you might decide a job you have submitted to the cluster is not wanted after all or is unlikely to run properly owing to a coding error, etc. You can delete jobs that are queued and waiting to be run as well as jobs that are actually running as follows:

Deleting a job from the queue

You can remove a queued job from the queue if it has not yet been started with the qdel command, giving the Job ID as the argument:
andy@kato:~/pbs_tests $ qdel 52
This command executes silently with no confirmation or feedback unless you enter a non-existent Job ID. It goes without saying you can only delete your own jobs, not those of other users!

Cancelling a job

If you need to cancel a running job, you can do this with the canceljob utility, giving the Job ID as the argument:
andy@kato:~/pbs_tests $ canceljob 52

job '52' cancelled

How do I know when my job has started, completed or even been aborted?

If you have used the PBS -m option to tell the system that you want to be notified of start, abort and end events, you will receive emails from the system. If you have specified PBS -m be (with the begin and end parameters), you will receive an email like the following when the job starts:

Date: Fri, 12 Feb 2010 12:00:57 +0000
From: "" 
To: "Thomas, Andy D" 
Subject: PBS JOB 50.kato

PBS Job Id:
Job Name:   R_job
Exec host:
Begun execution

and when the job completes, another email is sent which contains some information about the resources used, which nodes(s) were utilised, the CPU time used and the wall time for the whole job, etc:

Date: Fri, 12 Feb 2010 12:01:08 +0000
From: "" 
To: "Thomas, Andy D" 
Subject: PBS JOB

PBS Job Id: 50.kato
Job Name:   R_job
Exec host:
Execution terminated

From this you can see that this very short job used 9 seconds of CPU time and, as expected, the 'wall time' was slightly longer.

The observant reader will note from these emails that although the job was submitted on kato, it was actually executed on the lax node. Notifications from the queuing system will always come from kato as this node is the queue management node on which the Torque/Maui queue server/scheduler pair are hosted; the other nodes in the cluster simply run the queue clients that communicate with the queue server & scheduler on kato.

Where is my output?

Even if you have specified a file in which to write the output (ie, the standard out) of your job into as part of your submission script, the queueing system will always create extra two files in the subdirectory from which you submitted the job from. One is the 'o' (output) file, containing the error output from your job (the standard error, if any) and the other file, the 'e' (error) file, contains any error messages from the queue system itself. These two files will always be created for every job you run - if there are no errors from either your job or from the queueing system, these files will be zero length but they will still exist nontheless which you could use as an indicator (eg, for a monitoring script) that the job has completed.

The files will take the form: <Job Name>.o<Job ID> and <Job Name>.e<Job ID> respectively.

So if your job is called matlab with the full Job ID 54.kato, then remembering that the hostname part of the Job ID will be quietly dropped, when it completes you will find the file:


in the directory you submitted the job from initially and it might contain the warning message:

Warning: No display specified.  You will not be able to display graphics on the screen.

                            < M A T L A B (R) >
                  Copyright 1984-2009 The MathWorks, Inc.
                Version (R2009a) 64-bit (glnxa64)
                             February 12, 2009

  		To get started, type one of these: helpwin, helpdesk, or demo.
  		For product information, visit

An example of an 'e' file would be R_job.e29 produced by an incorrectly specified invocation of R called, fittingly enough, 'R_job' with the Job ID of 29:

Fatal error: you must specify '--save', '--no-save' or '--vanilla'
So to recap: 'e' files are produced by the Torque/Maui queueing system if it has a problem with running your job, as in the example above which has a missing '--vanilla' option to the R invocation - the job can't be started by the queueing system and the error message is written to the file 'R_job.e29'. But once Torque has got your job started, then your job might encounter its own errors which will be written to the 'o' file as in the 'matlab.o54' example (above), where the '-nodisplay' option was missing when Matlab was not being started from an X terminal.

What if my job goes wrong?

Unfortunately, quite a few things can go wrong so this section summarises some of the things that we have found out through trial & error as we gain more experience of the system.

I have submitted a job to qsub but nothing seems to happen

Sometimes a job will execute so quickly once you have submitted it using qsub that it completes before you have a chance to check on its status. If you have used the #PBS -m be option as suggested above, you will receive two emails in rapid succession. One solution is to log into the kato server using the same password you use for the kato submission node and use the Torque 'tracejob' command to backtrack through the various logs the system keeps of all user's jobs and present everything it knows about a given job:

andy@kato:~ $ tracejob -qalm 65
/var/spool/torque/server_priv/accounting/20100223: Permission denied
/var/spool/torque/mom_logs/20100223: No matching job records located
/var/spool/torque/sched_logs/20100223: No such file or directory

Job: 65.kato

02/23/2010 11:32:15  S    enqueuing into standard, state 1 hop 1
02/23/2010 11:32:15  S    Job Queued at request of
                , owner =
                , job name = matlab,
                          queue = standard
02/23/2010 11:32:16  S    Job Modified at request of root@kato
02/23/2010 11:32:16  S    Job Run at request of root@kato
02/23/2010 11:32:16  S    Job Modified at request of root@kato
02/23/2010 11:32:16  S    Exit_status=127 resources_used.cput=00:00:00
                          resources_used.mem=1480kb resources_used.vmem=97428kb
02/23/2010 11:32:25  S    Post job file processing error
02/23/2010 11:32:25  S    dequeuing from standard, state COMPLETE

The '-qalm' option tells tracejob to suppress all error messages and not to try looking into non-existent logs for information about a job. You need to run tracejob on kato since this is the queue management node on which the Torque/PBS server is hosted and the queueing system logs are kept on this system only.

By default tracejob will present information about your job if it was started today (ie, after 00:01 this morning) as the Torque server logs roll-over to a fresh log at midnight; if you want to look at previous jobs, use the '-n' option to specify how many days in the past you want to query, eg:
andy@kato:~ $ tracejob -n 4 37

will provide information on job ID 37 that you submitted to the queueing system any time within the past 4 days. If a job completes one or more days after the day it is started, the system will retrieve the logs from both the start day and the completion day and display the entire queued session even if it spans many days. tracejob defaults to '-n 1' if this option is not given so the smallest useful parameter is 2.

Exit status

Like almost all Linux tasks, every job executed by the Torque/Maui queueing system returns an exit status, a integer that indicates the whether the job was successful or not - an exit status of 0 (zero) always means the task completed successfully and any other number indicates an error condition (which might not necessarily have been fatal). But to distinguish an exit status set by the queueing system itself from one set by the job you are running, all exit status values returned by the Torque queue server will have a negative value while those returned by your program or job will have a positive value as usual.

If you have used a wrapper script containing a #PBS -m e directive to mail you when the job has completed, the mail message will contain the exit status as in this example:

Subject: PBS JOB 69.kato

PBS Job Id: 69.kato
Job Name:
Exec host:
Execution terminated

where you can see that the exit status is 134. As this is a positive integer, this is the exit status of your job and you will need to consult the documentation of your job's package (or your own code if you wrote it yourself) to find out what it means.

Alternatively, you can use the tracejob utility on the kato server to find out the exit status of a particular job as in the example above, where the exit_status is 127.

Is there any additional storage?

Yes! If you need more storage for your files, programs, experimental data, etc you can use the calculus storage server where just like the scratch disks on the main Maths NextGen cluster you can create you own space, one or more of the large data silos in Maths. Your ICNFS home directory is also mounted on the Degond cluster, making it easy to transfer files & data to and from other systems in Maths and in the College.

More information

This documentation is permanently very much a work in progress although the queueing system itself has now matured on the main Maths NextGen cluster with well over 2 million jobs having been run since its launch in 2010 and more nodes and queues have since been added. The Torque/Maui queue management system is very complex with a rich set of programs and utilities, many of which only come into play on very large cluster farms.

Extensive man documentation is installed on kato for the following Torque programs and utilities;

	chk_tree    printjob                pbsdsh      qhold   qselect
	hostn       printserverdb           pbsnodes    qmove   qsig
	nc-config   printtracking           qalter      qmsg    qstart
	ncdump      pbs_queue_attributes    qchkpt      qorder  qstat
	ncgen       pbs_resources           qdel        qrerun  qstop
	nqs2pbs     pbs_server              qdisable    qrls    qsub       
	pbs-config  pbs_track               qenable     qrun    tracejob
while the Maui scheduler has a few commands intended for end user use.

canceljob    checkjob     showbf     showq   showstart  showstats
You will find comprehensive information online at the Adaptive Computing site.

Andy Thomas

Research Computing Manager,
Department of Mathematics

last updated: 7.1.2022