The Next Generation Maths Compute Cluster


Overview

The next generation Maths compute cluster is now ready for use and those of you who are familiar with existing Maths compute cluster will find using it is much the same as before. The main differences are:

  • the cluster now uses a dedicated internal network to link all the nodes together, instead of the nodes having connections to the college network and sharing bandwidth with the general college network

  • existing node hardware and node names (macomp01, mabdlad01, etc) have been retained internally but access to the cluster to submit jobs is now through a single head node called macomp001 - direct access to individual nodes from the college network is no longer possible

  • the operating system is the latest available Ubuntu Linux 16.04 Server with long term support and similarly the maths applications, C and Fortran compilers, etc are the latest available. Cluster utilities are now under the standard /usr/local location with the /usr/local_machine area being used solely for user-contributed software

  • a new dedicated cluster control server called openpbs hosts the latest Torque cluster management software version 6.1.1.1, as opposed to the version 2.4.2 used by the existing cluster, which dates from 2009

What queues are available?

The most popular queue with users is the standard queue with 1 processor core and 1930 Mb memory and this is currently the only queue available. This is primarily because the macompXX and mabladXX nodes in the new cluster don't have the CPU and memory resources to support some of the fat queues offered by the newer serevrs in the old cluster but also because it is possible older user programs will not run on the new cluster unless they are modified, updated or recompiled; troubleshooting these potential problems (which may never actually arise!) is much easier on simple single processor queues than on an 8 core/16 thread parallel queue using all the cores present in a node.

Once we are sure there are no issues with legacy user-compiled programs on the new cluster, the medium, jumbo and the matlabpar4/6/8 queues will be added later. The existing larger and specialist queues will migrate to the new cluster when the more powerful nodes from the existing cluster are upgraded and migrated.

How do I submit a compute job to the new cluster?

In exactly the same way as you do with the existing cluster except you simply log into macomp001 to submit jobs, not assorted nodes such as macomp07 or mablad11, etc. Note: do remember to run the update-ssh-known-hosts script before you submit your first job to this cluster.

Has anything else changed?

Apart from faster cluster performance owing to the dedicated network and the enhanced facilities of more up to date applications, about the only other difference you may notice is emails from the cluster telling you about events such as jobs starting, ending or being aborted will now come from root@openpbs.ma.ic.ac.uk instead of root@torque.ma.ic.ac.uk.

Also, the Torque web interface is still in the process of being migrated to the new cluster, where it will be accessible as http://openpbs.ma.ic.ac.uk. Since this does not affect your use of the cluster for running jobs, it was decided to make the new cluster available for use as soon as possible with the web enhancements to follow.

Can I still retrieve job output directly from an execute node if I run out of space on ICNFS?

Yes you can - just log into macomp001 and from there log into the node where your output is stored. For example, you may get an email from the cluster telling you that you had insufficent disk space on ICNFS to save the output of your compute job so it has been stored on, say, macomp07; to recover it, log into macomp001 and from there type:

ssh macomp07

to connect to macomp7 via the cluster's internal network. Now you can copy, compress or delete the stored output or use scp, sftp, rysnc, etc to transfer it off the cluster to another computer where you have more space. You'll find that you have access from the cluster's nodes to systems on the college network or even outside the college, but not the other way round.

Longer term, when the remaining nodes from the existing cluster are migrated, use of ICNFS for cluster storage will eventually be discontinued in favour of our own larger and more reliable storage servers.

Why have we made these changes to the Maths compute facilities?

The existing cluster, which has been in operation since 2010, has its roots in a collection of desktop PCs (the German City Cluster as it was then called) set up in about 2003 and locked in a room where anyone could log into any them remotely via ssh and use them for running compute jobs. This environment was very much a wild "dog eat dog" free-for-all with PCs regularly crashing since users did not realise (or did not care) resources were finite and there were other users using them at the same time.

Torque/Maui job management was eventually introduced bring clustering and some limited control to these PCs to tame the worst user abuses but at the time, owing to lack of general purpose compute systems in the department, the ability to log into a PC and just do ad hoc interactive computing as before had to be retained. The cluster became a hybrid solution where this could be done so long as no more than 30 minutes of CPU time was used by any one job on a given PC catered for this but even so, some users (chiefly new users) regularly abused this and over time, quite a lot of software was added to police this, detecting these users' jobs and killing them if they exceeded 30 minutes of run time.

Another issue with the existing cluster is limited network bandwidth. Each PC (now called a 'node' in line with accepted cluster terminology) has a college network connection over which all of its data traffic travels. As the cluster grew and became more distributed with nodes accessing data stored on other nodes, remote fileservers across campus, etc and having to share the network with an increasingly busy Maths departmental network, network congestion began to slow down cluster performance. At the same time, owing to continuing expansion of Maths compute facilities, we were forever ordering the installation of additional college network sockets at a cost of over £100 each while installing a local dedicated network for the cluster ourselves would have been far cheaper, provides much better cluster performance and frees up college network connections for other systems.

Now, with around 200 servers in the department and departmental sections, groups and even individuals having their own clusters, servers or powerful workstations, the need to support traditional ad hoc interactive computing has largely gone. With the experience gained from running the existing Maths compute cluster for almost 8 years, we felt it was time to migrate the cluster to a full HPC in stages, hence the new cluster. The timing coincides with some major upheavals in the ICT data centre and during the extended summer vacation when cluster demand tends to be low. Interruption or loss of cluster service during the migration has been avoided and in the coming weeks, the remaining nodes in the existing cluster will be moved and upgraded to join the new next generation cluster.



Andy Thomas

Research Computing Manager,
Department of Mathematics

last updated: 04.10.2017