Long term planning: downsizing and relocation of the ICT data centre

August 22nd update:

The macompXX systems were moved as scheduled on August 10th but owing to a user's long-running compute job still running on mablad09, the decision was made to delay moving the mabladXX blade systems until Monday 14th to give this user's job time to complete over the weekend.

Everything is now physically in place in the new rack in its new location but we are still waiting for an IP address to be assigned to the new head node after which the software work originally planned for August 14th can be started.

Unfortunately, this delay is outside the Maths department's control but should not impact the cluster since the Huxley-based part of the cluster is fully operational and will remain so until the ICT-hosted nodes are finally brought into service, although we cannot say when this will be.

August 2nd update:

Following a meeting with ICT in July, we have been offered space in a new, smaller server room in the City and Guilds (C&G) building which replaces the present large data centre after most of the equipment is moved to the new Slough facility. So instead of moving 24 nodes from the C&G data centre to Huxley, the plan has been revised to move these nodes a short distance to the new facility and then, later in August, to move 13 nodes plus the clustor home directory servers from Huxley to the C&G facility instead.

All the other details explained below will remain as before and the timetable is now as follows:

  • August 2nd: compute queues on macomp01 to macomp08 inclusive and all mabladXX nodes will be halted to prevent any new compute jobs being assigned to these nodes; existing jobs will be allowed to run to completion or until these nodes are shut down on August 9th

  • August 9th: nodes macomp01 to macomp08 inclusive and all mabladXX nodes will be taken offline at 10pm and any user jobs running on these nodes will be terminated

  • August 10th: overnight August 9/10th the operating systems on these nodes will be upgraded from Ubuntu 14.04 to 16.04 and they will be shut down around 6am Thursday prior to being moved later that day

  • August 14th: work begins on installing a dedicated cluster LAN, a new combined Torque/Maui queue management system and a local mirror server for Ubuntu software update and security repositories before upgrading the above nodes to Ubuntu 16.04

  • August 17th: from this date onwards it is expected the C&G part of the cluster will be operational. Users will be encouraged to migrate to using clustor storage only for their cluster home directories although access to ICNFS home directories will still be possible, although slower.

  • Later in August: the Huxley-based cluster nodes will be shut down and moved to the C&G server room and upgraded, more details to follow later.

A reduced cluster service will be available throughout since the Huxley nodes will not be moved and upgraded until the C&G part of the cluster is fully operational.

June 21st update:

Even though we can't really have the good old-fashioned annual Factory Fortnight that used to be the norm in 20th century industrial Britain, when all the workers & management took their annnual holidays at the same time in August, leaving factories clear for maintenance staff, fitters, machine tool engineers, painters, etc to move in and carry out an annual re-fit, we are now seriously looking at August 2017 as the best time to enjoy(!) at least a partial shutdown. This is necessary so that we can close down the bulk of the compute cluster that is currently accommodated in the ICT data centre and move it to the Maths server room along with the few remaining private servers still in ICT, ahead of the data centre's eventual closure next year.

Demand for compute facilities always falls sharply at the end of the undergraduate term (29th June this year) with another abrupt fall in July as schools close for the summer and August also sees the start of the next financial year, which means more funds will be available to undertake major projects as well as purchase yet more equipment. The proposed timing of this move is expected to have the least impact for users.

Almost half a tonne of kit needs to be moved across campus and reinstalled in Huxley 616 along with another new rack and a large UPS. During this relocation the compute cluster will be reduced to just 10 production nodes plus the 3 test & dev nodes. At the same time, the following changes are planned for the cluster:

  • a Ubuntu operating system upgrade from the present version 14.04 to 16.04

  • the transfer of all the production compute nodes (macomp01 and above, and all of the mabladXX nodes) from the general college network to a new dedicated cluster network as described below

The Ubuntu update is unlikely to affect most users except those who have compiled their own C, C++, Fortran, Julia, etc programs using bespoke libraries installed under /usr/local or /usr/local_machine. These programs may need to be recompiled to run under Ubuntu 16.04 owing to updated libraries and changed dependencies; if you are unsure whether you need to do this, you can already use macomp001 to test your programs under the Ubuntu 16.04 environment. Also, the cluster will not switch abruptly en masse from Ubuntu 14.04 to 16.04 - only the 24 nodes moved over from ICT will be upgraded while the 12 nodes already in the Maths server room will remain on 14.04 for a short period while any issues are resolved. Additional queues will be set up on the cluster so that you can choose whether to run your compute jobs under Ubuntu 14.04 or 16.04.

On the other hand, moving production compute nodes to a dedicated internal network will affect all users; direct access from the college network to individual production nodes such as macomp02 or mablad14 will no longer be possible, they will only be accessible through the test & dev nodes macomp00, macomp000 and macomp001 which will now also become 'head nodes' from which you can submit your jobs. The main reason for this change is explained below (see under Original Posting) but much improved cluster performance will also result.

Since your ICNFS and clustor home directories, the Maths silos and other storage servers are also accessible on the head nodes, there's no real need for you to access a specific compute node except in the case where owing to a problem with your home directory your job's output has been written into a temporary holding area on the node that executed your job. This is most likely to happen if you have exceeded your ICNFS disk usage quota, if you have not created your ssh public/private keypair or an up to date ~/.ssh/known_hosts file, you have set incorrect permissions, lost or deleted files or folders, etc. In these cases, access to individual nodes will still be possible by first logging into a head node and from there, logging into a compute node via the new internal cluster network.

This work along with other work planned for the server room is expected to take about a week.

If you have any comments on the proposed changes described above, on the proposed timing in August for these changes or if you have any questions, please don't hesitate to contact me

Original posting: 23rd March

Finally, we have started planning ahead for the eventual shutdown of the ICT data centre in the City & Guilds building over the next 18-24 months and its migration to a new facility in Slough. We have over 30 servers hosted in ICT - 24 of these are Maths compute cluster nodes which will probably move to Huxley 616 later this year along with a non-rackmount storage server while the rest will move to Slough, with hardware upgrades where necessary to support full remote management including "bare metal" installs, (the ability to install an operating system remotely from media in South Kensington without going anywhere near Slough, for example).

Some private rackmount systems currently in Huxley 616 may have to move to Slough although this is not certain - if it's not possible to install an operating system remotely or if the owner is unwilling to pay for upgrades to allow this to be done, or replace the server with one that has full remote media support, then that server will have to stay in the Maths server room. The main issue with accommodating more systems in the Maths server room is it is not possible to install any more college network connections on Huxley level 6 since the racks in the network wiring cupboard opposite the south lifts are now full. So an influx of another 25 systems into Huxley 616 will potentially be a problem.

For this reason we will have to move nearly all Maths compute cluster nodes onto a private dedicated network which means direct access to a particular node from the college network will no longer be possible although it would still be accessible from a designated head node. This is the norm on HPC clusters and should not inconvenience users that much and it does have the real advantage of faster cluster performance since cluster network traffic will not be mixed with non-compute traffic on the general college network.

Andy Thomas

Research Computing Manager,
Department of Mathematics

last updated: 22.8.2017