Maths compute cluster shutdown and relocation

Latest: November 16th: the cluster is now back up and running with all 35 nodes.

NextGen is now back up and running after being moved, rebuilt and upgraded. A failed node (mablad06) has been replaced and various other repairs undertaken to the HP C7000 blade system.

November 6th: the cluster is now back up and running with a reduced set of nodes.

NextGen is now back up and running after being moved, rebuilt and upgraded although the 16 madblad nodes are currently offline owing to various hardware and software issues. For the time being, the 19 macomp nodes will provide cluster facilities on a reduced scale which is not expected to be a major restriction since the older mablad nodes mostly supported the standard queue for jobs with modest resource needs.

The cluster was relocated on Friday November 1st, partially recabled on Saturday and clustor2 storage was accessible again later that day. Work was suspended on Sunday pending a delivery of cables which arrived on Monday when the recabling was completed. Following operating system & package upgrades, the cluster is now back up and operational. Once a replacement server blade for the mablad node pool arrives, this will be fitted and the remaining issues with the other mablad blades resolved at the same time.

After the mablad nodes are back in service, the opportunity will be taken to replace the macomp001 head node with a newer, much faster server with 12 x 2.93 GHz processor cores and a lot more memory (48 GB). This is in recognition of the fact that although macomp001 was primarily intended simply as a job submission/head node, users tend to camp out on this server for long periods and test resource-intensive code intended to be run on the cluster. The existing macomp001 has a minimal specification since it wasn't meant to be used in this way.

NextGen is shutting down on November 1st in readiness for relocation

The Maths NextGen compute cluster will be shut down on November 1st prior to being dismantled and moved to another part of ICT's data centre in the City & Guilds building.

This proposed shutdown and move was announced way back in March and was meant to occur before the end of September but due to the usual slippages and delays, etc in ongoing building works in other parts of the building, the move will now have to be completed before November 7th when the present site is due to be handed over to the building contractors.

Why is the cluster being moved?

Owing to a change of plans for space usage in the City & Guilds building (formerly known as the MechEng building) situated on the Exhibition Road side of the South Kensington campus, a further contraction of the ICT data centre is necessary and follows a major upheaval 2 years ago in the summer of 2017 when half of the data centre was closed down and moved to Slough. The area where the cluster is presently situated is being given up by ICT and is due to be refurbished as laboratory accommodation.

As a result of the 2017 reorganisation, the parts of the Maths cluster that were hosted in the now-closed section of the data centre were moved to its present location and the part that was in the Huxley server room then joined it so that we had an integrated cluster in one place. This consolidation was done for performance reasons. Now the cluster is being moved to a part of the building that ICT will continue to use for the forseeable future.

When will the cluster be ready for use again?

We are hoping to complete the physical move on Friday November 1st immediately after it is shut down and then plan on working through the weekend to rebuild it, installing new cabling, upgrading all of the compute, storage and cluster management servers and the applications and packages. Following basic testing, the aim is to have at least some of the cluster back up & running ready for Monday 4th. I say 'some of the cluster' because, inevitably with systems that have been running non-stop for years, hardware failures can and do occur after systems are powered off for a while, moved and powered on again. So a few failed nodes are to be expected but hopefully nothing 'show stopping'.

Since there is a lot of bespoke software installed on the cluster that will not be automatically upgraded along with an operating system upgrade, there is a possibility some of these custom applications might not function properly after the main upgrade has been completed. This will include some R applications & libraries and some software that has been built locally from source code following a user's request. These issues will be dealt with once the cluster is back up & running with services restored for the large majority of cluster users.

Any questions?

As always, I'll be happy to answer any questions you may have.

Andy Thomas

Research Computing Manager,
Department of Mathematics

last updated: 16.11.2019