Current status of Maths computing facilities...


## STOP PRESS ## Owing to the high temperatures anticipated in the London area over the next 4 days, there is a risk that some or most of Maths research IT services may have to be withdrawn and shut down at little or no notice if the cooling systems cannot cope with the server room temperatures.

Webservers/websites, fileservers, back-up/mirror servers and all infrastructure will continue operating but compute systems, which create most of the heat, are liable to be shut down.


August 10: NextGen HPC upgrade: 8 new nodes now available, 10 more available soon; Stats fallas HPC now upgraded & online

NextGen HPC upgrade: has already started with the 8 new Dell R630 nodes installed last week now in use; before submitting any new jobs please log into macomp001 and type:

update-ssh-known-hosts

to automatically update your ssh host keys which are cached in your ~/.ssh/known_hosts file. You need do this only once although you will need to do this again later this month after the new macomp20-macomp29 nodes are installed.

The HP C7000 blade system will be removed soon and 10 more R630 servers will take its place. The temporary reduced capacity of this cluster should not affect you since there are still 464 processors available.

Stats fallas cluster: since the Bazooka Hadoop cluster was shut down for a rebuild and modernisation, the opportunity afforded by the resulting lower server room temperatures has been taken to start up the full Stats Fallas HPC, update it and it is now fully operational. It is likely to remain in operation now since there is a demand for this facility while there is no current demand for the main Hadoop cluster, which will not be required for teaching Big Data/AI courses until the spring of next year.

Last updated: 16:16pm


July 23rd, 2022: known issues

Bazooka Hadoop cluster: this was shut down just before 5pm on Monday July 11th leaving just the teaching user node, athena.ma, still in operation since the cluster as a whole was not in use and was contributing to high temperatures in the server room. It will now remain out of service while it is rebuilt with the latest Linux and Hadoop ecosystems, with 8 more ex-NextGen cluster nodes being repurposed and added to it.

July 11th, 2022: known issues

LFC-UK cluster: this was shut down at 5:45pm today leaving landau.ma in operation since none of the 8 blade servers had been used for some time and were contributing to high temperatures in the server room.

July 11th: Huxley 616 server room temperatures

Despite high afternoon/evening temperatures in parts of Huxley 616 reaching 38 degrees C, we are aiming to keep all services running normally owing to the MSc project season. However, conditions in 616 are being monitored closely and any dangerous rise rise in rack or room temperatures, or a cooling failure, will lead to at least a partial shutdown of research IT services.

May 3rd: known issues

Huxley 616 server room cooling: in recent weeks this doesn't appear to be working as well as it used to with elevated top-of-rack temperatures and a slightly higher than normal overall room temperature. Due to the College's partial closure over the past 2 years owing to Covid, the air con systems had not received any routine maintenance since late 2019; this has now been undertaken but performance does not appear to be as good as it used to be. In the meantime non-essential systems have been powered off to keep the thermal load down.

February 1st, 2022: known issues

The compute system cfm-ic2 used by the Maths Finance section has been going into suspend mode shortly after being rebooted, which appears to be a hardware fault on its motherboard. This system is currently little-used by Maths Finance so repair is not urgent - its sister machine cfm-ic1 is still available for use, however.

March 20, 2020: legacy known issues

As always, hardware faults can occur when systems that have been powered on for years are first powered off and then back on again a short while later and the following systems were casualties of the scheduled power-downs on March 3rd last year:

  • fira: this elderly HP workstation in the Stats Linux cluster is awaiting two replacement disks but since the new madul, midal and model large compute servers which were introduced in February and March effectively supercede it, repair of this seldom-used system is not a high priority.


Andy Thomas

Research Computing Manager,
Department of Mathematics

last updated: 10.8.2022