Current status of Maths computing facilities...

July 22nd: voluntary reduction of computing workloads during heatwave, nvidia1 out of service, all other systems and services operational

Due to the high temperatures we are currently experiencing in the UK coupled with heavy computational workloads across many systems, leading to raised temperatures in server racks in the Huxley server room, heavy users have been asked to voluntarily reduce compute work loads to lower the power consumption and hence the heat produced by these servers. We apologise for having to do this but a combination of a larger than usual MSc cohort this year, keen on parallelising their projects on 4 CPU/64 core and multi-GPU systems, and the unusually hot weather is pushing cooling capacity to its limits and users are being asked to reduce their CPU demands until at least the weekend when the weather is expected to return to more normal temperatures.

There are currently no other known problems with Maths systems and services affecting users although a couple of servers need RAID backup batteries replacing which are now on order.

July 20thst: known issues

Current known problems are listed below:

Owing to a user persistently overloading nvidia1 with non-GPU computations, this system crashed on June 18th and although it is now back up again, it can no longer communicate with the three GPUs it hosts owing to some low-level failure between the server and the GPU card cage. This is likley to take some time to resolve but is not a high priority failure since owing to the 2nd generation GPU cards it is fitted with, it is now seldom used so will be investigated later.

March 20th, 2020: legacy known issues

As always, hardware faults can occur when systems that have been powered on for years are first powered off and then back on again a short while later and the following systems were casualties of the scheduled power-downs on March 3rd last year:

  • fira: this elderly HP workstation in the Stats Linux cluster is awaiting two replacement disks but since the new madul, midal and model large compute servers which were introduced in February and March effectively supercede it, repair of this seldom-used system is not a high priority.

Andy Thomas

Research Computing Manager,
Department of Mathematics

last updated: 22.7.2021