Current status of Maths computing facilities...

July 6: cfm-ic2 out of service, model O/S upgrade completed, all other systems and services operational despite higher than normal server room temperatures

June 23rd: model upgrade

The compute server model, part of the Stats section's M-series group of machines, was upgraded to Ubuntu 20.04 early today primarily to meet a user's need for a later version of python3 to be available on this system. Carried out at very short notice without the usual minimum 1 week's notice, model was chosen for this upgrade since the system was idle; the R packages and the PyPi python packages have since been upgraded as well.

June 18th: Huxley 616 server room temperatures

Despite high afternoon/evening temperatures in parts of Huxley 616 reaching 37 degrees C, we are aiming to keep all services running normally owing to the MSc project season. However, conditions in 616 are being monitored closely and any dangerous rise rise in rack or room temperatures, or a cooling failure, will lead to at least a partial shutdown of research IT services.

June 16th: nvidia3 upgrade

nvidia3 is being upgraded today to the same specification as nvidia4 following requests from this year's MSc students for more up to date software packages that cannot be installed under Ubuntu 18.04. This is being carried out at very short notice without the usual minimum 1 week's notice since time is running out for MSc project completion. On completion, nvidia3 will be running Ubuntu 20.04, CUDA 11.4 and cuDNN 8.

June 15th: known issues

nvidia1: problems with this triple GPU server, which originally surfaced almost a year ago, have been resolved today and this system is available for use again. However it has to be said the 2nd generation Tesla M2090 GPUs in this server have limited compute capabilities compared with, say, nVidia's current GPUs from the Tesla and GeForce ranges so nvidia1 is only really suited for running legacy GPU applications and cuDNN is not supported at all; for this reason fixing this system has been a low priority.

June 13th: known issues

nvidia4: there have been a number of incidents recently with individual GPUs disconnecting from the PCIe bus and on Monday June 13th, three GPUs went offline and did not reconnect even after power cycling the server. The server was taken out of the rack, opened up and all eight GPUs reseated in their slots and all power cables removed & replaced to ensure good electrical contact between GPUs and the mainboard/power supplies. This seems to have resolved the issue - it is believed temperature cycling within servers that use a lot of power with heavy currents flowing thorugh relatively small contacts causes GPU cards to expand & contract slightly causing physical movement in their PCIe slots.

May 3rd: known issues

Huxley 616 server room cooling: in recent weeks this doesn't appear to be working as well as it used to with elevated top-of-rack temperatures and a slightly higher than normal overall room temperature. Due to the College's partial closure over the past 2 years owing to Covid, the air con systems have not received any routine maintenance since late 2019 and this has now been requested from the Estates division. In the meantime non-essential systems have been powered off to keep the thermal load down.

February 1st, 2022: known issues

The compute system cfm-ic2 used by the Maths Finance section has been going into suspend mode shortly after being rebooted, which appears to be a hardware fault on its motherboard. This system is currently little-used by Maths Finance so repair is not urgent - its sister machine cfm-ic1 is still available for use, however.

March 20, 2020: legacy known issues

As always, hardware faults can occur when systems that have been powered on for years are first powered off and then back on again a short while later and the following systems were casualties of the scheduled power-downs on March 3rd last year:

  • fira: this elderly HP workstation in the Stats Linux cluster is awaiting two replacement disks but since the new madul, midal and model large compute servers which were introduced in February and March effectively supercede it, repair of this seldom-used system is not a high priority.

Andy Thomas

Research Computing Manager,
Department of Mathematics

last updated: 6.7.2022