Current status of Maths computing facilities...
December 5: issues with nvidia4 GPU server, all other active systems and services operating normally.
Last updated at 8:54 on December 5th
-
- nvidia4: Two of the 8 GPU cards in this server went offline in April, meaning a complete server reboot is needed to resolve this issue, which could involve extended downtime and may also mean the use of heavy lifting equipment to get the server out of the rack & back in again afterwards. Users of this GPU server are reminded that is intended for GPU-related work only - conventional CPU-based work should be undertaken on the other compute facilities available in the department.
- GPU cards dropping off the server's internal PCIe bus are almost always the result of excessive loads on one or more GPU cards; heavy traffic between the system's CPUs and the GPU cards can overload the bus, causing GPUs to crash. Sometimes however, a GPU card can go offline owing to its power feeds failing due to the high currents drawn by these cards causing heating and subsequent oxidation of the contacts within plugs and sockets between the card and the system power supplies. The server then needs to be physically opened up for this to be sorted out.
- Resolving this issue has been postponed until all of the MSc student summer projects have finished.
- Cooling issues in the Maths server room: in a nutshell, the four cooling systems in the Huxley 616 server room are no longer able to cope with today's server heat loads during periods of heavy usage when external temperatures are high in the summer months. The recent initiative of installing blackout blinds to the server room windows to block sunlight coming in through them has significantly reduced afternoon and early evening heat gain from outside and has led to lower temperatures overall in Huxley 616 during 'normal' summer temperatures. But the heatwaves we have experienced in 2022 and again this year continue to be a major problem.
- To further reduce the amount of heat created in this room, some servers that support research IT services are being taken out of service on a temporary basis. The temperatures at various locations around the room are continually monitored and if it is safe to do so, services are being restored as and when temperatures fall to near normal levels. Apologies for the inconvenience.
January 10th, 2023: known issues
-
- zeus: this large compute server went down at about 5:15pm on Tuesday January 10th and it now appears to have a hardware fault that will require further investigation and obtaining at least one spare part. It has been replaced by its sister system hydra and is now ready for use. This is currently not an urgent issue and the space needed for undertaking hardware repairs of large pieces of equipment is currently not available in the Maths server room.
February 1st, 2022: known issues
-
- The compute system cfm-ic2 used by the Maths Finance section has been going into suspend mode shortly after being rebooted, which appears to be a hardware fault on its motherboard. This system is currently little-used by Maths Finance so repair is not urgent - its sister machine cfm-ic1 is still available for use, however.
March 20, 2020: legacy known issues
-
- As always, hardware faults can occur when systems that have been powered on for years are first powered off and then back on again a short while later and the following systems were casualties of the scheduled power-downs on March 3rd last year:
- fira: this elderly HP workstation in the Stats Linux cluster is awaiting two replacement disks but since the new madul, midal and model large compute servers which were introduced in February and March effectively supercede it, repair of this seldom-used system is not a high priority.
Andy Thomas
Research Computing Manager,
Department of Mathematics
last updated: 5.12.2023