Current status of Maths computing facilities...

June 15: more services curtailed owing to ongoing cooling problems in the Maths server room, NextGen cluster upgrade now ~97% completed, all other active systems and services operating normally.

Last updated at 10:42 on June 15th

June 5th: Maths server room overheating: temperatures in two areas of the server room rose to 40 degrees C yesterday, Tuesday June 4th, on a not particularly hot day but with all GPU servers running at high compute loads. Shutting down the Hadoop cluster and a few idle compute servers eventually reduced spot temperatures to no more than 31 degrees by late evening. For now, only those systems that are absolutely essential for summer projects, ongoing research and teaching will be kept available at all times.

nvidia4 &: nvidia6 local user data storage: local on-server storage is provided on all of the GPU servers to store the output of currently running GPU compute jobs; being entirely contained within the servers, this storage is fast (typically 600-1000 megabytes per second write speed) and is able to cope with many users using all 8 GPUs simultaneously. In contrast, owing to the 1 gigabit/sec (1000 Mbit/sec) speed limitation of the College network, networked storage is limited to just under 100 megabytes/sec write speed which is shared across all the users of any one GPU server. The idea behind this local storage is once your GPU computations have completed, you should move your data to wherever you normally keep it; for most research users this will be your storage on clustor or clustor2, on a sectional cluster such as the Stats systems, one of the 4 silos, etc or even your own computer.

Being optimised for accommodating GPU cards, these servers simply do not have the physical space within to accommodate lots of large hard disks and the existing space for these is often very full. Please review the data you have stored on these servers and move any historical data to one of the following:

  • clustor: /home/clustor/ma/<i>/<username>
  • clustor2: /home/clustor2/ma/<i>/<username>
  • silo1: /home/silo1/<username>
  • silo2: /home/silo2/<username>
  • silo3: /home/silo3/<username>
  • silo4: /home/silo4/<username>

where <i> is the first letter of your username and <username> is your username. Alternatively, you can use any of the usual methods (scp, sftp) to copy your data elsewhere and then delete it from the GPU servers afterwards.

overloading of GPU servers: please do not run jobs on the GPU servers that use more than one GPU simultaneously, or run multiple computations that each use a GPU; the GPU servers nvidia4 and nvidia6 have only 8 GPUs installed in each and these must be shared with other users. Also, some computationally intensive jobs are known to crash the PCIe bus that links the GPUs to the rest of the server, disconnecting one or more GPUs from the server, causing lost compute jobs and inaccessible GPUs that can only be recovered by a complete system reboot.

Fact: each GPU can consume up to 80 amps of current - that's more than the street mains current supplied to houses in most areas (typically between 40 and 100 amps in the UK). This current, multiplied by all the other GPUs in a GPU server, causes heating and eventual oxidation of contact surfaces in the GPU power supply wiring, resulting in voltage drops and GPU crashes. Periodically, GPU servers need to be shut down, dismantled and problems with the cabling connectors sorted out.

Slow login issues resolved: Since Wednesday afternoon, April 11th, user logins into Math systems that use central College authentication were taking a very long time to complete - several minutes in the case of ssh logins. This was due to issues with communicating with the central LDAP server and an incident was raised with the ICT Service Desk that afternoon. The problem - whatever it was - appears to have been resolved at some stage this afternoon, Tuesday April 16th, and both general system logins and the NextGen cluster are once again working. Taking user authentication for the NextGen HPC cluster back in-house is actively being looked at to prevent widespread disruption to this facility in the future.

Issue with running Julia on NextGen cluster: a problem was identified on Wednesday March 5th, whereby some compute jobs that use the latest version 1.9 Julia programming language were failing and aborting. This has since been traced to library dependencies used by Julia on some nodes that replaced older nodes in the autumn of 2022 were incompatible with Julia; Julia 1.9 was added to the NextGen cluster in June last year at a user's request but it being unavailable for Ubuntu Linux at that time, it was built locally from source on macomp001 and then pushed out across the cluster. Since Julia is seldom used on the cluster the problem wasn't noticed until now.

As a workaround to allow a user to use the cluster while the problem is resolved, a new queue called simply julia has been added to the cluster today which will run Julia jobs only on the compute nodes that will run it without problems. This queue can be selected for your compute jobs by using the #PBS -q julia directive in your cluster submission script.

Cooling issues in the Maths server room: in a nutshell, the four cooling systems in the Huxley 616 server room are no longer able to cope with today's server heat loads during periods of heavy usage when external temperatures are high in the summer months. The recent initiative of installing blackout blinds to the server room windows to block sunlight coming in through them has significantly reduced afternoon and early evening heat gain from outside and has led to lower temperatures overall in Huxley 616 during 'normal' summer temperatures. But the heatwaves we have experienced in 2022 and again in 2023 continue to be a major problem.

To further reduce the amount of heat created in this room, some servers that support research IT services are being taken out of service on a temporary basis. The temperatures at various locations around the room are continually monitored and if it is safe to do so, services are being restored as and when temperatures fall to near normal levels. Apologies for the inconvenience.

January 10th, 2023: known issues

zeus: this large compute server went down at about 5:15pm on Tuesday January 10th and it now appears to have a hardware fault that will require further investigation and obtaining at least one spare part. It has been replaced by its sister system hydra and is now ready for use. This is currently not an urgent issue and the space needed for undertaking hardware repairs of large pieces of equipment is currently not available in the Maths server room.

February 1st, 2022: known issues

The compute system cfm-ic2 used by the Maths Finance section has been going into suspend mode shortly after being rebooted, which appears to be a hardware fault on its motherboard. This system is currently little-used by Maths Finance so repair is not urgent - its sister machine cfm-ic1 is still available for use, however.

March 20, 2020: legacy known issues

As always, hardware faults can occur when systems that have been powered on for years are first powered off and then back on again a short while later and the following systems were casualties of the scheduled power-downs on March 3rd last year:

  • fira: this elderly HP workstation in the Stats Linux cluster is awaiting two replacement disks but since the new madul, midal and model large compute servers which were introduced in February and March effectively supercede it, repair of this seldom-used system is not a high priority.

Andy Thomas

Research Computing Manager,
Department of Mathematics

last updated: 15.6.2024