Current status of Maths computing facilities...


December 12: nvidia6 GPU server storage 96% full, clustor storage currently 93% full, aachen ssh gateways now back in service, all other active systems and services operating normally.

Last updated at 18:18 on December 12th

November 21st: external ssh access via hessian & aachen now restored: the College perimeter firewall issue that prevented external access to the aachen and hessian ssh gateways earlier this morning has now been resolved by ICT and both these gateways are now operating normally again.

October 23rd: hessian ssh: this ssh gateway is now repaired & back online following hardware failure on Sunday, October 20th.

October 20th: nvidia6 storage is now 95% full: thanks to those who have taken steps to reduce storage usage from 99% full a few weeks ago to the current 95% but please will all users of this GPU server please take steps to either delete old unwanted data, copy it to another system and then delete it or simply move it to your clustor/clustor4 storage as described below. Thank you for your co-operation.

October 20th: clustor storage is now 95% full: clustor storage also is now 95% full; can all users of this storage please take steps to delete unwanted data and/or move data to other storage such as the Maths silo servers or your own storage facilities. Thank you.

September 27th: clustor2 storage is now back online: following repairs and hardware changes to the system's architecture, clustor2 is now back online and available for use again. Storage capacity and read/write performance have both been increased and one of the SSD (solid state disk) devices used internally for filesystem functions has been replaced by a very fast mirrored pair of devices for improved resilience, with other SSDs also upgraded to faster components.

nvidia4 & nvidia6 local user data storage: local on-server storage is provided on all of the GPU servers to store the output of currently running GPU compute jobs; being entirely contained within the servers, this storage is fast (typically 600-1000 megabytes per second write speed) and is able to cope with many users using all 8 GPUs simultaneously. In contrast, owing to the 1 gigabit/sec (1000 Mbit/sec) speed limitation of the College network, networked storage is limited to just under 100 megabytes/sec write speed which is shared across all the users of any one GPU server. The idea behind this local storage is once your GPU computations have completed, you should move your data to wherever you normally keep it; for most research users this will be your storage on clustor or clustor2, on a sectional cluster such as the Stats systems, one of the 4 silos, etc or even your own computer.

Being optimised for accommodating large GPU cards, these servers simply do not have the physical space within to accommodate lots of large hard disks and the existing space for these is usually completely full of disks. Please review the data you have stored on these servers and if you are a Maths user, move any historical data to one of the following:

  • clustor: /home/clustor/ma/<i>/<username>
  • clustor4: /home/clustor4/ma/<i>/<username>
  • silo1: /home/silo1/<username>
  • silo2: /home/silo2/<username>
  • silo3: /home/silo3/<username>
  • silo4: /home/silo4/<username>

where <i> is the first letter of your username and <username> is your username.

If you are not a Maths user, you may find storage allocated to you in /home/clustor4/external/your_username where 'your_username' is the one you use to log into nvidia6.

Alternatively, you can use any of the usual methods (scp, sftp) to copy your data elsewhere and then delete it from the GPU servers afterwards.

overloading of GPU servers: please do not run jobs on the GPU servers that use more than one GPU simultaneously, or run multiple computations that each use a GPU; the GPU servers nvidia4 and nvidia6 have only 8 GPUs installed in each and these must be shared with other users. Also, some computationally intensive jobs are known to crash the PCIe bus that links the GPUs to the rest of the server, disconnecting one or more GPUs from the server, causing lost compute jobs and inaccessible GPUs that can only be recovered by a complete system reboot.

Fact: each GPU can consume up to 80 amps of current - that's more than the street mains current supplied to houses in most areas (typically between 40 and 100 amps in the UK). This magnitude of current, multiplied by all the other GPUs in a GPU server, causes heating and eventual oxidation of contact surfaces in the GPU power supply wiring, resulting in voltage drops and GPU crashes. Frequently, GPU servers need to be shut down, dismantled and problems with the cabling connectors sorted out.

Issue with running Julia on NextGen cluster: a problem was identified on Wednesday March 5th, whereby some compute jobs that use the latest version 1.9 Julia programming language were failing and aborting. This has since been traced to library dependencies used by Julia on some nodes that replaced older nodes in the autumn of 2022 were incompatible with Julia; Julia 1.9 was added to the NextGen cluster in June last year at a user's request but it being unavailable for Ubuntu Linux at that time, it was built locally from source on macomp001 and then pushed out across the cluster. Since Julia is seldom used on the cluster the problem wasn't noticed until now.

As a workaround to allow a user to use the cluster while the problem is resolved, a new queue called simply julia has been added to the cluster today which will run Julia jobs only on the compute nodes that will run it without problems. This queue can be selected for your compute jobs by using the #PBS -q julia directive in your cluster submission script.

Perennial cooling issues in the Maths server room: in a nutshell, the four cooling systems in the Huxley 616 server room are no longer able to cope with today's server heat loads during periods of heavy usage when external temperatures are high in the summer months; much of this is due to the increasing popularity of GPU-based computing. The recent initiative of installing blackout blinds to the server room windows to block sunlight coming in through them has significantly reduced afternoon and early evening heat gain from outside and has led to lower temperatures overall in Huxley 616 during 'normal' summer temperatures. But the heatwaves we have experienced in 2022, 2023 and now again in 2024 continue to be a major problem.

To further reduce the amount of heat created in this room, some servers that support research IT services are being taken out of service on a temporary basis. The temperatures at various locations around the room are continually monitored and if it is safe to do so, services are being restored as and when temperatures fall to near normal levels. Apologies for the inconvenience.

January 10th, 2023: known issues

zeus: this large compute server went down at about 5:15pm on Tuesday January 10th and it now appears to have a hardware fault that will require further investigation and obtaining at least one spare part. It has been replaced by its sister system hydra and is now ready for use. This is currently not an urgent issue and the space needed for undertaking hardware repairs of large pieces of equipment is currently not available in the Maths server room.

February 1st, 2022: known issues



Andy Thomas

Research Computing Manager,
Department of Mathematics

last updated: 12.12.2024