Current status of Maths computing facilities...

July 5th: all systems and services in Maths operating normally

The new GPU server nvidia4 is back up after an additional CUDA module was loaded into the kernel and it has not crashed yet even under sustained heavy load. Possible disk controller issues unrelated to the GPUs are still being looked at although it is not clear these are related to the crashes we hace experienced earlier.

As we begin the fourth month of remote working at the College, we're happy to report there have actually been very few problems with Maths research IT services & systems and those problems we have had have mostly been connected with the urgent need to temporarily shut down a lot of it overnight on May 31st after a partial cooling failure caused by a series of campus-wide power failures.

As "lockdown shock" is beginning to wear off a little, it is heartening to see compute loads slowly starting to rise on the various compute platforms although a few group clusters have been operating more or less at full load throughout.

Research IT services fully operational during the college shutdown

Maths research IT systems and services are currently operating as normal and will remain fully available along with the usual levels of support for users even though College is now closed to all from Wednesday March 18th onwards.

By its very nature, research IT is well-suited to remote working due to its client-server model; research users are familiar with the idea of working for years on end with unseen systems they may have never seen (the server), interacting with them from their own PC, laptop, tablet or even a smartphone (the client). So a wholesale switch in the College's modus operandi to remote working will not come as a shock to these users, for whom it will be business as usual as far as computing is concerned.

The College - and in particular ICT - will do everything it can to support remote working. Maths users have a variety of methods they can use to work remotely:

  • if you wish to access a Windows computer remotely you can use the college's Remote Desktop Gateway - see the ICT documentation for more details

  • you also can use the college VPN service which is a more flexible option for research users since it's not limited to accessing just Windows PCs. Using the MS PPTP protocol, it is well-supported by remote systems running Windows, Linux and Android but is no longer supported by Mac OS X. It's also supported by UNIXes such as Solaris, FreeBSD & OpenBSD (if you should happen to want ready-to-use configuration files for accessing Imperial's VPN service from a system running FreeBSD, let me know).

  • use one of the Maths ssh gateways - this is probably the best option for most Maths Linux and UNIX research users since all of the generally accessible Maths home directory, file and data storage servers are mounted directly on the gateways. So you can log into one of these via ssh and access your data and better still, you can transfer files into & out of the various storage servers using scp or sftp without having to log into specific clusters or compute servers. For security reasons, these gateways use a non-standard access port so ask me for information about this if you don't already know what this is (it won't be stated here for obvious reasons!).

  • use the college's ssh gateway service - you'll need to sign up with the ICT Service Desk beforehand to use this service and more information can be found at the bottom of the ICT Remote Access page. To be frank, this service is of limited use to Maths research users since there is no access to Maths home directory, file and data storage servers, X-Windows forwarding is disabled (so GUI programs cannot be run remotely) and there is very limited local storage on this service. You can however make onward ssh connections to Maths systems from this gateway.

Some bespoke remote access solutions are available for specialist applications - these are not publicly advertised for security reasons but if you require a specialist remote access solution that is not available for general use, please enquire.

July 1st: known issues

Current known problems are listed below:

  • nvidia4 GPU server: this system has been crashing under heavy load and although it is currently up and running under moderate load, the cause of the problems we experienced with this system in May and June are not clearly understood. Errors are occasionally being logged in Linux for the disk controller and, at a much lower level, by the server's system event log of a possible CPU problem. Decoding the CPU error codes is diffcult owing to lack of information on what these mean and we are awaiting a response from the server motherboard manufacturer's support forum.

March 20th: legacy known issues

As always, hardware faults can occur when systems that have been powered on for years are first powered off and then back on again a short while later and the following equipment were casualties of the scheduled power-downs on March 3rd and 11th:

  • archimedes: this compute system is awaiting repair after two disks failed following the March 3rd power-down

  • fira: this elderly workstation in the Stats Linux cluster is also awaiting disk replacement but since the new madul, midal and model large compute servers which were introduced last month effectively supercede it, repair of this seldom-used system is not a high priority

Andy Thomas

Research Computing Manager,
Department of Mathematics

last updated: 5.7.2020