Current status of Maths computing facilities...
February 13th update: a warning about huge growing files, no other reported issues
- A problem we see from time to time is where a user has a large file stored on one of the Maths fileservers that is continually growing over a long period of time. This causes problems behind the scenes, here's why...
- To ensure your data is kept safe and is not lost in the event of a fileserver failure, all of the fileservers in Maths are mirrored by identical 'shadow' servers, each of which contain an identical copy of the data stored on the main servers. Every 24 hours, the shadow server compares each file and directory it has previously copied from the server it is mirroring with the current state of the corresponding file/folder on the server being mirrored and any change in size, permissions, user/group ownership, directory contents - or the location of the file/folder in the filesystem - will result in the shadow server updating its copies to match. So, naturally, huge files are also mirrored to these shadow servers.
- So far, all good - files many terabytes in size get copied to the shadow server. But a large file on the fileserver that is open and still being written to while it is being mirrored is a problem - when the mirror copy completes, it still doesn't compare to the original file's size since the original file has since grown during the copy operation! This isn't a problem for smaller files where the files are compared, mirrored and then compared again between the two servers in less time than it takes for the original file to change in size. But large files take so long to copy that the copy is already out of date by the time it completes, so the final file compare operation fails and the mirror has to start all over again.
- These huge files are often accidental and not really wanted - maybe someone decides to log the error output from a compute job to aid debugging some coding problem, fixes the original problem, forgets to disable the logging and then goes off & submits 30 of these jobs to the compute cluster that run for months on end - and writing 30 streams of debugging data to the one and same file. Result: a 3 terabyte file full of unwanted junk that has been forgotten about, grows every few seconds and eventually starts causing problems for mirrored fileserver pairs. The server sysadmin is alerted either through gradually worsening data access times on the server, high system loads on the shadow server or by mirror sessions stacking up and taking longer than 24 hours to complete, with two or more data mirroring operations fighting each other for control of the file copy.
- However, some users are purposely writing output data from computations into single large files which are held open for weeks on end and continuously being written to. This is not good programming practice; apart from the problems it causes for fileserver mirroring, a program error or a network glitch could easily cause that big file to be corrupted with weeks worth of experimental data having to be discarded. Storing program output into smaller files, say, daily files or files that are closed after 4 hours of computations have completed with a new file being opened to continue storing data, is a much more resilient solution.
Known network issues
- Although not a 'mission critical' problem, it is now known that some links within the internal networks in Huxley 616 are not operating at the highest speeds they should be and a few of the network connections to some systems are operating at 100 Mbit/s instead of 1000 Mbit/s (gigabit) speeds. To users, this means file and data transfers to & from some systems can take as much as 10 times longer than they should do. Individually, the systems, inter-connecting cables and network switches are all functioning correctly but in some cases the end-to-end link speed is being negotiated at the slower 100 Mbit/s speed and not gigabit.
- This is believed to be caused by physical cable congestion in the increasingly confined space underneath some of the racks, resulting in cross-talk between bunched network cables and possibly interference also from adjacent power cables and metalwork. The data signalling within a standard network cable is effectively a weak UHF radio frequency connection running through 4 twisted pairs of wires in an unshielded cable and is prone to interference. So a programme of relocating rack cabling will begin in January 2019 with more details to be announced shortly.
No other issues
- There are currently no reported problems with Maths systems and services although the SCAN is temporarily out of service owing to technical issues with migrating the service to operate with PC clusters that have been changed from legacy BIOS booting to UEFI booting.
Research Computing Manager,
Department of Mathematics
last updated: 13.2.2019