Latest news for the Maths compute cluster


August 19th: scratch disk clearout begins

Over the nine years the cluster has been running, users have dumped data into the cluster's 26 scratch disks and then, in many cases, have left the college leaving the data there. It has got to the point now where we have to remove some of it, starting with data belonging to those who are no longer in the Maths department followed by ancient data belonging to users who are still here, who will each be contacted individually before anything is done.

This purge will commence on Monday, August 19th.

NextGen is moving again!

Sometime between July and the end of September the Maths NextGen compute cluster will have to be shut down and moved to another part of ICT's data centre in the City & Guilds building on the Exhibition Road side of the South Kensington campus. This is due to a change of plans for space usage in this building resulting in a further contraction of the ICT data centre - this follows a major upheaval only 2 years ago in summer 2017.

This is not good news since apart from the extra work involved and the downtime whilst it is being done, long-running compute jobs that are already running today may have to be terminated early. The precise dates and details of the move are not known at this stage but as soon as concrete information becomes available, it will be posted here.

After the cluster move, advantage will be taken of the downtime to upgrade Ubuntu 16.04 to the latest long-term supported version (18.04) along with packages such as the R CRAN repository and PyPI Python packages; now is a good time to remind users that support for legacy Python 2.7 is being discontinued from Janary 1st 2020 and users should now be using at least Python 3.5.

matlab2018 queue revisited...

Almost all of the cluster has now been upgraded to Matlab R2018b with just 1 node still running the old R2016a version owning to existing long-running jobs. A new queue called 'matlab2018' was set up in March so that those who wanted to use the latest Matlab features could submit jobs to this queue which would guarantee their job(s} would run under Matlab R2018b while jobs submitted to the 'standard' queue would run on either R2016a or R2018b, whichever was available at the time.

Unfortunately, the optional Parallel Computing Toolbox was inadvertently omitted from this installation so Matlab programs using parallel features submitted to the parallel4, parallel6 or parallel8 queues would run but using one processor core only. Adding the missing toolbox into a Matlab installation on headless cluster nodes afterwards is not straightforward and is time-consuming so an all-new master installation was done on April 24th and rolled out to 17 nodes that were not running Matlab jobs. The opportunity was taken at the same time to include the Database Toolkit as well as the latest update 4 from Mathworks.

As of April 25th jobs submitted to the parallel4, parallel6, parallel8 and matlab2018 queues now execute on the nodes updated on April 24th while jobs submitted to the other queues with execute on the earlier version that lacks the Parallel Toolbox - this won't affect these jobs since these other queues are single processor only. As and when existing Matlab jobs complete on the remaining nodes, these will in turn be upgraded as well although this is likely to take months to complete owing to the number of long-running jobs.

To use the matlab2018 queue, in your qsub submission script just replace any existing PBS queue directive that starts with:

#PBS -q

with

#PBS -q matlab2018

If you don't have a #PBS -q directive in your submisssion script (perhaps because you are using the default standard queue), simply add this line to your script.

The Matlab R2018 upgrade

The upgrade of Matlab from version R2016a to the latest R2018b commenced in December last year but on a running cluster with many Matlab jobs already running, it had to be done piecemeal as and when nodes were free of Matlab jobs.

Apologies if this long drawn-out upgrade has inconvenienced you but unlike other HPC clusters, we have a policy of allowing users to run jobs for weeks or months instead of limiting them to just 3 days. Unfortunately, this does make cluster management much more difficult since apart from software updates and upgrades, "in-flight incidents" such as disk and memory failures, corrupted filesystems, etc have to be contained and workarounds implemented if that node is running a user job, with the ailing node being nursed along until all remaining jobs have finished and it can finally be taken out of service & repaired.

Using the Linux du command on the cluster

Those of you who are using clustor2 for file storage may be mystified by the large discrepancy between file and folder sizes reported by the 'ls -l' and 'du' commands. This is because clustor2 uses file compression internally to improve read/write performance so that file and folder sizes as seen by the 'du' utility are the compressed sizes as stored in clustor2's ZFS disk pool, not the sizes seen by the operating system or yourself!

To support ZFS compression-based storage systems such as that on clustor2, the latest version of du available for Ubuntu Linux now has the '--apparent-size' option which will report the actual file size, not the compressed size as seen on the clustor2 server. You can use this new option in conjunction with the existing du options such as 'h', 'm', 'c', etc. Here is an example, using a file that is 3.2 GB in size:

ls -l reports the file size is 3.2 GB as expected:

andy@macomp001:~$ ls -l 3_2GB_test_file.tar 
-rw-r--r-- 1 andy root 3343513600 Feb  2  2018 3_2GB_test_file.tar

but du shows it as less than half this size:

andy@macomp001:~$ du 3_2GB_test_file.tar 
1434360 3_2GB_test_file.tar

using the '--apparent-size' option to du now reports the size you would expect to see:

andy@macomp001:~$ du --apparent-size 3_2GB_test_file.tar 
3265150 3_2GB_test_file.tar

Using du to find sizes of files or folders on other servers attached to the compute cluster, for example silo2 or clustor, will show very similar sizes with or without the '--apparent-size' option since they do not use compression in their underlying storage systems.

major R upgrade for Maths compute cluster completed

Last autumn the core R installation on the cluster has been upgraded from version 3.4.3 to 3.4.4 which is a very minor upgrade but at the same time, the large additional R package collection - mostly from the CRAN repository - has also been rebuilt from current sources, many for the first time ever since the cluster was introduced.

Any questions?

As always, I'll be happy to answer any questions you may have.



Andy Thomas

Research Computing Manager,
Department of Mathematics

last updated: 15.8.2019