Latest news for the Maths compute cluster


March 2025: repairs to cluster now completed

Following the data centre overheat incident on August 22nd, a few compute nodes suffered non-critical damage such as failed scratch disks, failure of one disk in a mirrored pair and memory bank outages. These are not catastrophic failures and do allow the nodes to continue operation albeit with reduced local storage, resilience and memory respectively. These nodes have now been repaired and are fully operational again.

We apologise for the distruption caused by this incident, which was outside Maths' control, and will soon be introducing an upgrade to storage servers to improve resilience to SSD (solid state disk) failures.

June 2024: introducing macomp31 and macomp32

On Tuesday June 25th, the new compute nodes macomp31 and macomp32 were enabled, adding 128 processor cores and 1 terabyte of memory to the cluster. These nodes join the existing macomp17 and macomp19 large compute nodes to provide more capacity for large parallel and/or large memory queues.

May 2024: cluster upgrade almost finished...

One feature of the NextGen cluster is there is no enforced limit to the length of time a user's job runs for; there is a default of 100 days after which the cluster management system terminates the job but this can be overridden by a user and a longer or a shorter period can be specified when submitting a job. However a downside of this policy is felt when planning upgrades and updates since we sometimes have to wait months for a user's job to finish before we can take a node out of service to be worked on.

As a result the updates started in March are taking a long time to complete but as of today, May 21st, all but one node in the cluster has now been upgraded to the latest software standard, leaving just one node still running a user's legacy jobs until mid-July.

April 2024: introducing macomp002 and more nodes!

It has been decided to keep the original macomp001 server 'as is' and not replace it outright with new hardware containing more modern CPUs and memory, as originally planned, and instead to add the replacement system as the new second submission node macomp002. One reason for this change of plan is having a single head/submission node represents a single point of failure which would put the cluster out of reach of users if it should fail. Secondly, the differences between macomp001 and macomp002 only come into play when code is compiled from source on these head nodes, which only a minority of users do. Code built on macomp002 may run faster when submitted as a cluster job since it could be optimised for the additional AVX and AVX2 instruction sets found on the processors in macomp002 and throughout the cluster, but not on macomp001.

Keeping macomp001 and adding macomp002 has other advantages - there is no disruption to cluster service while macomp001 is physically being replaced and we also gain extra compute resources which could be added to the main compute pool. Do give macomp002 a try!

Another development coming up soon: two nodes each containing 4 CPUs with a total of 64 processors and 512GB of memory will be added to the cluster, providing more resources for those running large memory and/or parallel compute jobs.

March 2024: cluster update and upgrade

In the past the frequency and timing of major cluster upgrades and/or updates has been bi-annually in the late summer/early autumn but owing to the shorter lifecycles of software such as Python and the consequent user demands for the very latest in programming tools, the cluster update cycle has now been shortened to 18 months or so while still ensuring the cluster's stability through running penultimate versions of Ubuntu LTS Linux.

The 2024 cluster upgrade takes Linux from Ubuntu 20.04 to 22.04 which brings with it Python 3.10 and applications include R 4.3.3, Matlab R2023b and Magma 2_2.86-2. At the same time an additional compute node - macomp30 - has been added and the submission/login node macomp001 is due to be replaced after the Easter break with a new server with AVX/AVX2-capable CPUs to match those of the compute nodes. This work is being done in the background with no outages anticipated and is about 80% complete at the time of writing.

2023: New cluster jobs status page

The online cluster jobs status display on the cluster management server, openpbs, was replaced on March 11th with a new one which is far more accurate than the old one and properly displays the status of array jobs (jobs sharing the same base job ID but with suffixes added) and other problems with displaying the status of queued and blocked jobs have also been resolved. Some of the issues with this facility up until now have been due to small changes in the Maui showq and Torque qstat utilities following last year's cluster upgrade, as well as the original code having been written for php 5.x while we are now using php 7.2.

Issues with the scrambled data sometimes displayed by the jobinfo facility, where specific bits of information end up in the wrong table cells when the 'full job info' button is selected are now being worked on but the basic job info option is largely bug-free.

Cluster software updates

Most users are probably unaware that minor software updates are carried out clusterwide several times a month, usually due to users requesting new software additions, Python modules and R libraries. These events are so frequent that they are not worthy of announcement.

2022 cluster update and upgrade

Every two years during the late summer/early autumn the cluster is updated to the current LTS (Long Term Support, or stable) version of Ubuntu. This timing ensures we run a stable, matured yet current version of Linux that has been in use for about 18 months while at the same time, the upheavals involved in cluster updates take place once every 2 years at a time when the cluster is usually less busy.

The 2022 cluster upgrade is a major one with 24 of the original nodes being replaced with 18 modern Dell R630 nodes each with 20 'real' processors and 40 'virtual' processors thanks to hyperthreading.

Using the Linux du command on the cluster

Those of you who are using clustor2 for file storage may be mystified by the large discrepancy between file and folder sizes reported by the ls -l and du commands. This is because clustor2's ZFS-based storage subsystem uses file compression internally to improve read/write performance so that the file and folder sizes as seen by the du utility when used on cluster nodes are the compressed sizes as stored in clustor2's ZFS disk pool, not the sizes seen by the operating system or yourself!

To support compression-based storage systems such as that used on clustor2, the latest version of du available for Ubuntu Linux now has the --apparent-size option which will report the actual file size as seen on the cluster, not the compressed size as seen on the clustor2 server. You can use this new option in conjunction with the existing du options such as h, m, c, etc. Here is an example, using a file that is 3.2 GB in size:

ls -l reports the file size is 3.2 GB as expected:

andy@macomp001:~$ ls -l 3_2GB_test_file.tar 
-rw-r--r-- 1 andy root 3343513600 Feb  2  2018 3_2GB_test_file.tar

but du shows it as less than half this size:

andy@macomp001:~$ du 3_2GB_test_file.tar 
1434360 3_2GB_test_file.tar

using the '--apparent-size' option to du now reports the size you would expect to see:

andy@macomp001:~$ du --apparent-size 3_2GB_test_file.tar 
3265150 3_2GB_test_file.tar

Note: using du to find sizes of files or folders on other servers attached to the compute cluster, for example silo2 or clustor, will show very similar sizes with or without the --apparent-size option since they do not use compression in their underlying storage systems.

major R upgrade for Maths compute cluster completed

In the Spring of 2021 the core R installation on the cluster was upgraded from version 3.4.4 to 4.1 which is a major upgrade but at the same time, the large additional R package collection - mostly from the CRAN repository - has also been rebuilt from current sources, many for the first time ever since the cluster was introduced.

Any questions?

As always, I'll be happy to answer any questions you may have.



Andy Thomas

Research Computing Manager,
Department of Mathematics

last updated: 5.3.2025