Introducing job queueing on the Fünf compute cluster
- The Fünf Gruppe cluster has become very popular as a general purpose compute resource but this popularity brings with it some drawbacks - namely excessive demand & system loading leading to slow operation, long compute times and the occasional crashes and outages. Sometimes, a system in this group may be very lightly loaded with only one user running a single job but more often than not, there may be a dozen users each trying to run a job or, in some cases, more than one job.
- Each of the systems in this cluster has 8 processor cores (two physical Xeon processors with 4 cores in each) and 16 GB of physical memory. To put it simply, no more than 8 jobs should be run simultaneously, with one job on each core, and no more than 16 GB of memory should be consumed; any more than this and performance suffers dramatically. Up until fairly recently users have been quite reasonable in their use of this compute facility but there is now clear evidence that some users are using more than their fair share by running multiple jobs in parallel across not one but several machines in the cluster.
- So we have decided to introduce a job queueing & control system to not only control and limit resource consumption but also to proactively monitor it and to make life easier for end users by making it possible to enter a job into the queue and then go away knowing that the job will start automatically when the required resources become available
- Job queueing systems are not new to the Maths department - for a long time we successfully ran DQS (Distributed Queueing System) on the Maths Physics SuSE Linux cluster. Unfortunately, DQS is no longer maintained and is difficult to build on modern Linux distributions owing to library and header issues, although a pre-built DQS is bundled with SuSE and Debian Linuxes; however, the ICT-managed Linux clusters run Red Hat Linux.
- The queue control software being trialled on the Fünf cluster is a modified version of OpenPBS called Torque, while the scheduler component is Maui. Both are commercial products available on a dual-licensing basis where you are free to either use the software for free as long as you support it yourself, or purchase the enterprise version which comes with a full support contract.
- More about the new job queue environment
- Let me know if you have any problems with the new queueing system and I will be happy to answer any questions you may have.
Older news items:
- October 16th, 2008: Introduing the Fünf Gruppe compute cluster
- June 18th, 2008: German City compute farm now expanded to 22 machines
- February 7th, 2008: new applications on the Linux apps server, unclutter your desktop
- November 13th, 2007: aragon and cathedral now general access computers, networked Linux Matlab installation upgraded to R2007a
- September 14th, 2007: Problems with sending outgoing mail for UNIX & Linux users
- July 23rd, 2007: SCAN available full-time over the summer vacation, closure of Imperial's Usenet news server
- May 15th, 2007: Temporary SCAN suspension, closure of the Maths Physics computer room, new research computing facilities
- January 14th, 2005: Exchange mail server upgrade, spam filtering with pine and various other enhancements
Faculty of Natural Sciences
last updated: 11.03.2010