Maths GPU clusters and servers

Currently there are a number of GPU facilities available to Maths users - some are used by individual research groups and will not be detailed here but there are currently four generally accessible stand-alone GPU servers plus two older GPU clusters (nvidia1 and nvidia2) which due to lack of space are temporarily retired and are being kept in storage.

  • nvidia3: this stand-alone GPU server which is fitted with two nVidia K40 GPU cards each with 2880 cores, 12 Gb of memory and 2 TB of local storage as well as access to the networked storage facilities enjoyed by all compute system in Maths. With Ubuntu 20.04 and the latest CUDA 11 software as well as support for OpenCL, this facility is considerably more up to date than the older nvidia1 or nvidia2 and is better suited to experimental leading-edge applications.

  • nvidia4: this server has eight nVidia GTX 2080 Ti GPU cards, 1.5 Tb of memory and 22 TB of local data storage and is otherwise set up much the same as for nvidia3. The GPUs each have 4352 cores and 68 ray-tracing cores as well as 12 GB of memory.

  • nvidia5: one of the two latest additions to the GPU family, this server has four nVidia GTX 3090 Ti GPU cards, 256 GB of memory and 4.6TB of local on-server storage; the GPUs each have 10496 cores as well as 24 GB of memory. This GPU server is intended for special projects that may require different software or CUDA/cuDNN settings to the 'mainstream' nvidia6 server but is otherwise identical to nvidia6 apart from the reduced GPU card count and smaller on-server storage.

  • nvidia6: the second of the two latest additions to the GPU family, this server has eight nVidia GTX 3090 Ti GPU cards, 896 GB of memory and 14TB of local on-server data storage and, as for the nvidia5 server, the GPUs each have 10496 cores as well as 24 GB of memory.

    Programs you can run on the clusters may either be pre-compiled binaries that have been built and linked on another compatible GPU system or ones you have written yourself (or using source code given to you by others) as a CUDA source file and compiled using the nvcc compiler. By convention, CUDA source files have the suffix .cu but may contain a mix of C, C++ and CUDA statements; nvcc uses the system's gcc compiler to generate non-GPU object code when necessary, switching automatically to the nVidia PTX compiler for GPU object code.

    Getting started

    Access to the GPU servers is remotely via ssh and to begin with, you need an account on one or more of the servers - simply email Andy Thomas requesting an account. Once this is set up, the account details will be mailed to you - the password is a random password and you are strongly encouraged to change it when you log in for the first time, using the 'passwd' utility and following the prompts.

    nvcc does have a man page on the server but it's not very useful since it just lists the main nVidia CUDA utilities with very little information on their usage. You'll find a selection of nVidia documentation in PDF format right here on this server and you can also access nVidia's own online documentation for full information on the CUDA Toolkit.

    Checking the status of the GPUs

    If you want to find out what all the GPU cards are doing, use the nvidia-smi utility. Typing 'nvidia-smi' with no parameters produces a summary of their status as shown below:

    screenshot of output from nvidi-smi command

    which shows both GPUs in nvidia2 fully loaded although only using about 20% of the total available memory; the PIDs and names of the processes running on the host server are also listed and normal Linux utilities such as 'ps ax' can be used to find further information on these.

    Typing 'nvidia-smi -q' produces a very detailed status report for all GPUs in the system but this can be limited to a given GPU of interest with the -i N option, where N is the GPU identifier (0,1 or 2 for nvidia1 and 0 or 1 for nvidia2 and so on). For example, the command

    nvidia-smi -q -i 1

    will show the full information for GPU 1 only. Unlike most other nVidia CUDA programs, nvidia-smi has extensive man page documentation although many of the available options are reserved for the root user since they affect the operation of the GPU card.

    Are disk quotas imposed on the GPU cluster servers?

    No but as with all Maths systems disk usage is continuously monitored and those who have used a large proportion of the available home directory storage will be asked to move data to one of the silo storage servers, to clustor or clustor2 or delete unwanted data, etc.

    Is user data on the cluster servers backed up?

    Yes, all four servers are mirrored daily to our onsite backup servers which in turn are mirrored to the Maths offsite servers in Milton Keynes and Slough.

    What about job scheduling and fair usage controls?

    Job queueing and resource management is not being used on the GPU clusters or the stand-alone servers at present because, unlike the Maths compute cluster in the past, fair usage and contention for resources has not been a problem with the GPU facilities to date. Also, it is very difficult to implement traditional HPC-style cluster job management on GPU cards because there is no low-level interface to core and memory resources on any given GPU card, although it is possible to control use of entire GPU cards. But with the present small-scale clusters used by a small group of regular users, it currently is not worth implementing any form of job control.

    Commercial resource management software is now available that will also support GPU management but the cost is beyond our budget (the NextGen HPC uses the Torque and Maui job management systems which are open source and free of charge).

    About the GPU servers

    nvidia3 is a SuperMicro GR1027GR-72R2 GPU server that can accomodate up to 3 double-width GPU cards although only two are fitted at present. Two 2.5 Ghz quad-core CPUs are fitted and 64 GB of memory is available.

    GPUs fitted into nvidia4

    nvidia4 is a large Tyan FT77DB7109 server fitted with eight nVidia GeForce RTX 2080Ti GPUS (pictured on the left) two 16-core 2.8 GHz Xeon CPUs, 1.5 TB of memory and 14 hard disks, 2 of which are fast SAS disks arranged as a mirrored pair for the system while the other 12 form a XFS disk pool, with one disk being reserved as a 'hot spare' and having a total usable capacity of 22 TB.

    nvidia5 is an Asus ESC4000 G4 server fitted with four nVidia GeForce RTX 3090Ti GPUS, two 16-core 2.8 GHz Xeon CPUs, 256 GB of memory, two 1TB hard disks arranged as a mirrored pair for the operating system and six 1TB disks forming a XFS disk pool to provide 4.6TB of fast local storage.

    nvidia6 is an Asus ESC8000 G4 server fitted with eight nVidia GeForce RTX 3090Ti GPUS two 16-core 2.8 GHz Xeon CPUs, 896 GTB of memory, two 300 GB SAS hard disks arranged as a mirrored pair for the operating system and six 2.4TB disks to provide 11TB of local on-server data storage.

    Andy Thomas

    Research Computing Manager
    Department of Mathematics

    last updated: 18.7.2023