Maths GPU cluster


Although often referred to as the Maths GPU cluster, this actually consists of two separate clusters plus a stand-alone GPU server. The clusters each have their own host servers, nvidia1.ma and nvidia2.ma, which share the same PCI-express expansion chassis that physically accommodates the GPU cards. Both clusters are identical except for the GPU cards installed:

2.4 TB of local disk storage is available on each server for user home directories and in addition, the Maths data storage servers silo, silo2 and calculus are mounted on each GPU server via NFS. Running the Ubuntu 14.04 LTS Linux operating system, the latest nVidia CUDA version 6.5 drivers and Toolkit is installed on both although the CULA Tools sparse and dense libraries are currently not installed because at present, there is no version of these that will work with CUDA 6.5.

The standalone GPU server is nvidia3 which is fitted with a single nVidia K40 GPU card with 2880 cores, 12 Gb of memory and 50 GB of local storage as well as access to the networked storage facilities. With Ubuntu 16.04 and the latest CUDA software, this facility is considerably more up to date than nvidia1 or nvidia2 and is better suited to experimental leading-edge applications.

Programs you can run on the clusters may either be pre-compiled binaries that have been built and linked on another compatible GPU system or ones you have written yourself (or using source code given to you by others) as a CUDA source file and compiled using the nvcc compiler. By convention, CUDA source files have the suffix .cu but may contain a mix of C, C++ and CUDA statements; nvcc uses the system's gcc compiler to generate non-GPU object code when necessary, switching automatically to the nVidia PTX compiler for GPU object code.

Getting started

Access to either of the GPU clusters or the stand-alone server is remotely via ssh and to begin with, you need an account on one or both of the host servers - simply email Andy Thomas requesting an account. Once this is set up, the account details will be mailed to you - the password is a random password and you are strongly encouraged to change it when you log in for the first time, using the 'passwd' utility and following the prompts.

Before you start writing and compiling your own CUDA programs, you might want to have a look at some examples and you'll find a comprehensive selection of ready-to-compile programs in /usr/local/cuda/samples. A script called cuda-install-samples-6.5.sh is provided for you to make a writable copy of these read-only examples in your own home directory so that you can compile and run your own versions - here's an example of its use:

cuda-install-samples-6.5.sh ~/my_samples

will copy the entire set of examples to a directory called my_samples/NVIDIA_CUDA-6.5_Samples in your home directory. Once you have done this, you can explore the examples and if you want to build and run the binary, just change into the directory containing your chosen example and type 'make'. For example, deviceQuery is a useful utility that displays the characteristics of each GPU card attached to the server so to compile and run your own copy of this, do the following:

cd ~/my_samples/NVIDIA_CUDA-6.5_Samples/1_Utilities/deviceQuery
make
./deviceQuery

The utility should report it has found 3 GPUs for nvidia1 (two in the case of nvidia2 and just one on nvidia3) and provide a detailed listing of the features for each of them.

nvcc does have a man page on the server but it's not very useful since it just lists the main nVidia CUDA utilities with very little information on their usage. You'll find a selection of nVidia documentation in PDF format right here on this server and you can also access nVidia's own online documentation for full information on the CUDA Toolkit.

Checking the status of the GPUs

If you want to find out what all the GPU cards are doing, use the nvidia-smi utility. Typing 'nvidia-smi' with no parameters produces a summary of their status as shown below:

which shows both GPUs in nvidia2 fully loaded although only using about 20% of the total available memory; the PIDs and names of the processes running on the host server are also listed and normal Linux utilities such as 'ps ax' can be used to find further information on these.

Typing 'nvidia-smi -q' produces a very detailed status report for all GPUs in the system but this can be limited to a given GPU of interest with the -i N option, where N is the GPU identifier (0,1 or 2 for nvidia1 and 0 or 1 for nvidia2). For example, the command

nvidia-smi -q -i 1

will show the full information for GPU 1 only. Unlike most other nVidia CUDA programs, nvidia-smi has extensive man page documentation although many of the available options are reserved for the root user since they affect the operation of the GPU card.

Are disk quotas imposed on the GPU cluster servers?

No but as with all Maths systems disk usage is continuously monitored and thos who have used a large proportion of the available home directory storage will be asked to move data to one of the silo storage servers or delete unwanted data, etc.

Is user data on the cluster servers backed up?

Yes, all three servers are mirrored daily to ma-backup1 which in turn is mirrored to the Maths offsite server in Milton Keynes.

What about job scheduling and fair usage controls?

Job queueing and resource management is not being used on the GPU clusters or the stand-alone server at present because, unlike the Maths compute cluster in the past, fair usage and contention for resources has not been a problem with the GPU facilities. Also, it is very difficult to implement traditional HPC-style cluster job management on GPU cards because there is no low-level interface to core and memory resources on any given GPU card, although it is possible to control use of entire GPU cards. But with the present small-scale clusters used by a small group of regular users, it currently is not worth implementing any form of job control.

About the GPU clusters

The host servers nvidia1 and nvidia2 are blade servers fitted into a Dell C6100 chassis, with each server separately connected via iPASS links to a Dell C410x PCI-express expansion chassis which is capable of housing up to 16 GPU cards. The chassis is configured so that 8 GPU card bays connect to one server and the other 8 bays to the other server although not all of the bays are populated with GPU cards. The servers each have two 2.67 Ghz quad-core Xeon CPUs and 24 GB of memory.

nvidia3 is a SuperMicro GR1027GR-72R2 GPU server that can accomodate up to 3 GPU cards although only one is fitted at present.


Andy Thomas

Research Computing Manager
Department of Mathematics

last updated: 22.03.2017