HPC’s Shift to the Cloud

Timothy Prickett Morgan writes on The Next Platform about the slow but inevitable shift to cloudy infrastructure. It seems that a tipping point has been reached, where the amount of IT money spent on “cloudy” infrastructure overtook the amount spent on “traditional” datacentre gear. This happened in 2018Q3, according to the IDC report cited in the article.

Prickett Morgan suggests that the transformation from bare metal to the cloud has been faster in HPC than in enterprise IT. In some senses, this makes sense, because HPC has long had the sorts of abstractions between the application and its environment that make it possible to change infrastructure. The days where an atomic energy or climate situation would be capable of running only on dedicated hardware with integrated bench seating are long gone, and all of the top supercomputers are now (highly tuned, admittedly) GNU/Linux clusters running on normal-ish CPUs: mostly Intel, some IBM POWER, and ARM are moving from evaluation to deployment too. All of these technologies, as well as the Nvidia GPUs used in CUDA codes and deep learning applications, and even Google’s TPUs, are to be found in public cloud environments from the big providers.

On the other hand, there still are big honkin’ boxes of bare metal, with the number one spot changing almost every year. So not all HPC applications are cloud-suitable, and some of those codes that people are trying to port to the cloud may prove challenging. Here’s a summary of the components of a “traditional” HPC deployment, and how it might help or hinder adaptation to the cloud.

Infrastructure

Modules

Plenty of HPC sites already virtualise their filesystems to some extent, with the Modules package. Modules let administrators separate the installation and management of packages from their availability to users, by defining modulefiles that configure the environment to make each package accessible.

Where a team is already using modules to set up its environment for building or running codes, adopting containers and similar abstractions should be straightforward. Docker images, for example, can contain the module packages and the environment changes necessary to use the modules, and the HPC application image can be composed on top to include the relevant environment.

Job submission

HPC systems tend to already be built with the kind of self-service in mind that devops teams in commercial software development strive to provide. This heritage has evolved from the necessarily multi-user nature of a large supercomputer deployment. Mainframe batch submission systems, grid middleware (such as Sun -> Oracle -> Univa Grid Engine) and SLURM are based around the idea that a user can request a certain amount of resources to run their codes, the request being queued until the resources are available.

The open source SLURM project already supports cloud-native demand scheduling. Others are using Kubernetes as an elastic demand scheduler.

However, a lot of teams find job-specific submission scripts with hard coded assumptions about the environment they will run on, and codes that are tightly coupled to the submission script. Loosening that coupling will require some effort, but will make the codes “portable” to a cloud environment and enable new workflows for testing and development.

File systems

HPC sites frequently use high-performance parallel filesystems like Lustre or IBM’s GPFS. While these filesystems can be deployed to a cloud environment, the performance characteristics will differ and it will be harder to tune to the specific topology offered by a physical deployment. Notice that HPC filesystems do not perform well in some scenarios so some applications like AI training may benefit from re-evaluating the data access strategy anyway. Portable codes could be tested against new hardware without significant capital outlay; for example Google Cloud uses Intel Optane non-volatile memory.

Job-specific nodes

A traditional cluster will often have login nodes for accessing the cluster from the scientific workstations, batch nodes for running and using the batch submission systems, compiler nodes for building codes, metadata nodes if it uses a parallel filesystem, and finally compute nodes on which the simulations and deep learning jobs are actually executed. The compute nodes may be divided into groups to service different queues, or to differentiate between testing/debugging and production jobs.

While operations teams may be interested in getting close to 100% utilisation out of the compute nodes, the fact is that the other classes of machine exist because they need different configurations, not because they need to always be available with dedicated hardware. They are ideal candidates to lead the transition to on-demand scaling, perhaps treating a physical cluster as a “private cloud” that commits as much hardware to compute as possible, scaling its other functions as needed.

Meanwhile, compilation and computation can be modelled as serverless workloads, consuming resource when they are executing but scaling to zero when not in use.

Application Support

MPI

MPI libraries like Open MPI already support demand-based scaling at job launch, using the -np option to control how many processes are started and the --hostfile to indicate where those hosts are. In principle it might seem like the hosts in the host file could be discovered using the Kubernetes service registry or similar services from other cloud orchestration layers. In practice the MPI library will need to support launching the process on the nodes so a middleware (see above) will still need to be deployed on top, or the MPI software extended with native support for the cloud’s orchestration API.

Software Licences

This turns out to be one of the biggest hurdles for demand-scaling for many teams. HPC software such as proprietary compilers, numerical algorithms libraries and developer tools are licensed with a particular maximum number of parallel uses. Lab-developed codes may have evolved with assumptions about where the licence file is located, and not built defensively against the possibility that a licence can’t be checked out. The ISV may have built assumptions into their licensing scheme, for example the host having a fixed IP or MAC address. A researcher or developer could have copied a particular licence file into their home directory, using that beyond other agreements being arranged with the vendor.

Where the licensing scheme is flexible enough to allow portability of the software, a good technique is to centralise management of the licenses using a secrets store, for example Vault, and to inject them into the HPC applications’ containers when they are launched.

Alternatively, particularly if the licensing scheme is too rigid, it’s worth evaluating the effort required and performance impact sustained to port the code to a different technology, for example an open source compiler. The trade off of this approach is that on the one hand, increased deployment flexibility is strategically beneficial, but the short term costs, staffing requirements and impact on the scientific mission can make it hard to justify or unworkable.

Conclusion

While there are significant benefits to be had in porting high-performance codes to cloud environments, the task is not without its challenges. Labrary consultancy with Graham Lee, bringing his experience in cloud-first devops teams, scalable systems at Facebook, and High-Performance Computing on ARM, can help your team identify and overcome these challenges.

Graham will be at the HPC, Big Data and Data Science devroom at FOSDEM in Brussels, February 2-3. Say hello, grab some time and let’s move your codes forward!

Infrastructure

Modules

Job submission

File systems

Job-specific nodes

Application Support

MPI

Software Licences

Conclusion

About Graham

Leave a Reply

OOP the Easy Way

APPropriate Behaviour

APPosite Concerns

FSF