I was asked the other day what was the connection between High Performance Computing (HPC) and Clouds, so I thought I would try to post an answer here. Let's first talk a little bit about HPC.
High Performance Computing is about finding every single flops and every single iops on the largest machine you can get your hands on, in order to run your code as fast as possible. It is about batch processing on as many cores as you can get, so you can solve the largest problem you are facing. For a while, supercomputers, were large shared memory machines but in the late nineties distributed memory systems appeared, they were cheaper and you could assemble lots of nodes to get hundreds of cpus. Today the Top500 supercomputers are ranked every 6 months, this ranking is the theater of great technological battle between countries, vendors, research labs and programmers. In the latest ranking, Sequoia the supercomputer from Lawrence Livermore National lab topped the ranking at 16.32 PetaFlop/s and 1,572,864 cores. Weather modeling, atomic weapons simulation, molecular dynamics, genomics and high energy physics are among those that benefit from HPC.
There is big difference however within HPC itself. It is the difference between applications that rely heavily on inter-process communication and need a low latency network for message passing, and applications where each process runs an independent task, the so-called embarrassingly parallel applications. (e.g Map-reduce is an example of how to express an embarrassingly parallel problem ). High Throughput Computing (HTC) defines the type of application where access to a large number of cores over a specific amount of time is needed. Protein folding popularized by the Folding@home project running on PS3 as well as desktops is a good example. Financial simulation such as stock price forecasting and portfolio analysis also tend to fall in that category due to their statistical nature. Graphics rendering for animated movies also falls under HTC. HTC cares less about performance -as measured by FLOPS- and more about productivity -as in processing lots of jobs-.
The HPC quest for performance seems totally antagonist with the IaaS layer of clouds, at least when one thinks of true HPC workload that consumes every flop. Virtualization, the key enabler of IaaS, introduces overhead, both in cpus and network latency, and thus has been deemed "evil" for true HPC. Despite directed I/O, pass thrus, VM pinning and other tuning possibilities to reduce the overhead of virtualization, you might think that this would be it, no connection between HPC and Clouds. However according to a recent academic study of hypervisor performance from a group at Indiana University, this may not be entirely true and it would also be forgetting the users and their specific workloads.
In november 2010 a new player in the Top 500 arrived: Amazon EC2. Amazon submitted a benchmark result which placed an EC2 cluster 233rd on the top 500 list. By June 2011, this cluster was down to rank 451. Yet it proved a point: that a Cloud based cluster could do High Performance computing, raking up 82.5 TFlops peak using VM instances and 10GigE network. In november 2011, Amazon followed with a new EC2 cluster ranked 42nd with 17,023 cores and 354 TFlops peak. This cluster is made of "Cluster Compute Eight Extra Large" instances with 16 cores, 60 GB of RAM and 10 GigE interconnect and now ranked 72nd. For $1000 per hour this allows users to get an on-demand HPC cluster that itself ranks in the top500. This is done on-demand and provides users with their personal cluster....