Nvidia's Fermi hits flop-hungry challengers
HPC players tool up
Nvidia's Fermi graphics coprocessors have begun shipping through its OEM partner channel with a slew of tier-two players hoping the flop-happy GPUs give them a competitive edge against established players in the HPC server racket.
The Fermi graphics cards and GPU coprocessors that are based on them were both previewed last November at the SC09 supercomputing conference. The Fermi graphics chips previewed had 512 cores, but for reasons that Nvidia has not explained - and which probably involve chip yields and heating issues - the GeForce graphics cards and Tesla 20 coprocessors that have started shipping only have 448 working cores. And that means their floating-point performance is a little lower than expected.
The Tesla coprocessors are implemented in three different form factors, which was not apparent at the launch last November. The C series GPU coprocessors have fans on them and plug into workstation and personal supercomputers (basically, an x64 workstation on steroids); the M series, which are fanless units that are intended to be used in hybrid CPU-GPU setups within the same chassis; and the S series, which are GPU appliances that plug into servers through external PCI Express links and pack up to four GPUs into a 1U chassis.
Back in November, Nvidia was saying that the C2050 and the C2070, which had an initial rating of 520 and 630 gigaflops doing double-precision math and which cost $2,499 and $3,999, respectively, would support the 512-core Fermi chips. In early April, Nvidia started shipping the C2050, but with only 448 cores and rated at 515 gigaflops double-precision, and the C2070 was pushed out to the third quarter. It's a fair guess that with the number of cores dropping by 12.5 per cent in the C2050 but the aggregate performance of the GPU coprocessor only dropping by one per cent, Nvidia cranked up the clock speed to make up for the lower GPU core count.
There were to be two variations of the S series GPU appliances, the S2050 appliance using the C2050 GPUs, rated at 2.08 teraflops and costing $12,995, and the S2070 appliance using the faster C2070 GPUs rated at 2.52 teraflops and costing $18,995. The S series boxes aren't shipping yet, and they will be based on the 448-core C series GPUs, likely providing a little less floppy oomph. Sources at Nvidia say that the S series GPU appliances are still on track for delivery this quarter.
Nvidia started peddling the Fermi GPUs in its GeForce graphics card lineup during the first quarter.
The news today is that the Tesla M2050 embedded GPU coprocessor, which is based on the C2050 card as the name suggests and which is rated at the same 515 gigaflops of double-precision and 1.03 teraflops single-precision floating point performance, has begun shipping through OEM server partners. Appro and Super Micro were the first to announce systems using the M series GPUs. (You have to hunt around the Nvidia site to find the M2050 spec sheet, so let me save you the trouble.)
Oak Ridge boys
Nvidia planned to host a big shindig in Washington DC kicking off the M series, with Oak Ridge National Laboratory talking about how hybrid CPU-GPU systems were the wave of the future, and Georgia Tech, which has a project called Keeneland for creating applications that run on the hybrid CPU-GPU, giving presentations.
Oak Ridge is, of course, one of the first big customers for the Fermi GPUs. Last October, before the Fermi GPU coprocessors were unveiled by Nvidia at SC09 but after the Fermi chips on which they are based were detailed, the Cray XT "Jaguar" massively parallel Opteron super at Oak Ridge weighed in at 1.06 petaflops using the Linpack Fortran benchmark test as a gauge. Shortly thereafter, the upgraded Jaguar machine was pushed to 1.76 petaflops by the addition of new Opteron cores.
The only reason this matters is that in early October last year, Oak Ridge said that it would be building a hybrid CPU-GPU super based on Nvidia cards that would have at least ten times the oomph of Jaguar. Most likely meaning breaking the 10 petaflops barrier, but not the 20 petaflops barrier. Oak Ridge was intentionally vague, and perhaps because it was unsure of what the performance of such a hybrid machine might be.
There is also a rumor going around that Oak Ridge was unhappy about the performance of the Nvidia Tesla 20 GPUs and has canceled the project, but Nvidia says this is untrue. Oak Ridge has yet to say exactly what it is building.
Appro, one of the surviving boutique HPC vendors, has been jazzed about GPU coprocessors for years, and John Lee, vice president of advanced technology solutions at the company, has no qualms about saying that the advent of GPU coprocessing for HPC clusters is as significant - and perhaps more significant - of a technology change as was the introduction of low-latency, high-bandwidth InfiniBand networking a decade ago. It was InfiniBand that killed off proprietary interconnects, and it may be GPUs that kill off the idea that CPUs designed to run general-purpose workloads as well as doing calculations are the right kind of machine for doing massive amounts of math in parallel.
Appro has cooked up two different hybrid machines using the M2050 GPU coprocessor, one a rack design and the other a hybrid blade design.
The rack machine is the Tetra 1U server, which comes in flavors using x64 processors from Intel or Advanced Micro Devices. The Tetra 1426G4 server is a two-socket machine sporting Intel's latest six-core Xeon 5600 processors and also cramming four - yes, I said four and I triple checked it because it was hard to believe - M2050 GPU coprocessors in the chassis.
Appro's Tetra 1U CPU-GPU box: Looks like a normal rack server, but packs a floppy punch
According to the spec sheet, this machine has eight DDR3 memory slots for the Xeon chips, supporting up to 96GB of capacity. That can't be right, since there are no 12GB memory sticks as far as I know. It has to be either 96GB with a dozen memory slots or 128GB with eight slots using 16GB sticks.
The Tetra chassis has room for six 2.5-inch SATA disks. The GPUs plug into two PCI Express x16 slots with a riser card, but there is one x4 slot left over with a riser card if you need to add something else. The Tetra 1326G4 is the AMD version of this server, and it sports the eight-core and twelve-core Opteron 6100 processors in a two-socket configuration. This AMD box has the same PCI Express 2.0 slots, two out of three of them plugging in the M2050 GPUs as in the Intel machine. The spec sheet for this machine says it supports 128GB in eight slots. The Tetra machines will offer 80 teraflops in a rack with 40 machines, and will only take a dozen racks to break a petaflops.
The Tetra hybrid machines offer about twice the compute density of rack servers compared to prior CPU-GPU machines, according to Lee. It will be shipping at the end of May and a base configuration with all four Fermi GPUs will sell for under $13,000. A beefier configuration with faster processors, more memory, and other options could cost as much as $20,000. Still, $10 per gigaflop is pretty good - provided your applications know how to speak GPU.
Among the blades
If blades float your boat, then Appro has a GreenBlade hybrid setup for you. Or rather, two. One is based on a two-socket Xeon 5600 blade lashed to a GPU blade with two M2050s, while the other uses a two-socket Opteron 6100 with a GPU blade. The GreenBlade system was announced in February 2009 and was tapped by the San Diego Supercomputing Center, Appro's flagship customer, as the basis of a flash-heavy super nicknamed "Gordon" that SDSC said it would build last fall.
The Intel/Nvidia combo blade marries the gB222X blade, which supports up to 96GB of memory and two six-core Xeon 5600 processors, to the GXB100 GPU blade, which has two M2050 GPUs on it. The AMD blade is gB322H, and is a two-socket machine with up to 96GB of memory and supporting either the eight-core or twelve-core Opteron 6100 processors. It also links directly to the GXB100 GPU blade over PCI Express links.
The GreenBlade is based on a 5U chassis with ten slots, so you can put five of these hybrid blades in the box for about 5 teraflops of aggregate number-crunching performance. With all five blades in the box and a relatively light amount of memory, such a GreenBlade will run you about $30,000, or about $6 per gigaflop. On skinny memory configurations, the Tetras are running about $6.50 per gigaflops. That's not much of a premium for twice the density. The GreenBlade CPU-GPU hybrid boxes will be available in the middle of May.
Appro's GreenBlade: not as dense on the flops as the Tetra racks.
Lee says that Appro is not just thinking about Nvidia GPUs, but is also planning a similar set of products using AMD's FireStream GPUs. But the FireStream GPUs have some issues - such as not having multi-level caches or error correction on the cache and main memory cards, as the Nvidia machines do. Moreover, a lot of people have lined up behind the CUDA environment that Nvidia has created for the Tesla GPUs, and OpenCL is not quite there yet as far as Lee is concerned. But he is convinced it will gain traction over time, and it's hard to ignore the single-precision flops advantage that the latest FireStreams (those used in the Radeon 5870 graphics cards) have over the Fermis: almost three times as high at 2.72 teraflops per GPU, and 544 gigaflops at double precision.
Whitebox server maker Super Micro is also among the first companies to ship the M2050 embedded GPU coprocessors inside x64 systems aimed at HPC shops.
Super Micro's 1U Tesla 20–based hybrid box has the tongue-twisting name 6016GT-TF-FM205. It's a two-socket box based on Intel's Xeon 5600s and has two PCI Express slots for linking up two M2050 GPUs. Super Micro also has a personal supercomputer, the 7046GT-TRF-FC405, that can be converted into a 4U rack server that supports four C2050 cards, which have the fans to keep them cool.
You will no doubt note that Appro is getting twice as many M2050s into its 1U server as is Super Micro. Engineering still matters. And perhaps so does halon gas. ®