The State of GPU Computing: Is the CPU Dead Yet?
Massively parallel computing engines inside GPUs make them ideal for a wide range of tasks in addition to graphics. But where are the applications?
In the dark ages of PC gaming, the CPU took care of most of the graphics chores. The graphics chip did just the basics: some raster operations, dedicated text modes, and such seemingly quaint tasks as dithering colors down to 256 or 16 colors. As Windows took hold, the graphics equation began to shift a bit, with some Windows bitmap operations handled by “Windows accelerators.” Then along came hardware like the 3dfx Voodoo and the Rendition V1000, and accelerated 3D graphics on the PC took off.
Now it’s coming full circle. Today’s GPUs are fully capable of running massively parallel, double-precision floating-point calculations. GPU computing allows the 3D graphics chip inside your PC to take on other chores. The GPU isn’t just for graphics anymore.

The Fermi Die - GPU compute pioneer Nvidia advanced the cause with its Fermi architecture, which features 512 CUDA cores primed for computational chores.
GPU computing has its roots in an academic movement known as GPGPU, short for “general purpose computing on graphics processing units.” Early GPGPU efforts were limited due to the difficulty of trying to get pre-DirectX 9 GPUs to work effectively with floating-point calculations. In the DirectX 11 era, GPU architectures have evolved, taking on some of the characteristics of traditional CPUs, like loops and branches, dynamic linking, and large addressable memory space, among others.
The new age of GPU compute is also more open. DirectCompute built into DirectX 11 supports all the major DirectX 11-capable hardware. OpenCL supports multiple operating system platforms, including mobile. We’ll look at each of the major hardware manufacturers and APIs for GPU computing, as well as some applications that utilize the technology.
State of the Hardware
If we stick with GPU hardware, there are currently just two developers shipping GPU compute-enabled hardware: AMD and Nvidia. They’ll soon be joined by Intel, however, with the integrated GPU in the upcoming Ivy Bridge CPU. Let’s take a look at each of them in turn.
Nvidia: Tesla and CUDA
The first attempts at GPGPU used Nvidia GPUs. There were some early experiments with machine-vision applications that actually ran on very early GeForce 256‑series cards, which didn’t even have programmable shaders. However, efforts began to blossom when DirectX 9’s more flexible programmable-shader architecture arrived.
Nvidia took note of these early efforts, and realized that GPUs were potentially very powerful tools, particularly for scientific and high-performance computing (HPC) tasks. So the company's architects began to think about how to make the GPU more useful to general purpose programming. Until then, GPUs were great for graphics, but trying to write applications that were more general was difficult. There were no loops or returns, for example, and shader programs severely restricted the number of lines of code permitted.
Part of the issue, of course, was the lock DirectX 9 had on GPU hardware architecture. Back in the DirectX 9 era, any implementation of features to make life easier for non-graphics applications would be outside of the DirectX standard. Given the raw floating-point and single-intruction, multiple-data (SIMD) performance, however, graphics processors looked like good candidates for certain classes of supercomputing tasks.

The first iteration of Nvidia's CUDA GPU computing platform ran on the 8800 GTX.
In order to further the GPGPU movement, Nvidia created a more compute-friendly software development framework. CUDA 1.0, as Nvidia dubbed the architecture, was the first version of Nvidia’s CUDA (Compute Unified Device Architecture) software platform. Programmers could now use standard C, plus Nvidia extensions, to develop applications, rather than have to work through the more cumbersome shader language process. In other words, general purpose apps didn’t have to be written like graphics code. CUDA worked with 8800 GTX and related GPUs. That generation of graphics processors spawned the first products dedicated to GPU compute, the Tesla 870 line.
Since the early days of the 8800, Nvidia continued to build in architectural features to make the GPU a better general purpose programming tool. The goal isn’t to make the GPU a replacement for the CPU. CPUs still excel at linear or small-scale multithreaded applications. However, GPUs are potentially excellent at large-scale parallel programming applications involving hundreds of threads operating on large volumes of separate but similar data. That programming model is ideal for a certain class of scientific and high-performance applications, including financial analysis.
It’s significant that Nvidia positioned its latest Fermi architecture as a GPU compute platform before launching it as a graphics processor. The Fermi architecture brought substantial hardware enhancements to make it a better general purpose processor. These include fast atomic memory operations (which means a single memory location won’t be corrupted by accesses from different functions), a unified memory architecture, better context switching, and more. Since Fermi’s launch, Nvidia has also updated its CUDA software platform several times, which we’ll discuss shortly.
Nvidia didn’t just see GPU compute as something for oil exploration and academic computing. Nvidia acquired PhysX several years ago, discarding the dedicated hardware but keeping the broadly used physics API, so the GPU can accelerate physics calculations. The company has also worked with game developers to incorporate GPU compute into games, for water simulation, optical lens effects, and other compute-intensive tasks. Finally, it has worked with a number of mainstream companies like ArcSoft, Adobe, and CyberLink to enable GPU‑accelerated video transcoding in both high-end and consumer-level video applications.
All the work of Fermi as a compute platform has paid off, as Nvidia’s Tesla compute hardware sales topped $100M last year. Fermi doesn’t get the attention that the desktop graphics or mobile processor divisions have been getting, but its existence has enabled Nvidia to remain at the top of the heap for GPU compute. Still, competitors are nipping at its heels.