The State of GPU Computing: Is the CPU Dead Yet?
Massively parallel computing engines inside GPUs make them ideal for a wide range of tasks in addition to graphics. But where are the applications?
In the dark ages of PC gaming, the CPU took care of most of the graphics chores. The graphics chip did just the basics: some raster operations, dedicated text modes, and such seemingly quaint tasks as dithering colors down to 256 or 16 colors. As Windows took hold, the graphics equation began to shift a bit, with some Windows bitmap operations handled by “Windows accelerators.” Then along came hardware like the 3dfx Voodoo and the Rendition V1000, and accelerated 3D graphics on the PC took off.
Now it’s coming full circle. Today’s GPUs are fully capable of running massively parallel, double-precision floating-point calculations. GPU computing allows the 3D graphics chip inside your PC to take on other chores. The GPU isn’t just for graphics anymore.

The Fermi Die - GPU compute pioneer Nvidia advanced the cause with its Fermi architecture, which features 512 CUDA cores primed for computational chores.
GPU computing has its roots in an academic movement known as GPGPU, short for “general purpose computing on graphics processing units.” Early GPGPU efforts were limited due to the difficulty of trying to get pre-DirectX 9 GPUs to work effectively with floating-point calculations. In the DirectX 11 era, GPU architectures have evolved, taking on some of the characteristics of traditional CPUs, like loops and branches, dynamic linking, and large addressable memory space, among others.
The new age of GPU compute is also more open. DirectCompute built into DirectX 11 supports all the major DirectX 11-capable hardware. OpenCL supports multiple operating system platforms, including mobile. We’ll look at each of the major hardware manufacturers and APIs for GPU computing, as well as some applications that utilize the technology.
State of the Hardware
If we stick with GPU hardware, there are currently just two developers shipping GPU compute-enabled hardware: AMD and Nvidia. They’ll soon be joined by Intel, however, with the integrated GPU in the upcoming Ivy Bridge CPU. Let’s take a look at each of them in turn.
Nvidia: Tesla and CUDA
The first attempts at GPGPU used Nvidia GPUs. There were some early experiments with machine-vision applications that actually ran on very early GeForce 256‑series cards, which didn’t even have programmable shaders. However, efforts began to blossom when DirectX 9’s more flexible programmable-shader architecture arrived.
Nvidia took note of these early efforts, and realized that GPUs were potentially very powerful tools, particularly for scientific and high-performance computing (HPC) tasks. So the company's architects began to think about how to make the GPU more useful to general purpose programming. Until then, GPUs were great for graphics, but trying to write applications that were more general was difficult. There were no loops or returns, for example, and shader programs severely restricted the number of lines of code permitted.
Part of the issue, of course, was the lock DirectX 9 had on GPU hardware architecture. Back in the DirectX 9 era, any implementation of features to make life easier for non-graphics applications would be outside of the DirectX standard. Given the raw floating-point and single-intruction, multiple-data (SIMD) performance, however, graphics processors looked like good candidates for certain classes of supercomputing tasks.

The first iteration of Nvidia's CUDA GPU computing platform ran on the 8800 GTX.
In order to further the GPGPU movement, Nvidia created a more compute-friendly software development framework. CUDA 1.0, as Nvidia dubbed the architecture, was the first version of Nvidia’s CUDA (Compute Unified Device Architecture) software platform. Programmers could now use standard C, plus Nvidia extensions, to develop applications, rather than have to work through the more cumbersome shader language process. In other words, general purpose apps didn’t have to be written like graphics code. CUDA worked with 8800 GTX and related GPUs. That generation of graphics processors spawned the first products dedicated to GPU compute, the Tesla 870 line.
Since the early days of the 8800, Nvidia continued to build in architectural features to make the GPU a better general purpose programming tool. The goal isn’t to make the GPU a replacement for the CPU. CPUs still excel at linear or small-scale multithreaded applications. However, GPUs are potentially excellent at large-scale parallel programming applications involving hundreds of threads operating on large volumes of separate but similar data. That programming model is ideal for a certain class of scientific and high-performance applications, including financial analysis.
It’s significant that Nvidia positioned its latest Fermi architecture as a GPU compute platform before launching it as a graphics processor. The Fermi architecture brought substantial hardware enhancements to make it a better general purpose processor. These include fast atomic memory operations (which means a single memory location won’t be corrupted by accesses from different functions), a unified memory architecture, better context switching, and more. Since Fermi’s launch, Nvidia has also updated its CUDA software platform several times, which we’ll discuss shortly.
Nvidia didn’t just see GPU compute as something for oil exploration and academic computing. Nvidia acquired PhysX several years ago, discarding the dedicated hardware but keeping the broadly used physics API, so the GPU can accelerate physics calculations. The company has also worked with game developers to incorporate GPU compute into games, for water simulation, optical lens effects, and other compute-intensive tasks. Finally, it has worked with a number of mainstream companies like ArcSoft, Adobe, and CyberLink to enable GPU‑accelerated video transcoding in both high-end and consumer-level video applications.
All the work of Fermi as a compute platform has paid off, as Nvidia’s Tesla compute hardware sales topped $100M last year. Fermi doesn’t get the attention that the desktop graphics or mobile processor divisions have been getting, but its existence has enabled Nvidia to remain at the top of the heap for GPU compute. Still, competitors are nipping at its heels.
Comments
Comments are closed on this article
![]()
wumpus
January 12, 2012 at 9:16am
Parallelism is the future, and always will be.
Parallel code and great speed have been synonymous with speed since roughly the 1980s when single chip processors could surpass old fashioned board level designs. From that point on, it has been easier to stamp out multiple copies of a chip (or core) than to build a core that is twice as fast.
You may have noticed that everything isn't parallel yet. Some things are easy. Graphics is typically considered "embarrassingly parallel", and thus GPUs have been steadily increasing in power in proportion to their transistors for years, while CPUs have stumbled. If your problem can be broken down into wide, parallel sections (especially producer/consumer threads that share no memory other than queues), you can take advantage of paralllelism. If you have to share semaphores and memory: prepare for the heisenbugs of doom.
Then there is Amdahl's law. The catch is that the part that you can't split into multiple parts will dictate your speed. Expect it to be a significant chunk and limit your program to whatever one core of the CPU can do. Adding more GPU gives noticably diminishing returns after that.
All this is just for CPU parallelism. From what I've seen of openCL, it looks pretty weak in terms of trying to merge data computed on other threads back together for the next pass. Presumably this will pass soon, but don't expect everything to be as easy as on a CPU.
![]()
orbonsj
January 11, 2012 at 5:21am
I can't speak to the gaming/comsumer marketplace, but prior to retirement, I worked in scientific computing since 1965. The problem of utilizing multi-cpu resources has been around since the Cray 1 (or even before). There are two issues. The first is financial. Many commercial codes were developed and verified on single-cpu systems. Changing these codes to a parallel computing platform requires an enormous investment, particularly if your talking about safety issues (think nuclear). The second is theoretical. Some algorithms require sequential operations, while others can be converted to parallel algorithms. These are being worked on, but the insertion of these techniques is relatively slow.
In short, it's the "If it ain't broke, don't fix it." attitude.
![]()
HokieTechie
January 10, 2012 at 9:22am
I use the same Phenom II X4 system for gaming and photo editing, so I "happened to have" a Radeon 6850 installed when DXO added OpenCL support to their image processing software. As a result, processing 16 Megapixils of 14-bit RAW data from a Nikon D7000 now takes less than 10 seconds. When I was doing this CPU-only, the same task took 45-50 seconds.
And I only needed to install one Catalyst update to get the system stable . . .
![]()
Ulrich
January 10, 2012 at 8:15am
I use the open source Blender 2.61 which has full GPU rendering. In fact when using the cycles engine in Blender 2.61 you get realtime rendering in the view port which is really nice for setting up lighting, and materials. It has the option of using the GPU or CPU (Cuda is only supported as of right now so no ATI support yet) On my laptop the GPU is substantually faster at rendering opposed to my CPU. NVIDIA GeForce GT540 graphics with 2.0GB Video Memory vs. Intel Core i7-2617M 1.5GHz (2.6GHz Turbo Mode, 4MB Cache)
If you are interested just google Blender cycles.
![]()
Ghost XFX
January 09, 2012 at 10:34pm
Huhuhuhu.
Is this what AMD is gambling their future on? It seems to me they put a whole lot of effort into thier GPU's as of late compared to their actual CPUs. Personally, I don't like the way they're going right now. Hope they prove me wrong...
![]()
The Corrupted One
January 09, 2012 at 8:45pm
1. Thermal issues, Even if the CPU stays the CPU, there will become a point that Moores Law will become invalid, because you simply can't get that many transistors on one chip.
2. Modularity. CrossfireX, nuff said
![]()
LatiosXT
January 10, 2012 at 8:54am
Well technically they are MIMD processors if this diagram's description is correct: http://en.wikipedia.org/wiki/File:MIMD.svg
Give or take any shader is a PU, and it's instruction and data agnostic.
![]()
Jiffy
January 09, 2012 at 3:46pm
They don't even perform the same operations, of course it's not dead yet.
Log in to MaximumPC directly or log in using Facebook
Forgot your username or password?
Click here for help.















