In the dark ages of PC gaming, the CPU took care of most of the graphics chores. The graphics chip did just the basics: some raster operations, dedicated text modes, and such seemingly quaint tasks as dithering colors down to 256 or 16 colors. As Windows took hold, the graphics equation began to shift a bit, with some Windows bitmap operations handled by “Windows accelerators.” Then along came hardware like the 3dfx Voodoo and the Rendition V1000, and accelerated 3D graphics on the PC took off.
Now it’s coming full circle. Today’s GPUs are fully capable of running massively parallel, double-precision floating-point calculations. GPU computing allows the 3D graphics chip inside your PC to take on other chores. The GPU isn’t just for graphics anymore.
GPU computing has its roots in an academic movement known as GPGPU, short for “general purpose computing on graphics processing units.” Early GPGPU efforts were limited due to the difficulty of trying to get pre-DirectX 9 GPUs to work effectively with floating-point calculations. In the DirectX 11 era, GPU architectures have evolved, taking on some of the characteristics of traditional CPUs, like loops and branches, dynamic linking, and large addressable memory space, among others.
The new age of GPU compute is also more open. DirectCompute built into DirectX 11 supports all the major DirectX 11-capable hardware. OpenCL supports multiple operating system platforms, including mobile. We’ll look at each of the major hardware manufacturers and APIs for GPU computing, as well as some applications that utilize the technology.
If we stick with GPU hardware, there are currently just two developers shipping GPU compute-enabled hardware: AMD and Nvidia. They’ll soon be joined by Intel, however, with the integrated GPU in the upcoming Ivy Bridge CPU. Let’s take a look at each of them in turn.
The first attempts at GPGPU used Nvidia GPUs. There were some early experiments with machine-vision applications that actually ran on very early GeForce 256‑series cards, which didn’t even have programmable shaders. However, efforts began to blossom when DirectX 9’s more flexible programmable-shader architecture arrived.
Nvidia took note of these early efforts, and realized that GPUs were potentially very powerful tools, particularly for scientific and high-performance computing (HPC) tasks. So the company's architects began to think about how to make the GPU more useful to general purpose programming. Until then, GPUs were great for graphics, but trying to write applications that were more general was difficult. There were no loops or returns, for example, and shader programs severely restricted the number of lines of code permitted.
Part of the issue, of course, was the lock DirectX 9 had on GPU hardware architecture. Back in the DirectX 9 era, any implementation of features to make life easier for non-graphics applications would be outside of the DirectX standard. Given the raw floating-point and single-intruction, multiple-data (SIMD) performance, however, graphics processors looked like good candidates for certain classes of supercomputing tasks.
The first iteration of Nvidia's CUDA GPU computing platform ran on the 8800 GTX.
In order to further the GPGPU movement, Nvidia created a more compute-friendly software development framework. CUDA 1.0, as Nvidia dubbed the architecture, was the first version of Nvidia’s CUDA (Compute Unified Device Architecture) software platform. Programmers could now use standard C, plus Nvidia extensions, to develop applications, rather than have to work through the more cumbersome shader language process. In other words, general purpose apps didn’t have to be written like graphics code. CUDA worked with 8800 GTX and related GPUs. That generation of graphics processors spawned the first products dedicated to GPU compute, the Tesla 870 line.
Since the early days of the 8800, Nvidia continued to build in architectural features to make the GPU a better general purpose programming tool. The goal isn’t to make the GPU a replacement for the CPU. CPUs still excel at linear or small-scale multithreaded applications. However, GPUs are potentially excellent at large-scale parallel programming applications involving hundreds of threads operating on large volumes of separate but similar data. That programming model is ideal for a certain class of scientific and high-performance applications, including financial analysis.
It’s significant that Nvidia positioned its latest Fermi architecture as a GPU compute platform before launching it as a graphics processor. The Fermi architecture brought substantial hardware enhancements to make it a better general purpose processor. These include fast atomic memory operations (which means a single memory location won’t be corrupted by accesses from different functions), a unified memory architecture, better context switching, and more. Since Fermi’s launch, Nvidia has also updated its CUDA software platform several times, which we’ll discuss shortly.
Nvidia didn’t just see GPU compute as something for oil exploration and academic computing. Nvidia acquired PhysX several years ago, discarding the dedicated hardware but keeping the broadly used physics API, so the GPU can accelerate physics calculations. The company has also worked with game developers to incorporate GPU compute into games, for water simulation, optical lens effects, and other compute-intensive tasks. Finally, it has worked with a number of mainstream companies like ArcSoft, Adobe, and CyberLink to enable GPU‑accelerated video transcoding in both high-end and consumer-level video applications.
All the work of Fermi as a compute platform has paid off, as Nvidia’s Tesla compute hardware sales topped $100M last year. Fermi doesn’t get the attention that the desktop graphics or mobile processor divisions have been getting, but its existence has enabled Nvidia to remain at the top of the heap for GPU compute. Still, competitors are nipping at its heels.
AMD was a little late to the GPU compute party, but it has been working feverishly to catch up. ATI Stream was the company's equivalent to Nvidia's CUDA. The first AMD FireStream cards for dedicated GPU compute were the model 580s, built on the Radeon X1900 GPU, which saw fairly limited pickup. It wasn’t until the Radeon HD 4000 series shipped that AMD really had competitive hardware for GPU compute. The HD 5000 improved on that substantially. The latest Radeon 6000 series has significant enhancements specifically geared for general purpose parallel programming.
Philosophically, though, AMD has taken a slightly different road. At first, the company tried to mimic Nvidia’s CUDA efforts, but eventually discarded that approach and fully embraced open standards like OpenCL and DirectCompute. (We’ll discuss the software platforms in more detail next.)
Recently, AMD has shifted its GPU compute focus more to the mainstream. While AMD ships dedicated compute accelerators under the moniker FireStream, the company is trying to capitalize on its efforts to integrate Radeon graphics technology into mainstream CPUs. The Fusion APUs (accelerated processing units) are available in either mobile or desktop flavors. Even the high-end A3800, sporting a quad-core x86 CPU and 400 Radeon-class programmable shaders, costs less than $150.
AMD calls its approach to mainstream GPU compute App Acceleration. It’s a risky approach, since the mainstream applications ecosystem isn’t exactly rich with products that take advantage of GPU compute. The few applications that exist can run much faster on the GPU side of the APU, but the modest performance of the x86 side of the equation makes it difficult to compete with Intel’s x86 performance dominance. AMD is betting that more software developers will take advantage of GPU compute, shifting the performance equation for the APUs.
Intel has been watching the GPU compute movement with some understandable concern. The company tried to get into discrete graphics with Larrabee, but that project died on the vine. The technology behind Larrabee is now relegated to limited use in some high-performance parallel compute applications, but you can’t go out and buy a Larrabee board.
On the other hand, Intel has made waves with the integrated graphics built into its current Sandy Bridge CPUs. The Intel HD Graphics GPU is pretty average for Intel graphics, but the fixed-function video block is startlingly good. Video decode and transcode are very fast—even faster than most GPU-accelerated transcode. Of course, it’s a fixed-function unit, so it isn’t useful with non-standard codecs. But since a big part of the consumer GPU compute efforts from Nvidia and AMD focus on video encode and transcode, Sandy Bridge graphics stole a little thunder from the traditional graphics companies.
Intel’s upcoming 22nm CPU, code-named Ivy Bridge, may actually change the balance. The x86 CPU itself will offer modest enhancements to Sandy Bridge, but the GPU is being re-architected to be fully DirectX 11 compliant. When asked if GPU compute code could run entirely on the Ivy Bridge graphics core, the lead architect for Intel said it would. Performance is unknown at this point, but if Intel can couple a GPU core that’s equal to the AMD GPU inside Fusion APUs with its raw x86 CPU capabilities, then it may signal a sunset on the era of entry-level discrete graphics cards.
If you can’t write software to take advantage of great hardware, you essentially have really expensive paperweights. Early attempts to turn GPUs into general purpose parallel processors were bootstrapping efforts, requiring programmers to figure out how to write a graphics shader program that would do something other than graphics.
As the hardware evolved, a strong need for standard programming interfaces became critical. What happened is a recapitulation of graphics history: proprietary technology first, then a steady shift to more open standards.
Nvidia’s CUDA platform was one of the first attempts to build a standard programming interface for GPU compute. Nvidia has always maintained that CUDA isn’t really “Nvidia-only,” but neither AMD nor Intel has really taken up the company’s offer to accept it as a standard. Some of Nvidia’s third-party partners, however, have chipped in, enabling support for Intel CPUs as fallback for some CUDA-based middleware.
CUDA started out small, consisting of libraries and a C compiler to write parallel‑processing code for the GPU. Over the years, CUDA has evolved into an ecosystem of Nvidia and third-party compilers, debugging tools, and full integration with Microsoft Visual Studio.
CUDA has seen most of its success in the HPC and academic supercomputing market, but CUDA has a broader reach than just deskside supercomputers. Adobe used CUDA in Adobe Premiere Pro CS4, and later to accelerate high-definition video transcode and some transitions. MotionDSP uses CUDA to help reduce the shaky‑cam effect in home videos. We’ll highlight a few GPU‑accelerated apps later in this article.
We’ll just mention AMD’s Stream software platform briefly, since AMD is no longer pushing it, choosing to focus instead on OpenCL and DirectCompute.
Stream was AMD’s attempt to compete with CUDA, but the company obviously feels that the greater accessibility offered by standards-based platforms is more appealing.
DirectCompute shipped with Microsoft’s DirectX 11 API framework, so is available only on Windows Vista and Windows 7. It will also be available on Windows 8 when that OS ships. That means there’s no support for DirectCompute on non-Microsoft operating systems. DirectCompute won’t run on Windows XP, either, nor on Windows Phone 7 or the Xbox 360.
DirectCompute works across all GPUs capable of supporting DirectX 11. Today, that means only Nvidia GTX 400 series or later and AMD Radeon HD 5000 series or later. Intel will support DirectX 11 compute shaders when Ivy Bridge ships in 2012.
DirectCompute’s key advantage is that it uses an enhanced version of the same shader language, HLSL, for GPU compute programming as it does for graphics programming. This makes it substantially easier for the large numbers of programmers already facile in Direct3D to write GPU compute code. It also runs across graphics processors from both AMD and Nvidia, giving it broad graphics hardware support.
On the downside, DirectCompute has no CPU fallback. So code specifically written for DirectCompute simply fails if a DirectX 11-capable GPU isn’t available. That means programmers need a separate code path if they want to replicate the results of the DirectCompute code on a system running an older GPU.
Early on, OpenCL was developed by Apple, who turned over the framework to an open standards committee called Khronos Group. Apple retained the name as a trademark, but granted free rights to use it.
OpenCL runs on just about any hardware platform available, including traditional PC CPUs and GPUs inside mobile devices like smartphones and tablets. Care must be taken with code designed for multiplatform use, as a cell‑phone GPU may not be able to handle the same number of threads as gracefully as an Nvidia GTX 580. In fact, Intel has even released an OpenCL interface for the current Sandy Bridge‑integrated GPU.
On the other hand, OpenCL is still in its infancy. Supporting tools and middleware are still emerging, and for the time being developers may need to create their own custom libraries, instead of relying on commercially available or free middleware to ease programming chores. There’s no integration yet with popular dev tools like Microsoft’s Visual Studio.
The GPU compute API situation today resembles the consumer 3D graphics API wars of the late 1990s. The leading development platform is CUDA. Despite Nvidia’s protestations to the contrary, CUDA remains a proprietary platform. It has a rich ecosystem of developers and applications at this stage, but history hasn’t been kind to single-platform standards over the long haul.
This chart sums up the state of the GPU compute APIs in a nutshell.
You could argue that DirectCompute is also proprietary, since it’s Windows-only—and even lacks support on pre-Vista versions of Windows. However, Windows is by far the leading PC operating system, and DirectCompute supports all existing DirectX 11–capable hardware. That’s where the support ends, however, since there’s no version for mobile hardware, though we may see that change with Windows 8.
OpenCL offers the most promise in the long run, with its support for multiple operating systems, a wide array of hardware platforms, and strong industry support. OpenCL is the native GPU compute API for Mac OS X, which is gaining ground in the PC space, particularly on laptops. But OpenCL is still pretty immature at this stage of the game. There’s a strong need for integration with popular development platforms, more powerful debugging tools and more robust third-party middleware.
To see what kind of strides GPU compute has made, we’re going to focus on consumer applications, not scientific or highly vertical applications. GPUs should do well in applications where the code and data are highly parallel. Examples include some photography apps, video transcoding, and certain tasks in games (that aren’t just graphical in nature.)
Musemage is a complete photo editing application available from Chinese developer Paraken. When running on systems with Nvidia GPUs, Musemage is fully GPU accelerated. Musemage uses the CUDA software layer to accelerate the full range of photographic operations.
Musemage lacks a lot of the automated functions built into more mature tools like Photoshop, but if you’re willing to manually tweak your images, most of the filters and tools act almost instantly, even on very large raw files—provided you’ve got Nvidia hardware.
Adobe’s Premiere Pro is a professional-level video editing tool. One of the tasks necessary for any video editor is previewing projects as you assemble clips, titles, transitions and filters into a coherent whole. Adobe’s Mercury playback engine uses CUDA to accelerate the preview. This is incredibly useful as projects grow in size—you’re able to scrub back and forth on the timeline in real time, even after making changes.
In addition, a number of effects and filters are GPU accelerated, including color correction, various blurs, and more. A complete list can be found at the Adobe website.
Adobe is investigating porting the Mercury engine and other GPU-accelerated portions of Premiere Pro to OpenCL, but we haven’t heard whether a final decision has been made. Given the relative immaturity of the tool sets and drivers, OpenCL may need a little more time before major software companies like Adobe commit to the new standard.
Interestingly, Intel has recently delivered a plugin for Premiere Pro CS5.5 that can speed up HD encoding if you use Adobe Encoder. It does require an H67 or Z68 chipset. With a Z68 system, you can use an Nvidia-based GPU to accelerate the Mercury playback engine and QuickSync to perform the final render.
A number of video transcoding apps exist that are GPU accelerated. One of the first was CyberLink’s Media Espresso, which first used Nvidia’s CUDA framework, then OpenCL. The latest version of Media Espresso takes advantage of Intel’s QuickSync. Transcoding with QuickSync can be faster than using a GPU, but only if you use a QuickSync-supported codec.
Higher-end tools, like MainConcept, also use GPU encode. MainConcept offers separate H.264/AVC encoders for Nvidia, running on CUDA, and AMD, which uses OpenCL.
When we think of games and GPUs, it’s natural to think about graphics. But games are increasingly using the GPU for elements that aren’t purely graphical. Physics is the first thing that comes to mind. Usually when we think of physics, we think of collisions and rigid body dynamics.
But physics isn’t just about stuff bouncing off other stuff. Film effects like motion blur and lens effects like bokeh and volumetric smoke are handled via GPU compute techniques rather than run on the CPU. GPU compute also handles cloth simulations, better-looking water, and even some audio processing. In the future, we might see some of the AI calculations offloaded to the GPU; AMD already demonstrated GPU-controlled AI in an RTS-like setting.
As more GPU compute capability is integrated into the CPU die, it’s possible for the on-die GPU to handle some of these compute tasks while the discrete graphics card takes care of graphics chores. The ability for the on-die GPU and CPU to share data more quickly—without having to move data over the PCI Express bus—may make up for the fewer shader cores available on-die.
CPUs will never go out of fashion. There will always be a need for linear computation, and some applications don’t lend themselves to parallel computation. However, the future of the Internet and PCs is a highly visual one. Digital video, photography, and games may be the initial drivers for this, but the visual Internet, through standards like WebCL and HTML5 Canvas, will create more immersive experiences over the web. And much of the underlying programming for creating these experiences will be parallel in nature. GPUs, whether discrete or integrated on the CPU die, are naturals for this highly visual, parallel future. GPU computing is still in its infancy.