AMD’s Radeon HD 7970: 4.3 Billion Transistors of Pure Performance
AMD moves its high-end GPU family to 28nm, delivering stunning performance and impressive efficiency
We knew this was coming. We saw all the signs: The rumors. The price drops on existing videocards. The tweaked versions of old standbys masquerading as “new” GPUs. But more than anything, it’s been too long since we’ve had something fresh to sink our teeth into. And as has been the case in each of the last several big product launches, AMD is serving the first course.
Eric Demers, CTO of AMD’s Graphics Division, began talking about the company’s Graphics Core Next (GCN) earlier this summer. He described a new GPU architecture that would take graphics to the next level. He promised a GPU-compute monster that would remain highly scalable, so versions could be built into future generations of AMD APUs. The first iteration of Graphics Core Next comes in the form of the Radeon HD 7970, and it marks a substantial architectural shift for Radeon graphics. We’ll examine the overall architecture first, and then we’ll dive into the hardware specifics of the Radeon HD 7970.

Goodbye VLIW
Previous AMD GPU generations used very long instruction words (VLIW), a way of tightly packing multiple GPU instructions in order to move them around the GPU and memory efficiently. VLIW went through a couple of tweaks, including a change to a four-word VLIW scheme from a four-word scheme . VLIW was well tuned for the modern generation of programmable graphics, but it wasn’t so hot for GPU compute.
With AMD betting the farm on Fusion, which inherently takes advantage of a GPU’s parallel-compute capability, the company needed a more flexible architecture. So AMD discarded VLIW in favor of something the company calls GCN Quad SIMD (single instruction, multiple data). Instead of a single VLIW instruction plus four math operations for the ALU (arithmetic logic unit), the GPU uses four SIMDs and a single ALU operation. The four SIMDs can do the same work as a single VLIW, but they can also act independently when needed.

GCN marks a major shift in how AMD GPUs operate, behaving more like a general-purpose vector processor than a pure graphics engine. What’s more, each basic building block, called a GCN Compute Unit, includes a scalar coprocessor that can behave like a traditional—but non-pipelined—CPU. AMD has beefed up the caches that are distributed throughout the GPU. Each GCN core (yes, AMD is calling them cores) has its own dedicated L1 read/write cache. Each group of four cores shares a 16KB instruction cache and a 32KB scalar data cache. All the cores communicate over a shared bus to a partitioned L2 cache that can be sized differently depending on the graphics card and particular GPU die.
AMD intends for GCN to serve as the basis for several product families. The first product, code-named Tahiti, is aimed at gaming enthusiasts who want maximum frame rates while enabling maximum eye candy. The next product, code-named Pitcairn, will supersede the Radeon HD 6800 series. Pitcairn will be followed by a series code-named Cape Verde, which AMD believes will redefine the segment now held by products such as the Radeon HD 6700 series.
Code-name Tahiti
AMD took advantage of TSMC’s new 28nm manufacturing process to build its new high-end GPU. The Radeon HD 7970 sports 4.3 billion transistors in a surprisingly small 365mm2 die. AMD product marketing manager Devon Nekechuk tells us AMD’s 28nm yields have been both “good” and “predictable.”
Tahiti is assembled from 32 GCN compute units, which translates to 2,048 stream processors, each of which is based on AMD’s new SIMD-plus-scalar architecture. The existing Radeon HD 6970, by contrast, is equipped with just 1,536 stream processors and doesn’t benefit from the new architecture. The 7970 includes 768KB of L2 cache and eight render back-ends capable of pushing 32 color ROPs per clock and 128 Z/stencil ROPs per clock cycle. The existing 6970 provides the same quantity of render back-ends, but the newer card boasts higher throughput and much-improved efficiency; plus, the 7970 features a 384-bit interface to 3GB GDDR5 memory and a PCIe 3.0 interface. The GPU is capable of peak throughput of 264GB/s.

Tahiti also implements a feature known as partially resident textures. Local graphics memory is used as a kind of big cache for texture data, and very large textures can be streamed in on demand. This improves performance in game engines that use features such as virtual texturing or mega-textures: Texture sizes can be as large as 32TB (yes, terabytes).
The Radeon HD 6970 is oft criticized for its weak tessellation performance, especially when compared to Nvidia’s GeForce GTX 580 series. AMD has beefed up the GCN’s tessellator by improving the reuse of vertices, improving its off-chip buffering performance, and providing larger parameter caches. AMD predicts overall tessellation performance will be as much as 4x better than the 6970, depending on the application.
On the compute side, Tahiti uses dual asynchronous compute engines, which can independently schedule and dispatch work to improve multitasking. The compute engines can work in parallel with the graphics command processor, and AMD reports that context switching is “fast.” The GPU also features dual built-in DMA engines, and AMD suggests the chip can saturate a PCIe 3.0 x16 bus when running compute chores.
Floating-point performance is fully IEEE compliant, and the 7970 is capable of pumping out up to 947 double-precision gigaflops per second. It is the first GPU to support OpenCL 1.2, DirectCompute 1.1 and C++ AMP in hardware.
Video processing has also been improved. Given the right application, Tahiti can evaluate 7.6 terapixels per second (peak), and it has the ability to transcode 1080p video in faster than real time.