AMD’s Radeon HD 7970: 4.3 Billion Transistors of Pure Performance
AMD moves its high-end GPU family to 28nm, delivering stunning performance and impressive efficiency
We knew this was coming. We saw all the signs: The rumors. The price drops on existing videocards. The tweaked versions of old standbys masquerading as “new” GPUs. But more than anything, it’s been too long since we’ve had something fresh to sink our teeth into. And as has been the case in each of the last several big product launches, AMD is serving the first course.
Eric Demers, CTO of AMD’s Graphics Division, began talking about the company’s Graphics Core Next (GCN) earlier this summer. He described a new GPU architecture that would take graphics to the next level. He promised a GPU-compute monster that would remain highly scalable, so versions could be built into future generations of AMD APUs. The first iteration of Graphics Core Next comes in the form of the Radeon HD 7970, and it marks a substantial architectural shift for Radeon graphics. We’ll examine the overall architecture first, and then we’ll dive into the hardware specifics of the Radeon HD 7970.

Goodbye VLIW
Previous AMD GPU generations used very long instruction words (VLIW), a way of tightly packing multiple GPU instructions in order to move them around the GPU and memory efficiently. VLIW went through a couple of tweaks, including a change to a four-word VLIW scheme from a four-word scheme . VLIW was well tuned for the modern generation of programmable graphics, but it wasn’t so hot for GPU compute.
With AMD betting the farm on Fusion, which inherently takes advantage of a GPU’s parallel-compute capability, the company needed a more flexible architecture. So AMD discarded VLIW in favor of something the company calls GCN Quad SIMD (single instruction, multiple data). Instead of a single VLIW instruction plus four math operations for the ALU (arithmetic logic unit), the GPU uses four SIMDs and a single ALU operation. The four SIMDs can do the same work as a single VLIW, but they can also act independently when needed.

GCN marks a major shift in how AMD GPUs operate, behaving more like a general-purpose vector processor than a pure graphics engine. What’s more, each basic building block, called a GCN Compute Unit, includes a scalar coprocessor that can behave like a traditional—but non-pipelined—CPU. AMD has beefed up the caches that are distributed throughout the GPU. Each GCN core (yes, AMD is calling them cores) has its own dedicated L1 read/write cache. Each group of four cores shares a 16KB instruction cache and a 32KB scalar data cache. All the cores communicate over a shared bus to a partitioned L2 cache that can be sized differently depending on the graphics card and particular GPU die.
AMD intends for GCN to serve as the basis for several product families. The first product, code-named Tahiti, is aimed at gaming enthusiasts who want maximum frame rates while enabling maximum eye candy. The next product, code-named Pitcairn, will supersede the Radeon HD 6800 series. Pitcairn will be followed by a series code-named Cape Verde, which AMD believes will redefine the segment now held by products such as the Radeon HD 6700 series.
Code-name Tahiti
AMD took advantage of TSMC’s new 28nm manufacturing process to build its new high-end GPU. The Radeon HD 7970 sports 4.3 billion transistors in a surprisingly small 365mm2 die. AMD product marketing manager Devon Nekechuk tells us AMD’s 28nm yields have been both “good” and “predictable.”
Tahiti is assembled from 32 GCN compute units, which translates to 2,048 stream processors, each of which is based on AMD’s new SIMD-plus-scalar architecture. The existing Radeon HD 6970, by contrast, is equipped with just 1,536 stream processors and doesn’t benefit from the new architecture. The 7970 includes 768KB of L2 cache and eight render back-ends capable of pushing 32 color ROPs per clock and 128 Z/stencil ROPs per clock cycle. The existing 6970 provides the same quantity of render back-ends, but the newer card boasts higher throughput and much-improved efficiency; plus, the 7970 features a 384-bit interface to 3GB GDDR5 memory and a PCIe 3.0 interface. The GPU is capable of peak throughput of 264GB/s.

Tahiti also implements a feature known as partially resident textures. Local graphics memory is used as a kind of big cache for texture data, and very large textures can be streamed in on demand. This improves performance in game engines that use features such as virtual texturing or mega-textures: Texture sizes can be as large as 32TB (yes, terabytes).
The Radeon HD 6970 is oft criticized for its weak tessellation performance, especially when compared to Nvidia’s GeForce GTX 580 series. AMD has beefed up the GCN’s tessellator by improving the reuse of vertices, improving its off-chip buffering performance, and providing larger parameter caches. AMD predicts overall tessellation performance will be as much as 4x better than the 6970, depending on the application.
On the compute side, Tahiti uses dual asynchronous compute engines, which can independently schedule and dispatch work to improve multitasking. The compute engines can work in parallel with the graphics command processor, and AMD reports that context switching is “fast.” The GPU also features dual built-in DMA engines, and AMD suggests the chip can saturate a PCIe 3.0 x16 bus when running compute chores.
Floating-point performance is fully IEEE compliant, and the 7970 is capable of pumping out up to 947 double-precision gigaflops per second. It is the first GPU to support OpenCL 1.2, DirectCompute 1.1 and C++ AMP in hardware.
Video processing has also been improved. Given the right application, Tahiti can evaluate 7.6 terapixels per second (peak), and it has the ability to transcode 1080p video in faster than real time.
The Radeon HD 7970
Now that we’ve examined the GPU, let’s take a look at a reference-design example of the first videocard that will use it. We’ll start with power efficiency and noise, because AMD has made some notable advances on those fronts. The Radeon HD 6970 drew roughly 20 watts at idle, which was pretty good at the time. The Radeon HD 7970 card AMD provided for this evaluation, which is outfitted with one six-pin and one eight-pin PCIe power connector, idles at just 15 watts. AMD has also developed a new feature called ZeroPower that shuts puts the card into a deeper sleep state—including turning off its cooling fan—when Windows shuts off your display. In this state, the card draws just three watts. ZeroPower delivers benefits when you’re running two or more cards in CrossFire X mode, too. When your computer is simply running normal Windows stuff, the secondary cards will turn their fans off and reduce their power consumption to three watts, since they’re not driving displays. That means a multi-GPU system at idle will consume nearly the same amount of power as a single-GPU rig.
In our tests, a machine equipped with a reference-design Radeon HD 7970 card, a six-core CPU, 16GB of RAM, and two hard drives consumed just 109 watts at idle. That power consumption is incredibly modest for a machine that powerful. The 7970 draws more power under load than one equipped with AMD’s older high-end GPU, but it’s still more conservative than Nvidia’s maximum power consumption.
AMD will continue to use PowerTune technology to manage power consumption at all performance levels. One microcontroller monitors the thermal and power states of different parts of the card, and a second adjusts voltage and frequencies in real time. This allows AMD to set higher peak clock rates while remaining within the card’s 250-watt TDP rating.
AMD has also reengineered their reference-designing cooling mechanism. The fan has larger blades, and all monitor connections are on one half of the mounting bracket, leaving the entire other side free for ventilation slits. Based on our subjective evaluation, the 7970 card was substantially quieter than the 6970 card we compared it to.
Eyefinity 2.0
AMD is improving its Eyefinity technology, and some of those changes will carry over to current-generation cards. One key feature that all Eyefinity capable cards will get is better bezel compensation and new configurations, including 5 x 1 support (in either landscape or portrait configurations), which should make driving and flying games incredibly immersive. Maximum supported resolution over multiple displays will be beefed up to 16K x 16K. That, my friend, is a lot of pixels.

AMD is also improving its stereoscopic 3D support, although it will continue to rely on third-party manufacturers to produce compatible glasses and displays. The Radeon HD 7970 will drive three displays in stereoscopic 3D mode using upcoming DisplayPort 3D monitors. Reference-design 7970 cards will be outfitted with one dual-link DVI, two mini DisplayPort, and one HDMI 1.4a. They’ll support up to six displays simultaneously, although you’ll need the right mix of adapters to do that.
AMD’s new Discrete Digital Multi-Point Audio (DDMA) is another interesting feature, which could be useful in online gaming, video-conferencing, and other situations. If you’re engaged in a video conference with several other participants displayed on discrete monitors equipped with speakers, it will enable directional audio, so that when a participant on a monitor to your left speaks, you’ll hear his or her voice on that left-hand monitor. AMD says DDMA will also be useful for multi-room audio setups, so you can play a game on the computer in one room, while music is piped into speakers in other rooms in the home.
Performance
We tested a reference-design Radeon HD 7970 card using beta drivers, so bear in mind that our benchmarks are based on a work in progress. This card did not, however, come with the telltale EMI warning labels that typically mark early engineering samples. We did encounter one glitch, although it happened only once during testing, and we were unable to replicate it: When we cold-booted the system, the GPU’s clock reset to 500MHz (instead of the usual 925MHz). We used AMD’s OverDrive feature, part of the Catalyst control panel, to reset the clock to the factory default.
We compared the 7970 to three other cards: an XFX Radeon HD 6970, running at 880MHz and paired with 2GB of GDDR5, an EVGA GTX 580 SC, which is slightly overclocked at the factory to 797MHz and is outfitted with 1.5GB of GDDR5 memory, and an aggressively overclocked (855MHz) EVGA GTX 580 Classified, which is equipped with 3GB of GDDR5. Unlike our standard tests, we brought the pain by benchmarking all four cards on a 30-inch display at 2560 x 1600 resolution, 4x AA, and all settings at their maximum values.
When the smoke had cleared, the two GTX 580 cards won just a single benchmark, the HAWX2 test, which tessellates just about everything in sight. There were a few effective ties between the 7970 and the EVGA Classified, including Metro 2033 and Just Cause 2; but when the 7970 won, it generally won big. It opened a substantial lead on the Unigine Heaven synthetic test, for instance, even when we cranked tessellation to “extreme.”
AMD suggests that the Radeon HD 7970 has some clock-speed headroom, so we can expect to see factory-overclocked cards pushing the core clock rate up to 1GHz and possibly higher. That 3GB of frame buffer will come in handy for GPU-compute applications. AMD’s expects retail cards based on the 7970 to sell for $549, which strikes us as a reasonable—but not exceptional—value. AMD will also ship a cost-reduced version of the 7970, in form of the Radeon HD 7950 in early 2012, but the company hasn’t released specs or pricing on that SKU.
Specifications - Radeon HD 7970 vs. Radeon HD 6970
| |
Radeon HD 7970 |
Radeon HD 6970 |
| Manufacturing Process |
28nm |
40nm |
| Transistor Count |
4.31 billion
|
2.64 billion |
| Reference Core Clock |
925MHz |
880MHz |
| Frame Buffer |
3GB GDDR5 |
2GB GDDR5
|
| Memory Clock |
1,375MHz
|
1,375MHz |
| Memory Data Rate |
5.5 gigapixels/sec |
5.5 gigapixels/sec |
| Memory Bandwidth |
264GB/sec |
176GB/sec |
| Memory Bus |
384-bit |
256-bit |
| Stream Processors |
2,048 |
1,536 |
| Compute Performance |
3.79 single-precision TFLOPs |
2.7 single-precision TFLOPs |
| Texture Units |
128 |
96 |
| Texture Fill Rate (peak) |
118.4 gigatexels/sec |
84.5 gigatexels/sec |
| ROPs |
32 |
32 |
| Z/Stencil |
128 |
128 |
| Maximum Board Power |
250W |
250W |
| Idle Power (active) |
15W |
20W |
| Idle Power (long dark) |
3W |
20W |
Benchmarks
| |
AMD Radeon HD 7970 Reference |
XFX Radeon HD 6970 |
EVGA GTX 580 SC |
EVGA GTX 580 Classified |
| 3DMark 2011 Perf |
7,985 |
5,750 |
6,747 |
7,321 |
| 3D Mark Vantage Perf |
31,873 |
24,453 |
26,936 |
28,559 |
| Unigine Heaven 2.5 (fps) |
28 |
17 |
22 |
23 |
| Shogun 2 (fps) |
28 |
19 |
22 |
24 |
| Far Cry 2 / Long (fps) |
96 |
75 |
85 |
92 |
| HAWX 2 DX11 (fps) |
113 |
73 |
120 |
128 |
| STALKER: CoP DX11 (fps) |
37 |
25 |
28 |
29 |
| Just Cause 2 (fps) |
48 |
31 |
41 |
48 |
| Batman: Arkham City (fps) |
51 |
36 |
45 |
47 |
| Metro 2033 (fps) |
17 |
14 |
15 |
17 |
| DiRT3 (fps) |
60 |
44 |
50 |
55 |
| Core /Memory Clock Speeds |
925 / 1375 |
880 / 1375 |
797 / 1013 |
855 / 1053 |
| Power @ idle (W) |
124 |
126 |
140 |
140 |
| Power @ full throttle (W) |
325* |
296 |
344 |
385 |
| Price |
$549 |
$350 |
$550 |
$600 |
* "Long dark" system power was 109W
Best scores are bolded. Our test bed is a 3.33GHz Core i7 3960X Extreme Edition in an Asus P979X Deluxe motherboard with 16GB of Corsair DDR3/1600 and an AX1200 Corsair PSU. The OS is 64-bit Windows Ultimate. All games are run at 2560 x 1600 with 4x AA, except for the 3DMark tests.