ATI HD Radeon 5870: The Fastest Videocard Ever (PS It's $380)


AMD packs 2.15 billion transistors into a tiny chip, offering outstanding performance, DirectX 11 support, and triple-monitor (or better) capability. Nvidia’s response is nowhere to be seen

AMD’s graphics division, the former ATI Technologies, loves a good surprise. The company has been a perennial also-ran in the graphics performance arena, but every now and then, it one-ups the competition in a big way. That happened back in 2002, with the launch of the original Radeon 9700, which stole the performance lead from archrival Nvidia. It happened again last year, with the Radeon HD 4800 series. The 4850, 4870, and 4890 weren’t always faster than the competition, but they were small, efficient chips that forced Nvidia into a price war that was good for users but bad for Nvidia’s bottom line.

Now AMD’s doing it again, putting some serious hurt on the competition with the first GPU to support Microsoft’s upcoming DirectX 11 API. AMD’s also been paying close attention to the emerging market for non-gaming apps accelerated by GPUs, such as video transcoding and digital photography, fully supporting DirectCompute 11 and OpenCL standards for general purpose computing on graphics cards.

This new chip is no shrinking violet in the numbers department. Every number associated with the new Radeon 5800 series is staggering: 2.15 billion transistors, 2.7 trillion floating-point operations a second, more than 20 gigapixels per second throughput, 1,600 shader units. Other numbers impress because of their smallness. One example: The idle power is a scant 27W— lower than many entry level GPUs.

Given the sheer scale and ambition of this GPU, does it deliver in the performance realm? And will it deliver at a price normal humans can afford? Let’s find out.

Digging into the Radeon HD 5870

At its core is a no-compromise GPU more efficient than any in graphics history

Two years ago, AMD’s ATI division decided to bow out of the game of building huge, hot chips that were expensive to make, ceding the high-end glory to Nvidia’s GT200 chip. That’s not to say AMD gave up on performance; it instead adopted the mantra of building the best performance GPU within a certain cost and power envelope. The Radeon HD 5800 series, originally code-named RV870, is the culmination of that approach. Taking advantage of Moore’s Law, ATI’s designers were able to build a GPU with few compromises using a 40nm manufacturing process.

Radeon GPUs Compared

Radeon HD 4890 vs 5870

Radeon HD 4890
Radeon HD 5870
Die Size
263mm-squared 334mm-squared
Transistor Count 956 million
2.15 billion
CPU Clock
850MHz 850MHz
Memory Clock
975MHz 1200MHz
Memory Quantity (GDDR5) 1GB
Manufacturing Process
Stream Processors
Texture Units
ROPs 16
Maximum Board Power (TDP)
Idle Power

Power and Performance

The new GPU is just 334mm2—30 percent larger than the earlier 4870 GPU, but packing more than twice the number of transistors.

At 27W, the idle power is astonishingly low for such a large chip. The key factor was enabling lower memory clocks and voltages during idle, a feat made possible because of significant improvements in the 40nm manufacturing process. The net result is very low power when the board is just rendering your Windows desktop. At the same time, the VRM (voltage regulator module) interface has been improved, preventing overheating while allowing somewhat higher power consumption when performance is actually needed.

So the HD 5870 can draw less power while it’s doing nothing. But we also expect to see better performance, particularly given some of the other specs listed by ATI. The faster memory gives the 5870 overall memory bandwidth of 153GB/s. Feeding that huge pipe is a GPU with twice as much hardware where it matters—stream processors, ROPs, and texture units.

The graphics engine itself sports some new features—particularly the hardware tessellation engine. While past ATI products have offered hardware tessellation, the new engine fully supports Microsoft’s DirectX 11 tessellation API. ATI is fond of pointing out that this is actually its sixth generation tessellation hardware.

Texture Units and Caches

Having a robust set of shaders gives the Radeon 5870 unparalleled computational performance, but games still make heavy use of textures. Previous Radeons have been criticized for having fewer texture units and ROPs (raster operations in the render back-ends) than the competition. ATI has responded by doubling the number of texture units, from 40 to 80. The ROPs have also been doubled. The result is a theoretical doubling of throughput to 68 billion bilinear filtered texels per second.

The L2 cache sizes have been increased to 128KB to facilitate the additional throughput generated by the increased number of texture units. Raw cache performance has also been improved, upping the L1 texture fetch bandwidth to one terabyte per second. Each SIMD engine has dedicated L1 cache, and the bandwidth between these exclusive caches and the shared L2 cache is 435GB/s. Also, maximum texture size has been increased to 16k x 16k, to support DirectX 11.

Image Quality

In the past, we’ve found that Radeon GPU’s texture filtering was inferior to the equivalent Nvidia GPUs. AMD’s graphics architects took those criticisms to heart and redesigned the texture filtering units. The key is a new anisotropic filtering algorithm that no longer depends on the angle of view. This new, high-quality anisotropic filtering comes with no performance hit when compared to the older method.

The combination of raw compute horsepower, improved filtering algorithms, and double the texture units also gives the RV870 incredible antialiasing performance. AMD estimates the performance hit from going to 8x AA from 4x AA ranges from just a couple percentage points to less than 20 percent. All that graphics performance on tap has allowed AMD to implement supersampling antialiasing, something it had removed a few years back.

DirectX 11 on Tap

The Radeon HD 5800 chip implements all of DirectX 11 in hardware. This includes:

Hardware tessellation

This is the ability to generate geometry from an abstract description of the object defined by patches. Triangles can be interpolated within the patches, adding large amounts of geometric detail without the artist needing to explicitly create new artwork.

Shader Model 5.0

DirectX 11 now sports a unified shader language across all types of shaders: vertex, hull, domain, geometry, pixel, and the new compute shaders.

Object oriented programming model

The era of writing huge shader programs may be past. Instead, programmers can work in more familiar ways, creating shader objects that can be called by parent programs. It’s easier to write, document, and debug.

Order independent transparency

Previously, transparency was handled by alpha blending, or by the application doing expensive traversal of the geometry to understand the order of the triangles. Order independent transparency makes creating complex subjects with many transparent elements easier.


The DirectX API is fully multithreaded, and the drivers can now be more fully multithreaded, taking advantage of multicore CPUs. This feature will actually help speed up applications on older-generation GPUs, once DirectX 11 and new drivers are released.

DirectX 11 Compute

This is the new term for compute shaders. Now DirectX programmers have a standard interface for adding general purpose compute elements to games, such as physics, post-processing, and even alternative renderers, like ray tracing and radiosity rendering.

All these features are enabled in the RV870 hardware, making the new GPU the first fully compliant DirectX 11 graphics chip available.

GDDR5: Fast and Efficient Memory

Since the Radeon HD 5870 essentially doubles the computational power available for graphics, ATI needed a very fast memory subsystem. The new card uses GDDR5 memory running at 1,200MHz (effective), 225MHz faster than the same type of video RAM used on the older HD 4890. According to ATI, this now results in a balanced graphics card; the 4890 often was shader bound, meaning that available memory bandwidth went unused. The card itself also makes use of a GDDR5 feature called lower power strobe mode. This is part of what enables the 5870 to idle at a miserly 27W.

The Other Radeon: the HD 5850

When AMD shipped the first of the Radeon 4800 series, that card was the 4850. Later, the HD 4870 arrived on the scene. The only difference between the 4850 and the 4870 was the width (the 4850 was a single-wide card; the 4870 took two PCI slots), plus core and memory clock speeds.

The HD 5850 actually has a slightly different feature set than the 5870; ATI has left out one of the functional units. It’s the same chip, but with one of the SIMD engines disabled – what’s often called a “salvage” part. The clocks are also cut down a bit. So this time around, there is actually a difference in the two GPUs.

Radeon HD 5850 & 5870

Radeon HD 5850
Radeon HD 5870
Transistor Count 2.15 Billion
2.15 Billion
Memory Clock
Memory Quantity (GDDR5) 1GB
Stream Processors
Texture Units
ROPs 32
Maximum Board Power (TDP)
Idle Power

The Price of Glory

The best news about the Radeon HD 5800 series is the pricing. Final pricing wasn’t available as we wrapped up the article, but AMD noted that the HD 5870’s target price is around $380, with the 5850 aimed at under the $300 mark.

While $380 is a lot of money for a high-end board, the era of shipping the most capable GPU on the planet in a $600 board seems to be over.

Radeon HD 5870 Crushes Nvidia’s 285 GTX

In our GPU cage match, AMD’s new graphics processor delivers a stunning KO against the heavily overclocked competition

It’s a classic graphics card cage match. In one corner, the feisty, but unproven newcomer; in the other corner glowers the grizzled veteran. The newcomer, of course, is AMD’s shiny new Radeon HD 5870, weighing in at 2.15 billion transistors. The grizzled veteran is Nvidia’s 285 GTX. But this is no ordinary 285 GTX. We pitted the Radeon against a souped-up EVGA 285 GTX SSC.

The 285 GTX SSC runs its core clock at 702MHz, more than eight percent faster than the stock 648MHz; the memory clock is pumped up to 1,323MHz, about 6.5 percent faster than the base. In other words, it’s about the fastest Nvidia-based, single-chip graphics card you can get.

The newcomer is AMD’s spiffy new Radeon HD 5870. With an 850MHz engine clock and 1,200MHz GDDR5 clock, AMD’s new progeny looks like it has the chops to take on Nvidia. But we’ve been disappointed by promising GPUs from AMD’s graphics division in the past.

Not this time.

We tested three cards (also tossing in AMD’s previous best, the Radeon HD 4890) in a Core i7 975 system with 6GB of RAM, running on an Asus Rampage II X58 motherboard. All that CPU horsepower is to ensure that the benchmarks stress the graphics card, rather than be held back by CPU or RAM. We used Windows 7 Ultimate RTM as the OS.

After the smoke cleared, the 285 GTX looked like a tired fighter who’d been rope-a-doped and KO’d. The performance differences aren’t minor, they’re huge: The Radeon HD 5870 was 63 percent faster in Crysis, 32 percent faster in Far Cry 2, 33 percent faster in STALKER, and even 24 percent faster in Battleforge, an RTS that’s arguably more dependent on CPU than graphics.

This round of the endless GPU wars, then, is clearly owned by AMD, at least for single-GPU cards. And with performance like this, who wants the heat and power consumption of a dual-GPU card.

On the other hand, we won’t count Nvidia out. While Nvidia’s current high end is now relegated to the status of also-rans, the company is slaving away on its DirectX 11 GPU, code-named GT300. When that ships, expect a rematch.

Radeon HD 4890
Radeon HD 5870
Crysis (FPS) 22
Far Cry 2 (FPS) 51
BattleForge (FPS)
3DMark Vantage Performance (Score) 12128
3DMark Vantage Extreme (Score) 6276
Idle System Power (W)
Full Load System Power (W) 363

Best scores are bolded. All benchmarks run at 1920x1200 with 4x AA enabled and all graphics settings maxed out unless otherwise specified. Full load system power was taken during a 3DMark Vantage run at 2560x1600 with extreme settings.

Six Monitors, One Card

One of the more intriguing aspects of the Radeon HD 5870 is its use of multiple displays, something AMD dubs “Eyefinity.” The first shipping HD 5870 comes with four display connectors: two DVI, one DisplayPort, and one HDMI. Up to three monitors can be connected to any three of the four connectors. (Due to timing limitations, all four connectors can’t be used simultaneously.) Later this year, AMD will ship a card with six DisplayPort adapters, capable of connecting up to six DisplayPort-equipped monitors at the same time.

For cards supporting up to three displays, usage scenarios might include three wide screen displays in portrait mode, side by side. Cards capable of driving six simultaneous digital monitors could support a variety of display options: 3 x 2 landscape, 2 x 2, or oddball scenarios such as 3 x 1 with another 1 on top as an extended display (for flight sims, for example.)

One underlying technology making a six display configuration possible is DisplayPort. The display controller in the RV870 generates only two timing signals, suitable for DVI or VGA. DisplayPort can source external timing signals, and DisplayPort-equipped monitors can act as timing sources.

GPGPU: GPU Compute Comes of Age

Using graphics chips for general purpose computing is still a pretty new concept. AMD notes that the trend is increasingly towards apps that operate on large amounts of data in parallel on a single application. On the other hand, traditional CPUs are geared toward high performance with applications that crunch data serially.

Video transcoding, photo and video effects filters, plus tools like noise reduction and image cleanup have a ravenous appetite for parallel floating-point compute power.The HD 5870 offers up to 2.7 trillion single-precision floating-point operations per second  and up to 544 billion double-precision FP operations every second. To put that in context, Intel’s fastest CPU today, the Core i7 975, is capable of about 85 billion FLOPs . Floating-point calculations are now IEEE 754-compliant, which makes life easier for application developers and end users.

AMD built in the hooks to make the 5800 series a better general purpose compute engine than past Radeons. This includes full hardware implementation of OpenCL and DirectCompute 11, IEEE 754-2008 floating-point compliance, better memory handling for general applications, and global synchronization and data sharing.

While Nvidia has been pushing its proprietary CUDA architecture hard over the past several years, only a handful of consumer-level applications have really taken advantage of GPU compute. But now we have two emerging standards, both with strong organizations backing them: OpenCL and DirectCompute.

OpenCL development is coordinated and tested by the Khronos Group, which is also responsible for the OpenGL graphics standard. OpenCL is available on a variety of operating systems, including Windows, MacOS, and Linux. DirectCompute is part of Microsoft’s DirectX 11 API and fully supports the unified Shader Model 5 language.

Now that two strong standards have taken root, we’ll likely see GPU computing move beyond its proprietary and experimental roots. Upcoming DirectX 11 games will use DirectCompute for physics, deferred shading, and graphics post-processing. Companies like CyberLink are building consumer video and photo editing applications around OpenCL, moving away from CUDA and embracing standards.

Anatomy of a Thread Processor

With each succeeding generation, games are making heavier user of programmable shaders. It’s no wonder AMD and its competition are increasingly focusing on this key aspect of the GPU. At the heart of the beast that is the HD 5870 programmable engine are thread processors.

The thread processors, consisting of four stream cores and support units, were redesigned, streamlining some key instructions, adding DirectX 11 bit-level operations, and implementing a fused multiply-add capability. All of these increase the number of instructions per clock cycle for each thread processor; the 5870 has five stream cores (including the special functions core)   for each thread processor, 16 thread processors per SIMD engine, of which there are 20, for a total of 320 thread processors and 1,600 stream cores. The SIMD engine is the smallest logical functional unit, and it’s likely that future DirectX 11 GPUs will pare the GPU at the SIMD engine level to build lower cost cards.

Each thread processor consists of four stream cores and one special function core, a branch unit, and a number of general purpose registers. The four stream cores together can pump out four 32-bit floating-point multiply-adds per clock cycle, or generate a pair of FP multiplies. AMD also implemented a math feature called Sum of Absolute Differences , which is useful in video encoding and computer vision applications. A variety of DirectX 11 bit-level operations are also built in.

Around the web