The whole truth and nothing but the truth (as far as we know it)
Watching the ongoing race between AMD and Nvidia to build the ultimate graphics processor reminds us of the tale of the tortoise and the hare. AMD has played the hare, aggressively bounding ahead of Nvidia in terms of process size, number of stream processors, frame buffer size, memory interface, die size, and even memory type. Yet Nvidia always manages to snag the performance crown. The GeForce 200 series is but the latest example.
We convinced Nvidia to provide us with an early engineering sample of its high-end reference design (the GeForce GTX 280), with very immature drivers, for a first look at the GPU’s performance potential. At the time of this writing [Ed note: late May] , the company was still a full month away from shipping this product, and its lesser cousin, the GeForce GTX 260, so we won’t issue a formal verdict in this issue (our full hands-on review should be online by the time this issue reaches you).
As interesting as the benchmark numbers are, the story behind this new architecture is even more fascinating. We’ll give you all the juicy details, but first, let’s explain the new naming scheme: Nvidia has sowed a lot of brand confusion in the recent past, especially with the 512MB 8800 GTS. That card was based on a completely different GPU architecture than the 8800 GTS models with 320MB and 640MB frame buffers. The Green Team hopes to change that with this generation.
The letters GTX now represent Nvidia’s “performance” brand, and the three digits following those letters will indicate the degree of performance scaling: The higher the number, the more performance you should expect. Using 260 as a starting line should give the company plenty of headroom for future products (as well as leave a few slots open below for budget parts).
AMD jumped ahead to a 55nm manufacturing process with the RV670 (the foundation for the company’s flagship Radeon HD 3870), but Nvidia stuck with the tried-and-true 65nm process for the GeForce 200 series. Nvidia cites the new part’s long development cycle and sensible risk management as justification.
The GTX 280 is an absolute beast of a GPU: Packing 1.4 billion transistors (the 8800 GTX got by with a mere 681 million, and a quad-core Penryn has 820 million), it’s capable of bringing a staggering 930 gigaFLOPs of processing power to any given application (a Radeon HD 3870 delivers 496 gigaFLOPs, while the quad-core Penryn musters just 96).
Considering the transistor count and the 65nm process size, the GeForce 200 die must be absolutely huge (and Nvidia’s manufacturing yields hideously low). Although Nvidia declined to provide numbers on either of those fronts, those two questions will remain academic in the absence of fresh and considerable competition from AMD. (And for the record, all AMD would tell us about its new part is that we can expect it “real soon.”)
You could fit nearly six Penryns onto a single GeForce GTX 280 die, although a portion of the latter part’s massive size can be attributed to the fact that it’s manufactured using a 65nm process, compared to the Penryn’s more advanced 45nm process.
Nvidia packs 240 tiny processing cores into this space, plus 32 raster-operation processors, a host of memory controllers, and a set of texture processors. Thread schedulers, the host interface, and other components reside in the center of the die.
With technologies like CUDA, Nvidia is increasingly targeting general-purpose computing as a primary application for its hardware, reducing its reliance on PC gaming as the raison d’être for such high-end GPUs.
The GeForce GTX 280 has 240 stream processors onboard (Nvidia has taken to calling them “processing cores”). This being Nvidia’s second-generation unified architecture, each core can handle vertex-shader, pixel-shader, or geometry-shader instructions as needed. The cores can handle other types of highly parallel, data-intensive computations, too—including physics, a topic we’ll explore in more depth shortly. The GeForce GTX 260 is equipped with 192 stream processors.
Although the GeForce 280 has nearly twice as many stream processors as Nvidia’s previous best GPU, it’s still 80 shy of the 320 in AMD’s Radeon HD 3870. But Nvidia’s asymmetric clock trick, which enables its stream processors to run at clock speeds more than double that of the core, has so far obliterated AMD’s numerical advantage. In fact, a single GeForce GTX 280 proved to be an average of 28 percent faster than the dual-GPU Radeon HD 3870 X2 with real-world games running on Windows XP, and it was 24 percent faster running Vista.
We didn’t have an opportunity to benchmark the GTX 280 in SLI mode (or the GTX 260 at all), but a single GTX 280 beat two GeForce 9800 GTX cards running in SLI by a 9-percent margin, thanks in large measure to significantly improved performance with Crysis. (Turn to page 60 for complete benchmark results.)
A significant increase in the number of raster-operation processors (ROPs) and the speed at which they operate likely contributes to the new chip’s impressive performance. The 8800 GTX has 24 ROPs and the 9800 GTX has 16, but if the resulting pixels need to be blended as they’re written to the frame buffer, those two GPUs require two clock cycles to complete the operation. The 9800 GTX, therefore, is capable of blending only eight pixels per clock cycle.
The GTX 280 not only has 32 ROPs but is also capable of blending pixels at full speed—so its 32 ROPs can blend 32 pixels per clock cycle. The GTX 260, which is also capable of full-speed blending, is outfitted with 28 ROPs.
GeForce GTX 280 cards will feature a 1GB frame buffer, and the GPU will access that memory over an interface that’s a full 512 bits wide. AMD’s Radeon 2900 XT, you might recall, also had a 512-bit memory interface, but the company dialed back to a 256-bit interface for the Radeon 3800-series, claiming that the wider alternative didn’t offer much of a performance advantage. That was before Crysis hit the market.
Cards based on the GTX 260 will have 896MB of memory with a 448-bit interface. Despite the news that AMD will move to GDDR5 with its next-generation GPUs, Nvidia is sticking with GDDR3, claiming that the technology “still has plenty of life in it.” Judging by the performance of the GTX 280 compared to the Radeon 3870 X2, which uses GDDR4 memory (albeit half as much and with an interface half as wide as the GTX 280’s), we’d have to agree. Nvidia is taking a similar approach to Direct3D 10.1 and Shader Model 4.1: The GTX 280 and GTX 260 don’t support either.
A stock GTX 280 will run its core at 602MHz while its stream processors hum along at 1.296GHz. Memory will be clocked at 1.107GHz. The GTX 260 will have stock core, stream processor, and memory clock speeds of 576MHz, 1.242GHz, and 999MHz, respectively (what, they couldn’t squeeze out an extra MHz to reach an even gig?).
When Nvidia acquired the struggling Ageia, we were disappointed—but not surprised—to learn that Nvidia was interested only in the PhysX software. While it wouldn’t be accurate to say that Nvidia has orphaned the hardware, the company has no plans to continue developing the PhysX silicon. What’s more, there is absolutely no Ageia intellectual property to be found in the GTX 200-series silicon—the new GPU had already been taped out when the acquisition was finalized in February.
But Nvidia didn’t acquire Ageia just to put the company out of its misery. The company’s engineers quickly set about porting the PhysX software to Nvidia’s GeForce 8-, 9-, and 200-series GPUs. When Ageia first introduced the PhysX silicon, the company maintained that it was a superior solution to the CPU and GPU architectures, which weren’t specifically optimized for accelerating complex physics calculations. In reality, the PhysX architecture wasn’t as radically different from modern GPU architectures as we’d been told.
The first PhysX part, for example, had 30 parallel cores; the mobile version that ships in Dell’s XPS 1730 notebook PC has 40 cores. Nvidia tells us it took only three months to get PhysX software running on GeForce, and the software will soon be running on every CUDA platform. See the sidebar on this page for more information on the GeForce 200-series’s physics capabilities.
The screenshot above shows something of what’s possible with PhysX technology. The Unreal Tournament Tornado mod features a whirling vortex that tears the battlefield apart as the game progresses. The tornado can also suck in projectile weapons, such as rockets, adding an exciting new dynamic to the game.
Unfortunately for Ageia, mods such as this were too few and far between, and this chicken-or-the-egg conundrum ultimately killed the PhysX physics processing unit. By the time Nvidia acquired the company, Ageia had convinced just two manufacturers—Asus and BFG—to build add-in boards based on the PPU, and Dell was the only major notebook manufacturer to offer machines featuring the mobile version. Absent a large installed base of customers, few major game developers (aside from Epic and Ubisoft’s GRAW team) saw any reason to support the hardware.
Nvidia will have a much more persuasive argument: When it releases PhysX drivers for the GeForce 8-, 9-, and 200-series GPUs, the installed base will amount to 90 million units—a number expected to swell to 100 million by the end of 2008.
Even then, we predict PhysX will need a killer app if it’s to really take off. Nvidia will need to help foster the development of more PhysX-exclusive games, such as the Tornado and Lighthouse mods for Unreal Tournament 3 , and the Ageia Island level in Ghost Recon: Advanced Warfighter .
Nvidia will also remedy one of Ageia’s key marketing mistakes: Consumers couldn’t run a PhysX application unless they had a PhysX processor, which meant they had no idea what they might be missing out on. Under Nvidia’s wing, PhysX applications will fall back to the host CPU in the absence of a CUDA-compatible processor. The app might run like a fly dipped in molasses, but the experience could fuel demand for Nvidia-based videocards.
Nvidia tells us it expects to have PhysX drivers for the GTX-200 series shortly after launch; drivers for GeForce 8- and 9-series parts will follow shortly thereafter.
Both the GeForce GTX 280 and 260 have two SLI edge connectors, so they will support three-way SLI configurations. Nvidia wouldn’t comment on the possibility of a future single-board, dual-GPU product that would allow quad SLI, but reps did tell us they expect the current dual-GPU GeForce 9800 GX2 to fade away.
Nvidia’s reference-design board features two DVI ports and one analog video output on the mounting bracket, with HDMI support available via dongle. The somewhat kludgy solution of bringing digital audio to the board via SPDIF cable remains (we much prefer AMD’s over-the-bus solution). Add-in board partners can choose to offer DisplayPort SKUs for customers who want support for displays with 10-bit color and 120Hz refresh rates.
Nvidia tells us there’s more to the GeForce 200 series than just substantial increases in the numbers of stream processors and ROPs. The new GPUs, for example, are capable of managing three times as many threads in flight at a given time as the previous architecture. Improved dual-issue performance enables each stream processor to execute multiple instructions simultaneously, and the new processors have twice as many registers as the previous generation.
These performance-oriented improvements should allow for faster shader performance and increasingly complex shader effects, according to Nvidia. In a new demo called Medusa, a geometry shader enables the mythical creature to turn a warrior to stone with a single touch. This isn’t a simple texture change or skinning operation—the stone slowly creeps up the warrior’s leg, torso, and face until he is completely transformed. Medusa then knocks off his head with a flick of her tail for good measure.
Nvidia still perceives gaming as a critically important market for its GPUs, but the company is also looking well beyond that large, but still niche, market. Through its CUDA (Compute Unified Device Architecture) initiative, the company is taking on an increasing number of apps that have traditionally been the responsibility of the host CPU. Nvidia isn’t looking to replace the CPU with a GPU, it’s just trying to convince consumers that GPU purchasing decisions and upgrades are more important than CPU purchasing decisions.
CUDA applications will run on any GeForce 8- or 9-series GPU, but the GeForce 200 series delivers an important advantage over those architectures: support for the IEEE-754R double-precision floating-point standard. This should make the new GPUs—and CUDA in general—even more attractive to users who develop or run applications that rely heavily on floating-point math. Such applications are common not only in the scientific, engineering, and financial markets, but also in the mainstream consumer marketplace (for everything from video transcoding to digital photo and video editing).
Nvidia has made great strides in reducing its GPUs’ power consumption, and the GeForce 200 series promises to be no exception. In addition to supporting Hybrid Power (a feature that can shut down a relatively power-thirsty add-in GPU when a more economical integrated GPU can handle the workload instead), these new chips will have performance modes optimized for times when Vista is idle or the host PC is running a 2D application, when the user is watching a movie on Blu-ray or DVD, and when full 3D performance is called for. Nvidia promises the GeForce device driver will switch between these modes based on GPU utilization in a fashion that’s entirely transparent to the user.
We can’t take the performance of an engineering-sample board with early drivers as gospel, but the benchmark results have us hungry for shipping product!
Few things piss us off as readily as new architecture that offers only incremental improvements in performance. Fortunately for Nvidia, that’s not the case with the GeForce GTX 280. Assuming the drivers that ship with this card deliver performance as good as these beta versions, Nvidia will have another in what has been a long list of winners on its hands.
The GTX 280 delivered real-world benchmark numbers nearly 50 percent faster than a single GeForce 9800 GTX running on Windows XP, and it was 23-percent faster than that card running on Vista. In fact, it looks as though a single GTX 280 will be comparable to—and in some cases beat—two 9800 GTX cards running in SLI, a fact that explains why Nvidia expects the 9800 GX2 to fade from the scene rather quickly.
We’re especially pleased with the performance delta we observed with Crysis : Even with the resolution at 1920x1200, 4x antialiasing enabled, and all the game’s other quality settings on high, our engineering sample delivered the game at more than 30 frames per second running DirectX 9. Games still run slower on Vista, however; Crysis , for example, shed about eight frames per second running DirectX 10, but it was still twice as fast as a single 9800 GTX. And remember, we tested an engineering sample running on pre-release drivers.
The GTX 280 absolutely clobbered AMD’s dual-GPU Radeon 3870 X2, delivering superior overall benchmarks in both Windows XP and Vista. The one bright spot, oddly enough, was the X2’s Crysis performance in Vista: AMD’s part managed to run the game two frames per second faster than Nvidia’s latest. The single GTX 280, on the other hand, was more than twice as fast running the RTS World in Conflict under Vista.
Prototype GTX 280
PNY 9800 GTX
PNY 9800 GTX in SLI
MSI Radeon 3870 X2
|3DMark06 Game 1 (fps)||
|3DMark06 Game 2 (fps)||
|3DMark Vantage: Game 1 (fps)||
|3DMark Vantage: Game 2 (fps)||
Unreal Tournament 3 (fps)
|Company of Heroes: Opposing Fronts (fps)||
|World in Conflict (fps)||
|Best scores are bolded. Nvidia-based cards are tested with an EVGA 680i SLI motherboard; AMD-based cards tested with an INtel D975BX2 motherboard. Intel 2.93GHz Core2 Extreme CPUs and 2GB of Corsair DDR RAM used in both scenarios. Benchmarks performed at 1920x1200 resolution on ViewSonic VP2330wb monitors.|