Nvidia’s latest generation GPU is going through the most painful, drawn out gestation period since the company’s first programmable GPU, the GeForce 5800 series. Like the more recent GeForce 280 GTX, the current GF100 (the code name, not the final name) chip represents a major, ground-up architectural redesign.
Recently, we spent the better part of a day being briefed on the GF100, which represents the first actual graphics processor built with Nvidia’s Fermi architecture. The presenters included Jonah Alben, Senior VP of GPU engineering, Henry Morton, lead architect for geometry processing, Emmett Kilgariff, Director of GPU Architecture plus a host of Nvidia marketing and PR folk.
The basic Fermi architecture layers graphics functionality atop a powerful parallel compute engine. As GPU compute becomes more important, both in games and in certain classes of mainstream applications, it makes sense to build an architecture that builds more general purpose capability. That’s not to say that Fermi will try to take on the functions of a mainstream CPU. For example, though GF100’s compute engines offer some CPU-like capability, like a full L2 read/write data cache, there’s no provision for speculative execution. As Henry Morton, one of the lead architects for the GF100 noted, “You don’t want to do speculative execution on a parallel machine.”
Nvidia’s architecture team had some key goals in mind when designing GF100:
• Geometric realism. As DirectX 8, 9 and 10 progressed, GPUs made huge strides in pixel shader and texturing performance – up to 150x over the first generation of programmable GPUs. Meanwhile, geometry performance only increased about 3x over that time.
• Unrivaled image quality. The GF100 moves well beyond multisampling anti-aliasing, improving the previous CSAA (coverage sample anti-aliasing) by allowing 32 sample CSAA. Substantial work also went into improving anti-aliasing with transparency.
• Beefing up GPU compute for gaming. In addition to improving performance in the now-familiar post-processing effects (motion blur, depth-of-field effects), the GF100 offers more robust accelerated physics, with the potential of also offloading AI and animation functionality previously the realm of CPUs.
Let’s keep these goals in mind as we dive deeper into the architectural features of the GF100. We’ll first take a high level look at the architecture and its components. Then we’ll see how these come together to enable high performance DirectX 11 capable graphics. Finally, we’ll talk about what we know about the actual hardware, and when we might actually see shipping cards.
Nvidia’s Fermi GPU is likely to be one of the largest chips ever manufactured. All that chip real estate is there to support substantial increases in GPU horsepower as well as larger quantities of onboard cache.
GF100 Block Diagram. The GPU is designed to be highly modular, enabling a variety of different products to be built.
The GF100 is a highly modular design, which is scalable at different levels of the architecture. At its coarsest level, we have the GPCs (graphics processing clusters.) The GF100 chip contains four of these, each of which is in turn constructed from sets of modules. Surrounding the four GPCs are six 64-bit GDDR5 memory controllers, which yields a 384-bit memory interface.
Initial GF100 GPUs will have 512 ALUs (called “CUDA cores”), 16 geometry units, 4 faster units, 64 texture units and 48 ROPs. Tying together all the GPCs is an 768KB shared L2 cache. Unlike the 256KB cache on the current GTX 285, which is used for fast storage of texture information, the L2 cache on the GF100 is a fully read/write, coherent write-through cache for all data formats. A least-recently used algorithm manages how long data remains in the L2.
The GigaThread engine manages incoming data fetches from main memory and feeds them to the frame buffer. It’s also responsible for creating and sending blocks of execution threads to the GPU itself. Each GPC is broken down into SMs (streaming multiprocessors). These SMs then take the large thread blocks, breaks them down into groups of 32 threads (called warps) and allocates them to the various execution units underneath.
Each GPC consists of a set of SMs (up to four) plus a raster engine. The raster engine takes care of triangle setup, rasterization and z-cull (throwing away vertex data occluded in the scene, and not visible.) You can think of the GPC as a kind of “mini-GPU” unto itself.
Streaming Multiprocessor block diagram. Each SM consists of a large number of different compute cores.
Inside each SM are 32 compute cores, which Nvidia calls CUDA cores (after its CUDA GPU compute initiative.) This is up from the 8 CUDA cores in the previous generation GT200 series GPUs. All CUDA cores are scalar. Each core consists of an pipelined, integer ALU and a floating point unit. The FPU fully implements the IEEE 754-2008 floating point standard, and incorporates a fused multiply-add (FMA) instruction for both single- and double-precision arithmetic, reducing precision errors that can occur in multi-step floating point adds.
Each SM has 64KB of shared memory that’s configurable on the fly as either 48KB of shared memory / 16KB of L1 cache or 16KB of shared memory and 48KB of L1 cache. For graphics, the 16KB L1 cache configuration is used.
Also built into the SM are four special function units (SFUs.) These handle more exotic instructions (trigonometric functions like sine, cosine, etc.), squre roots and so on. These SFUs are independent of the CUDA cores, and the scheduler can dispatch instructions to other execution units if a particular SFU is busy.
Managing so many threads in flight, and making sure that idle units are given fresh work while busy execution cores are bypassed is the work of the dual warp scheduler.
The dual warp scheduler makes sure work is distributed across all units, maximizing overall execution efficiency.
The scheduler picks two warps (remember, warps are groups of 32 threads), then issues instructions from the warp individually to non-busy execution units.
Each SM also contains the PolyMorph engine, which includes the tessellator, viewport transform and vertex fetch units, plus stream output.
The polymorph engine hands tessellation, viewport transforms and vertex fetch.
One of the important aspects of DirectX 11 is hardware tessellation. A patch is defined by the application, which consists of a set of control points, which is sent to the tessellator. The tessellator slices up the patch based on information from the control points, then sends a mesh of vertices back to the SM. The Domain and Geometry shaders (as defined by DX11) operate on the data and determines the position of each vertex. A displacement map, which is actually a grayscale texture map, can be applied to the patch to add more geometric detail.
Here’s what a displacement map might look like.
In the example map above, the crosshatches define how the geometry of a patch might be changed to create more detail. Take a flat piece of geometry, apply the displacement map, and you’ll get something like the next image:
The final tessellated image, after the displacement map has been applied.
The geometry sharder takes care of some final postprocessing, then the whole affair is sent back to the tessellation engine for the final pass.
Each SM also contains four texture units, which computes texture addresses and can fetch four texture samples per clock, which can be filtered. This is a departure from Nvidia’s past generation, where texture units were shared among several SMs.
Overall texture performance has also been improved by the dedicated L1 texture cache, plus the unified L2 cache. The GF100 incorporates new texture modes required by DirectX 11. Perhaps the most interesting outcome of the increased performance of the new texture units is improved shadowing performance due to implementing DX11’s four-offset Gather4 directly in hardware. Four texels can be fetched from a 128x128 grid with one instruction. This, plus jittered sampling (available in previous generations) can substantially improve soft shadowing performance, which was a weak spot in previous generations.
Improved texturing performance allows for more effective use of soft shadows in PC games.
The GF100’s six ROP partitions are laid out around the L2 cache. Each partition contains eight ROP units, for a total of 48 ROPs on the initial chip. This is double the number of ROPs per partition over Nvidia’s last generation GT200 GPU. Each ROP unit itself has been streamlined and enhanced to improve performance. The net result is a big step up in anti-aliasing performance. Nvidia suggests that 8x AA is averaging only 9% slower than 4x AA mode on the GF100 – and 230% faster than the 285 GTX.
Nvidia is also extending its proprietary coverage sampling AA modes (CSAA to a 32x mode, which will improve image quality in scenes that use billboarded transparency, such as foliage, railings, fencing and similar types of scenery items.
Using 32x CSAA will improve AA in areas with heavy use of small objects plus transparency.
Nvidia also spent a great deal of time talking up the use of GPU compute in gaming. We’re all familiar with Nvidia’s aggressive promotion of its own PhysX physics engine, though there’s no reason other physics middleware couldn’t be implemented. The soon-to-be released game Dark Void will implement some creative particle and fluid dynamic effects into the game, improving the overall “gee-whiz” factor. This will differentiate Dark Void from PCs running on competing hardware, but perhaps more importantly, from the console version.
Other potential aspects for GPU compute in gaming includes improved post-processing effects (better motion blur, fluid dynamics, cloth effects, realistic hair and more.
Will we finally see realistic hair in games? Maybe…
Nvidia’s Emmett Kilgariff also spent some time extolling the virtues of ray tracing, but the demo shown ran at less than one frame per second on two GF100 cards. Note that the car rendering demo involved a high resolution, fully ray traced scene. Kilgariff did note that games could used mixed-mode rendering, using ray tracing only on small portions of the scenery that required more realistic reflections, for example.
One very cool demo Nvidia had on tap was titled Supersonic Sled. This demo was a big exercise in physics, with lots of fluid dynamic effects, particle effects and many thousands of objects blowing up, colliding and otherwise interacting. It’s a great example of something that could be done in a true DirectX 11 title, given enough GPU horsepower.
We’ve covered a lot of ground regarding the architecture of the GPU. What about the hardware itself? When will we see GF100 cards?
Nvidia’s technical marketing director, Nick Stam, emphatically noted that GF100 would be a “Q1 product,” implying we’ll see first cards before the end of March. However, Nvidia refused to disclose any details on pricing, quantity, die size or yield. We do know that the GPU is being manufactured on TSMC’s 40nm process technology, and uses over 3 billion transistors.
Given that AMD’s Radeon HD 5870 is built on the same 40nm process, has 2.15 billion transistors, and is a 334mm2 die, it’s very likely the GF100 chip approaches 500mm2. That’s one big chip, and high yields will be necessary to ensure the cost of boards isn’t ridiculously high.
Nvidia representatives also noted that the GPU cooler wasn’t final as well. We certainly hope it wasn’t final. One demo, which featured a dual card SLI configuration, was noticeably loud. When asked about power usage, Nvidia’s said that the GF100 would “… use more power than our current high end.”
Given that statement, Nvidia will be hard pressed to match the current AMD Radeon HD 5870’s 27W at idle and 188W at full throttle. However, if the final GF100 hardware is substantially faster than the HD 5870, that would justify higher power consumption. We really hope, however, that the card isn’t as noisy as the reference hardware on display at the briefing.
What about multimonitor support? AMD has been generating lots of press over its Eyefinity multi-display technology, though how many people actual take advantage of more than two displays with a single card isn’t known. Shipping GF100 cards will support two displays out of the box; if you want more than two, you’ll need a second card.
Nvidia’s long been a proponent of stereoscopic 3D gaming, even selling a set of LCD shutter glasses under the 3D Vision brand. The company is taking 3D gaming a step forward, supporting stereo 3D on three panels simultaneously, provided the three panels support 120Hz refresh rates.
Triple Panel Stereoscopic 3D
Of course, you’ll need two GF100’s for three panel stereo, plus a set of shutter glasses. LG and Asus have announced 1080p 120Hz LCD monitors, but those haven’t quite hit the market yet.
So how will the board stack up against AMD’s Radeon HD 5870? If Nvidia’s numbers are correct, quite well. Nvidia didn’t give hard performance numbers, which would be dubious given that final core and memory clocks haven’t been divulged. However, they did note that the ground up reimagining of the GPU architecture represented by the GF100 would yield big performance gains in one key DirectX 11 are: tessellation.
Nvidia claims the GF100 tessellation performance will be much higher than the Radeon HD 5870.
Of course, tessellation is only one factor, and most games today don’t make much use of it. What will be important is that the GF100 perform better than the HD 5870 across the board – and by enough to justify what will likely be a fairly substantial price differential.
After you cut through all the tech briefings, the cool demos and the performance comparisons against AMD, one thing remains true:
You can’t buy a GF100 today.
While Nvidia has taken great pains to note that the GF100 will be available in Q1 of this year, the long development cycle is a little worrying. Pricing is also an unknown, but given the size of the chip, it’s likely to be a pricey board. Will the added performance be enough to overcome resistance among high end buyers? Will Nvidia offer reduced cost versions using “salvage” chips, as they did with the GTX 260 when the original GTX 280 shipped? It’s all unknown at this point.
Overall, we’re impressed with the architecture of the GF100. But that’s all we can say at this point, until we get boards in hand, benchmarks under way and will be able to see for ourselves.