Everything You Need to Know About Nvidia's GF100 (Fermi) GPU
The GF100 DirectX 11 GPU is still not ready for prime time, but Nvidia opens up the hood and shows us the engine that will power next generation Nvidia-based graphics cards.
Nvidia’s latest generation GPU is going through the most painful, drawn out gestation period since the company’s first programmable GPU, the GeForce 5800 series. Like the more recent GeForce 280 GTX, the current GF100 (the code name, not the final name) chip represents a major, ground-up architectural redesign.
Recently, we spent the better part of a day being briefed on the GF100, which represents the first actual graphics processor built with Nvidia’s Fermi architecture. The presenters included Jonah Alben, Senior VP of GPU engineering, Henry Morton, lead architect for geometry processing, Emmett Kilgariff, Director of GPU Architecture plus a host of Nvidia marketing and PR folk.
The basic Fermi architecture layers graphics functionality atop a powerful parallel compute engine. As GPU compute becomes more important, both in games and in certain classes of mainstream applications, it makes sense to build an architecture that builds more general purpose capability. That’s not to say that Fermi will try to take on the functions of a mainstream CPU. For example, though GF100’s compute engines offer some CPU-like capability, like a full L2 read/write data cache, there’s no provision for speculative execution. As Henry Morton, one of the lead architects for the GF100 noted, “You don’t want to do speculative execution on a parallel machine.”
Nvidia’s architecture team had some key goals in mind when designing GF100:
• Geometric realism. As DirectX 8, 9 and 10 progressed, GPUs made huge strides in pixel shader and texturing performance – up to 150x over the first generation of programmable GPUs. Meanwhile, geometry performance only increased about 3x over that time.
• Unrivaled image quality. The GF100 moves well beyond multisampling anti-aliasing, improving the previous CSAA (coverage sample anti-aliasing) by allowing 32 sample CSAA. Substantial work also went into improving anti-aliasing with transparency.
• Beefing up GPU compute for gaming. In addition to improving performance in the now-familiar post-processing effects (motion blur, depth-of-field effects), the GF100 offers more robust accelerated physics, with the potential of also offloading AI and animation functionality previously the realm of CPUs.
Let’s keep these goals in mind as we dive deeper into the architectural features of the GF100. We’ll first take a high level look at the architecture and its components. Then we’ll see how these come together to enable high performance DirectX 11 capable graphics. Finally, we’ll talk about what we know about the actual hardware, and when we might actually see shipping cards.
The GF100 Deconstructed
Nvidia’s Fermi GPU is likely to be one of the largest chips ever manufactured. All that chip real estate is there to support substantial increases in GPU horsepower as well as larger quantities of onboard cache.

GF100 Block Diagram. The GPU is designed to be highly modular, enabling a variety of different products to be built.
The GF100 is a highly modular design, which is scalable at different levels of the architecture. At its coarsest level, we have the GPCs (graphics processing clusters.) The GF100 chip contains four of these, each of which is in turn constructed from sets of modules. Surrounding the four GPCs are six 64-bit GDDR5 memory controllers, which yields a 384-bit memory interface.
Initial GF100 GPUs will have 512 ALUs (called “CUDA cores”), 16 geometry units, 4 faster units, 64 texture units and 48 ROPs. Tying together all the GPCs is an 768KB shared L2 cache. Unlike the 256KB cache on the current GTX 285, which is used for fast storage of texture information, the L2 cache on the GF100 is a fully read/write, coherent write-through cache for all data formats. A least-recently used algorithm manages how long data remains in the L2.
The GigaThread engine manages incoming data fetches from main memory and feeds them to the frame buffer. It’s also responsible for creating and sending blocks of execution threads to the GPU itself. Each GPC is broken down into SMs (streaming multiprocessors). These SMs then take the large thread blocks, breaks them down into groups of 32 threads (called warps) and allocates them to the various execution units underneath.