Everything You Need to Know About Nvidia's GF100 (Fermi) GPU
The Stream Multiprocessor
Each GPC consists of a set of SMs (up to four) plus a raster engine. The raster engine takes care of triangle setup, rasterization and z-cull (throwing away vertex data occluded in the scene, and not visible.) You can think of the GPC as a kind of “mini-GPU” unto itself.

Streaming Multiprocessor block diagram. Each SM consists of a large number of different compute cores.
Inside each SM are 32 compute cores, which Nvidia calls CUDA cores (after its CUDA GPU compute initiative.) This is up from the 8 CUDA cores in the previous generation GT200 series GPUs. All CUDA cores are scalar. Each core consists of an pipelined, integer ALU and a floating point unit. The FPU fully implements the IEEE 754-2008 floating point standard, and incorporates a fused multiply-add (FMA) instruction for both single- and double-precision arithmetic, reducing precision errors that can occur in multi-step floating point adds.
Each SM has 64KB of shared memory that’s configurable on the fly as either 48KB of shared memory / 16KB of L1 cache or 16KB of shared memory and 48KB of L1 cache. For graphics, the 16KB L1 cache configuration is used.
Also built into the SM are four special function units (SFUs.) These handle more exotic instructions (trigonometric functions like sine, cosine, etc.), squre roots and so on. These SFUs are independent of the CUDA cores, and the scheduler can dispatch instructions to other execution units if a particular SFU is busy.
Dual Warp Scheduler
Managing so many threads in flight, and making sure that idle units are given fresh work while busy execution cores are bypassed is the work of the dual warp scheduler.

The dual warp scheduler makes sure work is distributed across all units, maximizing overall execution efficiency.
The scheduler picks two warps (remember, warps are groups of 32 threads), then issues instructions from the warp individually to non-busy execution units.
The Polymorph Engine & Tessellator
Each SM also contains the PolyMorph engine, which includes the tessellator, viewport transform and vertex fetch units, plus stream output.

The polymorph engine hands tessellation, viewport transforms and vertex fetch.
One of the important aspects of DirectX 11 is hardware tessellation. A patch is defined by the application, which consists of a set of control points, which is sent to the tessellator. The tessellator slices up the patch based on information from the control points, then sends a mesh of vertices back to the SM. The Domain and Geometry shaders (as defined by DX11) operate on the data and determines the position of each vertex. A displacement map, which is actually a grayscale texture map, can be applied to the patch to add more geometric detail.

Here’s what a displacement map might look like.
In the example map above, the crosshatches define how the geometry of a patch might be changed to create more detail. Take a flat piece of geometry, apply the displacement map, and you’ll get something like the next image:

The final tessellated image, after the displacement map has been applied.
The geometry sharder takes care of some final postprocessing, then the whole affair is sent back to the tessellation engine for the final pass.