The Age of HSA (Heterogeneous System Architecture)

avatar

The Age of HSA (Heterogeneous System Architecture)

(chorus) Heterogeneous System Architecture

You’ll have to forgive us, but it’s hard not to think of AMD’s next-generation APU without breaking into a chorus of that classic New Age song: “Age of Aquarius.”

HSA does, after all, promise a brighter, sunshinier future for AMD’s vision of Heterogeneous System Architecture—just without the hippies with long hair and bell bottoms dancing around. In short, HSA is the next step that finally melds the CPU and the GPU together.

Sound impossible? AMD’s new “Kaveri” APU does just that, and to a level we've never seen before. But does this New Age APU really have what it takes to push off old-world designs from that Mad Men–dressed Intel? To find out, read on.

Kaveri Arrives

Heterogeneous System Architecture Explained

Let’s be up-front about this: Kaveri’s significance isn’t that it’s a super-high-end, tricked-out part that will rock everyone’s computing world. No. This part won’t and can’t compete with a $1,000 Core i7 Extreme CPU or even a $500 Core i7 chip. Or, for that matter, a $320 Core i7, either. In fact, the chip is a mild-mannered, midrange APU. If your eyes are glazed over already, don’t be quite so jaded. Kaveri is still a very significant milestone in AMD’s quest to see heterogeneous computing turn into a reality.

If you flip back a few years, AMD had just gone through a run of phenomenal successes with its Athlon, Athlon XP, and Athlon 64, which had put its mortal enemy back on its heels for the first time in both companies’ histories. That sleeping giant wouldn’t stay asleep at the wheel forever, though. Those who know current events know that Intel came back with a roaring success with its Core CPUs.

As a tiny chipmaker, and one without a fab of its own, AMD’s only real, true course was to use the one tool that Intel didn’t have: graphics performance. While Intel has long been known as the king of not-good graphics and has received an Academy Award nomination for graphics unimpressiveness, AMD’s GPU division hasn’t always been in first, but everyone acknowledges the company has some good parts.

It’s not just old integrated graphics

One thing you shouldn’t confuse AMD’s APUs with is the move to integrated graphics. Intel, in fact, combined a graphics core into the socket of its original LGA1156 CPUs before AMD, but that was just a GPU sitting inside the CPU package. The graphics core eventually was merged into the die itself, but the approach is far different than AMD’s APU philosophy. Ever since the Llano APU was introduced in 2011, AMD has slowly been trying to elevate the integrated GPU to the same status as the CPU. Llano featured a few special interfaces between the cores and memory, Richland unified the CPU and GPU’s north bridge, and Kaveri at last makes the GPU a full equal to the CPU.

HUMA

One of the biggest changes with Kaveri is the memory controller. With Richland, Llano, and other on-die or on-chip graphics parts, the CPU and GPU were still separate entities as far as the memory controller was concerned. In a Richland chip, for example, before the GPU could be tasked with working on something the CPU was processing, the information would have to be copied to the GPU’s memory before work could begin—even though it was already in memory. It’s a bit like two people working on a task simultaneously who must pass the materials over a wall before anything can happen. With Kaveri, both the CPU and GPU can access the same memory (up to 32GB) without the inefficiency of passing things back and forth first. Kaveri also features a queuing technique that lets either the GPU or CPU dispatch work loads. Thus, the GPU is finally getting top billing with the CPU rather than playing second fiddle.

Those aren’t the only changes AMD made to Kaveri. On the x86 side, AMD has tweaked the new “Steamroller” cores to reduce mispredicted branches by 20 percent, reduce cache misses by 30 percent, and made general scheduling efficiency increases of 5 to 10 percent, the company claims. All this adds up to, AMD says, a healthy increase in efficiency over its previous “Piledriver” microarchitecture from 10 to 20 percent.

The basic fundamentals of the chip from A10-6800K are still there, too: It’s a dual-core “module” with some shared resources. All three of the launch Kaveris feature two modules for a total of four cores. Of course, that gets into AMD’s new language for Kaveri about just what a core is. The company says the new Kaveri actually has 12 “compute cores.” Four of those are the CPU's, the other eight are the GPU cores. What’s a compute core? AMD defines them as an HSA hardware block that’s programmable and capable of running at least one process in its own context and virtual memory space, independently from other cores. Fortunately, AMD isn’t pushing this line too hard and selling it to the public as a 12-core processor, but at least it’s a somewhat defendable argument.

 APU’s compared
 Make/Model A10-7850K A10-6800K A10-5800K A8-3850
Code name Kaveri Richland Trinity Llano
Microarchitecture Steamroller Enhanced Piledriver
 
Piledriver K10
Core/Thread 4/4 4/4 4/4 4/4
Base/Turbo Clock 3.7/4 4.1/4.4 4.1/4.4 2.9/2.9
Die Size 245mm2 246mm2
246mm2 228mm2
Transistor Count 2.41 billion 1.303 billion 1.303 billion 1.178 billion
Graphics R7 HD 8670 HD 7660 HD6550D
L2 Cache 4MB 4MB 4MB 4MB
Process technology 28nm 32nm 32nm 32nm
TDP 95 watts 100 watts 100 watts 100 watts
Price (at launch) $173 $142 $122 $135

A new process that may not be better for CPUs

Kaveri gets a die shrink, going from Richland's 32nm process to 28nm process. The chip is still being made by AMD’s former fab, Global Foundries, but there is a key difference. The 32nm process used a High-K Metal Gate Silicon On Insulator that’s less dense than the 28nm Super High Performance (SHP) process used in Kaveri. The 32nm SOI process is better suited for higher-clocked CPUs—witness the cherry-picked AMD FX-9590 parts that could hit 5GHz out of a retail box. The 28nm SHP process (that’s also used for GPUs) is denser but doesn’t scale in frequency as easily as the 32nm process. That’s apparent in these new parts, as the top-end 3.7GHz A10-7850K part Turbos up to 4GHz. In comparison, the older 32nm SOI-based A10-6800K had a base clock of 4.1GHz with a Turbo of 4.4GHz.

There will certainly be those who pooh-pooh such a decision, but the choice probably makes sense from AMD’s perspective—it has basically chosen a process technology that’s better suited to its core strength on the graphics cores rather than the CPU side.

It’s dead, Jim

One thing the move to 28nm may signal certainly won’t make AMD enthusiasts happy, though: The end of AM3+. There have been rumors swirling for months that the FX CPUs would be the final parts for the AM3+ socket. We’ll point out that AMD has neither confirmed nor denied AM3+ and FX as dead, but between the leaked roadmaps that go nowhere for the high end and the 28nm Steamroller process so focused on denser transistors that don’t favor building high-clocked CPUs, it doesn’t look good. It also doesn’t help that 2014 will be a transition year with DDR4 on the horizon. Does it make sense for AMD to really roll out a Steamroller using a process technology not optimized for pure CPUs that likely still won’t compete with Intel’s midrange parts, and then transition to DDR4? Painfully, we have to admit: no. Still, we really hope we’re wrong, as we’d love to see an 8-core or more Steamroller hit the AM3+ just for old times’ sake.

Don’t take that to mean AMD will abandon the high-end enthusiast altogether, though. It just may be time for AMD to shed AM3+, and even those who have ridden the socket from AM2 till now really can’t complain too much—the company should be patted on the back for providing a fairly smooth road over the years.

AMD’s new Kaveri Lineup
Make/Model A10-7850K A10-7700K A8-7600 A8-7600
CPU Cores/GPU Cores 4/8 4/6 4/6 4/6
CPU Base/Turbo Clock 3.7/4 3.4/3.8
 
3.3/3.8 3.1/3.3
TDP 95 watts 95 watts 65 watts 45 watts
L2 Cache 4MB 4MB 4MB 4MB
Graphics R7 R7 R7 R7
Shader Cores 512 384 384 384
Graphics Clock Up to 720MHz Up to 720MHz Up to 720MHz Up to 720MHz
Price $173 $152 $119 $119

Socket to Me

FM1

The original Socket FM1 was released alongside AMD’s Llano APUs. We’ve lauded AMD over the years for socket stability, but not on FM1. Socket FM1 lived about a year and was made obsolete by Socket FM2. The worst part of it was the confusion on what worked with which part. To this day, we still have to look it up.

FM2

FM2 has been the modern APU socket we’ve grown accustomed to over the last year or so. It works with all A-series parts 5000 and greater, and even a few oddball Athlon II X4 parts, too. AMD is trying to wreck its legacy of socket support, though, by making Kaveri incompatible with existing FM2 boards. Fooey.

FM2+

Kaveri has a slightly different pin configuration than Richland and Trinity, and thus FM2+, or FM2b, was born. The good news is FM2+ boards will work with older A-series APUs 5000 or greater. The bad news is Kaveri won’t fit into your existing FM2 motherboard.

But wait, there’s more

Mantle, TrueAudio get thrown in for free!

Besides a new process, microarchitecture, and uniform memory access, Kaveri also brings AMD’s GCN graphics cores to the table. It’s not just about the graphics cores, though, AMD has also wisely decided to bring AMD TrueAudio technology to the table. True-
Audio promises very advanced audio processing using the power of the graphics compute cores on hand. Kaveri will also benefit from AMD’s new Mantle support. Mantle is AMD’s new API that’s designed to sidestep limitations in DirectX and promises insane performance increases. Interestingly, Mantle’s main benefit may not be Dream Machine–level exotic rigs with overclocked hexa-cores. Instead, Mantle looks to benefit those with far more modest CPUs. You know, such as an A10-7850K. Using the Mantle API, a Core i7-4960X with a Radeon R9 290X would see a modest 1.4 percent improvement at 2560x1600, but that same GPU paired with a lower-end A10-7700K would see an astounding 40.1 percent buff using Mantle, AMD claims.

Earlier, we said it was wise of AMD to include Mantle and True-Audio in Kaveri. We believe it was a smart decision because the company needs to get more chips on the ground that support Mantle and TrueAudio for either to gain any real traction. Putting it into the very affordable Kaveri accomplishes that. The last piece, which is the icing on the cake, is the ability to run the graphics cores to mine crypto-currency. How much you can make varies on what you’re mining, how much power you’re paying for, and the market value, but one crazy figure AMD officials threw out there was a Kaveri APU could potentially generate $704 yearly in crypto-currency (or more, and also less, too).

The Evolution of apus

Llano

The original Llano APU used the same basic K10 cores as Phenom II CPUs. Though it didn’t come out until 2011, AMD had been working on the Llano concept since 2006—soon after its purchase of ATI. Rather than the dual-core modules, Llano used four distinct CPU cores alongside the Radeon HD 6000–class graphics.

Trinity

The second-gen APU used AMD’s newer Piledriver dual-module cores, added a Turbo Core mode, and faster Radeon HD 7000–class graphics. A second iteration of the Piledriver APU was called Richland and added mostly power and clock improvements.

Kaveri

Kaveri breaks from previous APUs by allowing both CPU and GPU cores to use the same memory without having to copy the data back and forth. The newest APU also bring R7-class graphics to the mix.

How we tested

Kaveri comes in three flavors but we think the most interesting of the bunch is the Cindy Brady model: the A8-7600 part. While the high-end APU sounds enticing, at $173 it’s pretty close to a quad-core Haswell chip. And yeah, no one cares about Jan. Far more palatable is the A8-7600. For $119, the A8-7600 does what AMD does best: compete on price. The A8-7600 is also unique among the Kaveris because you can easily set the CPU via BIOS to run at either a 45 watt TDP or at 65 watt TDP. At 65 watts, the base is 3.3GHz with a Turbo of 3.8GHz, and at 45 watts the base is 3.1GHz with a mild Turbo of 3.3GHz. We tested ours at 65 watts.

The A8-7600 went into an Asrock FM2A88X-ITX+ mobo running a Samsung 840 SSD. Before we get into test results, we’d like to point out a sleight of hand AMD performed this time around with the benchmarks it showed off to the public. The AMD part shows benchmarks using 16GB of DDR3/2133 and the Intel system using 16GB of DDR3/1600. That’s not fair. For our testing, we outfitted our Intel chips with dual-channel DDR3/1866 of G-skill RAM. We also obtained a pair of DDR3/2133 SO-DIMMs from G-skill, but the density was a bit lower at 8GB total. In the end, we tested all of our Intel parts using 16GB of DDR3/1866 and also tested the Kaveri at DDR3/2133 and DDR3/1866. Our testing shows that if you have Kaveri, it’s probably worth paying for DDR3/2133.

All of our Intel testing was done using an Asrock Z87-M8 mother-board. It’s also a Mini-ITX board but uses SO-DIMMs rather than DIMMs, hence the problem with getting 16GB of DDR3/2133. Our Intel box was outfitted with a Corsair Neutron SSD. Both platforms were loaded with a clean install of Windows 8.1, the boards flashed with the latest BIOS, and integrated graphics was used all around. For our Intel contestants we reached for a 3.4GHz Core i3-4130, which sells for $117. We also threw in a 3GHz Core i5-4430 part, the cheapest quad-core Haswell at $182; we include it only as a point of reference.

In choosing our benchmarks, we grabbed a subset of our typical CPU tests and threw in GPU compute and gaming workloads to give the integrated graphics a decent workout. We also added in an HSA JPEG decompression demo that AMD provided. Yes, we’ll say that again—AMD provided it, so stop your belly-aching, Intel fanboys. We wanted to put it into the mix to see how the new Kaveri would do when something was made to exploit the capabilities of HSA.

The Result

In many ways, not much has changed. Clock for clock, core for core, the AMD parts don’t equal their modern Intel competitors when it comes to today’s applications. The quad-core Core i5-4430 has a steep advantage over the AMD quad-core A8-7600 in multi-threaded tasks. This shouldn’t be news to anyone, as AMD’s current CPUs and APUs use dual-chip modules that share resources. Although the instructions per clock (IPC) is better than before, it’s not enough to bridge the gap with an Intel part. Yeah, we know. You’re thinking “who cares, that’s a $182 CPU against a $119 CPU.” We’d agree. Well, until you look at many of the benchmarks from the dual-core Core i3-4130 part. Thanks to its Hyper-Threading and better core efficiency, it’s damn near a tie with the AMD part even with its four physical, albeit shared, internal cores.

AMD’s JPEG decompression demo is built to show off the advantages of Kaveri’s GPU and CPU to access the same memory.

In the X264 HD 5.01 encoding test, for example, the dual-core Core i3 does a good job keeping up and even passing the A8-7600. Remember, this is two cores against four cores. While Kaveri does OK with encoding, Haswell will generally be well ahead. We also ran through the new PCMark V2 test, which now features OpenCL support. The results weren’t too surprising: When set to load up the GPU, the AMD part won; when run in conventional (aka CPU) mode, the Intel part jumped in front. As part of our testing, we also threw in some older tests, such as Valve’s Particle Test. This primarily tests how a chip will handle theoretical game physics. More threads and more cores are better, and the Intel part was clearly the winner. The Core i3 also aces the AMD chip in Cinebench 11.5, which is a 3D rendering test. Again, we’re talking about a dual-core vs. AMD’s quad-core part, but maybe that old way of thinking is about as antiquated as judging CPUs based on clock speed. We don’t, for example, compare a 3GHz Core i5 to a 3.4GHz A10 chip and automatically assume the A10 part is faster. Looked at that way, the A6-7600 does OK. In purely CPU tasks such as 7-Zip for file compression and TrueCrypt for encryption, both parts are dead even. For AMD CPUs, that’s really not bad.

Fortunately, the AMD part gets payback where you’d expect: graphics. The A8-7600 offers significant performance boosts over the Intel HD4400 and HD4600 in the quad-core Intel Haswell chip. In Firestrike, the graphics boost is almost double that of the Intel dual-core. Tomb Raider also favored the AMD part. You could almost play Tomb Raider (with enough graphics tweaks) at a reasonable frame rate, too. In GPU compute, the Kaveri also represents well, generally acing the dual-core Haswell. In the one HSA test we ran, Kaveri showed a healthy performance boost over the Intel parts—in line with the promises of HSA, if you believe them.

We can’t say all this without some caveats, though. The first is that when we pushed the GPU compute workloads from the graphics cores to the CPU cores, the AMD did pretty poorly. AMD’s argument, however, is that the future is about GPU computing and HSA. Even if Intel’s next-generation chip hits it out of the ballpark, the improvement isn’t going to be more than maybe 25 percent, the company would argue. But by moving the same workload to the GPU, it’s possible see performance gains of several magnitudes. That’s the sales pitch, anyway, and though it sounds good, we know that hanging performance on developer support is a very long fight. Anyone hoping that native HSA apps will be a game-changer should burst that bubble now—it’ll be years, if ever, before that happens. Think back to when Intel introduced the Pentium 4: Remember the huge stink around porting existing code to the new chip? As hard as Intel tried, it could never overcome the huge library of legacy applications that performed from bleak to mediocre on the P4. The lesson the company learned was that you can’t “recompile the world.”

The other issue with Kaveri is that it’s main advantage is in graphics. Once you plug in even a $100 GPU, you neutralize the value of the APU but you’re still stuck with the meh performance of the AMD compute cores. Fortunately, Kaveri is generally acceptable, but the power of those Haswell core’s can’t be ignored—Kaveri’s four cores just about equal two Haswell cores.

The Upshot

Kaveri’s role is perfect in a box that is unlikely to ever see a real discrete graphics card, such as an ultra-budget rig or a NUC-style machine. That’s the cheapest Kaveri, though. Once you get into the $150 and $175 range, you’re fighting with the faster quad-core Haswell parts, and we’d temper our verdict with those matchups. The A8-7600 is in a very nice spot, though. It’s far faster than its Intel contemporary in graphics, and generally “just good enough” on x86 performance so people aren’t likely to care too much. We’d honestly have to call that a victory.

Benchmarks
 CPU A8-7600 Core i3-4130 A8-7600
(for reference only)
Core i5-4430
(for reference only)
Graphics R7 HD4400 R7 HD4600
Clock 3.3GHz/3.8GHz 3.4GHz 3.3GHz/3.8GHz 3GHz/3.2GHz
RAM Clock DDR3/1866 DDR3/1866 DDR3/1866 DDR3/1866
Price $119 $117 $119 $182
X264 HD 5.01 Pass 1 (fps) 34.2 39.5 34.6 50.4
X264 HD 5.01 Pass 2 (fps) 8
7.62 8 11.6
PCMark 8 V2 Home Accelerated 3,102 2,864 3,164 2,883
PCMark 8 V2 Home Conventional 2,502 2,863 3,422 2,880
Valve Particle Benchmark (fps) 82 101 82 126
Cinebench 11.5 3.36 3.68 3.4 5.06
TrueCrypt 7.1a (GB/s) 2.1 2 2.1 2.5
7-Zip 9.20 (mips) 10,499 10,423 10,888 14,046
JPEG Decoder CPU (sec) 12.14 9.46 11.62 10.04
JPEG Decoder GPU (sec) 6.06 N/A 5.63 N/A
Basemark CL 1.1 GPU 64.2 35.8 66.5 44.4
Basemark CL 1.1 CPU 2.66 8.7 2.7 11.8
LuxMark 2.0 Room GPU 220 177 230 176
LuxMark 2.0 Room CPU 72 165 119 328
Tomb Raider Norm 19x10 (fps) 26.1 16.1 28.2 19.7
Tomb Raider Low 19x10 (fps) 36.7 23.4 40.8 31.1
Tomb Raider Norm 16x10 (fps) 30 18.6 32.4 22.3
Metro: 2033 12x7 (fps) 39.33 18.3 43.95 31.67
Hitman 12x7 (fps) 33.1 25.1 36.1 27.8
3DMark Ice Storm 57,581 44,008 59,761 51,177
3DMark Ice Storm Graphics 71,752 44,844 76,442 52,722
3DMark Ice Storm Physics 34,048 41,315 33,854 46,417
3DMark Cloud Gate 6,141 5,074 6,335 6,191
3DMark Cloud Gate Graphics 8,925 5,819 9,464 7,099
3DMark Cloud Gate Physics 2,936 3,504 3,504 4,277
3DMark Fire Strike 1,265 661 1,325 774
3DMark Fire Strike Graphics 1,370 715 1,434 848
3DMark Fire Strike Physics 4,210 5,028 4,200 5,869

Best scores are bolded. Windows 8.1 and integrated graphics were used for all testing.

0

Comments

+ Add a Comment