Your Tablet Benchmarks Suck

Jimmy Thang

The mobile benchmarking scene is facing the same pitfalls as the PC experienced

At least that’s what Intel engineers are saying as the chip giant finally prepares to go toe-to-toe with ARM -based tablets with its new “ Bay Trail Atom chips.

In talks with the hardware press just before showing off Intel’s new Bay Trail part, principal Engineers Ronen Zohar and Francois Piednoel pointed out several “cringe” worthy issues they found in the source code for many benchmarks being used to test tablets today.

The identities of the benchmarks weren’t disclosed but Zohar pointed out several issues with the source of popular benchmarks that don’t actually test what they claim to test. For example, one memory bandwidth benchmark didn’t even stress the tablet’s RAM. Another test used an unrealistic math function that the vast majority of research doesn’t match popular use.

In another test, the developer hoped to create a CPU performance test but going through the source code with the media, Zohar said it was apparent the test didn’t do that. Instead the test only really tested how fast it could update the status bar.

“Your CPU benchmark ‘tests’ how fast you can update the status bar and how fast you can update the clock,” Zohar said he learned from examining the app's source code.

Perhaps worse than inadequate benchmarks is gaming of the tests by vendors. Zohar said in one example he witnessed, running the stock browser on a device and pointing it at a web-based browser resulted in the CPU ratcheting up to higher clocks.

Those who have followed the PC benchmarking scene for ages will feel a sense of déjà vu all over again as the PC went through this too in the early days.  Piednoel agrees there are echoes of the early days of the PC when benchmarking was a bit of a wild west.

It’s not just Intel that believes this either.

“I agree with Intel that mobile benchmarking has gotten completely out of hand, and, it does remind me of the 90's,” said analyst Pat Moorhead of Moor Insights & Strategy. A former AMD exec turned analyst, Moorhead said mobile benchmarks have some maturing to do. He also agreed that cheating is rampant but also not quite as black and white as people make it out to be.

AnTuTu is a very popular mobile benchmark

Intel's hands aren’t entirely clean in this either. This summer the company was accused of cheating in the popular AnTuTu benchmark in showdowns with its Clover Trail + SOC. When run on the popular AnTuTu benchmark, ARM-based tablets would run the full calculation but when the benchmark was run on Clover Trail+ platforms, the benchmark would run the full calculation and then take a shortcut for the rest of the run.

When asked to about the summer dustup over AnTuTu—after having just accused competing ARM vendors of benchmarketing—Zohar chalked it up to legitimate compiler optimization. But, Zohar said, the optimizations had been made many moons ago and made sense. He said Intel had no interaction with the AnTuTu developer except to provide a compiler which was optimized for Intel hardware—not exactly illogical. The optimizations didn’t fabricate numbers he said, the compiler just knew that if the code is asking for the exact same thing it asked for, there’s no reason to waste time and energy since you already know the answer. Why not just give the same answer?  When AnTuTu was written to execute the same workload over and over again rather than take the compiler short cut, the Intel CPU actually trailed the ARM chip slightly.

Just days ago AnTuTu Labs, the developer of the AnTuTu benchmark said it has implemented anti-cheating techniques in AnTuTu X.

Moorehead said he doesn’t think Intel’s framing of how it all unrolled doesn’t match what he knows but allegations of “cheating” isn’t exactly unusual.

“I don't know of any relevant hardware company who hasn't been accused of cheating, particularly in CPUs and graphics,” Moorhead said. “The gray area is that one man's cheat is another's optimization. One man's piling on of resources in a benchmark consortium is interpreted as manipulation.”

For the most part, cheating on the PC has been mostly tamed by the move from the early days of purely synthetic benchmarks to an emphasis on “real-world” tests. The move was motivated when the benchmarking community began seeing driver optimizations that increased performance in benchmarks that actually hurt gaming performance. The theory behind the emphasis on real-world testing is that if a vendor is “optimizing” for a game, the end user still benefits. So call it cheating or optimizing, the result is still a better experience for the consumer. At least, that’s the theory. Reality doesn’t always match though.

More than a decade ago, ATI got in hot water for fudging performance numbers in Quake III Arena

One of the most famous cases of “optimizations” involved Quake III Arena. Tech site found that changing the name of the executable from Quake3.exe to Quack3.exe would cause performance of the ATI Radeon 8500 to drop. When changed back, the performance would increase. Further testing by others found that the “optimization” appeared to be at the cost of image quality.

ATI defended itself by saying that it was indeed an optimization made to give gamers the best combination of performance and visual quality but the fact that people still remember this more than 12 years later tells you how history remembers it.

Years ago, Intel was also caught up in another benchmark brouhaha when it was found that applications compiled with Intel’s compiler didn’t use Streaming SIMD Extensions 2 (SSE2) properly on AMD CPUs that had the feature. The only way to enable the support on AMD CPUs was to make the application appear to be an Intel CPU that supported SSE2. The end result was even if an AMD CPU had SSE2 support, an application compiled with Intel’s compiler would run far slower using a different code path without SSE2 support. This, in fact, was an allegation of AMD’s anti-trust suit against Intel which both eventually agreed to settle with AMD receiving a $1.25 billion payment.

But showing just how gray “optimizations” can be, defenders of Intel argue that the Intel’s C++ compiler is specifically designed for Intel CPUs and it’s not Intel’s job to validate AMD CPUs with a tool made to extract the most performance out of an Intel CPU. Others, of course, argue that Intel’s foot print on the industry is so large and if its compiler was violating Intel’s own guidance to explicitly check for CPU feature set support rather than just the CPUID string the only answer can be blatant cheating.

On the PC though, these incidents are more the exception than the rule thanks to the bad PR that’s usually generated and a generally skeptical press. Reliance on using real-applications, such as how long it takes to encode a video using Handbrake, has also kept the benchmark controversies to a minimum lately.

That’s not the same with tablets and smartphones right now. Samsung’s name was recently dragged through over allegations that the Galaxy S4 and Galaxy Note 3 were maxing out on cores and clock speeds—but only during popular benchmarks. This practice though apparently wasn’t confined to Samsung, found multiple vendors were targeting benchmarks including Asus, HTC, LG as well as Samsung.

As the first fingered for optimizing solely for benchmarks, Samsung has denied it’s intentionally trying to cheat, but only wants to give the highest performance when running stressful workloads. Afterall, when you’re running a test that’s supposed to measure an SOC’s theoretical performance, don’t you want the SOC to be running at maximum clock speed with all of the cores active?

In the end, Intel argues that synthetic mobile benchmarks are still misleading.

“It’s not because you have 25 potatoes that you have a good cell phone,” Piednoel said.  “At the end of the day we are asking you to look at the new breed of benchmarks coming from benchmark vendors. Try to measure user experience, stop trying to measure potatoes.”

Around the web

by CPMStar (Sponsored) Free to play