Gadget wrote:
Really? I haven't followed hardware closely for quite some time now. Are there plans for incorporating FPGAs into mainstream hardware? That would raise all kinds of interesting questions/issues in a multi-programming environment.
There have been plans for this for years, but none have borne much fruit. People have tried everything from FPGAs on PCMCIA cards to FPGAs directly in processor sockets, to hard PPC cores directly embedded in FPGA logic. Some of the reasons they haven't been very successful outside a small niche is they're costly, have little utility for your grandma's web browsing, vary widely in resources and capability, and there is no standard abstract interface to them. Cost can be mitigated by scale, but the rest of the problems are very much still problems today. I think we will start to see some reconfigurable logic incorporated directly onto SoCs (like Xilinx's Zynq) destined for mobile phones as a means to decrease energy consumption, thereby giving grandma some utility. The last two points are the real deal-breakers. The ISA abstraction has proved very useful for CPUs: you can buy CPUs with vastly different resources but use the same, unmodified, software on both, and your more capable CPU will just 'magically' work faster. There is no obvious abstraction useful for porting FPGA bitstreams; everything about them is as device-specific as it gets. The closest thing to a common abstraction would be behavioral RTL, but even that is light-years away from being one, not to mention completely inaccessible to software developers (even with modern high-level synthesis tools). Even worse, it will take hours of compute time just to create the bitstreams for the specific accelerator...
Heterogeneous computing is a very hot topic, and projects like LiquidMetal and OpenCL are moving toward the direction of seamless portability between multiple models of computation. Even with these challenges, there are a number of possible workarounds: libraries, precomputed bitstreams available OTA, 'warp processing', etc. will all play a role in bringing these devices to the mainstream.
Gadget wrote:
Good example! Which I will gladly upgrade to great if you provide a strong practical reason -- I couldn't come up with anything offhand, but I suspect there might be some uses related to information theory. Keeping in mind that it will still be much slower than a dedicated hardware solution, can anyone come up with a way to reduce the time complexity from Theta(word_size/2) down to Theta(1)?
Well, it was just a particularly simple and illustrative example. I'm not proposing CPU designers incorporate such logic in their chips

. If you want a more useful example of an operation with large amounts of bit-level parallelism which typically can't be exploited by CPUs, I would suggest counting the number of bits set in a word. Intel recently introduced the 'popcnt' instruction to do just this, but software has to be specifically coded to take advantage of it. It's a surprisingly common operation, and is useful in pretty much anything from parity checking to bitvector manipulation in DBMSs.
Gadget wrote:
Unfortunately, estimating the additional work required for implementing any type of parallelism is going to be extremely difficult. Conceptually, it may not seem difficult to implement some degree of parallelism in a game engine. A really good team of developers with experience in this area might actually complete coding without incurring much additional expense. However, parallelism is going to introduce quite a bit of additional complexity in all of the other phases as well, especially testing and maintenance, which I suspect will be _at least_ 5x more expensive based on my experience helping maintain/fix a couple of defense projects w/ multiple processes (using shared memory in Unix). Finding the cause and then correcting a bug in a parallel program is truly one of the worst levels of hell for software developers.
I agree. Without direct experience with the specific code involved, the best anyone can do is guess. I also have experience working with parallel programs on multiprocessors, and it can indeed be an absolute nightmare to diagnose even the simplest bugs. Personally, I hope those who use their technological knowledge to enable political oppression (I'm looking at YOU, Blue Coat programmers!) and sell out free speech tools to the Saudis (I'm looking at YOU, Moxie!) are reserved the final circle in hell: debugging highly-parallel, tightly-coupled programs on a multiprocessor system with no cache coherency, in a perpetual Sisyphusian punishment.
LatiosXT wrote:
I think No instruction set computing would be good for our computing. Apparently NVIDIA Kepler runs on this style.
Before you go "But Kepler sucks at compute performance!", well because like VLIW, it requires a highly tuned compiler
I'm not so convinced this is really the way forward. This only exacerbates the problems VLIWs present. VLIWs haven't been adopted on a large scale, but it's not really because of the compiler's obligation to statically schedule everything. Intel's IA-64 compiler is quite good at this. The most serious problem is the notion of binary-compatible software goes right out the window, even among CPUs sharing a common instruction set! NISC has the exact same problem here, but even worse: now you don't even have a common instruction set!
People have tried to mitigate this problem for years; remember the Crusoe? It was a VLIW under the hood with an x86 JIT unit. The idea was the chip would be compatible with existing software, perhaps see some speedup where the JIT unit did a good job, and give great performance for device-specific code. Sounds like a great idea; how did it work out? Well, does anyone remember the Crusoe???
Gadget wrote:
Huh... sounds interesting... although it also sounds like NO instruction set is a bit of a white lie. Maybe I should look into compiler research. The boss will never tell you that optimization / performance doesn't matter plus we're becoming more and more dependent on good compilers.
<break>
I should really do a hardware review... is VLIW still considered immature or slow?
It's not really a lie. We're used to an ISA as a specification for interacting with the CPU's control FSM, which in turn manages the microarchitectural details necessary to actually effect your command. In this case, there isn't much need for a decoder and control logic: nearly all the microarchitectural details are handled by the compiler, which encodes their commands directly. It really entirely skips the abstraction layer ISAs provide, hence, "no instruction set". I suppose you could say the instruction set is just the set of all possible combinations of FU commands.
I don't think VLIWs were ever thought of as slow. VLIWs can be extremely fast -- after all, the Itanium saw some significant adoption in supercomputers. With a perfect compiler, code can be scheduled optimally, which is not even close to feasible with dynamic issue OOO CPUs (this is not to say that 'perfect compilation' is feasible in software, either -- last time I checked, we were up to sequences of no more than 7 instructions...).
I also don't think VLIWs are considered immature. The idea has been around for decades and decades. They're not widely used (unless you consider DSPs to be VLIW machines) simply because development is difficult and they don't meet the needs of the market at a feasible cost.
LatiosXT wrote:
The idea is that there's no instruction set. I almost want to say it's like FPGAs with VHDL, but I can't. It's probably something that's way over my head. But all I know is there's no instruction decoder, the processor executes the operation directly from memory.
I would say it's more closely related to CGRAs than RTL for FPGAs. It kind of takes the idea of static scheduling to the next level. VLIWs still schedule most microarchitectural details, while this is giving the code generator even more fine-grained control over the microarchitecture. I see why you say this, though. It makes binary code look much more like a bitstream than a program.