## Area vs Time: A look at algorithmic complexity in Hardware and Software

Many of us would have studied in our college that implementing something in hardware will be faster than implementing the same in software. What i see from this situation is a time complexity being converted into area complexity. And the benefits provided by pipelining, and reduced control overhead resulting from special purpose control implementation (As opposed to more generic&bulky control logic required by a general purpose CPU).

My own recent expeditions in this area suggest something that is not so obvious. The inverse proportion between Logic Area and Time starts after a discontinuity. Some algorithms will benefit substantially if a very minimal part of the algorithm is implemented in hardware. Beyond that point the Area vs Time relation will be an inverse proportional relation.

I would call these minimal logic that brings in huge value add as a ‘primitive function’. Now a little bit of theory. A general purpose processor does some sort of data manipulation function (eg. add, subract, multiply, divide, compare shift etc..) between two registers. Basically for a simple RISC 32 bit processor, whose instruction set is capable of taking 2 32 bit register operands and producing 1 32 bit register output, there is some combinational logic between the 64bit input and 32 bit output, depending on the current instruction. The typical instruction set of a RISC is <256 instruction. Is this enough to do all the possible function mapping from a 64 bit input to a 32 bit output domain?. No certainly not. You can always have one function that mirrors one of the 32 bit input to the output.

So for any of the general purpose processors, it is not possible to implement all the possible function mapping from a 2 word space to a 1 word symbol space. Instead it just implements the most commonly used function mappings like add, subtract, boolean operators, shifts etc. If you take the case of mirroring of bits in a 32 bit word, it will sure take at-least 32 cycles to do it in a general purpose CPU. But implementing a mirroring in hardware doesn’t consume much logic. But instantly reduces the time required for computation of the mirror by a large amount.

So if you are tasked with implementing custom hardware constrained by area, the first thing to ask is: is my CPU capable of supporting all the primitive logic function, relevant for the algorithms in use.

TODO: To be updated with some rough sketches indicating my idea