http://semipublic.comp-arch.net/wiki/Why_dynamic_shift_count_instructions_are_often_slower_than_constant_shift_count_instructions
This essay isn't finished - need to collect more info.
= [[Why dynamic shift count instructions are often slower than constant shift count instructions]] =
I don't know.
I don't even know if they are, in general. Terje Mathisen says that they are in his experience,
and on at least two anecdotal machines dynamic shifts have been a pain. But I am aware of no fundamental reason.
TBD: survey.
== Possible Circuit Reasons - Unlikely ==
Paul Clayton suggests that knowing the static shift count could be used to do early set up of a shifter, whereas dynamic shift counts may arrive too later. This is possibly true, even probably true - it _could_ be used as an optimization. But in my experience I have not seen this done. Ditto Mitch Alsup's.
Now, a very basic reason is that a shift by a small constant, e.g. a shift left or right by 1, can be much cheaper. On a 4 pipeline machine I can easily imagine building 4 narrow shifters but only one general shifter. Similarly, I can imagine converting <<1 into ADD instructions. I.e. I can imagine why we might have more small width shifters that a full width shifter. Similarly for latency. But I still don't see a generic reason.
== x86 dynamic shift - flags hassle ==
The x86 OOO/P6 family slowness for variable shifts is largely due to the fact that a variable shift by zero was defined be a NOP. On a 2-input OOO machine, this necessitated a third input for the old flags, and a second uop:
tmp := concat( value_to_be_shifted, old_flags )
dest,new_flags := shift( tmp, shift_count )
or (with lower latency for the shift, and a widget uop to handle to flag selection)
dest,tmp_flags := shift( value_to_be_shifted, shift_count )
final_flags := select_shift_flags( tmp_flags, old_flags)
If the instruction had been defined without this 0-NOP flag business, e.g. if you set a flag combination on zero rather than inheriting one, it would have been faster.
Now that 3-input datapaths are common for multiply-add, this could be undone. Perhaps it has already been?
== Gould - no dynamic shift instruction ==
E.g. on Gould we did not have them: we had to resort to the moral equivalent of self-modifying code, generating the shift by constant in a register and then using the execute register instruction.
== Discussion ==
Apart from this, why are dynamic shifts slow? What machines make them slow?
A dynamic shift is always going to be more expensive than a shift by 1 or 2 bits. At least, you can probably build 4 shift by 1s, but only 1 dynamic full width shifter, on a typical datapath. But not necessarily more expensive than a shift by a large constant, unless tyhe sort of optimization Paul was talking about is done. Which has not been the case in my experieence.
But, you are right: dynamic shifts are often penalized.
This sounds like an essay for the comp-arch.net wiki.
What machines make dynamic shifts slower than, say, a shift by 29?
Why?
Is it fundamental, or is it an accident of the instruction set, as it was for x86 and Gould?
What would an instruction set definition of dynamic shift look like that did NOT causse such implementation artifacts look like?
Disclaimer
The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.
See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.
See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.
Wednesday, May 18, 2011
Subscribe to:
Posts (Atom)