Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Wednesday, May 18, 2011

Dynamic shifts

http://semipublic.comp-arch.net/wiki/Why_dynamic_shift_count_instructions_are_often_slower_than_constant_shift_count_instructions

This essay isn't finished - need to collect more info.

= [[Why dynamic shift count instructions are often slower than constant shift count instructions]] =

I don't know.

I don't even know if they are, in general. Terje Mathisen says that they are in his experience,
and on at least two anecdotal machines dynamic shifts have been a pain. But I am aware of no fundamental reason.

TBD: survey.

== Possible Circuit Reasons - Unlikely ==

Paul Clayton suggests that knowing the static shift count could be used to do early set up of a shifter, whereas dynamic shift counts may arrive too later. This is possibly true, even probably true - it _could_ be used as an optimization. But in my experience I have not seen this done. Ditto Mitch Alsup's.

Now, a very basic reason is that a shift by a small constant, e.g. a shift left or right by 1, can be much cheaper. On a 4 pipeline machine I can easily imagine building 4 narrow shifters but only one general shifter. Similarly, I can imagine converting <<1 into ADD instructions. I.e. I can imagine why we might have more small width shifters that a full width shifter. Similarly for latency. But I still don't see a generic reason.

== x86 dynamic shift - flags hassle ==

The x86 OOO/P6 family slowness for variable shifts is largely due to the fact that a variable shift by zero was defined be a NOP. On a 2-input OOO machine, this necessitated a third input for the old flags, and a second uop:

tmp := concat( value_to_be_shifted, old_flags )
dest,new_flags := shift( tmp, shift_count )

or (with lower latency for the shift, and a widget uop to handle to flag selection)

dest,tmp_flags := shift( value_to_be_shifted, shift_count )
final_flags := select_shift_flags( tmp_flags, old_flags)


If the instruction had been defined without this 0-NOP flag business, e.g. if you set a flag combination on zero rather than inheriting one, it would have been faster.

Now that 3-input datapaths are common for multiply-add, this could be undone. Perhaps it has already been?

== Gould - no dynamic shift instruction ==


E.g. on Gould we did not have them: we had to resort to the moral equivalent of self-modifying code, generating the shift by constant in a register and then using the execute register instruction.

== Discussion ==

Apart from this, why are dynamic shifts slow? What machines make them slow?

A dynamic shift is always going to be more expensive than a shift by 1 or 2 bits. At least, you can probably build 4 shift by 1s, but only 1 dynamic full width shifter, on a typical datapath. But not necessarily more expensive than a shift by a large constant, unless tyhe sort of optimization Paul was talking about is done. Which has not been the case in my experieence.

But, you are right: dynamic shifts are often penalized.

This sounds like an essay for the comp-arch.net wiki.

What machines make dynamic shifts slower than, say, a shift by 29?

Why?

Is it fundamental, or is it an accident of the instruction set, as it was for x86 and Gould?

What would an instruction set definition of dynamic shift look like that did NOT causse such implementation artifacts look like?

Sunday, May 15, 2011

ISO ACLs for Wiki; considering Drupal

I have long wanted ACLs for wikis, both at companies like Intel and AMD, and on comp-arch.net.

I punted ion this when I set up comp-arch.net, with the semipublic and public areas. Abandoning the public areas when I got spammed.

But now my need is growing again. I am wiki'ing things that may get patented, so they cannot be initially public --- but it is too much hassle to write them up privately, and then later post to a wiki, a year or so later when the patent is published. I want to write them up in the wiki once, and then have them go public when allowed.

Setting up a private wiki is a pain, because wikidmin wants to be shared. It's a pain to have to propagate changes made to a template on a public wiki to a private wiki, and vice versa.

---

This became annoying enough today that I considered switching to a CMS like Drupal or Joomla, from mediawiki.

Transferring the mediawiki content will be a hassle.

---

I''m starting to dream of a file based wiki. Each wiki page, or attachment, living in a filesystem. OS filesystem security and ACLs. Webdav if appropriate. Wikiserver running as OS user accounts - requiring setuid CGI scripts or the equivalent.

The content files being separate from the wiki web presentation engine. E.g. one might have a CMS-like web system, or a wiki system, that interpret the files separately. In which case changing the CMS/wiki engine is just changing the scripts.

Wednesday, May 11, 2011

Love it!!!!!

http://www.math.utah.edu/~palais/pi.html

Pi is wrong.

Not mathematically - but observes that setting some other symbol to be what we now call 2pi or 6.28... leads to many simplifications in formulae, many fewer factors of 2 - and probably far fewer errors.

The astounding thing is how recent the modern use of pi=3.14... emerged. Of course, mathematicians back to the Greeks knew the concept, but the convention arose only in the 18th century, and stuck when Euler used it, borrowing it from a much less prominent Welsh mathematician. (I want to say "an obscure Welsh mathematician, William Jones, but I'll get in trouble for that.) Euler apparently previously used p/c, where p is the periphery and c the radius of any circle.

I'm just piling on to a flash mob, and I'm arriving late. But, what the heck.