Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Friday, February 18, 2011

Micro-optimization primitive instructions

http://semipublic.comp-arch.net/wiki/Micro-optimization_primitive_instructions


By [[micro-optimization primitive instructions]] I mean instructions that expose the internal steps and partial results of what are otherwise considered to be primitive instructions.

One of the best academic and published examples is in the Dally paper referenced at the bottom:
[[Micro-optimization of floating-point operations]].
However, other examples have been found in actual instruction sets, particularly before transistor counts increased to the point where fully pipelined floating point [[FADD]] and [[FMUL]] instructions were common.

= Examples of Floating-Point Micro-Optimization Instructions (from Dally paper) =

Examples from Dally paper: (transcribed not quite exactly)

;Automatic Block Exponent
:Identifying the largest exponent is cascaded additions, aligning all mantissae and adding without normalization.
* Saves: power of intermediate normalizations and roundings, unnecesary shifts
* Cost: Precision, unless extra mantissae width.See also [[block floating-point]], [[superaccumulator]].

;Shift Combining
:An alternative to automatic block exponent used in casxes where order of operations must be preserved: avoids repeatedly shifting left for normalization, and then shifting right for alignment.
;I.e. combines normalization and alignment shifts.
* Saves: power in redundant shifts.
* Cost: precision, sometimes

;Post Multiply Normalization
:Maintain extra guad bits to the right of mantissa, to avoid normalizzation shift, since multiplication of normalized numbers can denormalize by at most one bit position. (Denorm*norm can produce greater shifts.)

;Conventional Optimizations
:E.g.
(A + B) * (A - B)
can [[CSE (common sub-expression)]] the alignment of A and B.
* Saves: power
* Cost: the exposed micro-optimization, possibly requiring a few extra bits of mantissa and exponent.

;Scheduling
:Exposing primitive operations permits more effective optimization by compiler (or, for that matter, by a dynamic or [[OOO]] scheduler).

Dally suggests a fairly horizontal instruction format for these microoptimizations, consisting of
* Exponent Ops
** E+, E1 - exponent add/subtract (EC <- EA op EB) ** FF1 - returns shift required to normalize mantisssa MA (EC <- FF1 MA) * LDE, STE - load or store exponent as integer * Mantissa Ops ** M+, M-, M* - mantisssa add, subtract, multiply (MC <- MA op MB) ** SHR, SHL - mantissa shift left or right; supports [[shifting by a negative count]] in either dirrection ** ABS, NEG - zeroes and complements the mantissa sign bit ** LDM, STM - loads and stores mantissa as integer ** LDF, STF - load and store mantissa and exponent together as standard floating point number * Branch Ops ** BNEG - [[branch on exponent negative]] ** Bcond - [[branch on exponent and mantissa compare]]] (EA,MA) relop (EB,MB) = Other Possible Micro-optimizations = == Booth Encoding == Many processors use Booth encoding and multiplierarrays, for both integer and floating point. Booth encoding involves (a) preparing multiples of one operand. Unfortunatyely, the actual multiples depeend on how advanced the multiplier is - {-1,0,+1}, {-2,-1,0,+1,+2}, 3X, 4X, etc. (b) using the second operand selecting which multiples of the first operand are to be fed into the array for each digit group. We can certainly imagine [[CSE]]ing the work in preparing such multiples, for eother operand. Unfortunately, the multiples needed change with the exact multiplier used - one microarchitecture might require produced -1X and 3X multiples (1X and 2X are "free", shifting - while another might require 5X, etc. Therefore, the number of bits needed might change from chip to chip. We can imagine storing these large intermediate results in a SIMD packed vector, via an instruction that looks like: vreg256b <- booth_encode(64b) Heck - that might be a useful operation to provide even if not doing microoptimization. It has uses. The details of the Booth encoding used might be hidden. However, perhaps an easier approach might be to do this microoptimization in microarchitecture: e.g. have a [[dynamic scheduler]] recognize several multiplications that share a common first operand, and then decide to skip the [[Booth encoding]] pipestage. The skip might shorten the pipeline - or it might just be used to save power. This amounts to [[memoizing]] the Booth encoding. == Virtual Memory Translations == There are many, many, repeated translations of nearby virtual addresses that result in the same physical page address. ; CacheTranslation with Base Registers E.g. Austin and Sohi suggested caching the translations next to the base register, and thereby avoiding TLB lookup, and supporting more translatuons on a low ported TLB.

    Todd M. Austin and Gurindar S. Sohi. 1996. High-bandwidth address translation for multiple-issue processors. In Proceedings of the 23rd annual international symposium on Computer architecture (ISCA '96). ACM, New York, NY, USA, 158-167. DOI=10.1145/232973.232990 http://doi.acm.org/10.1145/232973.232990
Equivalently, power can be saved.


; Instructions to Save Translations
We can imagine that this might be exposed to software as an instruction that looks like
treg := [[translate_virtual_address]]( virtual address )
dreg := load_using_physical_adddress( treg + offset )
store_using_physical_address( treg + offset)

The first instruction saves the translation, and any permissions bits required, in treg.
treg might be an ordinary address sized integer register, or a special register to hold such a translation.

Note: this instruction, [[translate_virtual_address]], is often requested even outside the context of micro-optimization.
See [[uses of an instruction to save virtual to physical address translation]].

The load and store operations use a saved translation. I would imagine that they would trap if the memory location specified by offset and sizze fell outside the saved translation.

NOTE: in a [[virtual machine]] environment, a Guest OS executing this instruction would get at most a partial benefit.

;User Mode Instructions to Save Translations
Instructions such as [[load_using_physical_address]] can only be used by privileged code - if software can construct arbitrary physical registers.

But if the saved translation lives in a special treg, translation register, that can only be written by certain instructions,
then unprivileged code could employ this instruction.
Equivalently, this could be a [[privilege tagged capability register]] - possibly an ordinary register, with an extra bit that
gets set by the translate instruction.
The physical memory accesses would require that bit to be set.
Other operations on such a register might clear the tag bit.

This is how [[capability registers]] work for other capabilities;
physical addresses are just an extra capability.

Except... physical addresses can change, e.g. during [[virtual memory]] operations such as [[page fault]]s and [[COW]].

(1) we could expose the [[translation registers]] to the OS, which can treat them as an extended TLB, changing them as necessary.
: This is not unlike what old OSes needed to do when [[multitasking using base registers]].

(2) However, we might then need to re-translate. This could be accomplished by storing the original virtual address inside the treg [[[translation register]].

I.e. we have circled back to Austin and Sohi - except that instead of the physical address being a cache, it is more exposed.

Not completely exposed - it is a [[covert channel security hole]] to expose physical addresses.
But it is an explicitly acknowledged, albeit opaque, data quantity.


== Division Step, etc. ==

Any "step" operation, such as divide-step, multiply-step, sqrt-step,
can be considered as a micro-optimization primitive.

With the concomitant problem, that the algorithm, may vary between versions of a processor.

== TBD: other micro-optimizations ==

TBD

= References =
W. J. Dally. 1989. [[Micro-optimization of floating-point operations]]. SIGARCH Comput. Archit. News 17, 2 (April 1989), 283-289. DOI=10.1145/68182.68208 http://doi.acm.org/10.1145/68182.68208

W. J. Dally. 1989. Micro-optimization of floating-point operations. In Proceedings of the third international conference on Architectural support for programming languages and operating systems (ASPLOS-III). ACM, New York, NY, USA, 283-289. DOI=10.1145/70082.68208 http://doi.acm.org/10.1145/70082.68208

No comments: