Krazy Glew's Blog

Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Saturday, September 10, 2011

Google Desktop (Search) Dying

Google is end-of-life-ing several products, including Google Desktop, in particular Google Desktop Search, which I use constantly, and Google Notebook and Sidewiki, which I would like to use, but which, fortunately, I decided not to become dependent on.

Google says

Desktop: In the last few years, there’s been a huge shift from local to cloud-based storage and computing, as well as the integration of search and gadget functionality into most modern operating systems. People now have instant access to their data, whether online or offline. As this was the goal of Google Desktop, the product will be discontinued on September 14, including all the associated APIs, services, plugins, gadgets and support.
Fair enough. I *am* trying to keep most of my information in the cloud. But unfortunately I do not have full connectivity - large areas of rural Oregon still do not have cell phone connectivity, voice, let alone data, and, for that matter, I am too cheap to pay for data connectivity at all times. Plus, some things I am not allowed to keep in the cloud, and

So it looks like I will have to use Microsoft Desktop Search, as built in to Windows 7, for stuff on my laptop. I prefer the Google user interface - but since Google Desktop Search will be discontinued, I don't have much choice. And because I am using Microsoft Desktop Search, I have started using Bing for my web search. Here, I still prefer the Google user interface, which I am familiar with - but Bing's user interface is more similar to MS Desktop Search, and I value consistency between Desktop and Web more than I value the more familiar Google Search interface.

I suspect that Microsoft Desktop Search cannot search my Google Chrome browser history. Just like Google Desktop Search could not search my Microsoft OneNote.

Unfortunately, this begins to look like dominoes falling. So long as I have substantial local data on my Microsoft PC, I am attracted, despite my personal preferences, to Microsoft tools like Microsoft Desktop Search, Bing, and Internet Explorer.

Not everyone will be in this situation. Some people will be purely cloud based. But, unfortunately, I am not. Yet?

I wonder if this will affect my next choice of cell phone? I was leaning towards Android, but perhaps this is enough to tilt me towards Windows 8.

Monday, September 05, 2011

Bit Interleaving, Instruction Set Support for

[[Category:ISA]]

http://semipublic.comp-arch.net/wiki/Bit_Interleaving,_Instruction_Set_Support_for
= Instruction Set Support for Bit Interleaving =

== What is Bit Interleaving? ==

[[Bit interleaving]] is a very common form of multi-operand bit permutation.
It has been important enough to occasionally warrant special instruction set support.

2-way bit interleaving of A and B;

for i = 0 to N do
result.bit[2*i] := A.bit[i]
result.bit[2*i+1] := B.bit[i]

3-way bit interleaving off A, B, C:

for i = 0 to N do
result.bit[3*i] := A.bit[i]
result.bit[3*i+1] := B.bit[i]
result.bit[3*i+2] := C.bit[i]

M-way bit interleaving of A[i]

for i = 0 to N do
for j = 0 to M do
result.bit[N*i+j] := A[j].bit[i]

Note that since bit interleaving combines multiple operands into a single operand, issues of overflow may arise.
For example, the result is often used in address arithmetic,
in which case the inputs (aka the coordinates)
are probably not full width addresses.
The bit interleaving may be used in only a subrange of the input bits.

== Why Bit Interleaving? ==

See [[http://en.wikipedia.org/wiki/Morton_number_(number_theory) Wikipedia Morton numbers]].

Bit interleaving is often used as a component of hash functions.

In particular, bit interleaving of 2D and 3D coordinates in computer graphics
often produces hash functions that have better cache locality than uninterleaved coordinates used to index an array.

== Bit Interleaving Instructions ==

Intel Larrabee has 1:1 and 2:1 interleave instructions, in both scalar and vector form:

vbitinterleave11pi: 1:1 bit-interleave vectors
vbitinterleave21pi: 2:1 bit-interleave vectors

bitinterleave11: 1:1 bit-interleave scalar
bitinterleave21: 2:1 bit-interleave scalar

(See [[http://drdobbs.com/architecture-and-design/216402188?pgno=5 Mike Abrash's Dr. Dobb's article]] which comments that
"is useful for generating swizzled addresses, particularly in conjunction with vinsertfield, for example in preparation for texture sample fetches (volume textures in the case of vbitinterleave21pi".)

The 1:1 interleaving instruction obviously accomplishes 2D interleaving,
i.e. the interleaving of two coordinates for a 2D system.

The 2:1 interleaving instruction accoplishes 3D interleaving.
Or, rather, first one interleaves two coordinates, say X and Y, using 1:1,
and then one interleaves the third coordinate, Z, using 2:1.

4:1 interleaving is accomplished by two 1:1 interleaves.
However, operations on objects and spaces of more than 1D, 2D, or 3D are much less common.

== Expand (Interleave with 0) ==

Although 1:1 and 2:1 interleaving "fit" into commion instruction sets,
higher degrees of interleaving may not - too many operands may be required.

Some support may be afforded by bit expansion,
or interleaving with 0.
This takes 2 (or 3) inputs:
* the bitvector to which 0s will be inserted between elements
* the number of 0 bits to be inserted between elements
* possibly an offset to cause the result to be aligned appropriately for the last step.

The last step would be ORing the results together.
(Or equivalently, ANDing).

Saturday, August 13, 2011

Register_File_Port_Reduction_using_Time-based_SIMD

http://semipublic.comp-arch.net/wiki/Register_File_Port_Reduction_using_Time-based_SIMD

= Background: 4-cycle GPU SIMD instruction groups =

The most popular GPUs circa 2010 - Nvidia, AMD/ATI, Intel Gen - share the followng characteristics:
* they have [[SIMD or SIMT]] [[coherent threading]]
* the SIMD is 16-way spatially:
** i.e. each SIMD engine is circa 16 [[lanes]] wide
** although the definition of the lane varies, e.g. from 32 bit wide scalar for Nvidia to 5-way VLIW for AD/ATI
* the SIMD is 4 way temporally
* i.e. every [[SIMD instruction group]], [[wavefront]] or [[warp]], occupies the 16 lanes for 4 cycles.

This wiki page discusses the temporal aspect,
at least one specific advantage of taking at least 4 cycles peer [[SIMD instruction group]].
This is a particular case of
[[register file port reduction]].

= [[Register File Port Reduction using Time-based SIMD]] =

For simplicity, let us ignore the possibility of [[spatial SIMD]].
Let us assume that we have only one ALU,
taking 2 inputs peer cycle and producing a single output per cycle.
(Or possibly 3 inputs, if we want to consider [[multiply-add]].)
(Or more different combinations of inputs and outputs, if we are particularly aggressive.)

I.e.
dest := opcode( src, src2, src3 )

On a conventional scalar microarchitecture, this would require 3 read ports and 1 write port on the register file per cycle.
(Let's forget the possibility of architecturally requiring the registers to belong to banks with non-interfering ports.)

Now, instead, let us imagine that we are dealing with a SIMD instruction group or wavefront, i=0,3.
And let us say that it is distributed over time, 4 successive clock cycles t0..t3

I.e.
t0: dest[0] := opcode( src[0], src2[0], src3[0] )
t1: dest[1] := opcode( src[1], src2[1], src3[1] )
t2: dest[2] := opcode( src[2], src2[2], src3[2] )
t3: dest[3] := opcode( src[3], src2[3], src3[3] )

Now, instead of reading 3 separate (e.g. 32) values per cycle, and writing a single such value every cycle,
we could make each separate access 4X larger, but only do one such 4X larger access every cycle:

I.e.
t-3: src3_4x := RF.read( src3 )[0:127]
t-2: src2_4x := RF.read( src2 )[0:127]
t-1: src1_4x := RF.read( src1 )[0:127]
t0: dest_4x[0:31] := opcode( src1_4x[0:31], src2_4x[0:31], src3_4x[0:31] )
t1: dest_4x[32:63] := opcode( src1_4x[32:63], src2_4x[32:63], src3_4x[32:63] )
t2: dest_4x[64:95] := opcode( src1_4x[64:95], src2_4x[64:95], src3_4x[64:95] )
t3: dest_4x[96:127] := opcode( src1_4x[96:127], src2_4x[96:127], src3_4x[96:127] )
t3+1: RF.write( dest )[0:127] := dest_4x[0:127]

(As is typical in these discussions, we are constrained by the lack of universally understood [[slicing notation]])

This shows that you can get away with a single 4X wider read/write port,
at the cost of muxing/delay elements:

The pipeline might look like

W0.write W1.exec_0
W2.read1* W1.exec_1
W2.read2* W1.exec_2
W2.read3* W1.exec_3
W1.write W2.exec_0*
W3.read1 W2.exec_1*
W3.read2 W2.exec_2*
W3.read3 W2.exec_3*
W2.write* W3.exec_0
... ...

although of course it can be extended to be deeper, tolerating more ALU latency, etc.

= 4X Wider versus 4X Time-skewed =

This register file port reduction can be obtained in at least two ways:

* 4x wider
** by performing 4X wider reads and writes
** in a single "cycle" of RF access
** to and from 4X wider temporary registers
** and then muxing 1/4 of those wider registers in any given cycle of execution

* 4x time skewing
** by having 4 register files
** each skewed from the others by 1 cycle
** e.g. providing the same register number to each, but delayed 1 cycle for each skewing

These are very similar.

The 4X wider approach has the advantage of being very easy to express in conventional synthesized logic,
even though it might be marginally more expensive in full custom logic.
As of 2011 time skewed register files are hard to express in RTL languages.

= Why 4X? =

4X is a convenient power of two,
and [[powers of 2 are convenient in computer architecture]].

4X matches 3 reads and 1 write
for a register file with a single read and write port in any cycle.
Which conveniently matches multiply add, A:=B*C+D, one of the most common operations in graphics
- and GPUs were one of the first places this occurred.

3X could be a possibility, if restricted to [[2-input operations]].
But I suspect that 4X is just plain nicer.

Larger than 4X also a possibility. But, again, 4X is the smallest convenient size
that reaps most of the benefits.

Thursday, July 14, 2011

Things you can do with a double precision multiplier

Consider a processor that has hardware sufficient to do a double precision multiplication.
Or perhaps a multiply-add.

([[WLOG]] we will talk about floating point; similar discussion applies to 2X wide integer.)

A double precision multiplier overall has a multiplier capable of forming 4 single precision products.
Let us draw something like this, except using byte wide multipliers as subcomponents rather than single bit:
Compare the partial products array for 64x64 to 32x32:
XXXX/XXXX
X X/X X
X X/X X
XXXX/XXXX
----+-----
XXXX/XXXX
X X/X X
X X/X X
XXXX/XXXX

or briefly if the numbers are (Ahi+Blo)*(Xhi+Ylo)
AY BY
AX BX

the summation network is similarly larger, although the final [[CPA (Carry Propagate Adder)]] is "only" 2X wider
(more than 2X more gates, but only a bit deeper in logic depth).

Given such a double precision multiplier, we can synthesize several different types of single precision operations

* [[LINE]]: v=p*u+q
:: this is just an [[FMA]], with possibly a different arrangement of inputs
* [[PLANE]]: w=p*u+q*v+r
:: this has 2 multiplications, although the sum network must be adjusted to align the products differently. This can be achieved by shifting the input to the upper half of the multiplier array
* [[LRP]] or [[BLEND]]: w=u*x+v*(1-x)
:: This is like [[PLANE]], except the second multiplier part is calculated. Like 2X, etc. products for advanced [[Booth encoding]]?

The above uses the 4 multiplications of the double precision multiplier,
but only uses 2 of them.
We can be more aggressive, trying to use all 4 - but then the summation network needs considerable adjustment.

An arbitrary 2D outer product:
[[OUTER2]] = (a b) X (x y) =
ax ay
bx by
although this causes some difficulties because it needs to write back an output twice as wide as its inputs.

[[CMUL (Complex multiply)]]: can be achieved using this multiplier: (a+bi) X (x+yi) = ax-by + (ay+bx)i
although once again there are difficulties with alignment in the summation network.

Saturday, July 09, 2011

Paradox and Consistency

I am fascinated by paradox. I am interested in the possibility of logical systems which are not consistent but where the inconsistency is somehow limited, so that it does not propagate and pollute the whole system. I am unhappy with systems where, given any inconsistency, any theorem can be proven.

I suspect that there may be much existing work in this area. Certainly I am aware of Russel's "set of all sets that do not contain themselves". And with Godel (although I probably do not understand Godel as well as I would like.). I should probably research the literature.

But since this is a hobby, I thought I might begin by writing down some simple observations. Not new - not to me, and probably well known.

But fun.

Consider simple systems of statements:

A single statement system:

S0: "This statement is false." The classic Liar's Paradox. The statement is neither false nor true: some say that it reflects an extra logic value, "paradox". Three valued logic.

A two statement system:

S1: "S2 is false"
S2: "S1 is false"

This system is self referential, but not necessarily paradoxical. It might be that S1 is true, in which case S2 is false, confirming that S1 is true. Or, vice versa. Thus, it is not contradictory. Either situation is consistent. But, from the outside, we don't know which.

What should we call this? Bistable? Metastable, if more than 2 stable states? Unknowable?

A three statement system:

S1: "S2 is false"
S2: "S3 is false"
S3: "S4 is false"

This is paradoxical, in the same way that the single statement Liar's Paradox is paradoxical.

Conjecture: I suspect that be a true paradox, then the feedback loop must have an odd number of stages. Whereas to be bistable or metastable, it must have an even number of stages.

Compare to inverter rings or storage bitcells with cross coupled inverters in computer electronics.

Q: can you construct a system that has interescting self referential rings of odd and even length?

Wednesday, July 06, 2011

Address Generation Writeback

http://semipublic.comp-arch.net/wiki/Address_Generation_Writeback

= Hardware Motivation for Address Register Modifying Addressing Modes =

Apart from the software datastructure motivation,
there is a hardware motivation for addressing modes such as pre-increment or decrement.

In more generality - addressing modes that calculate an address, and then save the results of that calculation
in a register - typically one of the registers involved in the addressing mode.

Consider [[Mitch Alsup's Favorite Addressing Mode]],
the x86's [[Base+Scaled-Index+Offset]].

A succession of such instructions might look like:

;an [[RMW]]
r1 := load( M[ rB+rO*4+offset1 ] )
r1 := ... some calculation involving r1 ...
store( M[ rB+rO*4+offset1 ] := r1 )

or

; nearby loads
r1 := load( M[ rB+rO*4+offset1 ] )
r2 := load( M[ rB+rO*4+offset2 ] )
r3 := load( M[ rB+rO*4+offset3 ] )

Notice the possibility of [[CSE (Common Subexpression Elimination)]]:

For the [[RMW]] case, this is straightforward:

rAtmp := [[lea]]( M[ rB+rO*4+offset1 ] )
r1 := load( rAtmp ] )
r1 := ... some calculation involving r1 ...
store( M[ rAtmp ] := r1 )

Is this a win? It depends:
* in terms of performance, on a classic RISC, probably note: the extra instruction adds latency, probably costs a cycle.
* in terms of power, possibly

For the nearby load case, we could do:

rAtmp := lea( M[ rB+rO*4 ] )
r1 := load( M[ rAtmp+offset1 ] )
r2 := load( M[ rAtmp*4+offset2 ] )
r3 := load( M[ rAtmp*4+offset3 ] )

or

rAtmp := lea( M[ rB+rO*4+offset1 ] )
r1 := load( M[ rAtmp ] )
r2 := load( M[ rAtmp*4+(offset2-offset1) ] )
r3 := load( M[ rAtmp*4+(offset3-offset1) ] )

Is this a win? Again, it depends:
* probably not in performance
* possibly in terms of power.

As an aside, let me note that if the base address rB+rO*4 is aligned, then the subsequent addresses could use [[OR indexing]] rather than [[ADD indexing]]. Which may further save power. At the cost of yet another piece of crap in the instruction set.
To avoid conundrums like this - is it better to [[CSE]] the addressing mode or not?
- processors like the [[AMD 29K]]
abandoned addressing modes except for [[absolute addressing mode]] and [[register indirect addressing mode]]
- so to form any nomn-trivial address you had to save the results to a register, and thereby were
encouraged to CSE it whenever possible.

Generalized pre-increment/decrement is a less RISCy approach to this.
Instead of encouraging CSE via separate instructions,
it encourages CSE by making it "free" to save the result ofthe addressing mode calculation.

:Let's call this [[Address Generation Saving]], or, if it modifies a register used in the addressing mode, [[Address Register Modification]]:

In our examples:

;RMW
rAtmp := [[lea]]( M[ rB+rO*4+offset1 ] )
r1 := load( rAtmp := rB+rO*4+offset1 ] )
r1 := ... some calculation involving r1 ...
store( M[ rAtmp ] := r1 )

; nearby load
r1 := load( M[ rAtnp :=rB+rO*4+offset1 ] )
r2 := load( M[ rAtmp*4+(offset2-offset1) ] )
r3 := load( M[ rAtmp*4+(offset3-offset1) ] )

Note that the nearby case that changes the offsets is simpler than the nearby case that saves only part ofthe address calculation:

;Using C's comma expression:
r1 := load( M[ (rAtmp := rB+rO*4, rAtmp+offset1) ] )
r2 := load( M[ rAtmp*4+offset2 ] )
r3 := load( M[ rAtmp*4+offset3 ] )

:: Again, [[OR indexing]] may benefit, for addressing mode [[(Base+Scaled-Index)}Offset]]

What is the probem with this?
* Writeback ports

Such an [[Address Generation Saving]] requires an extra writeback port, in addition to the result of an instruction like load.

For this reason, some instruction sets have proposed to NOT have [[address generation writeback]] for instructions that write other results, such as load,
but only to have [[address generation writeback]] for instructions that do not have a normal writeback, such as a store.

: Glew opinion: workable, minor benefit, but ugly.

Except that pre-increment/decre more general form:

Post-increment and decrement goes further, generalized in this way:
then the address calculation looks like

{ old := address_reg; address_reg := new_address; return old }

Not only does this require a writeback port for the updated address register,
but it also requires two paths from the AGU or RF to where the address is used
- one for the address, the other for the updated address_reg.

OVERALL OBSERVATION: addressing modes begin the slippery slope to VLIW.

Monday, July 04, 2011

Mitre Top 25 SW bugs

http://cwe.mitre.org/top25/index.html

Rank Score ID Name
[1] 93.8 CWE-89 Improper Neutralization of Special Elements used in an SQL Command ('SQL Injection')
[2] 83.3 CWE-78 Improper Neutralization of Special Elements used in an OS Command ('OS Command Injection')

-- the top two are handled by taint propagation. "Improper Neutralization" => "quotification"



[3] 79.0 CWE-120 Buffer Copy without Checking Size of Input ('Classic Buffer Overflow')

-- handled by stuff like Milo Martin's Hardbound and Softbound.


[4] 77.7 CWE-79 Improper Neutralization of Input During Web Page Generation ('Cross-site Scripting')

-- tainting/quotification

[5] 76.9 CWE-306 Missing Authentication for Critical Function
[6] 76.8 CWE-862 Missing Authorization

-- more like real SW bugs, with no crutches like bounds checks of tainting.

-- except that we can imagine that capability systems might help here - it might be made mandatory to activate some capabilities positively, rather than passively inheriting them.

[7] 75.0 CWE-798 Use of Hard-coded Credentials
[8] 75.0 CWE-311 Missing Encryption of Sensitive Data


[9] 74.0 CWE-434 Unrestricted Upload of File with Dangerous Type

-- while we might hope that a capability system might upload such a file but strip it of all ability to do harm, experience suggests that a user will just blindly give the file whatever privileges it asks for.

[10] 73.8 CWE-807 Reliance on Untrusted Inputs in a Security Decision

-- tainiting

[11] 73.1 CWE-250 Execution with Unnecessary Privileges

-- capability systems are supposed to make it easier to implement the "Principle of Least Privilege".  However, it still happens.

-- we can imagine security tools that profile privileges - that determine what privileges have never been used, to allow narrowing the privileges as much as possible.

[12] 70.1 CWE-352 Cross-Site Request Forgery (CSRF)

-- tainting, capabilities


[13] 69.3 CWE-22 Improper Limitation of a Pathname to a Restricted Directory ('Path Traversal')
[14] 68.5 CWE-494 Download of Code Without Integrity Check
[15] 67.8 CWE-863 Incorrect Authorization

[16] 66.0 CWE-829 Inclusion of Functionality from Untrusted Control Sphere

-- tainting


[17] 65.5 CWE-732 Incorrect Permission Assignment for Critical Resource
[18] 64.6 CWE-676 Use of Potentially Dangerous Function
[19] 64.1 CWE-327 Use of a Broken or Risky Cryptographic Algorithm

[20] 62.4 CWE-131 Incorrect Calculation of Buffer Size

-- classic buffer overflow, such as Martin Hardbound/Softbound.

[21] 61.5 CWE-307 Improper Restriction of Excessive Authentication Attempts
[22] 61.1 CWE-601 URL Redirection to Untrusted Site ('Open Redirect')

[23] 61.0 CWE-134 Uncontrolled Format String
[24] 60.3 CWE-190 Integer Overflow or Wraparound

-- these two, 23 and 24, are somewhat handled by buffer overflow checking as in Hardbound and Softbound - the security flaws with integer overflow often manifest themselves as unchecked buffer overflows.

-- however, they can manifest themselves in other ways as well.


[25] 59.9 CWE-759 Use of a One-Way Hash without a Salt