Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Thursday, July 14, 2011

Things you can do with a double precision multiplier

Consider a processor that has hardware sufficient to do a double precision multiplication.
Or perhaps a multiply-add.

([[WLOG]] we will talk about floating point; similar discussion applies to 2X wide integer.)

A double precision multiplier overall has a multiplier capable of forming 4 single precision products.
Let us draw something like this, except using byte wide multipliers as subcomponents rather than single bit:
Compare the partial products array for 64x64 to 32x32:
XXXX/XXXX
X X/X X
X X/X X
XXXX/XXXX
----+-----
XXXX/XXXX
X X/X X
X X/X X
XXXX/XXXX

or briefly if the numbers are (Ahi+Blo)*(Xhi+Ylo)
AY BY
AX BX

the summation network is similarly larger, although the final [[CPA (Carry Propagate Adder)]] is "only" 2X wider
(more than 2X more gates, but only a bit deeper in logic depth).

Given such a double precision multiplier, we can synthesize several different types of single precision operations

* [[LINE]]: v=p*u+q
:: this is just an [[FMA]], with possibly a different arrangement of inputs
* [[PLANE]]: w=p*u+q*v+r
:: this has 2 multiplications, although the sum network must be adjusted to align the products differently. This can be achieved by shifting the input to the upper half of the multiplier array
* [[LRP]] or [[BLEND]]: w=u*x+v*(1-x)
:: This is like [[PLANE]], except the second multiplier part is calculated. Like 2X, etc. products for advanced [[Booth encoding]]?

The above uses the 4 multiplications of the double precision multiplier,
but only uses 2 of them.
We can be more aggressive, trying to use all 4 - but then the summation network needs considerable adjustment.

An arbitrary 2D outer product:
[[OUTER2]] = (a b) X (x y) =
ax ay
bx by
although this causes some difficulties because it needs to write back an output twice as wide as its inputs.

[[CMUL (Complex multiply)]]: can be achieved using this multiplier: (a+bi) X (x+yi) = ax-by + (ay+bx)i
although once again there are difficulties with alignment in the summation network.

Saturday, July 09, 2011

I am fascinated by paradox. I am interested in the possibility of logical systems which are not consistent but where the inconsistency is somehow limited, so that it does not propagate and pollute the whole system. I am unhappy with systems where, given any inconsistency, any theorem can be proven.

I suspect that there may be much existing work in this area. Certainly I am aware of Russel's "set of all sets that do not contain themselves". And with Godel (although I probably do not understand Godel as well as I would like.). I should probably research the literature.

But since this is a hobby, I thought I might begin by writing down some simple observations. Not new - not to me, and probably well known.

But fun.

Consider simple systems of statements:

A single statement system:

S0: "This statement is false." The classic Liar's Paradox. The statement is neither false nor true: some say that it reflects an extra logic value, "paradox". Three valued logic.

A two statement system:

S1: "S2 is false"
S2: "S1 is false"

This system is self referential, but not necessarily paradoxical. It might be that S1 is true, in which case S2 is false, confirming that S1 is true. Or, vice versa. Thus, it is not contradictory. Either situation is consistent. But, from the outside, we don't know which.

What should we call this? Bistable? Metastable, if more than 2 stable states? Unknowable?

A three statement system:

S1: "S2 is false"
S2: "S3 is false"
S3: "S4 is false"

This is paradoxical, in the same way that the single statement Liar's Paradox is paradoxical.

Conjecture: I suspect that be a true paradox, then the feedback loop must have an odd number of stages. Whereas to be bistable or metastable, it must have an even number of stages.

Compare to inverter rings or storage bitcells with cross coupled inverters in computer electronics.

Q: can you construct a system that has interescting self referential rings of odd and even length?

Wednesday, July 06, 2011

= Hardware Motivation for Address Register Modifying Addressing Modes =

Apart from the software datastructure motivation,
there is a hardware motivation for addressing modes such as pre-increment or decrement.

In more generality - addressing modes that calculate an address, and then save the results of that calculation
in a register - typically one of the registers involved in the addressing mode.

Consider [[Mitch Alsup's Favorite Addressing Mode]],
the x86's [[Base+Scaled-Index+Offset]].

A succession of such instructions might look like:

;an [[RMW]]
r1 := load( M[ rB+rO*4+offset1 ] )
r1 := ... some calculation involving r1 ...
store( M[ rB+rO*4+offset1 ] := r1 )

or

r1 := load( M[ rB+rO*4+offset1 ] )
r2 := load( M[ rB+rO*4+offset2 ] )
r3 := load( M[ rB+rO*4+offset3 ] )

Notice the possibility of [[CSE (Common Subexpression Elimination)]]:

For the [[RMW]] case, this is straightforward:

rAtmp := [[lea]]( M[ rB+rO*4+offset1 ] )
r1 := load( rAtmp ] )
r1 := ... some calculation involving r1 ...
store( M[ rAtmp ] := r1 )

Is this a win? It depends:
* in terms of performance, on a classic RISC, probably note: the extra instruction adds latency, probably costs a cycle.
* in terms of power, possibly

For the nearby load case, we could do:

rAtmp := lea( M[ rB+rO*4 ] )
r1 := load( M[ rAtmp+offset1 ] )
r2 := load( M[ rAtmp*4+offset2 ] )
r3 := load( M[ rAtmp*4+offset3 ] )

or

rAtmp := lea( M[ rB+rO*4+offset1 ] )
r1 := load( M[ rAtmp ] )
r2 := load( M[ rAtmp*4+(offset2-offset1) ] )
r3 := load( M[ rAtmp*4+(offset3-offset1) ] )

Is this a win? Again, it depends:
* probably not in performance
* possibly in terms of power.

As an aside, let me note that if the base address rB+rO*4 is aligned, then the subsequent addresses could use [[OR indexing]] rather than [[ADD indexing]]. Which may further save power. At the cost of yet another piece of crap in the instruction set.

To avoid conundrums like this - is it better to [[CSE]] the addressing mode or not?
- processors like the [[AMD 29K]]
abandoned addressing modes except for [[absolute addressing mode]] and [[register indirect addressing mode]]
- so to form any nomn-trivial address you had to save the results to a register, and thereby were
encouraged to CSE it whenever possible.

Generalized pre-increment/decrement is a less RISCy approach to this.
Instead of encouraging CSE via separate instructions,
it encourages CSE by making it "free" to save the result ofthe addressing mode calculation.

:Let's call this [[Address Generation Saving]], or, if it modifies a register used in the addressing mode, [[Address Register Modification]]:

In our examples:

;RMW
rAtmp := [[lea]]( M[ rB+rO*4+offset1 ] )
r1 := load( rAtmp := rB+rO*4+offset1 ] )
r1 := ... some calculation involving r1 ...
store( M[ rAtmp ] := r1 )

r1 := load( M[ rAtnp :=rB+rO*4+offset1 ] )
r2 := load( M[ rAtmp*4+(offset2-offset1) ] )
r3 := load( M[ rAtmp*4+(offset3-offset1) ] )

Note that the nearby case that changes the offsets is simpler than the nearby case that saves only part ofthe address calculation:

;Using C's comma expression:
r1 := load( M[ (rAtmp := rB+rO*4, rAtmp+offset1) ] )
r2 := load( M[ rAtmp*4+offset2 ] )
r3 := load( M[ rAtmp*4+offset3 ] )

:: Again, [[OR indexing]] may benefit, for addressing mode [[(Base+Scaled-Index)}Offset]]

What is the probem with this?
* Writeback ports

Such an [[Address Generation Saving]] requires an extra writeback port, in addition to the result of an instruction like load.

For this reason, some instruction sets have proposed to NOT have [[address generation writeback]] for instructions that write other results, such as load,
but only to have [[address generation writeback]] for instructions that do not have a normal writeback, such as a store.

: Glew opinion: workable, minor benefit, but ugly.

Except that pre-increment/decre more general form:

Post-increment and decrement goes further, generalized in this way:
then the address calculation looks like

{ old := address_reg; address_reg := new_address; return old }

Not only does this require a writeback port for the updated address register,
but it also requires two paths from the AGU or RF to where the address is used
- one for the address, the other for the updated address_reg.

OVERALL OBSERVATION: addressing modes begin the slippery slope to VLIW.

Monday, July 04, 2011

Mitre Top 25 SW bugs

http://cwe.mitre.org/top25/index.html

```Rank Score ID Name
[1] 93.8 CWE-89 Improper Neutralization of Special Elements used in an SQL Command ('SQL Injection')
[2] 83.3 CWE-78 Improper Neutralization of Special Elements used in an OS Command ('OS Command Injection')

-- the top two are handled by taint propagation. "Improper Neutralization" => "quotification"

[3] 79.0 CWE-120 Buffer Copy without Checking Size of Input ('Classic Buffer Overflow')

-- handled by stuff like Milo Martin's Hardbound and Softbound.

[4] 77.7 CWE-79 Improper Neutralization of Input During Web Page Generation ('Cross-site Scripting')

-- tainting/quotification

[5] 76.9 CWE-306 Missing Authentication for Critical Function
[6] 76.8 CWE-862 Missing Authorization

-- more like real SW bugs, with no crutches like bounds checks of tainting.

-- except that we can imagine that capability systems might help here - it might be made mandatory to activate some capabilities positively, rather than passively inheriting them.

[7] 75.0 CWE-798 Use of Hard-coded Credentials
[8] 75.0 CWE-311 Missing Encryption of Sensitive Data

[9] 74.0 CWE-434 Unrestricted Upload of File with Dangerous Type

-- while we might hope that a capability system might upload such a file but strip it of all ability to do harm, experience suggests that a user will just blindly give the file whatever privileges it asks for.

[10] 73.8 CWE-807 Reliance on Untrusted Inputs in a Security Decision

-- tainiting

[11] 73.1 CWE-250 Execution with Unnecessary Privileges

-- capability systems are supposed to make it easier to implement the "Principle of Least Privilege".  However, it still happens.

-- we can imagine security tools that profile privileges - that determine what privileges have never been used, to allow narrowing the privileges as much as possible.

[12] 70.1 CWE-352 Cross-Site Request Forgery (CSRF)

-- tainting, capabilities

[13] 69.3 CWE-22 Improper Limitation of a Pathname to a Restricted Directory ('Path Traversal')
[14] 68.5 CWE-494 Download of Code Without Integrity Check
[15] 67.8 CWE-863 Incorrect Authorization

[16] 66.0 CWE-829 Inclusion of Functionality from Untrusted Control Sphere

-- tainting

[17] 65.5 CWE-732 Incorrect Permission Assignment for Critical Resource
[18] 64.6 CWE-676 Use of Potentially Dangerous Function
[19] 64.1 CWE-327 Use of a Broken or Risky Cryptographic Algorithm

[20] 62.4 CWE-131 Incorrect Calculation of Buffer Size

-- classic buffer overflow, such as Martin Hardbound/Softbound.

[21] 61.5 CWE-307 Improper Restriction of Excessive Authentication Attempts
[22] 61.1 CWE-601 URL Redirection to Untrusted Site ('Open Redirect')

[23] 61.0 CWE-134 Uncontrolled Format String
[24] 60.3 CWE-190 Integer Overflow or Wraparound

-- these two, 23 and 24, are somewhat handled by buffer overflow checking as in Hardbound and Softbound - the security flaws with integer overflow often manifest themselves as unchecked buffer overflows.

-- however, they can manifest themselves in other ways as well.

[25] 59.9 CWE-759 Use of a One-Way Hash without a Salt
```