The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Friday, January 28, 2011



o a collector of computer architecture trivia like me,
it was quite remarkable
that the Nvidia Fermi GPU
provides full-speed support for denormal operands and results,
whereas Intel and AMD CPUs, at the time of writing, do not.

By the way, terminology: [[full-speed support for denormals]] is not quite the same as [[hardware support for denormals]].

The first question is whether you provide hardware support,
or whether it is necessary to [[trap to microcode or software]], to provide the denormal support.

The second question is,
if you do provide hardware support for denormals and do not need to trap to microcode or software.
Even if denormals are performed in hardware, there may be a performance cost:
it may cost latency, e.g. 5 cycles for an FADD  rather than 4  cycles;
or it may cost bandwidth.
The latency and bandwidth impacts are related but may be decoupled:
e.g. it is possible that a scheduler can arrange so that throughput is not reduced even if latency is increased,
so long as there is sufficient [[ILP]].

By [[full-speed denorms]] we mean mainly that throughput or bandwidth is not affected.
E.g. you may arrange to add to the  latency of all FP ops,
to avoid [[writeback port collisions for operations with different latencies]].
Since GPUs are throughput machines, this is usually a reasonable tradeoff;
on some GPUs even integer arithmetic is stretched out to 40 cycles.

So, why would a GPU like Nvidia Fermi provide full-speed denorms,
whereas x86 CPUs from Intel and  AMD do not?

Let's skip the possibility of a marketing bullet.

Remember [[GPU-style SIMD or SIMT coherent threading]]?

If a GPU takes a trap, or otherwise takes a hiccup, to handle denorms, then not only is the current thread impacted.
All 16 or 64 threads the [[wavefront or warp]] are impacted.
And if only one of the [[spatial threads]] in the [[warp]] is  taking the trap,
the efficiency of the machine decreases by at least 16-64X in that period.
Let alone the possibility that the denorm handling code is also not well suited to a GPU.

I.e. the relative cost of [[denorm handling via trapping]] is much higher on a GPU than on a CPU.
Even denorm handling in hardware, in a manner that impacts latency or throughput,
is relatively more expensive and/or harder to deal with.

Hence: there are good technical reasons to do [[full-speed denorm handling]] in GPUs.
These reasons are not quite so compelling for CPUs
- although I predict that ultimately CPUs will be compelled to follow.


    Ancedote: on one of my few trips to Pixar, I talked to Bruce Perens about denorm handling. Apparently Pixar would be happily rendering a movie at high speed, and then Bam! they would have a scene like a night sky, black, full of denorms, and performance would fall off a cliff. We talked about flush-to-zero modes, which Intel x86 CPUs did not have at that time. I suggested biasing the numbers, e.g. adding 1 or similar, to prevent getting into denorm range. Now that x86 has flush-to-zero the need for such [[kluge]]s is greatly reduced. But, denorms exist for a reason. flush-to-zero can introduce artifacts, and always introduces [[FUD (Fear, Uncretainty, and Doubt)]]. [[Full-speed support for denorms]] may just be the coming thing, a place where GPUs lead the way for CPUs.