Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Sunday, February 27, 2011

Unified_versus_Split_or_Typed_Register_Files

http://semipublic.comp-arch.net/wiki/Unified_versus_Split_or_Typed_Register_Files

There used to be a big debate in computer architecture about whether
one should have [[Unified versus Split Integera and Floating Point]].
Both ISA or macroarchitecurally, in the programmer exposed register
file. Or microarchitecturally, in the implementation.

E.g. the [[Intel P6 microarchitecture]] implemented the [[Intel
x86/x87 ISA]], with separate integer and floating point. But the
original P6 microarchitecture implemented this is a unified datapath,
in which both integer and floating point use the same scheduler, and
occupy the same [[PRF]] in the form of the unified [[ROB]], each entry
of which can hold a full 80+ bit x87 FP value. But they are split at
the [[RRF]].

Over time, Intel added [[SIMD packed vector extensions]]. [[MMX]]
shared the x86 registers, both architecturally and
microarchitecturally (in most implementations). But wider SIMD, such
as SSE with 128 bit [XMM]] regisyyers, and AVX with 256 bit [[YMM]]
registers, introduced yet another split architectureal register file.
Microarchitecturally, they may have kept the P6 unified structure, but
the wastage of storing an 8, 16, 32, or even 64 bit scalar in a 128 or
256 bit wide PRF entry becomes more and more of a concern.

AMD, by contrast, has typically had separate integer and FP/SIMD
clusters, with separate schedulers and PRFs and ...

Intel's 2011 processor, Sandybridge, has split integer and SIMD PRFs,
a unified scheduler, but split scalar and SIMD clusters.

It is becoming increasingly obvious that any desire to split datapaths
should be based not on type, not on integer versis floating point, but
should be based on width, intrger or scalar versus SIMD packed
vectors. Narrow data types, 8, 16, 32, 64 bits, versus wide packed
data types, 128, 256, possibly even 512 bits wide. Latency versus
throughput.

Currently it is pretty standard to separate

* integer, scalar, 8, 16, 32, and 64 bit data. Addresses and branches. Latency sensitive

* SIMD packed vector, both integer and FP. 128, 256, and possibly 512 bit data. Bandwidth.

The only real question is whether one should support scalar FP on a
datapath optimized for low latency, and interaction with branches and
addresses. Or whether scalar FP should only be handled as a single
lane in a vector register, as in SSE, in a throughput manner.

Now, there will always be workloads where FP scalar latency will
matter. For that matter, there will always be workloads where vector
latency matters. Nevertheless, this division into narrow latency and
wide throughput datapaths seems to be a trend.

It will always be desirable to be able to access the bits of an FP
number, whether scalar or vector, in an ad-hoc manner. It will always
be desirable to branch and form addresses based on such accesses. But
these desires will not always outweigh the separation into narrow
latency and wide throughput clusters.

Power concerns now are paramount. Split datapaths naturally save
power; but unified datapaths may be able to save power in an even
finer grained manner. Much depends on the granularity possible with
[[clock gating]] (which can be quite fine grained, e.g. to bytes of a
datapath) and [[power gating]] (which in 2011 is quite coarse grained,
unlikely to be done within the width of a unified datapath). For that
matter, separate narrow/latency and wide/throughput clusters would
probably be easier to clock at different frequencies, with different
coltages - although as of 2011 this does not seem yet to be done.

I can make no hard and fast conclusion: although there is a trend
towards split narrow/latency and wide/throughput datapaths, there will
still be occasional winning attempts to unifiy. Much depends on the
evolution of power, frequency, and voltage management, in terms of
granularity.

No comments: