I just deleted a blog entry I had been writing for an hour or so. Blogger has a "save" button, which l was using -but when I Lit the refresh button, because I Lad rotated my screen and I wanted the text box to fit, I lost everything.
Grrr....
I go now to find the text box backup plugin I use at work.
Disclaimer
The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.
See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.
See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.
Tuesday, December 14, 2010
Sunday, December 12, 2010
Flying Tablet
Flying today once again demonstrates the advantage of a pen-based tablet PC. The seats are just too close to open my convertible tablet PC in clamshell/notebook mode so I can use the keyboard to type. And, even though I have three seats next to each other all to myself, twisting sideways to use the PC keyboard in clamshell mode is too much of a pain. Especially since turbulence was producing a lot of typing errors.
But using it in slate mode works fine. interestingly, I normally use cursive, as I am doing now, for handwriting input. But, earlier, when the turbulence was bad, I found that the letter "comb" was less sensitive to turbulence induced errors.
--
Sometimes I fear that I am the only person in the world to like handwriting. yesterday shopped for a new cellulose and/or tablet. First time I actually held a Dell Streak, the 5" phone. I quite like the size _ it fit in mf pocket _ but the device B show, with slow wireless, a slow processor, and an old version of Android. I quite liked the Samsung Galaxy, and I actually used its on-screen touch keyboard to enter a page on my comp-arch.net wiki. I quite wish it cou\d be used as a phone, just to avoid having to pay for another dataplan. I am considering getting the tablet and going with a cheap non-smart phone.
But using it in slate mode works fine. interestingly, I normally use cursive, as I am doing now, for handwriting input. But, earlier, when the turbulence was bad, I found that the letter "comb" was less sensitive to turbulence induced errors.
--
Sometimes I fear that I am the only person in the world to like handwriting. yesterday shopped for a new cellulose and/or tablet. First time I actually held a Dell Streak, the 5" phone. I quite like the size _ it fit in mf pocket _ but the device B show, with slow wireless, a slow processor, and an old version of Android. I quite liked the Samsung Galaxy, and I actually used its on-screen touch keyboard to enter a page on my comp-arch.net wiki. I quite wish it cou\d be used as a phone, just to avoid having to pay for another dataplan. I am considering getting the tablet and going with a cheap non-smart phone.
Wednesday, December 01, 2010
Https not everywhere sucks
Woke up this morning with a little idea wrt an FPGA fun hacking project.
Wrote it up.
Wanted to post on my private wiki, but instead posted on Google sites
https://sites.google.com/site/glewprivate/fpga-processor-isa--uarch-hacking.
Why? I hate Google sites. But Google sites uses https/SSL. Whereas, my privae wiki does not have https/SSL.
Why not? because my web hoster, Dreamhosts, requires a fixed IP for SSL. And I have only paid extra for two domains to have the fixed IP and SSL. Not the domain that hosts my private wiki.
Why not host my private wiki on one of my two fixed IP https/SSL domains? Because dreamhost only supports, by default, one UNIX (Linux) user ID per such domain. So all of the websites on such a domain are vulnerable to each other.
I'm moving towards fixing this, with my own virtual private server, so I can set up setuid CGI. Mainly, haven't gotten around to setting that up yet. But, also, doing so loses me the automatic updates that dreamhost provides.
Why do I care about https/SSL? I really should not need to answer this, but... In particular, this morning I'm at a motel. I can assume that bad guys may be monitoring all of my traffic that is not encrypted. No VPN. Hence, want to use https/SSL for anything that needs a password.
Why no VPN? A1: can't use work VPN, personal. A2: haven't paid for personal VPN server. But, in any case, a VPN server would only encrypt from my tablet across the motel wifi to the VPN server somewhere in the net. From there, it would be unencrypted if, e.g., using plain old http. VPN helps, but really need end to end encryption, such as https/SSL give.
Bottom line: this morning decided to use Google sites, even though I hate it, because it offers https/SSL. Could have used Google docs, similar.
Hmm... Google's www.blogger.com, which I am using now, doesn't seem to use https. Doesn't this make blogging vulnerable? Isn't the rule that everything password protected needs to be encrypted - or at least the cookies carrying the authentication need to be. Not just the login, but all authorizing?
Personal To Do:
1) Set up setuid cgi asap
2) install whatever wiki I care about - and move my wiki
3) keep playing with virtual server apache
Wrote it up.
Wanted to post on my private wiki, but instead posted on Google sites
https://sites.google.com/site/glewprivate/fpga-processor-isa--uarch-hacking.
Why? I hate Google sites. But Google sites uses https/SSL. Whereas, my privae wiki does not have https/SSL.
Why not? because my web hoster, Dreamhosts, requires a fixed IP for SSL. And I have only paid extra for two domains to have the fixed IP and SSL. Not the domain that hosts my private wiki.
Why not host my private wiki on one of my two fixed IP https/SSL domains? Because dreamhost only supports, by default, one UNIX (Linux) user ID per such domain. So all of the websites on such a domain are vulnerable to each other.
I'm moving towards fixing this, with my own virtual private server, so I can set up setuid CGI. Mainly, haven't gotten around to setting that up yet. But, also, doing so loses me the automatic updates that dreamhost provides.
Why do I care about https/SSL? I really should not need to answer this, but... In particular, this morning I'm at a motel. I can assume that bad guys may be monitoring all of my traffic that is not encrypted. No VPN. Hence, want to use https/SSL for anything that needs a password.
Why no VPN? A1: can't use work VPN, personal. A2: haven't paid for personal VPN server. But, in any case, a VPN server would only encrypt from my tablet across the motel wifi to the VPN server somewhere in the net. From there, it would be unencrypted if, e.g., using plain old http. VPN helps, but really need end to end encryption, such as https/SSL give.
Bottom line: this morning decided to use Google sites, even though I hate it, because it offers https/SSL. Could have used Google docs, similar.
Hmm... Google's www.blogger.com, which I am using now, doesn't seem to use https. Doesn't this make blogging vulnerable? Isn't the rule that everything password protected needs to be encrypted - or at least the cookies carrying the authentication need to be. Not just the login, but all authorizing?
Personal To Do:
1) Set up setuid cgi asap
2) install whatever wiki I care about - and move my wiki
3) keep playing with virtual server apache
Sunday, November 28, 2010
Monitor/Mwait
http://semipublic.comp-arch.net/wiki/Monitor-Mwait
= Monitor/Mwait =
Monitor/Mwait are a pair of instructions introduced as follows:
What: Monitor/Mwait
Where: Intel x86 PNI SSE3
When: 2004 Prescott
Brief description from [[Instruction_Set_Extensions,_a_Rough_History#Monitor-Mwait]]
Let's discuss in more detail - not just monitor/mwait, but the class of synchronization operations that these are an exampleof.
== The Ideal: Monitor a Datastructure in Memory ==
What people, at least naive programmers, want to do is to be able to monitor arbitrary datastructures in memory
* arbitrary datastructures,
* involving arbitrary numbers of memory locations 9some of which are contingent, e.g. linked lists)
* evaluating arbitrary conditions.
I.e. people want to do something like
WAIT UNTIL
arbitrary condition on arbitrary datastructure(s)
or
SIGNAL EVENT WHEN
arbitrary condition on arbitrary datastructure(s)
Note the two forms: the first synchronous, the second interrupt or event handler based.
E.g. something like
WAIT UNTIL
there are 500 elements on the run queue
or
WAIT UNTIL
the B tree is unbalanced by more than 2X
; Transactions
Note a first issue: monitoring an arbitrary datastructure in memory may be hard to do, if the datastructure is constantly being updated, relinked, etc.
The programmer really wants to monitor this transactionally.
Later we will return to this issue inspired by hardware or software transactional memory techniques.
To start off with, however, we will retreat, past multiple memory locations known a-priori (non contingent),
all the way to monitoring a single memory location.
== Incremental Monitoring ==
First, let's get the easy part out of the way: many such monitoring conditions can be handled by having any code that changes the datastructure evaluate the condition,
and invoke the handler.
E.g.
insert onto queue( Q, el )
...
if( Q.size > 500 ) call handler_for_queue_gt_500
This works for single threaded code.
It even works for multithreaded code - the handler on one thread may set a shared variable or signal other threads. In many ways, monitor/mwait devolves into how to perform such cross thread notification easily.
It doesn't work when you can't modify or extend the existing code.
E.g. a debugger attempting to evaluate such a condition. However, a debugger is such a special case.
Note that it is usually not necessary to evaluate the whole of a complicated condition on a datastructure. Depending on what has recently happeneed in the modifying code, specialization or early out.
So, let's just get this out of the way, and assume that we still want to do monitor/mwait or the equivalent, between threads, at least on a single variable.
== Monitor/Mwait ==
Restricting ourselves to a single memory location,
we want to evaluate an arbitrary condition on that single memory location.
E.g.
LOOP
tmpReg := load( memLoc )
if( arbitrary_condition(tmpReg) then exit loop
END LOOP
E.g. specific:
LOOP
tmpReg := load( memLoc )
if( tmpReg = 42 || tmpReg == -21 ) then exit loop
END LOOP
Such a loop is semantically "good enough" (so long as the programmer understands that the condition may have been changed by another thread immediately thereafter, e.g.
even before the loop exits; and so long as there is no need for an operation such as if zero then set to 1 that needs to be atomic.
However, such a spin loop can cause unnecessary bus traffic, waste power, etc.
so, what we are looking for with monitor/mwait is a way of (a) saving power, by blocking the spinloop, and (b) possibly allowing the operation to be exported.
The former, (a), is the main story behind monitor/mwait.
The latter, (b), may be considered an advanced topic.
Simply adding a pause inside the loop is considered unacceptable, since it adds latency to detecting the change:
LOOP
tmpReg := load( memLoc )
if( tmpReg = 42 || tmpReg == -21 ) then exit loop
sleep for time T
END LOOP
What we want is a way of monitoring the memory location to see if anything has changed.
In a MESI cache protocol, it is sufficient to obtain exclusive ownership of a cache line.
If the cache line is written by another processor, the local processor loses ownership,
and the condition can be re-evaluated:
LOOP
tmpReg := load( memLoc )
if( tmpReg = 42 || tmpReg == -21 ) then exit loop
mwait( memLoc ) - wait until another processor may have written the cache line
END LOOP
However, because the condition may initially be true, etc., the mwait is accompanied by a monitor outside the loop:
MONITOR(memLoc)
LOOP
tmpReg := load( memLoc )
if( tmpReg = 42 || tmpReg == -21 ) then exit loop
MWAIT( memLoc ) - wait until another processor may have written the cache line
END LOOP
Note that I say "may have",
in "wait until another processor may have written the cache line".
The MWAIT typically unblocks when the cache line in question has been written,
or at least ownership has been transferred so that it may have been written.
But it will also unblock when the cacheline has been thrashed out (e.g. by another thread, if the monitpr mechanism is implemennted in the cache linees, and not via a dedicated monitor reghister),
or after a context switch to another processor,
or after going into a low power mode for a while;
... essntially, whenever you can no longer guarantee that the value has changed.
I.e. it is not monitoring changes; it is monitoring for lack of change, and blocking while lack of change is guaranteed.
== Monitor/Mwait and LL/SC ==
Monitor/Mwait are similar to load-linked/store-conditional
- the so-called RISC form of atomic synchronization,
that only works in systems with caches, at the processor,
which does not [[exporting synchronization operations]].
My tone of voice indicates my attitude:
M/MW, like LL/SC, is limited.
It will not work in all circumstances, neither at the low end nor at the upper end.
however, it may be a point of time solution.
For the most future-proof instruction set
just as I advocate wrapping LL/SC inside instructions such as "atomic increment memory location",
I advocate using M/MW as an implementation mechanism for possible instructions such as
* Wait until memory location is nonzero
Indeed, IBM's virtual machie experience has shown us that it is desirable to identify all sorts
of spinloops to the hardware or VMM, for virtual machines or otherwise.
(This warrants a discussion of it's own - [[encapsulating or advising spinloops]].)
== Climbing Back Up ==
Now that we have single location monitor/mwait spinloops:
MONITOR(memLoc)
LOOP
tmpReg := load( memLoc )
if( tmpReg = 42 || tmpReg == -21 ) then exit loop
MWAIT( memLoc ) - wait until another processor may have written the cache line
END LOOP
it is natural to think about extending it to multiple locations.
MONITOR(memLoc1)
MONITOR(memLoc2)
LOOP
tmpReg := load( memLoc )
if( tmpReg = 42 || tmpReg == -21 ) then exit loop
MWAIT( * ) - wait until another processor may have written any of the monitored locations
END LOOP
You can even imagine re-evaluating the MONITOR'ed locations, e.g. to handle insertion and deletions on a list being monitored.
Note that you can probably get away with having multiple monitors, and a single MWAIT.
In much the same way as people have proposed multiple load-linked with a single store conditional
loop:
tmp1 := load-linked(memLoc1)
tmp2 := load-linked(memLoc2)
if( condition(tmp1,tmp2) then store-conditional( someMemLoc )
if failure goto loop
Although this starts us on the slippery slope towards transactions.
First, with an explicit store commit phase
loop:
tmp1 := load-linked(memLoc1)
tmp2 := load-linked(memLoc2)
if_unchanged_then_commit_stores_atomically {
store(memLoc1,...)
store(memLoc1,...)
}
else {
goto loop
}
Then relaxing the requirement of users grouping the stores.
The problem with all of these is that the limits are arbitrary.
How many memory locations can you access? 1? 2? 4? An L1 cache in size?
= Monitor/Mwait =
Monitor/Mwait are a pair of instructions introduced as follows:
What: Monitor/Mwait
Where: Intel x86 PNI SSE3
When: 2004 Prescott
Brief description from [[Instruction_Set_Extensions,_a_Rough_History#Monitor-Mwait]]
- Originally intended to be used for fast thread synchronization, to implement "Wait until a memory location has changed" (or "may have" changed, since might exit the wait when cache line evicted but variable unchanged). Really, what people want is a cheap way to "wait until condition on memvar is met", but we implement by spinning and reevaluating the condition, blocking as efficiently as possible to reduce as many spins as possible. "Hardware Event Driven" The Monitor part necessary to handle race conditions in such loops. Unfortunately, Monitor/Mwait had problems that resulted in them not being used outside the OS. MWAIT became a power management instruction, saying to enter a power management state like C3 or C4.
Let's discuss in more detail - not just monitor/mwait, but the class of synchronization operations that these are an exampleof.
== The Ideal: Monitor a Datastructure in Memory ==
What people, at least naive programmers, want to do is to be able to monitor arbitrary datastructures in memory
* arbitrary datastructures,
* involving arbitrary numbers of memory locations 9some of which are contingent, e.g. linked lists)
* evaluating arbitrary conditions.
I.e. people want to do something like
WAIT UNTIL
arbitrary condition on arbitrary datastructure(s)
or
SIGNAL EVENT WHEN
arbitrary condition on arbitrary datastructure(s)
Note the two forms: the first synchronous, the second interrupt or event handler based.
E.g. something like
WAIT UNTIL
there are 500 elements on the run queue
or
WAIT UNTIL
the B tree is unbalanced by more than 2X
; Transactions
Note a first issue: monitoring an arbitrary datastructure in memory may be hard to do, if the datastructure is constantly being updated, relinked, etc.
The programmer really wants to monitor this transactionally.
Later we will return to this issue inspired by hardware or software transactional memory techniques.
To start off with, however, we will retreat, past multiple memory locations known a-priori (non contingent),
all the way to monitoring a single memory location.
== Incremental Monitoring ==
First, let's get the easy part out of the way: many such monitoring conditions can be handled by having any code that changes the datastructure evaluate the condition,
and invoke the handler.
E.g.
insert onto queue( Q, el )
...
if( Q.size > 500 ) call handler_for_queue_gt_500
This works for single threaded code.
It even works for multithreaded code - the handler on one thread may set a shared variable or signal other threads. In many ways, monitor/mwait devolves into how to perform such cross thread notification easily.
It doesn't work when you can't modify or extend the existing code.
E.g. a debugger attempting to evaluate such a condition. However, a debugger is such a special case.
Note that it is usually not necessary to evaluate the whole of a complicated condition on a datastructure. Depending on what has recently happeneed in the modifying code, specialization or early out.
So, let's just get this out of the way, and assume that we still want to do monitor/mwait or the equivalent, between threads, at least on a single variable.
== Monitor/Mwait ==
Restricting ourselves to a single memory location,
we want to evaluate an arbitrary condition on that single memory location.
E.g.
LOOP
tmpReg := load( memLoc )
if( arbitrary_condition(tmpReg) then exit loop
END LOOP
E.g. specific:
LOOP
tmpReg := load( memLoc )
if( tmpReg = 42 || tmpReg == -21 ) then exit loop
END LOOP
- The example condition is chosen to be one that is unlikely to be hardwired. In reality, the conditions are usually simpler, such as tmpReg == 0 or -1, etc., which raises the possibility odf hardwiring the condition. With the usual issue of what hardwired conditions should be supported.
Such a loop is semantically "good enough" (so long as the programmer understands that the condition may have been changed by another thread immediately thereafter, e.g.
even before the loop exits; and so long as there is no need for an operation such as if zero then set to 1 that needs to be atomic.
However, such a spin loop can cause unnecessary bus traffic, waste power, etc.
so, what we are looking for with monitor/mwait is a way of (a) saving power, by blocking the spinloop, and (b) possibly allowing the operation to be exported.
The former, (a), is the main story behind monitor/mwait.
The latter, (b), may be considered an advanced topic.
Simply adding a pause inside the loop is considered unacceptable, since it adds latency to detecting the change:
LOOP
tmpReg := load( memLoc )
if( tmpReg = 42 || tmpReg == -21 ) then exit loop
sleep for time T
END LOOP
What we want is a way of monitoring the memory location to see if anything has changed.
In a MESI cache protocol, it is sufficient to obtain exclusive ownership of a cache line.
If the cache line is written by another processor, the local processor loses ownership,
and the condition can be re-evaluated:
LOOP
tmpReg := load( memLoc )
if( tmpReg = 42 || tmpReg == -21 ) then exit loop
mwait( memLoc ) - wait until another processor may have written the cache line
END LOOP
However, because the condition may initially be true, etc., the mwait is accompanied by a monitor outside the loop:
MONITOR(memLoc)
LOOP
tmpReg := load( memLoc )
if( tmpReg = 42 || tmpReg == -21 ) then exit loop
MWAIT( memLoc ) - wait until another processor may have written the cache line
END LOOP
Note that I say "may have",
in "wait until another processor may have written the cache line".
The MWAIT typically unblocks when the cache line in question has been written,
or at least ownership has been transferred so that it may have been written.
But it will also unblock when the cacheline has been thrashed out (e.g. by another thread, if the monitpr mechanism is implemennted in the cache linees, and not via a dedicated monitor reghister),
or after a context switch to another processor,
or after going into a low power mode for a while;
... essntially, whenever you can no longer guarantee that the value has changed.
I.e. it is not monitoring changes; it is monitoring for lack of change, and blocking while lack of change is guaranteed.
== Monitor/Mwait and LL/SC ==
Monitor/Mwait are similar to load-linked/store-conditional
- the so-called RISC form of atomic synchronization,
that only works in systems with caches, at the processor,
which does not [[exporting synchronization operations]].
My tone of voice indicates my attitude:
M/MW, like LL/SC, is limited.
It will not work in all circumstances, neither at the low end nor at the upper end.
however, it may be a point of time solution.
For the most future-proof instruction set
just as I advocate wrapping LL/SC inside instructions such as "atomic increment memory location",
I advocate using M/MW as an implementation mechanism for possible instructions such as
* Wait until memory location is nonzero
Indeed, IBM's virtual machie experience has shown us that it is desirable to identify all sorts
of spinloops to the hardware or VMM, for virtual machines or otherwise.
(This warrants a discussion of it's own - [[encapsulating or advising spinloops]].)
== Climbing Back Up ==
Now that we have single location monitor/mwait spinloops:
MONITOR(memLoc)
LOOP
tmpReg := load( memLoc )
if( tmpReg = 42 || tmpReg == -21 ) then exit loop
MWAIT( memLoc ) - wait until another processor may have written the cache line
END LOOP
it is natural to think about extending it to multiple locations.
MONITOR(memLoc1)
MONITOR(memLoc2)
LOOP
tmpReg := load( memLoc )
if( tmpReg = 42 || tmpReg == -21 ) then exit loop
MWAIT( * ) - wait until another processor may have written any of the monitored locations
END LOOP
You can even imagine re-evaluating the MONITOR'ed locations, e.g. to handle insertion and deletions on a list being monitored.
Note that you can probably get away with having multiple monitors, and a single MWAIT.
In much the same way as people have proposed multiple load-linked with a single store conditional
loop:
tmp1 := load-linked(memLoc1)
tmp2 := load-linked(memLoc2)
if( condition(tmp1,tmp2) then store-conditional( someMemLoc )
if failure goto loop
Although this starts us on the slippery slope towards transactions.
First, with an explicit store commit phase
loop:
tmp1 := load-linked(memLoc1)
tmp2 := load-linked(memLoc2)
if_unchanged_then_commit_stores_atomically {
store(memLoc1,...)
store(memLoc1,...)
}
else {
goto loop
}
Then relaxing the requirement of users grouping the stores.
The problem with all of these is that the limits are arbitrary.
How many memory locations can you access? 1? 2? 4? An L1 cache in size?
Sunday, November 21, 2010
Flowchart Connectors: Coming Soon - Google Docs Help
I am very happy to learn that proper diagram drawing - connectors that stay connected - is coming soon to Google draw.
This will put me in a quandary: this makes Google drawings almost exactly what I need for my comp-arch.net wiki.
Except that the wiki aspects of Google docs and Google sites are pretty darn poor.
The URLs are unreadable. And there doesn't seem to be a way to create a link to a page that does not yet exist yet, that you can click on to create the page, like [[New Idea To Be Filled In Later]] - which IMHO is the essence of wiki.
Moving programming languages beyond ASCII text
Just read Pul-Henning Kamp's CACM article "Sir, Please Step Away from the ASR-33!"
Amen.
Constraining ourselves to use ASCII has led to less readable programming languages and programs, and probably has also led to bugs and security breakins, as Kamp suggests with C's & and &&.
Kamp advocates using unicode, with "the entire gamut of Greek letters, mathematical and technical symbols, brackets, brockets, sprockets, and weird and wonderful glyphs..."
I am not so sure that I would go all the way to full unicode, if for no other reason that I don't no how I would edit a program that used "...Dentistry symbol light down and horizontal with wave" (0x23c7)", as Kamp finishes the previous quote. People complain about APL being write-only! But certainly a subset of unicode - subscripted letters in the major alphabets, and ideograms.
Let us end not just the tyranny of ascii, but the tyranny of English. Variable names with accents for the rest of the Europeans; variable names in Chinese and Japanese ideograms. Internationalization systems for program variable names and keywords!!!!
Okay, maybe that goes a bit far - but nevertheless I feel the pain of non-English native speakers.
Kamp goes on to talk about using wide screens, e.g. "Why can't we pull minor scopes and subroutines out in that right-hand space...?"
I'm not sure about that, but I am a big user of tables. Truth tables, etc., that make it easy to see when you have handled all of the possible cases. Arrange the code in a table; heck, click through to zoom in and see details if the code in a table cell gets too hard to read even on our wide screens.
And Kamp goes on to talk about using color for syntax. Again, amen... although that raises the possibility of red/green colorblindness causing new bugs.
Myself, I learned to program from theAlgol 60 primer, with boldfaced keywords.
I think, however, most of this is presentation. We need to start using something like XML as the basis of our languages. It can express all of the above. It can be presented using color, fonts, etc, on wide and narrow screens. It can be customized to your local display devices using CSS.
I know: many people hate XML. I suggest it mainly because it exists, and is sufficient, and there are many existing tools that can be leveraged.
And did I mention language extensions via domain-specific languages? Many conference speakers and famous California professors advocate them.
Me, I am mixed about domain specific languages: I want to express things as natural in my domain, but I want all of my existing tools, my emacs modes, my interface generators, etc., to have a chance of handling them. More important than anything, it can be parsed by language processors that do not necessarily know the syntax of the domain specific language in use.
Think of how hard it was to extend C to C++: whenever they added a new keyword, existing programs using that keyword as a variable broke. This will not happen if your keywords are distinguished in XML using
I don't terribly care which representation is used, so long as there are standard tools to convert back and forth to a form that my other tools can expect to receive.
(Although the multiple tries it took me to get Google's blogging software to type XML < literals > - I had to add spaces - shows how much tool change is needed.)
Amen.
Constraining ourselves to use ASCII has led to less readable programming languages and programs, and probably has also led to bugs and security breakins, as Kamp suggests with C's & and &&.
Kamp advocates using unicode, with "the entire gamut of Greek letters, mathematical and technical symbols, brackets, brockets, sprockets, and weird and wonderful glyphs..."
I am not so sure that I would go all the way to full unicode, if for no other reason that I don't no how I would edit a program that used "...Dentistry symbol light down and horizontal with wave" (0x23c7)", as Kamp finishes the previous quote. People complain about APL being write-only! But certainly a subset of unicode - subscripted letters in the major alphabets, and ideograms.
Let us end not just the tyranny of ascii, but the tyranny of English. Variable names with accents for the rest of the Europeans; variable names in Chinese and Japanese ideograms. Internationalization systems for program variable names and keywords!!!!
Okay, maybe that goes a bit far - but nevertheless I feel the pain of non-English native speakers.
Kamp goes on to talk about using wide screens, e.g. "Why can't we pull minor scopes and subroutines out in that right-hand space...?"
I'm not sure about that, but I am a big user of tables. Truth tables, etc., that make it easy to see when you have handled all of the possible cases. Arrange the code in a table; heck, click through to zoom in and see details if the code in a table cell gets too hard to read even on our wide screens.
And Kamp goes on to talk about using color for syntax. Again, amen... although that raises the possibility of red/green colorblindness causing new bugs.
Myself, I learned to program from theAlgol 60 primer, with boldfaced keywords.
I think, however, most of this is presentation. We need to start using something like XML as the basis of our languages. It can express all of the above. It can be presented using color, fonts, etc, on wide and narrow screens. It can be customized to your local display devices using CSS.
I know: many people hate XML. I suggest it mainly because it exists, and is sufficient, and there are many existing tools that can be leveraged.
And did I mention language extensions via domain-specific languages? Many conference speakers and famous California professors advocate them.
Me, I am mixed about domain specific languages: I want to express things as natural in my domain, but I want all of my existing tools, my emacs modes, my interface generators, etc., to have a chance of handling them. More important than anything, it can be parsed by language processors that do not necessarily know the syntax of the domain specific language in use.
Think of how hard it was to extend C to C++: whenever they added a new keyword, existing programs using that keyword as a variable broke. This will not happen if your keywords are distinguished in XML using
< keyword > if < / keyword >
or
< keyword ascii="if" / >
or
< if type="keyword" / >
Tuesday, November 16, 2010
Array_Ports:_Dual_Port,_Multiport,_etc.
I keep plodding along:
http://semipublic.comp-arch.net/wiki/Array_Ports:_Dual_Port,_Multiport,_etc.
wikied, USEnet comp.arch posted, andblogged
= Array Basics =
The classic array is a 2D crosspoint array of memory cells.
In an SRAM based array, the memory cells are cross-linked inverters (or a degenerate version thereof).
In a DRAM based array, the memory cells are capacitors.
In other memory technologies, the memory cells may be a floating gate (which is really a capacitor),
a bit of glass that can be melted, magnetic material, etc., etc.
>2D arrays are possible - but 2D is the basic.
(Actually, 1D memories have been created - e.g. delay lines, magnetic bubble racetrack memory, etc. - but that is not in the scope of this topic.)
Conventionally, one of the two dimensions, let's say X, horizontal, carries word lines, or decoded address lines.
The other, Y, vertical, carries bit lines or data lines.
One of the word lines is activated, and it selects all of the memory cells along that word, and causes them to connect to the bit lines. Either the value is read out, or a new value is written in.
= A Single Port =
Sometimes there is a single bitline; sometimes there is a pair of bitlines, typically Q and Q-bar.
Sometimes there is a single wordline, and another wire, often parallel to the wordline, that indicates read or write.
Sometimes there are separate wordlines for read and write.
== Minor Tweaks ==
Read/write is a "diffuse" signal: while there must be a word line per, well, word, for a single ported array we might be able to have a global signal that said "write". Such a global signal could be, well, diffuse - signaled in the ground plane, etc. Possibly by a light shining from the 3rd dimension. Or you could route the write enables vertically, or diagonally. Usually there are not good ideas, costong more than they save in area or power.
However, you can share read/write lines between pairs of wordlines. Or even larger sets of wordlines, at the cost of routing to get the signal to where it would need to go.
See [[byte write enables]] or [[subword write enables]]
= True Multiport Arrays =
The simplest possible array might have a single bitline, a single wordline, and a read/write signal.
Typically two horizontal wires, and 1 vertical.
This would implement a single port that can be used for both read and write.
The next step in complexity would be to have
* a read port, with a wordline and a bitline
* a write port, with a dedicated wordline and a bitline
I call these true ports because almost arbitrary words can be simultaneously accessed in the same array.
In general, each true port costs 1 X and 1 Y wire.
This implies that true ports have an area cost of O(N2).
Simple arrays with small numbers of ports may not be wire limited: they may be limited by the bitcells.
But, as the number of ports grows, eventually the O(N2) effect kicks in.
= Array Replication for Read Multiporting =
One way to mitigate the O(N2) effect of ports is to replicate the array.
E.g. a 2-read port, single write port array
may be implemented as two 1R 1W arrays,
with separate read ports,
but the write ports tied together.
Similarly, 2R-1W has area 9c, whereas 2x1R-1R has area 2(4c) = 8c.
The effect gets bigger as the number of read ports increases.
However, there is often a cost to distributing the wires to the arrays.
Simplistically, if all flows along a single datapath, there is O(n2) cost to "turning" the wires to go to the separate array.
However2, circumstances may vary: it may not be necessary to turn all the wires back into the same datapath, etc.
= Time Multiplexing =
Another way to get multiple ports is by time multiplexing.
You might, for example, "double pump" an array to get effectively twice the ports.
This can even be done with wave pipelining.
Similarly, you might know by construction that reads can only occur, in 2 out of every 3 clock cycles, while writes may occur only every third clock cycle.
If you cannot make such guarantees, it is necessary to have a mechanism for handling conflicts.
If you can make average rate guarantees but not instantaneous rate guarantees,
a few buffers may suffice.
= Banks =
Banks are a form of spatial multiplexing.
Similar to array replication, but without tying together the write ports.
Address bits are used to assign requests to particular banks.
This weighs heavily in favor of power of two sized banks,
although [[non-power-of-two-sized indexing]] is possible.
especially for arrays relatively far out in the memory subsystem, such as DRAM (e.g. Matrox had 17 way interleaved memory)
or outer level caches.
Bank selection bits may be chosen so as to interleave - successive words in the logical arrray may be assigned to different banks. Or they may be coarse grained - e.g. the lowest half of the address space is assigned to thefirst bank, tyhe upper half to a second bank.
The two may be intermixed - interleaving, and coarse grain banking.
Indeed, there may be multiple levels of interleaving and coarse grain banking.
Before we leave this topic, I want to mention that cache lines may possibly be subject to both
intra-cacheline-banking, and inter-cacheline-banking.
E.g. a 64B (512b) cache line may be divided into 8B (64b) banks within the cacheline,
while the cachelines are interleaved between coarser scalee banks.
= Terminology: Banks, Ranks, etc. =
Terminology may vary.
I (Andy Glew) tend to follow my standard practice of using [[standard terms, with distinguishing adjectives]].
E.g. I say any division of an array is a bank, and the process of distributing addresses across the banks is an interleaving pattern.
Other people invent different terms:
e.g. banks within a DDR DRAM;
DRAM chips mounted on DIMMS, operated as separate ranks;
memory may be interleaved across different DRAM channels.
= Bank Conflicts and Other Port Conflicts =
Whenever there are not enough ports to meet all needs the possibility of conflicting accesses arises.
More terminology: I sometimes call banking pseudo-multiporting.
With power-of-2 bitfield extraction used to create the bank selection and indexing functions, there often arise "resonances" on common power of two boundaries. Such as 64KiB, 1MiB, etc.
Simple hashing, XORing a few upper bits into the indexing and selection functions, sometimes helps.
Non-power-of-two indexing can also help.
Mechanisms to handle bank conflicts can lead to great complexity.
The simplest case would be to stall one of the requests.
Simple - unless you are working on a statically scheduled machine without pipeline interlocks.
Simple - but of course immediately stalling is impossible in many high speed designs.
Replay or recycling the request is another way of handling bank conflicts:
a fixed pipeline's worth of requests drains out,
and is routed back to the pipeline or arrays inputs to be retried.
This is a good workable idea - but it can cause positive feedback leading to unstable behavior.
See [[Replay Pipelines]].
I don't know of any other alternatives to stall or replay - although there are many flavors of each.
However, the microarchitecture can arrange things so as to mitigate the cost of bank conflicts,
(or other port conflicts):
E.g. P6 had a pseudo-dual-ported, banked, L1 cache array, with 1 read and 1 write port. The read port always took priority, since handling such conflicts on a read is more complex because dependent operations may be scheduled.
Whereas write port conflicts, while affecting bandwidth, do not have dependejnt operations.
Similarly, what I call a [[2 and a half ported array]] is another nice design point:
true dual ports used for two read ports,
with a write port using pseudo-multiport banking.
Bank conflicts on the dual ported reads are avoided, while on the writes bank conflicts can sometimes be tolerated,
or scheduled around.
= Related =
* [[Bank scheduling]]
* [[Read and Write Combining]]
http://semipublic.comp-arch.net/wiki/Array_Ports:_Dual_Port,_Multiport,_etc.
wikied, USEnet comp.arch posted, andblogged
= Array Basics =
The classic array is a 2D crosspoint array of memory cells.
In an SRAM based array, the memory cells are cross-linked inverters (or a degenerate version thereof).
In a DRAM based array, the memory cells are capacitors.
In other memory technologies, the memory cells may be a floating gate (which is really a capacitor),
a bit of glass that can be melted, magnetic material, etc., etc.
>2D arrays are possible - but 2D is the basic.
(Actually, 1D memories have been created - e.g. delay lines, magnetic bubble racetrack memory, etc. - but that is not in the scope of this topic.)
Conventionally, one of the two dimensions, let's say X, horizontal, carries word lines, or decoded address lines.
The other, Y, vertical, carries bit lines or data lines.
- (I don't want to say just "bit lines", since multivalued logic is a possibility.)
One of the word lines is activated, and it selects all of the memory cells along that word, and causes them to connect to the bit lines. Either the value is read out, or a new value is written in.
- Usually only one word line at a time is activated. Although you could activate more than one, e.g. and OR or AND the several bits selected for read. Conversely, you could write a value to multiple words at once. However, this is uncommon; it often pushes the edges of circuitry; and some systems will, in fact, have logic to prevent this from ever happening.
= A Single Port =
Sometimes there is a single bitline; sometimes there is a pair of bitlines, typically Q and Q-bar.
Sometimes there is a single wordline, and another wire, often parallel to the wordline, that indicates read or write.
Sometimes there are separate wordlines for read and write.
== Minor Tweaks ==
Read/write is a "diffuse" signal: while there must be a word line per, well, word, for a single ported array we might be able to have a global signal that said "write". Such a global signal could be, well, diffuse - signaled in the ground plane, etc. Possibly by a light shining from the 3rd dimension. Or you could route the write enables vertically, or diagonally. Usually there are not good ideas, costong more than they save in area or power.
However, you can share read/write lines between pairs of wordlines. Or even larger sets of wordlines, at the cost of routing to get the signal to where it would need to go.
See [[byte write enables]] or [[subword write enables]]
= True Multiport Arrays =
The simplest possible array might have a single bitline, a single wordline, and a read/write signal.
Typically two horizontal wires, and 1 vertical.
This would implement a single port that can be used for both read and write.
The next step in complexity would be to have
* a read port, with a wordline and a bitline
* a write port, with a dedicated wordline and a bitline
I call these true ports because almost arbitrary words can be simultaneously accessed in the same array.
- Almost arbitrary... often there are constraints, such as
* never writing to the same bitcells (words) simultaneously
* never reading and writing the same bitcells simultaneously
* sometimes never reading from more than M ports at a time. sometimes, the read limit M=1
In general, each true port costs 1 X and 1 Y wire.
This implies that true ports have an area cost of O(N2).
- At least. Many technologies require dual bitlines, or, two bitlines to write, 1 to read.
Note the possibility of a hybrid port with three wordlines and two bitlines, using bit/bit-bar to write, and reading on each bitline separately.
Simple arrays with small numbers of ports may not be wire limited: they may be limited by the bitcells.
But, as the number of ports grows, eventually the O(N2) effect kicks in.
- == Enumerate Ports when designing Arrays ==
Because of the possibilities of odd tradeoffs, such as combined read;/write ports, or 1 write 2 read ports,
it is desirable to enumerate all of the ports you need in your [[abstract array design]],
and also to think about which ports may not need to be simultaneously active.
= Array Replication for Read Multiporting =
One way to mitigate the O(N2) effect of ports is to replicate the array.
E.g. a 2-read port, single write port array
may be implemented as two 1R 1W arrays,
with separate read ports,
but the write ports tied together.
Similarly, 2R-1W has area 9c, whereas 2x1R-1R has area 2(4c) = 8c.
The effect gets bigger as the number of read ports increases.
However, there is often a cost to distributing the wires to the arrays.
Simplistically, if all flows along a single datapath, there is O(n2) cost to "turning" the wires to go to the separate array.
However2, circumstances may vary: it may not be necessary to turn all the wires back into the same datapath, etc.
- ;Bluesky (at the time of writing, 2010):
If you can work with the third dimension, the costs of such array replication to get extra read ports is lowered:
whereas in 2D routing to a separate array costs O(N62), in 3D routing to a separate array placed on a layer above the first is O(n).
= Time Multiplexing =
Another way to get multiple ports is by time multiplexing.
You might, for example, "double pump" an array to get effectively twice the ports.
This can even be done with wave pipelining.
Similarly, you might know by construction that reads can only occur, in 2 out of every 3 clock cycles, while writes may occur only every third clock cycle.
If you cannot make such guarantees, it is necessary to have a mechanism for handling conflicts.
If you can make average rate guarantees but not instantaneous rate guarantees,
a few buffers may suffice.
= Banks =
Banks are a form of spatial multiplexing.
Similar to array replication, but without tying together the write ports.
Address bits are used to assign requests to particular banks.
This weighs heavily in favor of power of two sized banks,
although [[non-power-of-two-sized indexing]] is possible.
especially for arrays relatively far out in the memory subsystem, such as DRAM (e.g. Matrox had 17 way interleaved memory)
or outer level caches.
Bank selection bits may be chosen so as to interleave - successive words in the logical arrray may be assigned to different banks. Or they may be coarse grained - e.g. the lowest half of the address space is assigned to thefirst bank, tyhe upper half to a second bank.
The two may be intermixed - interleaving, and coarse grain banking.
Indeed, there may be multiple levels of interleaving and coarse grain banking.
Before we leave this topic, I want to mention that cache lines may possibly be subject to both
intra-cacheline-banking, and inter-cacheline-banking.
E.g. a 64B (512b) cache line may be divided into 8B (64b) banks within the cacheline,
while the cachelines are interleaved between coarser scalee banks.
= Terminology: Banks, Ranks, etc. =
Terminology may vary.
I (Andy Glew) tend to follow my standard practice of using [[standard terms, with distinguishing adjectives]].
E.g. I say any division of an array is a bank, and the process of distributing addresses across the banks is an interleaving pattern.
Other people invent different terms:
e.g. banks within a DDR DRAM;
DRAM chips mounted on DIMMS, operated as separate ranks;
memory may be interleaved across different DRAM channels.
= Bank Conflicts and Other Port Conflicts =
Whenever there are not enough ports to meet all needs the possibility of conflicting accesses arises.
More terminology: I sometimes call banking pseudo-multiporting.
With power-of-2 bitfield extraction used to create the bank selection and indexing functions, there often arise "resonances" on common power of two boundaries. Such as 64KiB, 1MiB, etc.
Simple hashing, XORing a few upper bits into the indexing and selection functions, sometimes helps.
Non-power-of-two indexing can also help.
Mechanisms to handle bank conflicts can lead to great complexity.
The simplest case would be to stall one of the requests.
Simple - unless you are working on a statically scheduled machine without pipeline interlocks.
Simple - but of course immediately stalling is impossible in many high speed designs.
Replay or recycling the request is another way of handling bank conflicts:
a fixed pipeline's worth of requests drains out,
and is routed back to the pipeline or arrays inputs to be retried.
This is a good workable idea - but it can cause positive feedback leading to unstable behavior.
See [[Replay Pipelines]].
I don't know of any other alternatives to stall or replay - although there are many flavors of each.
However, the microarchitecture can arrange things so as to mitigate the cost of bank conflicts,
(or other port conflicts):
E.g. P6 had a pseudo-dual-ported, banked, L1 cache array, with 1 read and 1 write port. The read port always took priority, since handling such conflicts on a read is more complex because dependent operations may be scheduled.
Whereas write port conflicts, while affecting bandwidth, do not have dependejnt operations.
Similarly, what I call a [[2 and a half ported array]] is another nice design point:
true dual ports used for two read ports,
with a write port using pseudo-multiport banking.
Bank conflicts on the dual ported reads are avoided, while on the writes bank conflicts can sometimes be tolerated,
or scheduled around.
= Related =
* [[Bank scheduling]]
* [[Read and Write Combining]]
Saturday, November 13, 2010
Content Based Query Install
Throwing documents into my document archive, I realize one of the forms of semantic exposure of deduplication that I want:
When I say "put(file)", if file is already in the archivem I want to get told "Hey, you, the file is already in the archive!!!
Perhaps not under the name I want to put it in.
put(Archive,nameInArchive,file) => warn if already there
put(Archive,nameInArchive,file,force=1) => force install, even if already there
Interactively, I want the default o be "don't install if already there".
When I say "put(file)", if file is already in the archivem I want to get told "Hey, you, the file is already in the archive!!!
Perhaps not under the name I want to put it in.
put(Archive,nameInArchive,file) => warn if already there
put(Archive,nameInArchive,file,force=1) => force install, even if already there
Interactively, I want the default o be "don't install if already there".
Editing History
Messing around with Mercurial to keep my stuff in synch, while working on ulfs.
While I am reasonably happy with my scheme for synchronizing partial checkins and checkouts - not merging, but making candidate for merge - it's not yet ready for prime time. Hence my use of Mercurial.
Can't say continued use of Mercurial, because I have only lately switched to using Mercurial from Git.
Anyway: historically, the main reason for wanting to be able to edit history has been legal. E.g. if a court settlement or a sale meant that you were not supposed to have versions of a file that is determined not to belong to you.
Today, another reason: Mercurial on cygwin kept dying. Looks like it was related to large files. I really wanted to remove those large files from the history, but the "Though shalt not edit history" attitude got in the way. Used hg strip, but that was too much; I really just wanted to delete a few entries. Not enter a counteracting entry, but to delete the original.
Really, I am using DVCS not just for revision control of a single project, but for history tracking of an archive of largely unrelated files.
This should not be that hard. Mercurial docs make noise about names incorporating all content checksums. ... Well, while that might be a good integrity check, it suggests that such content based naming is not so great an idea. If it makes history editing hard.
Realistically, all versions of all objects should have names separate from content hashes.
Scratch that: s/name/key/
All versions of all objects should have (a) content hash names, and (b) non-content hash names.
This applies just as much to the "indexes" that list all file versions in a contour, as to individual files.
The history data structure is the interesting issue: should version-sets point to parent version sets, and, if so, by what name? If V3->V2->V1, what happens if V2 gets deleted. How do we convert V3->...->V1 to V3->V1? Especially if V3->V2 lives outside the current repository?
This suggests we may want to keep version-hashes of objects deleted, if only to act as placeholders, and to prevent them from getting recreated.
While I am reasonably happy with my scheme for synchronizing partial checkins and checkouts - not merging, but making candidate for merge - it's not yet ready for prime time. Hence my use of Mercurial.
Can't say continued use of Mercurial, because I have only lately switched to using Mercurial from Git.
Anyway: historically, the main reason for wanting to be able to edit history has been legal. E.g. if a court settlement or a sale meant that you were not supposed to have versions of a file that is determined not to belong to you.
Today, another reason: Mercurial on cygwin kept dying. Looks like it was related to large files. I really wanted to remove those large files from the history, but the "Though shalt not edit history" attitude got in the way. Used hg strip, but that was too much; I really just wanted to delete a few entries. Not enter a counteracting entry, but to delete the original.
Really, I am using DVCS not just for revision control of a single project, but for history tracking of an archive of largely unrelated files.
This should not be that hard. Mercurial docs make noise about names incorporating all content checksums. ... Well, while that might be a good integrity check, it suggests that such content based naming is not so great an idea. If it makes history editing hard.
Realistically, all versions of all objects should have names separate from content hashes.
Scratch that: s/name/key/
All versions of all objects should have (a) content hash names, and (b) non-content hash names.
This applies just as much to the "indexes" that list all file versions in a contour, as to individual files.
The history data structure is the interesting issue: should version-sets point to parent version sets, and, if so, by what name? If V3->V2->V1, what happens if V2 gets deleted. How do we convert V3->...->V1 to V3->V1? Especially if V3->V2 lives outside the current repository?
This suggests we may want to keep version-hashes of objects deleted, if only to act as placeholders, and to prevent them from getting recreated.
Saturday, October 30, 2010
Compacting Patent Plaques
Somewhat bittersweet: am today compacting and throwing out my patent plaques. Circa 60-70 of them - eventually I started asking Intel *NOT* to give me any more plaques. They just take up too much space, are a pain to move. Need the bookshelf space.
At the moment I am unscrewing the metal engraving. I'll keep the metal with the actual patent engraved on it. Maybe one day I will actually mount them. But I am going to throw out, discard, recycle, the wooden boards. I wonder if my daughter's school woodshop would want them.
Maybe I'll use the metal for something. Melt it down.
Like beating swords into ploughshares?
At the moment I am unscrewing the metal engraving. I'll keep the metal with the actual patent engraved on it. Maybe one day I will actually mount them. But I am going to throw out, discard, recycle, the wooden boards. I wonder if my daughter's school woodshop would want them.
Maybe I'll use the metal for something. Melt it down.
Like beating swords into ploughshares?
Tuesday, October 19, 2010
Textual notation for indexed summation, etc.
I grew up with mathematical notations such as
Now, I really believe that our programming systems should support such mathematical systems.
But seeing as most don't, I often wrestle with textural notation.
Here's a pleasant textual notation I came up with today:
first, the usual sum or sigma:
sum for i from 1 to N of A[i]
to create a vector
v[i], i=1,n
::== vector for i from 1 to N of v[i]
combining
vector for j from 1 to N of
sum for i from 1 to N of A[j+i]*B[j+N-i]
other
integral for i from 1 to N of f(i)
union for i from 1 to N of f(i)
intersection for i from 1 to N of f(i)
product for i from 1 to N of f(i)
continued fraction for i from 1 to N of f(i)
::== 1/(1+1/(f(1)+1/(1+1/f(2) + ...
2D
array
for i from 1 to M,
for j from 1 to N
of A[i,j]
3D
array
for i from 1 to N,
for j from 1 to M,
for k from 1 to L
of A[i,j,k]
shaped
array
for i from 1 to N,
for j from i to 2*i,
for k from 1 to j
of A[i,j,k]
Now, I really believe that our programming systems should support such mathematical systems.
But seeing as most don't, I often wrestle with textural notation.
Here's a pleasant textual notation I came up with today:
first, the usual sum or sigma:
sum for i from 1 to N of A[i]
to create a vector
v[i], i=1,n
::== vector for i from 1 to N of v[i]
combining
vector for j from 1 to N of
sum for i from 1 to N of A[j+i]*B[j+N-i]
other
integral for i from 1 to N of f(i)
union for i from 1 to N of f(i)
intersection for i from 1 to N of f(i)
product for i from 1 to N of f(i)
continued fraction for i from 1 to N of f(i)
::== 1/(1+1/(f(1)+1/(1+1/f(2) + ...
2D
array
for i from 1 to M,
for j from 1 to N
of A[i,j]
3D
array
for i from 1 to N,
for j from 1 to M,
for k from 1 to L
of A[i,j,k]
shaped
array
for i from 1 to N,
for j from i to 2*i,
for k from 1 to j
of A[i,j,k]
Monday, October 18, 2010
Inner Product Instructions
http://semipublic.comp-arch.net/wiki/index.php?title=Inner_product_instructions&action=edit
The mathematical inner product or dot product is often desired in vector instruction sets. However, it poses some challenges for the instruction set designer.
= Integer Dot Product =
For example, consider an integer inner product. E.g. for 4 inputs
A0*B0 + A1*B1 + A2*B2 + A3*B3
If the arithmetic is 2's compliment, signed or unsigned, without overflow, this can be evaluated in any order. Indeed, one need not perform carry propagating adds for the products; we can combine the sum of the products with the sums involved in multiplication, in a redundant, carry-free, form.
However, if we are to detect overflow, or equivalently perform saturation, issues are raised. How are the calculations associated:
(((A0*B0 + A1*B1) + A2*B2) + A3*B3)
or
((A0*B0 + A1*B1) + (A2*B2 + A3*B3))
Should we detect intermediate overflow, or should we ignore overflows that are "corrected" by subsequently adding in numbers of opposite sign. I.e. should we perform calculations as if using extended precision intermediate accumulators or temporary values.
If saturating, e.g. if using leftmost association, when we saturate for overflow should we stay saturated, or should we allow the saturated value to be brought back into range? Etc.
All of these boundary cases are annoying.
Glew recommendation: for small vectors, either don't do overflow detection, or calculate as if in extended precision and detect at the overflow at the end.
* Pro: easy to describe and understand
* Con: gives different results compared to doing scalar calculations in straightforward loops (with overflow detection)
Consider calculations as if in a specific order, typically leftmost or tree.
A hybrid approach is possible: calculate in the fastest possible way, e.g. tree, but then have a slower calculation detecting overflow. CON: extra power.
= Floating Point Dot Product =
Floating point dot product instructions have the same issues of order dependence as described for integer dot product with overflow detection - with the additional issue of rounding. Even in the absence of overflow,
(((A0*B0 + A1*B1) + A2*B2) + A3*B3)
or
((A0*B0 + A1*B1) + (A2*B2 + A3*B3))
may give different answers.
As before, some instruction sets define particular orders of evaluation, while some leave it undefined. Some also define an "infinite intermediate precision" form of inner product, where the sums are formed without rounding, and only rounded (and overflow detected) at the end.
[[Superaccumulators]] are a more general form of such arithmetic.
In the absence of such special support, floating point dot product instructions
with linear, e.g. leftmost, associativity are little faster than a succession of [[FMA]] or [[FMAC]] instructions.
= Alternative Inner Product Support =
Alternative instructions that can be used to synthesize dot products and other
possibly order dependent [[vector reductions]]
may be motivated by the issues discussed above.
For example, horizontal or pairwise add instructions, such as Intel SSE3's PHADDW:
2 packed vector inputs
A0, A1, A2 ...
B0, B1, B2 ...
1 packed vector output
A0+A1, A2+A3 ..., B0+B1, B2+B3, ...
Pairwise add could, of course, be done on a single vector operand, but then the
output would only be half-length.
Combining two vector operands in this manner produces a full-length vector output,
which is convenient.
Wiring-wise, if the A and B vector operands are interleaved
to allow [[vertical add]], this only requires wires sufficient to
route A1 to A0, and vece versa.
Note that horizontal add is "only" a combination of a vector interleave or shuffle operation,
and a vertical add. The interleave operation, however, wants to produce two vector outputs,
or else occupy two instructions (or produce a double width vector operand, or pack two singles into
double precision vector elements (but then what do you do for double precision inputs), or ...)
2 packed vector inputs
A0, A1, A2 ...
B0, B1, B2 ...
2 packed vector outputs
A0, B0, A2, B2 ...
A1, B1, A3, B3 ...
Vertical add
A0+A1, B0+B1, ...
Vertical, element by element, multiplication,
combined with sufficient horizontal adds to perform the sum,
builds a dot product.
Or, a single instruction may combine vertical multiplication and horizontal pairwise addition:
2 packed vector inputs
A0, A1, A2 ...
B0, B1, B2 ...
1 result
A0*B0+A1*B1, A2*B2+A3*B3, ...
E.g. Intel SSE3's PMADDUBSW.
The putative problem of having only half as many vector elements in the PMADDUBSW output is
handled by doubling the precision,
taking 8 bits as input and producing a packed 16 bit vector as output.
Unfortunately, that technique does not work so well for floating point.
= Vector Parallel versus Vector Sequential =
Defining dot product with leftmost associativity
for purposes of overflow and rounding,
is well suited to a [[vector sequential]] implementation.
[[Vector parallel]] implementations are better suited to horizontal adds,
and other parallel orders of summation.
Modern instruction sets such as SSE tend to have [[vector parallel]] implementations.
It is possible to mix vector parallel and vector sequential,
although a latency penalty is paid when transitioning from vector sequential to vector parallel.
Thus we begin sliding down [[The Slippery Slope of Vector Instruction Sets]]:
there are so many different possible definitions of the vector instructions,
with different evaluation orders,
suited to different microarchitectures.
It is easy to define the most basic vector instructions.
Completing the coverage of vector instructions is more challenging.
The mathematical inner product or dot product is often desired in vector instruction sets. However, it poses some challenges for the instruction set designer.
= Integer Dot Product =
For example, consider an integer inner product. E.g. for 4 inputs
A0*B0 + A1*B1 + A2*B2 + A3*B3
If the arithmetic is 2's compliment, signed or unsigned, without overflow, this can be evaluated in any order. Indeed, one need not perform carry propagating adds for the products; we can combine the sum of the products with the sums involved in multiplication, in a redundant, carry-free, form.
However, if we are to detect overflow, or equivalently perform saturation, issues are raised. How are the calculations associated:
(((A0*B0 + A1*B1) + A2*B2) + A3*B3)
or
((A0*B0 + A1*B1) + (A2*B2 + A3*B3))
Should we detect intermediate overflow, or should we ignore overflows that are "corrected" by subsequently adding in numbers of opposite sign. I.e. should we perform calculations as if using extended precision intermediate accumulators or temporary values.
If saturating, e.g. if using leftmost association, when we saturate for overflow should we stay saturated, or should we allow the saturated value to be brought back into range? Etc.
All of these boundary cases are annoying.
Glew recommendation: for small vectors, either don't do overflow detection, or calculate as if in extended precision and detect at the overflow at the end.
* Pro: easy to describe and understand
* Con: gives different results compared to doing scalar calculations in straightforward loops (with overflow detection)
Consider calculations as if in a specific order, typically leftmost or tree.
A hybrid approach is possible: calculate in the fastest possible way, e.g. tree, but then have a slower calculation detecting overflow. CON: extra power.
= Floating Point Dot Product =
Floating point dot product instructions have the same issues of order dependence as described for integer dot product with overflow detection - with the additional issue of rounding. Even in the absence of overflow,
(((A0*B0 + A1*B1) + A2*B2) + A3*B3)
or
((A0*B0 + A1*B1) + (A2*B2 + A3*B3))
may give different answers.
As before, some instruction sets define particular orders of evaluation, while some leave it undefined. Some also define an "infinite intermediate precision" form of inner product, where the sums are formed without rounding, and only rounded (and overflow detected) at the end.
[[Superaccumulators]] are a more general form of such arithmetic.
In the absence of such special support, floating point dot product instructions
with linear, e.g. leftmost, associativity are little faster than a succession of [[FMA]] or [[FMAC]] instructions.
= Alternative Inner Product Support =
Alternative instructions that can be used to synthesize dot products and other
possibly order dependent [[vector reductions]]
may be motivated by the issues discussed above.
For example, horizontal or pairwise add instructions, such as Intel SSE3's PHADDW:
2 packed vector inputs
A0, A1, A2 ...
B0, B1, B2 ...
1 packed vector output
A0+A1, A2+A3 ..., B0+B1, B2+B3, ...
Pairwise add could, of course, be done on a single vector operand, but then the
output would only be half-length.
Combining two vector operands in this manner produces a full-length vector output,
which is convenient.
Wiring-wise, if the A and B vector operands are interleaved
to allow [[vertical add]], this only requires wires sufficient to
route A1 to A0, and vece versa.
Note that horizontal add is "only" a combination of a vector interleave or shuffle operation,
and a vertical add. The interleave operation, however, wants to produce two vector outputs,
or else occupy two instructions (or produce a double width vector operand, or pack two singles into
double precision vector elements (but then what do you do for double precision inputs), or ...)
2 packed vector inputs
A0, A1, A2 ...
B0, B1, B2 ...
2 packed vector outputs
A0, B0, A2, B2 ...
A1, B1, A3, B3 ...
Vertical add
A0+A1, B0+B1, ...
Vertical, element by element, multiplication,
combined with sufficient horizontal adds to perform the sum,
builds a dot product.
Or, a single instruction may combine vertical multiplication and horizontal pairwise addition:
2 packed vector inputs
A0, A1, A2 ...
B0, B1, B2 ...
1 result
A0*B0+A1*B1, A2*B2+A3*B3, ...
E.g. Intel SSE3's PMADDUBSW.
The putative problem of having only half as many vector elements in the PMADDUBSW output is
handled by doubling the precision,
taking 8 bits as input and producing a packed 16 bit vector as output.
Unfortunately, that technique does not work so well for floating point.
= Vector Parallel versus Vector Sequential =
Defining dot product with leftmost associativity
for purposes of overflow and rounding,
is well suited to a [[vector sequential]] implementation.
[[Vector parallel]] implementations are better suited to horizontal adds,
and other parallel orders of summation.
Modern instruction sets such as SSE tend to have [[vector parallel]] implementations.
It is possible to mix vector parallel and vector sequential,
although a latency penalty is paid when transitioning from vector sequential to vector parallel.
Thus we begin sliding down [[The Slippery Slope of Vector Instruction Sets]]:
there are so many different possible definitions of the vector instructions,
with different evaluation orders,
suited to different microarchitectures.
It is easy to define the most basic vector instructions.
Completing the coverage of vector instructions is more challenging.
WebDAV for my wiki?
I want to have drawings, etc., for my http://comp-arch.net wiki.
But I have thrashed on how to get them. I've tried too many different drawing tools for wiki - all have weaknesses.
I am now leaning towards WebDAV. Indeed, I am leaning towards rebuilding my wiki around WebDAV.
WebDAV basically is a filesystem implemented across web protocols.. If you are lucky enough, your favorite drawing program knows how to access a file stored on WebDAV. Or, you may have a tool that allows you to mount WebDAV as a filesystem on Windows or Linux or whatever, allowing almost any tools to access it.
The problem with WebDAV is that it IS a filesystem Just exporting a filesystem across the web is not very interesting. Listings of files in whatever tools you want - PowerPoint, Visio, Inkscape, etc.
But, I think if I write my own WebDAV server, combined with a wiki, it can be better than this. When a file is stored to WebDAV, I can arrange to convert it (if I understand its format) into a format that can be directly viewed by most, if not all, web browsers. E.g. GIF, although I am more hopeful about a vector forma like SVG. Plus, I can provide a link to the original format, .PPT or .VIS or whatever.
Not just for drawings.
Wiki text might be edited by a wiki editor, or potentially accessed by WebDAV and edited in your favorite text editor. My favorite would be EMACS. In org mode?
Heck, Word and PowerPoint and PDF and PS coud be uploaded. Although I prefer not. Viewed in the browser, if known how.
---
I was planning on a WebDAV server with git or hg or bzr as the versioning backend.
All I really want from WebDAV is the ability to click on an edit link and, on your favorite OS, open whatever program is associated with that file's type.
E.g. for a .PPT open PowerPoint or OpenOffice; for a .VIS (or whatever Visio uses) open Visio (if there's any Open Source tool that understands Visio files, I want to hear about it). Ditto Inkscape. .TXT or .WIKI in Emacs or your favorite text editor.
And, similar, on saving the file, have my own hooks to translate into some other format.
I'll probably want to provide some canned, in a web page, wiki text editor. In fact, I probably want to discourage tesxt being written except in a standard wiki markup. (Although having more than one wiki markup has some attractions. E.g. because I crosspost between wki and this blog. At least I want restructured text, since I am in the habit of cross posting wiki pages and comp.arch on USEnet. RST is much more readable that standard mediawiki or twiki markup.)
And maybe even a canned, in a web page, drawing editor.
But, my motivation for WebDAV is that I haven't found a drawing editor that makes me happy. Why reinvent the wheel?
The main problem with existing drawing editors is that their drawings cannot be displayed inline in web pages. But when I realized I can set my own put-time hooks in my own WebDAV server that seems more doable.
But I have thrashed on how to get them. I've tried too many different drawing tools for wiki - all have weaknesses.
I am now leaning towards WebDAV. Indeed, I am leaning towards rebuilding my wiki around WebDAV.
WebDAV basically is a filesystem implemented across web protocols.. If you are lucky enough, your favorite drawing program knows how to access a file stored on WebDAV. Or, you may have a tool that allows you to mount WebDAV as a filesystem on Windows or Linux or whatever, allowing almost any tools to access it.
The problem with WebDAV is that it IS a filesystem Just exporting a filesystem across the web is not very interesting. Listings of files in whatever tools you want - PowerPoint, Visio, Inkscape, etc.
But, I think if I write my own WebDAV server, combined with a wiki, it can be better than this. When a file is stored to WebDAV, I can arrange to convert it (if I understand its format) into a format that can be directly viewed by most, if not all, web browsers. E.g. GIF, although I am more hopeful about a vector forma like SVG. Plus, I can provide a link to the original format, .PPT or .VIS or whatever.
Not just for drawings.
Wiki text might be edited by a wiki editor, or potentially accessed by WebDAV and edited in your favorite text editor. My favorite would be EMACS. In org mode?
Heck, Word and PowerPoint and PDF and PS coud be uploaded. Although I prefer not. Viewed in the browser, if known how.
---
I was planning on a WebDAV server with git or hg or bzr as the versioning backend.
All I really want from WebDAV is the ability to click on an edit link and, on your favorite OS, open whatever program is associated with that file's type.
E.g. for a .PPT open PowerPoint or OpenOffice; for a .VIS (or whatever Visio uses) open Visio (if there's any Open Source tool that understands Visio files, I want to hear about it). Ditto Inkscape. .TXT or .WIKI in Emacs or your favorite text editor.
And, similar, on saving the file, have my own hooks to translate into some other format.
I'll probably want to provide some canned, in a web page, wiki text editor. In fact, I probably want to discourage tesxt being written except in a standard wiki markup. (Although having more than one wiki markup has some attractions. E.g. because I crosspost between wki and this blog. At least I want restructured text, since I am in the habit of cross posting wiki pages and comp.arch on USEnet. RST is much more readable that standard mediawiki or twiki markup.)
And maybe even a canned, in a web page, drawing editor.
But, my motivation for WebDAV is that I haven't found a drawing editor that makes me happy. Why reinvent the wheel?
The main problem with existing drawing editors is that their drawings cannot be displayed inline in web pages. But when I realized I can set my own put-time hooks in my own WebDAV server that seems more doable.
Saturday, October 16, 2010
Why Vectors Increase Frequency of Misalignment
= RISC handling of misaligned memory operations =
TBD: make this into a topic page of its wown.
It was a truism during the so-called [[RISC revolution]] that support for misaligned memory operations was unnecessary.
In particular, the rather expensive [[memory alignment mux]] could be greatly reduced in cost,
and the complexity involved in handling memory operands that cross cacheline boundaries or page boundaries could be reduced.
Compilers could either arrange to use [[naturally aligned]] data,
or they could arrange to use instruction sequences such as that below for data that might not be aligned:
loading possible misaligned data, address in rA
rAlo := rA & ~(size-1)
rAHi := rAlo + size
rDlo := load( rAlo )
rDhi := load( rAhi )
rAlobits := rA & (size-1)
rAlobits := size - rAlobits
rAlobitsMask := (1<
rAhibits := size - rAlobits
rAhbitsMask := (1<
rDlo := rDlo & rAlobitsMask
rDhi := rDhi & rAhibitsMask
rD := rDlo | rDhi
You should get the idea (even in the likely case that there are errors above).
Frequently the compiler can simplify this. It nearly always knows the size, which is nearly always a power of 2.
It often knows the alignment, e.g. of a field in a struct whose alignment is known.
But there will arise cases where these things are not known.
Since the general case of compiled code to perform such boundary croossing support is rather complex,
RISC machines often combine many of these operations into a pair of instructions
rDlo := load_lo.size( rA )
rDhi := load_hi.size( rA )
rD := rDlo | rDhi
The final OR instruction to merge the low and high parts may be eliminated, e.g. by [[writing into instead of onto]] the destination register,
or by having an implicit merge.
rD := load_lo.size( rA )
rD := load_hi.size( rA )
TBD: describe the several ways of doing this compaction.
Note that machines with support for misaligned menory operations often use similar approoaches,
except that the may first attempt an aligned load, for example, and then fall back to such a misaligned sequence.
(TBD: describe forms of misaligned support.)
= Why Vectors Increase Frequency of Misalignment =
Unfortunately, the advent of small vector or packed instruction sets,
such as Intel MMX, SSE, or PowerPC Altivec,
led to an increase in the frequency of misaligned memory operations.
Briefly: users of vectors often want to load subvectors or slices of other vectors.
E.g. given a big vector V[0...1023] in memory, the user may want to operate 4 elements at a time.
V[0-3] would be aligned,
but V[1-4] would be misaligned.
The problem of motion estimation in digital video is a very good example of this.
== Motion Estimation and Misaligned Memory References ==
Digital video operates on sequences of frames, each frame beiing a MxN matrix of pixels
P[0,0] ... P[0,M-1]
... ... ...
P[M-1,0] ... P[M-1,N-1]
where (M,N) may be values such as CIF (288,252). Much larger videos are now common.
Digital video is usually compressed.
One common form of compression is to find parts of the image that seem to have moved between frames,
and then transmit, rather than the entire region that has moved, a motion vector,
and some correction values.
Motion compensation is the process of reconstructing the pixels from this information;
motion estimation is the proceess of finding such motion vectors and correction values.
Motion estimation is much more computationally intensive than motion compensation.
Most motion estimation algorithms involve a search.
They typically start with a block of pixels in Frame1
Frame1
P[R,C] ... P[0,C+3]
... ... ...
P[R+3,0] ... P[R+3,C+3]
and compare it, in succession, to offset frames in Frame2
Frame2
P[R+dR,C+dC] ... P[0+dR,C+3+dC]
... ... ...
P[R+3+dR,0+dC] ... P[R+3+dR,C+3+dC]
Typically the comparisons are done a row at a time, and then summed over all rows in the blocks:
tmp1 := load( Frame1.P[R,C..C+3] )
tmp2 := load( Frame2.P[R+dR,C+dC..C+3+dC] )
The exact search patterns vary by algorithm, but the offset values (dR,dC) nearly always include distances such as 0, 1, 2... etc pixels.
In fact, they often include sub-pixel granularity, such as 1/2 and 1/4 pixels.
It can therefore be seen that motion estimation naturally involves vectors that are not [[naturally aligned]].
Although the pixels may be naturally aligned, a group of 4 pixels P[1..4] is not.
There have been various attempts to mitigate the alignment issues.
For example, for motion estimation some forms of the [[MPSAD]] instruction may load double width slices, and extract all possible subslices.
These are a special form of [[convolution instructions]] or [[moving window instructions]].
However, this is toothpaste: when one source of vector misalignment has been cured, another crops up somewhere else.
TBD: make this into a topic page of its wown.
It was a truism during the so-called [[RISC revolution]] that support for misaligned memory operations was unnecessary.
In particular, the rather expensive [[memory alignment mux]] could be greatly reduced in cost,
and the complexity involved in handling memory operands that cross cacheline boundaries or page boundaries could be reduced.
Compilers could either arrange to use [[naturally aligned]] data,
or they could arrange to use instruction sequences such as that below for data that might not be aligned:
loading possible misaligned data, address in rA
rAlo := rA & ~(size-1)
rAHi := rAlo + size
rDlo := load( rAlo )
rDhi := load( rAhi )
rAlobits := rA & (size-1)
rAlobits := size - rAlobits
rAlobitsMask := (1<
rAhbitsMask := (1<
rDlo := rDlo & rAlobitsMask
rDhi := rDhi & rAhibitsMask
rD := rDlo | rDhi
You should get the idea (even in the likely case that there are errors above).
Frequently the compiler can simplify this. It nearly always knows the size, which is nearly always a power of 2.
It often knows the alignment, e.g. of a field in a struct whose alignment is known.
But there will arise cases where these things are not known.
Since the general case of compiled code to perform such boundary croossing support is rather complex,
RISC machines often combine many of these operations into a pair of instructions
rDlo := load_lo.size( rA )
rDhi := load_hi.size( rA )
rD := rDlo | rDhi
The final OR instruction to merge the low and high parts may be eliminated, e.g. by [[writing into instead of onto]] the destination register,
or by having an implicit merge.
rD := load_lo.size( rA )
rD := load_hi.size( rA )
TBD: describe the several ways of doing this compaction.
Note that machines with support for misaligned menory operations often use similar approoaches,
except that the may first attempt an aligned load, for example, and then fall back to such a misaligned sequence.
(TBD: describe forms of misaligned support.)
= Why Vectors Increase Frequency of Misalignment =
Unfortunately, the advent of small vector or packed instruction sets,
such as Intel MMX, SSE, or PowerPC Altivec,
led to an increase in the frequency of misaligned memory operations.
Briefly: users of vectors often want to load subvectors or slices of other vectors.
E.g. given a big vector V[0...1023] in memory, the user may want to operate 4 elements at a time.
V[0-3] would be aligned,
but V[1-4] would be misaligned.
The problem of motion estimation in digital video is a very good example of this.
== Motion Estimation and Misaligned Memory References ==
Digital video operates on sequences of frames, each frame beiing a MxN matrix of pixels
P[0,0] ... P[0,M-1]
... ... ...
P[M-1,0] ... P[M-1,N-1]
where (M,N) may be values such as CIF (288,252). Much larger videos are now common.
Digital video is usually compressed.
One common form of compression is to find parts of the image that seem to have moved between frames,
and then transmit, rather than the entire region that has moved, a motion vector,
and some correction values.
Motion compensation is the process of reconstructing the pixels from this information;
motion estimation is the proceess of finding such motion vectors and correction values.
Motion estimation is much more computationally intensive than motion compensation.
Most motion estimation algorithms involve a search.
They typically start with a block of pixels in Frame1
Frame1
P[R,C] ... P[0,C+3]
... ... ...
P[R+3,0] ... P[R+3,C+3]
and compare it, in succession, to offset frames in Frame2
Frame2
P[R+dR,C+dC] ... P[0+dR,C+3+dC]
... ... ...
P[R+3+dR,0+dC] ... P[R+3+dR,C+3+dC]
Typically the comparisons are done a row at a time, and then summed over all rows in the blocks:
tmp1 := load( Frame1.P[R,C..C+3] )
tmp2 := load( Frame2.P[R+dR,C+dC..C+3+dC] )
The exact search patterns vary by algorithm, but the offset values (dR,dC) nearly always include distances such as 0, 1, 2... etc pixels.
In fact, they often include sub-pixel granularity, such as 1/2 and 1/4 pixels.
It can therefore be seen that motion estimation naturally involves vectors that are not [[naturally aligned]].
Although the pixels may be naturally aligned, a group of 4 pixels P[1..4] is not.
There have been various attempts to mitigate the alignment issues.
For example, for motion estimation some forms of the [[MPSAD]] instruction may load double width slices, and extract all possible subslices.
These are a special form of [[convolution instructions]] or [[moving window instructions]].
However, this is toothpaste: when one source of vector misalignment has been cured, another crops up somewhere else.
Sunday, October 10, 2010
One of those days...
It's been one of those days (weekends) at Mudshoe Manor.
Yesterday my daughter's tablet PC, an HP tx1000 (more precisely, a tx1420us) refused to power on. So I set her up with an account on my tablet PC to do her weekend's homework. In the meantime, I unplugged and removed the battery on her tablet.
By the end of the day, her machine was working. So we transferred her homework to her machine. This morning, she made some revisions, fixing the glaring speech recognition errors. (Yes, Sophie has been using Vista's speech recognition.)
And when we broke for lunch, of course, her tablet PC broke again. this time hard, not fixed by removing power. The blue lightning icon flashes briefly, and then nothing.
So she moved back to my tablet PC to do the rest of her homework. Yes, she had to redo her revisions of the morning.
While she is using my tablet PC, I google hers. Ouch. Apparently the HP tx1000 series has well-known problems with overheating caused by its AMD CPU and, more notoriously, bad solder or wire bonds or something similar on its Nvidia graphics chip. Unfortunately, the simple fix, pressing down the JKL keys while booting, doesn't work.
Youtube is full of videos on how to repair this, usually amounting to opening the case, removing the motherboard, heating up the Nvidia GPU chip without heating anything else, pressing down, and then attaching a penny or a quarter to provide better contact to the heatsink - the heatsink apparently not having a bar dedicated to the GPU.
I've been reading about the Nvidia class action suits. Apparently now I am involved - maybe. Although the problem is notorious on the Internet, apparently HP has not officially acknowledged it.
So, this is where I'm at: do I open up the case to do the DIY fix, or do I wait on the phone to see if HP will deliver a fix pursuant to the Nvidia class action? I'll probably do the latter, and if that fails, the former.
In the meantime, my daughter is PC-less. Or, rather, I am PC-less while she uses my PC.
I'm writing this on our old, and hitherto almost useless, 2go PC - one of the original classmate PCs, early netbooks, from Intel's employee purchase program. Plugged it in and booted it for the first time in almost a year. After installing umpteen upgrades and security patches, I installed the Diamond BVU195 USB display adapter drivers, and Chrome - and now this formerly useless and slow PC is reborn as a web browser appliance. Seems fast enough for blogging.
It's quite amazing to see the 2goPC, with its 1024x600 display, drive 3 1920x1200 monitors.
I love USB display adapters!!!! I hope the Linux drivers are available.
It's like dominoes: one computer has a problem, and I have to fix another, and then another. My daughter's machine doesn't have much on it - we mainly are cloud based - but just in case, I am going to pull the drive. Unfortunately, I don't have another drive with enough space to transfer to. My own tablet PC is full. So, this finally forced me to find the disk drive upgrade for my tablet PC that I had been sitting on, and start transferring to that.
Off to find open source drive imaging software.
I hate computers.
Yesterday my daughter's tablet PC, an HP tx1000 (more precisely, a tx1420us) refused to power on. So I set her up with an account on my tablet PC to do her weekend's homework. In the meantime, I unplugged and removed the battery on her tablet.
By the end of the day, her machine was working. So we transferred her homework to her machine. This morning, she made some revisions, fixing the glaring speech recognition errors. (Yes, Sophie has been using Vista's speech recognition.)
And when we broke for lunch, of course, her tablet PC broke again. this time hard, not fixed by removing power. The blue lightning icon flashes briefly, and then nothing.
So she moved back to my tablet PC to do the rest of her homework. Yes, she had to redo her revisions of the morning.
While she is using my tablet PC, I google hers. Ouch. Apparently the HP tx1000 series has well-known problems with overheating caused by its AMD CPU and, more notoriously, bad solder or wire bonds or something similar on its Nvidia graphics chip. Unfortunately, the simple fix, pressing down the JKL keys while booting, doesn't work.
Youtube is full of videos on how to repair this, usually amounting to opening the case, removing the motherboard, heating up the Nvidia GPU chip without heating anything else, pressing down, and then attaching a penny or a quarter to provide better contact to the heatsink - the heatsink apparently not having a bar dedicated to the GPU.
I've been reading about the Nvidia class action suits. Apparently now I am involved - maybe. Although the problem is notorious on the Internet, apparently HP has not officially acknowledged it.
So, this is where I'm at: do I open up the case to do the DIY fix, or do I wait on the phone to see if HP will deliver a fix pursuant to the Nvidia class action? I'll probably do the latter, and if that fails, the former.
In the meantime, my daughter is PC-less. Or, rather, I am PC-less while she uses my PC.
I'm writing this on our old, and hitherto almost useless, 2go PC - one of the original classmate PCs, early netbooks, from Intel's employee purchase program. Plugged it in and booted it for the first time in almost a year. After installing umpteen upgrades and security patches, I installed the Diamond BVU195 USB display adapter drivers, and Chrome - and now this formerly useless and slow PC is reborn as a web browser appliance. Seems fast enough for blogging.
It's quite amazing to see the 2goPC, with its 1024x600 display, drive 3 1920x1200 monitors.
I love USB display adapters!!!! I hope the Linux drivers are available.
It's like dominoes: one computer has a problem, and I have to fix another, and then another. My daughter's machine doesn't have much on it - we mainly are cloud based - but just in case, I am going to pull the drive. Unfortunately, I don't have another drive with enough space to transfer to. My own tablet PC is full. So, this finally forced me to find the disk drive upgrade for my tablet PC that I had been sitting on, and start transferring to that.
Off to find open source drive imaging software.
I hate computers.
Friday, October 08, 2010
MS License Lock Box - yeah, right
Last year I gave in and purchased a copy of Microsoft Office. "Gave in" because I mainly use Open Office. Gave in, because (a) there was a cheap deal on Office 2007 Home and Student - 80$ for license to install on three machines, (b) I had just realized that my Vista tablet had functional speech recognition, no need to buy expensive Dragon, and my hands were in spasm (much as I prefer open source, I am not aware of any free software that does good speech recognition, or handwriting for that matter), and (c) I have grown to depend on Microsoft OneNote. (This last really minor, just a nice to have.)
80$ to install on three machines. Sounds good, eh?
Also, at the same time I purchased the optional "License Lock Box", which supposedly gave me the right to re-download the software for 3 years.
I installed the software immediately on my tablet. Did not get around to installing it on my wife & daughter's compute(s) for a few months. Was annoyed when I tried to that the original download rights lasted 60 days. I thought the License Lock Box was supposed to give me more time than that, but could not figure it out.
Wrote it off as yet more money poured down the drain into Microsoft.
Today, decided to try to set my daughter up. Her mother and I are Luddites: my daughter can't tupe. But here new school requires assignments on computer. Typing lessons are icumin in, but in the meantime, speech recognition. Which works better on Vista, using Microsoft Office.
Since I had long given up hope of using my Office 2007 license, I started purchasing Office 2010. Filled out all the forms, pressed the button "after which your credit card will be charged" - and was taken to a page that said "No such page found". On Microsoft's store website, using Microsoft Internet Explorer.
Sigh. Called Microsoft Store help. It appears that my transaction did not go through. But while I was on the line, asked about my Office 2007 license, and the "Licensing Lock Box".
The very helpful guy on the phone explained that, at that time, software sold from Microsoft's website was sold through a reseller, Digital River. And that the License Lock Box was their product, not Microsoft's. Aha!
Found that I could log in - with one of the umpteen Microsoft LiveId accounts.
Curiously, with Chrome and Firefox I can see my purchase history. But I can't see the "Download Now" button that I am supposed to use the re-download.
But with Internet Explore, I get told they have no history of my purchase. Again, it is curious that such a Microsoft-related site works worse with Internet Explorer than with non-MS browsers. But then again, it doesn't work welll with either.
So, annoyingly, it looks as if I *MIGHT* be able to exercise my license rights to 3 installs of Offoce 2007. But because of website brokenness, not before Monday. And Sophie has homework to do this weekend.
God! This is one of the big advantages of Open Source: if you have a problem, you can just re-download at any time. On any computer. No licensing hoops to jump through using information that can only be finagled out of the software reseller''s broken website on weekdays.
---
I thought that buying this "License Lock Box" was a Microsoft approximation to always being able to download the latest software from the cloud. Maybe so - but it's a bad approximation. Apparently not even from Microsoft.
Although the helpful guy at the Microsoft store said that they allow infinite re-downloads.
80$ to install on three machines. Sounds good, eh?
Also, at the same time I purchased the optional "License Lock Box", which supposedly gave me the right to re-download the software for 3 years.
I installed the software immediately on my tablet. Did not get around to installing it on my wife & daughter's compute(s) for a few months. Was annoyed when I tried to that the original download rights lasted 60 days. I thought the License Lock Box was supposed to give me more time than that, but could not figure it out.
Wrote it off as yet more money poured down the drain into Microsoft.
Today, decided to try to set my daughter up. Her mother and I are Luddites: my daughter can't tupe. But here new school requires assignments on computer. Typing lessons are icumin in, but in the meantime, speech recognition. Which works better on Vista, using Microsoft Office.
Since I had long given up hope of using my Office 2007 license, I started purchasing Office 2010. Filled out all the forms, pressed the button "after which your credit card will be charged" - and was taken to a page that said "No such page found". On Microsoft's store website, using Microsoft Internet Explorer.
Sigh. Called Microsoft Store help. It appears that my transaction did not go through. But while I was on the line, asked about my Office 2007 license, and the "Licensing Lock Box".
The very helpful guy on the phone explained that, at that time, software sold from Microsoft's website was sold through a reseller, Digital River. And that the License Lock Box was their product, not Microsoft's. Aha!
Found that I could log in - with one of the umpteen Microsoft LiveId accounts.
Curiously, with Chrome and Firefox I can see my purchase history. But I can't see the "Download Now" button that I am supposed to use the re-download.
But with Internet Explore, I get told they have no history of my purchase. Again, it is curious that such a Microsoft-related site works worse with Internet Explorer than with non-MS browsers. But then again, it doesn't work welll with either.
So, annoyingly, it looks as if I *MIGHT* be able to exercise my license rights to 3 installs of Offoce 2007. But because of website brokenness, not before Monday. And Sophie has homework to do this weekend.
God! This is one of the big advantages of Open Source: if you have a problem, you can just re-download at any time. On any computer. No licensing hoops to jump through using information that can only be finagled out of the software reseller''s broken website on weekdays.
---
I thought that buying this "License Lock Box" was a Microsoft approximation to always being able to download the latest software from the cloud. Maybe so - but it's a bad approximation. Apparently not even from Microsoft.
Although the helpful guy at the Microsoft store said that they allow infinite re-downloads.
Thursday, October 07, 2010
Accumulator ISA vs Uarch
wiki at: http://semipublic.comp-arch.net/wiki/Accumulator_ISA_vs_Uarch
oldder article on accumulators:
http://semipublic.comp-arch.net/wiki/Accumulators
I'm at home sick today, so I've got to be careful: last time I posted on comp.arch when I was home sick, I set the agenda for 4+ years of work.
First: 22+ gates may well fit in a pipestage. People are headed past 28 FO4, in an effort to tolerate device variation (the more gates / pipestage, the higher your yield).
Of course, usually when you do this we end up trying to cram several pipestages worth of work into one. As you might say redundant bypassing does.
---
As for accumulator architectures and microarchitectures, love/hate.
First, we must distinguish instruction set architecture from microarchitecture.
= Accumulator ISA =
Accumulator ISAs save bits. May lead to very small code. When we have the ability to reorganize instructions over long distances, accumulator ISAs don't need to hurt performance.
By the way, some might say that "2 operand" ISAs are accumulator based:
rD += rS
Myself, although more accumulator like that three operand (I hate these terms, but must use them)
rD := rS1 + rS2
I tend to be a stickler, and say that either there is single accumulator
acc += rS
or maybe a small number of accumulators, acc0 - 3 versus r0-32
acc0-3 += rS0-31
== Accumulator versus Stack ==
By the way, for similar instruction size issues, I sometimes look fondly on stack ISAs. Even fewer bits than accumulators.
= Accumulator Microarchitecture =
Now, let's transition into microarchitecture. I call it an accumulator microarchitecture when the accumulator (or acccumulators) are "close to" one or more ALUs.
I.e. you may have a small number, say 4, of accumulators near to every individual ALU. E.g. you might have 3 ALUs, with 4 accumulators each.
Or you may have a small number of accumulators shared between a group of ALUs.
Physically, the term accumulator may be associated with a different circuit design: e.g. the accumulators might be flops, whereas the registers might be in an SRAM array.
I don't call it an accumulator microarchitecture when you have, say, 2 clusters of 4 ALUs, each tightly bound to 16 registers. I call that a partitioned register file. But I might be willing to call it an accumulator if there were only 4 such tightly bound registers next to the ALUs. Obviously, this is a matter of degree.
I might call it an accumulator architecture if there were 2 such ALU clusters, each with 4 accumulator registers, and a big shared pool of 64 registers eqaually accessible to all. But I would require that there
be instructions of the form
cluster.accum_i += global_reg_j
If the global registers could not be used as operands of at least the usual ADD instruction, but were restricted to moving to and from the accumulators, I would call it a two level register file, not an accumulator microarchitecture.
== OOO implementation of Accumulator ISA ==
Accumulator ISAs implemented on accumulator microarchitectures are natiral for in-order machines.
On an out-of-order machine, it is trivial to rename the accumulators of the ISA to a single PRF. In which case, the hypothetical advantages of the accumulator ISA are lost.
== OOO implementation of Accumulator Microarchitecture ==
Conversely, it is straightforward to implement a "bypass cache" for a flat register file instruction set, taking advantage of the fact that a value need not be written into the main register file before it is reused. A bypass cache is a generalization of generic bypassing: a bypass cache adds state, and is not just restricted to bypassing from operands in flight.
The bypass cache state is very much like an accumulator microarchitecture. With at least one important difference:
If the bypass cache is CAM indexed by the physical register number in the main RF, then that is an important difference. A second important difference is whether the bypass cache can miss - whether a value you expect to receive from the bypasses or the bypass cache may not be there when you expect it to be, so that you have to fetch it froom the main RF. I sometimes call this a priori vs a posteriori caching.
If, however, there is separate renaming for accumulator registers and main registers, then you more exactly have an OOO implementation of, say, a flat RF ISA on an accumulator uarch. But you have all sorts of questions about policy: when do you make a copy of a main RF register in an accumulator, and vice versa.
(Hey, here's an idea: in the scheduler, CAM on main REF regnums. But also broadcast when a main RF regnum has been moved to an accumulator. Issue the instruction with non-CAM indexed accum regnums... How handlle replacemeent? Probably LRU, modified by overiting - so many values are only read once.)
== OOO implementattion of Accumulator ISA on Accumulator Uarch ==
What about combining the two? Well, obviously, you could rename from ISA accum into flat RF, and then into uarch accums. But that rather defeats the purpose.
We might rename accums and main regs separately. The ISA accums guide replacement in the physical accums.
Issue: what if ISA accums and physical accums are not the same in number?
Here's an observation: one of the biggest advantages of accumulator ISAs is that they tend to reuse the accumulators quickly. I.e. they tell us when values are dead, and hence can be more quickly recycled. You might get the same effect by a "kill register" bit.
= Summary =
Accumulators, whether ISA or uarch, are part of our tool kit.
I do not thing that there is a uniform way of saying one is better than another. Certainly, at the moment flat RFs preponderate.
Much of the benefit of accumulators can be obtained by bypass caching.
I would hesitate to add true accumulators to a new ISA unless (a) codesize really matters, or (b) it is an embedded system, with the ability to change uarch and ISA every generation.
(But then, that's what they always say.)
On the other hand, I am scared of implementing accumulators, whether in ISA or uarch.
== Accumulators for Extended Precision ==
I forgot to mention one of the biggest advantages of accumulators: extended intermediate precision.
Many DSPs have, for example, 24 bit accumulators to add chains of 16 bit numbers.
One might consider FMA or FMAC, floating point multipply add without intermediate rounding, to be a hidden internal accumulator. Except that it isn't - it flows. If you exposed that intermediate value...
I keep waiting for a nice formulation of exact, infinte precision, vector sum reductions, dot products, etc. Trivially, that can be done via a [[superaccumulator]] - but although I like superaccumulators, they are pretty expensive. What, 1K or 2K bits?
An ISA-level accumulator architecture of the form
superacc := 0
for all i superacc += a[i]
is nice in that you could imagine mapping it to the infinite internal precision versions of 4-way dot product, etc., that one sees in some instruction sets. (E.g. Microunity)
I'd like to see lower cost but still infinite precision wayys ofdoing such FP vector reductions.
oldder article on accumulators:
http://semipublic.comp-arch.net/wiki/Accumulators
I'm at home sick today, so I've got to be careful: last time I posted on comp.arch when I was home sick, I set the agenda for 4+ years of work.
First: 22+ gates may well fit in a pipestage. People are headed past 28 FO4, in an effort to tolerate device variation (the more gates / pipestage, the higher your yield).
Of course, usually when you do this we end up trying to cram several pipestages worth of work into one. As you might say redundant bypassing does.
---
As for accumulator architectures and microarchitectures, love/hate.
First, we must distinguish instruction set architecture from microarchitecture.
= Accumulator ISA =
Accumulator ISAs save bits. May lead to very small code. When we have the ability to reorganize instructions over long distances, accumulator ISAs don't need to hurt performance.
By the way, some might say that "2 operand" ISAs are accumulator based:
rD += rS
Myself, although more accumulator like that three operand (I hate these terms, but must use them)
rD := rS1 + rS2
I tend to be a stickler, and say that either there is single accumulator
acc += rS
or maybe a small number of accumulators, acc0 - 3 versus r0-32
acc0-3 += rS0-31
== Accumulator versus Stack ==
By the way, for similar instruction size issues, I sometimes look fondly on stack ISAs. Even fewer bits than accumulators.
= Accumulator Microarchitecture =
Now, let's transition into microarchitecture. I call it an accumulator microarchitecture when the accumulator (or acccumulators) are "close to" one or more ALUs.
I.e. you may have a small number, say 4, of accumulators near to every individual ALU. E.g. you might have 3 ALUs, with 4 accumulators each.
Or you may have a small number of accumulators shared between a group of ALUs.
Physically, the term accumulator may be associated with a different circuit design: e.g. the accumulators might be flops, whereas the registers might be in an SRAM array.
I don't call it an accumulator microarchitecture when you have, say, 2 clusters of 4 ALUs, each tightly bound to 16 registers. I call that a partitioned register file. But I might be willing to call it an accumulator if there were only 4 such tightly bound registers next to the ALUs. Obviously, this is a matter of degree.
I might call it an accumulator architecture if there were 2 such ALU clusters, each with 4 accumulator registers, and a big shared pool of 64 registers eqaually accessible to all. But I would require that there
be instructions of the form
cluster.accum_i += global_reg_j
If the global registers could not be used as operands of at least the usual ADD instruction, but were restricted to moving to and from the accumulators, I would call it a two level register file, not an accumulator microarchitecture.
== OOO implementation of Accumulator ISA ==
Accumulator ISAs implemented on accumulator microarchitectures are natiral for in-order machines.
On an out-of-order machine, it is trivial to rename the accumulators of the ISA to a single PRF. In which case, the hypothetical advantages of the accumulator ISA are lost.
== OOO implementation of Accumulator Microarchitecture ==
Conversely, it is straightforward to implement a "bypass cache" for a flat register file instruction set, taking advantage of the fact that a value need not be written into the main register file before it is reused. A bypass cache is a generalization of generic bypassing: a bypass cache adds state, and is not just restricted to bypassing from operands in flight.
The bypass cache state is very much like an accumulator microarchitecture. With at least one important difference:
If the bypass cache is CAM indexed by the physical register number in the main RF, then that is an important difference. A second important difference is whether the bypass cache can miss - whether a value you expect to receive from the bypasses or the bypass cache may not be there when you expect it to be, so that you have to fetch it froom the main RF. I sometimes call this a priori vs a posteriori caching.
If, however, there is separate renaming for accumulator registers and main registers, then you more exactly have an OOO implementation of, say, a flat RF ISA on an accumulator uarch. But you have all sorts of questions about policy: when do you make a copy of a main RF register in an accumulator, and vice versa.
(Hey, here's an idea: in the scheduler, CAM on main REF regnums. But also broadcast when a main RF regnum has been moved to an accumulator. Issue the instruction with non-CAM indexed accum regnums... How handlle replacemeent? Probably LRU, modified by overiting - so many values are only read once.)
== OOO implementattion of Accumulator ISA on Accumulator Uarch ==
What about combining the two? Well, obviously, you could rename from ISA accum into flat RF, and then into uarch accums. But that rather defeats the purpose.
We might rename accums and main regs separately. The ISA accums guide replacement in the physical accums.
Issue: what if ISA accums and physical accums are not the same in number?
Here's an observation: one of the biggest advantages of accumulator ISAs is that they tend to reuse the accumulators quickly. I.e. they tell us when values are dead, and hence can be more quickly recycled. You might get the same effect by a "kill register" bit.
= Summary =
Accumulators, whether ISA or uarch, are part of our tool kit.
I do not thing that there is a uniform way of saying one is better than another. Certainly, at the moment flat RFs preponderate.
Much of the benefit of accumulators can be obtained by bypass caching.
I would hesitate to add true accumulators to a new ISA unless (a) codesize really matters, or (b) it is an embedded system, with the ability to change uarch and ISA every generation.
(But then, that's what they always say.)
On the other hand, I am scared of implementing accumulators, whether in ISA or uarch.
== Accumulators for Extended Precision ==
I forgot to mention one of the biggest advantages of accumulators: extended intermediate precision.
Many DSPs have, for example, 24 bit accumulators to add chains of 16 bit numbers.
One might consider FMA or FMAC, floating point multipply add without intermediate rounding, to be a hidden internal accumulator. Except that it isn't - it flows. If you exposed that intermediate value...
I keep waiting for a nice formulation of exact, infinte precision, vector sum reductions, dot products, etc. Trivially, that can be done via a [[superaccumulator]] - but although I like superaccumulators, they are pretty expensive. What, 1K or 2K bits?
An ISA-level accumulator architecture of the form
superacc := 0
for all i superacc += a[i]
is nice in that you could imagine mapping it to the infinite internal precision versions of 4-way dot product, etc., that one sees in some instruction sets. (E.g. Microunity)
I'd like to see lower cost but still infinite precision wayys ofdoing such FP vector reductions.
Tuesday, September 21, 2010
Vector_Lanes_versus_Packed_Operations
= Packed =
For want of a better term, I will use the phrase [[Packed Operations]] to refer to instruction set extensions such as Intel MMX, SSE, PowerPC Altivec, etc.:
typically operating on 64 or 128 bit data (although people did propose [[32 bit MMX]], and 256 bit AVX is coming soon in 2010).
Typically operating on
* 8x8b or 16x8b
* 4x16b or 8x16b
* 2x32b or 4x32b
etc.
Typically operating SIMD-style inside ALU operations,
but typically lacking scatter/gather memory operations.
= Packed vs Vector Lanes =
I want to contrast this with microarchitectures that have parallel vector lanes.
(NOT [[vector sequential]] microarchitectures.)
Packed Operations are fundamentally scalar pipelines. They just happen to have wide scalar datapaths, e.g. 64bits.
Since there are typically no 128 bit scalar datapaths, they have evolved beyond scalar;
but typically the datapath is designed as if it is a wide ALU datapath.
Parallel vector lane microarchitectures, on the other hand, are typically formed ourt of multiple indeoendent ALUs.
Packed ALUs are typically a single ALU, that is subdivided to perform independent packed operations.
E.g. one might take a 64-bit wide adder, and add special buffer bits at every byte boundary.
But controlling the values of these buffer bits, we can control whether carry will propagate across 8b, 16b, or 32b boundaries.
I.e. a 64 bit wide ALU with such byte boundary buffer bits
is really a 72 bit wide ALU,
capable of performing 8x8b, 4x16b, and 2x32b, as well as 64 bit addition.
(For that matter, odd sizes such as 24bit arithmetic can also be achieved.)
Similarly for 128 bit packed datapaths. Although nobody probably wants to build a 128 bit wide carry propagate adder.
Whereas in a vector lane microarchitecture, the ALUs are independent.
E.g, in 512 bit wide vector machine with 16 32b lanes,
there are 16 32b wide adders.
In such a machine crossing vector lanes is difficult.
Of course there is a spectrum of possibilities.
For example, pairs of even and odd vector lanes may be combined to perform double precision arithmetic.
Or, one may have vector lanes, where the operations within the lanes are 128b packed operations.
(128b packed is particularly pleasant as a lane width, since it supports the common 4x32b vectors, as well as 2x64b complex.)
Even on a packed datapath, certain operations that move data between bits are difficult.
E.g. shifts require special handling of buffer bits.
It is more difficult to partition a 64x64bit integer multiplier into 32b*32b multipliers,
than it is to simply add buffer bits for add - nevertheless, it can be done.
(Note that a 64b*64b integer multiplier can be partitioned into 4 32b*32b multipliers - roughly what is required for complex multiplication or an FFT 2-butterfly.)
Floating point is even more difficult to partition. The multiplier array for an FMA can be shared between 32b and 64b.
The adder logic is more challenging.
Neverthelerss: considerable sharing is possiblwe on a packed datapath.
But at some point one graduates to separate ALUs for vector lanes.
In "separate ALUs", the operative term is "separated". Physically separated ALUs are part of a vector lane microarchitecture.
= Instruction Set Design Consequences =
In a packed architecture, one tends to have N-bits, subdivided into more or viewer smaller pieces.
In vector architectures, one tends to have vector elements of a given width, say VL * EW, where EW is the element width.
One might then have narrower operations that apply to vector elements,
e.g. a vector of VL 32b floats or a vector of VL 64b floats.
(Although one might have packed operations on the lane width vector elements.)
There may be operations that load a vector of 8, 16, or 32 bit elements,
packed in memory, into unpacked 64 bit elements within a vector register.
Whereas on a packed instruction set, one probably just loads the values without unpacking them.
Consider, e.g. 16b wide floating point.
Many machines expand it into 32b floating point in registers.
But one can imagine ALUs that operate on 16b FP in registers.
I.e. packed operations are about using all of the bits of the register datapath.
Whereas vector lane operations are about vector operations on a limited number of datatypes.
For want of a better term, I will use the phrase [[Packed Operations]] to refer to instruction set extensions such as Intel MMX, SSE, PowerPC Altivec, etc.:
typically operating on 64 or 128 bit data (although people did propose [[32 bit MMX]], and 256 bit AVX is coming soon in 2010).
Typically operating on
* 8x8b or 16x8b
* 4x16b or 8x16b
* 2x32b or 4x32b
etc.
Typically operating SIMD-style inside ALU operations,
but typically lacking scatter/gather memory operations.
= Packed vs Vector Lanes =
I want to contrast this with microarchitectures that have parallel vector lanes.
(NOT [[vector sequential]] microarchitectures.)
Packed Operations are fundamentally scalar pipelines. They just happen to have wide scalar datapaths, e.g. 64bits.
Since there are typically no 128 bit scalar datapaths, they have evolved beyond scalar;
but typically the datapath is designed as if it is a wide ALU datapath.
Parallel vector lane microarchitectures, on the other hand, are typically formed ourt of multiple indeoendent ALUs.
Packed ALUs are typically a single ALU, that is subdivided to perform independent packed operations.
E.g. one might take a 64-bit wide adder, and add special buffer bits at every byte boundary.
But controlling the values of these buffer bits, we can control whether carry will propagate across 8b, 16b, or 32b boundaries.
I.e. a 64 bit wide ALU with such byte boundary buffer bits
is really a 72 bit wide ALU,
capable of performing 8x8b, 4x16b, and 2x32b, as well as 64 bit addition.
(For that matter, odd sizes such as 24bit arithmetic can also be achieved.)
Similarly for 128 bit packed datapaths. Although nobody probably wants to build a 128 bit wide carry propagate adder.
Whereas in a vector lane microarchitecture, the ALUs are independent.
E.g, in 512 bit wide vector machine with 16 32b lanes,
there are 16 32b wide adders.
In such a machine crossing vector lanes is difficult.
Of course there is a spectrum of possibilities.
For example, pairs of even and odd vector lanes may be combined to perform double precision arithmetic.
Or, one may have vector lanes, where the operations within the lanes are 128b packed operations.
(128b packed is particularly pleasant as a lane width, since it supports the common 4x32b vectors, as well as 2x64b complex.)
Even on a packed datapath, certain operations that move data between bits are difficult.
E.g. shifts require special handling of buffer bits.
It is more difficult to partition a 64x64bit integer multiplier into 32b*32b multipliers,
than it is to simply add buffer bits for add - nevertheless, it can be done.
(Note that a 64b*64b integer multiplier can be partitioned into 4 32b*32b multipliers - roughly what is required for complex multiplication or an FFT 2-butterfly.)
Floating point is even more difficult to partition. The multiplier array for an FMA can be shared between 32b and 64b.
The adder logic is more challenging.
Neverthelerss: considerable sharing is possiblwe on a packed datapath.
But at some point one graduates to separate ALUs for vector lanes.
In "separate ALUs", the operative term is "separated". Physically separated ALUs are part of a vector lane microarchitecture.
= Instruction Set Design Consequences =
In a packed architecture, one tends to have N-bits, subdivided into more or viewer smaller pieces.
In vector architectures, one tends to have vector elements of a given width, say VL * EW, where EW is the element width.
One might then have narrower operations that apply to vector elements,
e.g. a vector of VL 32b floats or a vector of VL 64b floats.
(Although one might have packed operations on the lane width vector elements.)
There may be operations that load a vector of 8, 16, or 32 bit elements,
packed in memory, into unpacked 64 bit elements within a vector register.
Whereas on a packed instruction set, one probably just loads the values without unpacking them.
Consider, e.g. 16b wide floating point.
Many machines expand it into 32b floating point in registers.
But one can imagine ALUs that operate on 16b FP in registers.
I.e. packed operations are about using all of the bits of the register datapath.
Whereas vector lane operations are about vector operations on a limited number of datatypes.
Wifi disppears from HP TX 1420 tablet PC
My wife and daughter's HP TX 1420 Tablet PC (originally my wife's, then my daughter's) suddenly behaves as if it no longer has wifi builtin.
"Suddenly" may not be true - it has probably been attached to wired ethernet for a few months. I don't know when wifi disappeared. But it has gone now.
As the link shows, this is apparently a well known problem with HP machines.
"Suddenly" may not be true - it has probably been attached to wired ethernet for a few months. I don't know when wifi disappeared. But it has gone now.
As the link shows, this is apparently a well known problem with HP machines.
Monday, September 20, 2010
Tuesday, September 07, 2010
Varieties of Cache - CompArch
Varieties of Cache - CompArch: "Varieties of Cache"
The basic principle of caching is to use a small, fast structure (typically an [[Array]])
improve the performance of a large slow, structure,
hopefully to approximate a large, fast, structure, without the expense.
However, the definition of "fast" may vary. And there may be other reasons to use a cache.
The most familiar caches are [[latency caches]].
E.g. an L1 data cache that may take 4 cycles to access, an L2 20 cycles, and main memory 200 cyckes.
Another useful cache is a [[bandwidth cache]], where the cache provides more bandwidth than the structure is is a cache for.
This is quite common - often latency and bandwidth caches are combined.
However, the use of caches to improve bandwidth sometimes leaves customers dispirited,
such as those on the [[High bandwidth Computing]] mailing list.
Another useful cache has many array read and write ports on the small cache, and fewer on the large structure. E.g. [[array port reduction]].
This is really a particular form of [[bandwidth cache]].
A cache is a fairly good [[read combining]] or [[write combining]] mechanism.
A [[power cache]] attempts to reduce power by keeping frequently accessed data items in the small structure.
this may be done even without an improvement in latency or bandwidth.
Finally(?), a reliability cache may keep certain items expected to have relaibility issues in the cache.
Particularly if reliability is related to frequency of access.
TBD: are there any other varieties of cache?
The basic principle of caching is to use a small, fast structure (typically an [[Array]])
improve the performance of a large slow, structure,
hopefully to approximate a large, fast, structure, without the expense.
However, the definition of "fast" may vary. And there may be other reasons to use a cache.
The most familiar caches are [[latency caches]].
E.g. an L1 data cache that may take 4 cycles to access, an L2 20 cycles, and main memory 200 cyckes.
Another useful cache is a [[bandwidth cache]], where the cache provides more bandwidth than the structure is is a cache for.
This is quite common - often latency and bandwidth caches are combined.
However, the use of caches to improve bandwidth sometimes leaves customers dispirited,
such as those on the [[High bandwidth Computing]] mailing list.
Another useful cache has many array read and write ports on the small cache, and fewer on the large structure. E.g. [[array port reduction]].
This is really a particular form of [[bandwidth cache]].
A cache is a fairly good [[read combining]] or [[write combining]] mechanism.
A [[power cache]] attempts to reduce power by keeping frequently accessed data items in the small structure.
this may be done even without an improvement in latency or bandwidth.
Finally(?), a reliability cache may keep certain items expected to have relaibility issues in the cache.
Particularly if reliability is related to frequency of access.
TBD: are there any other varieties of cache?
Array Port Reduction - CompArch
Array Port Reduction - CompArch: "Array Port Reduction"
AFAIK "RF Port Reduction" was invented by Dave Papworth on the Intel P6 circa 1991. At that time called "ROB Read Port Reduction". Based on observation that most uops' inputs were captured by the RS, while the ROB read ports were actually
See
* [[Register File Port Reduction]]
* [[Register Renamer Port Reduction]]
After the above, I note some generic trends:
Say that you have N operations, each with L inputs and K outputs. How do you avoid having to access
the array containing such inputs and outputs with N*L read ports and N*K write ports?
I observe that these techniques apply, not just to the RF and RR arrays mentioned above, bit also to other arrays:
* Generic Cache Arrays
** e.g. banking as a form of [[pseudo dual porting]]
* RF arrays
* [[Store Buffers]]
* [[Branch Predictor Arrays]], in particular [[BTB Arrays]]
* Nearly any large [[Queue or Buffer]], such as those in store and forward networks such as the memory hierarchy.
** See [[The Memory Hierarchy is a Store and Forward Network]]
Similar principles of array port reduction apply to all of these arrays.
Simplest is when the arrays are indexed solely and uniquely, e.g. by a memory address or register number.
One might say that these are the primary key.
More challenging is when the array has such a *spatial* index key,
but also has a *temporal* aspect.
E.g. "find the youngest store to the same address older than the present store".
Spatial/temporal may well be a useful distinction, although two dimensional X/Y may also apply.
= Bypassing =
The first principle is as mentioned in the examples of RF port reduction and renamer port reduction:
that group of N operations may not really need to do N*L reads and N*K writes.
Since we are often talking about instructions, some instructions may produce values that later instructions consume. I call this generically "bypassing": it may take the form of conventional execution unit bypassing, or dependency checks inside a block of instructions at the renamer. The P6 [[Capture RS]] that led to Papworth's original ROB Read Port Reduction is a form of [[bypass cache]] in some sense.
= Banking =
Classic spatial banking: banks in a pseudo-multiported cache. Banks in a large register file, such as a GPU.
Banking often seeks to do the opposite of [[spatial locality]] - you try to arrange the banking function, usually a bitfield extraction, occasionally a hash - so that locations likely to occur together are likely to occur in different banks, not the same.
Temporal banking - or otherwise exploiting temporal proximity. E.g. as described in [[Register Renamer Port Reduction]], where uops' inputs are compared to neighbouring older uops first, and then to more distant uops only if the recent predecessor comparison fails.
If these comparisons are done in batches with hard boundaries, it is a form of temporal banking.
If these comparisons are done smoothly - all neighbous compared, etc. - then it is a different way of exploiting temporal proximity, not banking.
Intermediate between these is to divide the temporally sorted array into banks,
and to always compare an element to two or three banks:
* 3 banks: the bank the element is in, the preceding and succeeding bank.
* 2 banks:
** the bank the unit is in
** the preceding bank if the element is close to the beginning of the bank
** the succeeding bank if the element is close to the end of the bank
** possibly no other bank, if the element is in the middle of its own bank.
= Read and Write Combining =
TBD: combining circuits:
;(1) CAMs: compare all operations to all operations
;(2) Choose N oldest (or highest priority) unique. Then CAM for combining.
;(3) Spatial hash to queues.
Within the queues
* CAM
* or just compare nearest neighbours
;(4) Sorting networking
;Sort by index.
Choose N lowest.
ISSUE: loss of age or priority info
;Sort by index, spatially
;Also sort by priority, e.g. time.
(Maybe different sorts, time and priority)
Spatial sort is of tuples (index,seqnum).
When matching indexes (index,seqnum1), (index,seqnum2) link seqnum2 to seqnum1,
and only keep (index,seqnum1) in the sorting datastructure.
Equivalent, keep redundant values in the temporal sort.
Compute dominators in the temporal sort. Extract the N highest priority dominatos in the temporal sort.
Sounds expensive. Less expensive time-wise if thinking about special sorting hardware.
Bluesky.
= Caches =
Small highly ported array,
combined with a larger, less ported array.
May exploit temporal and/or spatial locality.
E.g. may have 1r/1w port on an array of large cache line of 64 bytes,
but have more, narrower, ports within a smaller array.
May use such a cache not to improve latency, but solely to manage ports.
Which are a special form of bandwidth cache. (See [[Varieties of Cache]].)
= TBD =
Q: what other forms of array port reduction have I encountered?
AFAIK "RF Port Reduction" was invented by Dave Papworth on the Intel P6 circa 1991. At that time called "ROB Read Port Reduction". Based on observation that most uops' inputs were captured by the RS, while the ROB read ports were actually
See
* [[Register File Port Reduction]]
* [[Register Renamer Port Reduction]]
After the above, I note some generic trends:
Say that you have N operations, each with L inputs and K outputs. How do you avoid having to access
the array containing such inputs and outputs with N*L read ports and N*K write ports?
I observe that these techniques apply, not just to the RF and RR arrays mentioned above, bit also to other arrays:
* Generic Cache Arrays
** e.g. banking as a form of [[pseudo dual porting]]
* RF arrays
* [[Store Buffers]]
* [[Branch Predictor Arrays]], in particular [[BTB Arrays]]
* Nearly any large [[Queue or Buffer]], such as those in store and forward networks such as the memory hierarchy.
** See [[The Memory Hierarchy is a Store and Forward Network]]
Similar principles of array port reduction apply to all of these arrays.
Simplest is when the arrays are indexed solely and uniquely, e.g. by a memory address or register number.
One might say that these are the primary key.
More challenging is when the array has such a *spatial* index key,
but also has a *temporal* aspect.
E.g. "find the youngest store to the same address older than the present store".
Spatial/temporal may well be a useful distinction, although two dimensional X/Y may also apply.
= Bypassing =
The first principle is as mentioned in the examples of RF port reduction and renamer port reduction:
that group of N operations may not really need to do N*L reads and N*K writes.
Since we are often talking about instructions, some instructions may produce values that later instructions consume. I call this generically "bypassing": it may take the form of conventional execution unit bypassing, or dependency checks inside a block of instructions at the renamer. The P6 [[Capture RS]] that led to Papworth's original ROB Read Port Reduction is a form of [[bypass cache]] in some sense.
= Banking =
Classic spatial banking: banks in a pseudo-multiported cache. Banks in a large register file, such as a GPU.
Banking often seeks to do the opposite of [[spatial locality]] - you try to arrange the banking function, usually a bitfield extraction, occasionally a hash - so that locations likely to occur together are likely to occur in different banks, not the same.
Temporal banking - or otherwise exploiting temporal proximity. E.g. as described in [[Register Renamer Port Reduction]], where uops' inputs are compared to neighbouring older uops first, and then to more distant uops only if the recent predecessor comparison fails.
If these comparisons are done in batches with hard boundaries, it is a form of temporal banking.
If these comparisons are done smoothly - all neighbous compared, etc. - then it is a different way of exploiting temporal proximity, not banking.
Intermediate between these is to divide the temporally sorted array into banks,
and to always compare an element to two or three banks:
* 3 banks: the bank the element is in, the preceding and succeeding bank.
* 2 banks:
** the bank the unit is in
** the preceding bank if the element is close to the beginning of the bank
** the succeeding bank if the element is close to the end of the bank
** possibly no other bank, if the element is in the middle of its own bank.
E.g. based on the two most significant bits of the sequence number
00bb...b => compare to this bank plus preceding
11bb...b => compare to this bank plus succeeding
01bb...b OR 10bb...b => don't compare
or 3 bits ...
[[AG Novelty?]] - I've never seen this proposed before. It seems so obvious...
= Read and Write Combining =
TBD: combining circuits:
;(1) CAMs: compare all operations to all operations
;(2) Choose N oldest (or highest priority) unique. Then CAM for combining.
;(3) Spatial hash to queues.
Within the queues
* CAM
* or just compare nearest neighbours
;(4) Sorting networking
;Sort by index.
Choose N lowest.
ISSUE: loss of age or priority info
;Sort by index, spatially
;Also sort by priority, e.g. time.
(Maybe different sorts, time and priority)
Spatial sort is of tuples (index,seqnum).
When matching indexes (index,seqnum1), (index,seqnum2) link seqnum2 to seqnum1,
and only keep (index,seqnum1) in the sorting datastructure.
Equivalent, keep redundant values in the temporal sort.
Compute dominators in the temporal sort. Extract the N highest priority dominatos in the temporal sort.
Sounds expensive. Less expensive time-wise if thinking about special sorting hardware.
Bluesky.
= Caches =
Small highly ported array,
combined with a larger, less ported array.
May exploit temporal and/or spatial locality.
E.g. may have 1r/1w port on an array of large cache line of 64 bytes,
but have more, narrower, ports within a smaller array.
May use such a cache not to improve latency, but solely to manage ports.
Which are a special form of bandwidth cache. (See [[Varieties of Cache]].)
= TBD =
Q: what other forms of array port reduction have I encountered?
Register Renamer Port Reduction - CompArch
Register Renamer Port Reduction - CompArch: "Register Renamer Port Reduction"
On comp.arch circa August 2010, Mitch Alsup a little while back said something like "It isn't the ports on the register file that are a problem, it's the ports on the renamer." With, IIRC, a comment that you had to squeeze the renamer into a single pipestage.
Not true. This reminds me of a comment by Tim Olson, then of AMD (and active on comp.arch), when I presented HaRRM, my Hardware Register Renaming Mechanism (which is basically the modern form of renaming) to him. He said that it would lose because it required an extra pipestage.
First, renamer ports: the renamer is a much smaller structure than the PRF, in terms of number of entries times number of bits per entry. This means that the wires involved are shorter, even though the logic depth, Mitch's preferred metric in most of his posts, is similar. Wires are a major contributor to delay.
Second, "port reduction" techniques can be applied to the renamer just as they can be applied to the register file. True, out-of-order execution may allow the execution units to pick up a higher fraction of their inputs on bypasses, whereas the renamer is essentially an in-order pipestage, and in-order means less port reduction. Nevertheless, there's a lot to latch on to.
= Components of Register Renaming =
Register renaming has three main parts:
a) bypassing: in a block of instructions, comparing the output registers of older instructions to the input registers of younger instructions
b) lookup: if a register is live-in to a block, i.e. if it is not bypassed, then looking it up in the renamer/map table.
c) allocation: if a register is written within the block, giving it a new physical register (and updating the array); also, arranging for this newly allocated register to be used as inputs by younger instructions in the block
= Lookup =
Lookup is the part that most people think of as accessing a highly ported array. E.g. if you are renaming 4 instructions per cycle, with 2 inputs, you might be looking up 8 inputs per cycle, and writing 4 new outputs.
But observe that whenever there is bypassing you don't need lookups.
For example, consider the code sequence
ADD eax += M[ecx+edx]
tmp := load(ecx+edx)
eax := eax + tmp
CMP eax,ebx
JNE target
The above group of 4 uops only has 4 live-ins, not 8. Only 4 values need to be looked up in the RAT array. The remaining 3 input values (tmp, eax in the CMP, JNE) are produced by earlier uops in the same block.
True, P6 brute forced this: we renamed three uops per cycle, we always looked up 6, one for each register operand - and we threw away any lookup that was bypassed within the block. This avoided the need to first do the bypassing, and then
However, even before P6, when I was working on this form of register renaming at UIUC during my MSEE, I anticipated ways to avoid this. Of course, that was back when I was trying to create 16-wide superscalar machines, so reducing complexity was highly required.
== Non-Blocking Limited Port Register Renaming ==
IMHO the most elegant way is similar to what is also IMHO the most elegant way to do RF read port reduction:
1) do the bypass comparisons of uop output to uop input earlier, possibly as a separate pipestage.
2) out of this previous work, generate a list of live-ins that need to be looked up. Have a limited number of lookup ports. Of course use port combining, so multiple reads of the same register live-in only get looked up once - although possibly only after some time.
The question is, what happens when you run out of lookup ports? You could do an in-order stall. But, being me, I like to let later instructions that are not dependent on one of the blocked lookups get renamed and proceed to the scheduler. I.e. I want to do out of order renaming. Actually, I think the easiest way to do this is to let the guys blocked on a renamer lookup port proceed to the scheduler, but to have a separate write port that fills in the physical register input later. This is not actually out-of-order: all uops are renamed in-order, but some are immediately renamed to physical registers, while others are renamed to placeholders that later get filled in with physical register numbers. I.e. the renaming is in-order, but the renaming all the way to physical registers may be out of order. (In my most BS-full moods I have imagined that the renaming is actually done out of an OOO scheduler, since even the bypass comparisons might be done with a non-full network, as will be discussed next.)
= Caching Renamed Instructions =
And, oh yes: it is trivial to imagine caching the renames, the results of the bypass comparisons, in a decoded instruction cache / trace cache. That is one of the reasons I worked on trace cache, way back then. You get blocks of instructions - possibly fixed size blocks, possibly variable length traces - with N live-ins that need to be looked up, and M internally bypassed values that are given relative numbers, i.e. "use the K-th register that is allocated to this block."
= Bypassing Renames Inside Instruction Blocks =
Now, how about the bypassing? Apart from the
caching thereof, mentioned above.
Naively, that's O(N^2) comparators:
1st uop
- lookup all of its inputs (some combining)
- compare its output to all N-1 younger uop's inputs
2nd up
- if its inputs are not bypassed from uop #1
lookup all of its inputs (some combining)
- compare its output to all N-2 younger uop's inputs
etc.
Dana Henry and Brad Kuzsmaul's Ultrascalar showed that you can do this in O(lg N) work. But that's asymptotic, for really big machines.
== Per Logical Register Renaming Circuit ==
Here's another way to do bypassing, suitable for machines with small register sets:
* create a "carry chain" for each logical register
* 2 bits, states "not used", "written", "needs to be looked up"
* insert not used at the oldest instruction
* pull down whenever written by a uop, in the sense
older next
nu nu => nu
lu * => lu
w * => w
This is O(N) delay, O(N*L) hardware, where N= number of uops, L=number of logical registers. It suffices to tell you what registers are live-in, and need to be looked up.
If you want, you can make the values inserted be "written by uop #k",
and thereby establish the dataflow. Increases the hardware cost by a factor, i.e. multiplying, by *log2(N) bits. You take those sequence numbers, and lookup a queue of free registers, and/or add to a base (the latter if doing a P6-style circular allocation in a ROB).
(I've heard that some companies did freelists as CAMs. Enough so that they veered away from freelists in subsequent processors. Silly people.)
Of course, O(N*L) is not a win over O(N^2) unless L is small. Which it might have been, back when I imagined doing this for a 16-wide x86. But which is no longer, now that N seems to be stuck around 4, and L is 32 or more.
== Incomplete Comparisons ==
Note that you can also do an incomplete comparison: e.g. compare most uops to their immediate predecessor, which catches many dependencies. Queue up requests to do comparisons that are further apart on a smaller set of comparators.
On comp.arch circa August 2010, Mitch Alsup a little while back said something like "It isn't the ports on the register file that are a problem, it's the ports on the renamer." With, IIRC, a comment that you had to squeeze the renamer into a single pipestage.
Not true. This reminds me of a comment by Tim Olson, then of AMD (and active on comp.arch), when I presented HaRRM, my Hardware Register Renaming Mechanism (which is basically the modern form of renaming) to him. He said that it would lose because it required an extra pipestage.
First, renamer ports: the renamer is a much smaller structure than the PRF, in terms of number of entries times number of bits per entry. This means that the wires involved are shorter, even though the logic depth, Mitch's preferred metric in most of his posts, is similar. Wires are a major contributor to delay.
Second, "port reduction" techniques can be applied to the renamer just as they can be applied to the register file. True, out-of-order execution may allow the execution units to pick up a higher fraction of their inputs on bypasses, whereas the renamer is essentially an in-order pipestage, and in-order means less port reduction. Nevertheless, there's a lot to latch on to.
= Components of Register Renaming =
Register renaming has three main parts:
a) bypassing: in a block of instructions, comparing the output registers of older instructions to the input registers of younger instructions
b) lookup: if a register is live-in to a block, i.e. if it is not bypassed, then looking it up in the renamer/map table.
c) allocation: if a register is written within the block, giving it a new physical register (and updating the array); also, arranging for this newly allocated register to be used as inputs by younger instructions in the block
= Lookup =
Lookup is the part that most people think of as accessing a highly ported array. E.g. if you are renaming 4 instructions per cycle, with 2 inputs, you might be looking up 8 inputs per cycle, and writing 4 new outputs.
But observe that whenever there is bypassing you don't need lookups.
For example, consider the code sequence
ADD eax += M[ecx+edx]
tmp := load(ecx+edx)
eax := eax + tmp
CMP eax,ebx
JNE target
The above group of 4 uops only has 4 live-ins, not 8. Only 4 values need to be looked up in the RAT array. The remaining 3 input values (tmp, eax in the CMP, JNE) are produced by earlier uops in the same block.
True, P6 brute forced this: we renamed three uops per cycle, we always looked up 6, one for each register operand - and we threw away any lookup that was bypassed within the block. This avoided the need to first do the bypassing, and then
However, even before P6, when I was working on this form of register renaming at UIUC during my MSEE, I anticipated ways to avoid this. Of course, that was back when I was trying to create 16-wide superscalar machines, so reducing complexity was highly required.
== Non-Blocking Limited Port Register Renaming ==
IMHO the most elegant way is similar to what is also IMHO the most elegant way to do RF read port reduction:
1) do the bypass comparisons of uop output to uop input earlier, possibly as a separate pipestage.
2) out of this previous work, generate a list of live-ins that need to be looked up. Have a limited number of lookup ports. Of course use port combining, so multiple reads of the same register live-in only get looked up once - although possibly only after some time.
The question is, what happens when you run out of lookup ports? You could do an in-order stall. But, being me, I like to let later instructions that are not dependent on one of the blocked lookups get renamed and proceed to the scheduler. I.e. I want to do out of order renaming. Actually, I think the easiest way to do this is to let the guys blocked on a renamer lookup port proceed to the scheduler, but to have a separate write port that fills in the physical register input later. This is not actually out-of-order: all uops are renamed in-order, but some are immediately renamed to physical registers, while others are renamed to placeholders that later get filled in with physical register numbers. I.e. the renaming is in-order, but the renaming all the way to physical registers may be out of order. (In my most BS-full moods I have imagined that the renaming is actually done out of an OOO scheduler, since even the bypass comparisons might be done with a non-full network, as will be discussed next.)
= Caching Renamed Instructions =
And, oh yes: it is trivial to imagine caching the renames, the results of the bypass comparisons, in a decoded instruction cache / trace cache. That is one of the reasons I worked on trace cache, way back then. You get blocks of instructions - possibly fixed size blocks, possibly variable length traces - with N live-ins that need to be looked up, and M internally bypassed values that are given relative numbers, i.e. "use the K-th register that is allocated to this block."
= Bypassing Renames Inside Instruction Blocks =
Now, how about the bypassing? Apart from the
caching thereof, mentioned above.
Naively, that's O(N^2) comparators:
1st uop
- lookup all of its inputs (some combining)
- compare its output to all N-1 younger uop's inputs
2nd up
- if its inputs are not bypassed from uop #1
lookup all of its inputs (some combining)
- compare its output to all N-2 younger uop's inputs
etc.
Dana Henry and Brad Kuzsmaul's Ultrascalar showed that you can do this in O(lg N) work. But that's asymptotic, for really big machines.
== Per Logical Register Renaming Circuit ==
Here's another way to do bypassing, suitable for machines with small register sets:
* create a "carry chain" for each logical register
* 2 bits, states "not used", "written", "needs to be looked up"
* insert not used at the oldest instruction
* pull down whenever written by a uop, in the sense
older next
nu nu => nu
lu * => lu
w * => w
This is O(N) delay, O(N*L) hardware, where N= number of uops, L=number of logical registers. It suffices to tell you what registers are live-in, and need to be looked up.
If you want, you can make the values inserted be "written by uop #k",
and thereby establish the dataflow. Increases the hardware cost by a factor, i.e. multiplying, by *log2(N) bits. You take those sequence numbers, and lookup a queue of free registers, and/or add to a base (the latter if doing a P6-style circular allocation in a ROB).
(I've heard that some companies did freelists as CAMs. Enough so that they veered away from freelists in subsequent processors. Silly people.)
Of course, O(N*L) is not a win over O(N^2) unless L is small. Which it might have been, back when I imagined doing this for a 16-wide x86. But which is no longer, now that N seems to be stuck around 4, and L is 32 or more.
== Incomplete Comparisons ==
Note that you can also do an incomplete comparison: e.g. compare most uops to their immediate predecessor, which catches many dependencies. Queue up requests to do comparisons that are further apart on a smaller set of comparators.
Monday, September 06, 2010
Ulfs - my user level filesystem - Andy Glew's Wiki
Ulfs - my user level filesystem - Andy Glew's Wiki: "Ulfs - my user level filesystem"
ulfs - my User Level FileSystem.
wiki: http://wiki.andy.glew.ca/wiki/Ulfs_-_my_user_level_filesystem
blog: ...
Have wanted to code this up for a long time.
Never could, because anything I did belonged to my employer.
Want it open source, so that I can use it for my personal stuff - wikis, version control.
With my current employment agreement I can work on open source, so starting.
Today got the most trivial put, get, and delete coded & tested.
Not much, but its a start.
I want a user level filesystem, with at least two bindings
* UNIX command line
* Perl
I imagine that I will eventually want python, javascript, etc.
TBD: generate such via swig?
I want command line, because I'm a UNIX old fart. I still think of the shell as a good way to compose operations, e.g. via pipes.
I want scripting languages like Perl, because much code will be in there.
But mainly because I will implement at least first version in Perl.
Storage types:
* ordinary UNIX (or Windows) hierarchical filesystem tree
* archive formats, like tar
* possibly VC tools, like git, hg, bzr, ...
** delta storage
* possibly databases, even relational
* possibly hybrid filesystem/database
: any of the above may have flavors - e.g. will have metadata in the ordinary filesystem
Filesystem Names
* use the name specified by the user
* allocate and return a name
** sequentially
** content based hashing
*** handling collisions
* metadata queries
:: all of the above, both explicit, implicit, and content based
* versioning
Command Line Interface:
mkfs:
ulfs -mkfs test.fs -type ordinary
ulfs -mkfs test.fs.tar -type tar
my $ulfs = ULFS_Ordinary::mkfs("test.fs");
my $ulfs = ULFS_Tar::mkfs("test.tar.fs");
my $ulfs = ULFS::openfs("test.fs");
basic operations
ulfs -fs test.fs -to testdata1-in-fs < testdata1 -put
ulfs -fs test.fs -from testdata1-in-fs > testdata1.retrieved -get
ulfs -fs test.fs -from testdata1-in-fs -delete
$ulfs->put(-to=>testdata1-in-fs,-input=>testdata1);
$ulfs->get(-from=>testdata1-in-fs,-output=>testdata1);
$ulfs->delete(-from=>testdata1-in-fs;
other operations
copy
move/rename
mkdir
As is my wont, some thoughts:
;Predicates
If I try to use this as a net or web based filesystem, I may want to have predicates specified by the client
executed by the server. E.g. compare-and-swap: put this file, but only if the old version's content is ...
Operations like rmdir are really simple predicates: only remove the name if that name is a directory.
ulfs - my User Level FileSystem.
wiki: http://wiki.andy.glew.ca/wiki/Ulfs_-_my_user_level_filesystem
blog: ...
Have wanted to code this up for a long time.
Never could, because anything I did belonged to my employer.
Want it open source, so that I can use it for my personal stuff - wikis, version control.
With my current employment agreement I can work on open source, so starting.
Today got the most trivial put, get, and delete coded & tested.
Not much, but its a start.
I want a user level filesystem, with at least two bindings
* UNIX command line
* Perl
I imagine that I will eventually want python, javascript, etc.
TBD: generate such via swig?
I want command line, because I'm a UNIX old fart. I still think of the shell as a good way to compose operations, e.g. via pipes.
I want scripting languages like Perl, because much code will be in there.
But mainly because I will implement at least first version in Perl.
Storage types:
* ordinary UNIX (or Windows) hierarchical filesystem tree
* archive formats, like tar
* possibly VC tools, like git, hg, bzr, ...
** delta storage
* possibly databases, even relational
* possibly hybrid filesystem/database
: any of the above may have flavors - e.g. will have metadata in the ordinary filesystem
Filesystem Names
* use the name specified by the user
* allocate and return a name
** sequentially
** content based hashing
*** handling collisions
* metadata queries
:: all of the above, both explicit, implicit, and content based
* versioning
Command Line Interface:
mkfs:
ulfs -mkfs test.fs -type ordinary
ulfs -mkfs test.fs.tar -type tar
my $ulfs = ULFS_Ordinary::mkfs("test.fs");
my $ulfs = ULFS_Tar::mkfs("test.tar.fs");
my $ulfs = ULFS::openfs("test.fs");
basic operations
ulfs -fs test.fs -to testdata1-in-fs < testdata1 -put
ulfs -fs test.fs -from testdata1-in-fs > testdata1.retrieved -get
ulfs -fs test.fs -from testdata1-in-fs -delete
$ulfs->put(-to=>testdata1-in-fs,-input=>testdata1);
$ulfs->get(-from=>testdata1-in-fs,-output=>testdata1);
$ulfs->delete(-from=>testdata1-in-fs;
other operations
copy
move/rename
mkdir
As is my wont, some thoughts:
;Predicates
If I try to use this as a net or web based filesystem, I may want to have predicates specified by the client
executed by the server. E.g. compare-and-swap: put this file, but only if the old version's content is ...
Operations like rmdir are really simple predicates: only remove the name if that name is a directory.
Friday, September 03, 2010
apenwarr - Business is Programming
apenwarr - Business is Programming: "The sad evolution of wikis"
Amen.
Wikipedia is not wiki.
But... I'm contributing. I want comp-arch.net to be a controlled wiki - in part because I've already been spammed, but also because I eventually want to make it into a book. Or at least the wiki equivalent of one.
Also because i have specific contractual requirements. Certain things i can write now, but not publish for a few years. Because I'm lazy, I want to write them on the wiki now, and have them go public later.
Hence my half-assed semipublic/public comp-arch.net. And my dreams of a more complete solution.
The original poster said:
I actually have quite a few ideas along this direction.
I actually have arranged, in my employment contract, the legal right to work on such, even while employed.
Unfortunately, getting the time to work on this is my problem. What time I have, I tend to spend on comp.arch or comp-arch.net. I.e. content. Improving infrastructure requires blocks of time I only get on vacation.
Amen.
Wikipedia is not wiki.
But... I'm contributing. I want comp-arch.net to be a controlled wiki - in part because I've already been spammed, but also because I eventually want to make it into a book. Or at least the wiki equivalent of one.
Also because i have specific contractual requirements. Certain things i can write now, but not publish for a few years. Because I'm lazy, I want to write them on the wiki now, and have them go public later.
Hence my half-assed semipublic/public comp-arch.net. And my dreams of a more complete solution.
The original poster said:
How do you create a vibrant community, but allow for private topics and discussion, but allow for public topics and discussion, and allow me to work for more than one company at a time with multiple private discussions, and have my WikiWords always end up pointing where they're supposed to?
I have no idea.
I actually have quite a few ideas along this direction.
I actually have arranged, in my employment contract, the legal right to work on such, even while employed.
Unfortunately, getting the time to work on this is my problem. What time I have, I tend to spend on comp.arch or comp-arch.net. I.e. content. Improving infrastructure requires blocks of time I only get on vacation.
Monday, August 30, 2010
Memory Transaction Topology
* wiki: http://semipublic.comp-arch.net/wiki/Memory_Transaction_Topology
* blogged, comp.arch posted, hbc mailing list.
Let us discuss the simple shape or topology of memory transactions,
from the point of view of a simple interface, with a processor on one side,
and memory on the other.
When I have drawn such diagrams in the past, I have often used the vertical axis to indicate the passing of time.
Here, drawing in ascii, I will not bother to do so.
= Simple Memory Accesses =
;Simple read
* processor sends an address to memory
P ---A---> M
* sooner or later, memory responds with the read data
P <---D--- M ;Simple Write * processor sends address and data to memory P ---AD--> M
Acknowledgement may or may not be necessary; it may be [[store and forget]].
The store address and data may be multiplexed oin the same wires, asis roughly indicated above.
Or they may be in parallel.
* processor sends address and data to memory
P -+-A-+-> M
+-D-+ M
Or they might even be two separate transactions, joined at the memory
P ---A---> M
P ---D---> M
= Interleaved Memory Accesses =
;Interleaved read
Interleaved reads may be satisfied in the original order
P ---A1---> M
P ---A2---> M
P <---D1--- M P <---D2--- M Or out of order. If ooo, then some sort of transaction ID is needed: P ---A1---> M
P ---A2---> M
P <---D2--- M P <---D1--- M Simple writes with address and data available simultaneous do not provide opportunity for interleaving. P ---A1D1--> M
P ---A2D2--> M
However, separate address and data transfiers do:
P ---A1---> M
P ---A2---> M
P ---D1---> M
P ---D2---> M
or out of order:
P ---A1---> M
P ---A2---> M
P ---D2---> M
P ---D1---> M
We usually assume that the address is sent first, but this need not be the case.
P ---A1---> M
P ---D2---> M
P ---D1---> M
P ---A2---> M
Usually, however, interleaving or reordering of stores refers to with respect to the order of stores between different proocessors. TBD.
= Burst Accesses =
Memory transfers often occur in cache line sized bursts, frequently 4 or 8 clock cycles, sometimes more.
Interleaved reads may be satisfied in the original order
P ---A------> M
P <---DDDD--- M Stores may have address and data near to each other P ---ADDDD--> M
Decoupled burst store address and store data is also a possibility, but is not common.
= Multiple Transaction Sizes =
Thus we have severakl different transaction sizzes
;Address Only
P ---A------> M
; Burst Read Data Reply
P <---DDDD--- M ; Burst Store P ---ADDDD--> M
;Various non-burst or partial
in 8b, 16b, 32b, 64b, 128b, ... sizes
as well as possibly misaligned transfers
P <---D--- M = Read-Modify-Write = The following read/write transaction is usually associated with atomic read modify write operations. (e.g. the GUPS benchmark) P ---AopD---> M
op
store new data value
return old
P <---D------ M Note that this can also be done in combination with a cached value P $ ------------AopD---> M
op op
store store new data value
ignore (or check) return old
P $ <-----------D------- M == Modify Write == It is possible to have a modify write transaction, where an operation is performed at memory, but the old value is not returned. P ---AopD---> M
op
store new data value
X <-- M Aread modify without the write is even less useful; it really amounts to having an implicit operand that does not need to be transferred. = Masked Write, including Burst Masked Write == It is common to have a bitmask specifying which bytes should be written: P ---AmD---> M
The mask is so small that it is often part of the command.
Often a read-modify-write is necessary at the receiving end, e.g. for ECC.
But nevertheless, not across the interconnect.
Less commonly there is a masked read, e.g. for side-effectful memory.
There can also be burst masked writes, where the mask may approach and surpass a data chunk width.
P ---AmDDDD---> M
With [[Bitmask consistency]], burst masked writes become the norm.
Reads that indicate which bytes are an absolute requirement,
whether burst masked reads or otherwise, may also be useful.
= Scatter/Gather =
Rather than separate non burst reads
P ---A1---> M
P ---A2---> M
P <---D2--- M P <---D1--- M ... or writes, or RMWs We might take advantage of burst transfers, even though not necessarily burst memory (DRAM) accesses P ---A1.A2.A3.A4---> M
P <---D1.D2.D3.D4--- M P ---A1D1.A2D2.A3D3.A4D4---> M
P ---A1opD1.A2opD2.A3opD3.A4opD4---> M
P <----------D1.D2.D3.D4------------ M For the RMW casse we might constrain the same op to be used everywhere: P --op-A1D1.A2D2.A3D3.A4D4---> M
P <----------D1.D2.D3.D4------------ M = BLUESKY: Reducing the Transaction Types in a Scatter/Gather Memory Subsystem = ; read P ---A---> M
;gather read
P ---A1.A2.A3.A4.A5.A6---> M
; write
P ---op.A.mask.DDDD---> M
; scatter
P ---op.A1D1.A2D2.A3D3---> M
; reply
P <---id.DDDD--- M If A and D chunks are the same size, * the "normal" burst is 6 chunks long (A.mask.DDDD) * with 6 gather read addresses * or 3 scatter write or RMW address/data pairs We might even do away with partial reads (---A--->).
Possibly make all replies look like writes.
; reply
P <---op.A.mask.DDDD--- M
Further, we might compress the values transferred.
Compression does not help fixed size transactions with fixed size memory cache lines,
except for power.
But compression does help fixed size cache lines with variable length transfers.
And compression will allow more scatter/gather requests to fit in a single packet.
In general, we would assume that in a multistage interconnect s/g packeets will be broken up and routed s/g element by element.
Circuit switched, e.g. optical, works against this.
* blogged, comp.arch posted, hbc mailing list.
Let us discuss the simple shape or topology of memory transactions,
from the point of view of a simple interface, with a processor on one side,
and memory on the other.
When I have drawn such diagrams in the past, I have often used the vertical axis to indicate the passing of time.
Here, drawing in ascii, I will not bother to do so.
= Simple Memory Accesses =
;Simple read
* processor sends an address to memory
P ---A---> M
* sooner or later, memory responds with the read data
P <---D--- M ;Simple Write * processor sends address and data to memory P ---AD--> M
Acknowledgement may or may not be necessary; it may be [[store and forget]].
The store address and data may be multiplexed oin the same wires, asis roughly indicated above.
Or they may be in parallel.
* processor sends address and data to memory
P -+-A-+-> M
+-D-+ M
Or they might even be two separate transactions, joined at the memory
P ---A---> M
P ---D---> M
= Interleaved Memory Accesses =
;Interleaved read
Interleaved reads may be satisfied in the original order
P ---A1---> M
P ---A2---> M
P <---D1--- M P <---D2--- M Or out of order. If ooo, then some sort of transaction ID is needed: P ---A1---> M
P ---A2---> M
P <---D2--- M P <---D1--- M Simple writes with address and data available simultaneous do not provide opportunity for interleaving. P ---A1D1--> M
P ---A2D2--> M
However, separate address and data transfiers do:
P ---A1---> M
P ---A2---> M
P ---D1---> M
P ---D2---> M
or out of order:
P ---A1---> M
P ---A2---> M
P ---D2---> M
P ---D1---> M
We usually assume that the address is sent first, but this need not be the case.
P ---A1---> M
P ---D2---> M
P ---D1---> M
P ---A2---> M
Usually, however, interleaving or reordering of stores refers to with respect to the order of stores between different proocessors. TBD.
= Burst Accesses =
Memory transfers often occur in cache line sized bursts, frequently 4 or 8 clock cycles, sometimes more.
Interleaved reads may be satisfied in the original order
P ---A------> M
P <---DDDD--- M Stores may have address and data near to each other P ---ADDDD--> M
Decoupled burst store address and store data is also a possibility, but is not common.
= Multiple Transaction Sizes =
Thus we have severakl different transaction sizzes
;Address Only
P ---A------> M
; Burst Read Data Reply
P <---DDDD--- M ; Burst Store P ---ADDDD--> M
;Various non-burst or partial
in 8b, 16b, 32b, 64b, 128b, ... sizes
as well as possibly misaligned transfers
P <---D--- M = Read-Modify-Write = The following read/write transaction is usually associated with atomic read modify write operations. (e.g. the GUPS benchmark) P ---AopD---> M
op
store new data value
return old
P <---D------ M Note that this can also be done in combination with a cached value P $ ------------AopD---> M
op op
store store new data value
ignore (or check) return old
P $ <-----------D------- M == Modify Write == It is possible to have a modify write transaction, where an operation is performed at memory, but the old value is not returned. P ---AopD---> M
op
store new data value
X <-- M Aread modify without the write is even less useful; it really amounts to having an implicit operand that does not need to be transferred. = Masked Write, including Burst Masked Write == It is common to have a bitmask specifying which bytes should be written: P ---AmD---> M
The mask is so small that it is often part of the command.
Often a read-modify-write is necessary at the receiving end, e.g. for ECC.
But nevertheless, not across the interconnect.
Less commonly there is a masked read, e.g. for side-effectful memory.
There can also be burst masked writes, where the mask may approach and surpass a data chunk width.
P ---AmDDDD---> M
With [[Bitmask consistency]], burst masked writes become the norm.
Reads that indicate which bytes are an absolute requirement,
whether burst masked reads or otherwise, may also be useful.
= Scatter/Gather =
Rather than separate non burst reads
P ---A1---> M
P ---A2---> M
P <---D2--- M P <---D1--- M ... or writes, or RMWs We might take advantage of burst transfers, even though not necessarily burst memory (DRAM) accesses P ---A1.A2.A3.A4---> M
P <---D1.D2.D3.D4--- M P ---A1D1.A2D2.A3D3.A4D4---> M
P ---A1opD1.A2opD2.A3opD3.A4opD4---> M
P <----------D1.D2.D3.D4------------ M For the RMW casse we might constrain the same op to be used everywhere: P --op-A1D1.A2D2.A3D3.A4D4---> M
P <----------D1.D2.D3.D4------------ M = BLUESKY: Reducing the Transaction Types in a Scatter/Gather Memory Subsystem = ; read P ---A---> M
;gather read
P ---A1.A2.A3.A4.A5.A6---> M
; write
P ---op.A.mask.DDDD---> M
; scatter
P ---op.A1D1.A2D2.A3D3---> M
; reply
P <---id.DDDD--- M If A and D chunks are the same size, * the "normal" burst is 6 chunks long (A.mask.DDDD) * with 6 gather read addresses * or 3 scatter write or RMW address/data pairs We might even do away with partial reads (---A--->).
Possibly make all replies look like writes.
; reply
P <---op.A.mask.DDDD--- M
Further, we might compress the values transferred.
Compression does not help fixed size transactions with fixed size memory cache lines,
except for power.
But compression does help fixed size cache lines with variable length transfers.
And compression will allow more scatter/gather requests to fit in a single packet.
In general, we would assume that in a multistage interconnect s/g packeets will be broken up and routed s/g element by element.
Circuit switched, e.g. optical, works against this.
Thursday, August 26, 2010
New variant of old Windows graphics "window contour trails" bug
This morning, just now, encountered a new variant of an old bug.
I have often run into what I call the Windows contour trail bug: when dragging windows around the screen, they leave a trail of old, partially overwritten, window contents. Looking rather like mouse trails. As if the windw was being redrawn every few fractions of a second, each time overwriting the drawing of the window at its old position, and the background was never replacing the overwritten window contents at the old position. Fills the screen with rather pleasant abstract art.
Unfortunately, a reboot usually seems to be necessary.
Today, same Windows contour trail problem - but on only one monitor. I have 4 monitors connected to that machine. The laptop LCD panel, and 3 USB display adapters. Only one of the monitors, connected to only one of the USB display adapters, has gone buggy.
Which gives a few hints: its not a problem with the overall graphics display infrastructure. Apparently just with one display driver.
... While writing this, apparently fixed itself. Possibly related to going in and out of standby. Possibly related to Windows Explorer desktop shell hanging on a network drive.
---
All sorts of thoughts about micro-rebooting running through my head.
I have often run into what I call the Windows contour trail bug: when dragging windows around the screen, they leave a trail of old, partially overwritten, window contents. Looking rather like mouse trails. As if the windw was being redrawn every few fractions of a second, each time overwriting the drawing of the window at its old position, and the background was never replacing the overwritten window contents at the old position. Fills the screen with rather pleasant abstract art.
Unfortunately, a reboot usually seems to be necessary.
Today, same Windows contour trail problem - but on only one monitor. I have 4 monitors connected to that machine. The laptop LCD panel, and 3 USB display adapters. Only one of the monitors, connected to only one of the USB display adapters, has gone buggy.
Which gives a few hints: its not a problem with the overall graphics display infrastructure. Apparently just with one display driver.
... While writing this, apparently fixed itself. Possibly related to going in and out of standby. Possibly related to Windows Explorer desktop shell hanging on a network drive.
---
All sorts of thoughts about micro-rebooting running through my head.
Subscribe to:
Posts (Atom)