The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Monday, December 12, 2011

Powering up multiple displays on Windows 7 - lots of flashing :-(

I have 4 external 1920x1200 displays connected to my Lenovo Thinkpad X220 Tablet PC, via Diamond DisplayLink based USB adapters.

When I power the besat up, not just cold start, but also, most annoyingly, warm start, from Hibernate (Suspend to Disk) or Suspwend (to RAM) power down modes, it engages in excessive flashing.

The screen blanks and redisplays 15 times!!!

Quite annoying.

Continuing http://andyglew.blogspot.com/2011/12/multiple-displays-on-windows-7-good-but.html

Sunday, December 11, 2011

Lifehack: Binder Clips help Debug Chistmas Lights

Fixing Christmas lights today.

Figured out the binary search approach - e.g. as described by http://www.squidoo.com/howto-fix-broken-christmas-lights#module12970984.

My own embellishment: I find it hard to track where I am in checking the Christmas lights, and/or strand continuity.  Even leaving empty sockets behind doesn't help that much, since the empty sockets are hard to see.  And since I have a small work space, so had to fold the string of lights back on itself 3 times.

So I marked my position with binder clips.  I happened to have at least 4 colours of binder clip. I used four clips to dedfine the boundaries of the interval where I was searching for a duff lightbulb - two clibs at each boundary.  Two clips at each boundary because, given the folding ofg the string of lights, it became hard to tell which direction was towards the end, and which inside the interval.  So I used two colours of binder clip at each end.

Thursday, December 08, 2011

A Modest 1 bit Proposal about Quotification - making the Default Easy

Listening to an old "Security Now" podcast while doing my morning stretches.

Leo Laporte's TWIT website was hacked, and Steve Gibson, the Security Guy, says "Any time you are soliciting user input, there is a risk of malicious input somehow tricking the backend and executing that input, when it is meant to be, you know, benign [input data, like] user name and password.".

This is typical of the classic SQL injection hack, and, indeed, of any hack where the attacker is able to inject scripting code and fool the backend into executing it.  Typically by embedding quotes or the like in the input string.

(For that matter, Steve's description also applies to binary injection via buffer overflow.  But we won't go there; this page will talk only about non-buffer-overflow attacks, sijnce we have elsewhere described our agenda for preventing buffer overflow attacks.)

Say that you are talking user input like NAME, and are somehow using it to create an SQL or other language command, like "SELECT FIELDLIST FROM TABLE WHERE NAME = '$NAME'  ".   But now the attacker, instead of providing a nicely formed string like "John Doe", provides instead something like "anything' OR 'x' = 'x  ". (I added spaces between the single and double quotes for readability.) I.e. the user provides a string that contains quotes in the target language - not the language where the query string is composed, but a language further along.  So the query string becomes "SELECT FIELDLIST FROM TABLE WHERE NAME = 'anything' OR 'x' = 'x'  ". And now the query matches any row in the table.  (http://www.unixwiz.net/techtips/sql-injection.html provides examples, as does wikip[edia.).

The general solution to this is "quotification": take the user input, and either delete or quote anything that looks like a quote in the target language:. E.g. transform the attacker's string "anything' OR 'x' = 'x  " into either "anything OR x = x  " or "anything\' OR \'x\' = \'x  ".

The problem with deleting stuff from the user string is that sometimes the user is supposed to have quotelike things.  Consider names like "O'Toole".  Or consider prioviding, e.g. via cut and paste, Chinese unicode names in an application whose original programmer was English, but where the system is otherwise capable of displaying Chinese.  It is a pity if the barrier to internationalizaion is the "security" code scattered throughout your application that santizes user input. Worse, that is the sort of code that might get fixed by somebody who fixing internationalization problems who doesn't understand the security issues

The problem with quotifiying stuff is that it is hard.  It is not just a case, for you Perl afficionadoes, of doing s/'/\/g - what about input strings that already have \\' inside them?  And so on.

But the real problem, applicable to both deleting and quotification strategies, is that the code doing the user input sanitization does not necessarily know the syntax of all of the languages downstream.  It may know that there is SQL along the way - but it may not know that somebody has just added a special filter that looks for French quotes, << and >>.  Etc.  Not just special symbols: I have defined sublanguages where QuOtE and EnDqUoTe were the quotes.

The security code may know the syntax at the time the sanitization code was written.  But the downstream processing may have changed.  The syntax of the language may have been extended, in a new revision of the SQL or Perl or ... .  (I found a bug like that last year.)

The problem is that the user input santization code is trying to transform user input from strings that may be unsafe, to strings that are guaranteed to be safe forever and ever, no matter what revisions are made to the language, etc.   The problem is that the default for character strings is that ANY CHARCATER MAY BE PART OF A COMMAND unless specially quoted.

We need to change this default.  Here is my moldest proposal:

Let us define a character set whereby there is a special bit free in all characters.  And whereby, if that special bit is set, it is guaranteed by ANY REASONABLE LANGUAGE that no character with that special bit set will be part of any command or language syntax like a quote symbol.

We should strongly suggest, that the visual display for the characters with and without the special bit set is the same.  Or at least, the same in most situations - in others you may want to distinguish them, e.g., by shading.
If you are using something like BNF to describe your language, then it might be:

ORDINARY_CHARACTER ::== 'A' | 'B' |  ...

TAINTED_CHARACTER ::== 1'A' | 1'B' |  ...

where I am using the syntax 1'A' to describe a single character literal. with the special bit set.



(Actually, I am not sure whether a quoted string should be the abnove, or

And we require that the only place where the possibly tainted characters with the tainted bit set are ONLY permitted in strings.  Nowhere else in the language.  Not in keywords, symbols, operators....

Then we just have to ensure that all of our input routines set the special bit. If you really need to form operators, the programmer can untaint the data expliocitly.  Btter to have to untaint explicitly in a few p[laces, than to have to quotify correctly in all places.

Perhaps better to make taintimg the default.  To flip the polarity of the special bit.  And to require that language syntax, keywords, etcv., be set only if the special bit is set.

This is just the well known taint or poison propagation strategy.  Exposed to programming language syntax definitions.

I have elsewhere espoused taking advantage of extensible XML syntax for programming languages.  This is similar, although orthogonal.

Wednesday, December 07, 2011

Multiple Displays on Windows 7... Good, but...

I love the DisplayLink USB display adapters that let my piddling little tablet PC drive 4 1920x1200 external monitors.

However, many, umm, irregularities happen with Windows 7 support and/or the display or device drivers.

Earlier today, and for weeks if not months, I have been able ti have 5 displays - my laptop/tablet PC's built in LCD, and my 4 external displays.

However, I went into hibernation while travelling from home to work (where I did not use the PC) and back to home again.

And when I woke up, I cannot get my laptop built in display to work, when the external displays are plugged. Sure, it works when they are not plugged in; but when they are plugged in, the laptop built in display gets "blackened" in the Orientation box.

Just yet another Windows 7 strangeness.

Saturday, December 03, 2011

Organization and time management, technology for

I felt inspired to start writing this.  More like collecting.

Human handwritten proofreading marks

http://creativeservices.iu.edu/resources/guide/marks.shtml has a nice collection of human proofreading marks.

Some are irrelevant to pen or touch tablet computers - why mark that a period should inserted, when you can just do it?

But some might be usefully considered for use as commands in a pen driven text editing system

Possibly also for depiction of edits and revisions.

Organization Tools, Evolution of

I am fascinated, or at least interested, in tools for organization, scheduling. Time management. Filing.

American schools say "Time management skills are the most important thing you have to learn in high school to be prepared for success in college."  But, as far as I know, nobody actually teaches techniques for organization and scheduling.

Perhaps one of the reasons I am fascinated by this is that it does not come naturally to me.  But, as a result, I have studied it, and my career as a computer programmer and computer architect is really all about organization and scheduling.  Albeit for computers, not people.

My whole career has been about scheduling for computers.  First software operating system schedulers, in particular real time schedulers. Then hardware OOO CPU schedulers.

But I remain interested in scheduling and organization in many different aspects: computerzs, factories (JIT and kanban), people (organizers, PDAs, DayTimers, etc.)

It is quite interesting to see how organizing tools for people have evolved.  I probably need to convert this into a wiki post so that I can evolve it.  http://wiki.andy.glew.ca/wiki/Organization,_technology,_evolution_of

First paper, and now on computer.

In paper:


Three ring folders.

Tracker-keepr --- I just learned from a friend how this tracker/keeper system for holding papers together became popular for elementary and high school kids in the US in the 1980s.  Essentially plastic sheets that folded to provide an envelope out of which is was hard for things to fall.

Three ring folders have made a comeback in my daughter's
school, but with zippers around them.


Paper clips.

A famous engineering author explains how, prior to paper clips and stables, people used straight pins.

School websites

School websites sound like a good thing: use your browser to keep track of your kid's school, homework assignments, etc.

In reality, it turns into a mess unless the website is organized.  And teachers are not necessarily good website architects[*].

The school may have a website.  Good.  In a content management system like Drupal.  Okay.  With pages for every class, a news system, etc. Not so bad.

But then some teachers may use a separate Moodle.

And some other teachers may use a separate Google Apps or Google Sites site,.

So, for class #1 you have to go to the Drupal site. For class #2 you have to go to the Moodle.  For class #3, the homework is posted on the Google Sites site, but there's some background info on the Moodle.  And, oh yes, still other stuff on the Drupal.

AAARGH!!!!  I can't remember which webtool - Drupal, Moodle, Google Sites - to use for any particular class. Yes, I may be a Luddite parent - but I am a reasonably web.savvy Luddite parent.

And it's not just me, the Luddite parent, who finds this confusing.  The students do too.  Not just my kid:  I have had a teacher complain to me that this teacher keeps telling the class to use the Google Sites site for the class, but the students never seem to understand.  Could it be because this teacher is their only teacher using Google Sites instead of Drupal or Moodle, and the kids naturally look to one place?    Perhaps if this teacher created a class page in standard place on the Drupal or Moodle page, and linked to the Google Sites site for his class? It might also help the Luddite parents.

Note that above at the [*] I decided to say "website architect" or "website organizer" rather than "website designer".  Website designers, at least the not so good ones, often concentrate on visual flashy effects.  School websites have plenty of that.  What they seem to lack is organization, or architecture.

Note: it's not just a question of installing a new content management system.  They already have more than enough.  Note that I am not saying "too many" - I am saying "more than enough".  I am not against using different CMSes.   But, if you do, you need to link them together, so you can get from one to the other.


Here is a dream, a suggestion a modest proposal for school websites:

(0) Have a front end, in whatever CMS you want.  Drupal, say.

(1) Allow the teachers to create sites in whatever other CMSes they insist on using. Moodle. Google Sites.  Whatever.

(2) But insist that there be a classroom page created on the front end site (Drupal, in this example).  If nothing else, it should point to where on the other CMSes the real classroom contents.

Note that I don't mean just with text saying "some teachers use the Moodle or other tools".  Which basically means that you have to figure out how to get there on your own.  No, I mean a real, clickable, link to that other site.  If you can't create a link into that other site, well...  (IMHO that's the only reason to forbid using a different tool.)


More on the same lines: Use dates, so that stale data does not keep rising to the top. 

I keep clicking to links and URLs that say stuff like "Fourth Grade Science".  But it turns out that it was 4th grade science for 2008.  Or last years' class.  Or ...

A page identifier like "4th grade science" is really a floating identifier.  At any particular time it should point to something like "4th grade science 2011-2012".  Every year, it should be updated.

I'm quite happy to have the old years' stuff remain on the website(s).  It's nice to look at.  But, I don't like it getting in my way when I am looking for this year's stuff.


Perhaps "News" should expire after a while?

I.e. I am not at all sure that class presentations for history class from two years ago are still "News".  More like "Olds".  Again, useful for reference, but a time waster when I, am a parent, am trying to find what is really new at my child's school website.


In my dreams:  wouldn't it be nice to have a page you or your child could log into, that contained links to the particular class webpages for all the classes your child is taking?

Actually, my effort to do that myself - to create a wiki/Evernote/whatever page, pointing to all of the key information on my child's school website - is what inspired this post.  It's darned frustrating finding it.


Hypertext systems like the web have great potential.  But they are a pain to use if disorganized.

It's hard to keep them organized.

Search engines like Google are the saving grace for the real web.  But Google is not allowed to index a school website with possibly sensitive information.   Most CMSes have their own search, but when you have multiple CMSes, this doesn't help much.

It would be nice if each school ran its own private Google indexer on its various websites and CMSes.  (Hmm, Google, maybe that would be a nice public service?  Or are you hell bent on getting as many schools as possible to use Google Sites and Apps, giving non-Google Drupal and Moodle the cold shoulder?)

But even if there was decent search for school websites, some modicum of organization is desirable.  I think what I suggest is pretty basic:  (1) All class webpages in a fixed place, linking to whatever other websites or services the teacher chooses to use.  (2) With dated content content, and floating labels - ("4th grade science 2008-9" versus "This year's 4th grade science".)


Nothing that I say here is specific to schools.  This is just good website design - or should I say "architecture" - for a heterogeneous system.

Nothing that I say here is critical of school IT.  School IT just sets up the system - the teachers provide the content.    The only slight criticism is that IT should enforce the very minimal standards that I suggest.  Perhaps school IT should set up the default classroom pages on the primary website.  Teachers should be responsible for linking to whatever other non-standard website technology they choose to use.


I just had a thought about why this schizophrenia arises aty school websites.

Teachers are often not the most sophisticated web.users, whether in their own college days, or afterwards.

In college, teachers probably used whatever their university IT department provided. And university IT departments are often pretty good, in particular good at providing services to unsophisticated users (faculty and students) 

But different universities have different CMSes. Some use Drupal, some Moodle, Google Apps and Google Sites increasingly common.

So a newly graduated teacher is used to whatever they used at school.  And is also used to having reasonably good IT support.

But when they start teaching at a K-12 school, you get teachers who graduated from different university programs, each used to different CMSes.  And they all want to use what they are familiar with.  Moreover, the K-12 school IT, although perfectly competent, is NOT as fledged as that of a university IT department. 

I am sure that large school districts enforce website standards: e.g. they probably tell their teachers that Moodle is the only CMS supported, use it or else.  Makes it easier to organize.  However, that is not true for the schools that I have interacted with.  And I remain an advocate of heterogeneous systems.  It;s just that they require a certain minimum amount of work.


I should probably volunteer to help my kid's school organize its website.

Wednesday, November 16, 2011

Reagan Cattledog Links - reverifiable COW BOW links

Thinking about updating shared libraries:

Shared libraries' true advantage is their true disadvantage: multiple users can be updated, whether for good or ill.

Perhaps what we need are shared library linkages that are not automatically updated. Where an update is marked pending, encouraging  the user of the library to update it asap, but the change is not made.

I am calling this a reverifiable COW link. A link,that is broken when somebody else writes to the linked object (hence COW, Copy on Write, or BOW, Break on Write).  But which is reverifiable. Retestable.  (As one of my friends says "If you really believe in unit tests..."  (I do, he doesn't).

I would like very much to be able to have acronym COWBOY instead of COW BOW.  But I am humour deprived.

In the meantime I can call them Reagan Cattledog links.  Get it? BOW, as in bowwow, dog.  Reagan, as in "trust, but verify."


This not just for shared libraries.  Any sharing.  web pages. Like the "cached link" I have described elsewhere.  Cached links are really just COW BOW links which are assumed to be updated when the linkee comes back online.

Shared libraries and data deduplication

People have talked about the advantages of shared libraries: reducing virtual memory requirements, reducing disk space requirements, etc, because of sharing.

Here's a thought: Q: if we had truly ubiquitous data deduplication, what would be the advantages of shared libraries?

A: none of the performance wins through sharing need apply.  Deduplication beats them in a more flexible, more abstract, way.

(Of course, we do not have truly ubiquitous deduplication. And it usually requires things to be block or file aligned.)

This leaves the only fundamental advantage of shared librares

  • the fact that you can effect a ubiquitous change by updating a shared library.
Which is also their fundamental disadvantage.  You can propagate a bug fix.  But you can also propagate bugs.


So I read Furlani et al's papers on "Modules".

Modules is not so bad - it is a nice way of dealing with an annoying problem.

Or, rather, Modules may be the best possible way of dealing with environment dependent code.  But it might be better to discourage environment dependent code in the first place.  See my earlier post about environment dependent code being a dominant meme.


Minor observation: I would like to bring some non-Modules source scripts "kicking and screaming into the '90s with Modules".  I would like to simply wrapperize some existing legacy code that requires you to "set some environment variables and then source foo".  I.re.I don't want to rewrite foo - I would just like to wrap it in a module.

Modules does not seem to be able to do this.

Although it looks as if it would only be a minor extension to modules to handle it.

To do foo, start off in a new window

How many times have you seen "How to" directions begin:

To do foo

  1. Start in a fresh xterm
  2. Start in a fresh shell (typicaly csh)
  3. Log out and log in again so that you get a clean environment

While this may be good advice, certainly good advice for debugging brokenesses and/or avoiding bugs in the first place  - it is basically an admission that something is not right.

Some tool depends on the environment in weird ways.Possibly not just the standard UNIX environment string;possibly also the extended shell environment.

Tools should be callable from almost arbitrary environments.  They should not DEPEND on environment variables. It may be acceptable to USE environment variables to change some behaviors, but, consider: if you had a setuid script, it would probably be unwise to depend on environment variables.  Environment variables should be considered tainted.

I suppose my version of the above is to say

To do foo

  1. Empty all of your environment variables, and start off with a minimum environment
  2. Type the absolute path to the tool,/a/b/.../c/d/tool
IMHO tools should work when invoked like this.  If they are using the equivalent of Perk FindBin, they should be able to locate all of the relevant library files, etc., they need. Or else they should have, embedded in them, the paths to same.

GLEW OPINION: much of the reason for environment abuse is the broken, non-object oriented, UNIX installation model, where a tool may be put in /bin, its libraries in /usr/lib, etc - where the directories are not known at build time.  PATH, LIBPATH. MANPATH.  FindBin can live with this - a FindBin script can be relocated by copying - so long as the relative locations of what it depends on are maintained.

source scripts

I am going to try calling the sort of shell command file that must be source'd, i.e. read into the user so that it can modify the environment, a "source-script".

As opposed to a "shell-script" or other command which is usually executed as a subprocess.

UNIX-style subprocess execution of commands provides isolation.  The parent is unaffected by the child, except for interactions through the filesystem.  (Although with the /proc filesystem, or the child applying a debugger to the parent, that could be significant.)

Whereas, consider a csh style source-script.  It can be changing directories all over the place to get its work done.  And it may terminate with an error before finishing properly.  So the "caller" or the source-script may not know what directory he is in after the source script terminates.

Q:  how many people do:

set saved_dir=`pwd`
source srcscript.csh
cd $saved_dir

And,of course, even that has obvious bugs.

environment setting a dominant meme?

Thinking about why I go through this paroxysm of disgust whenever I encounter a toolchain that depends on environment variables.  Like Modules or modulefiles(1). Like many CAD tools.

This morning it struck me: they are a dominant meme.  An evolutionarily stable strategy.

Not because environment based tools are better.

But because cleanly written stuff,like I try to write, can be called pretty safely from anywhere.  Whereas stuff that  does what I consider unclean environment modifications cannot be called so easily from other code.  It can call other code, but it is hard to be called from other code.  So there is a tendency for users to just give in, and write in csh (since csh in so often the language associated with such environment dependent tools).

Sure, you can try to write code that prints the environment and which then gets called. However, this only catches the UNIX environment - modulefiles(1) rely on sideeffects in the extended shell environment, shell functions and aliases.  You could print these, but would have to parse them to pass to a different language, or at least reread them if passing to a later compatible shell.


The best way to work with such tools is to start a persistent subprocess, pass it commands, and interpret the results.  Expect style.  Coroutines. Which is doable, but is more complex than order function calls / UNIX style subprocess invocations.

Sunday, November 13, 2011

calling a function to change the environment

I think that what I really object to in tools that dedpend on environmebt variables is that it is hard to put a wrapper around environment variables. I.e. it is hard to "call a function" to set environment variables.

However, I have to parse out what I mean by the terms above.
      Remember: I am talking about scripting, in shell and Perl, etc.
      "Calling a function" in a shell script usually means executing a child process. And, by definition, a child process dos not pass environment variables back to its parent.
      Now, of course, in every scripting language worth its salt, it is possible to write a function that sets environment variables. But that's a function in the language you are running in. It's a lot harder to have, e.g. perl or bash call a csh function, and have that child csh set environment variables in the parent perl or bash.
      Similarly, you can source a file in csh, or "." it in bash, and hae some degree of modularization. But again it is same language.

Why do I care? Mainly because I am writing stuff in bash or perl or python, and I want to get whatever environment variables legacy csh scripts set up.
      But, in general, you lose abstraction if you are constrained to call functions only written in your current language.  Or, even then, if only callable via a special syntax, eg. csh's "source file" rather than just executing file.
      Loss of abstraction.  But, requiring a special syntax provides a but of security.You can tell who is side effectful, and who is not.  Pure vs impure functions.

My clear-env script does things oppositely - it allows me to call a script in a subshell with a clean environment, but I don't necessarily see what it set up in the environment.
      Similarly, my friends trick where bash can get a csh script's environment by doing something like
csh -c "source module; exec bash
is NOT a "function call". It's more like continuations.

Part of the trouble is that the whole point of processes is to isolate the parent from the child.
    Except that here, the whole point is to get access tyo the child's side effects.

I think that I may need to create a family of scripts in various languages that execute or source or whatever, and then, at the end, printenv into a file or stdout - so that the parent can parse the printenv.

A better way would be to do something like stop the child process in a debugger - and then have the parent look through the child's environment with the debugger.

I haven't even touched on other side effects.  Not just the environment, but file descriptors.

E.g. "I have a bash script that wants to call a csh script that redirects stdout and stderr - and then have the parent bash script use those settings".

Saturday, November 12, 2011

Why depend on UNIX environment variables?

Previously I have posted about how I find tools that depend on sourcing csh files to get environment variables set a real pain to deal with. Largely because I don't have good ways of automating such procedures, both for use and test. And also because of environment variable interference.

So, I ask myself: Q: why depend on environment variables?

Then I remember how excited I was to learn of shell environment variables.

By setting an environment variable I could arrange for a parameter to be passed from the outside world right into the middle of a tool. Without having to modify any of the interveing levels.

No need to create argument parsing code.

In fact, in Bell Labs type shells,argument parsers are implicitly created for environmrnt variables:
VAR1=val1 VAR2=val2 program options...

I hate (csh) environment based tools

I hate tools that have heavy, undocumented, environment dependencies.

csh scripts seem to be the classic example. Beware of anything that says
source file1
source file2
do this
source file3
do that

where the csh files you are source'ing mainly act by setting up environment variables, but also may act by side effects such as cd'ing.


Why do I hate these things? They are hard to automate. Especaly to automatically test.

Really, to automate I need to insert checks into the script above after each step. At least if it is flakey. (If it is not flakey and is all working, I don't care what it does. As long as I can put t in a black box, and don't have to let its side effects escape.)


Why do I hate these things?

So often the are designed for interactive use. And interfere with other stuff you may be using interactvely.

Oftentimes I need to fall back to re,oving all of my interactive customizations to get something like this working in a clean environment.


I have a script I call clear-env that deletes environment variables, and starts up new subshells. Has saved my bacon many times

However, today I am running into problems that depend on running exactly the sit standard initialization files, .login and .cshrc, before running any other csh-&**^&**^&^-source modules.

Testing: just (start) doing it

Trying to test something that I don;t know how to automate. In fact, my goal is to automate - but I can't automate it enough even to run the smallest self checking automated test. The blinking procedure only works, at all, interactively.

So, do what I can:

I can't run the program under test automatically. (Yet: I hope to change this.)

But the program under test does leave some output files, etc.

Therefore, I can automate CHECKING the output of the manual test.

Perhaps in a while I will be able to automate the whole thing.


It is amazing the sense of ease that even this small degree of automation brings.

Tuesday, November 08, 2011



'via Blog this'

Most version control tools have a .ignore file or ignore - .cvsignore, .hgignore, vk ignore, bzr ignore, etc.

All that I am aware of ignore fils based on pattern matching on the filename. E.g. in Mercurial:
An untracked file is ignored if its path relative to the repository root directory, or any prefix path of that path, is matched against any pattern in .hgignore.
I would like to extend this to be able to do simple queries.

E.g. I usually have an ignore rule that looks something like
I.e. I ignore directories that are suffixed .novc (for no version control).

This works fine, but is somewhat verbose. Plus, it gets in the way when certan tools have other conventions about names.

I should like to get the .novc directvetive out of the filename, and into a placeholder file in the directory.

E.g. if .../path/.novc exists, then ignore .../path

Q: is it there, and I do not know?

Monday, October 17, 2011

Comp-arch.net wiki on hold from October 17, 2011 - CompArch

Comp-arch.net wiki on hold from October 17, 2011 - CompArch:

'via Blog this'

= On Hold =

The comp-arch.net wiki is being put on hold - or at least to a very low boil back burner -
as of Monday, October 17, 2011.

Reason: I (Andy Glew, the creator of and main contributor to comp-arch.net)
have decided to go back to work in industry as a computer architect at MIPS Technologies.

= (Pre)History of comp-arch.net =

I have long wanted to write a book that I threatened to call "The Art of Computer Architecture".
I would like it to be like Knuth's "The Art of Computer Programming",
except that I am no Don Knuth: I am willing to compromise, not necessarily provide full academic references,
if in exchange I can document the "folklore" of computer architecture - the things that hardware engineers
know or used to know, but which never seem to get written down, and which get re-invented every decade or so.

The web, and wikis in particular, provided a forum and technology to allow me to write this in small pieces.

At most of the computer companies I have worked at, in particular AMD and Intel
(but also to a lesser extent at Gould and Motorola, prior to the creation of the web)
I have created wikis like this.
Whenever other engineers asked me a question
for which the answer was known, in the literature and/or folklore,
but not necessarily easily accessible,
I would write up a white paper or wiki page explaining the topic.
Bonus: often in so doing I would realize missing alternatives in a taxonomy,
which would lead to new inventions.

These company-internal web and wiki sites proved popular.
Several times, former co-workers who had left industry to become academics
asked me if they were accessible outside the company.
I always had to say "No".

Intel and AMD never allowed me to create public versions of such websites.
Perhaps they would have given extensive legal review - but I found the process of getting such approval a large disincentive.
The document or wiki or website would be stale before approved.

In September 2009 I left Intel to join Intellectual Ventures.
One of the conditions for my joining IV was that I be allowed to work on such a public website,
(I actually created the website during the brief period between leaving Intel and starting work at IV.)
I am incredibly grateful to IV for giving me that opportunity.

Progress on comp-arch.net was slow, and probably not steady, but at least visible.
In two years I had created 300 public pages on comp-arch.net.

In addition, I created a certain number of private wiki pages, sometimes on topics that I thought might be worth patenting,
sometimes because I was afraid that disclosing the topics I was writing about might create conflict with IV.
Even though my employment agreement might give me the right to work on something in public,
I would not want to get in the way of my employer's business or marketing strategy.
Such conflicts would have loomed very large for Intel
- I would have had trouble writing honestly about Itanium at the time when Itanium was Intel's main emphasis
- and were much less of a problem for IV,
but still FUD and self-censorship were an impediment
to work on comp-arch.net.

However, I say again: I am immensely grateful to Intellectual Ventures for giving me the chance to start working on comp-arch.net.
If I was confident I could stay state-of-the-art as a computer architect while continuing to work on comp-arch.net
and with Intellectual Ventures, I would keep doing so.

= Present of comp-arch.net =

For reasons such as these I left Intellectual Ventures to return to work in industry as a computer architect.
On October 17, 2011, I joined MIPS Technologies.

At MIPS I do not expect to be able to write pages on comp-arch.net and post them in real time.
I will continue to try to work on comp-arch.net in private,
and occasionally seek approval to make stuff public.

= Working on the Wiki =

I will also take this opportunity to work on the technology of comp-arch.net.

In 2009 I started comp-arch.net using mediawiki.

I have long wanted a better wiki technology. My wiki wishlist is documented elsewhere on this wiki, and other places.
It includes better support for drawings,
and better support for melding of public and private content
- since it appears that such restrictions will be something I have to live with for the foreseeable future.
The computer hardware industry is not "open" as the wiki philosophy would have it,
except possibly within companies.

= Future of comp-arch.net =

As mentioned above, I hope to continue working on comp-arch.net,
making stuff public occasionally, when permitted.

Plus, if ever I retire, I hope to continue this labor of love.

= It's not just about me! =

Although I have been the main contributor to comp-arch.net,
there have been other contributors.

For example, Paul Aaron Clayton has made some contributions.

I hope that others can continue to work on comp-arch.net
during this time when I must leave it on hold.

If you are interested, please contact me, and I will arrange for access.

(I may need to do a bit of wiki work to separate the old stuff from new stuff.)

= Au revoir, et à bientôt =

I hope to see myself working on comp-arch.net again.

But for the moment, I am excited to be working towards actual shipping product again at MIPS.

Friday, October 14, 2011

Mint to Quicken 2011

Mint to Quicken 2011:

'via Blog this'

I would be posting this on a Mint.com forum, but I don't want to bother with registering, so I will post it here.

I am yet another person who would like to synch Quicken from Mint, and possibly vice versa.

Although just being able to take transactions from Mint into Quicken would be wonderful. A backup.

In my case, although I am a past Quicken user, I am not currently. Not even a Mint user. I use Yodlee. I would consider switching to any product that allowed me to keep both online and offline accounts.

Wednesday, October 12, 2011

Google Plus Traffic Drops, 1269% Gains Erased

Google Plus Traffic Drops, 1269% Gains Erased:

'via Blog this'

Google+ is declining.

But I am still using Google+.

For one big reason: the "circles" provide better access control than LinkedIn or FaceBook.

Why do you want access control? Well, it often becomes obvious, looking at LinkedIn or FaceBook updates, when a person from company 1 is interviewing with company 2. Now, you can often turn off the automatic notification - but I know former coworkers/managers who proactively searched LinkedIn and FaceBook, to see if people they knew had suddenly become, e.g. linked directly to people in other companies, or even rivals within the same company.


(Not if there is a possibility that your present employer might retaliate.)

(Yes, there are limited fixes for this in LinkedIn and FaceBook. None satisfactory.)

Now, Google+ circles may help here. Except, as is typical with so many Google apps, they don't scale, in the user interface.

E.g. I have separate Google+ circles for all companies I used to work for, as well as old schools, etc. But when you reach more than a dozen circles, they become a pain to deal with. I currently have 24 circles - that's too many for Google's user interface. It needs some sort of hierarchy - e.g. a meta-circle of Companies containing several sub-circles of particular coompanies, and so on.

As is typical with so many Google apps, the user interface doesn't scale. People repeat, over and over and over again, the mistake of providing a flat set of categories (Google+ circles, Gmail labels), rather than providing structure and nesting. I can just imagine the "data driven" conversation at Google: "prove that people need more than 8 circles", "prove that people want nested circles and labels".

Tuesday, October 11, 2011

Colorado College | Block Plan

I learned about the Colorado College Block Plan from my daughter's school counselor. Instead off juggling 8 classes at a time, take one class at a time, intensely.

I find this interesting and attractive - this is almost, for example, how I got into computers, via a summer session Concordia University organized for high school students in the Montreal region.

The school itself has the usual issues with small liberal arts colleges versus universities.

Saturday, October 01, 2011

Load-linked/store-conditional (LL/SC)


The load-linked/store-conditional instruction pair provide a [[RISC]] flavor of [[atomic RMW]] [[synchronization]], emphasizing primitives which can be flexibly composed. It can be viewed as a minimal form of [[transactional memory]].

Part of the [[RISC philosophy]] was to espouse [[load/store architecture]] - instruction sets that separated load and store memory operations
from computational or ALU operations such as add or increment. This works fine for single processor operations, but runs into problems
for multiprocessor synchronization.

== Pre-LL/SC ==

After years of evolution, prior to the so-called [[RISC revolution]]
multiprocessor synchronization instruction sets were converging on simple [[atomic RMW]] instructions such as
[[compare-and-swap]], [[atomic increment memory]] or [[fetch-and-add]] and other [[fetch-and-op]]s.
Now, these atomic RMWs can be seen as composed of fairly simple primitives, for example:

[[locked increment memory]]
tmp := load(MEM)
tmp := tmp+1
store( MEM, tmp )

However, the implementation of begin-atomic/end-atomic is not necessarily simple.

The atomicity can be provided in a simple way:

[[locked increment memory]]
tmp := [[load-locked]](MEM)
tmp := tmp+1
[[store-unlock]]( MEM, tmp )

Where [[load-locked]] may be
* implemented at the memory module
** locking the entire module
** or a limited number of memory locations at the module
** or potentially an arbitrary number of memory locations, using per-location lock-bit [[memory metadata]], e.g. [[stolen from ECC]]

* implemented via a bus lock

* implemented via a cache protocol
** e.g. an [[address lock]]: acquiring ownership of the cache line, and then refusing to respond to [[snoops or probes]] until the [[address lock]] was released
** or, more primitive: acquiring ownership of the cache line, and then acquiring a [[cache lock]], locking the entire cache or a fraction thereof

Interestingly, implementing locking at the memory module quite possibly came first, since many early multiprocessor systems were not snoopy or bus-based.

So far, so good: [[load-locked]] and [[store-unlocked]] are somewhat RISCy.

But they have a problem: [[load-locked]] and [[store-unlocked]] as separate instructions
raise certain security, performance, and reliability problems in some (but not necessarily all) implementations.

E.g. what happens if a user uses load-locked to lock the bus,
and never unlocks it?
That *might* be interpreted as giving one user exclusive access to the bus.

Obviously, one can eliminate these problems by architecture and microarchitecture.
But doing so is a source of complexity.
Many, probably most, implementations have found it easier to package the atomic RMW
so that the primitive operations are not exposed to the user

== [[Load-linked/store-conditional]] ==

The basic idea behind [[LL/SC]] is that, instead of guaranteeing forward progress with a [[load-locked]] that always succeeds,
but which may deadlock,
[[load-linked]] would employ [[optimistic concurrency control]].
Software would assume that it had worked, perform the operation that you want to be atomic,
and then try to commit the operation using [[store-conditional]].

If the operation was atomic, i.e. if nobody else had accessed the memory location load-link'ed, the store-conditional would proceed.

But if the operation were non-atomic, if somebody else had accessed the memory location, the store-conditional would fail.
And the user software would be required to handled that failure, e.g. by looping.

L: oldval := load-linked MEM
newval := oldval+a
store-conditional( MEM, newval )
if SC_failed goto L
return oldval

Typically LL/SC are implemented via a snoopy bus protocol:
a memory location is "linked" by putting its address into a snooper.
If another location writes to it, the link is broken so that SC will fail.

Some implementations did not implement an address snooper - they might fail if there was ANY other activity on the bus.
Needless to say, this does not exhibit good performance on contended systems.

Non-snoopy implementations of LL/SC are possible.
E.g. implementing them at a memory module is possible. [[TBD]].

As with more extensive forms of transactional memory, LL/SC has issues with fairness and forward progress. It is not desirable to loop forever in the LL/SC code.
This is solvable - through much the same mechanisms as one uses to implement a [[locked atomic RMW]] with [[load-locked]].

One way amounts to converting a [[load-linked]] into a [[load-locked]] after a certain number of failures.

== Synthesizing Complex RMWs ==

The nice thing about [[LL/SC]] is that they can be used to implement many different forms of synchronization. E.g. almost any form of [[fetch-and-op]].
Say you want [[floating point fetch-and-add]] .. you can build that with LL/SC.
Whereas if you don't have LL/SC, and just have a fixed repertoire of integer [[atomic RMW]]s, you may just be out of luck.

This *is* one of the big advantages of RISC: provide primitives, rather than full solutions.

The problem, of course, is what was mentioned above: "plain" LL/SC may have problems with fairness and forward progress. Mechanisms that solve these problems
for LL/SC may be more complicated than they would be for non-LL/SC instructions.

See [[list of possible atomic RMW operations]].

== [[Remote Atomic RMWs]] ==

The "most natural" implementation of [[LL/SC]] is to pull the data into registers and/or cache of the processor on which the instructions are executing. By "the instructions" I mean those before the LL, the LL itself, between the LL and SC, the SC itself, and after the SC.

This is also the most natural implementation for locked implementations of atomic RMWs:

[[LOCK INC mem]]
tmp := [[load-locked]] M[addr]
dstreg := tmp+1
[[store-unlock]] M[addr] := dstreg

I call this the "local implementation" of the atomic RMW.

However, many systems have found it useful to create "remote implementations" of the atomic RMW.

For example, special [[bus or interconnect]] transactions could be created that indicate that the memory controller should do the atomic RMW operation itself.

For example, the [[NYU Ultracomputer]] allowed many [[fetch-and-add]] operations to be exported. They could be performed at the ultimate destination, the memory controller. But they could also be handled specially at intermediate nodes in the interconnection fabric, where conflicting atomic RMWs to the same location could be combined, forwarding only their net effect on towards the destination.

You can imagine such [[remote atomic RMW]]s as being performed by

[[LOCK INC mem]]
send the command "fetch-and-add M[addr],1" to the outside system

This is the advantage of CISCy atomic RMW instructions: they can hide the implementation.
If the interconnect fabric supports only [[fetch-and-add]] but not [[fetch-and-OR]], then the two respective [[microoperation expansions]] might be:

[[LOCK INC mem]]
send the command "fetch-and-add M[addr],1" to the outside system

oldval := [[load-locked]] M[addr]
newtmp := oldval OR src1
[[store-unlock]] M[addr] := newtmp

For that matter, the CISC implementation migh well use [[LL/SC]] optimistic concurrency control within its [[microflow]].

It is more work to create such a [[remote atomic RMW]] with [[LL/SC]], since the very strength of LL/SC
- that the user can place almost arbitrarily complicated instruction sequences between the [[LL]] and [[SC]]
is also its weakness.
If you wanted to create a [[remote atomic RMW]] implementation that supported just, say, [[fetch-and-add]],
then you would have to create an [[idiom recognizer]] that recognized
code sequences such as:

L: oldval := load-linked MEM
newval := oldval+a
store-conditional( MEM, newval )
if SC_failed goto L
return oldval

Actually, you might not have to recognize the full sequence, complete with the retry loop.
You would only need to recognize the sequence

oldval := load-linked MEM
newval := oldval+a
store-conditional( MEM, newval )

and convert that into whatever bus operations create the [[remote atomic RMW]].

Recognizing a 3-instruction idiom is not THAT hard.
But in general [[it is harder to combine separate instructions that to break up complex instructions in an instruction decoder]].

One might even imagine creating a fairly complete set of instructions executable on the interconnect.

=== The best of both worlds? ===

Let me end this section by describing a hybrid implementation of the CISC atomic RMWs that combines the best of both worlds wrt [[local and remote atomic RMWs]].

oldval := [[load-locked/fetch-and-op]] M[addr]
newtmp := oldval OP src1
[[store-unlock/fetch-and-op]] M[addr] := newtmp
return oldval

If M[addr] is exclusive in the locak cache, then the [[load-locked/fetch-and-op]] and [[store-unlock/fetch-and-op]]
are equivalent to ordinary [[load-locked]] and [[store-unlock]].

If M[addr]] is not present in the local cache, then [[load-locked/fetch-and-op]] is equivalent to sending a remote atomic RMW fetch-and-op command to the external network. The [[store-unlock/fetch-and-op]] may then become a NOP (although it may be used for bookkeeping purposes).

If M[addr] is present in the local cache in a shared state, then we *COULD* perform the atomic RMW both locally and remotely. The remote version might invalidate or update other copies. However, if the operation is supposed to be [[serializing]], then the local update cannot proceed without coordinating with the remote.

== Extending the semantics ==


(Paul Clayton added this.)

Because existing LL/SC definitions either fail or have [[undefined semantics]] if other memory operations are performed, the semantics can be safely extended without sacrificing [[compatibility]]. (While it may be possible in some architectures ([[TBD]]: check whether this is the case for Alpha, MIPS, ARM, Power, etc.) that a non-SC memory operation was used by software to cancel an atomic section, such is generally discouraged and unlikely to have been done.) Such extensibility could allow increasingly more extensive forms of transactional memory to be implemented without requiring additional instructions. While an implementation of an older architecture would not generate an illegal instruction, an architecture which guaranteed failure on additional memory accesses would guarantee either deadlock (which would simplify debugging and fixing) or the use of an already in place software fallback (which, while slower, would maintain correctness). In architectures which use a full-sized register to return the success condition and define a single success condition, that register could be used to communicate failure information.

A natural progression for such extensions might be: to initially constrain the transaction to an aligned block of 32 or 64 bytes or the implementation-defined unit of coherence (This last could cause a significant performance incompatibility but would maintain the semantics.), perhaps with a single write, then to expand the semantics to something like those presented in Cliff Click's "IWannaBit!" where any cache miss or eviction cancels the reservation (For many uses, this would require that the cache be warmed up before the transaction can succeed. If the reservation mechanism is expensive, prefetching data outside of the atomic section might be preferred over retrying the transaction.), perhaps extending writes to an entire cache line, and then to an arbitrary number of memory accesses.

Providing non-transactional memory accesses within the atomic section would be more difficult. It would be possible to define register numbers which when used as base pointers for memory accesses have non-transactional semantics (See [[Semantic register numbers]]). This definition would have to be made at the first extension of LL/SC and it might be difficult to establish compiler support.

A non-transactional extension of LL/SC would be to guarantee the success of the store-conditional under certain limited conditions. E.g., a simple register-register operation might be guaranteed to succeed, providing the semantics of most RMW instructions. Such a guaranteed transaction could itself be extended to allow more complex operations, though it might be desirable to allow a transaction that can be guaranteed to fail if lock semantics are not desired under contention. E.g., a thread might work on other tasks rather than wait for the completion of a heavily contended locked operation.

== See Also ==

* [[locked versus atomic]]
* [[Returning the old versus new value: fetch-and-add versus add-to-mem]]
* [[atomic RMW spinloops and CISCy instructions]]

Wednesday, September 28, 2011

Xfinity failing every afternoon 2:30-6pm approx; comcast DNS problems fixed by google

Xfinity failing every afternoon 2:30-6pm approx; comcast DNS problems fixed by google

My comcast internet cvonnection was failing regularly, every weekday, starting at approximately 2:30pm, lasting until 6 or 7pm. This started happening shortly after labor day.

I conjecture that this was probably happening because schoolkids were getting off school (which in my area happens circa 2:30pm) and starting to use the net.

The failures were accompanied by error meessages of the form "DNS error... DNS lookup failure... Error 105 net:ERR_NAME_NOT_RESOLVED". That error message was from a Windows 7 machine. I also regularly have a Vista machine and an Apple Mac connected to my home network. The Vista machine occasionally worked when the Windows 7 machine did not, and the Apple Mac more frequently worked; but all regularly failed about the same time in the afternoon. Yes, I tried powering off/disconnecting/connecting only 1 machine at a time. Not a cure.

Coincidence? I think not.

I reported the problem to Comcast, who very quickly sent over a "cable guy" who apologized that my 7 year old Linksys cable modem had not been replaced. So now I have a new SMC cable modem, and the net performs much better, up to 10X faster than in earlier speed tests. This is great!

But, it did not fix the DNS problem.

My Windows 7 machine (actually, the Windows 7 laptop my employer gave me) still got DNS errors between roughly 2:30 and 7pm.

My Vista machine sometiimes ran when the Windows 7 machine failed.

And the Apple Mac seemed nearly always to work, failing only occasionally.

But the Windows 7 machine reliably failed between 2:30 and 7pm.

OK, so I get more serious. I check the DNS settings. Finally, I do what I should have done in the first place: I switch the Windows 7 machine DNS server frm "automatic", i.e. from comcast's DNS servers, to Google's DNS servers, and

Fixed! :-)

Conclusion: Comcast DNS servers have problems. Particularly when the kids get home from school.

But I'm happy that I got a new, faster, cable modem out of this. And it is interesting that different machines with their different OSes behaved differently. Different timeouts (DNS settings that don't appear at the usual configuration places)? BTW, the machine that failed most often is the fastest machine by far.


BTW, I did try fixing DNS earlier, to no avail. OpenDNS did not seem to work. But I wasn't that serious, so I did not try all possibilities. It is possible that both the cable modem needed to be upgraded, AND the DNS needed to point away from comcast's DNS servers.

Saturday, September 17, 2011

Pipelined Bloom Filter


= [[Pipelined Bloom Filter]] =

Occasionally one wants to keep track of what (micro)instructions are in a pipeline.
For example, an aggressively speculative x86 may need to
[[snoop the pipeline]]
to detect [[self-modifying code]] that writes to instructions
that are already in the process of being executed.

A brute force technique might be to [[CAM]] all of the instructions in the pipeline.
Or to mark I$ lines or even ITLB pages when instructions from them are fetched for execution.
These can be expensive and/or complex.

This suggests a Bloom filter implementation:
set bit(s) in a bitvector according to a hash of the physical address of instructions
as the enter the pipeline.

unfortunately, a straightforward Bloom filter only ever sets bits.
It only ever accumulates. Eventually this will reduce in false conflicts being detected.

The basic idea behind a [[pipelined Bloom filter]] is to have a set of Bloom filters.
E.g. 2 1024 bit Bloom filters.
A new Bloom filter is (re)allocated by clearing it every 128 instructions.
It accumulates bits as instructions as executed for the next 128 instructions.
It is checked, but new instructions are not allocated into it, for the next 128 instructions.
It is then reallocated, and cleared.
Two such filters therefore can ping pong, providing coverage to the pipeline.

== Generalized ==

A set of K filters, used to protect a pipeline.

Use filter j for a specified period of time
- e.g. L instructions, or perhaps until B bits are set,
or C entries in a counted Bloom filter are saturated.
Switch to the next.
During this period one sets bits in the filter as new items are placed in the pipeline.

After this "allocation" period, continue to use the filter for checking, without writing new entries.

Deallocate the filter when it is guaranteed that no items inserted into the filter remain in the system, i.e. the pipeline.

== Other Terminology ==

It appears that there are some papers that use the term "pipelined Bloom filter"
to refer solely to a physical or electronic pipeline, and not the the logical structure
of overlapping Bloom filters that can be deallocated.

Saturday, September 10, 2011

Google Desktop (Search) Dying

Google is end-of-life-ing several products, including Google Desktop, in particular Google Desktop Search, which I use constantly, and Google Notebook and Sidewiki, which I would like to use, but which, fortunately, I decided not to become dependent on.

Google says

    Desktop: In the last few years, there’s been a huge shift from local to cloud-based storage and computing, as well as the integration of search and gadget functionality into most modern operating systems. People now have instant access to their data, whether online or offline. As this was the goal of Google Desktop, the product will be discontinued on September 14, including all the associated APIs, services, plugins, gadgets and support.

Fair enough. I *am* trying to keep most of my information in the cloud. But unfortunately I do not have full connectivity - large areas of rural Oregon still do not have cell phone connectivity, voice, let alone data, and, for that matter, I am too cheap to pay for data connectivity at all times. Plus, some things I am not allowed to keep in the cloud, and

So it looks like I will have to use Microsoft Desktop Search, as built in to Windows 7, for stuff on my laptop. I prefer the Google user interface - but since Google Desktop Search will be discontinued, I don't have much choice. And because I am using Microsoft Desktop Search, I have started using Bing for my web search. Here, I still prefer the Google user interface, which I am familiar with - but Bing's user interface is more similar to MS Desktop Search, and I value consistency between Desktop and Web more than I value the more familiar Google Search interface.

I suspect that Microsoft Desktop Search cannot search my Google Chrome browser history. Just like Google Desktop Search could not search my Microsoft OneNote.

Unfortunately, this begins to look like dominoes falling. So long as I have substantial local data on my Microsoft PC, I am attracted, despite my personal preferences, to Microsoft tools like Microsoft Desktop Search, Bing, and Internet Explorer.

Not everyone will be in this situation. Some people will be purely cloud based. But, unfortunately, I am not. Yet?

I wonder if this will affect my next choice of cell phone? I was leaning towards Android, but perhaps this is enough to tilt me towards Windows 8.

Monday, September 05, 2011

Bit Interleaving, Instruction Set Support for


= Instruction Set Support for Bit Interleaving =

== What is Bit Interleaving? ==

[[Bit interleaving]] is a very common form of multi-operand bit permutation.
It has been important enough to occasionally warrant special instruction set support.

2-way bit interleaving of A and B;

for i = 0 to N do
result.bit[2*i] := A.bit[i]
result.bit[2*i+1] := B.bit[i]

3-way bit interleaving off A, B, C:

for i = 0 to N do
result.bit[3*i] := A.bit[i]
result.bit[3*i+1] := B.bit[i]
result.bit[3*i+2] := C.bit[i]

M-way bit interleaving of A[i]

for i = 0 to N do
for j = 0 to M do
result.bit[N*i+j] := A[j].bit[i]

Note that since bit interleaving combines multiple operands into a single operand, issues of overflow may arise.
For example, the result is often used in address arithmetic,
in which case the inputs (aka the coordinates)
are probably not full width addresses.
The bit interleaving may be used in only a subrange of the input bits.

== Why Bit Interleaving? ==

See [[http://en.wikipedia.org/wiki/Morton_number_(number_theory) Wikipedia Morton numbers]].

Bit interleaving is often used as a component of hash functions.

In particular, bit interleaving of 2D and 3D coordinates in computer graphics
often produces hash functions that have better cache locality than uninterleaved coordinates used to index an array.

== Bit Interleaving Instructions ==

Intel Larrabee has 1:1 and 2:1 interleave instructions, in both scalar and vector form:

vbitinterleave11pi: 1:1 bit-interleave vectors
vbitinterleave21pi: 2:1 bit-interleave vectors

bitinterleave11: 1:1 bit-interleave scalar
bitinterleave21: 2:1 bit-interleave scalar

(See [[http://drdobbs.com/architecture-and-design/216402188?pgno=5 Mike Abrash's Dr. Dobb's article]] which comments that
"is useful for generating swizzled addresses, particularly in conjunction with vinsertfield, for example in preparation for texture sample fetches (volume textures in the case of vbitinterleave21pi".)

The 1:1 interleaving instruction obviously accomplishes 2D interleaving,
i.e. the interleaving of two coordinates for a 2D system.

The 2:1 interleaving instruction accoplishes 3D interleaving.
Or, rather, first one interleaves two coordinates, say X and Y, using 1:1,
and then one interleaves the third coordinate, Z, using 2:1.

4:1 interleaving is accomplished by two 1:1 interleaves.
However, operations on objects and spaces of more than 1D, 2D, or 3D are much less common.

== Expand (Interleave with 0) ==

Although 1:1 and 2:1 interleaving "fit" into commion instruction sets,
higher degrees of interleaving may not - too many operands may be required.

Some support may be afforded by bit expansion,
or interleaving with 0.
This takes 2 (or 3) inputs:
* the bitvector to which 0s will be inserted between elements
* the number of 0 bits to be inserted between elements
* possibly an offset to cause the result to be aligned appropriately for the last step.

The last step would be ORing the results together.
(Or equivalently, ANDing).

Saturday, August 13, 2011



= Background: 4-cycle GPU SIMD instruction groups =

The most popular GPUs circa 2010 - Nvidia, AMD/ATI, Intel Gen - share the followng characteristics:
* they have [[SIMD or SIMT]] [[coherent threading]]
* the SIMD is 16-way spatially:
** i.e. each SIMD engine is circa 16 [[lanes]] wide
** although the definition of the lane varies, e.g. from 32 bit wide scalar for Nvidia to 5-way VLIW for AD/ATI
* the SIMD is 4 way temporally
* i.e. every [[SIMD instruction group]], [[wavefront]] or [[warp]], occupies the 16 lanes for 4 cycles.

This wiki page discusses the temporal aspect,
at least one specific advantage of taking at least 4 cycles peer [[SIMD instruction group]].
This is a particular case of
[[register file port reduction]].

= [[Register File Port Reduction using Time-based SIMD]] =

For simplicity, let us ignore the possibility of [[spatial SIMD]].
Let us assume that we have only one ALU,
taking 2 inputs peer cycle and producing a single output per cycle.
(Or possibly 3 inputs, if we want to consider [[multiply-add]].)
(Or more different combinations of inputs and outputs, if we are particularly aggressive.)

dest := opcode( src, src2, src3 )

On a conventional scalar microarchitecture, this would require 3 read ports and 1 write port on the register file per cycle.
(Let's forget the possibility of architecturally requiring the registers to belong to banks with non-interfering ports.)

Now, instead, let us imagine that we are dealing with a SIMD instruction group or wavefront, i=0,3.
And let us say that it is distributed over time, 4 successive clock cycles t0..t3

t0: dest[0] := opcode( src[0], src2[0], src3[0] )
t1: dest[1] := opcode( src[1], src2[1], src3[1] )
t2: dest[2] := opcode( src[2], src2[2], src3[2] )
t3: dest[3] := opcode( src[3], src2[3], src3[3] )

Now, instead of reading 3 separate (e.g. 32) values per cycle, and writing a single such value every cycle,
we could make each separate access 4X larger, but only do one such 4X larger access every cycle:

t-3: src3_4x := RF.read( src3 )[0:127]
t-2: src2_4x := RF.read( src2 )[0:127]
t-1: src1_4x := RF.read( src1 )[0:127]
t0: dest_4x[0:31] := opcode( src1_4x[0:31], src2_4x[0:31], src3_4x[0:31] )
t1: dest_4x[32:63] := opcode( src1_4x[32:63], src2_4x[32:63], src3_4x[32:63] )
t2: dest_4x[64:95] := opcode( src1_4x[64:95], src2_4x[64:95], src3_4x[64:95] )
t3: dest_4x[96:127] := opcode( src1_4x[96:127], src2_4x[96:127], src3_4x[96:127] )
t3+1: RF.write( dest )[0:127] := dest_4x[0:127]

(As is typical in these discussions, we are constrained by the lack of universally understood [[slicing notation]])

This shows that you can get away with a single 4X wider read/write port,
at the cost of muxing/delay elements:

The pipeline might look like

W0.write W1.exec_0
W2.read1* W1.exec_1
W2.read2* W1.exec_2
W2.read3* W1.exec_3
W1.write W2.exec_0*
W3.read1 W2.exec_1*
W3.read2 W2.exec_2*
W3.read3 W2.exec_3*
W2.write* W3.exec_0
... ...

although of course it can be extended to be deeper, tolerating more ALU latency, etc.

= 4X Wider versus 4X Time-skewed =

This register file port reduction can be obtained in at least two ways:

* 4x wider
** by performing 4X wider reads and writes
** in a single "cycle" of RF access
** to and from 4X wider temporary registers
** and then muxing 1/4 of those wider registers in any given cycle of execution

* 4x time skewing
** by having 4 register files
** each skewed from the others by 1 cycle
** e.g. providing the same register number to each, but delayed 1 cycle for each skewing

These are very similar.

The 4X wider approach has the advantage of being very easy to express in conventional synthesized logic,
even though it might be marginally more expensive in full custom logic.
As of 2011 time skewed register files are hard to express in RTL languages.

= Why 4X? =

4X is a convenient power of two,
and [[powers of 2 are convenient in computer architecture]].

4X matches 3 reads and 1 write
for a register file with a single read and write port in any cycle.
Which conveniently matches multiply add, A:=B*C+D, one of the most common operations in graphics
- and GPUs were one of the first places this occurred.

3X could be a possibility, if restricted to [[2-input operations]].
But I suspect that 4X is just plain nicer.

Larger than 4X also a possibility. But, again, 4X is the smallest convenient size
that reaps most of the benefits.

Thursday, July 14, 2011

Things you can do with a double precision multiplier

Consider a processor that has hardware sufficient to do a double precision multiplication.
Or perhaps a multiply-add.

([[WLOG]] we will talk about floating point; similar discussion applies to 2X wide integer.)

A double precision multiplier overall has a multiplier capable of forming 4 single precision products.
Let us draw something like this, except using byte wide multipliers as subcomponents rather than single bit:
Compare the partial products array for 64x64 to 32x32:

or briefly if the numbers are (Ahi+Blo)*(Xhi+Ylo)

the summation network is similarly larger, although the final [[CPA (Carry Propagate Adder)]] is "only" 2X wider
(more than 2X more gates, but only a bit deeper in logic depth).

Given such a double precision multiplier, we can synthesize several different types of single precision operations

* [[LINE]]: v=p*u+q
:: this is just an [[FMA]], with possibly a different arrangement of inputs
* [[PLANE]]: w=p*u+q*v+r
:: this has 2 multiplications, although the sum network must be adjusted to align the products differently. This can be achieved by shifting the input to the upper half of the multiplier array
* [[LRP]] or [[BLEND]]: w=u*x+v*(1-x)
:: This is like [[PLANE]], except the second multiplier part is calculated. Like 2X, etc. products for advanced [[Booth encoding]]?

The above uses the 4 multiplications of the double precision multiplier,
but only uses 2 of them.
We can be more aggressive, trying to use all 4 - but then the summation network needs considerable adjustment.

An arbitrary 2D outer product:
[[OUTER2]] = (a b) X (x y) =
ax ay
bx by
although this causes some difficulties because it needs to write back an output twice as wide as its inputs.

[[CMUL (Complex multiply)]]: can be achieved using this multiplier: (a+bi) X (x+yi) = ax-by + (ay+bx)i
although once again there are difficulties with alignment in the summation network.

Saturday, July 09, 2011

Paradox and Consistency

I am fascinated by paradox. I am interested in the possibility of logical systems which are not consistent but where the inconsistency is somehow limited, so that it does not propagate and pollute the whole system. I am unhappy with systems where, given any inconsistency, any theorem can be proven.

I suspect that there may be much existing work in this area. Certainly I am aware of Russel's "set of all sets that do not contain themselves". And with Godel (although I probably do not understand Godel as well as I would like.). I should probably research the literature.

But since this is a hobby, I thought I might begin by writing down some simple observations. Not new - not to me, and probably well known.

But fun.

Consider simple systems of statements:

A single statement system:

S0: "This statement is false." The classic Liar's Paradox. The statement is neither false nor true: some say that it reflects an extra logic value, "paradox". Three valued logic.

A two statement system:

S1: "S2 is false"
S2: "S1 is false"

This system is self referential, but not necessarily paradoxical. It might be that S1 is true, in which case S2 is false, confirming that S1 is true. Or, vice versa. Thus, it is not contradictory. Either situation is consistent. But, from the outside, we don't know which.

What should we call this? Bistable? Metastable, if more than 2 stable states? Unknowable?

A three statement system:

S1: "S2 is false"
S2: "S3 is false"
S3: "S4 is false"

This is paradoxical, in the same way that the single statement Liar's Paradox is paradoxical.

Conjecture: I suspect that be a true paradox, then the feedback loop must have an odd number of stages. Whereas to be bistable or metastable, it must have an even number of stages.

Compare to inverter rings or storage bitcells with cross coupled inverters in computer electronics.

Q: can you construct a system that has interescting self referential rings of odd and even length?

Wednesday, July 06, 2011

Address Generation Writeback


= Hardware Motivation for Address Register Modifying Addressing Modes =

Apart from the software datastructure motivation,
there is a hardware motivation for addressing modes such as pre-increment or decrement.

In more generality - addressing modes that calculate an address, and then save the results of that calculation
in a register - typically one of the registers involved in the addressing mode.

Consider [[Mitch Alsup's Favorite Addressing Mode]],
the x86's [[Base+Scaled-Index+Offset]].

A succession of such instructions might look like:

;an [[RMW]]
r1 := load( M[ rB+rO*4+offset1 ] )
r1 := ... some calculation involving r1 ...
store( M[ rB+rO*4+offset1 ] := r1 )


; nearby loads
r1 := load( M[ rB+rO*4+offset1 ] )
r2 := load( M[ rB+rO*4+offset2 ] )
r3 := load( M[ rB+rO*4+offset3 ] )

Notice the possibility of [[CSE (Common Subexpression Elimination)]]:

For the [[RMW]] case, this is straightforward:

rAtmp := [[lea]]( M[ rB+rO*4+offset1 ] )
r1 := load( rAtmp ] )
r1 := ... some calculation involving r1 ...
store( M[ rAtmp ] := r1 )

Is this a win? It depends:
* in terms of performance, on a classic RISC, probably note: the extra instruction adds latency, probably costs a cycle.
* in terms of power, possibly

For the nearby load case, we could do:

rAtmp := lea( M[ rB+rO*4 ] )
r1 := load( M[ rAtmp+offset1 ] )
r2 := load( M[ rAtmp*4+offset2 ] )
r3 := load( M[ rAtmp*4+offset3 ] )


rAtmp := lea( M[ rB+rO*4+offset1 ] )
r1 := load( M[ rAtmp ] )
r2 := load( M[ rAtmp*4+(offset2-offset1) ] )
r3 := load( M[ rAtmp*4+(offset3-offset1) ] )

Is this a win? Again, it depends:
* probably not in performance
* possibly in terms of power.

    As an aside, let me note that if the base address rB+rO*4 is aligned, then the subsequent addresses could use [[OR indexing]] rather than [[ADD indexing]]. Which may further save power. At the cost of yet another piece of crap in the instruction set.

To avoid conundrums like this - is it better to [[CSE]] the addressing mode or not?
- processors like the [[AMD 29K]]
abandoned addressing modes except for [[absolute addressing mode]] and [[register indirect addressing mode]]
- so to form any nomn-trivial address you had to save the results to a register, and thereby were
encouraged to CSE it whenever possible.

Generalized pre-increment/decrement is a less RISCy approach to this.
Instead of encouraging CSE via separate instructions,
it encourages CSE by making it "free" to save the result ofthe addressing mode calculation.

:Let's call this [[Address Generation Saving]], or, if it modifies a register used in the addressing mode, [[Address Register Modification]]:

In our examples:

rAtmp := [[lea]]( M[ rB+rO*4+offset1 ] )
r1 := load( rAtmp := rB+rO*4+offset1 ] )
r1 := ... some calculation involving r1 ...
store( M[ rAtmp ] := r1 )

; nearby load
r1 := load( M[ rAtnp :=rB+rO*4+offset1 ] )
r2 := load( M[ rAtmp*4+(offset2-offset1) ] )
r3 := load( M[ rAtmp*4+(offset3-offset1) ] )

Note that the nearby case that changes the offsets is simpler than the nearby case that saves only part ofthe address calculation:

;Using C's comma expression:
r1 := load( M[ (rAtmp := rB+rO*4, rAtmp+offset1) ] )
r2 := load( M[ rAtmp*4+offset2 ] )
r3 := load( M[ rAtmp*4+offset3 ] )

:: Again, [[OR indexing]] may benefit, for addressing mode [[(Base+Scaled-Index)}Offset]]

What is the probem with this?
* Writeback ports

Such an [[Address Generation Saving]] requires an extra writeback port, in addition to the result of an instruction like load.

For this reason, some instruction sets have proposed to NOT have [[address generation writeback]] for instructions that write other results, such as load,
but only to have [[address generation writeback]] for instructions that do not have a normal writeback, such as a store.

: Glew opinion: workable, minor benefit, but ugly.

Except that pre-increment/decre more general form:

Post-increment and decrement goes further, generalized in this way:
then the address calculation looks like

{ old := address_reg; address_reg := new_address; return old }

Not only does this require a writeback port for the updated address register,
but it also requires two paths from the AGU or RF to where the address is used
- one for the address, the other for the updated address_reg.

OVERALL OBSERVATION: addressing modes begin the slippery slope to VLIW.

Monday, July 04, 2011

Mitre Top 25 SW bugs


Rank Score ID Name
[1] 93.8 CWE-89 Improper Neutralization of Special Elements used in an SQL Command ('SQL Injection')
[2] 83.3 CWE-78 Improper Neutralization of Special Elements used in an OS Command ('OS Command Injection')

-- the top two are handled by taint propagation. "Improper Neutralization" => "quotification"

[3] 79.0 CWE-120 Buffer Copy without Checking Size of Input ('Classic Buffer Overflow')

-- handled by stuff like Milo Martin's Hardbound and Softbound.

[4] 77.7 CWE-79 Improper Neutralization of Input During Web Page Generation ('Cross-site Scripting')

-- tainting/quotification

[5] 76.9 CWE-306 Missing Authentication for Critical Function
[6] 76.8 CWE-862 Missing Authorization

-- more like real SW bugs, with no crutches like bounds checks of tainting.

-- except that we can imagine that capability systems might help here - it might be made mandatory to activate some capabilities positively, rather than passively inheriting them.

[7] 75.0 CWE-798 Use of Hard-coded Credentials
[8] 75.0 CWE-311 Missing Encryption of Sensitive Data

[9] 74.0 CWE-434 Unrestricted Upload of File with Dangerous Type

-- while we might hope that a capability system might upload such a file but strip it of all ability to do harm, experience suggests that a user will just blindly give the file whatever privileges it asks for.

[10] 73.8 CWE-807 Reliance on Untrusted Inputs in a Security Decision

-- tainiting

[11] 73.1 CWE-250 Execution with Unnecessary Privileges

-- capability systems are supposed to make it easier to implement the "Principle of Least Privilege".  However, it still happens.

-- we can imagine security tools that profile privileges - that determine what privileges have never been used, to allow narrowing the privileges as much as possible.

[12] 70.1 CWE-352 Cross-Site Request Forgery (CSRF)

-- tainting, capabilities

[13] 69.3 CWE-22 Improper Limitation of a Pathname to a Restricted Directory ('Path Traversal')
[14] 68.5 CWE-494 Download of Code Without Integrity Check
[15] 67.8 CWE-863 Incorrect Authorization

[16] 66.0 CWE-829 Inclusion of Functionality from Untrusted Control Sphere

-- tainting

[17] 65.5 CWE-732 Incorrect Permission Assignment for Critical Resource
[18] 64.6 CWE-676 Use of Potentially Dangerous Function
[19] 64.1 CWE-327 Use of a Broken or Risky Cryptographic Algorithm

[20] 62.4 CWE-131 Incorrect Calculation of Buffer Size

-- classic buffer overflow, such as Martin Hardbound/Softbound.

[21] 61.5 CWE-307 Improper Restriction of Excessive Authentication Attempts
[22] 61.1 CWE-601 URL Redirection to Untrusted Site ('Open Redirect')

[23] 61.0 CWE-134 Uncontrolled Format String
[24] 60.3 CWE-190 Integer Overflow or Wraparound

-- these two, 23 and 24, are somewhat handled by buffer overflow checking as in Hardbound and Softbound - the security flaws with integer overflow often manifest themselves as unchecked buffer overflows.

-- however, they can manifest themselves in other ways as well.

[25] 59.9 CWE-759 Use of a One-Way Hash without a Salt

Wednesday, June 29, 2011

Notions of Time in Computer Architecture


Advanced computer architecture admits several different notions of time.

= Real World Time =

First, there is [[real time]] or [[wall clock time]]. Time in the real world (forgetting Einsteinian relativistic issues for the moment).

Unfortunately, the term [[real time]], which would be the best term for this, is often overloaded with the additional meaning
of pertaining to a system that must respond in real time - e.g. "Drop the rods in the nuclear power plant quickly or the nuclear fuel will melt down."
Or "adjust the aileron pitch with a few milliseconds, or the plane will stall and crash."
These applications are often called [[hard real time]], and there are also [[soft real time]] applications that have less stringent timing requirements.

Because of the use of real time in phrases such as [[real time operating system]] or [[real time response]], terms such as [[wall clock time]] may be used.

== Absolute versus Relative ==

Real time (and, indeed, other times) may be absolute or relative.

Absolute time is wall clock time, or calendar time.

Relative time is the amount of elapsed time since some other event. E.g. one might start a timer on entering a function, and read it on exit.

Obviously, you can calculate relative timer from absolute time samples.
But, it can be considerably easier to build the hardware to measure relative time that it is to measure ab solute time.
Timers used in relative timing may be in isolation, whereas clocks measuring absolute time may need to be kept in synchronization
with other clocks in other systems.

== [[Timers versus Timestamps]] ==

Oftentimes appplications do not need actual time.
Instead, what they really need is timestamps that indicate relative position.
The application may care who came first or second, but not how much time elapsed between arrivals.

Therefore, some timer architectures provide absolute time in uppper bits,
but simply indicate a unique counter in the lower bits of a timestamp.

= Von Neuman Time =

In out-of-order and speculative processors,
even when single threaded, it is often convenient to talk about the
[[Von Neumann sequence number]] or [[Von Neumann time]].

On a single threaded program, this is the sequence number at which every and any [[dynamic instruction sequence]] executes.

The [[Von Neumann time]] or sequence number for an instruction is NOT the [[wall clock time]]
at which an instruction executes.
Even on simple in-order processors, where instructions may have differing [[instruction latency]],
the VN time is not linearly related to the wall clock time.
On complex out-of-order processors instructions that are later in the Von Neumman instruction sequence
may execute earlier in wall clock time (that's what makes them out-of-order).

The [[Von Neumann time]] or sequence number may be a convenient book-keeping tool.
* it is often made available in a [[simulator]] user interface
* I have proposed using it it [[hardware data structures]] such as store buffers and schedulers.
** When I talk about an "age based scheduler", I usually mean a [[Von Neumann age]] based scheduler.
** Of course, real hardware must implement a finite number of bits, typically wrapping

== Multithreading ==

The [[Von Neumann time]] or sequence number is unambiguous for a single threaded program.

For multithreaded programs that interact, it is necessary to establish correspondences - VN1 time of thread A corresponds to VN time of thread B,
at different points if interaction.

It should be seen that this multithreaded time is a partial order. Ideally one from which a total order can be constructed.
(Although certain microarchitectural interactions may imply cycles in such a time graph.)

I know of no standard term for this [[multithreaded time]]. [[Multithreaded Von Neuman time]].

I think that it might be nice to call it [[Lamport time]],
since Leslie Lamport was a pioneer in this area.

= Pipeline Time(s) =

Any given [[dynamic instruction instance]] desn't execute at a single point in time.,
Rather, it executes in different ways at many different points in tiime:
* [[instruction fetch time]]
* [[decode time]]
* [[schedule time]]
* [[operand ready time]]
* [[execution time]]
* [[register read time]]
* [[writeback time]]
* re-execution time]] or [[replay time]]
* [[commit time]]
* [[retirement time]]

Plus of course there is the
* [[Von Neuman time]] or [[Von Neumann sequence number]] of a [[dynamic instruction instance]].

For that matter
* the [[Von Neumann time]] relative to a given thread of execution
may differe from the
* [[Von Neuman time]] relative to the processor it is executed on
because of [[timesharing]] or [[multitasking]] between threads or processors.

= Simulator Times =

* Real time of the machine on which the simulator runs: [[simulator real time]]
* [[Simulated real time]] of the workload running on the simulator: [[simulatee real time]]
* [[Von Neumann time]] or sequence number for the workload running on the simulator.

This can be fully recursive. A simulator may be ruunning on top of a simulator on top of ...

= Conclusion =

Be careful what notion of time you are talking about.
It may not be the notion of time for the person you are talking to.

[[Microarchitecture]] is concerned with [[real time]].
[[Macroarchitecture]] is concerned more with abstract time, such as [[Von Neumann time]].
Systems that have different microarchitectures and hence different real time behavior,
may execute the same [[macroarchitecture]].
But, of course, [[macroarchitecture]] features such as [[ISA]] greatly influence [[real time]] performance.

Thursday, June 23, 2011

cygwin startxwin problem with multiple displays

I use cygwin on my Windows PCs. Especially my tablet PC. AFAIK there is no equally good UNIX or Linux or Apple equivalent for a Microsoft Windows tablet PC. Tell me if there is. (Yes, I write and draw with the pen.))

Unfortunately, my cygwin setup has decayed - things stopped working on various upgrades, and I do not always fix the problem immediately. Here's one:

For a while, I would use startxwin to start up X Windows in Cygwin's multiwindow mode. At some point, it worked, but the xterm started by default was off obably havescreen, could not be seen. If I maximized I could find it, but when non-maximized, not seen.

Turns out that startxwin's default was to start "xterm -geometry +1+1 ...".

And I have been using multiple displays - e.g. currently have 4. Irregularly arranged. So that there is no display at +1+1 in the bounding box of all displays.

FIX: provide my own ~/.startxwinrc, with no location specified for the xterm.


Meta-issue: almost want a sanity check, to make sure that no windows are raised offscreen, invisibly.

Overall, there are probably a lot of such issues relating to assumptions about screen geometry being regular and rectangular, that no longer hold in the world of multiple displays.


Another meta-issue: took a while to find out where the xterm was coming from.

It would have been nice to have an audit trail that allowed me to find where the xterm came from.

Similarly, to find out where its geometry parameter came from.

As it was, I just guessed "startxwin". But I am embarassed to say how long it took...

Sunday, June 19, 2011

A vocabulary of terms for memory operation combining

There are many different terms for the common optimization of memory operation combining or coalescing.
See the discussions of the [[difference between write combining and write coalescing]],
as well as [[write combining]] and [[write coalescing]],
and [[NYU Ultracoomputer]] [[fetch-and-op]] [[combining networks]].

Here is a by no means complete list of such terms:

* [[combining]]
** [[write combining]]
*** [[write combining buffer]]s
*** [[write combining cache]]
** [[read combining]]
** [[fetch-and-op combining]]
*** [[NYU Ultracomputer]] [[fetch-and-op]] [[combining networks]]

* [[coalescing]]
:: on GPUs...
** [[write coalescing]]
** [[read coalescing]]

I have not heard anyone talk about fetch-and-op or atomic RMW coalescing
as distinct from the historic
[[NYU Ultracomputer]] [[fetch-and-op combining]].
But I suppose that inevitably this will arise.

* [[squashing]]
:: mainly refers to finding that an operation is unnecessary, and cancelling it - or at least the unnecessary part
** [[load squashing]]
::: On P6, referred to a cache miss finding that there was already a cache miss in flight for the same cache line. No need to issue a new bus request, but the squashed request was not completely cancelled - arrangement was made so that data return from the squashing cache miss would
** [[store squashing]]
::: I am not aware of this being done, but it could refer to comparing store addresses in a [[store buffer]], and determining that an older store is unnecessary since a later store completely overwrites it (and there would be no intervening operations that make it visible according to the [[memory ordering model]]). (Actually, I am not sure that there could be such, but I am leaving the possibility as a placeholder.)
::: [[Write combining]] accomplishes much the same thing, although [[store squashing]] in the store buffer gets it done earlier.
::: Note that this is similar to [[store buffer combining]] - combining entries in the store buffer, separate from a [[write combining buffer]].

Again, I have not heard of fetch-and-op squashing, although I am sure that it could be done, e.g. for lossy operations such as AND and OR (a later fetch-and-OR with a bitmask that completely includes an earlier...).

* [[snarfing]]
:: I have usually seen this term used in combination with a [[snoopy bus]], although it can also be used with other interconnects. [[Read snarfing]] means that a pending request from P1 that has not yet received the bus in arbitration observes the data for the same cachelines from a different processor P0, and "snarfs" the data as it goes by.
Depending on the cache protocol, it may be necessary to assert a signal to put data into [[shared (S) state]] rather than [[exclusive (E) state]].

I am not sure what [[write snarfing]] would look like.
An [[update cache]] is somewhat like, but it updates an existing cache line, not an existing request.
I.e. an update cache protocol snarfs write data from the bus or interconnect to update a cache line.
Whereas a pending load can snarf data either from read replies or write data transactions on the bus or interconnect.
I.e. a read can snarf from a read or a write.
But does a write "snarf"?
More like a write may combine with another write -
[[snoopy bus based write combining]],
as distinct from [[buffer based write combining]].