Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Sunday, December 21, 2008

Concurrent programming bug in X/VNC/sysadmin distributed config files

Below is a bug report, relating to the timing of a windowing system and window manager.

It is caused by abc in the UNIX shell,
and the fact that X and VNC do not really bind programs to a particular instance of a display.

It amuses me that, after all this time, even such a simplistic concurrent programming example can have a bug. What dos this say about the brave nw world of multicore?




There is a timing bug in eclogin - the files installed by eclogin -i -f - $HOME/.vnc/xstartup.
The file ends

( PATH-TO-WHEREVER-FVWM-IS-INSTALLED/bin/fvwm2
twm
mwm
olwm
xmessage -geometry -10-10 "FATAL ERROR - NO WINDOW MANAGER FOUND in vnc/xstartup" )




This looks innocuous.

But, if the user typs in something like the following
vnserver -kill :2; vncserver
then the fvwm2 of the OLD vncserver will fail, leading to the invoking, say, twm.
but the NEW vncserver may start up before the twm of the old vncserver starts up; indeed, it may start up before the old fvwm2 fails.
in any case, the mwm from the old vncserver may start up, and be connected to the new vncservr session.
resulting in very odd, hard to debug, behavior - like "Why am I getting mwm, or twm, or whatever, when I am trying to use the EC standard fvwm2 window manager?


I conjecture that this behavior was not seen on old machines, that were slower and/or did not have multiple processors.
I conjecture that it ois almost impossible to reproduce this behavior unless you type vncserver -kill :2; vncserver; it can happen if the machine is slow, but is much more likely.
I conjecture that only an anal retentive compuer user like me would have found this bug - particularly one who is an old sysadmin, who has seen bugs like this.


---



The fix:
A partial fix is to do:




if [ -x PATH-TO-WHEREVER-FVWM-IS-INSTALLED/fvwm2 ] ; then
PATH-TO-WHEREVER-FVWM-IS-INSTALLED/fvwm2
elsif [ -x $PATH_TO_TWM/twm ] ; then
$PATH_TO_TWM/twm
...


i.e. this prevents the frm falling through if the fvwm2 command was found, and executed for a while.
A better fix would be to se if fvwm2 executed. Checking exit code 0 is not sufficient, since it is usual and normal for fvmw2 to return non-zero when vncserver -kill :2 is called.
A stopgap would be to just document this in ~/.vnc/xstartup - noting something like "You either have to wait a while, or shutdwn your window manager before starting a new vncserver".
A better long term fix would be for vncserver to create a nonce, or some other identity cookie.

---



By the way: it is possible that this could be a security hole - I have not checked. E.g. an attacker could be constantly trying to attach a window manager to a vncserver he dos not own.

===



In case you do not beleve this is real, here is a shell session showing me killing such wm one after the other
I had been thrsahing, killing and creating new vncservers in a fairly tight loop, debugging a problem,when I noticed it:

/users/glew/ 86 : vncserver -kill :2 .../vnc/E4.2.5/vncserver -kill :2Killing Xvnc process ID 3471


/users/glew/ 87 : ps x grep wm
5869 ? S 0:00 twm
1593 pts/16 S 0:00 PATH-TO-WHEREVER-FVWM-IS-INSTALLED/fvwm2

3953 pts/16 S 0:00 mwm

3955 pts/16 S+ 0:00 grep wm

/users/glew/ 88 : kill -9 39533953: No such process

/users/glew/ 89 : ps x grep wm
5869 ? S 0:00 twm
1593 pts/16 S 0:00 PATH-TO-WHEREVER-FVWM-IS-INSTALLED/fvwm2

3988 pts/16 S+ 0:00 grep wm

/users/glew/ 90 : kill -9 1593

/users/glew/ 91 : ps x grep wm
5869 ? S 0:00 twm
3992 pts/16 S 0:00 twm
3996 pts/16 S+ 0:00 grep wm

/users/glew/ 92 : kill -9 3992

/users/glew/ 93 : ps x grep wm
5869 ? S 0:00 twm
4024 pts/16 S 0:00 mwm
4028 pts/16 S+ 0:00 grep wm

/users/glew/ 94 : kill -9 4024

/users/glew/ 95 : ps x grep wm
5869 ? S 0:00 twm
4036 pts/16 S+ 0:00 grep wm



Code inspection reveals the problem. Of course, anyone at all familiar with concurrent programming should recognize the problem.