Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Saturday, February 04, 2012

Recognizing errors in build scripts

A very important thing, when writing or some other form of build recipe, is recognizing when an error has occurred that might prevent you going on the the next step (of the build DAG).

This is easy when commands are well behaved.  When they indicate errors tidily. (E.g. in the UNIX convention, 0 for success, nonzero for error.)

It is less easy for tools that are not so well behaved.  When the tool regularly produces errors, both in the UNIX exit code, and in whatever output it sends to stdout or stderr.  When past users seem to have nearly always been running the tool by hand, and there is folklore about what errors can be ignored, and which errors matter.

Better if the folklore is recorded, e.g. on a webpage or wiki. Bad if the folklore is passed by word of mouth.  Worse, if no single person knows all the folklore.

Not so bad if the tool produces a well characterized set of outputs, e.g. output files, and if the file only exists if it was correctly built.  Bad if the output file can be created, but malformed, not completed, etc.

I am tempted to say "BKM: make all output go tgo output.tmp, and then mv to output.final when okay."  But in this same effort I have found several examples of where the output file is perfectly fine, but where an error occured later in the tool producing it.

Not so bad if the output file contains error markers, such as "When processing this record, there was an error".  No matter if this occurs at the very end, or in the middle.  Bad if the output file contains no error markers, so you are left wondering, in a partially created output file, why it stopped.  Really bad if the output file format ignores missing data, such as assuming it is all zeroes.  really, Really, REALLY bad.  I wasted lots of time on that, umm, "feature".

Here are some things that I have done to ciope with such error-ful output:

First, wrapperize the tools that produce unreliable error indications.

Second, in the wrapper or in the build tool, place code to look at the output. Let this "validation" code decide if it is safe to go on to the next step or not.

At first, perhaps unconditionally go on to the next step, since any particular error may be ignorable.

But as you debug stuff, start recognizing error conditions.  When you are confident that an error is non-ignorable, have the wrapper return an error.  When you are confident that the error is ignorable, or that there was no error, have the wrapper return success.

Print messages when you are in-between.  Probably continue anyway, but at least flag when your validation code - typically pattern matching on error messages - thinks that something suspicious has happened, but is not sure.

Be schizophrenic.  Don't try to give a single answer.  Have your validation code do

GOOD: I saw a passed message
BAD:!!: the log file was much tiner than it has ever been on a successful run 
SUSPICIOUS:?: manycompilation errors, but they were ignored in the past.
BAD:!!: the expected output files were not created
Use any heuristic that you can imagine, that you are willing to take the time to code.




Probably most important: save examples of good and bad runs, so that you can start looking for patterns.
 



No comments: