--- created_at: '2011-06-05T00:32:25.000Z' title: 'Tom Duff: Reading Code From Top to Bottom (1999)' url: http://iq0.com/notes/deep.nesting.html author: gnosis points: 56 story_text: '' comment_text: num_comments: 22 story_id: story_title: story_url: parent_id: created_at_i: 1307233945 _tags: - story - author_gnosis - story_2620872 objectID: '2620872' year: 1999 --- ### Reading Code From Top to Bottom This is a particularly simple example of a style of code that I often see (this example is a piece of *prman*, sanitized to hide the function of the code it comes from): int match(struct table *tab, char *arg) { int i, found = 0; if (tab != NULL) { for(i=0; initems; ++i) if (tab->item[i].name == arg) break; found = (i < tab->nitems); } return found; } This function determines whether `tab` points to a table containing an item whose name is `arg`. I would write the above like this: int match(struct table *tab, char *arg){ int i; if(tab==NULL) return 0; for(i=0;i!=tab->nitems;i++) if(tab->item[i].name==arg) return 1; return 0; } This avoids deep nesting. (The original wasn't *very* deep, but it's only a simple example.) It avoids re-testing the loop condition after the loop finishes. It avoids scattering multiple assignments to `found` throughout the code. It makes the lifetime of the loop index be just the loop. (Maybe I'm an old fart for this habit, which I cling to from my Fortran days \[early 1970's\], but limiting the lifetimes of variables makes things easier for register allocators and for people reading your code.) Most important, it reads from top to bottom, by which I mean that each step follows from the preceding step, and the code stops (that is, returns) as soon as it knows the answer. It tells a simple story: > If you want to find an element of `tab->item` with `name==arg`, > > - First check that `tab` points to something. > - Then check each element of the pointed-to array, returning if you > find a match. > - If you didn't find it, it's not there. The old version of the story is extremely roundabout: > If you want to find an element of `tab->item` with `name==arg`, > > - First, assume its not there. > - Then check that `tab` points somewhere. If so > - go through the array until you see what you're after. > - When you're done looking, you found a match if and only if you > didn't go through the whole array. > - Having done all that, inform the caller about what you determined. During the Structured Programming Era, programmers were often taught a style very much like the old version. The language they were trained in was normally Pascal, which only allowed a single return from a procedure, more or less mandating that the return value appear in a variable. Furthermore, teachers, influenced by Bohm and Jacopini (*Flow Diagrams, Turing Machines and Languages with only two formation rules*, Comm. ACM 9\#5, 1966, pp 366-371), often advocated reifying control structure into Boolean variables as a way of assuring that programs had reducible flowgraphs. But reducible flowgraphs can still be deeply nested and therefore hard to read. Long chains of reasoning or long sequences of actions are not a particular problem to understand, especially for people of a mathematical or otherwise analytic bent. Sequences that branch, so that a particular step may have multiple antecedents, all but one of which will be far from it on the page are a different matter. Reading them requires keeping in the mind a list of suspended options while paying attention to something else, a combination of mental activities that people are notoriously bad at. You should try as much as possible to structure your programs so that they proceed step-by-step. Whenever nesting gets deep, or code refers to results computed more than a few steps previously, you're headed for trouble. A common indicator of trouble with excessive nesting is narrow tab stops. Deeply nested code, at eight spaces per tab (as is usual on UNIX) quickly migrates over to the right margin. The easy way to fix this is to indent four spaces at a time (or even two\!). The right way is to rewrite it so that it proceeds from top to bottom, not from outside to inside, left to right. A similar phenomenon happens in mathematical proofs. Most proofs involve a chain of implications, each usually following from its immediate prececessor. When the proof requires a step that follows from some non-immediate predecessor, we are brought up short -- violation of the orderly step-by-step progress of the proof disturbs the narrative rhythm of the proof, and is a likely signal of a difficulty, either in the proof or in the reader's understanding. Mathematicians routinely hide these narrative anomalies in lemmas -- proof subroutines. In fact, there is evidence that mathematicians choose the theorems that they work on to be amenable to proof using simple structures. For example, when Doug McIlroy investigated the problem of automatically proving theorems of Euclidean geometry (personal communication, 1984 -- I don't believe this work was ever published) he discovered that when theorems are rewritten as systems of simultaneous equations (sometimes inequalities are required as well) the systems are usually \`triangular' (they're not generally linear, so they're not strictly triangular, but you know what I mean) and the proofs mostly proceed immediately by back-subsititution. ### *Else* Considered Harmful It is often useful to eschew the *else* clause of the *if* statement. Instead of writing int f(int x){ int v; if(b(x)) v=f1(x); else v=f2(x); return v; } it usually pays to write int f(int x){ if(b(x)) return f1(x); return f2(x); } Even in a simple function like this, the increase in clarity is obvious to me. In a more complicated situation, where `v=f1(x)` is replaced by a more complicated computation with more nested ifs, the reduction in nesting can be dramatic. Using the single-tailed if gives you some extra freedom in wording your code to read top-to-bottom. As an extreme example, the following int g(int x){ if(b1(x)){ if(b2(x)){ if(b3(x)){ if(b4(x)){ if(b5(x))return g1(x); return g2(x); } return g3(x); } return g4(x); } return g5(x); } return g6(x); } should be rewritten as int g(int x){ if(!b1(x)) return g6(x); if(!b2(x)) return g5(x); if(!b3(x)) return g4(x); if(!b4(x)) return g3(x); if(!b5(x)) return g2(x); return g1(x); } If this function had involved two-tailed rather than one-tailed ifs, reversing the sense of the tests would not change the depth to which the code was nested, and there would likely be no net change in readability. ### Another Example Right now (Oct 7, 1998) I'm working on an interpreter for a programming language that looks a lot like C, but lacks a few features. (Most of the restrictions are inspired by concerns discussed in this note.) To test its call and return mechanisms I thought I'd try running Ackerman's function. I have a C version sitting around that just parrots the usual lambda-calculus definition: int a(int i, int j){ return i?a(i-1, j?a(i, j-1):1):j+1; } int ackerman(int i){ return a(i, i); } The language I'm testing is missing the `?:` conditional expression, so I had to reword function `a` using ordinary if statements. Doing this mechanically, you get int a(int i, int j){ if(i){ if(j) return a(i-1, a(i, j-1)); return a(i-1, 1); } return j+1; } Of course I thought this read awkwardly and reversed the tests, yielding int a(int i, int j){ if(i==0) return j+1; if(j==0) return a(i-1, 1); return a(i-1, a(i, j-1)); } *Much* better. ### An Aesthetic Conundrum I don't like side effects in conditionals (or embedded in most expressions, but let's just think about conditionals for now.) Code almost always reads better if you separate the side effects from the tests. For example, I write ``` do c=getchar(); while(strchr(" \t\n", c)); ``` in preference to ``` while(strchr(" \t\n", c=getchar())); ``` although many people prefer the latter. This is justifiable as reading top-to-bottom instead of inside-out. Nevertheless, today (Sep 4, 1998) I found myself writing a function that read like this: char *skip(char *s, int c){ while(*s!='\0' && *s++!=c); return s; } This function just skips past the first occurrence of `c` in `s`, stopping also if it finds a *nul*. I did this after examining several alternatives, like char *skip(char *s, int c){ while(*s!='\0' && *s!=c) s++; if(*s==c) s++; return s; } and char *skip(char *s, int c){ for(;*s!='\0';s++) if(*s==c) return s+1; return s; } Neither of the last two satisfied me, the first because it retests `*s==c` after the loop and the second because the two halves of the test are unnaturally separated. I wound up using the first version in spite of the side effect in the test because it struck me as less infelicitous than the others -- that doesn't mean I'm happy with it. ### Another Problematic Fragment Yesterday (Feb 21, 1999) I rewrote my *com* command (because I was working on a un-networked machine at home, and I was too lazy to get up and transfer it on a floppy\!) The [original](../duffgram/com.lic) contains this code (omitting some error handling), which looks for the sequence `/*%` in a file: ``` for(;;){ c=getc(f); if(c==EOF){ /* error code goes here */ } if(c=='/'){ if((c=getc(f))=='*' && (c=getc(f))=='%') break; ungetc(c, f); /* might be another slash! */ } } ``` Today, I wouldn't write it that way -- I'd avoid the `if` with side-effects. The first version I wrote yesterday looked something like this: ``` while((c=getc(f))!=EOF){ if(c=='/'){ c=getc(f); if(c=='*'){ c=getc(f); if(c=='%'){ /* code that reacts to finding /*% */ } else ungetc(c, f); } else ungetc(c, f); } } /* error code goes here */ ``` Several things struck me about this: - It's obviously written inside out -- the code reads horizontally, not vertically. - I deliberately avoided side effects in the `if` tests, but in a moment of weakness retained `while((c=getc(f))!=EOF)`. - I don't like that it behaves oddly if `ungetc(EOF, f)` does the wrong thing. The standard says that it has no effect -- a buggy `stdio` might push `EOF&0xff` back. - Besides, `ungetc` is a silly thing to do with a character that's just going to reappear in `c` immediately. Keeping in mind the shallow-nesting and top-to-bottom-reading ideas that I've been writing about, I redid it like this: ``` for(;;){ c=getc(f); MightBeSlash: if(c==EOF){ /* error code goes here */ } if(c!='/') continue; c=getc(f); if(c!='*') goto MightBeSlash; c=getc(f); if(c!='%') goto MightBeSlash; break; } ``` Now it reads top to bottom and, to me at least, says what it means. It's a net improvement over the other version. But surely this is no good. The `goto`s are fairly benign, but still grate. Most egregious is the loop whose last statement is `break;`\! I let this code sit for an hour (it was lunch time) and then rewrote it. I can't remember what I eventually used, but it was not very satisfying. Looking at the various versions today, the one that strikes me as best is the original, with all of its calls to `getc` inside the tests. (Of course, if I were writing this code professionally, I'd probably call `fgets` and `strstr` rather than paw through the file byte by byte.) What does that say about all my preaching? Only that truly general principals are hard to come by, and ultimately every case is a different case. Now that I have two examples of side effects inside conditionals that I think read better than the alternatives, maybe I'd better think harder about what principles are involved. ### What About `goto`? *If you want to go somewhere, goto is the best way to get there.* \-Ken Thompson How can I say that a program with two `goto`s is \`a net improvement' over one with none? The problem with `goto` is that it promotes spaghetti -- when you're trying to understand a program, every time you see a label, you must look at the code at each `goto` that targets the label to see what the sequence of operations might be. This is the same problem that deep nesting causes -- code at any number of far away places might immediately precede the code at a label, and it is often difficult to keep in mind all the things that might have occurred when you reach the label. During the structured programming era, `goto` was a major bugaboo, so much so that most programmers today, with little or no experience thinking about code that uses `goto`, nevertheless recoil in revulsion at its use. The whole structured programming debate was started by [Edsger W. Dijkstra's essay](http://www.acm.org/classics/oct95/) *Go To Statement Considered Harmful*, Communications of the ACM, Vol. 11, No. 3, March 1968, pp. 147-148. One of the oft-cited bolsters of the argument to eschew the `goto` was that it appeared to be very difficult to describe `goto` in terms of C. A. R. Hoare's system of axiomatic semantics. Hoare's system involved specifying a predicate that gives the initial assumptions (pre-conditions) about a program. For each statement type there is a rule of inference that describes how to derive a post-condition from the statement's pre-condition. For example, let's think about the statment `while(B)S`, where `B` is an arbitrary Boolean (with no side effects\!) and `S` is any statement (that doesn't transfer control out of the loop\!). Hoare writes `P{S}Q` to mean that if proposition `P` is true before `S` is executed, then you can infer that `Q` is true afterwards. So, Hoare's rule for the `while` statement is: All this says is that If executing `S` doesn't falsify `P` whenever `B` is also initially true, then executing the loop `while(B) S` preserves `P` and guarantees that `B` becomes false. (`P` is usually called the *loop invariant*.) As far as Hoare's semantics is concerned, this is the meaning of a while loop. As I said, people were fond of pointing out that `goto` was \`\`obviously'' hard to formulate in terms of Hoare's semantics. So imagine everyone's surprise when Hoare discovered the following simple rule: First, for each label `L` in the program, we attach a predicate `PL` that is to be true whenever we reach `L`. Then, the `goto` rule is All this says is that if some proposition is true before you execute a `goto` then it's also true after you get there and furthermore, you can't fall through the `goto`, because `false` is true if you do. We also need a rule that says what happens at a label. This is just That is, a statement still does the same thing if you put a label on it. For me, the Hoare semantics is a good way of thinking about `goto`: a `goto` is a way of getting to somewhere that you can do the same things as where you are now. The main practical consequences of this view are that I think very carefully about whether the points at the label and at the `goto` are really semantically equivalent, and that I try to carefully choose labels to say what the corresponding predicate is. For example, in the loop in the previous section, the label `MightBeSlash` suggests that that we've read a character into `c`, but have not yet been checked to see whether or not it's a slash. (Of course, if you look at the code, you see that another possibility is that it might be an EOF, but I took the view that `MightBeSlashOrEOF` was an awfully long name for a label that's never mentioned more than 6 lines away from its definition. Maybe I should have called it `ExamineNewCharacter` or something like that.) The best reference on programming in a disciplined way using `goto` is still D. E. Knuth's paper Structured Programming with go to Statements, ACM Computing Surveys, Vol. 6, No. 4, December 1974, pp. 261-302. I think that everything I have to say about `goto` is in his paper.