hn-classics/_stories/1999/2620872.md

16 KiB

created_at title url author points story_text comment_text num_comments story_id story_title story_url parent_id created_at_i _tags objectID year
2011-06-05T00:32:25.000Z Tom Duff: Reading Code From Top to Bottom (1999) http://iq0.com/notes/deep.nesting.html gnosis 56 22 1307233945
story
author_gnosis
story_2620872
2620872 1999

Reading Code From Top to Bottom

This is a particularly simple example of a style of code that I often see (this example is a piece of prman, sanitized to hide the function of the code it comes from):

int
match(struct table *tab, char *arg)
{
    int i, found = 0;

    if (tab != NULL) {
    for(i=0; i<tab->nitems; ++i)
        if (tab->item[i].name == arg)
        break;
    found = (i < tab->nitems);
    }
    return found;
}

This function determines whether tab points to a table containing an item whose name is arg.

I would write the above like this:

int match(struct table *tab, char *arg){
    int i;

    if(tab==NULL) return 0;
    for(i=0;i!=tab->nitems;i++)
        if(tab->item[i].name==arg)
            return 1;
    return 0;
}

This avoids deep nesting. (The original wasn't very deep, but it's only a simple example.) It avoids re-testing the loop condition after the loop finishes. It avoids scattering multiple assignments to found throughout the code. It makes the lifetime of the loop index be just the loop. (Maybe I'm an old fart for this habit, which I cling to from my Fortran days [early 1970's], but limiting the lifetimes of variables makes things easier for register allocators and for people reading your code.)

Most important, it reads from top to bottom, by which I mean that each step follows from the preceding step, and the code stops (that is, returns) as soon as it knows the answer. It tells a simple story:

If you want to find an element of tab->item with name==arg,

  • First check that tab points to something.
  • Then check each element of the pointed-to array, returning if you find a match.
  • If you didn't find it, it's not there.

The old version of the story is extremely roundabout:

If you want to find an element of tab->item with name==arg,

  • First, assume its not there.
  • Then check that tab points somewhere. If so
    • go through the array until you see what you're after.
    • When you're done looking, you found a match if and only if you didn't go through the whole array.
  • Having done all that, inform the caller about what you determined.

During the Structured Programming Era, programmers were often taught a style very much like the old version. The language they were trained in was normally Pascal, which only allowed a single return from a procedure, more or less mandating that the return value appear in a variable. Furthermore, teachers, influenced by Bohm and Jacopini (Flow Diagrams, Turing Machines and Languages with only two formation rules, Comm. ACM 9#5, 1966, pp 366-371), often advocated reifying control structure into Boolean variables as a way of assuring that programs had reducible flowgraphs.

But reducible flowgraphs can still be deeply nested and therefore hard to read. Long chains of reasoning or long sequences of actions are not a particular problem to understand, especially for people of a mathematical or otherwise analytic bent. Sequences that branch, so that a particular step may have multiple antecedents, all but one of which will be far from it on the page are a different matter. Reading them requires keeping in the mind a list of suspended options while paying attention to something else, a combination of mental activities that people are notoriously bad at. You should try as much as possible to structure your programs so that they proceed step-by-step. Whenever nesting gets deep, or code refers to results computed more than a few steps previously, you're headed for trouble.

A common indicator of trouble with excessive nesting is narrow tab stops. Deeply nested code, at eight spaces per tab (as is usual on UNIX) quickly migrates over to the right margin. The easy way to fix this is to indent four spaces at a time (or even two!). The right way is to rewrite it so that it proceeds from top to bottom, not from outside to inside, left to right.

A similar phenomenon happens in mathematical proofs. Most proofs involve a chain of implications, each usually following from its immediate prececessor. When the proof requires a step that follows from some non-immediate predecessor, we are brought up short -- violation of the orderly step-by-step progress of the proof disturbs the narrative rhythm of the proof, and is a likely signal of a difficulty, either in the proof or in the reader's understanding. Mathematicians routinely hide these narrative anomalies in lemmas -- proof subroutines.

In fact, there is evidence that mathematicians choose the theorems that they work on to be amenable to proof using simple structures. For example, when Doug McIlroy investigated the problem of automatically proving theorems of Euclidean geometry (personal communication, 1984 -- I don't believe this work was ever published) he discovered that when theorems are rewritten as systems of simultaneous equations (sometimes inequalities are required as well) the systems are usually `triangular' (they're not generally linear, so they're not strictly triangular, but you know what I mean) and the proofs mostly proceed immediately by back-subsititution.

Else Considered Harmful

It is often useful to eschew the else clause of the if statement. Instead of writing

int f(int x){
    int v;

    if(b(x)) v=f1(x);
    else v=f2(x);
    return v;
}

it usually pays to write

int f(int x){
    if(b(x)) return f1(x);
    return f2(x);
}

Even in a simple function like this, the increase in clarity is obvious to me. In a more complicated situation, where v=f1(x) is replaced by a more complicated computation with more nested ifs, the reduction in nesting can be dramatic. Using the single-tailed if gives you some extra freedom in wording your code to read top-to-bottom. As an extreme example, the following

int g(int x){
    if(b1(x)){
        if(b2(x)){
            if(b3(x)){
                if(b4(x)){
                    if(b5(x))return g1(x);
                    return g2(x);
                }
                return g3(x);
            }
            return g4(x);
        }
        return g5(x);
    }
    return g6(x);
}

should be rewritten as

int g(int x){
    if(!b1(x)) return g6(x);
    if(!b2(x)) return g5(x);
    if(!b3(x)) return g4(x);
    if(!b4(x)) return g3(x);
    if(!b5(x)) return g2(x);
    return g1(x);
}

If this function had involved two-tailed rather than one-tailed ifs, reversing the sense of the tests would not change the depth to which the code was nested, and there would likely be no net change in readability.

Another Example

Right now (Oct 7, 1998) I'm working on an interpreter for a programming language that looks a lot like C, but lacks a few features. (Most of the restrictions are inspired by concerns discussed in this note.) To test its call and return mechanisms I thought I'd try running Ackerman's function. I have a C version sitting around that just parrots the usual lambda-calculus definition:

int a(int i, int j){
    return i?a(i-1, j?a(i, j-1):1):j+1;
}
int ackerman(int i){
    return a(i, i);
}

The language I'm testing is missing the ?: conditional expression, so I had to reword function a using ordinary if statements. Doing this mechanically, you get

int a(int i, int j){
    if(i){
        if(j) return a(i-1, a(i, j-1));
        return a(i-1, 1);
    }
    return j+1;
}

Of course I thought this read awkwardly and reversed the tests, yielding

int a(int i, int j){
    if(i==0) return j+1;
    if(j==0) return a(i-1, 1);
    return a(i-1, a(i, j-1));
}

Much better.

An Aesthetic Conundrum

I don't like side effects in conditionals (or embedded in most expressions, but let's just think about conditionals for now.) Code almost always reads better if you separate the side effects from the tests. For example, I write

    do
        c=getchar();
    while(strchr(" \t\n", c));

in preference to

    while(strchr(" \t\n", c=getchar()));

although many people prefer the latter. This is justifiable as reading top-to-bottom instead of inside-out.

Nevertheless, today (Sep 4, 1998) I found myself writing a function that read like this:

char *skip(char *s, int c){
    while(*s!='\0' && *s++!=c);
    return s;
}

This function just skips past the first occurrence of c in s, stopping also if it finds a nul.

I did this after examining several alternatives, like

char *skip(char *s, int c){
    while(*s!='\0' && *s!=c) s++;
    if(*s==c) s++;
    return s;
}

and

char *skip(char *s, int c){
    for(;*s!='\0';s++) if(*s==c) return s+1;
    return s;
}

Neither of the last two satisfied me, the first because it retests *s==c after the loop and the second because the two halves of the test are unnaturally separated.

I wound up using the first version in spite of the side effect in the test because it struck me as less infelicitous than the others -- that doesn't mean I'm happy with it.

Another Problematic Fragment

Yesterday (Feb 21, 1999) I rewrote my com command (because I was working on a un-networked machine at home, and I was too lazy to get up and transfer it on a floppy!) The original contains this code (omitting some error handling), which looks for the sequence /*% in a file:

    for(;;){
        c=getc(f);
        if(c==EOF){ /* error code goes here */ }
        if(c=='/'){
            if((c=getc(f))=='*' && (c=getc(f))=='%') break;
            ungetc(c, f);   /* might be another slash! */
        }
    }

Today, I wouldn't write it that way -- I'd avoid the if with side-effects. The first version I wrote yesterday looked something like this:

    while((c=getc(f))!=EOF){
        if(c=='/'){
            c=getc(f);
            if(c=='*'){
                c=getc(f);
                if(c=='%'){
                    /* code that reacts to finding /*% */
                }
                else ungetc(c, f);
            }
            else ungetc(c, f);
        }
    }
    /* error code goes here */

Several things struck me about this:

  • It's obviously written inside out -- the code reads horizontally, not vertically.
  • I deliberately avoided side effects in the if tests, but in a moment of weakness retained while((c=getc(f))!=EOF).
  • I don't like that it behaves oddly if ungetc(EOF, f) does the wrong thing. The standard says that it has no effect -- a buggy stdio might push EOF&0xff back.
  • Besides, ungetc is a silly thing to do with a character that's just going to reappear in c immediately.

Keeping in mind the shallow-nesting and top-to-bottom-reading ideas that I've been writing about, I redid it like this:

    for(;;){
        c=getc(f);
    MightBeSlash:
        if(c==EOF){ /* error code goes here */ }
        if(c!='/') continue;
        c=getc(f);
        if(c!='*') goto MightBeSlash;
        c=getc(f);
        if(c!='%') goto MightBeSlash;
        break;
    }

Now it reads top to bottom and, to me at least, says what it means. It's a net improvement over the other version.

But surely this is no good. The gotos are fairly benign, but still grate. Most egregious is the loop whose last statement is break;!

I let this code sit for an hour (it was lunch time) and then rewrote it. I can't remember what I eventually used, but it was not very satisfying. Looking at the various versions today, the one that strikes me as best is the original, with all of its calls to getc inside the tests. (Of course, if I were writing this code professionally, I'd probably call fgets and strstr rather than paw through the file byte by byte.)

What does that say about all my preaching? Only that truly general principals are hard to come by, and ultimately every case is a different case. Now that I have two examples of side effects inside conditionals that I think read better than the alternatives, maybe I'd better think harder about what principles are involved.

What About goto?

If you want to go somewhere,
goto is the best way to get there.

-Ken Thompson

How can I say that a program with two gotos is `a net improvement' over one with none? The problem with goto is that it promotes spaghetti -- when you're trying to understand a program, every time you see a label, you must look at the code at each goto that targets the label to see what the sequence of operations might be. This is the same problem that deep nesting causes -- code at any number of far away places might immediately precede the code at a label, and it is often difficult to keep in mind all the things that might have occurred when you reach the label.

During the structured programming era, goto was a major bugaboo, so much so that most programmers today, with little or no experience thinking about code that uses goto, nevertheless recoil in revulsion at its use.

The whole structured programming debate was started by Edsger W. Dijkstra's essay Go To Statement Considered Harmful, Communications of the ACM, Vol. 11, No. 3, March 1968, pp. 147-148. One of the oft-cited bolsters of the argument to eschew the goto was that it appeared to be very difficult to describe goto in terms of C. A. R. Hoare's system of axiomatic semantics. Hoare's system involved specifying a predicate that gives the initial assumptions (pre-conditions) about a program. For each statement type there is a rule of inference that describes how to derive a post-condition from the statement's pre-condition. For example, let's think about the statment while(B)S, where B is an arbitrary Boolean (with no side effects!) and S is any statement (that doesn't transfer control out of the loop!). Hoare writes P{S}Q to mean that if proposition P is true before S is executed, then you can infer that Q is true afterwards. So, Hoare's rule for the while statement is:

All this says is that If executing S doesn't falsify P whenever B is also initially true, then executing the loop while(B) S preserves P and guarantees that B becomes false. (P is usually called the loop invariant.) As far as Hoare's semantics is concerned, this is the meaning of a while loop.

As I said, people were fond of pointing out that goto was ``obviously'' hard to formulate in terms of Hoare's semantics. So imagine everyone's surprise when Hoare discovered the following simple rule:

First, for each label L in the program, we attach a predicate PL that is to be true whenever we reach L. Then, the goto rule is

All this says is that if some proposition is true before you execute a goto then it's also true after you get there and furthermore, you can't fall through the goto, because false is true if you do.

We also need a rule that says what happens at a label. This is just

That is, a statement still does the same thing if you put a label on it.

For me, the Hoare semantics is a good way of thinking about goto: a goto is a way of getting to somewhere that you can do the same things as where you are now. The main practical consequences of this view are that I think very carefully about whether the points at the label and at the goto are really semantically equivalent, and that I try to carefully choose labels to say what the corresponding predicate is. For example, in the loop in the previous section, the label MightBeSlash suggests that that we've read a character into c, but have not yet been checked to see whether or not it's a slash. (Of course, if you look at the code, you see that another possibility is that it might be an EOF, but I took the view that MightBeSlashOrEOF was an awfully long name for a label that's never mentioned more than 6 lines away from its definition. Maybe I should have called it ExamineNewCharacter or something like that.)

The best reference on programming in a disciplined way using goto is still D. E. Knuth's paper Structured Programming with go to Statements, ACM Computing Surveys, Vol. 6, No. 4, December 1974, pp. 261-302. I think that everything I have to say about goto is in his paper.