16 KiB
created_at | title | url | author | points | story_text | comment_text | num_comments | story_id | story_title | story_url | parent_id | created_at_i | _tags | objectID | year | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2011-06-05T00:32:25.000Z | Tom Duff: Reading Code From Top to Bottom (1999) | http://iq0.com/notes/deep.nesting.html | gnosis | 56 | 22 | 1307233945 |
|
2620872 | 1999 |
Reading Code From Top to Bottom
This is a particularly simple example of a style of code that I often see (this example is a piece of prman, sanitized to hide the function of the code it comes from):
int
match(struct table *tab, char *arg)
{
int i, found = 0;
if (tab != NULL) {
for(i=0; i<tab->nitems; ++i)
if (tab->item[i].name == arg)
break;
found = (i < tab->nitems);
}
return found;
}
This function determines whether tab
points to a table containing an
item whose name is arg
.
I would write the above like this:
int match(struct table *tab, char *arg){
int i;
if(tab==NULL) return 0;
for(i=0;i!=tab->nitems;i++)
if(tab->item[i].name==arg)
return 1;
return 0;
}
This avoids deep nesting. (The original wasn't very deep, but it's
only a simple example.) It avoids re-testing the loop condition after
the loop finishes. It avoids scattering multiple assignments to found
throughout the code. It makes the lifetime of the loop index be just the
loop. (Maybe I'm an old fart for this habit, which I cling to from my
Fortran days [early 1970's], but limiting the lifetimes of variables
makes things easier for register allocators and for people reading your
code.)
Most important, it reads from top to bottom, by which I mean that each step follows from the preceding step, and the code stops (that is, returns) as soon as it knows the answer. It tells a simple story:
If you want to find an element of
tab->item
withname==arg
,
- First check that
tab
points to something.- Then check each element of the pointed-to array, returning if you find a match.
- If you didn't find it, it's not there.
The old version of the story is extremely roundabout:
If you want to find an element of
tab->item
withname==arg
,
- First, assume its not there.
- Then check that
tab
points somewhere. If so
- go through the array until you see what you're after.
- When you're done looking, you found a match if and only if you didn't go through the whole array.
- Having done all that, inform the caller about what you determined.
During the Structured Programming Era, programmers were often taught a style very much like the old version. The language they were trained in was normally Pascal, which only allowed a single return from a procedure, more or less mandating that the return value appear in a variable. Furthermore, teachers, influenced by Bohm and Jacopini (Flow Diagrams, Turing Machines and Languages with only two formation rules, Comm. ACM 9#5, 1966, pp 366-371), often advocated reifying control structure into Boolean variables as a way of assuring that programs had reducible flowgraphs.
But reducible flowgraphs can still be deeply nested and therefore hard to read. Long chains of reasoning or long sequences of actions are not a particular problem to understand, especially for people of a mathematical or otherwise analytic bent. Sequences that branch, so that a particular step may have multiple antecedents, all but one of which will be far from it on the page are a different matter. Reading them requires keeping in the mind a list of suspended options while paying attention to something else, a combination of mental activities that people are notoriously bad at. You should try as much as possible to structure your programs so that they proceed step-by-step. Whenever nesting gets deep, or code refers to results computed more than a few steps previously, you're headed for trouble.
A common indicator of trouble with excessive nesting is narrow tab stops. Deeply nested code, at eight spaces per tab (as is usual on UNIX) quickly migrates over to the right margin. The easy way to fix this is to indent four spaces at a time (or even two!). The right way is to rewrite it so that it proceeds from top to bottom, not from outside to inside, left to right.
A similar phenomenon happens in mathematical proofs. Most proofs involve a chain of implications, each usually following from its immediate prececessor. When the proof requires a step that follows from some non-immediate predecessor, we are brought up short -- violation of the orderly step-by-step progress of the proof disturbs the narrative rhythm of the proof, and is a likely signal of a difficulty, either in the proof or in the reader's understanding. Mathematicians routinely hide these narrative anomalies in lemmas -- proof subroutines.
In fact, there is evidence that mathematicians choose the theorems that they work on to be amenable to proof using simple structures. For example, when Doug McIlroy investigated the problem of automatically proving theorems of Euclidean geometry (personal communication, 1984 -- I don't believe this work was ever published) he discovered that when theorems are rewritten as systems of simultaneous equations (sometimes inequalities are required as well) the systems are usually `triangular' (they're not generally linear, so they're not strictly triangular, but you know what I mean) and the proofs mostly proceed immediately by back-subsititution.
Else Considered Harmful
It is often useful to eschew the else clause of the if statement. Instead of writing
int f(int x){
int v;
if(b(x)) v=f1(x);
else v=f2(x);
return v;
}
it usually pays to write
int f(int x){
if(b(x)) return f1(x);
return f2(x);
}
Even in a simple function like this, the increase in clarity is obvious
to me. In a more complicated situation, where v=f1(x)
is replaced by a
more complicated computation with more nested ifs, the reduction in
nesting can be dramatic. Using the single-tailed if gives you some extra
freedom in wording your code to read top-to-bottom. As an extreme
example, the following
int g(int x){
if(b1(x)){
if(b2(x)){
if(b3(x)){
if(b4(x)){
if(b5(x))return g1(x);
return g2(x);
}
return g3(x);
}
return g4(x);
}
return g5(x);
}
return g6(x);
}
should be rewritten as
int g(int x){
if(!b1(x)) return g6(x);
if(!b2(x)) return g5(x);
if(!b3(x)) return g4(x);
if(!b4(x)) return g3(x);
if(!b5(x)) return g2(x);
return g1(x);
}
If this function had involved two-tailed rather than one-tailed ifs, reversing the sense of the tests would not change the depth to which the code was nested, and there would likely be no net change in readability.
Another Example
Right now (Oct 7, 1998) I'm working on an interpreter for a programming language that looks a lot like C, but lacks a few features. (Most of the restrictions are inspired by concerns discussed in this note.) To test its call and return mechanisms I thought I'd try running Ackerman's function. I have a C version sitting around that just parrots the usual lambda-calculus definition:
int a(int i, int j){
return i?a(i-1, j?a(i, j-1):1):j+1;
}
int ackerman(int i){
return a(i, i);
}
The language I'm testing is missing the ?:
conditional expression, so
I had to reword function a
using ordinary if statements. Doing this
mechanically, you get
int a(int i, int j){
if(i){
if(j) return a(i-1, a(i, j-1));
return a(i-1, 1);
}
return j+1;
}
Of course I thought this read awkwardly and reversed the tests, yielding
int a(int i, int j){
if(i==0) return j+1;
if(j==0) return a(i-1, 1);
return a(i-1, a(i, j-1));
}
Much better.
An Aesthetic Conundrum
I don't like side effects in conditionals (or embedded in most expressions, but let's just think about conditionals for now.) Code almost always reads better if you separate the side effects from the tests. For example, I write
do
c=getchar();
while(strchr(" \t\n", c));
in preference to
while(strchr(" \t\n", c=getchar()));
although many people prefer the latter. This is justifiable as reading top-to-bottom instead of inside-out.
Nevertheless, today (Sep 4, 1998) I found myself writing a function that read like this:
char *skip(char *s, int c){
while(*s!='\0' && *s++!=c);
return s;
}
This function just skips past the first occurrence of c
in s
,
stopping also if it finds a nul.
I did this after examining several alternatives, like
char *skip(char *s, int c){
while(*s!='\0' && *s!=c) s++;
if(*s==c) s++;
return s;
}
and
char *skip(char *s, int c){
for(;*s!='\0';s++) if(*s==c) return s+1;
return s;
}
Neither of the last two satisfied me, the first because it retests
*s==c
after the loop and the second because the two halves of the test
are unnaturally separated.
I wound up using the first version in spite of the side effect in the test because it struck me as less infelicitous than the others -- that doesn't mean I'm happy with it.
Another Problematic Fragment
Yesterday (Feb 21, 1999) I rewrote my com command (because I was
working on a un-networked machine at home, and I was too lazy to get up
and transfer it on a floppy!) The original
contains this code (omitting some error handling), which looks for the
sequence /*%
in a file:
for(;;){
c=getc(f);
if(c==EOF){ /* error code goes here */ }
if(c=='/'){
if((c=getc(f))=='*' && (c=getc(f))=='%') break;
ungetc(c, f); /* might be another slash! */
}
}
Today, I wouldn't write it that way -- I'd avoid the if
with
side-effects. The first version I wrote yesterday looked something like
this:
while((c=getc(f))!=EOF){
if(c=='/'){
c=getc(f);
if(c=='*'){
c=getc(f);
if(c=='%'){
/* code that reacts to finding /*% */
}
else ungetc(c, f);
}
else ungetc(c, f);
}
}
/* error code goes here */
Several things struck me about this:
- It's obviously written inside out -- the code reads horizontally, not vertically.
- I deliberately avoided side effects in the
if
tests, but in a moment of weakness retainedwhile((c=getc(f))!=EOF)
. - I don't like that it behaves oddly if
ungetc(EOF, f)
does the wrong thing. The standard says that it has no effect -- a buggystdio
might pushEOF&0xff
back. - Besides,
ungetc
is a silly thing to do with a character that's just going to reappear inc
immediately.
Keeping in mind the shallow-nesting and top-to-bottom-reading ideas that I've been writing about, I redid it like this:
for(;;){
c=getc(f);
MightBeSlash:
if(c==EOF){ /* error code goes here */ }
if(c!='/') continue;
c=getc(f);
if(c!='*') goto MightBeSlash;
c=getc(f);
if(c!='%') goto MightBeSlash;
break;
}
Now it reads top to bottom and, to me at least, says what it means. It's a net improvement over the other version.
But surely this is no good. The goto
s are fairly benign, but still
grate. Most egregious is the loop whose last statement is break;
!
I let this code sit for an hour (it was lunch time) and then rewrote it.
I can't remember what I eventually used, but it was not very satisfying.
Looking at the various versions today, the one that strikes me as best
is the original, with all of its calls to getc
inside the tests. (Of
course, if I were writing this code professionally, I'd probably call
fgets
and strstr
rather than paw through the file byte by byte.)
What does that say about all my preaching? Only that truly general principals are hard to come by, and ultimately every case is a different case. Now that I have two examples of side effects inside conditionals that I think read better than the alternatives, maybe I'd better think harder about what principles are involved.
What About goto
?
If you want to go somewhere,
goto is the best way to get there.
-Ken Thompson
How can I say that a program with two goto
s is `a net improvement'
over one with none? The problem with goto
is that it promotes
spaghetti -- when you're trying to understand a program, every time you
see a label, you must look at the code at each goto
that targets the
label to see what the sequence of operations might be. This is the same
problem that deep nesting causes -- code at any number of far away
places might immediately precede the code at a label, and it is often
difficult to keep in mind all the things that might have occurred when
you reach the label.
During the structured programming era, goto
was a major bugaboo, so
much so that most programmers today, with little or no experience
thinking about code that uses goto
, nevertheless recoil in revulsion
at its use.
The whole structured programming debate was started by Edsger W.
Dijkstra's essay Go To Statement
Considered Harmful, Communications of the ACM, Vol. 11, No. 3, March
1968, pp. 147-148. One of the oft-cited bolsters of the argument to
eschew the goto
was that it appeared to be very difficult to describe
goto
in terms of C. A. R. Hoare's system of axiomatic semantics.
Hoare's system involved specifying a predicate that gives the initial
assumptions (pre-conditions) about a program. For each statement type
there is a rule of inference that describes how to derive a
post-condition from the statement's pre-condition. For example, let's
think about the statment while(B)S
, where B
is an arbitrary Boolean
(with no side effects!) and S
is any statement (that doesn't transfer
control out of the loop!). Hoare writes P{S}Q
to mean that if
proposition P
is true before S
is executed, then you can infer that
Q
is true afterwards. So, Hoare's rule for the while
statement is:
All this says is that If executing S
doesn't falsify P
whenever B
is also initially true, then executing the loop while(B) S
preserves
P
and guarantees that B
becomes false. (P
is usually called the
loop invariant.) As far as Hoare's semantics is concerned, this is the
meaning of a while loop.
As I said, people were fond of pointing out that goto
was
``obviously'' hard to formulate in terms of Hoare's semantics. So
imagine everyone's surprise when Hoare discovered the following simple
rule:
First, for each label L
in the program, we attach a predicate PL
that is to be true whenever we reach L
. Then, the goto
rule is
All this says is that if some proposition is true before you execute a
goto
then it's also true after you get there and furthermore, you
can't fall through the goto
, because false
is true if you do.
We also need a rule that says what happens at a label. This is just
That is, a statement still does the same thing if you put a label on it.
For me, the Hoare semantics is a good way of thinking about goto
: a
goto
is a way of getting to somewhere that you can do the same things
as where you are now. The main practical consequences of this view are
that I think very carefully about whether the points at the label and at
the goto
are really semantically equivalent, and that I try to
carefully choose labels to say what the corresponding predicate is. For
example, in the loop in the previous section, the label MightBeSlash
suggests that that we've read a character into c
, but have not yet
been checked to see whether or not it's a slash. (Of course, if you look
at the code, you see that another possibility is that it might be an
EOF, but I took the view that MightBeSlashOrEOF
was an awfully long
name for a label that's never mentioned more than 6 lines away from its
definition. Maybe I should have called it ExamineNewCharacter
or
something like that.)
The best reference on programming in a disciplined way using goto
is
still D. E. Knuth's paper Structured Programming with go to Statements,
ACM Computing Surveys, Vol. 6, No. 4, December 1974, pp. 261-302. I
think that everything I have to say about goto
is in his paper.