hn-classics/_stories/1999/2620872.md

458 lines
16 KiB
Markdown

---
created_at: '2011-06-05T00:32:25.000Z'
title: 'Tom Duff: Reading Code From Top to Bottom (1999)'
url: http://iq0.com/notes/deep.nesting.html
author: gnosis
points: 56
story_text: ''
comment_text:
num_comments: 22
story_id:
story_title:
story_url:
parent_id:
created_at_i: 1307233945
_tags:
- story
- author_gnosis
- story_2620872
objectID: '2620872'
year: 1999
---
### Reading Code From Top to Bottom
This is a particularly simple example of a style of code that I often
see (this example is a piece of *prman*, sanitized to hide the function
of the code it comes from):
int
match(struct table *tab, char *arg)
{
int i, found = 0;
if (tab != NULL) {
for(i=0; i<tab->nitems; ++i)
if (tab->item[i].name == arg)
break;
found = (i < tab->nitems);
}
return found;
}
This function determines whether `tab` points to a table containing an
item whose name is `arg`.
I would write the above like this:
int match(struct table *tab, char *arg){
int i;
if(tab==NULL) return 0;
for(i=0;i!=tab->nitems;i++)
if(tab->item[i].name==arg)
return 1;
return 0;
}
This avoids deep nesting. (The original wasn't *very* deep, but it's
only a simple example.) It avoids re-testing the loop condition after
the loop finishes. It avoids scattering multiple assignments to `found`
throughout the code. It makes the lifetime of the loop index be just the
loop. (Maybe I'm an old fart for this habit, which I cling to from my
Fortran days \[early 1970's\], but limiting the lifetimes of variables
makes things easier for register allocators and for people reading your
code.)
Most important, it reads from top to bottom, by which I mean that each
step follows from the preceding step, and the code stops (that is,
returns) as soon as it knows the answer. It tells a simple story:
> If you want to find an element of `tab->item` with `name==arg`,
>
> - First check that `tab` points to something.
> - Then check each element of the pointed-to array, returning if you
> find a match.
> - If you didn't find it, it's not there.
The old version of the story is extremely roundabout:
> If you want to find an element of `tab->item` with `name==arg`,
>
> - First, assume its not there.
> - Then check that `tab` points somewhere. If so
> - go through the array until you see what you're after.
> - When you're done looking, you found a match if and only if you
> didn't go through the whole array.
> - Having done all that, inform the caller about what you determined.
During the Structured Programming Era, programmers were often taught a
style very much like the old version. The language they were trained in
was normally Pascal, which only allowed a single return from a
procedure, more or less mandating that the return value appear in a
variable. Furthermore, teachers, influenced by Bohm and Jacopini (*Flow
Diagrams, Turing Machines and Languages with only two formation rules*,
Comm. ACM 9\#5, 1966, pp 366-371), often advocated reifying control
structure into Boolean variables as a way of assuring that programs had
reducible flowgraphs.
But reducible flowgraphs can still be deeply nested and therefore hard
to read. Long chains of reasoning or long sequences of actions are not a
particular problem to understand, especially for people of a
mathematical or otherwise analytic bent. Sequences that branch, so that
a particular step may have multiple antecedents, all but one of which
will be far from it on the page are a different matter. Reading them
requires keeping in the mind a list of suspended options while paying
attention to something else, a combination of mental activities that
people are notoriously bad at. You should try as much as possible to
structure your programs so that they proceed step-by-step. Whenever
nesting gets deep, or code refers to results computed more than a few
steps previously, you're headed for trouble.
A common indicator of trouble with excessive nesting is narrow tab
stops. Deeply nested code, at eight spaces per tab (as is usual on UNIX)
quickly migrates over to the right margin. The easy way to fix this is
to indent four spaces at a time (or even two\!). The right way is to
rewrite it so that it proceeds from top to bottom, not from outside to
inside, left to right.
A similar phenomenon happens in mathematical proofs. Most proofs involve
a chain of implications, each usually following from its immediate
prececessor. When the proof requires a step that follows from some
non-immediate predecessor, we are brought up short -- violation of the
orderly step-by-step progress of the proof disturbs the narrative rhythm
of the proof, and is a likely signal of a difficulty, either in the
proof or in the reader's understanding. Mathematicians routinely hide
these narrative anomalies in lemmas -- proof subroutines.
In fact, there is evidence that mathematicians choose the theorems that
they work on to be amenable to proof using simple structures. For
example, when Doug McIlroy investigated the problem of automatically
proving theorems of Euclidean geometry (personal communication, 1984 --
I don't believe this work was ever published) he discovered that when
theorems are rewritten as systems of simultaneous equations (sometimes
inequalities are required as well) the systems are usually \`triangular'
(they're not generally linear, so they're not strictly triangular, but
you know what I mean) and the proofs mostly proceed immediately by
back-subsititution.
### *Else* Considered Harmful
It is often useful to eschew the *else* clause of the *if* statement.
Instead of writing
int f(int x){
int v;
if(b(x)) v=f1(x);
else v=f2(x);
return v;
}
it usually pays to write
int f(int x){
if(b(x)) return f1(x);
return f2(x);
}
Even in a simple function like this, the increase in clarity is obvious
to me. In a more complicated situation, where `v=f1(x)` is replaced by a
more complicated computation with more nested ifs, the reduction in
nesting can be dramatic. Using the single-tailed if gives you some extra
freedom in wording your code to read top-to-bottom. As an extreme
example, the following
int g(int x){
if(b1(x)){
if(b2(x)){
if(b3(x)){
if(b4(x)){
if(b5(x))return g1(x);
return g2(x);
}
return g3(x);
}
return g4(x);
}
return g5(x);
}
return g6(x);
}
should be rewritten as
int g(int x){
if(!b1(x)) return g6(x);
if(!b2(x)) return g5(x);
if(!b3(x)) return g4(x);
if(!b4(x)) return g3(x);
if(!b5(x)) return g2(x);
return g1(x);
}
If this function had involved two-tailed rather than one-tailed ifs,
reversing the sense of the tests would not change the depth to which the
code was nested, and there would likely be no net change in readability.
### Another Example
Right now (Oct 7, 1998) I'm working on an interpreter for a programming
language that looks a lot like C, but lacks a few features. (Most of the
restrictions are inspired by concerns discussed in this note.) To test
its call and return mechanisms I thought I'd try running Ackerman's
function. I have a C version sitting around that just parrots the usual
lambda-calculus definition:
int a(int i, int j){
return i?a(i-1, j?a(i, j-1):1):j+1;
}
int ackerman(int i){
return a(i, i);
}
The language I'm testing is missing the `?:` conditional expression, so
I had to reword function `a` using ordinary if statements. Doing this
mechanically, you get
int a(int i, int j){
if(i){
if(j) return a(i-1, a(i, j-1));
return a(i-1, 1);
}
return j+1;
}
Of course I thought this read awkwardly and reversed the tests, yielding
int a(int i, int j){
if(i==0) return j+1;
if(j==0) return a(i-1, 1);
return a(i-1, a(i, j-1));
}
*Much* better.
### An Aesthetic Conundrum
I don't like side effects in conditionals (or embedded in most
expressions, but let's just think about conditionals for now.) Code
almost always reads better if you separate the side effects from the
tests. For example, I write
```
do
c=getchar();
while(strchr(" \t\n", c));
```
in preference to
```
while(strchr(" \t\n", c=getchar()));
```
although many people prefer the latter. This is justifiable as reading
top-to-bottom instead of inside-out.
Nevertheless, today (Sep 4, 1998) I found myself writing a function that
read like this:
char *skip(char *s, int c){
while(*s!='\0' && *s++!=c);
return s;
}
This function just skips past the first occurrence of `c` in `s`,
stopping also if it finds a *nul*.
I did this after examining several alternatives, like
char *skip(char *s, int c){
while(*s!='\0' && *s!=c) s++;
if(*s==c) s++;
return s;
}
and
char *skip(char *s, int c){
for(;*s!='\0';s++) if(*s==c) return s+1;
return s;
}
Neither of the last two satisfied me, the first because it retests
`*s==c` after the loop and the second because the two halves of the test
are unnaturally separated.
I wound up using the first version in spite of the side effect in the
test because it struck me as less infelicitous than the others -- that
doesn't mean I'm happy with it.
### Another Problematic Fragment
Yesterday (Feb 21, 1999) I rewrote my *com* command (because I was
working on a un-networked machine at home, and I was too lazy to get up
and transfer it on a floppy\!) The [original](../duffgram/com.lic)
contains this code (omitting some error handling), which looks for the
sequence `/*%` in a file:
```
for(;;){
c=getc(f);
if(c==EOF){ /* error code goes here */ }
if(c=='/'){
if((c=getc(f))=='*' && (c=getc(f))=='%') break;
ungetc(c, f); /* might be another slash! */
}
}
```
Today, I wouldn't write it that way -- I'd avoid the `if` with
side-effects. The first version I wrote yesterday looked something like
this:
```
while((c=getc(f))!=EOF){
if(c=='/'){
c=getc(f);
if(c=='*'){
c=getc(f);
if(c=='%'){
/* code that reacts to finding /*% */
}
else ungetc(c, f);
}
else ungetc(c, f);
}
}
/* error code goes here */
```
Several things struck me about this:
- It's obviously written inside out -- the code reads horizontally,
not vertically.
- I deliberately avoided side effects in the `if` tests, but in a
moment of weakness retained `while((c=getc(f))!=EOF)`.
- I don't like that it behaves oddly if `ungetc(EOF, f)` does the
wrong thing. The standard says that it has no effect -- a buggy
`stdio` might push `EOF&0xff` back.
- Besides, `ungetc` is a silly thing to do with a character that's
just going to reappear in `c` immediately.
Keeping in mind the shallow-nesting and top-to-bottom-reading ideas that
I've been writing about, I redid it like this:
```
for(;;){
c=getc(f);
MightBeSlash:
if(c==EOF){ /* error code goes here */ }
if(c!='/') continue;
c=getc(f);
if(c!='*') goto MightBeSlash;
c=getc(f);
if(c!='%') goto MightBeSlash;
break;
}
```
Now it reads top to bottom and, to me at least, says what it means. It's
a net improvement over the other version.
But surely this is no good. The `goto`s are fairly benign, but still
grate. Most egregious is the loop whose last statement is `break;`\!
I let this code sit for an hour (it was lunch time) and then rewrote it.
I can't remember what I eventually used, but it was not very satisfying.
Looking at the various versions today, the one that strikes me as best
is the original, with all of its calls to `getc` inside the tests. (Of
course, if I were writing this code professionally, I'd probably call
`fgets` and `strstr` rather than paw through the file byte by byte.)
What does that say about all my preaching? Only that truly general
principals are hard to come by, and ultimately every case is a different
case. Now that I have two examples of side effects inside conditionals
that I think read better than the alternatives, maybe I'd better think
harder about what principles are involved.
### What About `goto`?
*If you want to go somewhere,
goto is the best way to get there.*
\-Ken Thompson
How can I say that a program with two `goto`s is \`a net improvement'
over one with none? The problem with `goto` is that it promotes
spaghetti -- when you're trying to understand a program, every time you
see a label, you must look at the code at each `goto` that targets the
label to see what the sequence of operations might be. This is the same
problem that deep nesting causes -- code at any number of far away
places might immediately precede the code at a label, and it is often
difficult to keep in mind all the things that might have occurred when
you reach the label.
During the structured programming era, `goto` was a major bugaboo, so
much so that most programmers today, with little or no experience
thinking about code that uses `goto`, nevertheless recoil in revulsion
at its use.
The whole structured programming debate was started by [Edsger W.
Dijkstra's essay](http://www.acm.org/classics/oct95/) *Go To Statement
Considered Harmful*, Communications of the ACM, Vol. 11, No. 3, March
1968, pp. 147-148. One of the oft-cited bolsters of the argument to
eschew the `goto` was that it appeared to be very difficult to describe
`goto` in terms of C. A. R. Hoare's system of axiomatic semantics.
Hoare's system involved specifying a predicate that gives the initial
assumptions (pre-conditions) about a program. For each statement type
there is a rule of inference that describes how to derive a
post-condition from the statement's pre-condition. For example, let's
think about the statment `while(B)S`, where `B` is an arbitrary Boolean
(with no side effects\!) and `S` is any statement (that doesn't transfer
control out of the loop\!). Hoare writes `P{S}Q` to mean that if
proposition `P` is true before `S` is executed, then you can infer that
`Q` is true afterwards. So, Hoare's rule for the `while` statement is:
All this says is that If executing `S` doesn't falsify `P` whenever `B`
is also initially true, then executing the loop `while(B) S` preserves
`P` and guarantees that `B` becomes false. (`P` is usually called the
*loop invariant*.) As far as Hoare's semantics is concerned, this is the
meaning of a while loop.
As I said, people were fond of pointing out that `goto` was
\`\`obviously'' hard to formulate in terms of Hoare's semantics. So
imagine everyone's surprise when Hoare discovered the following simple
rule:
First, for each label `L` in the program, we attach a predicate `PL`
that is to be true whenever we reach `L`. Then, the `goto` rule is
All this says is that if some proposition is true before you execute a
`goto` then it's also true after you get there and furthermore, you
can't fall through the `goto`, because `false` is true if you do.
We also need a rule that says what happens at a label. This is just
That is, a statement still does the same thing if you put a label on it.
For me, the Hoare semantics is a good way of thinking about `goto`: a
`goto` is a way of getting to somewhere that you can do the same things
as where you are now. The main practical consequences of this view are
that I think very carefully about whether the points at the label and at
the `goto` are really semantically equivalent, and that I try to
carefully choose labels to say what the corresponding predicate is. For
example, in the loop in the previous section, the label `MightBeSlash`
suggests that that we've read a character into `c`, but have not yet
been checked to see whether or not it's a slash. (Of course, if you look
at the code, you see that another possibility is that it might be an
EOF, but I took the view that `MightBeSlashOrEOF` was an awfully long
name for a label that's never mentioned more than 6 lines away from its
definition. Maybe I should have called it `ExamineNewCharacter` or
something like that.)
The best reference on programming in a disciplined way using `goto` is
still D. E. Knuth's paper Structured Programming with go to Statements,
ACM Computing Surveys, Vol. 6, No. 4, December 1974, pp. 261-302. I
think that everything I have to say about `goto` is in his paper.