hn-classics/_stories/2007/12199836.md

7.5 KiB

created_at title url author points story_text comment_text num_comments story_id story_title story_url parent_id created_at_i _tags objectID year
2016-08-01T01:59:38.000Z A Regular Expression Matcher (2007) http://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html _acme 132 34 1470016778
story
author__acme
story_12199836
12199836 2007

Source

A Regular Expression Matcher

Code by Rob Pike

Exegesis by Brian Kernighan

Draft version Jan 28 2007

Introduction

Beautiful code is likely to be simple -- clear and easy to understand. Beautitful code is likely to be compact -- just enough code to do the job and no more -- but not cryptic, to the point where it cannot be understood. Beautiful code may well be general, solving a broad class of problems in a uniform way. One might even describe it as elegant, showing good taste and refinement.

In this chapter I will describe a piece of beautiful code, for matching regular expressions that meets all of these criteria.

Regular expressions are a notation for describing patterns of text, in effect a special-purpose language for pattern matching. Although there are many variants, all share the idea that most characters in a pattern match literal occurrences of themselves but some "metacharacters" have special meaning, for example "*" to indicate some kind of repetition or [...] to mean any one character from the set within the brackets.

In practice, most searches in programs like text editors are for literal words, so the regular expressions are often literal strings like print, which will match printf or sprint or printer paper anywhere. In so-called "wild cards" in filenames in Unix and Windows, a * matches any number of characters, so the pattern *.c matches all filenames that end in .c. There are many, many variants of regular expressions, even in contexts where one would expect them to be the same. Jeffrey Friedl's Mastering Regular Expressions (O'Reilly, 2006) is an exhaustive study of the topic.

Regular expressions were invented by Stephen Kleene in the mid-1950's as a notation for finite automata, and in fact they are equivalent to finite automata in what they represent. Regular expressions first appeared in a program setting in Ken Thompson's version of the QED text editor in the mid-1960's. In 1967, Ken applied for a patent on a mechanism for rapid text matching based on regular expressions; it was granted in 1971, one of the very first software patents [US Patent 3,568,156, Text Matching Algorithm, March 2, 1971].

Regular expressions moved from QED to the Unix editor ed, and then to the quintessential Unix tool, grep, which Ken created by performing radical surgery on ed. Through these widely used programs, regular expressions became familiar throughout the early Unix community.

Ken's original matcher was very fast because it combined two independent ideas. One was to generate machine instructions on the fly during matching so that it ran at machine speed, not by interpretation. The other was to carry forward all possible matches at each stage, so it did not have to backtrack to look for alternative potential matches. Matching code in later text editors that Ken wrote, like ed, used a simpler algorithm that backtracked when necessary. In theory this is slower but the patterns found in practice rarely involved backtracking, so the ed and grep algorithm and code were good enough for most purposes.

Subsequent regular expression matchers like egrep and fgrep added richer classes of regular expressions, and focused on fast execution no matter what the pattern. Ever fancier regular expressions became popular, and were included not only in C-based libraries but also as part of the syntax of scripting languages like Awk and Perl.

The Practice of Programming

In 1998, Rob Pike and I were writing The Practice of Programming ("TPOP"). The last chapter of the book, "Notation", collected a number of examples where a good notation led to better programs and better programming. This included the use of simple data specifications (printf formats, for instance), and the generation of code from tables.

Given our Unix backgrounds and many years of experience with tools based on regular expression notation, we naturally wanted to include a discussion of regular expressions, and it seemed mandatory to include an implementation as well. Given our emphasis on tools, it also seemed best to focus on the class of regular expressions found in grep, rather than say those from shell wild cards, since we could also talk about the design of grep itself.

The problem was that any existing regular expression package was far too big. The local grep was over 500 lines long (about 10 book pages). Open-source regular expression packages tended to be huge, roughly the size of the entire book, because they were engineered for generality, flexibility, and speed; none was remotely suitable for pedagogy.

I suggested to Rob that we needed to find the smallest regular expression package that would illustrate the basic ideas while still recognizing a useful and non-trivial class of patterns. Ideally, the code would fit on a single page.

Rob disappeared into his office, and at least as I remember it now, appeared again in no more than an hour or two with the 30 lines of C code that subsequently appeared in Chapter 9 of TPOP. That code implements a regular expression matcher that handles these constructs:

    c    matches any literal character c
    .    matches any single character
    ^    matches the beginning of the input string
    $    matches the end of the input string
    *    matches zero or more occurrences of the previous character

This is quite a useful class; in my own experience of using regular expressions on a day-to-day basis, it easily accounts for 95 percent of all instances. In many situations, solving the right problem is a big step on the road to a beautiful program. Rob deserves great credit for choosing so wisely, from among a wide set of options, a very small yet important, well-defined and extensible set of features.

Rob's implementation itself is a superb example of beautiful code: compact, elegant, efficient, and useful. It's one of the best examples of recursion that I have ever seen, and it shows the power of C pointers. Although at the time we were most interested in conveying the important role of a good notation in making a program easier to use and perhaps easier to write as well, the regular expression code has also been an excellent way to illustrate algorithms, data structures, testing, performance enhancement, and other important topics.

Implementation

In the book, the regular expression matcher is part of a program that mimics grep, but the regular expression code is completely separable from its surroundings. The main program is not interesting here -- it simply reads from its standard input or from a sequence of files, and prints those lines that contain a match of the regular expression, as does the original grep and many other Unix tools.

This is the matching code:

    /* match: search for regexp anywhere in text */
    int match(char *regexp, char *text)
    {
        if (regexp[0] == '^')
            return matchhere(regexp+1, text);
        do {    /* must look even if string is empty */
            if (matchhere(regexp, text))
                return 1;
        } while (*text++ != '