hn-classics/_stories/2006/12248040.md

24 KiB

created_at title url author points story_text comment_text num_comments story_id story_title story_url parent_id created_at_i _tags objectID year
2016-08-08T14:31:28.000Z Apocalypse 5: Pattern Matching (2006) http://perl6.org/archive/doc/design/apo/A05.html _acme 45 3 1470666688
story
author__acme
story_12248040
12248040 2006

Source

Apocalypse 5: Pattern Matching

This file is part of the Perl 6 Archive

| ----- | | Note: these documents may be out of date. Do not use as reference! |

To see what is currently happening visit http://www.perl6.org/

=encoding utf-8

TITLE

Apocalypse 5: Pattern Matching

AUTHOR

Larry Wall <larry@wall.org>

VERSION

  Maintainer: Larry Wall <[larry@wall.org][1]>
  Date: 4 Jun 2002
  Last Modified: 18 May 2006
  Number: 5
  Version: 7

This is the Apocalypse on Pattern Matching, generally having to do with what we call "regular expressions", which are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I'm not going to try to fight linguistic necessity here. I will, however, generally call them "regexes" (or "regexen", when I'm in an Anglo-Saxon mood).

Here are the RFCs covered in this Apocalypse. PSA stands for "problem, solution, acceptance", my private rating of how this RFC will fit into Perl 6. Doubtless I have misclassified your RFC, though the other ratings are pretty accurate. :-)

    RFC   PSA   Title
    ---   ---   -----
    072   aaa   Variable-length lookbehind. 
    093   abb   Regex: Support for incremental pattern matching
    110   bbb   counting matches
    112   acc   Assignment within a regex
    135   acr   Require explicit m on matches, even with ?? and // as delimiters.
    144   aaa   Behavior of empty regex should be simple
    145   acr   Brace-matching for Perl Regular Expressions
    150   acc   Extend regex syntax to provide for return of a hash of matched subpatterns
    156   aaa   Replace first match function (C<?...?>) with a flag to the match command.
    164   ccr   Replace =~, !~, m//, s///, and tr// with match(), subst(), and trade()
    165   acc   Allow Variables in tr///
    166   abc   Alternative lists and quoting of things
    191   bbc   smart container slicing
    197   cdr   Numeric Value Ranges In Regular Expressions
    198   adr   Boolean Regexes
    261   dbr   Pattern matching on perl values
    274   acc   Generalised Additions to Regexs
    276   aaa   Localising Paren Counts in qr()s.
    308   dar   Ban Perl hooks into regexes
    316   bcr   Regex modifier for support of chunk processing and prefix matching
    317   aaa   Access to optimisation information for regular expressions
    331   acc   Consolidate the $1 and 1 notations
    332   abc   Regex: Make /$/ equivalent to /z/ under the '/s' modifier
    348   bcc   Regex assertions in plain Perl code
    360   acb   Allow multiply matched groups in regexes to return a listref of all matches
    361   abb   Simplifying split()

Interestingly, there were no withdrawn RFCs for pattern matching. That means either that there were no cork-brained ideas proposed, or that regex culture is so cork-brained already that the cork-brained ideas blend right in. I know where my money is... :-)

In fact, regular expression culture is a mess, and I share some of the blame for making it that way. Since my mother always told me to clean up my own messes, I suppose I'll have to do just that.

For prior Apocalypses, I've used the RFCs as a springboard for discussion of my thinking, but this one is special, because none of the RFCs were courageous enough (or foolhardy enough) to look at the big picture and propose radical change where it's needed. But Perl has often been tagged as a language in which it's easy to write programs that are difficult to read, and it's no secret that regular expression syntax that has been the chief culprit. Funny that other languages have been borrowing Perl's regular expressions as fast as they can...

That's primarily because we took several large steps in Perl 5 to enhance regex capabilities. We took one large step forwards with the /x option, which allowed whitespace between regex tokens. But we also took several large steps sideways with the (?...) extension syntax. I call them steps sideways, but they were simultaneously steps forward in terms of functionality and steps backwards in terms of readability. At the time, I rationalized it all in the name of backward compatibility, and perhaps that approach was correct for that time and place. It's not correct now, since the Perl 6 approach is to break everything that needs breaking all at once.

And unfortunately, there's a lot of regex culture that needs breaking.

Regex culture has gone wrong in a variety of ways, but it's not my intent to assign blame--there's plenty of blame to go around, and plenty of things that have gone wrong that are nobody's fault in particular. For example, it's nobody's fault that you can't realistically complement a character set anymore. It's just an accident of the way Unicode defines combining characters. The whole notion of character classes is mutating, and that will have some bearing on the future of regular expression syntax.

Given all this, I need to warn you that this Apocalypse is going to be somewhat radical. We'll be proposing changes to certain "sacred" features of regex culture, and this is guaranteed to result in future shock for some of our more conservative citizens. Do not be alarmed. We will provide ways for you to continue programming in old-fashioned regular expressions if you desire. But I hope that once you've thought about it a little and worked through some examples, you'll like most of the changes we're proposing here.

So although the RFCs did contribute greatly to my thinking for this Apocalypse, I'm going to present my own vision first for where regex culture should go, and then analyze the RFCs with respect to that vision.

First, let me enumerate some of the things that are wrong with current regex culture.

  • Too much history
  • Too compact and "cute"
  • Poor Huffman coding
  • Too much reliance on too few metacharacters
  • Different things look too similar
  • Poor end-weight design
  • Too much reliance on modifiers
  • Too many special rules and boobytraps
  • Backreferences not useful enough
  • Too hard to match a literal string
  • Two-level interpretation is problematic
  • Too little abstraction
  • Little support for named captures
  • Difficult to use nested patterns
  • Little support for grammars
  • Inability to define variants
  • Poor integration with "real" language
  • Missing backtracking controls
  • Difficult to define assertions

I'm sure there are other problems, but that'll do for starters. Let's look at each of these in more detail.

Too much history

Most of the other problems stem from trying to deal with a rich history. Now there's nothing wrong with history per se, but those of us who are doomed to repeat it find that many parts of history are suboptimal and contradictory. Perl has always tried to err on the side of incorporating as much history as possible, and sometimes Perl has succeeded in that endeavor.

Cultural continuity has much to be said for it, but what can you do when the culture you're trying to be continuous with is itself discontinuous? As it says in Ecclesiastes, there's a time to build up, and a time to tear down. The first five versions of Perl mostly built up without tearing down, so now we're trying to redress that omission.

Too compact and "cute"

Regular expressions were invented by computational linguists who love to write examples like /aa*b*(cd)*ee/. While these are conducive to reasoning about pattern matching in the abstract, they aren't so good for pattern matching in the concrete. In real life, most atoms are longer than "a" or "b". In real life, tokens are more recognizable if they are separated by whitespace. In the abstract, /a+/ is reducible to /aa*/. In real life, nobody wants to repeat a 15 character token merely to satisfy somebody's idea of theoretical purity. So we have shortcuts like the + quantifier to say "one or more".

Now, you may rightly point out that + is something we already have, and we already introduced /x to allow whitespace, so why is this bullet point here? Well, there's a lot of inertia in culture, and the problem with /x is that it's not the default, so people don't think to turn it on when it would probably do a lot of good. The culture is biased in the wrong direction. Whitespace around tokens should be the norm, not the exception. It should be acceptable to use whitespace to separate tokens that could be confused. It should not be considered acceptable to define new constructs that contain a plethora of punctuation, but we've become accustomed to constructs like (?<=...) and (??{...}) and [rnckp{Zl}p{Zp}], so we don't complain. We're frogs who are getting boiled in a pot full of single-character morphemes, and we don't notice.

Poor Huffman coding

Huffman invented a method of data compaction in which common characters are represented by a small number of bits, and rarer characters are represented by more bits. The principle is more general, however, and language designers would do well to pay attention to the "other" Perl slogan: Easy things should be easy, and hard things should be possible. However, we haven't always taken our own advice. Consider those two regex constructs we just saw:

    (?<=...)
    (??{...})

Which one do you think is likely to be the most common in everyday use? Guess which one is longer...

There are many examples of poor Huffman coding in current regexes. Consider these:

    (...)
    (?:...)

Is it really the case that grouping is rarer than capturing? And by two gobbledygooky character's worth? Likewise there are many constructs that are the same length that shouldn't be:

    (?:...)
    (?#...)

Grouping is much more important than the ability to embed a comment. Yet they're the same length currently.

Too much reliance on too few metacharacters

A lot of our Huffman troubles came about because we were trying to shoehorn new capabilities into an old syntax without breaking anything. The (?...) construct succeeded at that goal, but it was new wine in old wineskins, as they say. More successful was the *? minimal matching hack, but it's still symptomatic of the problem that we only had three characters to choose from that would have worked at that point in the grammar. We've pretty nearly exhausted the available backslash sequences.

The waterbed theory of linguistic complexity says that if you push down one place, it goes up somewhere else. If you arbitrarily limit yourself to too few metacharacters, the complexity comes out somewhere else. So it seems obvious to me that the way out of this mess is to grab a few more metacharacters. And the metacharacters I want to grab are...well, we'll see in a moment.

Different things look too similar

Consider these constructs:

    (??{...})
    (?{...})
    (?#...)
    (?:...)
    (?i:...)
    (?=...)
    (?!...)
    (?<=...)
    (?<!...)
    (?>...)
    (?(...)...|...)

These all look quite similar, but some of them do radically different things. In particular, the (?<...) does not mean the opposite of the (?>...). The underlying visual problem is the overuse of parentheses, as in Lisp. Programs are more readable if different things look different.

Poor end-weight design

In linguistics, the notion of end-weight is the idea that people tend to prefer sentences where the short things come first and the long things come last. That minimizes the amount of stuff you have to remember while you're reading or listening. Perl violates this with regex modifiers. It's okay when you say something short like this:

    s/foo/bar/g

But when you say something like we find in RFC 360:

    while ($text =~ /name:s*(.*?)ns*
                    children:s*(?:(?@S+)[, ]*)*ns*
                    favorite colors:s*(?:(?@S+)[, ]*)*n/sigx) {...}

it's not until you read the /sigx at the end that you know how to read the regex. This actually causes problems for the Perl 5 parser, which has to defer parsing the regular expression till it sees the /x, because that changes how whitespace and comments work.

Too much reliance on modifiers

The /s modifier in the previous example changes the meaning of the . metacharacter. We could, in fact, do away with the /s modifier entirely if we only had two different representations for "any character", one of which matched a newline, and one which didn't. A similar argument applies to the /m modifier. The whole notion of something outside the regex changing the meaning of the regex is just a bit bogus, not because we're afraid of context sensitivity, but because we need to have better control within the regex of what we mean, and in this case the context supplied outside the regex is not precise enough. (Perl 5 has a way to control the inner contexts, but it uses the self-obfuscating (?...) notation.)

Modifiers that control how the regex is used as a whole do make some sense outside the regex. But they still have the end-weight problem.

Too many special rules and boobytraps

Without knowing the context, you cannot know what the pattern // will do. It might match a null string, or it might match the previously successful match.

The local operator behaves differently inside regular expressions than it does outside.

It's too easy to write a null pattern accidentally. For instance, the following will never match anything but the null string:

    /
    | foo
    | bar
    | baz
    /x

Even when it's intentional, it may not look intentional:

    (a|b|c|)

That's hard to read because it's difficult to make the absence of something visible.

It's too easy to confuse the multiple meanings of dot. Or the multiple meanings of ^, and $. And the opposite of A is frequently not Z, but z. Tell me again, when do I say 1, and when do I say $1? Why are they different?

Backreferences not useful enough

Speaking of 1, backreferences have a number of shortcomings. The first is actually getting ahold of the right backreference. Since captures are numbered from the beginning, you have to count, and you can easily count wrong. For many purposes it would be better if you could ask for the last capture, or the one before that. Or perhaps if there were a way to restart the numbering part way through...

Another major problem with backreferences is that you can't easily modify one to search for a variant. Suppose you match an opening parenthesis, bracket, or curly. You'll like to search for everything up to the corresponding closing parenthesis, bracket, or curly, but there's no way to transmogrify the opening version to the closing version, because the backref search is hardwired independently of ordinary variable matching. And that's because Perl doesn't instantiate $1 soon enough. And that's because Perl relies on variable interpolation to get subexpressions into regexes. Which leads us to...

Too hard to match a literal string

Since regexes undergo an interpolation pass before they're compiled, anything you interpolate is forced to be treated as a regular expression. Often that's not what you want, so we have the klunky Q$stringE mechanism to hide regex metacharacters. And that's because...

Two-level interpretation is problematic

The problem with Q$stringE arises because of the fundamental mistake of using interpolation to build regexes instead of letting the regex control how it treats the variables it references. Regexes aren't strings, they're programs. Or, rather, they're strings only in the sense that any piece of program is a string. Just as you have to work to eval a string as a program, you should have to work to eval a string as a regular expression. Most people tend to expect a variable in a regular expression to match its contents literally. Perl violates that expectation. And because it violates that expectation, we can't make $1 synonymous with 1. And interpolated parentheses throw off the capture count, so you can't easily use interpolation to call subrules, so we invented (??{$var}) to get around that. But then you can't actually get at the parentheses captured by the subrule. The ramifications go on and on.

Too little abstraction

Historically, regular expressions were considered a very low-level language, a kind of glorified assembly language for the regex engine. When you're only dealing with ASCII, there is little need for abstraction, since the shortest way to say [a-z] is just that. With the advent of the eighth bit, we started getting into a little bit of trouble, and POSIX started thinking about names like [:alpha:] to deal with locale difficulties. But as with the problem of conciseness, the culture was still biased away from naming abstractly anything that could be expressed concretely.

However, it's almost impossible to write a parser without naming things, because you have to be able to name the separate grammar rules so that the various rules can refer to each other.

It's difficult to deal with any subset of Unicode without naming it. These days, if you see [a-z] in a program, it's probably an outright bug. It's much better to use a named character property so that your program will work right in areas that don't just use ASCII.

Even where we do allow names, it tends to be awkward because of the cultural bias against it. To call a subrule by name in Perl 5 you have to say this:

    (??{$rule})

That has 4 or 5 more characters than it ought to. Dearth of abstraction produces bad Huffman coding.

Little support for named captures

Make that "no support" in Perl, unless you include assignment to a list. This is just a part of the bias against naming things. Instead we are forced to number our capturing parens and count. That works okay for the top-level regular expression, when we can do list assignment or assign $1 to $foo. But it breaks down as soon as you start trying to use nested regexes. It also breaks down when the capturing parentheses match more than once. Perl handles this currently by returning only the last match. This is slightly better than useless, but not by much.

Difficult to use nested patterns

For many of the reasons we've mentioned, it's difficult to make regexes refer to each other, and even if you do, it's almost impossible to get the nested information back out of them. And there are entire classes of parsing problems that are not solvable without recursive definitions.

Little support for grammars

Even if it were easier for regexes to refer to other regexes, we'd still have the problem that those other regexes aren't organized in any meaningful way. They might be off in variables that come and go at the whim of the surrounding context.

When we have an organized system of parsing rules, we call it a grammar. One advantage of having a grammar is that you can optimize based on the assumption that the rules maintain their relationship to each other. For instance, if you think of grammar rules as a funny kind of subroutine, you can write an optimizer to inline some of the subrules--but only if you know the subrule is fixed in the grammar.

Without support for grammar classes, there's no decent way to think of deriving one grammar from another. And if you can't derive one grammar from another, you can't easily evolve your language to handle new kinds of problems.

Inability to define variants

If we want to have variant grammars for Perl dialects, then what about regex dialects? Can regexes be extended either at compile time or at run time? Perl 5 has some rudimentary overloading magic for rewriting regex strings, but that's got the same problems as source filters for Perl code; namely that you just get the raw regex source text and have to parse it yourself. Once again the fundamental assumption is that a regex is a funny kind of string, existing only at the behest of the surrounding program.

Do we think of regexes as a real, living language?

Poor integration with rich languages

Let's face it, in the culture of computing, regex languages are mostly considered second-class citizens, or worse. "Real" languages like C and C++ will exploit regexes, but only through a strict policy of apartheid. Regular expressions are our servants or slaves; we tell them what to do, they go and do it, and then they come back to say whether they succeeded or not.

At the other extreme, we have languages like Prolog or Snobol where the pattern matching is built into the very control structure of the language. These languages don't succeed in the long run because thinking about that kind of control structure is rather difficult in actual fact, and one gets tired of doing it constantly. The path to freedom is not to make everyone a slave.

However, I would like to think that there is some happy medium between those two extremes. Coming from a C background, Perl has historically treated regexes as servants. True, Perl has treated them as trusted servants, letting them move about in Perl society better than any other C-like language to date. Nevertheless, if we emancipate regexes to serve as co-equal control structures, and if we can rid ourselves of the regexist attitudes that many of us secretly harbor, we'll have a much more productive society than we currently do. We need to empower regexes with a sense of control (structure). It needs to be just as easy for a regex to call Perl code as it is for Perl code to call a regex.

Missing backtracking controls

Perl 5 started to give regexes more control of their own destiny with the "grab" construct, (?>...), which tells the regex engine that when it fails to match the rest of the pattern, it should not backtrack into the innards of the grab, but skip back to before it. That's a useful notion, but there are problems. First, the notation sucks, but you knew that already. Second, it doesn't go far enough. There's no way to backtrack out of just the current grouping. There's no way to backtrack out of just the current rule. Both of these are crucial for giving first-class status to the control flow of regexes.

Difficult to define assertions

Notionally, a regex is an organization of assertions that either succeed or fail. Some assertions are easily expressed in traditional regex language, while others are more easily expressed in a procedural language like Perl.

The natural (but wrong) solution is to try to reinvent Perl expressions within regex language. So, for instance, I'm rejecting those RFCs that propose special assertion syntax for numerics or booleans. The better solution is to make it easier to embed Perl assertions within regexes.

Brave New World

I've just made a ton of negative assertions about the current state of regex culture. Now I'd like you to perform a cool mental trick and turn all those negatives assertions into positive assertions about what I'm going to say, because I'm not intending to give the rationale again, but just present the design as it stands. Damian will discuss an extended example in his Exegesis 5, which will show the big picture of how these various features work together to produce a much more readable whole.

So anyway, here's what's new.

Metacharacter Reform

Some things stay the same: (...) captures text just as it did before, and the quantifiers *, +, and ? are also unchanged. The vertical bar | still separates alternatives. The backslash `` still protects the following character from its ordinary interpretation. The ? suffix character still does minimal matching. (Note that these are by far the most commonly used metacharacters, so many ordinary regexes will look nearly identical in Perl 5 and Perl 6.)

Since /x extended syntax is now the default, # is now always a metacharacter indicating a comment, and whitespace is now always "meta". Whitespace is now the standard way to separate regex tokens that would otherwise be confused as a single token.

Even in character classes, whitespace is not taken literally any more. Backwhack the space if you mean it literally, or use <sp>, or `