245 lines
12 KiB
Markdown
245 lines
12 KiB
Markdown
---
|
||
created_at: '2015-12-10T19:01:01.000Z'
|
||
title: Why ML/OCaml are good for writing compilers (1998)
|
||
url: http://flint.cs.yale.edu/cs421/case-for-ml.html
|
||
author: jasim
|
||
points: 140
|
||
story_text:
|
||
comment_text:
|
||
num_comments: 79
|
||
story_id:
|
||
story_title:
|
||
story_url:
|
||
parent_id:
|
||
created_at_i: 1449774061
|
||
_tags:
|
||
- story
|
||
- author_jasim
|
||
- story_10712566
|
||
objectID: '10712566'
|
||
year: 1998
|
||
|
||
---
|
||
[Source](http://flint.cs.yale.edu/cs421/case-for-ml.html "Permalink to Why ML/OCaml are good for writing compilers")
|
||
|
||
# Why ML/OCaml are good for writing compilers
|
||
|
||
Why ML/OCaml are good for writing compilers
|
||
|
||
### Why ML/OCaml are good for writing compilers
|
||
|
||
| ----- |
|
||
|
|
||
|
||
| |
|
||
|
|
||
|
||
| **Author:** | | Dwight VandenBerghe |
|
||
| **Email:** | dwight@pentasoft.com |
|
||
| **Date:** | 1998/07/28 | | | | |
|
||
| **Forums:** | comp.compilers |
|
||
|
||
|
|
||
|
||
|
||
|
||
Let's use the term "ML" to mean SML or Objective Caml. I'm a devotee
|
||
of Ocaml, but I have SML/NJ installed and although I prefer the
|
||
distribution, tools and overall implementation of Ocaml, I'd be
|
||
happy to write in SML/NJ if Ocaml wasn't around. (I can't speak
|
||
for Haskell, gofer, hugs, or any of the lazy languages. I think
|
||
that there are different kinds of people, some of whom like things
|
||
like deferred evaluation and other who like strict call-by-value.
|
||
I'm decidedly in the latter camp; I like how clean strict evaluation
|
||
is, how easy it is to understand exactly what is going on, and I
|
||
find that I have to give a little of that up with Haskell. But
|
||
your mileage may vary.) So although when I am writing about "ML"
|
||
I am writing about Ocaml, what I say also applies, I think, to SML/NJ.
|
||
|
||
So here is an unordered list of language features that seems to me
|
||
to make writing compilers a pleasure rather than a horrendous chore.
|
||
|
||
1. Garbage collection. It may seem elementary to mention this,
|
||
but gc is an incredible boon to programs that have lots of complex
|
||
data structures with short-to-intermediate lifetimes. And compilers
|
||
are, above all, programs with complex data structures. You gotta
|
||
be a data structure junkie to like compilers. Well, C/C++, Pascal,
|
||
and their ilk, with unrestricted pointers and malloc and their
|
||
cousins, make you take out your own trash. You have to be the
|
||
programmer and the janitor, at the same time. Some of us are
|
||
better programmers than we are janitors. ML has, perhaps, the
|
||
best gc around; it's so fast that for many real apps it's as fast
|
||
as the C++ malloc/free, and maybe even a little faster in some
|
||
cases. You don't have to wring your hands about using ML's gc,
|
||
as you do with Java's, which is slow. It's just there, invisible,
|
||
blindingly fast, and it works all the time, and your life is thus
|
||
made much easier.
|
||
|
||
2. Tail recursion is optimized. Thus, once you know how to take
|
||
advantage of it, you can write tree walks that don't eat up stack
|
||
space, and are very fast. Now, to be fair, there are C++ compilers
|
||
(like VisualC++ 5) that supposedly have some tail recursion
|
||
eliminated, but you can't count on it (VC has a lot of bugs in
|
||
theirs). ML is matched to recursion very well, and since many
|
||
of the data structures in compilers tend to be best handled with
|
||
recursive procedures, there is a good fit there.
|
||
|
||
3. The data types in ML match the compiling process. Compilers
|
||
don't tend to worry about unsigned shorts vs signed chars,
|
||
but use "ints" everywhere, as well as strings. Strings are used
|
||
all over the place, and C/C++ are pretty terrible with strings,
|
||
even in the presence of templates. There are some places in
|
||
compilers where it's nice to have arithmetic on quantities a
|
||
little larger than ints, so the "bignum" facility is actually
|
||
useful in those cases (like when you have to fold constants
|
||
or tokenize with a little more precision than the underlying
|
||
numeric types, then convert back to the native form). If you
|
||
don't have bignums, then you have to write your own, or resort
|
||
to subterfuge. (See guru compiler writer Dave Hanson's "C
|
||
Interfaces and Implementations" for an example of what I mean.)
|
||
|
||
4. The type constructors in ML are just plain wonderful for
|
||
describing things like ASTs (abstract syntax trees). They
|
||
implement what are sometimes called "tagged unions" - that is,
|
||
a union data type (efficient, small) that, unlike with C/C++,
|
||
come along with a tag field that tells what's in the union.
|
||
This is enforced: in other words, there's no way to circumvent
|
||
it. ML pattern matching is designed to work with tagged unions
|
||
so that the source code is incredibly readable, for functions
|
||
that take a data structure as an argument. Combine this with
|
||
type inference and tail recursion elimination, and you have
|
||
a language that is optimized for recursive functions that
|
||
take complex data structures are arguments ... sound familiar?
|
||
|
||
5. Safety. ML was conceived as a solution to the main problem
|
||
faced by mathematicians using automated theorem provers: that
|
||
because of the type-free, dangerous, cavalier nature of the
|
||
usual language for that domain (Lisp), you could never be sure
|
||
that your program was going to work. Of course, this could
|
||
be said about all languages, but ML was an attempt to restrict
|
||
the domain a bit, to add efficiency and safety at the same time.
|
||
ML programs can't crash the system; if it compiles, it will run,
|
||
and you won't get a segmentation fault. You can prove certain
|
||
properties about your program, you can trust that certain kinds
|
||
of errors just plain can't happen. For example, since lists are
|
||
immutable, and must contain elements of a single type, you don't
|
||
have to worry about screwing up a string list by putting an
|
||
integer in it by accident (as you can with Scheme and Lisp).
|
||
In fact, you can't put anything into it at all; you have to
|
||
create a new list. This makes the underlying routines able
|
||
to be much faster than lists in other languages, where you
|
||
have to be concerned about double-links, destructive updating,
|
||
and so on. With a blindingly fast GC system, this all works
|
||
out just fine ... and you sleep better at night.
|
||
|
||
6. ML was designed for an application domain (theorem
|
||
proving) that is characterized by big, hairy, recursive data
|
||
structures that have complex algorithms running against them.
|
||
Sound familiar?
|
||
|
||
7. Exceptions. ML implements fast and clean exception handling,
|
||
which is a real joy if you've never had the pleasure of using them
|
||
before. You write a table lookup by assuming that the key will
|
||
be found, and then wrapping the search in a "try" block that
|
||
catches the exceptional case ("not found"). So you can't ever
|
||
screw up your program by forgetting to test the not-found case;
|
||
if you do that, the runtime system will stop with an uncaught
|
||
exception, telling just where the exception was thrown. Programs
|
||
become easier to read, cleaner, more robust, when you learn to
|
||
use exceptions.
|
||
|
||
8. Type inference. In a thousand lines of ML you may have to
|
||
declare only two or maybe three variables, if that. It just figures
|
||
out the types by how the variables are used. And it doesn't guess,
|
||
like (say) Perl does. It knows for sure. One of the reasons I
|
||
like Ocaml over SML is that Ocaml doesn't like operator overloading:
|
||
there are different operators for float addition ("+.") and integer
|
||
addition ("+"), for example. Type inference and operator overloading
|
||
are uncomfortable bed partners; in my opinion, a language designer
|
||
should choose between them, rather than try to serve both masters.
|
||
But anything's better than Pascal or C, where the compiler just
|
||
tosses out everything it knows about usage and makes you state
|
||
the obvious, over and over and over again. All we can do is
|
||
screw it up, and so we do, over and over.
|
||
|
||
9. Lex/yacc/burg. ML has great implementations of these standard
|
||
tools, and once you know how to use them, they make a lot of jobs
|
||
simpler. No, I'm not a great lex fan, nor do I prefer lalr(1)
|
||
over ll(k), but I'm a pragmatist: if it's there, and it's well-
|
||
implemented, I'll use it. Ocaml and SML/NJ both have really
|
||
good, solid compiler tools implementations that are used by
|
||
their developers. Not many languages come with better toolkits.
|
||
|
||
10. Did I mention that Ocaml is fast? I wrote a compiler for
|
||
an actuarial financial modelling language in Ocaml, probably
|
||
10K lines of code, that would have been 20K or more in C++,
|
||
and it compiles the largest known program in 3 seconds on my
|
||
pentium 200. Most programs compile in under 1 second. I find
|
||
this astounding.
|
||
|
||
11. Support. I've gotten better support from Inria (Xavier
|
||
Leroy and Pierre Weis, among others) than I ever have from
|
||
any other language vendor. There just aren't many problems
|
||
with the language, and those few problems I've encountered
|
||
have been fixed in a matter of days. Compare this to the
|
||
support for, say, VC++ or Turbo Pascal.
|
||
|
||
12. Library. ML's standard library has a lot of data structure
|
||
stuff in it, which helps a lot. I find it more complete and
|
||
succinct and usable than the usual glommed-together mess of
|
||
most other languages.
|
||
|
||
13. The module system. ML has strong, well-thought-out support
|
||
for separate compilation that allows a separately compiled module
|
||
to support polymorphism (that is, it can operate on arbitrary
|
||
types). The visibility of the internals of the module can be
|
||
exactly controlled. "Functors" can specialize a module into
|
||
a specific incarnation. It's pretty cool, like C++ templates
|
||
without the pain and suffering.
|
||
|
||
So it's mostly about data structures. ML is extraordinarily
|
||
facile at allowing you to express complex data structures and
|
||
recursive algorithms around them. The most basic data structures
|
||
(lists, arrays, structs, unions, property lists, hash tables,
|
||
binary trees, queues, and so on) are sitting there in the
|
||
language already, well-implemented and ready to go. You start
|
||
off from a place that you would have had to build up in another
|
||
language.
|
||
|
||
Is ML the perfect language? Lord, no. It has many flaws,
|
||
like every other language. The syntax is weird, some parts
|
||
of it are hard to learn, some parts are hard to use. A simple
|
||
thing like Printf can be quite strange to express. You can't
|
||
use ML to advantage for many problem domains (like the one I
|
||
work a lot in, embedded systems). It would be terrible to
|
||
use for, say, programming an FFT inside a DSP chip. ML
|
||
doesn't sit all that well with GUIs yet, as far as I've seen,
|
||
and support for OOP is lacking (although I see that as a
|
||
feature, not a restriction).
|
||
|
||
But all languages have some problem domains in which they
|
||
shine. I think that compiler implementation is one of those areas
|
||
for ML. You're writing a compiler and in middle of a function
|
||
you need a 9mm crescent wrench with a box on the other end.
|
||
You open up your toolkit and ... there it is, in the top drawer,
|
||
bright and shiny and strong. You use it, and then a few minutes
|
||
later you need a small phillips screwdriver with a clip and
|
||
a magnet ... and there it is, bright and shiny and strong.
|
||
|
||
It's not that there are an extraordinary number of tools in
|
||
the toolbox. (No; in fact, the toolbox is much smaller than
|
||
the usual toolboxes, the ones used by your friends that contain
|
||
everything but the sink.) It's that the toolbox was carefully
|
||
and very thoughtfully assembled by some very bright toolsmiths,
|
||
distilling their many decades of experience, and designed, as
|
||
all good toolkits are, for a very specific purpose: building
|
||
fast, safe and solid programs that are oriented around separate
|
||
compilation of functions that primarily do recursive manipulation
|
||
of very complex data structures.
|
||
|
||
Program, like, say ... compilers.
|
||
|
||
Dwight
|
||
|
||
|