2018-02-23 18:58:03 +00:00
|
|
|
|
---
|
|
|
|
|
created_at: '2015-12-10T19:01:01.000Z'
|
|
|
|
|
title: Why ML/OCaml are good for writing compilers (1998)
|
|
|
|
|
url: http://flint.cs.yale.edu/cs421/case-for-ml.html
|
|
|
|
|
author: jasim
|
|
|
|
|
points: 140
|
|
|
|
|
story_text:
|
|
|
|
|
comment_text:
|
|
|
|
|
num_comments: 79
|
|
|
|
|
story_id:
|
|
|
|
|
story_title:
|
|
|
|
|
story_url:
|
|
|
|
|
parent_id:
|
|
|
|
|
created_at_i: 1449774061
|
|
|
|
|
_tags:
|
|
|
|
|
- story
|
|
|
|
|
- author_jasim
|
|
|
|
|
- story_10712566
|
|
|
|
|
objectID: '10712566'
|
2018-06-08 12:05:27 +00:00
|
|
|
|
year: 1998
|
2018-02-23 18:58:03 +00:00
|
|
|
|
|
|
|
|
|
---
|
2018-02-23 18:19:40 +00:00
|
|
|
|
[Source](http://flint.cs.yale.edu/cs421/case-for-ml.html "Permalink to Why ML/OCaml are good for writing compilers")
|
|
|
|
|
|
|
|
|
|
# Why ML/OCaml are good for writing compilers
|
|
|
|
|
|
|
|
|
|
Why ML/OCaml are good for writing compilers
|
|
|
|
|
|
|
|
|
|
### Why ML/OCaml are good for writing compilers
|
|
|
|
|
|
|
|
|
|
| ----- |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| **Author:** | | Dwight VandenBerghe |
|
|
|
|
|
| **Email:** | dwight@pentasoft.com |
|
|
|
|
|
| **Date:** | 1998/07/28 | | | | |
|
|
|
|
|
| **Forums:** | comp.compilers |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Let's use the term "ML" to mean SML or Objective Caml. I'm a devotee
|
|
|
|
|
of Ocaml, but I have SML/NJ installed and although I prefer the
|
|
|
|
|
distribution, tools and overall implementation of Ocaml, I'd be
|
|
|
|
|
happy to write in SML/NJ if Ocaml wasn't around. (I can't speak
|
|
|
|
|
for Haskell, gofer, hugs, or any of the lazy languages. I think
|
|
|
|
|
that there are different kinds of people, some of whom like things
|
|
|
|
|
like deferred evaluation and other who like strict call-by-value.
|
|
|
|
|
I'm decidedly in the latter camp; I like how clean strict evaluation
|
|
|
|
|
is, how easy it is to understand exactly what is going on, and I
|
|
|
|
|
find that I have to give a little of that up with Haskell. But
|
|
|
|
|
your mileage may vary.) So although when I am writing about "ML"
|
|
|
|
|
I am writing about Ocaml, what I say also applies, I think, to SML/NJ.
|
|
|
|
|
|
|
|
|
|
So here is an unordered list of language features that seems to me
|
|
|
|
|
to make writing compilers a pleasure rather than a horrendous chore.
|
|
|
|
|
|
|
|
|
|
1. Garbage collection. It may seem elementary to mention this,
|
|
|
|
|
but gc is an incredible boon to programs that have lots of complex
|
|
|
|
|
data structures with short-to-intermediate lifetimes. And compilers
|
|
|
|
|
are, above all, programs with complex data structures. You gotta
|
|
|
|
|
be a data structure junkie to like compilers. Well, C/C++, Pascal,
|
|
|
|
|
and their ilk, with unrestricted pointers and malloc and their
|
|
|
|
|
cousins, make you take out your own trash. You have to be the
|
|
|
|
|
programmer and the janitor, at the same time. Some of us are
|
|
|
|
|
better programmers than we are janitors. ML has, perhaps, the
|
|
|
|
|
best gc around; it's so fast that for many real apps it's as fast
|
|
|
|
|
as the C++ malloc/free, and maybe even a little faster in some
|
|
|
|
|
cases. You don't have to wring your hands about using ML's gc,
|
|
|
|
|
as you do with Java's, which is slow. It's just there, invisible,
|
|
|
|
|
blindingly fast, and it works all the time, and your life is thus
|
|
|
|
|
made much easier.
|
|
|
|
|
|
|
|
|
|
2. Tail recursion is optimized. Thus, once you know how to take
|
|
|
|
|
advantage of it, you can write tree walks that don't eat up stack
|
|
|
|
|
space, and are very fast. Now, to be fair, there are C++ compilers
|
|
|
|
|
(like VisualC++ 5) that supposedly have some tail recursion
|
|
|
|
|
eliminated, but you can't count on it (VC has a lot of bugs in
|
|
|
|
|
theirs). ML is matched to recursion very well, and since many
|
|
|
|
|
of the data structures in compilers tend to be best handled with
|
|
|
|
|
recursive procedures, there is a good fit there.
|
|
|
|
|
|
|
|
|
|
3. The data types in ML match the compiling process. Compilers
|
|
|
|
|
don't tend to worry about unsigned shorts vs signed chars,
|
|
|
|
|
but use "ints" everywhere, as well as strings. Strings are used
|
|
|
|
|
all over the place, and C/C++ are pretty terrible with strings,
|
|
|
|
|
even in the presence of templates. There are some places in
|
|
|
|
|
compilers where it's nice to have arithmetic on quantities a
|
|
|
|
|
little larger than ints, so the "bignum" facility is actually
|
|
|
|
|
useful in those cases (like when you have to fold constants
|
|
|
|
|
or tokenize with a little more precision than the underlying
|
|
|
|
|
numeric types, then convert back to the native form). If you
|
|
|
|
|
don't have bignums, then you have to write your own, or resort
|
|
|
|
|
to subterfuge. (See guru compiler writer Dave Hanson's "C
|
|
|
|
|
Interfaces and Implementations" for an example of what I mean.)
|
|
|
|
|
|
|
|
|
|
4. The type constructors in ML are just plain wonderful for
|
|
|
|
|
describing things like ASTs (abstract syntax trees). They
|
|
|
|
|
implement what are sometimes called "tagged unions" - that is,
|
|
|
|
|
a union data type (efficient, small) that, unlike with C/C++,
|
|
|
|
|
come along with a tag field that tells what's in the union.
|
|
|
|
|
This is enforced: in other words, there's no way to circumvent
|
|
|
|
|
it. ML pattern matching is designed to work with tagged unions
|
|
|
|
|
so that the source code is incredibly readable, for functions
|
|
|
|
|
that take a data structure as an argument. Combine this with
|
|
|
|
|
type inference and tail recursion elimination, and you have
|
|
|
|
|
a language that is optimized for recursive functions that
|
|
|
|
|
take complex data structures are arguments ... sound familiar?
|
|
|
|
|
|
|
|
|
|
5. Safety. ML was conceived as a solution to the main problem
|
|
|
|
|
faced by mathematicians using automated theorem provers: that
|
|
|
|
|
because of the type-free, dangerous, cavalier nature of the
|
|
|
|
|
usual language for that domain (Lisp), you could never be sure
|
|
|
|
|
that your program was going to work. Of course, this could
|
|
|
|
|
be said about all languages, but ML was an attempt to restrict
|
|
|
|
|
the domain a bit, to add efficiency and safety at the same time.
|
|
|
|
|
ML programs can't crash the system; if it compiles, it will run,
|
|
|
|
|
and you won't get a segmentation fault. You can prove certain
|
|
|
|
|
properties about your program, you can trust that certain kinds
|
|
|
|
|
of errors just plain can't happen. For example, since lists are
|
|
|
|
|
immutable, and must contain elements of a single type, you don't
|
|
|
|
|
have to worry about screwing up a string list by putting an
|
|
|
|
|
integer in it by accident (as you can with Scheme and Lisp).
|
|
|
|
|
In fact, you can't put anything into it at all; you have to
|
|
|
|
|
create a new list. This makes the underlying routines able
|
|
|
|
|
to be much faster than lists in other languages, where you
|
|
|
|
|
have to be concerned about double-links, destructive updating,
|
|
|
|
|
and so on. With a blindingly fast GC system, this all works
|
|
|
|
|
out just fine ... and you sleep better at night.
|
|
|
|
|
|
|
|
|
|
6. ML was designed for an application domain (theorem
|
|
|
|
|
proving) that is characterized by big, hairy, recursive data
|
|
|
|
|
structures that have complex algorithms running against them.
|
|
|
|
|
Sound familiar?
|
|
|
|
|
|
|
|
|
|
7. Exceptions. ML implements fast and clean exception handling,
|
|
|
|
|
which is a real joy if you've never had the pleasure of using them
|
|
|
|
|
before. You write a table lookup by assuming that the key will
|
|
|
|
|
be found, and then wrapping the search in a "try" block that
|
|
|
|
|
catches the exceptional case ("not found"). So you can't ever
|
|
|
|
|
screw up your program by forgetting to test the not-found case;
|
|
|
|
|
if you do that, the runtime system will stop with an uncaught
|
|
|
|
|
exception, telling just where the exception was thrown. Programs
|
|
|
|
|
become easier to read, cleaner, more robust, when you learn to
|
|
|
|
|
use exceptions.
|
|
|
|
|
|
|
|
|
|
8. Type inference. In a thousand lines of ML you may have to
|
|
|
|
|
declare only two or maybe three variables, if that. It just figures
|
|
|
|
|
out the types by how the variables are used. And it doesn't guess,
|
|
|
|
|
like (say) Perl does. It knows for sure. One of the reasons I
|
|
|
|
|
like Ocaml over SML is that Ocaml doesn't like operator overloading:
|
|
|
|
|
there are different operators for float addition ("+.") and integer
|
|
|
|
|
addition ("+"), for example. Type inference and operator overloading
|
|
|
|
|
are uncomfortable bed partners; in my opinion, a language designer
|
|
|
|
|
should choose between them, rather than try to serve both masters.
|
|
|
|
|
But anything's better than Pascal or C, where the compiler just
|
|
|
|
|
tosses out everything it knows about usage and makes you state
|
|
|
|
|
the obvious, over and over and over again. All we can do is
|
|
|
|
|
screw it up, and so we do, over and over.
|
|
|
|
|
|
|
|
|
|
9. Lex/yacc/burg. ML has great implementations of these standard
|
|
|
|
|
tools, and once you know how to use them, they make a lot of jobs
|
|
|
|
|
simpler. No, I'm not a great lex fan, nor do I prefer lalr(1)
|
|
|
|
|
over ll(k), but I'm a pragmatist: if it's there, and it's well-
|
|
|
|
|
implemented, I'll use it. Ocaml and SML/NJ both have really
|
|
|
|
|
good, solid compiler tools implementations that are used by
|
|
|
|
|
their developers. Not many languages come with better toolkits.
|
|
|
|
|
|
|
|
|
|
10. Did I mention that Ocaml is fast? I wrote a compiler for
|
|
|
|
|
an actuarial financial modelling language in Ocaml, probably
|
|
|
|
|
10K lines of code, that would have been 20K or more in C++,
|
|
|
|
|
and it compiles the largest known program in 3 seconds on my
|
|
|
|
|
pentium 200. Most programs compile in under 1 second. I find
|
|
|
|
|
this astounding.
|
|
|
|
|
|
|
|
|
|
11. Support. I've gotten better support from Inria (Xavier
|
|
|
|
|
Leroy and Pierre Weis, among others) than I ever have from
|
|
|
|
|
any other language vendor. There just aren't many problems
|
|
|
|
|
with the language, and those few problems I've encountered
|
|
|
|
|
have been fixed in a matter of days. Compare this to the
|
|
|
|
|
support for, say, VC++ or Turbo Pascal.
|
|
|
|
|
|
|
|
|
|
12. Library. ML's standard library has a lot of data structure
|
|
|
|
|
stuff in it, which helps a lot. I find it more complete and
|
|
|
|
|
succinct and usable than the usual glommed-together mess of
|
|
|
|
|
most other languages.
|
|
|
|
|
|
|
|
|
|
13. The module system. ML has strong, well-thought-out support
|
|
|
|
|
for separate compilation that allows a separately compiled module
|
|
|
|
|
to support polymorphism (that is, it can operate on arbitrary
|
|
|
|
|
types). The visibility of the internals of the module can be
|
|
|
|
|
exactly controlled. "Functors" can specialize a module into
|
|
|
|
|
a specific incarnation. It's pretty cool, like C++ templates
|
|
|
|
|
without the pain and suffering.
|
|
|
|
|
|
|
|
|
|
So it's mostly about data structures. ML is extraordinarily
|
|
|
|
|
facile at allowing you to express complex data structures and
|
|
|
|
|
recursive algorithms around them. The most basic data structures
|
|
|
|
|
(lists, arrays, structs, unions, property lists, hash tables,
|
|
|
|
|
binary trees, queues, and so on) are sitting there in the
|
|
|
|
|
language already, well-implemented and ready to go. You start
|
|
|
|
|
off from a place that you would have had to build up in another
|
|
|
|
|
language.
|
|
|
|
|
|
|
|
|
|
Is ML the perfect language? Lord, no. It has many flaws,
|
|
|
|
|
like every other language. The syntax is weird, some parts
|
|
|
|
|
of it are hard to learn, some parts are hard to use. A simple
|
|
|
|
|
thing like Printf can be quite strange to express. You can't
|
|
|
|
|
use ML to advantage for many problem domains (like the one I
|
|
|
|
|
work a lot in, embedded systems). It would be terrible to
|
|
|
|
|
use for, say, programming an FFT inside a DSP chip. ML
|
|
|
|
|
doesn't sit all that well with GUIs yet, as far as I've seen,
|
|
|
|
|
and support for OOP is lacking (although I see that as a
|
|
|
|
|
feature, not a restriction).
|
|
|
|
|
|
|
|
|
|
But all languages have some problem domains in which they
|
|
|
|
|
shine. I think that compiler implementation is one of those areas
|
|
|
|
|
for ML. You're writing a compiler and in middle of a function
|
|
|
|
|
you need a 9mm crescent wrench with a box on the other end.
|
|
|
|
|
You open up your toolkit and ... there it is, in the top drawer,
|
|
|
|
|
bright and shiny and strong. You use it, and then a few minutes
|
|
|
|
|
later you need a small phillips screwdriver with a clip and
|
|
|
|
|
a magnet ... and there it is, bright and shiny and strong.
|
|
|
|
|
|
|
|
|
|
It's not that there are an extraordinary number of tools in
|
|
|
|
|
the toolbox. (No; in fact, the toolbox is much smaller than
|
|
|
|
|
the usual toolboxes, the ones used by your friends that contain
|
|
|
|
|
everything but the sink.) It's that the toolbox was carefully
|
|
|
|
|
and very thoughtfully assembled by some very bright toolsmiths,
|
|
|
|
|
distilling their many decades of experience, and designed, as
|
|
|
|
|
all good toolkits are, for a very specific purpose: building
|
|
|
|
|
fast, safe and solid programs that are oriented around separate
|
|
|
|
|
compilation of functions that primarily do recursive manipulation
|
|
|
|
|
of very complex data structures.
|
|
|
|
|
|
|
|
|
|
Program, like, say ... compilers.
|
|
|
|
|
|
|
|
|
|
Dwight
|
|
|
|
|
|
|
|
|
|
|