--- created_at: '2015-12-10T19:01:01.000Z' title: Why ML/OCaml are good for writing compilers (1998) url: http://flint.cs.yale.edu/cs421/case-for-ml.html author: jasim points: 140 story_text: comment_text: num_comments: 79 story_id: story_title: story_url: parent_id: created_at_i: 1449774061 _tags: - story - author_jasim - story_10712566 objectID: '10712566' year: 1998 --- [Source](http://flint.cs.yale.edu/cs421/case-for-ml.html "Permalink to Why ML/OCaml are good for writing compilers") # Why ML/OCaml are good for writing compilers Why ML/OCaml are good for writing compilers ### Why ML/OCaml are good for writing compilers | ----- | | |   | | | **Author:** |   | Dwight VandenBerghe | | **Email:** | dwight@pentasoft.com | | **Date:** | 1998/07/28 | | | | | | **Forums:** | comp.compilers | | Let's use the term "ML" to mean SML or Objective Caml. I'm a devotee of Ocaml, but I have SML/NJ installed and although I prefer the distribution, tools and overall implementation of Ocaml, I'd be happy to write in SML/NJ if Ocaml wasn't around. (I can't speak for Haskell, gofer, hugs, or any of the lazy languages. I think that there are different kinds of people, some of whom like things like deferred evaluation and other who like strict call-by-value. I'm decidedly in the latter camp; I like how clean strict evaluation is, how easy it is to understand exactly what is going on, and I find that I have to give a little of that up with Haskell. But your mileage may vary.) So although when I am writing about "ML" I am writing about Ocaml, what I say also applies, I think, to SML/NJ. So here is an unordered list of language features that seems to me to make writing compilers a pleasure rather than a horrendous chore. 1. Garbage collection. It may seem elementary to mention this, but gc is an incredible boon to programs that have lots of complex data structures with short-to-intermediate lifetimes. And compilers are, above all, programs with complex data structures. You gotta be a data structure junkie to like compilers. Well, C/C++, Pascal, and their ilk, with unrestricted pointers and malloc and their cousins, make you take out your own trash. You have to be the programmer and the janitor, at the same time. Some of us are better programmers than we are janitors. ML has, perhaps, the best gc around; it's so fast that for many real apps it's as fast as the C++ malloc/free, and maybe even a little faster in some cases. You don't have to wring your hands about using ML's gc, as you do with Java's, which is slow. It's just there, invisible, blindingly fast, and it works all the time, and your life is thus made much easier. 2. Tail recursion is optimized. Thus, once you know how to take advantage of it, you can write tree walks that don't eat up stack space, and are very fast. Now, to be fair, there are C++ compilers (like VisualC++ 5) that supposedly have some tail recursion eliminated, but you can't count on it (VC has a lot of bugs in theirs). ML is matched to recursion very well, and since many of the data structures in compilers tend to be best handled with recursive procedures, there is a good fit there. 3. The data types in ML match the compiling process. Compilers don't tend to worry about unsigned shorts vs signed chars, but use "ints" everywhere, as well as strings. Strings are used all over the place, and C/C++ are pretty terrible with strings, even in the presence of templates. There are some places in compilers where it's nice to have arithmetic on quantities a little larger than ints, so the "bignum" facility is actually useful in those cases (like when you have to fold constants or tokenize with a little more precision than the underlying numeric types, then convert back to the native form). If you don't have bignums, then you have to write your own, or resort to subterfuge. (See guru compiler writer Dave Hanson's "C Interfaces and Implementations" for an example of what I mean.) 4. The type constructors in ML are just plain wonderful for describing things like ASTs (abstract syntax trees). They implement what are sometimes called "tagged unions" - that is, a union data type (efficient, small) that, unlike with C/C++, come along with a tag field that tells what's in the union. This is enforced: in other words, there's no way to circumvent it. ML pattern matching is designed to work with tagged unions so that the source code is incredibly readable, for functions that take a data structure as an argument. Combine this with type inference and tail recursion elimination, and you have a language that is optimized for recursive functions that take complex data structures are arguments ... sound familiar? 5. Safety. ML was conceived as a solution to the main problem faced by mathematicians using automated theorem provers: that because of the type-free, dangerous, cavalier nature of the usual language for that domain (Lisp), you could never be sure that your program was going to work. Of course, this could be said about all languages, but ML was an attempt to restrict the domain a bit, to add efficiency and safety at the same time. ML programs can't crash the system; if it compiles, it will run, and you won't get a segmentation fault. You can prove certain properties about your program, you can trust that certain kinds of errors just plain can't happen. For example, since lists are immutable, and must contain elements of a single type, you don't have to worry about screwing up a string list by putting an integer in it by accident (as you can with Scheme and Lisp). In fact, you can't put anything into it at all; you have to create a new list. This makes the underlying routines able to be much faster than lists in other languages, where you have to be concerned about double-links, destructive updating, and so on. With a blindingly fast GC system, this all works out just fine ... and you sleep better at night. 6. ML was designed for an application domain (theorem proving) that is characterized by big, hairy, recursive data structures that have complex algorithms running against them. Sound familiar? 7. Exceptions. ML implements fast and clean exception handling, which is a real joy if you've never had the pleasure of using them before. You write a table lookup by assuming that the key will be found, and then wrapping the search in a "try" block that catches the exceptional case ("not found"). So you can't ever screw up your program by forgetting to test the not-found case; if you do that, the runtime system will stop with an uncaught exception, telling just where the exception was thrown. Programs become easier to read, cleaner, more robust, when you learn to use exceptions. 8. Type inference. In a thousand lines of ML you may have to declare only two or maybe three variables, if that. It just figures out the types by how the variables are used. And it doesn't guess, like (say) Perl does. It knows for sure. One of the reasons I like Ocaml over SML is that Ocaml doesn't like operator overloading: there are different operators for float addition ("+.") and integer addition ("+"), for example. Type inference and operator overloading are uncomfortable bed partners; in my opinion, a language designer should choose between them, rather than try to serve both masters. But anything's better than Pascal or C, where the compiler just tosses out everything it knows about usage and makes you state the obvious, over and over and over again. All we can do is screw it up, and so we do, over and over. 9. Lex/yacc/burg. ML has great implementations of these standard tools, and once you know how to use them, they make a lot of jobs simpler. No, I'm not a great lex fan, nor do I prefer lalr(1) over ll(k), but I'm a pragmatist: if it's there, and it's well- implemented, I'll use it. Ocaml and SML/NJ both have really good, solid compiler tools implementations that are used by their developers. Not many languages come with better toolkits. 10. Did I mention that Ocaml is fast? I wrote a compiler for an actuarial financial modelling language in Ocaml, probably 10K lines of code, that would have been 20K or more in C++, and it compiles the largest known program in 3 seconds on my pentium 200. Most programs compile in under 1 second. I find this astounding. 11. Support. I've gotten better support from Inria (Xavier Leroy and Pierre Weis, among others) than I ever have from any other language vendor. There just aren't many problems with the language, and those few problems I've encountered have been fixed in a matter of days. Compare this to the support for, say, VC++ or Turbo Pascal. 12. Library. ML's standard library has a lot of data structure stuff in it, which helps a lot. I find it more complete and succinct and usable than the usual glommed-together mess of most other languages. 13. The module system. ML has strong, well-thought-out support for separate compilation that allows a separately compiled module to support polymorphism (that is, it can operate on arbitrary types). The visibility of the internals of the module can be exactly controlled. "Functors" can specialize a module into a specific incarnation. It's pretty cool, like C++ templates without the pain and suffering. So it's mostly about data structures. ML is extraordinarily facile at allowing you to express complex data structures and recursive algorithms around them. The most basic data structures (lists, arrays, structs, unions, property lists, hash tables, binary trees, queues, and so on) are sitting there in the language already, well-implemented and ready to go. You start off from a place that you would have had to build up in another language. Is ML the perfect language? Lord, no. It has many flaws, like every other language. The syntax is weird, some parts of it are hard to learn, some parts are hard to use. A simple thing like Printf can be quite strange to express. You can't use ML to advantage for many problem domains (like the one I work a lot in, embedded systems). It would be terrible to use for, say, programming an FFT inside a DSP chip. ML doesn't sit all that well with GUIs yet, as far as I've seen, and support for OOP is lacking (although I see that as a feature, not a restriction). But all languages have some problem domains in which they shine. I think that compiler implementation is one of those areas for ML. You're writing a compiler and in middle of a function you need a 9mm crescent wrench with a box on the other end. You open up your toolkit and ... there it is, in the top drawer, bright and shiny and strong. You use it, and then a few minutes later you need a small phillips screwdriver with a clip and a magnet ... and there it is, bright and shiny and strong. It's not that there are an extraordinary number of tools in the toolbox. (No; in fact, the toolbox is much smaller than the usual toolboxes, the ones used by your friends that contain everything but the sink.) It's that the toolbox was carefully and very thoughtfully assembled by some very bright toolsmiths, distilling their many decades of experience, and designed, as all good toolkits are, for a very specific purpose: building fast, safe and solid programs that are oriented around separate compilation of functions that primarily do recursive manipulation of very complex data structures. Program, like, say ... compilers. Dwight