hn-classics/_stories/2008/12471604.md

283 lines
15 KiB
Markdown
Raw Permalink Normal View History

---
created_at: '2016-09-11T01:05:24.000Z'
title: Why are the Microsoft Office file formats so complicated? (2008)
url: http://www.joelonsoftware.com/items/2008/02/19.html
author: diziet
points: 393
story_text:
comment_text:
num_comments: 227
story_id:
story_title:
story_url:
parent_id:
created_at_i: 1473555924
_tags:
- story
- author_diziet
- story_12471604
objectID: '12471604'
2018-06-08 12:05:27 +00:00
year: 2008
---
2018-03-03 09:35:28 +00:00
Last week, Microsoft published the [binary file formats for
Office](http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx).
These formats appear to be almost completely insane. The Excel 97-2003
file format is a 349 page PDF file. But wait, thats not all there is to
it\! This document includes the following interesting comment:
2018-02-23 18:19:40 +00:00
2018-03-03 09:35:28 +00:00
> Each Excel workbook is stored in a compound file.
2018-02-23 18:19:40 +00:00
2018-03-03 09:35:28 +00:00
You see, Excel 97-2003 files are OLE compound documents, which are,
essentially, file systems inside a single file. These are sufficiently
complicated that you have to read another 9 page spec to figure that
out. And these “specs” look more like C data structures than what we
traditionally think of as a spec. Its a whole hierarchical file system.
2018-02-23 18:19:40 +00:00
2018-03-03 09:35:28 +00:00
If you started reading these documents with the hope of spending a
weekend writing some spiffy code that imports Word documents into your
blog system, or creates Excel-formatted spreadsheets with your personal
finance data, the complexity and length of the spec probably cured you
of that desire pretty darn quickly. A normal programmer would conclude
that Offices binary file formats:
- are deliberately obfuscated
- are the product of a demented Borg mind
- were created by insanely bad programmers
- and are impossible to read or create correctly.
Youd be wrong on all four counts. With a little bit of digging, Ill
show you how those file formats got so unbelievably complicated, why it
doesnt reflect bad programming on Microsofts part, and what you can do
to work around it.
The first thing to understand is that the binary file formats were
designed with very different design goals than, say, HTML.
**They were designed to be fast on very old computers.** For the early
versions of Excel for Windows, 1 MB of RAM was a reasonable amount of
memory, and an 80386 at 20 MHz had to be able to run Excel comfortably.
There are a lot of optimizations in the file formats that are intended
to make opening and saving files much faster:
- These are binary formats, so loading a record is usually a matter of
just copying (blitting) a range of bytes from disk to memory, where
you end up with a C data structure you can use. Theres no lexing or
parsing involved in loading a file. [Lexing and parsing are orders
of magnitude slower than
blitting](https://www.joelonsoftware.com/articles/fog0000000319.html).
- The file format is contorted, where necessary, to make common
operations fast. For example, Excel 95 and 97 have something called
“Simple Save” which they use sometimes as a faster variation on
the OLE compound document format, which just wasnt fast enough for
mainstream use. Word had something called [Fast
Save](http://support.microsoft.com/kb/197978). To save a long
document quickly, 14 out of 15 times, only the changes are appended
to the end of the file, instead of rewriting the whole document from
scratch. On the hard drives of the day, this meant saving a long
document took one second instead of thirty. (It also meant that
deleted data in a document was still in the file. [This turned out
to be not what people
wanted](http://www.news.com/8301-10784_3-9780073-7.html).)
**They were designed to use libraries.** If you wanted to write a
from-scratch binary importer, youd have to support things like the
Windows Metafile Format (for drawing things) and OLE Compound Storage.
If youre running on Windows, theres library support for these that
makes it trivial… using these features was a shortcut for the Microsoft
team. But if youre writing everything on your own from scratch, you
have to do all that work yourself.
Office has extensive support for compound documents, for example, you
can embed a spreadsheet in a Word document. A perfect Word file format
parser would also have to be able to do something intelligent with the
embedded spreadsheet.
**They were not designed with interoperability in mind.** The
assumption, and a fairly reasonable one at the time, was that the Word
file format only had to be read and written by Word. That means that
whenever a programmer on the Word team had to make a decision about how
to change the file format, the only thing they cared about was (a) what
was fast and (b) what took the fewest lines of code in the Word code
base. The idea of things like SGML and HTML—interchangeable,
standardized file formats—didnt really take hold until the Internet
made it practical to interchange documents in the first place; this was
a decade later than the Office binary formats were first invented. There
was always an assumption that you could use importers and exporters to
exchange documents. In fact Word does have a format designed for easy
interchange, called
[RTF](http://en.wikipedia.org/wiki/Rich_Text_Format), which has been
there almost since the beginning. Its still 100% supported.
**They have to reflect all the complexity of the applications.** Every
checkbox, every formatting option, and every feature in Microsoft Office
has to be represented in file formats somewhere. That checkbox in Words
paragraph menu called “Keep With Next” that causes a paragraph to be
moved to the next page if necessary so that its on the same page as the
paragraph after it? That has to be in the file format. And that means if
you want to implement a perfect Word clone than can correctly read Word
documents, you have to implement that feature. If youre creating a
competitive word processor that has to load Word documents, it may only
take you a minute to write the code to load that bit from the file
format, but it might take you weeks to change your page layout algorithm
to accommodate it. If you dont, customers will open their Word files in
your clone and all the pages will be messed up.
**They have to reflect the history of the applications.** A lot of the
complexities in these file formats reflect features that are old,
complicated, unloved, and rarely used. Theyre still in the file format
for backwards compatibility, and because it doesnt cost anything for
Microsoft to leave the code around. But if you really want to do a
thorough and complete job of parsing and writing these file formats, you
have to redo all that work that some intern did at Microsoft 15 years
ago. The bottom line is that there are **thousands of developer years**
of work that went into the current versions of Word and Excel, and if
you really want to clone those applications completely, youre going to
have to do thousands of years of work. A file format is just a concise
summary of all the features an application supports.
Just for kicks, lets look at one tiny example in depth. An Excel
worksheet is a bunch of BIFF records of different types. I want to look
at the very first BIFF record in the spec. Its a record called
**1904**.
The Excel file format specification is remarkably obscure about this. It
just says that the 1904 record indicates “if the 1904 date system is
used.” Ah. A classic piece of useless specification. If you were a
developer working with the Excel file format, and you found this in the
file format specification, you might be justified in concluding that
Microsoft is hiding something. This piece of information does not give
you enough information. You also need some [outside
knowledge](http://support.microsoft.com/kb/214330), which Ill fill you
in on now. There are two kinds of Excel worksheets: those where the
epoch for dates is 1/1/1900 (with a [leap-year
bug](http://support.microsoft.com/kb/214058/) deliberately created for
1-2-3 compatibility that is too boring to describe here), and those
where the epoch for dates is 1/1/1904. Excel supports both because the
first version of Excel, for the Mac, just used that operating systems
epoch because that was easy, but Excel for Windows had to be able to
import 1-2-3 files, which used 1/1/1900 for the epoch. Its enough to
bring you to tears. At no point in history did a programmer ever not do
the right thing, but there you have it.
Both 1900 and 1904 file types are commonly found in the wild, usually
depending on whether the file originated on Windows or Mac. Converting
from one to another silently can cause data integrity errors, so Excel
wont change the file type for you. To parse Excel files you have to
handle both. Thats not just a matter of loading this bit from the file.
It means you have to rewrite all of your date display and parsing code
to handle both epochs. That would take several days to implement, I
think.
Indeed, as you work on your Excel clone, youll discover all kinds of
subtle details about date handling. When does Excel convert numbers to
dates? How does the formatting work? Why is 1/31 interpreted as January
31 of this year, while 1/50 is interpreted as January 1st, 1950? All of
these subtle bits of behavior cannot be fully documented without writing
a document that has the same amount of information as the Excel source
code.
And this is only the first of hundreds of BIFF records you have to
handle, and one of the simplest. Most of them are complicated enough to
reduce a grown programmer to tears.
The only possible conclusion is this. Its very helpful of Microsoft to
release the file formats for Microsoft and Office, but its not really
going to make it any easier to import or save to the Office file
formats. These are insanely complex and rich applications, and you
[cant just implement the most popular 20% and expect 80% of the
people to be
happy](https://www.joelonsoftware.com/articles/fog0000000020.html). The
binary file specification is, at most, going to save you a few minutes
reverse engineering a remarkably complex system.
OK, I promised some workarounds. The good news is that for almost all
common applications, trying to read or write the Office binary file
formats is the wrong decision. There are two major alternatives you
should seriously consider: letting Office do the work, or using file
formats that are easier to write.
**Let Office do the heavy work for you**. Word and Excel have extremely
complete object models, available via COM Automation, which allow you to
programmatically do anything. In many situations, you are better off
reusing the code inside Office rather than trying to reimplement it.
Here are a few examples.
1. You have a web-based application thats needs to output existing
Word files in PDF format. Heres how I would implement that: a few
lines of Word VBA code loads a file and saves it as a PDF using the
built in PDF exporter in Word 2007. You can call this code directly,
even from ASP or ASP.NET code running under IIS. Itll work. The
first time you launch Word itll take a few seconds. The second
time, Word will be kept in memory by the COM subsystem for a few
minutes in case you need it again. Its fast enough for a reasonable
web-based application.
2. Same as above, but your web hosting environment is Linux. Buy one
Windows 2003 server, install a fully licensed copy of Word on it,
and build a little web service that does the work. Half a day of
work with C\# and ASP.NET.
3. Same as above, but you need to scale. Throw a load balancer in front
of any number of boxes that you built in step 2. No code required.
This kind of approach would work for all kinds of common Office types of
applications you might perform on your server. For example:
- Opening an Excel workbook, storing some data in input cells,
recalculating, and pulling some results out of output cells
- Using Excel to generate charts in GIF format
- Pulling just about any kind of information out of any kind of Excel
worksheet without spending a minute thinking about file formats
- Converting Excel file formats to CSV tabular data (another approach
is to use [Excel ODBC
drivers](http://support.microsoft.com/kb/178717) to suck data out
using SQL queries).
- Editing Word documents
- Filling out Word forms
- Converting files between any of the many file formats supported by
Office (there are importers for dozens of word processor and
spreadsheet formats)
In all of these cases, there are ways to tell the Office objects that
theyre not running interactively, so they shouldnt bother updating the
screen and they shouldnt prompt for user input. By the way, if you go
this route, there are a few gotchas, and its not officially supported
by Microsoft, so read their [knowledge base
article](http://support.microsoft.com/default.aspx?scid=kb;EN-US;257757)
before you get started.
**Use a simpler format for writing files**. If you merely have to
produce Office documents programmatically, theres almost always a
better format than the Office binary formats that you can use which Word
and Excel will open happily, without missing a beat.
- If you simply have to produce tabular data for use in Excel,
consider CSV.
- If you really need worksheet calculation features that CSV doesnt
support, the [WK1
format](http://www.schnarff.com/file-formats/index.html) (Lotus
1-2-3) is a heck of a lot simpler than Excel, and Excel will open it
fine.
- If you really, really have to generate native Excel files, find an
extremely old version of Excel… Excel 3.0 is a good choice, before
all the compound document stuff, and save a minimum file containing
only the exact features you want to use. Use this file to see the
exact minimum BIFF records that you have to output and just focus on
that part of the spec.
- For Word documents, consider writing HTML. Word will open those
fine, too.
- If you really want to generate fancy formatted Word documents, your
best bet is to create an RTF document. Everything that Word can do
can be expressed in RTF, but its a text format, not binary, so you
can change things in the RTF document and itll still work. You can
create a nicely formatted document with placeholders in Word, save
as RTF, and then using simple text substitution, replace the
placeholders on the fly. Now you have an RTF document that every
version of Word will open happily.
Anyway, unless youre literally trying to create a competitor to Office
that can read and write all Office files perfectly, in which case,
youve got thousands of years of work cut out for you, chances are
that reading or writing the Office binary formats is the most labor
intensive way to solve whatever problem it is that youre trying to
solve.