hn-classics/_stories/2006/8464573.md

39 KiB

created_at title url author points story_text comment_text num_comments story_id story_title story_url parent_id created_at_i _tags objectID year
2014-10-16T12:41:47.000Z Crash-only software: More than meets the eye (2006) http://lwn.net/Articles/191059/ djpuggypug 74 14 1413463307
story
author_djpuggypug
story_8464573
8464573 2006

Source

Crash-only software: More than meets the eye [LWN.net]

LWN.net Logo LWN
.net News from the source
LWN

User: Password: | |

Subscribe / Log in / New account

Crash-only software: More than meets the eye

| ----- | | Benefits for LWN subscribers

The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today! |

July 12, 2006

This article was contributed by Valerie Henson

Next time your Linux laptop crashes, pull out your watch (or your cell phone) and time how long it takes to boot up. More than likely, you're running a journaling file system, and not only did your system boot up quickly, but it didn't lose any data that you cared about. (Maybe you lost the last few bytes of your DHCP client's log file, darn.) Now, keep your timekeeping device of choice handy and execute a normal shutdown and reboot. More than likely, you will find that it took longer to reboot "normally" than it did to crash your system and recover it - and for no perceivable benefit.

George Candea and Armando Fox noticed that, counter-intuitively, many software systems can crash and recover more quickly than they can be shutdown and restarted. They reported the following measurements in their paper, Crash-only Software (published in Hot Topics in Operating Systems IX in 2003):

| ----- | | System | Clean reboot | Crash reboot | Speedup |
| RedHat 8 (ext3) | 104 sec | 75 sec | 1.4x |
| JBoss 3.0 app server | 47 sec | 39 sec | 1.2x |
| Windows XP | 61 sec | 48 sec | 1.3x |

In their experiments, no important data was lost. This is not surprising as, after all, good software is designed to safely handle crashes. Software that loses or ruins your data when it crashes isn't very popular in today's computing environment - remember how frustrating it was to use word processors without an auto-save feature? What is surprising is that most systems have two methods of shutting down - cleanly or by crashing - and two methods of starting up - normal start up or recovery - and that frequently the crash/recover method is, by all objective measures, a better choice. Given this, why support the extra code (and associated bugs) to do a clean start up and shutdown? In other words, why should I ever type "halt" instead of hitting the power button?

The main reason to support explicit shutdown and start-up is simple: performance. Often, designers must trade off higher steady state performance (when the application is running normally) with performance during a restart - and with acceptable data loss. File systems are a good example of this trade-off: ext2 runs very quickly while in use but takes a long time to recover and makes no guarantees about when data hits disk, while ext3 has somewhat lower performance while in use but is very quick to recover and makes explicit guarantees about when data hits disk. When overall system availability and acceptable data loss in the event of a crash are factored into the performance equation, ext3 or any other journaling file system is the winner for many systems, including, more than likely, the laptop you are using to read this article.

Crash-only software is software that crashes safely and recovers quickly. The only way to stop it is to crash it, and the only way to start it is to recover. A crash-only system is composed of crash-only components which communicate with retryable requests; faults are handled by crashing and restarting the faulty component and retrying any requests which have timed out. The resulting system is often more robust and reliable because crash recovery is a first-class citizen in the development process, rather than an afterthought, and you no longer need the extra code (and associated interfaces and bugs) for explicit shutdown. All software ought to be able to crash safely and recover quickly, but crash-only software must have these qualities, or their lack becomes quickly evident.

The concept of crash-only software has received quite a lot of attention since its publication. Besides several well-received research papers demonstrating useful implementations of crash-only software, crash-only software has been covered in several popular articles in publications as diverse as Scientific American, Salon.com, and CIO Today. It was cited as one of the reasons Armando Fox was named one of Scientific American's list of top 50 scientists for 2003 and George Candea as one of MIT Technology Review's Top 35 Young Innovators for 2005. Crash-only software has made its mark outside the press room as well; for example, Google's distributed file system, GoogleFS, is implemented as crash-only software, all the way through to the metadata server. The term "crash-only" is now regularly bandied about in design discussions for production software. I myself wrote a blog entry on crash-only software back in 2004. Why bother writing about it again? Quite simply, the crash-only software meme became so popular that, inevitably, mutations arose and flourished, sometimes to the detriment of allegedly crash-only software systems. In this article, we will review some of the more common misunderstandings about designing and implementing crash-only software.

Misconceptions about crash-only software

The first major misunderstanding is that crash-only software is a form of free lunch: you can be lazy and not write shutdown code, not handle errors (just crash it! whee!), or not save state. Just pull up your favorite application in an editor, delete the code for normal start up and shutdown, and voila! instant crash-only software. In fact, crash-only software involves greater discipline and more careful design, because if your checkpointing and recovery code doesn't work, you will find out right away. Crash-only design helps you produce more robust, reliable software, it doesn't exempt you from writing robust, reliable software in the first place.

Another mistake is overuse of the crash/restart "hammer." One of the ideas in crash-only software is that if a component is behaving strangely or suffering some bug, you can just crash it and restart it, and more than likely it will start functioning again. This will often be faster than diagnosing and fixing the problem by hand, and so a good technique for high-availability services. Some programmers overuse the technique by deliberately writing code to crash the program whenever something goes wrong, when the correct solution is to handle all the errors you can think of correctly, and then rely on crash/restart for unforeseen error conditions. Another overuse of crash/restart is that when things go wrong, you should crash and restart the whole system. One tenet of crash-only system design is the idea that crash/restart is cheap - because you are only crashing and recovering small, self-contained parts of the system (see the paper on microreboots). Try telling your users that your whole web browser crashes and restarts every 2 minutes because it is crash-only software and see how well that goes over. If instead the browser quietly crashes and recovers only the thread that is misbehaving you will have much happier users.

On the face of it, the simplest part of crash-only software would be implementing the "crash" part. How hard is it to hit the power button? There is a subtle implementation point that is easy to miss, though: the crash mechanism has to be entirely outside and independent of the crash-only system - hardware power switch, kill -9, shutting down the virtual machine. If it is implemented through internal code, it takes away a valuable part of crash-only software: that you have an all-powerful, reliable method to take any misbehaving component of the system and crash/restart it into a known state.

I heard of one "crash-only" system in which the shutdown code was replaced with an abort() system call as part of a "crash-only" design. There were two problems with this approach. One, it relied on the system to not have any bugs in the code path leading to the abort() system call or any deadlocks which would prevent it being executed. Two, shutting down the system in this manner only exercised a subset of the total possible crash space, since it was only testing what happened when the system successfully received and handled a request to shutdown. For example, a single-threaded program that handled requests in an event loop would never be crashed in the middle of handling another request, and so the recovery code would not be tested for this case. One more example of a badly implemented "crash" is a database that, when it ran out of disk space for its event logging, could not be safely shut down because it wanted to write a log entry before shutting down, but it was out of disk space, so...

Another common pattern is to ignore the trade-offs of performance vs. recovery time vs. reliability and take an absolutist approach to optimizing for one quality while maintaining superficial allegiance to crash-only design. The major trade-off is that checkpointing your application's state improves recovery time and reliability but reduces steady state performance. The two extremes are checkpointing or saving state far too often and checkpointing not at all; like Goldilocks, you need to find the checkpoint frequency that is Just Right for your application.

What frequency of checkpointing will give you acceptable recovery time, acceptable performance, and acceptable data loss? I once used a web browser which only saved preferences and browsing history on a clean shutdown of the browser. Saving the history every millisecond is clearly overkill, but saving changed items every minute would be quite reasonable. The chosen strategy, "save only on shutdown," turned out to be equivalent to "save never" - how often do people close their browsers, compared to how often they crash? I ended up solving this problem by explicitly starting up the browser for the sole purpose of changing the settings and immediately closing it again after the third or fourth time I lost my settings. (This is good example of how all software should be written to crash safely but does not.) Most implementations of bash I have used take the same approach to saving the command history; as a result I now explicitly "exit" out of running shells (all 13 or so of them) whenever I shut down my computer so I don't lose my command history.

Shutdown code should be viewed as, fundamentally, only of use to optimize the next start up sequence and should not be used to do anything required for correctness. One way to approach shutdown code is to add a big comment at the top of the code saying "WISHFUL THINKING: This code may never be executed. But it sure would be nice."

Another class of misunderstanding is about what kind of systems are suitable for crash-only design. Some people think crash-only software must be stateless, since any part of the system might crash and restart, and lose any uncommitted state in the process. While this means you must carefully distinguish between volatile and non-volatile state, it certainly doesn't mean your system must be stateless! Crash-only software only says that any non-volatile state your system needs must itself be stored in a crash-only system, such as a database or session state store. Usually, it is far easier to use a special purpose system to store state, rather than rolling your own. Writing a crash-safe, quick-recovery state store is an extremely difficult task and should be left to the experts (and will make your system easier to implement).

Crash-only software makes explicit the trade-off between optimizing for steady-state performance and optimizing for recovery. Sometimes this is taken to mean that you can't use crash-only design for high performance systems. As usual, it depends on your system, but many systems suffer bugs and crashes often enough that crash-only design is a win when you consider overall up time and performance, rather than performance only when the system is up and running. Perhaps your system is robust enough that you can optimize for steady state performance and disregard recovery time... but it's unlikely.

Because it must be possible to crash and restart components, some people think that a multi-threaded system using locks can't be crash-only - after all, what happens if you crash while holding a lock? The answer is that locks can be used inside a crash-only component, but all interfaces between components need to allow for the unexpected crash of components. Interfaces between components need to strongly enforce fault boundaries, put timeouts on all requests, and carefully formulate requests so that they don't rely on uncommitted state that could be lost. As an example, consider how the recently-merged robust futex facility makes crash recovery explicit.

Some people end up with the impression that crash-only software is less reliable and unsuitable for important "mission-critical" applications because the design explicitly admits that crashes are inevitable. Crash-only software is actually more reliable because it takes into account from the beginning an unavoidable fact of computing - unexpected crashes.

A criticism often leveled at systems designed to improve reliability by handling errors in some way other than complete system crash is that they will hide or encourage software bugs by masking their effects. First, crash-only software in many ways exposes previously hidden bugs, by explicitly testing recovery code in normal use. Second, explicitly crashing and restarting components as a workaround for bugs does not preclude taking a crash dump or otherwise recording data that can be used to solve the bug.

How can we apply crash-only design to operating systems? One example is file systems, and the design of chunkfs (discussed in last week's LWN article on the 2006 Linux file systems workshop and in more detail here). We are trying to improve reliability and data availability by separating the on-disk data into individually checkable components with strong fault isolation. Each chunk must be able to be individually "crashed" - unmounted - and recovered - fsck'd - without bringing down the other chunks. The code itself must be designed to allow the failure of individual chunks without holding locks or other resources indefinitely, which could cause system-wide deadlocks and unavailability. Updates within each chunk must be crash-safe and quickly recoverable. Splitting the file system up into smaller, restartable, crash-only components creates a more reliable, easier to repair crash-only system.

The conclusion

Properly implemented, crash-only software produces higher quality, more reliable code; poorly understood it results in lazy programming. Probably the most common misconception is the idea that writing crash-only software is that it allows you to take shortcuts when writing and designing your code. Wake up, Sleeping Beauty, there ain't no such thing as a free lunch. But you can get a more reliable, easier to debug system if you rigorously apply the principles of crash-only design.

[Thanks to Brian Warner for inspiring this article, George Candea and Armando Fox for comments and for codifying crash-only design in general, and the implementers(s) of the Emacs auto-save feature, which has saved my work too many times to count.]


(Log in to post comments)

Crash-only software: More than meets the eye

Posted Jul 13, 2006 11:15 UTC (Thu) by nix (subscriber, #2304) [Link]

The Emacs autosave feature makes clear another aspect of crash-only software design applicable to systems with persistent state: that sometimes you want to have an explicit 'save' option as well, such that the persistent state that is saved at shutdown is only used to restore the state on startup. This is especially true if there's a lot of state.

Many text editors, including a number of vi implementations (not vim) implemented their preserve feature by saving the file periodically, not to /var/tmp or a .swp file, but to the original location. This turned out to be annoying if you decided your changes were a bad idea: :q left the file half-way altered. vim doesn't have this problem, and of course Emacs never did (it would be rather obvious if M-x revert-buffer stopped working, and while in fact it was broken at one point in the XEmacs 21.4 releease cycle, this was a bug, not intentional!)

This particular approach to state saving was very common in the DOS/Windows world for a time, and was exceptionally annoying: if anything, it was more annoying than a system with no automatic state saving would have been. It felt like making any change was horribly irrevocable, not least because these systems either had short undo queues or didn't save them on crashing.

No need for explicit saving

Posted Oct 22, 2008 4:34 UTC (Wed) by pgan (guest, #54573) [Link]

I think it's simpler for the user to never think about saving. The program should just save the history, and the user will find the document (or other object) in the state they left it. If they don't like the current state, they can undo to one they like.

that just doesn't work well

Posted Oct 22, 2008 8:17 UTC (Wed) by dlang (subscriber, #313) [Link]

the OLPC people believed the same thing, and the result is a flood of junk with the documents that you care about sprinkled though the junk. it's almost impossible to retrieve documents after a very short timeframe.

Crash-only software: More than meets the eye

Posted Jul 13, 2006 11:39 UTC (Thu) by th0ma7 (guest, #24698) [Link]

This is by far the article which describe the most the 24/7 environment in which I work in (canadian weather warnings/severe weather prognostics, terminal airport forcasts, etc).

We are regularely faced with the need of having a multithreaded crash only software to be able to keep up our mission critical systems (radars calculations, radars/satellite imagery system, weather bulletins forcasts transmissions, etc)... Systems on which we also try to implement no single point of failure (which actually can be seen has using part of the conception of crash only "hardware" ?) I just did'nt knew there was actually a name/theory for that.

In the past 10 years we've been faced with almost every single aspects described in this article!

I have a strong feeling that our current migration from HPUX -> Redhat -> Debian (which is now heading at the redhat -> Debian part) is the right way to go and that Linux will provide (and actually is providing from my perspective) more tools, more powerfull components, way bigger community at a way lower cost.

Thnx for this really good article and keep up the good work you linux kernel dev. freaks.

- vin

Crash-only vs fault tolerant

Posted Jul 14, 2006 18:07 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

What you're describing is known by the name "fault tolerant," not "crash-only."

A crash-only program is one that doesn't have a clean shutdown operation. Because the only way to stop the program is to cause a fault, it is obviously fault tolerant as well.

The article makes the point that your fault tolerant code is more likely to work right if you put it in a crash-only program because it gets exercised, and thought about, more.

Crash-only software: More than meets the eye

Posted Jul 13, 2006 18:54 UTC (Thu) by Segora (subscriber, #8209) [Link]

Hi,

this made me think of Joe Armstrong's (of Erlang fame) work on fault tolerant systems1. The canonical way to make an Erlang/OTP system is to divide it into one or more applications, each of which has a supervisor tree of processes. When a process crashes, the restart strategy determines if only the crashed process is to be restarted, all processes on the same level are restarted, or the supervisor crashes and the fault is propagated upwards, leading to the whole node being restarted via hardware watchdog in the extreme case.

Segora

1 Making Reliable Distributed Systems in the Presence of Software Errors (2003), http://citeseer.ist.psu.edu/armstrong03making.html

Crash-only software: More than meets the eye

Posted Jul 14, 2006 0:18 UTC (Fri) by pimlott (guest, #1535) [Link]

Rats, I was going to mention Erlang! Erlang's motto is "let it fail", and this philosophy (counterintuitively!) builds extremely high reliability telecommunications routers. Walking through the Erlang tutorial is a great exercise in this style of design.

Crash-only software: More than meets the eye

Posted Jul 14, 2006 3:25 UTC (Fri) by jzbiciak (subscriber, #5246) [Link]

You know what's interesting is that it's not only the software that fails. This is especially true in infrastructure computing (such as telecom), where boxes are literally everywhere and deployed "forever." Bit-flips due to radiation, aging components, etc... none of those should bring down the phone network, but you might drop a phone call.

Crash-only software: More than meets the eye

Posted Jul 13, 2006 20:57 UTC (Thu) by smoogen (subscriber, #97) [Link]

Writing a crash-safe, quick-recovery state store is an extremely difficult
task and should be left to the experts (and will make your system easier
to implement).

One area that would be interesting is a set of provable crash-only libraries that people could write into as an API might help this part. This would cut down on the number of people inventing their own that might not be correct.
I honestly wouldn't know where to start on this though :/.

Crash-only software: More than meets the eye

Posted Jul 14, 2006 19:44 UTC (Fri) by kingdon (guest, #4526) [Link]

Prevayler is one such (http://www.prevayler.org/), for Java code which can be written in the right style for Prevayler. Although the concept of Prevayler is pretty simple, there is some nice trickiness in the journaling code (in particular, which makes sure that the data has been sent to disk before acting as if the data has been written, without destroying performance).

Prevayler is not the only journaling code you'll ever need (even if you are writing in Java), but I agree that it makes sense to think in terms of journaling libraries rather than each application trying to get this right (and possibly failing in subtle ways).

Three comments

Posted Jul 14, 2006 9:47 UTC (Fri) by ncm (subscriber, #165) [Link]

First, this (otherwise excellent) article repeats a common misconception: in fact, journaling file systems, like journaling databases, are no generally safe against power drops. The underlying reason is that, for evidently unavoidable marketing reasons, most drives lie about whether blocks sent to the drive really are physically on the disk; it may take a few seconds for blocks actually to get written after the drive has sworn up-and-down that it's already been done. Drives that never lie lose performance benchmarks.

Furthermore, there is an urban myth that many/most drives will use the motor as a generator to provide power to finish writing the current block and park the head. It is generally false -- all drives you're likely to encounter happily write random stuff if voltage drops while they're writing, even if they do park the head afterward.

The implication is that a high-reliability system really must either have some form of battery-backed up disk drive power (e.g. UPS), or must somehow ensure its drives really don't lie (good luck verifying that!), or their journal blocks must have 64-bit-or-better checksums independent of whatever the drive uses, and be prepared for late journal blocks to be unreadable or just wrong. (Available drives have an appallingly high specified bit-error rate, so using 64-bit checksums everywhere reliability matters is a good idea anyhow.) Finally, when testing failure recovery, don't confuse hardware or software reset with power failure.

Second, the principles behind crash-only design are embodied, thus far uniquely, in the C++ exception mechanism. A well-designed C++ program will have only a very few places that catch and process exceptions, and almost all the code that is executed during an exception is also run frequently during normal operation, in destructors. This differs fundamentally from languages with superficially similar exception features that depend on the "try-finally" construct. There, exception-handling code is scattered pervasively throughout the system, and much of it cannot be executed in any practical test process.

Third, crash-only design is tied very closely to logging and log-replaying. Often log-replaying code can be recycled for user-level undo/redo, and thus exercised in normal operation. Note that if you're not worried about power failure (because of your UPS), there's no need for the program to flush (i.e. fsync) the log file frequently. The kernel will do that if the program crashes, and mmapping a big data-structure image (e.g. to support undoing a deletion) to the end of the log file is very cheap. "Auto-save" is a very poor substitute for good logging.

Three comments

Posted Jul 14, 2006 11:58 UTC (Fri) by nix (subscriber, #2304) [Link]

Isn't the drive-write-late-and-garbage problem exactly what write barriers are meant to solve, and the major reason why the journalled filesystems and md layer make use of write barriers? (Do any drives actually lie about write barriers, too, and say they're passed when the stuff they bar is not yet on the medium?)

Write barriers

Posted Jul 14, 2006 19:15 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

No, the problem that write barriers solve is where the device says "OK, I've got the data" and Linux considers the data to be permanent based on that. Before write barriers, that's what happens.

That's not ridiculous, by the way. "Permanent" is a matter of degree, and being written on the platter is just one degree in the middle of the scale. Once the device has the data, it is safe from a Linux kernel crash, and that's a lot.

Write barriers are, BTW, a Linux kernel block layer phenomenon; the device doesn't know the concept. Linux has various ways to know that the device has put the data on the platter and uses them to implement write barriers. But if the device lies, the write barriers won't work.

Since the device lies to circumvent a system that explicitly asked for the data to go on the platter, I rather doubt that it would refrain from lying when Linux write barriers are involved.

BTW, I can't confirm or deny that devices lie like this, and to the extent claimed. If someone can back up this claim, I'd love to see it.

Three comments

Posted Jul 14, 2006 19:30 UTC (Fri) by ncm (subscriber, #165) [Link]

I don't know to what degree modern drives really obey write barriers. If history is any guide, they obey write barriers when the data rate is low, but toss them when the buffers fill up, or any time they seem to recognize a benchmark being run. In any case, they won't protect against sectors being half-written.

Three comments

Posted Jul 18, 2006 6:21 UTC (Tue) by nix (subscriber, #2304) [Link]

That's just horrible enough that it might be true, but the idealist in me hopes that it isn't, because it would render the entire concept of write barriers pointless :(

Three comments

Posted Jul 18, 2006 9:04 UTC (Tue) by ncm (subscriber, #165) [Link]

Not at all... it just means that backup power is necessary for a reliable system. Disk drive designers have learned not to pretend they can make a whole system reliable all by themselves, and that (furthermore) the market won't pay for them to try. It doesn't take much backup power; if you can get the CPU's or ATA interface's power to drop out of tolerance a few seconds before the drive's, that may be all you need.

Three comments

Posted Jul 19, 2006 7:17 UTC (Wed) by drs (guest, #16570) [Link]

My understanding of this problem (recalling a Ted Tso talk on the issue)
is that as system voltage sags in an unexpected power failure, different
system components fail to operate correctly at different voltage levels.

More specifically, the voltage at which main memory maintains coherency is
significantly above the voltage at which the DMA engines in the
North/South bridge stop operating. So there is every likelihood that the
DMA engines will happily write random garbage from main memory to the
still operating disk. :(

writing garbage when the voltage drops

Posted Jul 14, 2006 19:19 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

there is an urban myth that many/most drives will use the motor as a generator to provide power to finish writing the current block and park the head. It is generally false -- all drives you're likely to encounter happily write random stuff if voltage drops while they're writing, even if they do park the head afterward.

I can easily believe that the motor generating power is fantasy, but I always assumed there was a capacitor in there that could supply enough energy to finish writing the current sector. Why wouldn't there be?

writing garbage when the voltage drops

Posted Jul 15, 2006 5:21 UTC (Sat) by roelofs (guest, #2599) [Link]

I can easily believe that the motor generating power is fantasy, but I always assumed there was a capacitor in there that could supply enough energy to finish writing the current sector. Why wouldn't there be?

Size, maybe? I'm just shooting the breeze here, but caps associated with power supplies tend to be immensely bigger than typical hard-drive components, and I'd guess that one capable of acting as a power-supply standin for even a few milliseconds would still be quite a bit bigger than the little surface-mount discretes used on drives today.

But maybe I'm suffering from cranio-rectal impaction again... I hate it when that happens.

Greg

writing garbage when the voltage drops

Posted Jul 17, 2006 14:29 UTC (Mon) by giraffedata (subscriber, #1954) [Link]

OK, I did some calculations. I think the drive needs less than 10 microseconds to finish writing a sector. In that time, it needs up to 1 ampere, and can work with at least 4v out of the 5v power supply. So a 10uF capacitor, which is the size of a pea, should suffice.

The stored energy in the disk probably is relevant too, in that it keeps the disk spinning fast enough for an acceptable write 10 uS after the motor loses power.

writing garbage when the voltage drops

Posted Jul 25, 2006 3:59 UTC (Tue) by barrygould (guest, #4774) [Link]

I'd expect you want clusters, not sectors, ensured to be written safely.

writing garbage when the voltage drops

Posted Jul 15, 2006 23:19 UTC (Sat) by ncm (subscriber, #165) [Link]

Suffice to say that disk-drive manufacturing is a very cost-sensitive business. They'd be happy to make drives fail better if it didn't actually cost anything, but nobody is willing to pay if it does cost.

writing garbage when the voltage drops

Posted Jul 18, 2006 15:40 UTC (Tue) by giraffedata (subscriber, #1954) [Link]

Now that I think about it, the atomic write in the case of power failure isn't all that useful, because if the sector doesn't get completely written, it can't be read back. The CRC in the trailer won't have been written. That means you can achieve the same thing by writing two copies of the critical sector: On readback, if you can't read the first copy, you just use the second copy, which is the complete old version.

You'd probably want that redundancy anyway, because it's probably a really important sector and write failures happen even without power failures.

For the benefit of those who are wondering why people think atomic sector writes at power failure are important: Some systems deal with the possibility of system failure in the middle of a complex disk update as follows: Keep the original data intact and write a whole second, updated copy. (Use copy-on-write if you have to for practicality). A single sector points to current copy. When you have a complete updated copy, update the pointer sector to point to the updated copy. Then delete the original copy. Any kind of failure before you update the pointer sector just means the complex update never happened. But if the update of the pointer sector itself gets interrupted, then you've got neither the original nor the updated copy.

Three comments

Posted Jul 20, 2006 16:20 UTC (Thu) by renox (subscriber, #23785) [Link]

journaling file systems, like journaling databases, are no generally safe against power drops

I think that it depends on the type of journaling: journaling metadata only doesn't protect your files which can be corrupted, but among other ext3fs has a journaling data+metadata, which has a quite high performance impact, but it should 'protect' as in the data is here or not here but it isn't half here (seen with ReiserFSv3: the passwd file appended with binary data, urgh).

Of course even journaling 'data+metadata' can work correctly only if the disk obey some order of writing the data.

Three comments

Posted Jul 20, 2006 23:43 UTC (Thu) by ncm (subscriber, #165) [Link]

...even journaling 'data+metadata' can work correctly only if the disk obey some order of writing the data.

Precisely the point. However, data+metadata journaling may be kind of pointless if you don't have any way of telling how much of the data you had meant to have written was, in fact, written. For example, if you have an outage while compiling a kernel, no amount of journaling can make it safe to skip "make clean" before running "make" again. That makes the cheapest, fastest journaling regime also best, for such an environment.

Exelent article, thanks

Posted Jul 14, 2006 12:21 UTC (Fri) by dion (guest, #2764) [Link]

I really enjoyed this article, most of all because it made me think of all those times where I've seen the same results, but never thought it was actually a recognized concept.

I fondly remember my first free text index which was basicly a binary tree of all words found in a collection of 20000 largeish text files.

It took longer to free the resulting tree at shutdown than it took to load it from disk at startup, so I ended up never free()ing the data and because the program was shutting down it never mattered.

Keep 'em coming

Posted Jul 15, 2006 21:33 UTC (Sat) by alspnost (guest, #2763) [Link]

Val Henson is a star - I've really enjoyed this recent series of articles. It's opened my eyes to so many crazy things that we don't normally think about in computing! I like the way that Val presents things in a zany and entertaining way, whilst retaining technical depth and rigour. Quite a feat, to be honest!

Crash-only software: More than meets the eye

Posted Jul 18, 2006 7:44 UTC (Tue) by ortalo (subscriber, #4654) [Link]

It seems to me that there is an uncovered area in your article.
Ok for "crash-only" (ie: backward recovery in dependability terminology), fault containment, increased reliability, etc.

But what about fault detection then? In more practical terms: when do you fire up the crash/kill/terminate procedure? Do you let the user decide when it should hit the power button? (Do you really trust users? What if he cuts the power cord with a knife?) Do you have another magical watchdog program running in some corners that knows what to do?

Fault management should not be limited to the recovery procedure, sometimes, the detection procedure is as important as well and it emphasizes the overall assumptions made on the system (fail-stop, fail-silent, fail-arbitrary, etc.).
Dunno how it applies to OS development however (except for pragmatic ideas like the linux software watchdog).

Crash-only software: More than meets the eye

Posted Oct 22, 2008 4:57 UTC (Wed) by pgan (guest, #54573) [Link]

What if he cuts it with a rusty old pirate's sword?

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds