hn-classics/_stories/2006/9260169.md

---
created_at: '2015-03-24T23:05:18.000Z'
title: So what's wrong with 1975 programming? (2006)
url: https://www.varnish-cache.org/trac/wiki/ArchitectNotes
author: mooreds
points: 91
story_text: ''
comment_text: 
num_comments: 20
story_id: 
story_title: 
story_url: 
parent_id: 
created_at_i: 1427238318
_tags:
- story
- author_mooreds
- story_9260169
objectID: '9260169'
year: 2006

---
# Notes from the Architect

Once you start working with the Varnish source code, you will notice
that Varnish is not your average run of the mill application.

That is not a coincidence.

I have spent many years working on the FreeBSD kernel, and only rarely
did I venture into userland programming, but when I had occation to do
so, I invariably found that people programmed like it was still 1975.

So when I was approached about the Varnish project I wasn't really
interested until I realized that this would be a good opportunity to try
to put some of all my knowledge of how hardware and kernels work to good
use, and now that we have reached alpha stage, I can say I have really
enjoyed it.

## So what's wrong with 1975 programming ?

The really short answer is that computers do not have two kinds of
storage any more.

It used to be that you had the primary store, and it was anything from
acoustic delaylines filled with mercury via small magnetic dougnuts via
transistor flip-flops to dynamic RAM.

And then there were the secondary store, paper tape, magnetic tape, disk
drives the size of houses, then the size of washing machines and these
days so small that girls get disappointed if think they got hold of
something else than the MP3 player you had in your pocket.

And people program this way.

They have variables in "memory" and move data to and from "disk".

Take Squid for instance, a 1975 program if I ever saw one: You tell it
how much RAM it can use and how much disk it can use. It will then spend
inordinate amounts of time keeping track of what HTTP objects are in RAM
and which are on disk and it will move them forth and back depending on
traffic patterns.

Well, today computers really only have one kind of storage, and it is
usually some sort of disk, the operating system and the virtual memory
management hardware has converted the RAM to a cache for the disk
storage.

So what happens with squids elaborate memory management is that it gets
into fights with the kernels elaborate memory management, and like any
civil war, that never gets anything done.

What happens is this: Squid creates a HTTP object in "RAM" and it gets
used some times rapidly after creation. Then after some time it get no
more hits and the kernel notices this. Then somebody tries to get memory
from the kernel for something and the kernel decides to push those
unused pages of memory out to swap space and use the (cache-RAM) more
sensibly for some data which is actually used by a program. This
however, is done without squid knowing about it. Squid still thinks that
these http objects are in RAM, and they will be, the very second it
tries to access them, but until then, the RAM is used for something
productive.

This is what Virtual Memory is all about.

If squid did nothing else, things would be fine, but this is where the
1975 programming kicks in.

After some time, squid will also notice that these objects are unused,
and it decides to move them to disk so the RAM can be used for more busy
data. So squid goes out, creates a file and then it writes the http
objects to the file.

Here we switch to the high-speed camera: Squid calls write(2), the
address i gives is a "virtual address" and the kernel has it marked as
"not at home".

So the CPU hardwares paging unit will raise a trap, a sort of interrupt
to the operating system telling it "fix the memory please".

The kernel tries to find a free page, if there are none, it will take a
little used page from somewhere, likely another little used squid
object, write it to the paging poll space on the disk (the "swap area")
when that write completes, it will read from another place in the paging
pool the data it "paged out" into the now unused RAM page, fix up the
paging tables, and retry the instruction which failed.

Squid knows nothing about this, for squid it was just a single normal
memory acces.

So now squid has the object in a page in RAM and written to the disk two
places: one copy in the operating systems paging space and one copy in
the filesystem.

Squid now uses this RAM for something else but after some time, the HTTP
object gets a hit, so squid needs it back.

First squid needs some RAM, so it may decide to push another HTTP object
out to disk (repeat above), then it reads the filesystem file back into
RAM, and then it sends the data on the network connections socket.

Did any of that sound like wasted work to you ?

Here is how Varnish does it:

Varnish allocate some virtual memory, it tells the operating system to
back this memory with space from a disk file. When it needs to send the
object to a client, it simply refers to that piece of virtual memory and
leaves the rest to the kernel.

If/when the kernel decides it needs to use RAM for something else, the
page will get written to the backing file and the RAM page reused
elsewhere.

When Varnish next time refers to the virtual memory, the operating
system will find a RAM page, possibly freeing one, and read the contents
in from the backing file.

And that's it. Varnish doesn't really try to control what is cached in
RAM and what is not, the kernel has code and hardware support to do a
good job at that, and it does a good job.

Varnish also only has a single file on the disk whereas squid puts one
object in its own separate file. The HTTP objects are not needed as
filesystem objects, so there is no point in wasting time in the
filesystem name space (directories, filenames and all that) for each
object, all we need to have in Varnish is a pointer into virtual memory
and a length, the kernel does the rest.

Virtual memory was meant to make it easier to program when data was
larger than the physical memory, but people have still not caught on.

## More caches.

But there are more caches around, the silicon mafia has more or less
stalled at 4GHz CPU clock and to get even that far they have had to put
level 1, 2 and sometimes 3 caches between the CPU and the RAM (which is
the level 4 cache), there are also things like write buffers, pipeline
and page-mode fetches involved, all to make it a tad less slow to pick
up something from memory.

And since they have hit the 4GHz limit, but decreasing silicon feature
sizes give them more and more transistors to work with, multi-cpu
designs have become the fancy of the world, despite the fact that they
suck as a programming model.

Multi-CPU systems is nothing new, but writing programs that use more
than one CPU at a time has always been tricky and it still is.

Writing programs that perform well on multi-CPU systems is even
trickier.

Imagine I have two statistics counters:

``` wiki
        unsigned    n_foo;
        unsigned    n_bar;
```

So one CPU is chugging along and has to execute `n_foo++`

To do that, it read n\_foo and then write n\_foo back. It may or may not
involve a load into a CPU register, but that is not important.

To read a memory location means to check if we have it in the CPUs level
1 cache. It is unlikely to be unless it is very frequently used. Next
check the level two cache, and let us assume that is a miss as well.

If this is a single CPU system, the game ends here, we pick it out of
RAM and move on.

On a Multi-CPU system, and it doesn't matter if the CPUs share a socket
or have their own, we first have to check if any of the other CPUs have
a modified copy of n\_foo stored in their caches, so a special
bus-transaction goes out to find this out, if if some cpu comes back and
says "yeah, I have it" that cpu gets to write it to RAM. On good
hardware designs, our CPU will listen in on the bus during that write
operation, on bad designs it will have to do a memory read afterwards.

Now the CPU can increment the value of n\_foo, and write it back. But it
is unlikely to go directly back to memory, we might need it again
quickly, so the modified value gets stored in our own L1 cache and then
at some point, it will end up in RAM.

Now imagine that another CPU wants to `n_bar+++` at the same time, can
it do that ? No. Caches operate not on bytes but on some "linesize" of
bytes, typically from 8 to 128 bytes in each line. So since the first
cpu was busy dealing with `n_foo`, the second CPU will be trying to grab
the same cache-line, so it will have to wait, even through it is a
different variable.

Starting to get the idea ?

Yes, it's ugly.

## How do we cope ?

Avoid memory operations if at all possible.

Here are some ways Varnish tries to do that:

When we need to handle a HTTP request or response, we have an array of
pointers and a workspace. We do not call malloc(3) for each header. We
call it once for the entire workspace and then we pick space for the
headers from there. The nice thing about this is that we usually free
the entire header in one go and we can do that simply by resetting a
pointer to the start of the workspace.

When we need to copy a HTTP header from one request to another (or from
a response to another) we don't copy the string, we just copy the
pointer to it. Provided we do not change or free the source headers,
this is perfectly safe, a good example is copying from the client
request to the request we will send to the backend.

When the new header has a longer lifetime than the source, then we have
to copy it. For instance when we store headers in a cached object. But
in that case we build the new header in a workspace, and once we know
how big it will be, we do a single malloc(3) to get the space and then
we put the entire header in that space.

We also try to reuse memory which is likely to be in the caches.

The worker threads are used in "most recently busy" fashion, when a
workerthread becomes free it goes to the front of the queue where it is
most likely to get the next request, so that all the memory it already
has cached, stack space, variables etc, can be reused while in the
cache, instead of having the expensive fetches from RAM.

We also give each worker thread a private set of variables it is likely
to need, all allocated on the stack of the thread. That way we are
certain that they occupy a page in RAM which none of the other CPUs will
ever think about touching as long as this thread runs on its own CPU.
That way they will not fight about the cachelines.

If all this sounds foreign to you, let me just assure you that it works:
we spend less than 18 system calls on serving a cache hit, and even many
of those are calls tog get timestamps for statistics.

These techniques are also nothing new, we have used them in the kernel
for more than a decade, now it's your turn to learn them :-)

So Welcome to Varnish, a 2006 architecture program.

Poul-Henning Kamp, Varnish architect and coder.
Tufte CSS, upgrade jekyll, add metadata, index 2018-02-23 18:58:03 +00:00			`---`
			`created_at: '2015-03-24T23:05:18.000Z'`
			`title: So what's wrong with 1975 programming? (2006)`
			`url: https://www.varnish-cache.org/trac/wiki/ArchitectNotes`
			`author: mooreds`
			`points: 91`
			`story_text: ''`
			`comment_text:`
			`num_comments: 20`
			`story_id:`
			`story_title:`
			`story_url:`
			`parent_id:`
			`created_at_i: 1427238318`
			`_tags:`
			`- story`
			`- author_mooreds`
			`- story_9260169`
			`objectID: '9260169'`
Adds year as metadata to all stories 2018-06-08 12:05:27 +00:00			`year: 2006`
Tufte CSS, upgrade jekyll, add metadata, index 2018-02-23 18:58:03 +00:00
			`---`
New parser with newspaper + pandoc 2018-03-03 09:35:28 +00:00			`# Notes from the Architect`
Initial commit 2018-02-23 18:19:40 +00:00
New parser with newspaper + pandoc 2018-03-03 09:35:28 +00:00			`Once you start working with the Varnish source code, you will notice`
			`that Varnish is not your average run of the mill application.`
Initial commit 2018-02-23 18:19:40 +00:00
New parser with newspaper + pandoc 2018-03-03 09:35:28 +00:00			`That is not a coincidence.`
Initial commit 2018-02-23 18:19:40 +00:00
New parser with newspaper + pandoc 2018-03-03 09:35:28 +00:00			`I have spent many years working on the FreeBSD kernel, and only rarely`
			`did I venture into userland programming, but when I had occation to do`
			`so, I invariably found that people programmed like it was still 1975.`

			`So when I was approached about the Varnish project I wasn't really`
			`interested until I realized that this would be a good opportunity to try`
			`to put some of all my knowledge of how hardware and kernels work to good`
			`use, and now that we have reached alpha stage, I can say I have really`
			`enjoyed it.`

			`## So what's wrong with 1975 programming ?`

			`The really short answer is that computers do not have two kinds of`
			`storage any more.`

			`It used to be that you had the primary store, and it was anything from`
			`acoustic delaylines filled with mercury via small magnetic dougnuts via`
			`transistor flip-flops to dynamic RAM.`

			`And then there were the secondary store, paper tape, magnetic tape, disk`
			`drives the size of houses, then the size of washing machines and these`
			`days so small that girls get disappointed if think they got hold of`
			`something else than the MP3 player you had in your pocket.`

			`And people program this way.`

			`They have variables in "memory" and move data to and from "disk".`

			`Take Squid for instance, a 1975 program if I ever saw one: You tell it`
			`how much RAM it can use and how much disk it can use. It will then spend`
			`inordinate amounts of time keeping track of what HTTP objects are in RAM`
			`and which are on disk and it will move them forth and back depending on`
			`traffic patterns.`

			`Well, today computers really only have one kind of storage, and it is`
			`usually some sort of disk, the operating system and the virtual memory`
			`management hardware has converted the RAM to a cache for the disk`
			`storage.`

			`So what happens with squids elaborate memory management is that it gets`
			`into fights with the kernels elaborate memory management, and like any`
			`civil war, that never gets anything done.`

			`What happens is this: Squid creates a HTTP object in "RAM" and it gets`
			`used some times rapidly after creation. Then after some time it get no`
			`more hits and the kernel notices this. Then somebody tries to get memory`
			`from the kernel for something and the kernel decides to push those`
			`unused pages of memory out to swap space and use the (cache-RAM) more`
			`sensibly for some data which is actually used by a program. This`
			`however, is done without squid knowing about it. Squid still thinks that`
			`these http objects are in RAM, and they will be, the very second it`
			`tries to access them, but until then, the RAM is used for something`
			`productive.`

			`This is what Virtual Memory is all about.`

			`If squid did nothing else, things would be fine, but this is where the`
			`1975 programming kicks in.`

			`After some time, squid will also notice that these objects are unused,`
			`and it decides to move them to disk so the RAM can be used for more busy`
			`data. So squid goes out, creates a file and then it writes the http`
			`objects to the file.`

			`Here we switch to the high-speed camera: Squid calls write(2), the`
			`address i gives is a "virtual address" and the kernel has it marked as`
			`"not at home".`

			`So the CPU hardwares paging unit will raise a trap, a sort of interrupt`
			`to the operating system telling it "fix the memory please".`

			`The kernel tries to find a free page, if there are none, it will take a`
			`little used page from somewhere, likely another little used squid`
			`object, write it to the paging poll space on the disk (the "swap area")`
			`when that write completes, it will read from another place in the paging`
			`pool the data it "paged out" into the now unused RAM page, fix up the`
			`paging tables, and retry the instruction which failed.`

			`Squid knows nothing about this, for squid it was just a single normal`
			`memory acces.`

			`So now squid has the object in a page in RAM and written to the disk two`
			`places: one copy in the operating systems paging space and one copy in`
			`the filesystem.`

			`Squid now uses this RAM for something else but after some time, the HTTP`
			`object gets a hit, so squid needs it back.`

			`First squid needs some RAM, so it may decide to push another HTTP object`
			`out to disk (repeat above), then it reads the filesystem file back into`
			`RAM, and then it sends the data on the network connections socket.`

			`Did any of that sound like wasted work to you ?`

			`Here is how Varnish does it:`

			`Varnish allocate some virtual memory, it tells the operating system to`
			`back this memory with space from a disk file. When it needs to send the`
			`object to a client, it simply refers to that piece of virtual memory and`
			`leaves the rest to the kernel.`

			`If/when the kernel decides it needs to use RAM for something else, the`
			`page will get written to the backing file and the RAM page reused`
			`elsewhere.`

			`When Varnish next time refers to the virtual memory, the operating`
			`system will find a RAM page, possibly freeing one, and read the contents`
			`in from the backing file.`

			`And that's it. Varnish doesn't really try to control what is cached in`
			`RAM and what is not, the kernel has code and hardware support to do a`
			`good job at that, and it does a good job.`

			`Varnish also only has a single file on the disk whereas squid puts one`
			`object in its own separate file. The HTTP objects are not needed as`
			`filesystem objects, so there is no point in wasting time in the`
			`filesystem name space (directories, filenames and all that) for each`
			`object, all we need to have in Varnish is a pointer into virtual memory`
			`and a length, the kernel does the rest.`

			`Virtual memory was meant to make it easier to program when data was`
			`larger than the physical memory, but people have still not caught on.`

			`## More caches.`

			`But there are more caches around, the silicon mafia has more or less`
			`stalled at 4GHz CPU clock and to get even that far they have had to put`
			`level 1, 2 and sometimes 3 caches between the CPU and the RAM (which is`
			`the level 4 cache), there are also things like write buffers, pipeline`
			`and page-mode fetches involved, all to make it a tad less slow to pick`
			`up something from memory.`

			`And since they have hit the 4GHz limit, but decreasing silicon feature`
			`sizes give them more and more transistors to work with, multi-cpu`
			`designs have become the fancy of the world, despite the fact that they`
			`suck as a programming model.`

			`Multi-CPU systems is nothing new, but writing programs that use more`
			`than one CPU at a time has always been tricky and it still is.`

			`Writing programs that perform well on multi-CPU systems is even`
			`trickier.`

			`Imagine I have two statistics counters:`

			``` wiki
			`unsigned n_foo;`
			`unsigned n_bar;`
			```

			So one CPU is chugging along and has to execute `n_foo++`

			`To do that, it read n\_foo and then write n\_foo back. It may or may not`
			`involve a load into a CPU register, but that is not important.`

			`To read a memory location means to check if we have it in the CPUs level`
			`1 cache. It is unlikely to be unless it is very frequently used. Next`
			`check the level two cache, and let us assume that is a miss as well.`

			`If this is a single CPU system, the game ends here, we pick it out of`
			`RAM and move on.`

			`On a Multi-CPU system, and it doesn't matter if the CPUs share a socket`
			`or have their own, we first have to check if any of the other CPUs have`
			`a modified copy of n\_foo stored in their caches, so a special`
			`bus-transaction goes out to find this out, if if some cpu comes back and`
			`says "yeah, I have it" that cpu gets to write it to RAM. On good`
			`hardware designs, our CPU will listen in on the bus during that write`
			`operation, on bad designs it will have to do a memory read afterwards.`

			`Now the CPU can increment the value of n\_foo, and write it back. But it`
			`is unlikely to go directly back to memory, we might need it again`
			`quickly, so the modified value gets stored in our own L1 cache and then`
			`at some point, it will end up in RAM.`

			Now imagine that another CPU wants to `n_bar+++` at the same time, can
			`it do that ? No. Caches operate not on bytes but on some "linesize" of`
			`bytes, typically from 8 to 128 bytes in each line. So since the first`
			cpu was busy dealing with `n_foo`, the second CPU will be trying to grab
			`the same cache-line, so it will have to wait, even through it is a`
			`different variable.`

			`Starting to get the idea ?`

			`Yes, it's ugly.`

			`## How do we cope ?`

			`Avoid memory operations if at all possible.`

			`Here are some ways Varnish tries to do that:`

			`When we need to handle a HTTP request or response, we have an array of`
			`pointers and a workspace. We do not call malloc(3) for each header. We`
			`call it once for the entire workspace and then we pick space for the`
			`headers from there. The nice thing about this is that we usually free`
			`the entire header in one go and we can do that simply by resetting a`
			`pointer to the start of the workspace.`

			`When we need to copy a HTTP header from one request to another (or from`
			`a response to another) we don't copy the string, we just copy the`
			`pointer to it. Provided we do not change or free the source headers,`
			`this is perfectly safe, a good example is copying from the client`
			`request to the request we will send to the backend.`

			`When the new header has a longer lifetime than the source, then we have`
			`to copy it. For instance when we store headers in a cached object. But`
			`in that case we build the new header in a workspace, and once we know`
			`how big it will be, we do a single malloc(3) to get the space and then`
			`we put the entire header in that space.`

			`We also try to reuse memory which is likely to be in the caches.`

			`The worker threads are used in "most recently busy" fashion, when a`
			`workerthread becomes free it goes to the front of the queue where it is`
			`most likely to get the next request, so that all the memory it already`
			`has cached, stack space, variables etc, can be reused while in the`
			`cache, instead of having the expensive fetches from RAM.`

			`We also give each worker thread a private set of variables it is likely`
			`to need, all allocated on the stack of the thread. That way we are`
			`certain that they occupy a page in RAM which none of the other CPUs will`
			`ever think about touching as long as this thread runs on its own CPU.`
			`That way they will not fight about the cachelines.`

			`If all this sounds foreign to you, let me just assure you that it works:`
			`we spend less than 18 system calls on serving a cache hit, and even many`
			`of those are calls tog get timestamps for statistics.`

			`These techniques are also nothing new, we have used them in the kernel`
			`for more than a decade, now it's your turn to learn them :-)`

			`So Welcome to Varnish, a 2006 architecture program.`

			`Poul-Henning Kamp, Varnish architect and coder.`