2018-02-23 18:58:03 +00:00
|
|
|
---
|
|
|
|
created_at: '2015-03-24T23:05:18.000Z'
|
|
|
|
title: So what's wrong with 1975 programming? (2006)
|
|
|
|
url: https://www.varnish-cache.org/trac/wiki/ArchitectNotes
|
|
|
|
author: mooreds
|
|
|
|
points: 91
|
|
|
|
story_text: ''
|
|
|
|
comment_text:
|
|
|
|
num_comments: 20
|
|
|
|
story_id:
|
|
|
|
story_title:
|
|
|
|
story_url:
|
|
|
|
parent_id:
|
|
|
|
created_at_i: 1427238318
|
|
|
|
_tags:
|
|
|
|
- story
|
|
|
|
- author_mooreds
|
|
|
|
- story_9260169
|
|
|
|
objectID: '9260169'
|
2018-06-08 12:05:27 +00:00
|
|
|
year: 2006
|
2018-02-23 18:58:03 +00:00
|
|
|
|
|
|
|
---
|
2018-03-03 09:35:28 +00:00
|
|
|
# Notes from the Architect
|
2018-02-23 18:19:40 +00:00
|
|
|
|
2018-03-03 09:35:28 +00:00
|
|
|
Once you start working with the Varnish source code, you will notice
|
|
|
|
that Varnish is not your average run of the mill application.
|
2018-02-23 18:19:40 +00:00
|
|
|
|
2018-03-03 09:35:28 +00:00
|
|
|
That is not a coincidence.
|
2018-02-23 18:19:40 +00:00
|
|
|
|
2018-03-03 09:35:28 +00:00
|
|
|
I have spent many years working on the FreeBSD kernel, and only rarely
|
|
|
|
did I venture into userland programming, but when I had occation to do
|
|
|
|
so, I invariably found that people programmed like it was still 1975.
|
|
|
|
|
|
|
|
So when I was approached about the Varnish project I wasn't really
|
|
|
|
interested until I realized that this would be a good opportunity to try
|
|
|
|
to put some of all my knowledge of how hardware and kernels work to good
|
|
|
|
use, and now that we have reached alpha stage, I can say I have really
|
|
|
|
enjoyed it.
|
|
|
|
|
|
|
|
## So what's wrong with 1975 programming ?
|
|
|
|
|
|
|
|
The really short answer is that computers do not have two kinds of
|
|
|
|
storage any more.
|
|
|
|
|
|
|
|
It used to be that you had the primary store, and it was anything from
|
|
|
|
acoustic delaylines filled with mercury via small magnetic dougnuts via
|
|
|
|
transistor flip-flops to dynamic RAM.
|
|
|
|
|
|
|
|
And then there were the secondary store, paper tape, magnetic tape, disk
|
|
|
|
drives the size of houses, then the size of washing machines and these
|
|
|
|
days so small that girls get disappointed if think they got hold of
|
|
|
|
something else than the MP3 player you had in your pocket.
|
|
|
|
|
|
|
|
And people program this way.
|
|
|
|
|
|
|
|
They have variables in "memory" and move data to and from "disk".
|
|
|
|
|
|
|
|
Take Squid for instance, a 1975 program if I ever saw one: You tell it
|
|
|
|
how much RAM it can use and how much disk it can use. It will then spend
|
|
|
|
inordinate amounts of time keeping track of what HTTP objects are in RAM
|
|
|
|
and which are on disk and it will move them forth and back depending on
|
|
|
|
traffic patterns.
|
|
|
|
|
|
|
|
Well, today computers really only have one kind of storage, and it is
|
|
|
|
usually some sort of disk, the operating system and the virtual memory
|
|
|
|
management hardware has converted the RAM to a cache for the disk
|
|
|
|
storage.
|
|
|
|
|
|
|
|
So what happens with squids elaborate memory management is that it gets
|
|
|
|
into fights with the kernels elaborate memory management, and like any
|
|
|
|
civil war, that never gets anything done.
|
|
|
|
|
|
|
|
What happens is this: Squid creates a HTTP object in "RAM" and it gets
|
|
|
|
used some times rapidly after creation. Then after some time it get no
|
|
|
|
more hits and the kernel notices this. Then somebody tries to get memory
|
|
|
|
from the kernel for something and the kernel decides to push those
|
|
|
|
unused pages of memory out to swap space and use the (cache-RAM) more
|
|
|
|
sensibly for some data which is actually used by a program. This
|
|
|
|
however, is done without squid knowing about it. Squid still thinks that
|
|
|
|
these http objects are in RAM, and they will be, the very second it
|
|
|
|
tries to access them, but until then, the RAM is used for something
|
|
|
|
productive.
|
|
|
|
|
|
|
|
This is what Virtual Memory is all about.
|
|
|
|
|
|
|
|
If squid did nothing else, things would be fine, but this is where the
|
|
|
|
1975 programming kicks in.
|
|
|
|
|
|
|
|
After some time, squid will also notice that these objects are unused,
|
|
|
|
and it decides to move them to disk so the RAM can be used for more busy
|
|
|
|
data. So squid goes out, creates a file and then it writes the http
|
|
|
|
objects to the file.
|
|
|
|
|
|
|
|
Here we switch to the high-speed camera: Squid calls write(2), the
|
|
|
|
address i gives is a "virtual address" and the kernel has it marked as
|
|
|
|
"not at home".
|
|
|
|
|
|
|
|
So the CPU hardwares paging unit will raise a trap, a sort of interrupt
|
|
|
|
to the operating system telling it "fix the memory please".
|
|
|
|
|
|
|
|
The kernel tries to find a free page, if there are none, it will take a
|
|
|
|
little used page from somewhere, likely another little used squid
|
|
|
|
object, write it to the paging poll space on the disk (the "swap area")
|
|
|
|
when that write completes, it will read from another place in the paging
|
|
|
|
pool the data it "paged out" into the now unused RAM page, fix up the
|
|
|
|
paging tables, and retry the instruction which failed.
|
|
|
|
|
|
|
|
Squid knows nothing about this, for squid it was just a single normal
|
|
|
|
memory acces.
|
|
|
|
|
|
|
|
So now squid has the object in a page in RAM and written to the disk two
|
|
|
|
places: one copy in the operating systems paging space and one copy in
|
|
|
|
the filesystem.
|
|
|
|
|
|
|
|
Squid now uses this RAM for something else but after some time, the HTTP
|
|
|
|
object gets a hit, so squid needs it back.
|
|
|
|
|
|
|
|
First squid needs some RAM, so it may decide to push another HTTP object
|
|
|
|
out to disk (repeat above), then it reads the filesystem file back into
|
|
|
|
RAM, and then it sends the data on the network connections socket.
|
|
|
|
|
|
|
|
Did any of that sound like wasted work to you ?
|
|
|
|
|
|
|
|
Here is how Varnish does it:
|
|
|
|
|
|
|
|
Varnish allocate some virtual memory, it tells the operating system to
|
|
|
|
back this memory with space from a disk file. When it needs to send the
|
|
|
|
object to a client, it simply refers to that piece of virtual memory and
|
|
|
|
leaves the rest to the kernel.
|
|
|
|
|
|
|
|
If/when the kernel decides it needs to use RAM for something else, the
|
|
|
|
page will get written to the backing file and the RAM page reused
|
|
|
|
elsewhere.
|
|
|
|
|
|
|
|
When Varnish next time refers to the virtual memory, the operating
|
|
|
|
system will find a RAM page, possibly freeing one, and read the contents
|
|
|
|
in from the backing file.
|
|
|
|
|
|
|
|
And that's it. Varnish doesn't really try to control what is cached in
|
|
|
|
RAM and what is not, the kernel has code and hardware support to do a
|
|
|
|
good job at that, and it does a good job.
|
|
|
|
|
|
|
|
Varnish also only has a single file on the disk whereas squid puts one
|
|
|
|
object in its own separate file. The HTTP objects are not needed as
|
|
|
|
filesystem objects, so there is no point in wasting time in the
|
|
|
|
filesystem name space (directories, filenames and all that) for each
|
|
|
|
object, all we need to have in Varnish is a pointer into virtual memory
|
|
|
|
and a length, the kernel does the rest.
|
|
|
|
|
|
|
|
Virtual memory was meant to make it easier to program when data was
|
|
|
|
larger than the physical memory, but people have still not caught on.
|
|
|
|
|
|
|
|
## More caches.
|
|
|
|
|
|
|
|
But there are more caches around, the silicon mafia has more or less
|
|
|
|
stalled at 4GHz CPU clock and to get even that far they have had to put
|
|
|
|
level 1, 2 and sometimes 3 caches between the CPU and the RAM (which is
|
|
|
|
the level 4 cache), there are also things like write buffers, pipeline
|
|
|
|
and page-mode fetches involved, all to make it a tad less slow to pick
|
|
|
|
up something from memory.
|
|
|
|
|
|
|
|
And since they have hit the 4GHz limit, but decreasing silicon feature
|
|
|
|
sizes give them more and more transistors to work with, multi-cpu
|
|
|
|
designs have become the fancy of the world, despite the fact that they
|
|
|
|
suck as a programming model.
|
|
|
|
|
|
|
|
Multi-CPU systems is nothing new, but writing programs that use more
|
|
|
|
than one CPU at a time has always been tricky and it still is.
|
|
|
|
|
|
|
|
Writing programs that perform well on multi-CPU systems is even
|
|
|
|
trickier.
|
|
|
|
|
|
|
|
Imagine I have two statistics counters:
|
|
|
|
|
|
|
|
``` wiki
|
|
|
|
unsigned n_foo;
|
|
|
|
unsigned n_bar;
|
|
|
|
```
|
|
|
|
|
|
|
|
So one CPU is chugging along and has to execute `n_foo++`
|
|
|
|
|
|
|
|
To do that, it read n\_foo and then write n\_foo back. It may or may not
|
|
|
|
involve a load into a CPU register, but that is not important.
|
|
|
|
|
|
|
|
To read a memory location means to check if we have it in the CPUs level
|
|
|
|
1 cache. It is unlikely to be unless it is very frequently used. Next
|
|
|
|
check the level two cache, and let us assume that is a miss as well.
|
|
|
|
|
|
|
|
If this is a single CPU system, the game ends here, we pick it out of
|
|
|
|
RAM and move on.
|
|
|
|
|
|
|
|
On a Multi-CPU system, and it doesn't matter if the CPUs share a socket
|
|
|
|
or have their own, we first have to check if any of the other CPUs have
|
|
|
|
a modified copy of n\_foo stored in their caches, so a special
|
|
|
|
bus-transaction goes out to find this out, if if some cpu comes back and
|
|
|
|
says "yeah, I have it" that cpu gets to write it to RAM. On good
|
|
|
|
hardware designs, our CPU will listen in on the bus during that write
|
|
|
|
operation, on bad designs it will have to do a memory read afterwards.
|
|
|
|
|
|
|
|
Now the CPU can increment the value of n\_foo, and write it back. But it
|
|
|
|
is unlikely to go directly back to memory, we might need it again
|
|
|
|
quickly, so the modified value gets stored in our own L1 cache and then
|
|
|
|
at some point, it will end up in RAM.
|
|
|
|
|
|
|
|
Now imagine that another CPU wants to `n_bar+++` at the same time, can
|
|
|
|
it do that ? No. Caches operate not on bytes but on some "linesize" of
|
|
|
|
bytes, typically from 8 to 128 bytes in each line. So since the first
|
|
|
|
cpu was busy dealing with `n_foo`, the second CPU will be trying to grab
|
|
|
|
the same cache-line, so it will have to wait, even through it is a
|
|
|
|
different variable.
|
|
|
|
|
|
|
|
Starting to get the idea ?
|
|
|
|
|
|
|
|
Yes, it's ugly.
|
|
|
|
|
|
|
|
## How do we cope ?
|
|
|
|
|
|
|
|
Avoid memory operations if at all possible.
|
|
|
|
|
|
|
|
Here are some ways Varnish tries to do that:
|
|
|
|
|
|
|
|
When we need to handle a HTTP request or response, we have an array of
|
|
|
|
pointers and a workspace. We do not call malloc(3) for each header. We
|
|
|
|
call it once for the entire workspace and then we pick space for the
|
|
|
|
headers from there. The nice thing about this is that we usually free
|
|
|
|
the entire header in one go and we can do that simply by resetting a
|
|
|
|
pointer to the start of the workspace.
|
|
|
|
|
|
|
|
When we need to copy a HTTP header from one request to another (or from
|
|
|
|
a response to another) we don't copy the string, we just copy the
|
|
|
|
pointer to it. Provided we do not change or free the source headers,
|
|
|
|
this is perfectly safe, a good example is copying from the client
|
|
|
|
request to the request we will send to the backend.
|
|
|
|
|
|
|
|
When the new header has a longer lifetime than the source, then we have
|
|
|
|
to copy it. For instance when we store headers in a cached object. But
|
|
|
|
in that case we build the new header in a workspace, and once we know
|
|
|
|
how big it will be, we do a single malloc(3) to get the space and then
|
|
|
|
we put the entire header in that space.
|
|
|
|
|
|
|
|
We also try to reuse memory which is likely to be in the caches.
|
|
|
|
|
|
|
|
The worker threads are used in "most recently busy" fashion, when a
|
|
|
|
workerthread becomes free it goes to the front of the queue where it is
|
|
|
|
most likely to get the next request, so that all the memory it already
|
|
|
|
has cached, stack space, variables etc, can be reused while in the
|
|
|
|
cache, instead of having the expensive fetches from RAM.
|
|
|
|
|
|
|
|
We also give each worker thread a private set of variables it is likely
|
|
|
|
to need, all allocated on the stack of the thread. That way we are
|
|
|
|
certain that they occupy a page in RAM which none of the other CPUs will
|
|
|
|
ever think about touching as long as this thread runs on its own CPU.
|
|
|
|
That way they will not fight about the cachelines.
|
|
|
|
|
|
|
|
If all this sounds foreign to you, let me just assure you that it works:
|
|
|
|
we spend less than 18 system calls on serving a cache hit, and even many
|
|
|
|
of those are calls tog get timestamps for statistics.
|
|
|
|
|
|
|
|
These techniques are also nothing new, we have used them in the kernel
|
|
|
|
for more than a decade, now it's your turn to learn them :-)
|
|
|
|
|
|
|
|
So Welcome to Varnish, a 2006 architecture program.
|
|
|
|
|
|
|
|
Poul-Henning Kamp, Varnish architect and coder.
|