hn-classics/_stories/1999/8761539.md

[Source](http://yarchive.net/comp/linux/semaphores.html "Permalink to  Semaphores (Linus Torvalds) ")

#  Semaphores (Linus Torvalds) 

[Index][1] [Home][2] [About][3] [Blog][4]

* * *
    
    
    From: torvalds@transmeta.com (Linus Torvalds)
    Newsgroups: comp.os.linux.development.system
    Subject: [Re: NT kernel guy playing with Linux][5]
    Date: 27 Jun 1999 03:25:59 GMT
    
    In article <7l3oan$mrj$1@wire.cadcamlab.org>,
    Peter Samuelson <psamuels@sampo.creighton.edu> wrote:
    
    >I say this in the hope of sparing others the confusion I had about what
    >a spinlock is.  (It looked a lot like what CS calls a semaphore, so why
    >don't they just call it a semaphore, and what's this semaphore_t...?)
    
    If your CS courses didn't tell you the difference between a semaphore
    and a spinlock, your CS courses were bad (or just didn't cover much
    about concurrency, which is fairly common).
    
    Blame your professors, don't blame the Linux kernel code.
    
    >A spinlock is a semaphore used for very short critical sections; while
    >waiting for a spinlock to be released, the kernel sits on the CPU and
    >does nothing.
    
    A spinlock is a mutual exclusion mechanism, not a semaphore (a semaphore
    is a very specific _kind_ of mutual exclusion).
    
    But yes, you're right in that it's (by design) a busy-waiting one.
    That's why they are called "spinlocks" - they "spin" waiting for the
    lock to go away.
    
    >		  A `semaphore_t' is a longer-term semaphore which has the
    >advantage that the kernel puts the needy process to sleep and goes and
    >does something else rather than busy-waiting.
    
    A semaphore also has the ability to have more than one process enter the
    critical region.  Basically semaphores were (as far as I know) first
    proposed by Dijkstra, and they explicitly imply a "sleep"/"wakeup"
    behaviour, ie they are _not_ spinlocks.  They originally had operations
    called "P()" and "V()", but nobody ever remembers whether P() was down()
    or up(), so nobody uses those names any more.  Dijkstra was probably a
    bit heavy on drugs or something (I think the official explanation is
    that P and V are the first letters in some Dutch words, but I personally
    find the drug overdose story much more believable).
    
    Also, unlike basic spinlocks, a semaphore has a "count" value: each
    process that does a down() operation decrements the count if positive,
    until the count would go negative.  Only then do they sleep.  The
    original intent of this was to allow multiple entries to the region
    protected by the semaphore: by initializing the count to 4, you allow
    four down() operations and only the fifth one will block.
    
    However, almost all practical use of semaphores is a special case where
    the counter is initialized to 1, and where they are used as simple
    mutual exclusion with only one user allowed in the critical region.
    Such a semaphore is often called a "mutex" semaphore for MUTual
    EXclusion.
    
    I've never really seen anybody use the more complex case of semaphores,
    although I do know of cases where it can be useful. For example, one use
    of a more complex semaphore is as a "throttle", where you do something
    like this:
    
    	/* Maximum concurrent users */
    	#define MAX_CONCURRENT_USERS 20
    	struct semaphore sem;
    
    	init_sema(&sem, MAX_CONCURRENT_USERS);
    
    and then each user does a down() on the semaphore before starting an
    operation.  It won't block until you have 20 users - you've not created
    a mutual exclusion, but you HAVE created a throttling mechanism.  See?
    
    Potentially useful, as I said, but the common case (and the only case
    currently in use in the Linux kernel - even though the implementation
    definitely can handle the general case) is certainly the mutex one.
    
    >However, you need to grab a spinlock for the purpose of grabbing a
    >semaphore_t, so for short critical sections a spinlock (even with the
    >potential of busy-waiting) is more efficient.
    
    Only in bad implementations or on bad hardware.
    
    All reasonably modern CPU's can do a semaphore without having grabbed a
    spinlock.  Often a spinlock is needed for the _contention_ case, but if
    done right that is very rare.  The contention case for a semaphore is
    usually the very expensive one, because that's when you have to
    re-schedule etc.
    
    So basically spinlocks are much simpler, and faster under short-lived
    contention, so that's why they tend to be used.  Also, semaphores cannot
    be used by interrupt handlers, as Linux doesn't allow interrupt handlers
    to sleep, so anything that protects interrupts needs to be a spinlock.
    
    In addition to semaphores, there are other mutual exclusion notions, the
    most popular being a "read-write" lock - something that requires
    exclusive access for writers, but allows any number of readers at a
    time.  Linux has the spinning version of this, but not the blocking one.
    We'll probably add a blocking version some day, as it's often very
    useful, but it hasn't been a major issue yet.
    
    For example, the per-VM memory management semaphore could very usefully
    be a blocking read-write lock, but without heavy thread contention a
    mutex semaphore is basically equivalent.
    
    >					  Since the usual practice
    >is to make your critical sections as short as possible, the result is
    >that the kernel uses a lot more spinlocks than semaphore_t's.
    
    True.  Semaphores are really only useful around anything that does IO,
    for example.  When the potential contention period is multiple
    milliseconds as opposed to nano- or microseconds, blocking operations
    (ie operations that cause a re-schedule on contention) are the right way
    to go.
    
    Note that some people believe in a mix-and-match approach, where you
    have a spinlock that gets upgraded to a semaphore if it waits too long.
    Personally, I think that only makes sense if (a) you're in user space
    and don't know what the scheduling rules are or (b) your locking is so
    badly designed that you have tons of short-lived contention on your
    semaphores, so you want to try the light approach first.
    
    		Linus
    
    

* * *
    
    
    Date: 	Sun, 8 Apr 2001 20:08:13 -0700 (PDT)
    From: Linus Torvalds <torvalds@transmeta.com>
    Subject: [Re: rw_semaphores][6]
    Newsgroups: fa.linux.kernel
    
    On Sun, 8 Apr 2001, Andrew Morton wrote:
    >
    > One issue with the current implementation is the (necessary)
    > design where one reader (or writer) is sleeping in
    > down_x_failed_biased(), with its bias in place, whereas
    > all other sleepers are sleeping in down_x_failed(), with
    > their bias not in place.
    
    Why not do this the same way the _real_ semaphores do it?
    
    You have a fast-path that has a single integer, and a slow path which has
    more state but and protected by a spinlock. The _only_ worry I see is to
    make sure that we properly have a "contention" state, because both readers
    and writers need to be able to know when they should wake things up. But
    there's a _trivial_ contention state. Read on:
    
    Forget about the BIAS stuff, and go back to the old simple "negative is
    writer, positive is reader" implementation:
    
    	0		- unlocked
    	1..n		- 1-n readers
    	0xc0000000	- writer
    	other <0	- contention
    
    Do you see anything wrong with this?
    
    Implementation:
    
     - fast path:
    
    	down_read:
    		lock incl (%sem)
    		js __down_read_failed
    
    	down_write:
    		xorl %eax,%eax
    		movl $0xc0000000,%r
    		lock cmpxchgl %r,(%sem)
    		jne __down_write_failed
    
    	up_read:
    		lock decl (%sem)
    		js __up_read_wakeup
    
    	up_write:
    		lock andl $0x3fffffff,(%sem)
    		jne __up_write_wakeup
    
    The above are all fairly obvious for the non-failure case, agreed?
    Including the guarantee that _if_ there is contention, the "up()"
    routines will always go to the slow path to handle the contention.
    
    Now, the _slow_ case could be your generic "protected by a spinlock" code,
    although I have a suggestion. As follows:
    
    The only half-way "subtle" case is that __down_write_failed needs to make
    sure that it marks itself as a contender (the down_read() fast-case code
    will have done so already by virtue of incrementing the counter, which is
    guaranteed to have resulted in a "contention" value.
    
    While down_read() automatically gets a "contention value" on failure,
    "down_write()" needs to do extra work. The extra work is not all that
    expensive: the simplest thing is to just do
    
    	subl $0x8000,(%sem)
    
    at the beginning - which will cause a "contention" value regardless of
    whether it was pure-reader before (it goes negative by the simple
    assumption that there will never be more than 32k concurrent readers), or
    whether it was locked by a writer (it stays negative due to the 0x40000000
    bit in the write lock logic). In both cases, both a up_write() and a
    up_read() will notice that they need to handle the contention of a waiting
    writer-to-be.
    
    Would you agree that the above fast-path implementation guarantees that we
    always get to the spinlock-protected slow path?
    
    Now, the above _heavily_ favours readers. There's no question that the
    above is unfair. I think that's ok. I think fairness and efficiency tend
    to be at odds. But queuing theory shows that the faster you can get out of
    the critical region, the smaller your queue will be - exponentially. So
    speed is worth it.
    
    Let's take an example at this point:
    
     - lock is zero
    
     - writer(1) gets lock: lock is 0xc0000000
    
     - reader(2) tries to get in: lock becomes 0xc0000001, synchronizes on
       spinlock.
    
     - another writer(3) tries to get in: lock becomes 0xbfff8001, synchronizes
       on spinlock.
    
     - writer(1) does up_write: lock becomes 0x3fff8001, != 0, so writer  decides
       it needs to wake up, and synchronizes on spinlock.
    
     - another reader(4) comes on on another CPU, increments, and notices that
       it can do so without it being negative: lock becomes 0x3fff8002 and
       this one does NOT synchronize on the spinlock.
    
    End result: reader(4) "stole base" and actually got the lock without ever
    seeing any contention, and we now have (1), (2) and (3) who are serialized
    inside the spinlock.
    
    So we get to what the serializers have to do, ie the slow path:
    
    First, let's do the __up_read/write_wakeup() case, because that one is the
    easy one. In fact, I think it ends up being the same function:
    
    	spin_lock(&sem->lock);
    	wake_up(&sem->waiters);
    	spin_unlock(&sem->lock);
    
    and we're all done. The only optimization here (which we should do for
    regular semaphores too) is to use the same spinlock for the semaphore lock
    and the wait-queue lock.
    
    The above is fairly obviously correct, and sufficient: we have shown that
    we'll get here if there is contention, and the only other thing that the
    wakup could sanely do would possibly be to select which process to wake.
    Let's not do that yet.
    
    The harder case is __down_read/write_failed(). Here is my suggested
    pseudo-code:
    
    	__down_write_failed(sem)
    	{
    		DECLARE_WAIT_QUEUE(wait, current);
    
    		lock subl $0x8000,(%sem)	/* Contention marker */
    		spin_lock(&sem->lock);
    		add_wait_queue_exclusive(&sem->wait, &wait);
    		for (;;) {
    			unsigned int value, newvalue;
    
    			set_task_state(TASK_SLEEPING);
    			value = sem->value;
    
    			/*
    			 * Ignore other pending writers: but if there is
    			 * are pending readers or a write holder we should
    			 * sleep
    			 */
    			if (value & 0xc0007fff) {
    				spin_unlock(&sem->lock);
    				schedule();
    				spin_lock(&sem->lock);
    				continue;
    			}
    
    			/*
    			 * This also undoes our contention marker thing,  while
    			 * leaving other waiters contention markers in  place
    			 */
    			newvalue = (value + 0x8000) | 0xc0000000;
    			if (lock_cmpxchg(sem->value, value, newvalue))
    				break;	/* GOT IT! */
    
    			/* Damn, somebody else changed it from under us */
    			continue;
    		}
    		remove_wait_queue(&sem->wait, &wait);
    		spin_unlock(&sem->lock);
    	}
    
    The down_read() slow case is equivalent, but ends up being much simpler
    (because we don't need to play with contention markers or ignore other
    peoples contention markers):
    
    	__down_read_failed(sem)
    	{
    		DECLARE_WAIT_QUEUE(wait, current);
    
    		spin_lock(&sem->lock);
    		add_wait_queue(&sem->wait, &wait);
    		for (;;) {
    			set_task_state(TASK_SLEEPING);
    			/*
    			 * Yah! We already did our "inc", so if we ever see
    			 * a positive value we're all done.
    			 */
    			if (sem->value > 0)
    				break;
    			spin_unlock(&sem->lock);
    			schedule();
    			spin_lock(&sem->lock);
    		}
    		remove_wait_queue(&sem->wait, &wait);
    		spin_unlock(&sem->lock);
    	}
    
    Can anybody shoot any holes in this? I haven't actually tested it, but
    race conditions in locking primitives are slippery things, and I'd much
    rather have an algorithm we can _think_ about and prove to be working. And
    I think the above one is provably correct.
    
    Not that I want to go to that kind of extremes.
    
    Anybody? Andrew? Mind giving the above a whirl on your testbed? Or can you
    see some thinko in it without even testing?
    
    Note in particular how the above keeps the non-contention "down_read()" /
    "up_read()" cases as two single instructions, no slower than a spinlock.
    
    (There are the following assumptions in the code: there are at most 32k
    active readers, and also at most 32k pending writers. The limits come from
    the writer contention marker logic. You have the 32 bits split up as:
    
     - 2 bits "write lock owner" (it's really only one bit, but the other bit
       is set so that the writer contention marker won't turn the most
       negative number into a positive one, so we have one "extra" bit set to
       keep the thing negative for the whole duration of a write lock)
     - 15 "reader count" bits
     - 15 "pending writer count" bits
    
    and the 15 bits is why you have the 32k user limitation. I think it's an
    acceptable one - and if not you can expand on it by adding extra fields
    that are only accessed within the spinlock).
    
    		Linus
    
    
    

* * *
    
    
    Date: 	Sun, 8 Apr 2001 21:18:20 -0700 (PDT)
    From: Linus Torvalds <torvalds@transmeta.com>
    Subject: [Re: rw_semaphores][6]
    Newsgroups: fa.linux.kernel
    
    The "down_writer_failed()" case was wrong:
    
    On Sun, 8 Apr 2001, Linus Torvalds wrote:
    >
    > 	__down_write_failed(sem)
    > 	{
    > 	....
    > 			/*
    > 			 * Ignore other pending writers: but if there is
    > 			 * are pending readers or a write holder we should
    > 			 * sleep
    > 			 */
    > 			if (value & 0xc0007fff) {
    	....
    
    The "value & 0xc0007ffff" test is wrong, because it's actually always true
    for some normal contention case (multiple pending writers waiting for a
    reader to release the lock). Because even if there is no write lock
    holder, other pending writers (and we're one) will have caused the high
    bits to be set, so the above would end up causing us to think that we
    can't get the lock even if there is no real lock holder.
    
    The comment is right, though. It's just the test that is simplistic and
    wrong.
    
    The pending readers part is correct, and obvious enough:
    
    	(value & 0x7fff) != 0
    
    implies that there are readers. In which case we should try again. Simple
    enough.
    
    The pending writers part is slightly more subtle: we know that there is at
    least _one_ pending writer, namely us. It turns out that we must check the
    two high bits, and the logic is:
    
     - 11: only pending writers, no write lock holder (if we had a write lock
           holder, he would have set the bits to 11, but a pending writer
           would have borrowed from the lower bit, so you'd get bit pattern
           10).
     - 10: we have a real lock holder, and the pending writers borrowed from
           the low lock bit when they did the "subl 0x8000" to mark off
           contention.
     - 01: must not happen. BUG.
     - 00: we had a real write lock holder, but he released the lock and
           cleared both bits.
    
    So the "is there a write lock holder" test basically becomes
    
    	(value & 0xc0000000) == 0x80000000
    
    and the full test should be
    
    	if ((value & 0x7fff) || ((value & 0xc0000000) == 0x80000000) {
    		spin_unlock();
    		schedule();
    		spin_lock();
    		continue;
    	}
    
    which might be rewritten some simpler way. I'm too lazy to think about it
    even more.
    
    For similar reasons, the "newvalue" calculation was subtly bogus: we must
    make sure that we maintain the correct logic for the two upper bits in the
    presense of _other_ pending writers. We can't just do the unconditional
    binary "or" operation to set the two upper bits, because then we'd break
    the above rules if there are other pending writers. So the newvalue
    calculation should be something along the lines of
    
    	/* Undo _our_ contention marker */
    	newvalue = value + 0x8000;
    
    	/* Get rid of stale writer lock bits */
    	newvalue &= 0x3fffffff;
    
    	/*
    	 * If there were no other pending writers (newvalue == 0), we set
    	 * both high bits, otherwise we only set bit 31.
    	 * (see above on the "borrow bit being clear" logic).
    	 */
    	if (!newvalue)
    		newvalue = 0xc0000000;
    	newvalue |= 0x80000000;
    
    And THEN I think the algorithm in the email I'm following up to should
    actually really work.
    
    Does anybody find any other details I missed?
    
    And no, I have not yet actually _tested_ any of this. But all my code
    works on the first try (or, in this case, second try if you want to be a
    stickler for details).
    
    No wonder we didn't get this right first time through. It's not really all
    that horribly complicated, but the _details_ kill you.
    
    		Linus
    
    
    

* * *
    
    
    Date: 	Mon, 9 Apr 2001 22:43:53 -0700 (PDT)
    From: Linus Torvalds <torvalds@transmeta.com>
    Subject: [Re: rw_semaphores][7]
    Newsgroups: fa.linux.kernel
    
    On Tue, 10 Apr 2001, Tachino Nobuhiro wrote:
    >
    >   I am not familiar with semaphore or x86, so this may not be correct,
    > but if the following sequence is possible, the writer can call wake_up()
    > before the reader calls add_wait_queue() and reader may sleep forever.
    > Is it possible?
    
    The ordering is certainly possible, but if it happens,
    __down_read_failed() won't actually sleep, because it will notice that the
    value is positive and just return immediately. So it will do some
    unnecessary work (add itself to the wait-queue only to remove itself
    immediately again), but it will do the right thing.
    
    		Linus
    
    
    

* * *
    
    
    Date: 	Tue, 10 Apr 2001 12:42:07 -0700 (PDT)
    From: Linus Torvalds <torvalds@transmeta.com>
    Subject: [Re: [PATCH] i386 rw_semaphores fix][8]
    Newsgroups: fa.linux.kernel
    
    On Tue, 10 Apr 2001, David Howells wrote:
    >
    > Here's a patch that fixes RW semaphores on the i386 architecture. It is very
    > simple in the way it works.
    
    XADD only works on Pentium+.
    
    That's no problem if we make this SMP-specific - I doubt anybody actually
    uses SMP on i486's even if the machines exist, as I think they all had
    special glue logic that Linux would have trouble with anyway. But the
    advantages of being able to use one generic kernel that works on plain UP
    i386 machines as well as SMP P6+ machines is big enough that I would want
    to be able to say "CONFIG_X86_GENERIC" + "CONFIG_SMP".
    
    Even if it would be noticeably slower (ie a fallback to a spinlock might
    be perfectly ok).
    
    If you do this, I would suggest having asm-i386/{rwsem.h|rwsem-xadd.h},
    and just having a
    
    	#ifndef CONFIG_XADD
    	#include <asm/rwsem.h>
    	#else
    	#include <asm/rwsem-xadd.h>
    	#endif
    
    (And adding "CONFIG_XADD" to the list of generated optimization
    configuration options in arch/i386/config.in, of course).
    
    That way we don't make the semaphore.h file even more unreadable.
    
    
    		Linus
    
    
    
    

* * *
    
    
    Date: 	Tue, 10 Apr 2001 13:16:10 -0700 (PDT)
    From: Linus Torvalds <torvalds@transmeta.com>
    Subject: [Re: [PATCH] i386 rw_semaphores fix][8]
    Newsgroups: fa.linux.kernel
    
    On Tue, 10 Apr 2001, Andi Kleen wrote:
    >
    > I guess 386 could live with an exception handler that emulates it.
    
    That approach is fine, although I'd personally prefer to take the
    exception just once and just rewrite the instuction as a "call". The
    places that need xadd would have to follow some strict guidelines (long
    modrms or other instructions to pad out to enough size, and have the
    arguments in fixed registers)
    
    > (BTW an generic exception handler for CMPXCHG would also be very useful
    > for glibc -- currently it has special checking code for 386 in its mutexes)
    > The 386 are so slow that nobody would probably notice a bit more slowness
    > by a few exceptions.
    
    Ehh. I find that the slower the machine is, the more easily I _notice_
    that it is slow. So..
    
    		Linus
    
    
    

* * *
    
    
    Date: 	Tue, 10 Apr 2001 17:55:09 -0700 (PDT)
    From: Linus Torvalds <torvalds@transmeta.com>
    Subject: [Re: [PATCH] i386 rw_semaphores fix][8]
    Newsgroups: fa.linux.kernel
    
    On Wed, 11 Apr 2001, David Weinehall wrote:
    > >
    > > Yes, and with CMPXCHG handler in the kernel it wouldn't be needed
    > > (the other 686 optimizations like memcpy also work on 386)
    >
    > But the code would be much slower, and it's on 386's and similarly
    > slow beasts you need every cycle you can get, NOT on a Pentium IV.
    
    Note that the "fixup" approach is not necessarily very painful at all,
    from a performance standpoint (either on 386 or on newer CPU's). It's not
    really that hard to just replace the instruction in the "undefined
    instruction"  handler by having strict rules about how to use the "xadd"
    instruction.
    
    For example, you would not actually fix up the xadd to be a function call
    to something that emulates "xadd" itself on a 386. You would fix up the
    whole sequence of "inline down_write()" with a simple call to an
    out-of-line "i386_down_write()" function.
    
    Note that down_write() on an old 386 is likely to be complicated enough
    that you want to do it out-of-line anyway, so the code-path you take
    (afetr the first time you've trapped on that particular location) would be
    the one you would take for an optimized 386 kernel anyway. And similarly,
    the restrictions you place on non-386-code to make it fixable are simple
    enough that it probably shouldn't make a difference for performance on
    modern chips.
    
    		Linus
    
    
    
    
    

* * *
    
    
    Date: 	Wed, 11 Apr 2001 11:41:06 -0700 (PDT)
    From: Linus Torvalds <torvalds@transmeta.com>
    Subject: [Re: [PATCH] i386 rw_semaphores fix][9]
    Newsgroups: fa.linux.kernel
    
    On Wed, 11 Apr 2001, David Howells wrote:
    >
    > > These numbers are infinity :)
    >
    > I know, but I think Linus may be happy with the resolution for the moment. It
    > can be extended later by siphoning off excess quantities of waiters into a
    > separate counter (as is done now) and by making the access count use a larger
    > part of the variable.
    
    I'm certainly ok with the could being limited to "thousands". I don't see
    people being able to exploit it any practical way. But we should remember
    to be careful: starting thousands of threads and trying to make them all
    take page faults and overflowing the read counter would be a _nasty_
    attack, It would probably not be particularly easy to arrange, but still.
    
    Note that blocking locks are different from spinlocks: for spinlocks we
    can get by with just 7 bits in a byte, and be guaranteed that that is
    enough for any machine with less than 128 processors. For the blocking
    locks, that is not true.
    
    (Right now the default "max user processes" ulimit already protects us
    from this exploit, I just wanted to make sure that people _think_ about
    this).
    
    So a 16-bit count is _fine_. And I could live with less.
    
    We should remember the issue, though. If we ever decide to expand it, it
    would be easy enough to make an alternative "rwsem-reallybig.h" that uses
    cmpxchg8b instead, or whatever. You don't need to start splitting the
    counts up to expand them past 16 bits, you could have a simple system
    where the _read_ locks only look at one (fast) 32-bit word for their fast
    case, and only the write lock actually needs to use cmpxchg8b.
    
    (I think it's reasonable to optimize the read-lock more than the
    write-lock: in the cases where you want to do rw-locking, the common
    reason is because you really _want_ to allow many concurrent readers. That
    also implies that the read case is the common one).
    
    So you could fairly easily expand past 16-bit counters by using a 31-bit
    counter for the reader, and making the high bit in the reader count be the
    "contention"  bit. Then the slow path (and the write path) would use the
    64-bit operations offered by cmpxchg8b.
    
    And yes, it's a Pentium+ instruction (and this time I -checked- ;), but by
    the time you worry about hundreds of thousands of threads I think you can
    safely just say "you'd better be running on a big a modern machine", and
    just make the code conditional on CONFIG_PENTIUM+
    
    So no real trickiness needed for expanding the number space, but certainly
    also no real _reason_ for it at this time.
    
    		Linus
    
    
    

* * *
    
    
    Date: 	Mon, 16 Apr 2001 10:05:57 -0700 (PDT)
    From: Linus Torvalds <torvalds@transmeta.com>
    Subject: [Re: rw_semaphores][10]
    Newsgroups: fa.linux.kernel
    
    On Mon, 16 Apr 2001 yodaiken@fsmlabs.com wrote:
    >
    > I'm trying to imagine a case where 32,000 sharing a semaphore was anything but a
    > major failure and I can't. To me: the result of an attempt by the 32,768th locker
    > should be a kernel panic. Is there a reasonable scenario where this is wrong?
    
    Hint: "I'm trying to imagine a case when writing all zeroes to /etc/passwd
    is anything but a major failure, but I can't. So why don't we make
    /etc/passwd world-writable?"
    
    Right. Security.
    
    There is _never_ any excuse for panic'ing because of some inherent
    limitation of the data structures. You can return -ENOMEM, -EAGAIN or
    something like that, but you must _not_ allow a panic (or a roll-over,
    which would just result in corrupted kernel data structures).
    
    Note that the limit is probably really easy to work around even without
    extending the number of bits: a sleeper that notices that the count is
    even _halfway_ to rolling around could easily do something like:
    
     - undo "this process" action
     - sleep for 1 second
     - try again from the beginning.
    
    I certainly agree that no _reasonable_ pattern can cause the failure, but
    we need to worry about people who are malicious. The above trivial
    approach would take care of that, while not penalizing any non-malicious
    users.
    
    So I'm not worried about this at all. I just want people _always_ to think
    about "how could I mis-use this if I was _truly_ evil", and making sure it
    doesn't cause problems for others on the system.
    
    (NOTE: This does not mean that the kernel has to do anything _reasonable_
    under all circumstances. There are cases where Linux has decided that
    "this is not something a reasonable program can do, and if you try to do
    it, we'll give you random results back - but they will not be _security_
    holes". We don't need to be _nice_ to unreasonable requests. We just must
    never panic, otherwise crash or allow unreasonable requests to mess up
    _other_ people)
    
    		Linus
    
    
    

* * *
    
    
    Date: 	Fri, 20 Apr 2001 10:46:01 -0700 (PDT)
    From: Linus Torvalds <torvalds@transmeta.com>
    Subject: [Re: [andrea@suse.de: Re: generic rwsem [Re: Alpha "process table][11]
    Newsgroups: fa.linux.kernel
    
    On Fri, 20 Apr 2001, David Howells wrote:
    >
    > The file should only be used for the 80386 and maybe early 80486's where
    > CMPXCHG doesn't work properly, everything above that can use the XADD
    > implementation.
    
    Why are those not using the generic files? The generic code is obviously
    more maintainable.
    
    > But if you want it totally non-inline, then that can be done. However, whilst
    > developing it, I did notice that that slowed things down, hence why I wanted
    > it kept in line.
    
    I want to keep the _fast_ case in-line.
    
    I do not care at ALL about the stupid spinlock version. That should be the
    _fallback_, and it should be out-of-line. It is always going to be the
    slowest implementation, modulo bugs in architecture-specific code.
    
    For i386 and i486, there is no reason to try to maintain a complex fast
    case. The machines are unquestionably going away - we should strive to not
    burden them unnecessarily, but we should _not_ try to save two cycles.
    
    In short:
     - the only case that _really_ matters for performance is the uncontended
       read-lock for "reasonable" machines. A i386 no longer counts as
       reasonable, and designing for it would be silly. And the write-lock
       case is much less compelling.
     - We should avoid any inlines where the inline code is >2* the
       out-of-line code. Icache issues can overcome any cycle gains, and do
       not show up well in benchmarks (benchmarks tend to have very hot
       icaches). Note that this is less important for the out-of-line code in
       another segment that doesn't get brought into the icache at all for the
       non-contention case, but that should still be taken _somewhat_ into
       account if only because of kernel size issues.
    
    Both of the above rules implies that the generic spin-lock implementation
    should be out-of-line.
    
    >   (1) asm-i386/rwsem-spin.h is wrong, and can probably be replaced with the
    >       generic spinlock implementation without inconveniencing people much.
    >       (though someone has commented that they'd want this to be inline as
    >        cycles are precious on the slow 80386).
    
    Icache is also precious on the 386, which has no L2 in 99% of all cases.
    Make it out-of-line.
    
    >   (2) "fix up linux/rwsem-spinlock.h": do you want the whole generic spinlock
    >       implementation made non-inline then?
    
    Yes. People who care about performance _will_ have architecture-specific
    inlines on architectures where they make sense (ie 99% of them).
    
    		Linus
    
    
    

* * *
    
    
    Date: 	Fri, 20 Apr 2001 16:45:32 -0700 (PDT)
    From: Linus Torvalds <torvalds@transmeta.com>
    Subject: [Re: x86 rwsem in 2.4.4pre[234] are still buggy [was Re: rwsem][12]
    Newsgroups: fa.linux.kernel
    
    On Fri, 20 Apr 2001, Andrea Arcangeli wrote:
    >
    > While dropping the list_empty check to speed up the fast path I faced the same
    > complexity of the 2.4.4pre4 lib/rwsem.c and so before reinventing the wheel I
    > read how the problem was solved in 2.4.4pre4.
    
    I would suggest the following:
    
     - the generic semaphores should use the lock that already exists in the
       wait-queue as the semaphore spinlock.
    
     - the generic semaphores should _not_ drop the lock. Right now it drops
       the semaphore lock when it goes into the slow path, only to re-aquire
       it. This is due to bad interfacing with the generic slow-path routines.
    
       I suspect that this lock-drop is why Andrea sees problems with the
       generic semaphores. The changes to "count" and "sleeper" aren't
       actually atomic, because we don't hold the lock over them all. And
       re-using the lock means that we don't need the two levels of
       spinlocking for adding ourselves to the wait queue. Easily done by just
       moving the locking _out_ of the wait-queue helper functions, no?
    
     - the generic semaphores are entirely out-of-line, and are just declared
       universally as regular FASTCALL() functions.
    
    The fast-path x86 code looks ok to me. The debugging stuff makes it less
    readable than it should be, I suspect, and is probably not worth it at
    this stage. The users of rw-semaphores are so well-defined (and so well
    debugged) that the debugging code only makes the code harder to follow
    right now.
    
    Comments?  Andrea? Your patches have looked ok, but I absolutely refuse to
    see the non-inlined fast-path for reasonable x86 hardware..
    
    		Linus
    
    
    

* * *
    
    
    Date: 	Sat, 21 Apr 2001 10:18:06 -0700 (PDT)
    From: Linus Torvalds <torvalds@transmeta.com>
    Subject: [Re: x86 rwsem in 2.4.4pre[234] are still buggy [was Re: rwsem][13]
    Newsgroups: fa.linux.kernel
    
    On Sat, 21 Apr 2001, Russell King wrote:
    >
    > Erm, spin_lock()?  What if up_read or up_write gets called from interrupt
    > context (is this allowed)?
    
    Currently that is not allowed.
    
    We allow it for regular semaphores, but not for rw-semaphores.
    
    We may some day have to revisit that issue, but I suspect we won't have
    much reason to.
    
    		Linus
    
    
    

* * *
    
    
    From: Linus Torvalds <torvalds@osdl.org>
    Newsgroups: fa.linux.kernel
    Subject: Re: [PATCH 1/19] MUTEX: Introduce simple mutex implementation
    Date: Fri, 16 Dec 2005 21:43:03 UTC
    Message-ID: <[fa.fv9t5bi.g0g5qg@ifi.uio.no][14]>
    Original-Message-ID: <[Pine.LNX.4.64.0512161339140.3698@g5.osdl.org][15]>
    
    On Fri, 16 Dec 2005, Thomas Gleixner wrote:
    
    > On Thu, 2005-12-15 at 21:32 +0100, Geert Uytterhoeven wrote:
    > > > Why have the "MUTEX" part in there?  Shouldn't that just be DECLARE_SEM
    > > > (oops, I mean DEFINE_SEM).  Especially that MUTEX_LOCKED! What is that?
    > > > How does a MUTEX start off as locked.  It can't, since a mutex must
    > > > always have an owner (which, by the way, helped us in the -rt patch to
    > > > find our "compat_semaphores").  So who's the owner of a
    > > > DEFINE_SEM_MUTEX_LOCKED?
    > >
    > > No one. It's not really a mutex, but a completion.
    >
    > Well, then let us use a completion and not some semantically wrong
    > workaround
    
    It is _not_ wrong to have a semaphore start out in locked state.
    
    For example, it makes perfect sense if the data structures that the
    semaphore needs need initialization. The way you _should_ handle that is
    to make the semaphore come up as locked, and the data structures in some
    "don't matter" state, and then the thing that initializes stuff can do so
    properly and then release the semaphore.
    
    Yes, in some cases such a locked semaphore is only used once, and ends up
    being a "completion", but that doesn't invalidate the fact that this is
    a perfectly fine way to handle a real issue.
    
    		Linus
    
    

* * *
    
    
    From: Linus Torvalds <torvalds@osdl.org>
    Newsgroups: fa.linux.kernel
    Subject: Re: [PATCH 1/19] MUTEX: Introduce simple mutex implementation
    Date: Fri, 16 Dec 2005 22:21:08 UTC
    Message-ID: <[fa.g0a54rd.h0o5at@ifi.uio.no][16]>
    Original-Message-ID: <[Pine.LNX.4.64.0512161414370.3698@g5.osdl.org][17]>
    
    On Fri, 16 Dec 2005, Thomas Gleixner wrote:
    >
    > Well, in case of a semaphore it is a semantically correct use case. In
    > case of of a mutex it is not.
    
    I disagree.
    
    Think of "initialization" as a user. The system starts out initializing
    stuff, and as such the mutex should start out being held. It's that
    simple. It _is_ mutual exclusion, with one user being the early bootup
    state.
    
    		Linus
    
    

* * *
    
    
    From: Linus Torvalds <torvalds@osdl.org>
    Newsgroups: fa.linux.kernel
    Subject: Re: [PATCH 1/12]: MUTEX: Implement mutexes
    Date: Sun, 18 Dec 2005 18:43:42 UTC
    Message-ID: <[fa.g09n3bh.i067av@ifi.uio.no][18]>
    Original-Message-ID: <[Pine.LNX.4.64.0512181027220.4827@g5.osdl.org][19]>
    
    On Sun, 18 Dec 2005, Russell King wrote:
    >
    > On Sat, Dec 17, 2005 at 10:30:41PM -0800, Linus Torvalds wrote:
    > > An interrupt can never change the value without changing it back, except
    > > for the old-fashioned use of "up()" as a completion (which I don't think
    > > we do any more - we used to do it for IO completion a looong time ago).
    >
    > I doubt you can guarantee that statement, or has the kernel source
    > been audited for this recently?
    
    Well, _if_ it's a noticeable performance win, we should just do it. We
    already know that people don't call "down()" in interrupts (it just
    wouldn't work), we can instrument "up()" too.
    
    It's easy enough to add a "might_sleep()" to the up(). Not strictly true,
    but conceptually it would make sense to make up/down match in that sense.
    We'd have to mark the (few) places that do down_trylock() + up() in
    interrupt context with a special "up_in_interrupt()", but that would be ok
    even from a documentation standpoint.
    
    > However, the real consideration is stability - if a semaphore was
    > used for a completion and it was merged, would it be found and
    > fixed?  Probably not, because it won't cause any problems on
    > architectures where semaphores have atomic properties.
    
    Actually, the reason we have completions is that using semaphores as
    completions caused some really subtle problems that had nothing to do with
    atomicity of the operations themselves, so if you find somebody who uses a
    semaphore from an interrupt, I think we want to know about it.
    
    Completions actually have another - and more important - property than the
    fact that they have a more logical name for a particular usage.
    
    The completion has "don't touch me" guarantees. A thread or interrupt that
    does an "up()" on a semaphore may still touch the memory that was
    allocated for the semaphore after the "down()" part has been released.
    And THAT was the reason for the completions: we allocate them on the stack
    or in temporary allocations, and the thing that does the "down()" to wait
    for something to finish will also do the _release_ of the memory.
    
    With semaphores, that caused problems, because the side doing the "up()"
    would thus possibly touch memory that got released from under it.
    
    This problem happens only on SMP (since you need to have the downer and
    the upper running at the same time), but it's totally independent of the
    other atomicity issues. And almost any semaphore that is used as a
    completion from an interrupt will have this problem, so yes, if you find
    somebody doing an "up()" in interrupt context, we'll fix it.
    
    It would be good to make the rules clear, that you can never touch a
    semaphore from irq context without changing it back before you return.
    
    Of course, that still leaves the following sequence
    
    	if (!down_trylock(..)) {
    		... do something ..
    		up(..);
    	}
    
    which is actually used from interrupts too. At least the console layer
    does that (printk() can happen from interrupts, and we do a down_trylock
    on the console semaphore. But that one shouldn't mess with the _count_,
    although it does mean that the wait-queue preparation etc (for when the
    fast case fails) does still need to be protected against interrupts.
    
    But that would be the slow case, so from a performance standpoint, it
    would still allow the case that really _matters_ to be done with
    interrupts enabled.
    
    > Unless of course sparse can be extended to detect the use of unbalanced
    > semaphores in interrupt contexts.
    
    In theory, yes, but in practice I'd much rather just do the stupid brute
    force things.
    
    > > (Of course, maybe it's not worth it. It might not be a big performance
    > > issue).
    >
    > Balancing the elimination of 4 instructions per semaphore operation,
    > totalling about 4 to 6 cycles, vs stability I'd go for stability
    > unless we can prove the above assertion via (eg) sparse.
    
    I agree, if arm interrupt disables are fast. For example, on x86 (where
    this isn't needed, because you can have an "interrupt-safe" decrement by
    just having it as a single instruction, even if it isn't SMP-safe),
    disabling and re-enabling interrupts is just one instruction each, but the
    combination is usually something like 50+ cycles. So if this was an issue
    on x86, we'd definitely care.
    
    But if you don't think it's a big issue on ARM, it just doesn't matter.
    
    			Linus
    
    

* * *
    
    
    From: Linus Torvalds <torvalds@osdl.org>
    Newsgroups: fa.linux.kernel
    Subject: Re: [PATCH 1/12]: MUTEX: Implement mutexes
    Date: Sun, 18 Dec 2005 19:55:47 UTC
    Message-ID: <[fa.fva543d.j0g6ir@ifi.uio.no][20]>
    Original-Message-ID: <[Pine.LNX.4.64.0512181153080.4827@g5.osdl.org][21]>
    
    On Sun, 18 Dec 2005, James Bottomley wrote:
    >
    > Actually, I don't think you want might_sleep(): there are a few cases
    > where we do an up() from under a spinlock, which will spuriously trigger
    > this.  I'd suggest WARN_ON(in_interrupt()) instead.
    
    Ahh, good point. Yes.
    
    However, if even the arm people aren't all that interested in this, maybe
    it simply doesn't matter. A lot of other architectures either have
    "decrement in memory" or can already use ll/sc for it.
    
    (of course, on some architectures, ll/sc is really really slow, so they
    might well prefer using a normal load and store instead).
    
    		Linus
    
    

* * *
    
    
    From: Linus Torvalds <torvalds@osdl.org>
    Newsgroups: fa.linux.kernel
    Subject: Re: [patch 00/15] Generic Mutex Subsystem
    Date: Mon, 19 Dec 2005 19:12:41 UTC
    Message-ID: <[fa.g19j43e.h026iq@ifi.uio.no][22]>
    Original-Message-ID: <[Pine.LNX.4.64.0512191053400.4827@g5.osdl.org][23]>
    
    On Mon, 19 Dec 2005, Ingo Molnar wrote:
    >
    > in fact, generic mutexes are _more_ fair than struct semaphore in their
    > wait logic. In the stock semaphore implementation, when a waiter is
    > woken up, it will retry the lock, and if it fails, it goes back to the
    > _tail_ of the queue again - waiting one full cycle again.
    
    Ingo, I don't think that is true.
    
    It shouldn't be true, at least. The whole point with the "sleeper" count
    was to not have that happen. Of course, bugs happen, so I won't guarantee
    that's actually true, but ..
    
    If you are woken up as a waiter on a semaphore, you shouldn't fail to get
    it. You will be woken first, and nobody else will get at it, because the
    count has been kept negative or zero even by the waiters, so that a
    fast-path user shouldn't be able to get the lock without going through the
    slow path and adding itself (last) to the list.
    
    But hey, somebody should test it with <n> kernel threads that just do
    down()/up() and some make-believe work in between to make sure there
    really is contention, and count how many times they actually get the
    semaphore. That code has been changed so many times that it may not work
    the way it is advertized ;)
    
    [ Oh.  I'm looking at the semaphore code, and I realize that we have a
      "wake_up(&sem->wait)" in the __down() path because we had some race long
      ago that we fixed by band-aiding over it. Which means that we wake up
      sleepers that shouldn't be woken up. THAT may well be part of the
      performance problem.. The semaphores are really meant to wake up just
      one at a time, but because of that race hack they'll wake up _two_ at a
      time - once by up(), once by down().
    
      That also destroys the fairness. Does anybody remember why it's that
      way? ]
    
    Ho humm.. That's interesting.
    
    		Linus
    
    

* * *
    
    
    From: Linus Torvalds <torvalds@osdl.org>
    Newsgroups: fa.linux.kernel
    Subject: Re: [patch 00/15] Generic Mutex Subsystem
    Date: Mon, 19 Dec 2005 19:57:13 UTC
    Message-ID: <[fa.g1a13rj.h0c6qh@ifi.uio.no][24]>
    Original-Message-ID: <[Pine.LNX.4.64.0512191148460.4827@g5.osdl.org][25]>
    
    On Mon, 19 Dec 2005, Benjamin LaHaise wrote:
    >
    > The only thing I can see as an improvement that a mutex can offer over
    > the current semaphore implementation is if we can perform the same
    > optimization that spinlocks perform in the unlock operation: don't use
    > a locked, serialising instruction in the up() codepath.  That might be
    > a bit tricky to implement, but it's definately a win on the P4 where the
    > cost of serialisation can be quite high.
    
    Good point. However, it really _is_ hard, because we also need to know if
    the mutex was under contention. A spinlock doesn't care, so we can just
    overwrite the lock value. A mutex would always care, in order to know
    whether it needs to do the slow wakeup path.
    
    So I suspect you can't avoid serializing the unlock path for a mutex. The
    issue of "was there contention while I held it" fundamentally _is_ a
    serializing question.
    
    > > [ Oh.  I'm looking at the semaphore code, and I realize that we have a
    > >   "wake_up(&sem->wait)" in the __down() path because we had some race long
    > >   ago that we fixed by band-aiding over it. Which means that we wake up
    > >   sleepers that shouldn't be woken up. THAT may well be part of the
    > >   performance problem.. The semaphores are really meant to wake up just
    > >   one at a time, but because of that race hack they'll wake up _two_ at a
    > >   time - once by up(), once by down().
    > >
    > >   That also destroys the fairness. Does anybody remember why it's that
    > >   way? ]
    >
    > History?
    
    Oh, absolutely, I already checked the old BK history too, and that extra
    wake_up() has been there at least since before we even started using BK.
    So it's very much historical, I'm just wondering if somebody remembers far
    enough back that we'd know.
    
    I don't see why it's needed (since we re-try the "atomic_add_negative()"
    inside the semaphore wait lock, and any up() that saw contention should
    have always been guaranteed to do a wakeup that should fill the race in
    between that atomic_add_negative() and the thing going to sleep).
    
    It may be that it is _purely_ historical, and simply isn't needed. That
    would be funny/sad, in the sense that we've had it there for years and
    years ;)
    
    		Linus
    
    

* * *
    
    
    From: Linus Torvalds <torvalds@osdl.org>
    Newsgroups: fa.linux.kernel
    Subject: Re: [patch 00/15] Generic Mutex Subsystem
    Date: Mon, 19 Dec 2005 20:14:24 UTC
    Message-ID: <[fa.fvpr2re.jg27qq@ifi.uio.no][26]>
    Original-Message-ID: <[Pine.LNX.4.64.0512191203120.4827@g5.osdl.org][27]>
    
    On Mon, 19 Dec 2005, Ingo Molnar wrote:
    >
    >  average cost per op:       206.59 usecs
    >  average cost per op:       512.13 usecs
    
    (mutex vs semaphore).
    
    That looks suspiciously like exactly double the cost, so I do believe that
    the double wake_up() might be exactly what is going on.
    
    However:
    
    > hm, removing that wakeup quickly causes hung test-tasks.
    
    So clearly it really is still hiding some bug.
    
    > and even considering that the current semaphore implementation may have
    > a fairness bug, i cannot imagine that making it more fair would also
    > speed it up.
    
    That's not the point. The extra wakeup() in the semaphore code wakes up
    two processes for every single up(), so the semaphores end up not just
    being unfair, they also end up doing twice the work (because it will
    result in the other processes effectively just doing the down() twice).
    
    > I personally find the semaphore implementation clever but too complex,
    > maybe that's a reason why such bugs might be hiding there.  (possibly
    > for many years already ...)
    
    Oh, absolutely. It is too complex.
    
    And don't get me wrong: if it's easier to just ignore the performance bug,
    and introduce a new "struct mutex" that just doesn't have it, I'm all for
    it. However, if so, I do NOT want to do the unnecessary renaming. "struct
    semaphore" should stay as "struct semaphore", and we should not affect old
    code in the _least_.
    
    Then code can switch to "struct mutex" if people want to. And if one
    reason for it ends up being that the code avoids a performance bug in the
    process, all the better ;)
    
    IOW, I really think this should be a series of small patches that don't
    touch old users of "struct semaphore" at all. None of this "semaphore" to
    "arch_semaphore" stuff, and the new "struct mutex" would not re-use _any_
    of the names that the old "struct semaphore" uses.
    
    		Linus
    
    

* * *
    
    
    From: Linus Torvalds <torvalds@osdl.org>
    Newsgroups: fa.linux.kernel
    Subject: Re: [patch 04/15] Generic Mutex Subsystem, 
    	add-atomic-call-func-x86_64.patch
    Date: Tue, 20 Dec 2005 20:13:01 UTC
    Message-ID: <[fa.g09n334.i0672i@ifi.uio.no][28]>
    Original-Message-ID: <[Pine.LNX.4.64.0512201202200.4827@g5.osdl.org][29]>
    
    On Tue, 20 Dec 2005, Russell King wrote:
    >
    > Also, Nico has an alternative idea for mutexes which does not
    > involve decrementing or incrementing - it's an atomic swap.
    > That works out at about the same cycle count on non-Intel ARM
    > CPUs as the present semaphore path.  I'm willing to bet that
    > it will be faster than the present semaphore path on Intel ARM
    > CPUs.
    
    Don't be so sure, especially not in the future.
    
    An atomic "swap" operation is, from a CPU design standpoint, fundamentally
    more expensive that a "load + store".
    
    Now, most ARM architectures don't notice this, because they are all
    in-order, and not SMP-aware anyway. No suble memory ordering, no nothing.
    Which is the only case when "swap" basically becomes a cheap "load+store".
    
    What I'm trying to say is that a plain "load + store" is almost always
    going to be the best option in the long run.
    
    It's also almost certainly always the best option for UP + non-preempt,
    for both present and future CPU's. The reason is simply that a
    microarchitecture will _always_ be optimized for that case, since it's
    pretty much by definition the common situation.
    
    Is preemption even the common case on ARM? I'd assume not. Why are people
    so interested in the preemption case? IOW, why don't you just do
    
    	ldr  lr,[%0]
    	subs lr, lr, %1
    	str  lr,[%0]
    	blmi failure
    
    as the _base_ timings, since that should be the common case. That's the
    drop-dead fastest UP case.
    
    There's an additional advantage of the regular load/store case: if some
    CPU has scheduling issues, you can actually write this out as C code (with
    an extra empty ASM to make sure that the compiler doesn't move anything
    out of the critical region). Again, that probably doesn't matter on most
    ARM chips, but in the general case it sure does matter.
    
    (Btw, inlining _any_ of these except perhaps the above trivial case, is
    probably wrong. None of the ARM chips tend to have caches all that big, I
    bet).
    
    			Linus
    
    

* * *
    
    
    From: Linus Torvalds <torvalds@osdl.org>
    Newsgroups: fa.linux.kernel
    Subject: Re: [patch 04/15] Generic Mutex Subsystem, 
    	add-atomic-call-func-x86_64.patch
    Date: Tue, 20 Dec 2005 22:07:04 UTC
    Message-ID: <[fa.g09r4b6.i066ak@ifi.uio.no][30]>
    Original-Message-ID: <[Pine.LNX.4.64.0512201354210.4827@g5.osdl.org][31]>
    
    On Tue, 20 Dec 2005, Nicolas Pitre wrote:
    >
    > I mean...... what is it with mutexes that you dislike to the point of
    > bending backward that far, and even after seeing the numbers, with such
    > a semaphore implementation that _I_ even wouldn't trust people to use
    > correctly?
    
    Quite frankly, what has disgusted me about this mutex discussion is the
    totally specious arguments for the new mutexes that just rubs me entirely
    the wrong way.
    
    If it had _started_ with a mutex implementation that was faster, simpler,
    and didn't rename the old and working semaphores, I'd have been perfectly
    fine with it.
    
    As it is, the discussion has been pretty much everything but that.
    
    And then people who argue about single cycles, end up dismissing the
    single cycles when I argue that "ld+st" is faster - like you just did.
    
    Be consistent, dammit. If single cycles matter, they matter. If they
    don't, then the existing code is better, since it's existing and works.
    You can't have it both ways.
    
    In other words: if people didn't mix up issues that had nothing to do with
    this into it, I'd be happier. I've already said that a mutex that does
    _not_ replace semaphore (and doesn't mess with naming) is acceptable.
    
    We've done that before. But do it RIGHT, dammit. And don't mix existing
    semaphores into it (for example, completions didn't change any old users).
    
    			Linus
    
    

* * *
    
    
    From: Linus Torvalds <torvalds@osdl.org>
    Newsgroups: fa.linux.kernel
    Subject: Re: [patch 2/8] mutex subsystem, add asm-generic/mutex-[dec|xchg].h
    Date: Thu, 22 Dec 2005 23:58:36 UTC
    Message-ID: <[fa.hb476cr.k76389@ifi.uio.no][32]>
    Original-Message-ID: <[Pine.LNX.4.64.0512221550290.14098@g5.osdl.org][33]>
    
    On Fri, 23 Dec 2005, Ingo Molnar wrote:
    >
    > add the two generic mutex fastpath implementations.
    
    Now this looks more like it. This is readable code without any #ifdef's in
    the middle.
    
    Now the only #ifdef's seem to be for mutex debugging. Might it be
    worthwhile to have a generic debugging, that just uses spinlocks and just
    accept that it's going to be slow, but shared across absolutely all
    architectures?
    
    Then you could have <linux/mutex.h> just doing a single
    
    	#ifdef CONFIG_MUTEX_DEBUG
    	# include <asm-generic/mutex-dbg.h>
    	#else
    	# include <asm/mutex.h>
    	#endif
    
    and have mutex-dbg.h just contain prototypes (no point in inlining them,
    they're going to be big anyway) and then have a
    
    	obj$(CONFIG_MUTEX_DEBUG) += mutex-debug.c
    
    in the kernel/ subdirectory? That way you could _really_ have a clean
    separation, with absolutely zero pollution of any architecture mess or
    debugging #ifdef's in any implementation code.
    
    At that point I'd like to switch to mutexes just because the code is
    cleaner!
    
    		Linus
    
    

* * *
    
    
    From: Linus Torvalds <torvalds@osdl.org>
    Newsgroups: fa.linux.kernel
    Subject: Re: [patch 08/19] mutex subsystem, core
    Date: Tue, 03 Jan 2006 15:40:21 UTC
    Message-ID: <[fa.g0a14jd.i0g62p@ifi.uio.no][34]>
    Original-Message-ID: <[Pine.LNX.4.64.0601030736531.3668@g5.osdl.org][35]>
    
    On Tue, 3 Jan 2006, Ingo Molnar wrote:
    > >
    > > Is this an interrupt deadlock, or do you not allow interrupt contexts
    > > to even trylock a mutex?
    >
    > correct, no irq contexts are allowed. This is also checked for if
    > CONFIG_DEBUG_MUTEXES is enabled.
    
    Note that semaphores are definitely used from interrupt context, and as
    such you can't replace them with mutexes if you do this.
    
    The prime example is the console semaphore. See kernel/printk.c, look for
    "down_trylock()", and realize that they are all about interrupts.
    
    			Linus
    
    

* * *
    
    
    From: Linus Torvalds <torvalds@linux-foundation.org>
    Newsgroups: fa.linux.kernel
    Subject: Re: [PATCH] Replace completions with semaphores
    Date: Tue, 15 Apr 2008 17:00:38 UTC
    Message-ID: <[fa.z+KqPzoOtCdF/f8Xq6ejdEs5kZ4@ifi.uio.no][36]>
    
    On Tue, 15 Apr 2008, Andi Kleen wrote:
    >
    > >  - probably add support for completions to do counting
    >
    > But that's just a semaphore, isn't it?
    
    Exactly. But the point here is:
    
     - nobody should use semaphores anyway (use mutexes)
     - making *more* code use semaphores is wrong
     - completions have a different _mental_ model
    
    IOW, this is not about implementation issues. It's about how you think
    about the operations.
    
    We should _not_ implement completions as semaphores, simply because we
    want to get *rid* of semaphores some day.
    
    So rather than this long and involved patch series that first makes
    semaphores generic, and then makes them be used as completions, I'd much
    rather just skip this whole pointless exercise entirely.
    
    Why have "generic semaphores" at all, if we want to get rid of them?
    
    		Linus
    
    

* * *
    
    
    From: Linus Torvalds <torvalds@linux-foundation.org>
    Newsgroups: fa.linux.kernel
    Subject: Re: [PATCH] Replace completions with semaphores
    Date: Tue, 15 Apr 2008 18:20:47 UTC
    Message-ID: <[fa.6KvRw3P6/PVf+3AnODUXgA0i0AU@ifi.uio.no][37]>
    
    On Tue, 15 Apr 2008, Matthew Wilcox wrote:
    >
    > > In other words, what makes me not like this is hat we first turn
    > > semaphores into the generic code (which is largely what completions were:
    > > just a special case of the generic semaphores!) and then turns completions
    > > into these things. That just doesn't make any sense to me!
    >
    > Blame me for not realising that completions were semaphores under a
    > different name.
    
    The origin of completions is literally the semaphore code - just
    simplified to use spinlocks and be usable as just a mutex. We used to use
    semaphores, and because of the subtle race with lockless semaphores I
    wrote that stupid completion code as a "generic semaphore with a very
    specific usage scenario" and called them "completions".
    
    The completions _could_ have been extended/used as mutex semaphores, but
    the difference was really the mental model for them. That then limited the
    implementation of them: the functions working on completions are defined
    on purpose to be limited - it doesn't really have "up()" and "down()"
    functions: "complete()" is really a up(), but "wait_for_completion()" is
    more like a "wait_until_I_could_do_a_trydown()" function.
    
    Would it make sense to use completions for countable events too? Yeah. In
    fact, we have some things that really would like to do counting, both in
    the sense of "wait for <n> events to all complete" _and_ in the sense of
    "allow up to <n> events to be outstanding". Both of which could be done as
    a counting function (just make "complete" increment the counter, and then
    make "wait for <n> events" initialize it to negative, while "allow <n>
    outstanding events" would be a positive counter, and make
    "wait_for_completion()" basically be a "decrement and wait until it
    is zero".
    
    IOW, completions() really follow the same patterns as semaphores, and it
    *does* make sense to just have one single code-base. But if we want to
    make semaphores go away, I think that it would be better to implement
    semaphores in terms of "extended completions" rather than the other way
    around. That way, we could one day really get rid of semaphores entirely.
    
    		Linus
    

* * *

[Index][1] [Home][2] [About][3] [Blog][4]

[1]: http://yarchive.net/index.html
[2]: http://yarchive.net/home.html
[3]: http://yarchive.net/about.html
[4]: http://yarchive.net/blog
[5]: http://groups.google.com/groups/search?as_ugroup=comp.os.linux.development.system&as_uauthors=Linus+Torvalds&as_usubject=playing+kernel+linux&as_drrb=b&as_mind=26&as_minm=6&as_miny=1999&as_maxd=28&as_maxm=6&as_maxy=1999&sitesearch=groups.google.com
[6]: http://groups.google.com/groups/search?as_ugroup=fa.linux.kernel&as_uauthors=Linus+Torvalds&as_usubject=rw_semaphores&as_drrb=b&as_mind=7&as_minm=4&as_miny=2001&as_maxd=9&as_maxm=4&as_maxy=2001&sitesearch=groups.google.com
[7]: http://groups.google.com/groups/search?as_ugroup=fa.linux.kernel&as_uauthors=Linus+Torvalds&as_usubject=rw_semaphores&as_drrb=b&as_mind=8&as_minm=4&as_miny=2001&as_maxd=10&as_maxm=4&as_maxy=2001&sitesearch=groups.google.com
[8]: http://groups.google.com/groups/search?as_ugroup=fa.linux.kernel&as_uauthors=Linus+Torvalds&as_usubject=rw_semaphores+patch+i386&as_drrb=b&as_mind=9&as_minm=4&as_miny=2001&as_maxd=11&as_maxm=4&as_maxy=2001&sitesearch=groups.google.com
[9]: http://groups.google.com/groups/search?as_ugroup=fa.linux.kernel&as_uauthors=Linus+Torvalds&as_usubject=rw_semaphores+patch+i386&as_drrb=b&as_mind=10&as_minm=4&as_miny=2001&as_maxd=12&as_maxm=4&as_maxy=2001&sitesearch=groups.google.com
[10]: http://groups.google.com/groups/search?as_ugroup=fa.linux.kernel&as_uauthors=Linus+Torvalds&as_usubject=rw_semaphores&as_drrb=b&as_mind=15&as_minm=4&as_miny=2001&as_maxd=17&as_maxm=4&as_maxy=2001&sitesearch=groups.google.com
[11]: http://groups.google.com/groups/search?as_ugroup=fa.linux.kernel&as_uauthors=Linus+Torvalds&as_usubject=generic+process+andrea&as_drrb=b&as_mind=19&as_minm=4&as_miny=2001&as_maxd=21&as_maxm=4&as_maxy=2001&sitesearch=groups.google.com
[12]: http://groups.google.com/groups/search?as_ugroup=fa.linux.kernel&as_uauthors=Linus+Torvalds&as_usubject=buggy+rwsem+still&as_drrb=b&as_mind=19&as_minm=4&as_miny=2001&as_maxd=21&as_maxm=4&as_maxy=2001&sitesearch=groups.google.com
[13]: http://groups.google.com/groups/search?as_ugroup=fa.linux.kernel&as_uauthors=Linus+Torvalds&as_usubject=buggy+rwsem+still&as_drrb=b&as_mind=20&as_minm=4&as_miny=2001&as_maxd=22&as_maxm=4&as_maxy=2001&sitesearch=groups.google.com
[14]: http://groups.google.com/groups/search?as_umsgid=fa.fv9t5bi.g0g5qg%40ifi.uio.no
[15]: http://mid.gmane.org/Pine.LNX.4.64.0512161339140.3698%40g5.osdl.org
[16]: http://groups.google.com/groups/search?as_umsgid=fa.g0a54rd.h0o5at%40ifi.uio.no
[17]: http://mid.gmane.org/Pine.LNX.4.64.0512161414370.3698%40g5.osdl.org
[18]: http://groups.google.com/groups/search?as_umsgid=fa.g09n3bh.i067av%40ifi.uio.no
[19]: http://mid.gmane.org/Pine.LNX.4.64.0512181027220.4827%40g5.osdl.org
[20]: http://groups.google.com/groups/search?as_umsgid=fa.fva543d.j0g6ir%40ifi.uio.no
[21]: http://mid.gmane.org/Pine.LNX.4.64.0512181153080.4827%40g5.osdl.org
[22]: http://groups.google.com/groups/search?as_umsgid=fa.g19j43e.h026iq%40ifi.uio.no
[23]: http://mid.gmane.org/Pine.LNX.4.64.0512191053400.4827%40g5.osdl.org
[24]: http://groups.google.com/groups/search?as_umsgid=fa.g1a13rj.h0c6qh%40ifi.uio.no
[25]: http://mid.gmane.org/Pine.LNX.4.64.0512191148460.4827%40g5.osdl.org
[26]: http://groups.google.com/groups/search?as_umsgid=fa.fvpr2re.jg27qq%40ifi.uio.no
[27]: http://mid.gmane.org/Pine.LNX.4.64.0512191203120.4827%40g5.osdl.org
[28]: http://groups.google.com/groups/search?as_umsgid=fa.g09n334.i0672i%40ifi.uio.no
[29]: http://mid.gmane.org/Pine.LNX.4.64.0512201202200.4827%40g5.osdl.org
[30]: http://groups.google.com/groups/search?as_umsgid=fa.g09r4b6.i066ak%40ifi.uio.no
[31]: http://mid.gmane.org/Pine.LNX.4.64.0512201354210.4827%40g5.osdl.org
[32]: http://groups.google.com/groups/search?as_umsgid=fa.hb476cr.k76389%40ifi.uio.no
[33]: http://mid.gmane.org/Pine.LNX.4.64.0512221550290.14098%40g5.osdl.org
[34]: http://groups.google.com/groups/search?as_umsgid=fa.g0a14jd.i0g62p%40ifi.uio.no
[35]: http://mid.gmane.org/Pine.LNX.4.64.0601030736531.3668%40g5.osdl.org
[36]: http://groups.google.com/groups/search?as_umsgid=fa.z+KqPzoOtCdF%2ff8Xq6ejdEs5kZ4%40ifi.uio.no
[37]: http://groups.google.com/groups/search?as_umsgid=fa.6KvRw3P6%2fPVf+3AnODUXgA0i0AU%40ifi.uio.no