hn-classics/_stories/2003/8523550.md

---
created_at: '2014-10-28T21:00:37.000Z'
title: X86 versus other architectures (Linus Torvalds) (2003)
url: http://yarchive.net/comp/linux/x86.html
author: tambourine_man
points: 63
story_text: ''
comment_text: 
num_comments: 62
story_id: 
story_title: 
story_url: 
parent_id: 
created_at_i: 1414530037
_tags:
- story
- author_tambourine_man
- story_8523550
objectID: '8523550'

---
[Source](http://yarchive.net/comp/linux/x86.html "Permalink to  x86 versus other architectures(Linus Torvalds) ")

#  x86 versus other architectures(Linus Torvalds) 

[Index][1] [Home][2] [About][3] [Blog][4]

* * *
    
    
    Newsgroups: fa.linux.kernel
    From: torvalds@transmeta.com (Linus Torvalds)
    Subject: Re: Minutes from Feb 21 LSE Call
    Original-Message-ID: <[b3b6oa$bsj$1@penguin.transmeta.com][5]>
    Date: Sun, 23 Feb 2003 19:23:46 GMT
    Message-ID: <[fa.k71001p.1m862d@ifi.uio.no][6]>
    
    In article <20030223082036.GI10411@holomorphy.com>,
    William Lee Irwin III  <wli@holomorphy.com> wrote:
    >On Sun, Feb 23, 2003 at 12:07:50AM -0800, David Lang wrote:
    >> Garrit, you missed the preior posters point. IA64 had the same fundamental
    >> problem as the Alpha, PPC, and Sparc processors, it doesn't run x86
    >> binaries.
    >
    >If I didn't know this mattered I wouldn't bother with the barfbags.
    >I just wouldn't deal with it.
    
    Why?
    
    The x86 is a hell of a lot nicer than the ppc32, for example.  On the
    x86, you get good performance and you can ignore the design mistakes (ie
    segmentation) by just basically turning them off.
    
    On the ppc32, the MMU braindamage is not something you can ignore, you
    have to write your OS for it and if you turn it off (ie enable soft-fill
    on the ones that support it) you now have to have separate paths in the
    OS for it.
    
    And the baroque instruction encoding on the x86 is actually a _good_
    thing: it's a rather dense encoding, which means that you win on icache.
    It's a bit hard to decode, but who cares? Existing chips do well at
    decoding, and thanks to the icache win they tend to perform better - and
    they load faster too (which is important - you can make your CPU have
    big caches, but _nothing_ saves you from the cold-cache costs).
    
    The low register count isn't an issue when you code in any high-level
    language, and it has actually forced x86 implementors to do a hell of a
    lot better job than the competition when it comes to memory loads and
    stores - which helps in general.  While the RISC people were off trying
    to optimize their compilers to generate loops that used all 32 registers
    efficiently, the x86 implementors instead made the chip run fast on
    varied loads and used tons of register renaming hardware (and looking at
    _memory_ renaming too).
    
    IA64 made all the mistakes anybody else did, and threw out all the good
    parts of the x86 because people thought those parts were ugly.  They
    aren't ugly, they're the "charming oddity" that makes it do well.  Look
    at them the right way and you realize that a lot of the grottyness is
    exactly _why_ the x86 works so well (yeah, and the fact that they are
    everywhere ;).
    
    The only real major failure of the x86 is the PAE crud.  Let's hope
    we'll get to forget it, the same way the DOS people eventually forgot
    about their memory extenders.
    
    (Yeah, and maybe IBM will make their ppc64 chips cheap enough that they
    will matter, and people can overlook the grottiness there. Right now
    Intel doesn't even seem to be interested in "64-bit for the masses", and
    maybe IBM will be. AMD certainly seems to be serious about the "masses"
    part, which in the end is the only part that really matters).
    
    		Linus
    
    
    

* * *
    
    
    Newsgroups: fa.linux.kernel
    From: Linus Torvalds <torvalds@transmeta.com>
    Subject: Re: Minutes from Feb 21 LSE Call
    Original-Message-ID: <[Pine.LNX.4.44.0302231326370.1534-100000@home.transmeta.com][7]>
    Date: Sun, 23 Feb 2003 21:39:07 GMT
    Message-ID: <[fa.m6ucdqo.140m9go@ifi.uio.no][8]>
    
    On Sun, 23 Feb 2003, David Mosberger wrote:
    >
    > But does x86 reall work so well?  Itanium 2 on 0.13um performs a lot
    > better than P4 on 0.13um.
    
    On WHAT benchmark?
    
    Itanium 2 doesn't hold a candle to a P4 on any real-world benchmarks.
    
    As far as I know, the _only_ things Itanium 2 does better on is (a) FP
    kernels, partly due to a huge cache and (b) big databases, entirely
    because the P4 is crippled with lots of memory because Intel refuses to do
    a 64-bit version (because they know it would totally kill ia-64).
    
    Last I saw P4 was kicking ia-64 butt on specint and friends.
    
    That's also ignoring the fact that ia-64 simply CANNOT DO the things a P4
    does every single day. You can't put an ia-64 in a reasonable desktop
    machine, partly because of pricing, but partly because it would just suck
    so horribly at things people expect not to suck (games spring to mind).
    
    And I further bet that using a native distribution (ie totally ignoring
    the power and price and bad x86 performance issues), ia-64 will work a lot
    worse for people simply because the binaries are bigger. That was quite
    painful on alpha, and ia-64 is even worse - to offset the bigger binaries,
    you need a faster disk subsystem etc just to not feel slower than a
    bog-standard PC.
    
    Code size matters. Price matters. Real world matters. And ia-64 at least
    so far falls flat on its face on ALL of these.
    
    >                         As far as I can guess, the only reason P4
    > comes out on 0.13um (and 0.09um) before anything else is due to the
    > latter part you mention: it's where the volume is today.
    
    It's where all the money is ("ia-64: 5 billion dollars in the red and
    still sinking") so of _course_ it's where the efforts get put.
    
    			Linus
    
    
    

* * *
    
    
    Newsgroups: fa.linux.kernel
    From: Linus Torvalds <torvalds@transmeta.com>
    Subject: Re: Minutes from Feb 21 LSE Call
    Original-Message-ID: <[Pine.LNX.4.44.0302231634150.1690-100000@home.transmeta.com][9]>
    Date: Mon, 24 Feb 2003 00:45:46 GMT
    Message-ID: <[fa.m5ugfii.150ub8u@ifi.uio.no][10]>
    
    On Sun, 23 Feb 2003, David Mosberger wrote:
    >
    >    2 GHz Xeon:		701 SPECint
    >    1 GHz Itanium 2:	810 SPECint
    >
    > That is, Itanium 2 is 15% faster.
    
    Ehh, and this is with how much cache?
    
    Last I saw, the Itanium 2 machines came with 3MB of integrated L3 caches,
    and I suspect that whatever 0.13 Itanium numbers you're looking at are
    with the new 6MB caches.
    
    So your "apples to apples" comparison isn't exactly that.
    
    The only thing that is meaningful is "performance at the same time of
    general availability". At which point the P4 beats the Itanium 2 senseless
    with a 25% higher SpecInt. And last I heard, by the time Itanium 2 is up
    at 2GHz, the P4 is apparently going to be at 5GHz, comfortably keeping
    that 25% lead.
    
    			Linus
    
    
    

* * *
    
    
    Newsgroups: fa.linux.kernel
    From: Linus Torvalds <torvalds@transmeta.com>
    Subject: Re: Minutes from Feb 21 LSE Call
    Original-Message-ID: <[Pine.LNX.4.44.0302231840220.1690-100000@home.transmeta.com][11]>
    Date: Mon, 24 Feb 2003 02:59:50 GMT
    Message-ID: <[fa.m6eefqe.14gcagq@ifi.uio.no][12]>
    
    On Sun, 23 Feb 2003, David Mosberger wrote:
    >   >> 2 GHz Xeon:	701 SPECint
    >   >> 1 GHz Itanium 2:	810 SPECint
    >
    >   >> That is, Itanium 2 is 15% faster.
    >
    > Unfortunately, HP doesn't sell 1.5MB/1GHz Itanium 2 workstations, but
    > we can do some educated guessing:
    >
    >   1GHz Itanium 2, 3MB cache:		810 SPECint
    >   900MHz Itanium 2, 1.5MB cache:	674 SPECint
    >
    > Assuming pure frequency scaling, a 1GHz/1.5MB Itanium 2 would get
    > around 750 SPECint.  In reality, it would get slightly less, but most
    > likely substantially more than 701.
    
    And as Dean pointed out:
    
      2Ghz Xeon MP with 2MB L3 cache:	842 SPECint
    
    In other words, the P4 eats the Itanium for breakfast even if you limit it
    to 2GHz due to some "process" rule.
    
    And if you don't make up any silly rules, but simply look at "what's
    available today", you get
    
      2.8Ghz Xeon MP with 2MB L3 cache: 	907 SPECint
    
    or even better (much cheaper CPUs):
    
      3.06 GHz P4 with 512kB L2 cache:	1074 SPECint
      AMD Athlon XP 2800+:			 933 SPECint
    
    These are systems that you can buy today. With _less_ cache, and clearly
    much higher performance (the difference between the best-performing
    published ia-64 and the best P4 on specint, the P4 is 32% faster. Even
    with the "you can only run the P4 at 2GHz because that is all it ever ran
    at in 0.18" thing the ia-64 falls behind.
    
    >   Linus> The only thing that is meaningful is "performace at the same
    >   Linus> time of general availability".
    >
    > You claimed that x86 is inherently superior.  I provided data that
    > shows that much of this apparent superiority is simply an effect of
    > the larger volume that x86 achieves today.
    
    And I showed that your data is flawed. Clearly the P4 outperforms ia-64
    on an architectural level _even_ when taking process into account.
    
    		Linus
    
    

* * *
    
    
    Newsgroups: fa.linux.kernel
    From: Linus Torvalds <torvalds@transmeta.com>
    Subject: Re: Minutes from Feb 21 LSE Call
    Original-Message-ID: <[Pine.LNX.4.44.0302231343050.1534-100000@home.transmeta.com][13]>
    Date: Sun, 23 Feb 2003 21:49:50 GMT
    Message-ID: <[fa.m5e8eal.15gi80t@ifi.uio.no][14]>
    
    On Sun, 23 Feb 2003, John Bradford wrote:
    >
    > I could be wrong, but I always thought that Sparc, and a lot of other
    > architectures could mark arbitrary areas of memory, (such as the
    > stack), as non-executable, whereas x86 only lets you have one
    > non-executable segment.
    
    The x86 has that stupid "executablility is tied to a segment" thing, which
    means that you cannot make things executable on a page-per-page level.
    It's a mistake, but it's one that _could_ be fixed in the architecture if
    it really mattered, the same way the WP bit got fixed in the i486.
    
    I'm definitely not saying that the x86 is perfect. It clearly isn't. But a
    lot of people complain about the wrong things, and a lot of people who
    tried to "fix" things just made them worse by throwing out the good parts
    too.
    
    		Linus
    
    
    

* * *
    
    
    Newsgroups: fa.linux.kernel
    From: Linus Torvalds <torvalds@transmeta.com>
    Subject: Re: Minutes from Feb 21 LSE Call
    Original-Message-ID: <[Pine.LNX.4.44.0302231805240.1690-100000@home.transmeta.com][15]>
    Date: Mon, 24 Feb 2003 02:43:43 GMT
    Message-ID: <[fa.m6eieqj.14g0bgv@ifi.uio.no][16]>
    
    On 24 Feb 2003 linux@horizon.com wrote:
    >
    > Now wait a minute.  I thought you worked at Transmeta.
    >
    > There were no development and debugging costs associated with getting
    > all those different kinds of gates working, and all the segmentation
    > checking right?
    
    So? The only thing that matters is the end result.
    
    > Wouldn't it have been easier to build the system, and shift the effort
    > where it would really do some good, if you didn't have to support
    > all that crap?
    
    Probably not appreciably. You forget - it's been tried. Over and over
    again. The whole RISC philosophy was all about "wouldn't it perform better
    if you didn't have to support that crap".
    
    The fact is, the "crap" doesn't matter that much. As proven by the fact
    that the "crap" processor family ends up being the one that eats pretty
    much everybody else for lunch on performance issues.
    
    Yes, the "crap" does end up making it a harder market to enter. There's a
    lot of IP involved in knowing what all the rules are, and having literally
    _millions_ of tests that check for conformance to the architecture (and
    much of the "architecture" is a de-facto thing, not really written down in
    architecture manuals).
    
    But clearly even that is not insurmountable, as shown by the fact that not
    only does the x86 perform well, it's also one of the few CPU's that are
    actively worked on by multiple different companies (including Transmeta,
    as you point out - although clearly the "crap" is one reason why the sw
    approach works at all).
    
    > Transmeta's software-decoding is an extreme example of what all modern
    > x86 processors are doing in their L1 caches, namely predecoding the
    > instructions and storing them in expanded form.  This varies from
    > just adding boundary tags (Pentium) and instruction type (K7) through
    > converting them to uops and cacheing those (P4).
    
    But you seem to imply that that is somehow a counter-argument to _my_
    argument. And I don't agree.
    
    I think what Transmeta (and AMD, and VIA etc) show is that the ugliness
    doesn't really matter - there are different ways of handling it, and you
    can either throw hardware at it or software at it, but it's still worth
    doing, because in the end what matters is not the bad parts of it, but the
    good parts.
    
    Btw, the P4 tracecache does pretty much exactly the same thing that
    Transmeta does, except in hardware. It's based on a very simple reality:
    decoding _is_ going to be the bottleneck for _any_ instruction set, once
    you've pushed the rest hard enough. If you're not doing predecoding, that
    only means that you haven't pushed hard enough yet - _regardless_ of your
    archtiecture.
    
    > This exactly undoes any L1 cache size benefits.  The win, of course, is
    > that you don't have as much shifting and aligning on your i-fetch path,
    > which all the fixed-instruction-size architectures already started with.
    
    No. You don't understand what "cold-cache" case really means. It's more
    than just bringing the thing in from memory to the cache. It's also all
    about loading the dang thing from disk.
    
    > So your comments only apply to the L2 cache.
    
    And the disk.
    
    > And for the expense of all the instruction predecoding logic betweeen
    > L2 and L1, don't you think someone could build an instruction compressor
    > to fit more into the die-size-limited L2 cache?
    
    It's been done. See the PPC stuff. I've read the papers (it's been a long
    time, admittedly - it's not something new), and the fact is, it's not
    apparently being used that much. Because it's quite painful, unlike the
    x86 approach.
    
    > > stores - which helps in general.  While the RISC people were off trying
    > > to optimize their compilers to generate loops that used all 32 registers
    > > efficiently, the x86 implementors instead made the chip run fast on
    > > varied loads and used tons of register renaming hardware (and looking at
    > > _memory_ renaming too).
    >
    > I don't disagree that chip designers have managed to do very well with
    > the x86, and there's nothing wrong with making a virtue out of a necessity,
    > but that doesn't make the necessity good.
    
    Actually, you miss my point.
    
    The necessity is good because it _forced_ people to look at what really
    matters. Instead of wasting 15 years and countless PhD's on things that
    are, in the end, just engineering-masturbation (nr of registers etc).
    
    > The low register count *does* affect you when using a high-level language,
    > because if you have too many live variables floating around, you start
    > suffering.  Handling these spills is why you need memory renaming.
    
    Bzzt. Wrong answer.
    
    The right answer is that you need memory renaming and memory alias
    hardware _anyway_, because doing dynamic scheduling of loads vs stores is
    something that is _required_ to get the kind of performance that people
    expect today. And all the RISC stuff that tried to avoid it was just a BIG
    WASTE OF TIME. Because the _only_ thing the RISC approach ended up showing
    was that eventually you have to do the hard stuff anyway, so you might as
    well design for doing it in the first place.
    
    Which is what ia-64 did wrong - and what I mean by doing the same mistakes
    that everybody else did 15 years ago. Look at all the crap that ia64 does
    in order to do compiler-driven loop modulo-optimizations. That's part of
    the whole design, with predication and those horrible register windows.
    Can you say "risc mistakes all over again"?
    
    My strong suspicion (and that makes it a "fact" ;) is that in another 5
    years they'll get to where the x86 has been for the last 10 years, and
    they'll realize that they will need to do out-of-order accesses etc, which
    makes all of that modulo optimization pretty much useless, since the
    hardware pretty much has to do it _anyway_.
    
    > It's true that x86 processors have had fancy architectural features
    > sooner than similar-performance RISCs, but I think there's a fair case
    > that that's because they've *needed* them.
    
    Which is exactly my point. And by the time you implement them, you notice
    that the half-way measures don't mean anything, and in fact make for more
    problems.
    
    For example, that small register state is a pain in the ass, no? But since
    you basically need register renaming _anyway_, the small register state
    actually has some advantages in that it makes it easier to have tons of
    read ports and still keep the register file fast. And once you do renaming
    (including memory state renaming), IT DOESN'T MUCH MATTER.
    
    >				  Why do the P4 and K7/K8 have
    > such enormous reorder buffers, able to keep around 100 instructions
    > in flight at a time?  Because they need it to extract parallelism out
    > of an instruction stream serialized by a miserly register file.
    
    You think this is bad?
    
    Look at it another way: once you have hundreds of instructions in flight,
    you have hardware that automatically
    
     - executes legacy applications reasonably well, since compilers aren't
       the most important thing.
    
       End result: users are happy.
    
     - you don't need to have compilers that do stupid things like unrolling
       loops, thus keeping your icache pressure down, since you do loop
       unrolling in hardware thanks to deep pipelines.
    
    Even the RISC people are doing hundreds of instructions in flight (ie
    Power5), but they started doing it years after the x86 did, because they
    claimed that they could force their users to recompile their binaries
    every few years. And look where it actually got them..
    
    > They've developed some great technology to compensate for the weaknesses,
    > but it's sure nice to dream of an architecture with all that great
    > technology but with fewer initial warts.  (Alpha seemed like the
    > best hope, but *sigh*.  Still, however you apportion blame for its
    > demise, performance was clearly not one of its problems.)
    
    So my premise is that you always end up doing the hard things anyway, and
    the "crap" _really_ doesn't matter.
    
    Alpha was nice, no question about it. But it took them way too long to get
    to the whole OoO thing, because they tried to take a short-cut that in the
    end wasn't the answer. It _looked_ like the answer (the original alpha
    design was done explicitly to not _need_ things like complex out-of-order
    execution), but it was all just wrong.
    
    The thing about the x86 is that hard cold reality (ie millions of
    customers that have existing applications) really _forces_ you to look at
    what matters, and so far it clearly appears that the things you are
    complaining about (registers and segmentation) simply do _not_ matter.
    
    > I think the same claim applies much more powerfully to the ppc32's MMU.
    > It may be stupid, but it is only visible from inside the kernel, and
    > a fairly small piece of the kernel at that.
    >
    > It could be scrapped and replaced with something better without any
    > effect on existing user-level code at all.
    >
    > Do you think you can replace the x86's register problems as easily?
    
    They _have_ been solved. The x86 performs about twice as well as any ppc32
    on the market. End of discussion.
    
    > > The only real major failure of the x86 is the PAE crud.
    >
    > So you think AMD extended the register file just for fun?
    
    I think the AMD register file extension was unnecessary, yes. They did it
    because they could, and it wasn't a big deal. That's not the part that
    makes the architecture interesting. As you should well know.
    
    > Hell, the "PAE crud" is the *same* problem as the tiny register
    > file.  Insufficient virtual address space leading to physical > virtual
    > kludges.
    
    Nope. The small register file is a non-issue. Trust me. I do work for
    transmeta, and we do the register renaming in software, and it doesn't
    matter in the end.
    
    			Linus
    
    
    

* * *
    
    
    Newsgroups: fa.linux.kernel
    From: Linus Torvalds <torvalds@transmeta.com>
    Subject: Re: Minutes from Feb 21 LSE Call
    Original-Message-ID: <[Pine.LNX.4.44.0302232041130.4453-100000@home.transmeta.com][17]>
    Date: Mon, 24 Feb 2003 05:02:35 GMT
    Message-ID: <[fa.m7tseqi.160q9go@ifi.uio.no][18]>
    
    On Sun, 23 Feb 2003, Martin J. Bligh wrote:
    >
    > > The fact is, the "crap" doesn't matter that much. As proven by the fact
    > > that the "crap" processor family ends up being the one that eats pretty
    > > much everybody else for lunch on performance issues.
    >
    > But is that because it's a better design? Or because it has more money
    > thrown at it? I suspect it's merely it's mass-market dominance generating
    > huge amounts of cash to improve it ... and it got there through history,
    > not technical prowess.
    
    Sure. It's to a large degree "more money and resources", no question about
    that.
    
    But what is "better design"? Would it have been possible to put as much
    effort as Intel (and others) put into the x86 architecture into something
    else, and make it even better?
    
    MY standpoint is that the above question is _meaningless_ and stupid.
    People did try. Very hard. Claiming anything else is clearly misguided.
    But compatibility and price matter equally much - and often more - than
    raw performance. Which means that even _if_ another architecture performed
    better (and it certainly happened, in the hay-day of the alpha), it
    wouldn't much matter. People still stayed away from it in droves.
    
    And in the end, that's why I don't like IA-64. I'll take back every single
    bad thing I've ever said about IA-64 if Intel were to just to sell those
    things to the mass market instead of P4's. But clearly the IA-64 can't
    make it in that market, and thus it is made irrelevant. The same way alpha
    was made irrelevant, _despite_ having had much better performance - an
    advantage that ia-64 clearly doesn't have.
    
    (Admittedly, alpha didn't have hugely better performance for very long.
    Intel came out with the PPro, and took a _lot_ of people by surprise).
    
    AMD's x86-64 approach is a lot more interesting not so much because of any
    technical issues, but because AMD _can_ try to avoid the "irrelevant"
    part. By having a part that _can_ potentially compete in the market
    against a P4, AMD has something that is worth hoping for. Something that
    can make a difference.
    
    IBM with Power5 and apple could be the same thing (yeah yeah, I personally
    suspect it goes enough against IBMs normal approach that it will cause
    some friction). A CPU that actually competes in a market that is relevant.
    
    Because server CPU's simply aren't very interesting from a technical
    standpoint. I don't know of a _single_ CPU that ever grew down. But we've
    seen a _lot_ of CPU's grow _up_. In other words: the small machines tend
    to eat into the large ones, not the other way around.
    
    And if you start from the large ones, you aren't going to make it in the
    long run.
    
    Put yet another way: if I was on Intels IA-32 team, I'd be a lot more
    worried about those XScale people finally getting their act together than
    I would be about IA-64.
    
    			Linus
    
    

* * *

[Index][1] [Home][2] [About][3] [Blog][4]

[1]: http://yarchive.net/index.html
[2]: http://yarchive.net/home.html
[3]: http://yarchive.net/about.html
[4]: http://yarchive.net/blog
[5]: http://mid.gmane.org/b3b6oa%24bsj%241%40penguin.transmeta.com
[6]: http://groups.google.com/groups/search?as_umsgid=fa.k71001p.1m862d%40ifi.uio.no
[7]: http://mid.gmane.org/Pine.LNX.4.44.0302231326370.1534-100000%40home.transmeta.com
[8]: http://groups.google.com/groups/search?as_umsgid=fa.m6ucdqo.140m9go%40ifi.uio.no
[9]: http://mid.gmane.org/Pine.LNX.4.44.0302231634150.1690-100000%40home.transmeta.com
[10]: http://groups.google.com/groups/search?as_umsgid=fa.m5ugfii.150ub8u%40ifi.uio.no
[11]: http://mid.gmane.org/Pine.LNX.4.44.0302231840220.1690-100000%40home.transmeta.com
[12]: http://groups.google.com/groups/search?as_umsgid=fa.m6eefqe.14gcagq%40ifi.uio.no
[13]: http://mid.gmane.org/Pine.LNX.4.44.0302231343050.1534-100000%40home.transmeta.com
[14]: http://groups.google.com/groups/search?as_umsgid=fa.m5e8eal.15gi80t%40ifi.uio.no
[15]: http://mid.gmane.org/Pine.LNX.4.44.0302231805240.1690-100000%40home.transmeta.com
[16]: http://groups.google.com/groups/search?as_umsgid=fa.m6eieqj.14g0bgv%40ifi.uio.no
[17]: http://mid.gmane.org/Pine.LNX.4.44.0302232041130.4453-100000%40home.transmeta.com
[18]: http://groups.google.com/groups/search?as_umsgid=fa.m7tseqi.160q9go%40ifi.uio.no