2018-02-23 18:58:03 +00:00
|
|
|
---
|
|
|
|
created_at: '2014-10-28T21:00:37.000Z'
|
|
|
|
title: X86 versus other architectures (Linus Torvalds) (2003)
|
|
|
|
url: http://yarchive.net/comp/linux/x86.html
|
|
|
|
author: tambourine_man
|
|
|
|
points: 63
|
|
|
|
story_text: ''
|
|
|
|
comment_text:
|
|
|
|
num_comments: 62
|
|
|
|
story_id:
|
|
|
|
story_title:
|
|
|
|
story_url:
|
|
|
|
parent_id:
|
|
|
|
created_at_i: 1414530037
|
|
|
|
_tags:
|
|
|
|
- story
|
|
|
|
- author_tambourine_man
|
|
|
|
- story_8523550
|
|
|
|
objectID: '8523550'
|
|
|
|
|
|
|
|
---
|
2018-02-23 18:19:40 +00:00
|
|
|
[Source](http://yarchive.net/comp/linux/x86.html "Permalink to x86 versus other architectures(Linus Torvalds) ")
|
|
|
|
|
|
|
|
# x86 versus other architectures(Linus Torvalds)
|
|
|
|
|
|
|
|
[Index][1] [Home][2] [About][3] [Blog][4]
|
|
|
|
|
|
|
|
* * *
|
|
|
|
|
|
|
|
|
|
|
|
Newsgroups: fa.linux.kernel
|
|
|
|
From: torvalds@transmeta.com (Linus Torvalds)
|
|
|
|
Subject: Re: Minutes from Feb 21 LSE Call
|
|
|
|
Original-Message-ID: <[b3b6oa$bsj$1@penguin.transmeta.com][5]>
|
|
|
|
Date: Sun, 23 Feb 2003 19:23:46 GMT
|
|
|
|
Message-ID: <[fa.k71001p.1m862d@ifi.uio.no][6]>
|
|
|
|
|
|
|
|
In article <20030223082036.GI10411@holomorphy.com>,
|
|
|
|
William Lee Irwin III <wli@holomorphy.com> wrote:
|
|
|
|
>On Sun, Feb 23, 2003 at 12:07:50AM -0800, David Lang wrote:
|
|
|
|
>> Garrit, you missed the preior posters point. IA64 had the same fundamental
|
|
|
|
>> problem as the Alpha, PPC, and Sparc processors, it doesn't run x86
|
|
|
|
>> binaries.
|
|
|
|
>
|
|
|
|
>If I didn't know this mattered I wouldn't bother with the barfbags.
|
|
|
|
>I just wouldn't deal with it.
|
|
|
|
|
|
|
|
Why?
|
|
|
|
|
|
|
|
The x86 is a hell of a lot nicer than the ppc32, for example. On the
|
|
|
|
x86, you get good performance and you can ignore the design mistakes (ie
|
|
|
|
segmentation) by just basically turning them off.
|
|
|
|
|
|
|
|
On the ppc32, the MMU braindamage is not something you can ignore, you
|
|
|
|
have to write your OS for it and if you turn it off (ie enable soft-fill
|
|
|
|
on the ones that support it) you now have to have separate paths in the
|
|
|
|
OS for it.
|
|
|
|
|
|
|
|
And the baroque instruction encoding on the x86 is actually a _good_
|
|
|
|
thing: it's a rather dense encoding, which means that you win on icache.
|
|
|
|
It's a bit hard to decode, but who cares? Existing chips do well at
|
|
|
|
decoding, and thanks to the icache win they tend to perform better - and
|
|
|
|
they load faster too (which is important - you can make your CPU have
|
|
|
|
big caches, but _nothing_ saves you from the cold-cache costs).
|
|
|
|
|
|
|
|
The low register count isn't an issue when you code in any high-level
|
|
|
|
language, and it has actually forced x86 implementors to do a hell of a
|
|
|
|
lot better job than the competition when it comes to memory loads and
|
|
|
|
stores - which helps in general. While the RISC people were off trying
|
|
|
|
to optimize their compilers to generate loops that used all 32 registers
|
|
|
|
efficiently, the x86 implementors instead made the chip run fast on
|
|
|
|
varied loads and used tons of register renaming hardware (and looking at
|
|
|
|
_memory_ renaming too).
|
|
|
|
|
|
|
|
IA64 made all the mistakes anybody else did, and threw out all the good
|
|
|
|
parts of the x86 because people thought those parts were ugly. They
|
|
|
|
aren't ugly, they're the "charming oddity" that makes it do well. Look
|
|
|
|
at them the right way and you realize that a lot of the grottyness is
|
|
|
|
exactly _why_ the x86 works so well (yeah, and the fact that they are
|
|
|
|
everywhere ;).
|
|
|
|
|
|
|
|
The only real major failure of the x86 is the PAE crud. Let's hope
|
|
|
|
we'll get to forget it, the same way the DOS people eventually forgot
|
|
|
|
about their memory extenders.
|
|
|
|
|
|
|
|
(Yeah, and maybe IBM will make their ppc64 chips cheap enough that they
|
|
|
|
will matter, and people can overlook the grottiness there. Right now
|
|
|
|
Intel doesn't even seem to be interested in "64-bit for the masses", and
|
|
|
|
maybe IBM will be. AMD certainly seems to be serious about the "masses"
|
|
|
|
part, which in the end is the only part that really matters).
|
|
|
|
|
|
|
|
Linus
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
* * *
|
|
|
|
|
|
|
|
|
|
|
|
Newsgroups: fa.linux.kernel
|
|
|
|
From: Linus Torvalds <torvalds@transmeta.com>
|
|
|
|
Subject: Re: Minutes from Feb 21 LSE Call
|
|
|
|
Original-Message-ID: <[Pine.LNX.4.44.0302231326370.1534-100000@home.transmeta.com][7]>
|
|
|
|
Date: Sun, 23 Feb 2003 21:39:07 GMT
|
|
|
|
Message-ID: <[fa.m6ucdqo.140m9go@ifi.uio.no][8]>
|
|
|
|
|
|
|
|
On Sun, 23 Feb 2003, David Mosberger wrote:
|
|
|
|
>
|
|
|
|
> But does x86 reall work so well? Itanium 2 on 0.13um performs a lot
|
|
|
|
> better than P4 on 0.13um.
|
|
|
|
|
|
|
|
On WHAT benchmark?
|
|
|
|
|
|
|
|
Itanium 2 doesn't hold a candle to a P4 on any real-world benchmarks.
|
|
|
|
|
|
|
|
As far as I know, the _only_ things Itanium 2 does better on is (a) FP
|
|
|
|
kernels, partly due to a huge cache and (b) big databases, entirely
|
|
|
|
because the P4 is crippled with lots of memory because Intel refuses to do
|
|
|
|
a 64-bit version (because they know it would totally kill ia-64).
|
|
|
|
|
|
|
|
Last I saw P4 was kicking ia-64 butt on specint and friends.
|
|
|
|
|
|
|
|
That's also ignoring the fact that ia-64 simply CANNOT DO the things a P4
|
|
|
|
does every single day. You can't put an ia-64 in a reasonable desktop
|
|
|
|
machine, partly because of pricing, but partly because it would just suck
|
|
|
|
so horribly at things people expect not to suck (games spring to mind).
|
|
|
|
|
|
|
|
And I further bet that using a native distribution (ie totally ignoring
|
|
|
|
the power and price and bad x86 performance issues), ia-64 will work a lot
|
|
|
|
worse for people simply because the binaries are bigger. That was quite
|
|
|
|
painful on alpha, and ia-64 is even worse - to offset the bigger binaries,
|
|
|
|
you need a faster disk subsystem etc just to not feel slower than a
|
|
|
|
bog-standard PC.
|
|
|
|
|
|
|
|
Code size matters. Price matters. Real world matters. And ia-64 at least
|
|
|
|
so far falls flat on its face on ALL of these.
|
|
|
|
|
|
|
|
> As far as I can guess, the only reason P4
|
|
|
|
> comes out on 0.13um (and 0.09um) before anything else is due to the
|
|
|
|
> latter part you mention: it's where the volume is today.
|
|
|
|
|
|
|
|
It's where all the money is ("ia-64: 5 billion dollars in the red and
|
|
|
|
still sinking") so of _course_ it's where the efforts get put.
|
|
|
|
|
|
|
|
Linus
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
* * *
|
|
|
|
|
|
|
|
|
|
|
|
Newsgroups: fa.linux.kernel
|
|
|
|
From: Linus Torvalds <torvalds@transmeta.com>
|
|
|
|
Subject: Re: Minutes from Feb 21 LSE Call
|
|
|
|
Original-Message-ID: <[Pine.LNX.4.44.0302231634150.1690-100000@home.transmeta.com][9]>
|
|
|
|
Date: Mon, 24 Feb 2003 00:45:46 GMT
|
|
|
|
Message-ID: <[fa.m5ugfii.150ub8u@ifi.uio.no][10]>
|
|
|
|
|
|
|
|
On Sun, 23 Feb 2003, David Mosberger wrote:
|
|
|
|
>
|
|
|
|
> 2 GHz Xeon: 701 SPECint
|
|
|
|
> 1 GHz Itanium 2: 810 SPECint
|
|
|
|
>
|
|
|
|
> That is, Itanium 2 is 15% faster.
|
|
|
|
|
|
|
|
Ehh, and this is with how much cache?
|
|
|
|
|
|
|
|
Last I saw, the Itanium 2 machines came with 3MB of integrated L3 caches,
|
|
|
|
and I suspect that whatever 0.13 Itanium numbers you're looking at are
|
|
|
|
with the new 6MB caches.
|
|
|
|
|
|
|
|
So your "apples to apples" comparison isn't exactly that.
|
|
|
|
|
|
|
|
The only thing that is meaningful is "performance at the same time of
|
|
|
|
general availability". At which point the P4 beats the Itanium 2 senseless
|
|
|
|
with a 25% higher SpecInt. And last I heard, by the time Itanium 2 is up
|
|
|
|
at 2GHz, the P4 is apparently going to be at 5GHz, comfortably keeping
|
|
|
|
that 25% lead.
|
|
|
|
|
|
|
|
Linus
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
* * *
|
|
|
|
|
|
|
|
|
|
|
|
Newsgroups: fa.linux.kernel
|
|
|
|
From: Linus Torvalds <torvalds@transmeta.com>
|
|
|
|
Subject: Re: Minutes from Feb 21 LSE Call
|
|
|
|
Original-Message-ID: <[Pine.LNX.4.44.0302231840220.1690-100000@home.transmeta.com][11]>
|
|
|
|
Date: Mon, 24 Feb 2003 02:59:50 GMT
|
|
|
|
Message-ID: <[fa.m6eefqe.14gcagq@ifi.uio.no][12]>
|
|
|
|
|
|
|
|
On Sun, 23 Feb 2003, David Mosberger wrote:
|
|
|
|
> >> 2 GHz Xeon: 701 SPECint
|
|
|
|
> >> 1 GHz Itanium 2: 810 SPECint
|
|
|
|
>
|
|
|
|
> >> That is, Itanium 2 is 15% faster.
|
|
|
|
>
|
|
|
|
> Unfortunately, HP doesn't sell 1.5MB/1GHz Itanium 2 workstations, but
|
|
|
|
> we can do some educated guessing:
|
|
|
|
>
|
|
|
|
> 1GHz Itanium 2, 3MB cache: 810 SPECint
|
|
|
|
> 900MHz Itanium 2, 1.5MB cache: 674 SPECint
|
|
|
|
>
|
|
|
|
> Assuming pure frequency scaling, a 1GHz/1.5MB Itanium 2 would get
|
|
|
|
> around 750 SPECint. In reality, it would get slightly less, but most
|
|
|
|
> likely substantially more than 701.
|
|
|
|
|
|
|
|
And as Dean pointed out:
|
|
|
|
|
|
|
|
2Ghz Xeon MP with 2MB L3 cache: 842 SPECint
|
|
|
|
|
|
|
|
In other words, the P4 eats the Itanium for breakfast even if you limit it
|
|
|
|
to 2GHz due to some "process" rule.
|
|
|
|
|
|
|
|
And if you don't make up any silly rules, but simply look at "what's
|
|
|
|
available today", you get
|
|
|
|
|
|
|
|
2.8Ghz Xeon MP with 2MB L3 cache: 907 SPECint
|
|
|
|
|
|
|
|
or even better (much cheaper CPUs):
|
|
|
|
|
|
|
|
3.06 GHz P4 with 512kB L2 cache: 1074 SPECint
|
|
|
|
AMD Athlon XP 2800+: 933 SPECint
|
|
|
|
|
|
|
|
These are systems that you can buy today. With _less_ cache, and clearly
|
|
|
|
much higher performance (the difference between the best-performing
|
|
|
|
published ia-64 and the best P4 on specint, the P4 is 32% faster. Even
|
|
|
|
with the "you can only run the P4 at 2GHz because that is all it ever ran
|
|
|
|
at in 0.18" thing the ia-64 falls behind.
|
|
|
|
|
|
|
|
> Linus> The only thing that is meaningful is "performace at the same
|
|
|
|
> Linus> time of general availability".
|
|
|
|
>
|
|
|
|
> You claimed that x86 is inherently superior. I provided data that
|
|
|
|
> shows that much of this apparent superiority is simply an effect of
|
|
|
|
> the larger volume that x86 achieves today.
|
|
|
|
|
|
|
|
And I showed that your data is flawed. Clearly the P4 outperforms ia-64
|
|
|
|
on an architectural level _even_ when taking process into account.
|
|
|
|
|
|
|
|
Linus
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
* * *
|
|
|
|
|
|
|
|
|
|
|
|
Newsgroups: fa.linux.kernel
|
|
|
|
From: Linus Torvalds <torvalds@transmeta.com>
|
|
|
|
Subject: Re: Minutes from Feb 21 LSE Call
|
|
|
|
Original-Message-ID: <[Pine.LNX.4.44.0302231343050.1534-100000@home.transmeta.com][13]>
|
|
|
|
Date: Sun, 23 Feb 2003 21:49:50 GMT
|
|
|
|
Message-ID: <[fa.m5e8eal.15gi80t@ifi.uio.no][14]>
|
|
|
|
|
|
|
|
On Sun, 23 Feb 2003, John Bradford wrote:
|
|
|
|
>
|
|
|
|
> I could be wrong, but I always thought that Sparc, and a lot of other
|
|
|
|
> architectures could mark arbitrary areas of memory, (such as the
|
|
|
|
> stack), as non-executable, whereas x86 only lets you have one
|
|
|
|
> non-executable segment.
|
|
|
|
|
|
|
|
The x86 has that stupid "executablility is tied to a segment" thing, which
|
|
|
|
means that you cannot make things executable on a page-per-page level.
|
|
|
|
It's a mistake, but it's one that _could_ be fixed in the architecture if
|
|
|
|
it really mattered, the same way the WP bit got fixed in the i486.
|
|
|
|
|
|
|
|
I'm definitely not saying that the x86 is perfect. It clearly isn't. But a
|
|
|
|
lot of people complain about the wrong things, and a lot of people who
|
|
|
|
tried to "fix" things just made them worse by throwing out the good parts
|
|
|
|
too.
|
|
|
|
|
|
|
|
Linus
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
* * *
|
|
|
|
|
|
|
|
|
|
|
|
Newsgroups: fa.linux.kernel
|
|
|
|
From: Linus Torvalds <torvalds@transmeta.com>
|
|
|
|
Subject: Re: Minutes from Feb 21 LSE Call
|
|
|
|
Original-Message-ID: <[Pine.LNX.4.44.0302231805240.1690-100000@home.transmeta.com][15]>
|
|
|
|
Date: Mon, 24 Feb 2003 02:43:43 GMT
|
|
|
|
Message-ID: <[fa.m6eieqj.14g0bgv@ifi.uio.no][16]>
|
|
|
|
|
|
|
|
On 24 Feb 2003 linux@horizon.com wrote:
|
|
|
|
>
|
|
|
|
> Now wait a minute. I thought you worked at Transmeta.
|
|
|
|
>
|
|
|
|
> There were no development and debugging costs associated with getting
|
|
|
|
> all those different kinds of gates working, and all the segmentation
|
|
|
|
> checking right?
|
|
|
|
|
|
|
|
So? The only thing that matters is the end result.
|
|
|
|
|
|
|
|
> Wouldn't it have been easier to build the system, and shift the effort
|
|
|
|
> where it would really do some good, if you didn't have to support
|
|
|
|
> all that crap?
|
|
|
|
|
|
|
|
Probably not appreciably. You forget - it's been tried. Over and over
|
|
|
|
again. The whole RISC philosophy was all about "wouldn't it perform better
|
|
|
|
if you didn't have to support that crap".
|
|
|
|
|
|
|
|
The fact is, the "crap" doesn't matter that much. As proven by the fact
|
|
|
|
that the "crap" processor family ends up being the one that eats pretty
|
|
|
|
much everybody else for lunch on performance issues.
|
|
|
|
|
|
|
|
Yes, the "crap" does end up making it a harder market to enter. There's a
|
|
|
|
lot of IP involved in knowing what all the rules are, and having literally
|
|
|
|
_millions_ of tests that check for conformance to the architecture (and
|
|
|
|
much of the "architecture" is a de-facto thing, not really written down in
|
|
|
|
architecture manuals).
|
|
|
|
|
|
|
|
But clearly even that is not insurmountable, as shown by the fact that not
|
|
|
|
only does the x86 perform well, it's also one of the few CPU's that are
|
|
|
|
actively worked on by multiple different companies (including Transmeta,
|
|
|
|
as you point out - although clearly the "crap" is one reason why the sw
|
|
|
|
approach works at all).
|
|
|
|
|
|
|
|
> Transmeta's software-decoding is an extreme example of what all modern
|
|
|
|
> x86 processors are doing in their L1 caches, namely predecoding the
|
|
|
|
> instructions and storing them in expanded form. This varies from
|
|
|
|
> just adding boundary tags (Pentium) and instruction type (K7) through
|
|
|
|
> converting them to uops and cacheing those (P4).
|
|
|
|
|
|
|
|
But you seem to imply that that is somehow a counter-argument to _my_
|
|
|
|
argument. And I don't agree.
|
|
|
|
|
|
|
|
I think what Transmeta (and AMD, and VIA etc) show is that the ugliness
|
|
|
|
doesn't really matter - there are different ways of handling it, and you
|
|
|
|
can either throw hardware at it or software at it, but it's still worth
|
|
|
|
doing, because in the end what matters is not the bad parts of it, but the
|
|
|
|
good parts.
|
|
|
|
|
|
|
|
Btw, the P4 tracecache does pretty much exactly the same thing that
|
|
|
|
Transmeta does, except in hardware. It's based on a very simple reality:
|
|
|
|
decoding _is_ going to be the bottleneck for _any_ instruction set, once
|
|
|
|
you've pushed the rest hard enough. If you're not doing predecoding, that
|
|
|
|
only means that you haven't pushed hard enough yet - _regardless_ of your
|
|
|
|
archtiecture.
|
|
|
|
|
|
|
|
> This exactly undoes any L1 cache size benefits. The win, of course, is
|
|
|
|
> that you don't have as much shifting and aligning on your i-fetch path,
|
|
|
|
> which all the fixed-instruction-size architectures already started with.
|
|
|
|
|
|
|
|
No. You don't understand what "cold-cache" case really means. It's more
|
|
|
|
than just bringing the thing in from memory to the cache. It's also all
|
|
|
|
about loading the dang thing from disk.
|
|
|
|
|
|
|
|
> So your comments only apply to the L2 cache.
|
|
|
|
|
|
|
|
And the disk.
|
|
|
|
|
|
|
|
> And for the expense of all the instruction predecoding logic betweeen
|
|
|
|
> L2 and L1, don't you think someone could build an instruction compressor
|
|
|
|
> to fit more into the die-size-limited L2 cache?
|
|
|
|
|
|
|
|
It's been done. See the PPC stuff. I've read the papers (it's been a long
|
|
|
|
time, admittedly - it's not something new), and the fact is, it's not
|
|
|
|
apparently being used that much. Because it's quite painful, unlike the
|
|
|
|
x86 approach.
|
|
|
|
|
|
|
|
> > stores - which helps in general. While the RISC people were off trying
|
|
|
|
> > to optimize their compilers to generate loops that used all 32 registers
|
|
|
|
> > efficiently, the x86 implementors instead made the chip run fast on
|
|
|
|
> > varied loads and used tons of register renaming hardware (and looking at
|
|
|
|
> > _memory_ renaming too).
|
|
|
|
>
|
|
|
|
> I don't disagree that chip designers have managed to do very well with
|
|
|
|
> the x86, and there's nothing wrong with making a virtue out of a necessity,
|
|
|
|
> but that doesn't make the necessity good.
|
|
|
|
|
|
|
|
Actually, you miss my point.
|
|
|
|
|
|
|
|
The necessity is good because it _forced_ people to look at what really
|
|
|
|
matters. Instead of wasting 15 years and countless PhD's on things that
|
|
|
|
are, in the end, just engineering-masturbation (nr of registers etc).
|
|
|
|
|
|
|
|
> The low register count *does* affect you when using a high-level language,
|
|
|
|
> because if you have too many live variables floating around, you start
|
|
|
|
> suffering. Handling these spills is why you need memory renaming.
|
|
|
|
|
|
|
|
Bzzt. Wrong answer.
|
|
|
|
|
|
|
|
The right answer is that you need memory renaming and memory alias
|
|
|
|
hardware _anyway_, because doing dynamic scheduling of loads vs stores is
|
|
|
|
something that is _required_ to get the kind of performance that people
|
|
|
|
expect today. And all the RISC stuff that tried to avoid it was just a BIG
|
|
|
|
WASTE OF TIME. Because the _only_ thing the RISC approach ended up showing
|
|
|
|
was that eventually you have to do the hard stuff anyway, so you might as
|
|
|
|
well design for doing it in the first place.
|
|
|
|
|
|
|
|
Which is what ia-64 did wrong - and what I mean by doing the same mistakes
|
|
|
|
that everybody else did 15 years ago. Look at all the crap that ia64 does
|
|
|
|
in order to do compiler-driven loop modulo-optimizations. That's part of
|
|
|
|
the whole design, with predication and those horrible register windows.
|
|
|
|
Can you say "risc mistakes all over again"?
|
|
|
|
|
|
|
|
My strong suspicion (and that makes it a "fact" ;) is that in another 5
|
|
|
|
years they'll get to where the x86 has been for the last 10 years, and
|
|
|
|
they'll realize that they will need to do out-of-order accesses etc, which
|
|
|
|
makes all of that modulo optimization pretty much useless, since the
|
|
|
|
hardware pretty much has to do it _anyway_.
|
|
|
|
|
|
|
|
> It's true that x86 processors have had fancy architectural features
|
|
|
|
> sooner than similar-performance RISCs, but I think there's a fair case
|
|
|
|
> that that's because they've *needed* them.
|
|
|
|
|
|
|
|
Which is exactly my point. And by the time you implement them, you notice
|
|
|
|
that the half-way measures don't mean anything, and in fact make for more
|
|
|
|
problems.
|
|
|
|
|
|
|
|
For example, that small register state is a pain in the ass, no? But since
|
|
|
|
you basically need register renaming _anyway_, the small register state
|
|
|
|
actually has some advantages in that it makes it easier to have tons of
|
|
|
|
read ports and still keep the register file fast. And once you do renaming
|
|
|
|
(including memory state renaming), IT DOESN'T MUCH MATTER.
|
|
|
|
|
|
|
|
> Why do the P4 and K7/K8 have
|
|
|
|
> such enormous reorder buffers, able to keep around 100 instructions
|
|
|
|
> in flight at a time? Because they need it to extract parallelism out
|
|
|
|
> of an instruction stream serialized by a miserly register file.
|
|
|
|
|
|
|
|
You think this is bad?
|
|
|
|
|
|
|
|
Look at it another way: once you have hundreds of instructions in flight,
|
|
|
|
you have hardware that automatically
|
|
|
|
|
|
|
|
- executes legacy applications reasonably well, since compilers aren't
|
|
|
|
the most important thing.
|
|
|
|
|
|
|
|
End result: users are happy.
|
|
|
|
|
|
|
|
- you don't need to have compilers that do stupid things like unrolling
|
|
|
|
loops, thus keeping your icache pressure down, since you do loop
|
|
|
|
unrolling in hardware thanks to deep pipelines.
|
|
|
|
|
|
|
|
Even the RISC people are doing hundreds of instructions in flight (ie
|
|
|
|
Power5), but they started doing it years after the x86 did, because they
|
|
|
|
claimed that they could force their users to recompile their binaries
|
|
|
|
every few years. And look where it actually got them..
|
|
|
|
|
|
|
|
> They've developed some great technology to compensate for the weaknesses,
|
|
|
|
> but it's sure nice to dream of an architecture with all that great
|
|
|
|
> technology but with fewer initial warts. (Alpha seemed like the
|
|
|
|
> best hope, but *sigh*. Still, however you apportion blame for its
|
|
|
|
> demise, performance was clearly not one of its problems.)
|
|
|
|
|
|
|
|
So my premise is that you always end up doing the hard things anyway, and
|
|
|
|
the "crap" _really_ doesn't matter.
|
|
|
|
|
|
|
|
Alpha was nice, no question about it. But it took them way too long to get
|
|
|
|
to the whole OoO thing, because they tried to take a short-cut that in the
|
|
|
|
end wasn't the answer. It _looked_ like the answer (the original alpha
|
|
|
|
design was done explicitly to not _need_ things like complex out-of-order
|
|
|
|
execution), but it was all just wrong.
|
|
|
|
|
|
|
|
The thing about the x86 is that hard cold reality (ie millions of
|
|
|
|
customers that have existing applications) really _forces_ you to look at
|
|
|
|
what matters, and so far it clearly appears that the things you are
|
|
|
|
complaining about (registers and segmentation) simply do _not_ matter.
|
|
|
|
|
|
|
|
> I think the same claim applies much more powerfully to the ppc32's MMU.
|
|
|
|
> It may be stupid, but it is only visible from inside the kernel, and
|
|
|
|
> a fairly small piece of the kernel at that.
|
|
|
|
>
|
|
|
|
> It could be scrapped and replaced with something better without any
|
|
|
|
> effect on existing user-level code at all.
|
|
|
|
>
|
|
|
|
> Do you think you can replace the x86's register problems as easily?
|
|
|
|
|
|
|
|
They _have_ been solved. The x86 performs about twice as well as any ppc32
|
|
|
|
on the market. End of discussion.
|
|
|
|
|
|
|
|
> > The only real major failure of the x86 is the PAE crud.
|
|
|
|
>
|
|
|
|
> So you think AMD extended the register file just for fun?
|
|
|
|
|
|
|
|
I think the AMD register file extension was unnecessary, yes. They did it
|
|
|
|
because they could, and it wasn't a big deal. That's not the part that
|
|
|
|
makes the architecture interesting. As you should well know.
|
|
|
|
|
|
|
|
> Hell, the "PAE crud" is the *same* problem as the tiny register
|
|
|
|
> file. Insufficient virtual address space leading to physical > virtual
|
|
|
|
> kludges.
|
|
|
|
|
|
|
|
Nope. The small register file is a non-issue. Trust me. I do work for
|
|
|
|
transmeta, and we do the register renaming in software, and it doesn't
|
|
|
|
matter in the end.
|
|
|
|
|
|
|
|
Linus
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
* * *
|
|
|
|
|
|
|
|
|
|
|
|
Newsgroups: fa.linux.kernel
|
|
|
|
From: Linus Torvalds <torvalds@transmeta.com>
|
|
|
|
Subject: Re: Minutes from Feb 21 LSE Call
|
|
|
|
Original-Message-ID: <[Pine.LNX.4.44.0302232041130.4453-100000@home.transmeta.com][17]>
|
|
|
|
Date: Mon, 24 Feb 2003 05:02:35 GMT
|
|
|
|
Message-ID: <[fa.m7tseqi.160q9go@ifi.uio.no][18]>
|
|
|
|
|
|
|
|
On Sun, 23 Feb 2003, Martin J. Bligh wrote:
|
|
|
|
>
|
|
|
|
> > The fact is, the "crap" doesn't matter that much. As proven by the fact
|
|
|
|
> > that the "crap" processor family ends up being the one that eats pretty
|
|
|
|
> > much everybody else for lunch on performance issues.
|
|
|
|
>
|
|
|
|
> But is that because it's a better design? Or because it has more money
|
|
|
|
> thrown at it? I suspect it's merely it's mass-market dominance generating
|
|
|
|
> huge amounts of cash to improve it ... and it got there through history,
|
|
|
|
> not technical prowess.
|
|
|
|
|
|
|
|
Sure. It's to a large degree "more money and resources", no question about
|
|
|
|
that.
|
|
|
|
|
|
|
|
But what is "better design"? Would it have been possible to put as much
|
|
|
|
effort as Intel (and others) put into the x86 architecture into something
|
|
|
|
else, and make it even better?
|
|
|
|
|
|
|
|
MY standpoint is that the above question is _meaningless_ and stupid.
|
|
|
|
People did try. Very hard. Claiming anything else is clearly misguided.
|
|
|
|
But compatibility and price matter equally much - and often more - than
|
|
|
|
raw performance. Which means that even _if_ another architecture performed
|
|
|
|
better (and it certainly happened, in the hay-day of the alpha), it
|
|
|
|
wouldn't much matter. People still stayed away from it in droves.
|
|
|
|
|
|
|
|
And in the end, that's why I don't like IA-64. I'll take back every single
|
|
|
|
bad thing I've ever said about IA-64 if Intel were to just to sell those
|
|
|
|
things to the mass market instead of P4's. But clearly the IA-64 can't
|
|
|
|
make it in that market, and thus it is made irrelevant. The same way alpha
|
|
|
|
was made irrelevant, _despite_ having had much better performance - an
|
|
|
|
advantage that ia-64 clearly doesn't have.
|
|
|
|
|
|
|
|
(Admittedly, alpha didn't have hugely better performance for very long.
|
|
|
|
Intel came out with the PPro, and took a _lot_ of people by surprise).
|
|
|
|
|
|
|
|
AMD's x86-64 approach is a lot more interesting not so much because of any
|
|
|
|
technical issues, but because AMD _can_ try to avoid the "irrelevant"
|
|
|
|
part. By having a part that _can_ potentially compete in the market
|
|
|
|
against a P4, AMD has something that is worth hoping for. Something that
|
|
|
|
can make a difference.
|
|
|
|
|
|
|
|
IBM with Power5 and apple could be the same thing (yeah yeah, I personally
|
|
|
|
suspect it goes enough against IBMs normal approach that it will cause
|
|
|
|
some friction). A CPU that actually competes in a market that is relevant.
|
|
|
|
|
|
|
|
Because server CPU's simply aren't very interesting from a technical
|
|
|
|
standpoint. I don't know of a _single_ CPU that ever grew down. But we've
|
|
|
|
seen a _lot_ of CPU's grow _up_. In other words: the small machines tend
|
|
|
|
to eat into the large ones, not the other way around.
|
|
|
|
|
|
|
|
And if you start from the large ones, you aren't going to make it in the
|
|
|
|
long run.
|
|
|
|
|
|
|
|
Put yet another way: if I was on Intels IA-32 team, I'd be a lot more
|
|
|
|
worried about those XScale people finally getting their act together than
|
|
|
|
I would be about IA-64.
|
|
|
|
|
|
|
|
Linus
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
* * *
|
|
|
|
|
|
|
|
[Index][1] [Home][2] [About][3] [Blog][4]
|
|
|
|
|
|
|
|
[1]: http://yarchive.net/index.html
|
|
|
|
[2]: http://yarchive.net/home.html
|
|
|
|
[3]: http://yarchive.net/about.html
|
|
|
|
[4]: http://yarchive.net/blog
|
|
|
|
[5]: http://mid.gmane.org/b3b6oa%24bsj%241%40penguin.transmeta.com
|
|
|
|
[6]: http://groups.google.com/groups/search?as_umsgid=fa.k71001p.1m862d%40ifi.uio.no
|
|
|
|
[7]: http://mid.gmane.org/Pine.LNX.4.44.0302231326370.1534-100000%40home.transmeta.com
|
|
|
|
[8]: http://groups.google.com/groups/search?as_umsgid=fa.m6ucdqo.140m9go%40ifi.uio.no
|
|
|
|
[9]: http://mid.gmane.org/Pine.LNX.4.44.0302231634150.1690-100000%40home.transmeta.com
|
|
|
|
[10]: http://groups.google.com/groups/search?as_umsgid=fa.m5ugfii.150ub8u%40ifi.uio.no
|
|
|
|
[11]: http://mid.gmane.org/Pine.LNX.4.44.0302231840220.1690-100000%40home.transmeta.com
|
|
|
|
[12]: http://groups.google.com/groups/search?as_umsgid=fa.m6eefqe.14gcagq%40ifi.uio.no
|
|
|
|
[13]: http://mid.gmane.org/Pine.LNX.4.44.0302231343050.1534-100000%40home.transmeta.com
|
|
|
|
[14]: http://groups.google.com/groups/search?as_umsgid=fa.m5e8eal.15gi80t%40ifi.uio.no
|
|
|
|
[15]: http://mid.gmane.org/Pine.LNX.4.44.0302231805240.1690-100000%40home.transmeta.com
|
|
|
|
[16]: http://groups.google.com/groups/search?as_umsgid=fa.m6eieqj.14g0bgv%40ifi.uio.no
|
|
|
|
[17]: http://mid.gmane.org/Pine.LNX.4.44.0302232041130.4453-100000%40home.transmeta.com
|
|
|
|
[18]: http://groups.google.com/groups/search?as_umsgid=fa.m7tseqi.160q9go%40ifi.uio.no
|
|
|
|
|