11 KiB

Raw Permalink Blame History

created_at

title

url

author

points

story_text

comment_text

num_comments

story_id

story_title

story_url

parent_id

created_at_i

_tags

objectID

year

2014-09-01T19:07:23.000Z

The Pentium 4 and the G4e: An Architectural Comparison (2001)

http://arstechnica.com/features/2001/05/p4andg4e/

CoolGuySteve

1409598443

story

author_CoolGuySteve

story_8254254

8254254

2001

Source

The Pentium 4 and the G4e: an Architectural Comparison: Part I | Ars Technica

Navigate

Filter by topic

Settings

Front page layout

[
Grid

]23

[
List

]24

[
Wide

]25

Site theme

[ Black on white

]26

[ White on black

]27

Comment activity

Stay logged in | Having trouble?

Features —

The Pentium 4 and the G4e: an Architectural Comparison: Part I

When the Pentium 4 hit the market in November of 2000, it was the first major …

Jon Stokes - May 12, 2001 1:45 am UTC

reader comments

Share this story

Introduction

When the Pentium 4 hit the market in November of 2000, it was the first major new x86 microarchitecture from Intel since the Pentium Pro. In the years prior to the P4's launch the P6 core dominated the market in its incarnations as the Pentium II and Pentium III, and anyone who was paying attention during that time learned at least one, major lesson: clock speed sells. Intel was definitely paying attention, and as the Willamette team labored away in Hillsboro, Oregon they kept MHz foremost in their minds. This singular focus is evident in everything from Intel's Pentium 4 promotional and technical literature down to the very last detail of the processor's design. As this article will show, the successor to the most successful x86 microarchitecture of all time is a machine built from the ground up for stratospheric clock speed.

This article will examine the tradeoffs and design decisions that the P4's architects made in their effort to build a MHz monster, paying special attention to the innovative features that the P4 sports and the ways those features fit with the processor's overall design philosophy and target application domain. We'll cover the P4's ultra-deep pipeline, its trace cache, its double-pumped ALUs, and a host of other aspects of its design, all with an eye to their impact on performance.

One thing I've found in writing about technology is that it's never enough to just explain how something new works. Most of us need a point of reference from which to evaluate design decisions. When covering a new product, I always try to compare it to either a previous version of the same product or (even better) to a competitor's product. Such a comparison provides a context for understanding what's "new" about a new technology and why this technology matters. To this end, I'll be using Motorola's new MPC7450 (a.k.a. the G4e or G4+) as a basis from which to talk about the P4. Note that this article is not a performance comparison; performance comparisons are best done in the lab by testing and benchmarking with real-world applications. I will talk about performance quite a bit, but not in a manner that pits the two processors directly against each other. In the end, it's best to think of this article as an article about the P4 that uses the G4e as a point of reference. I'll be using the G4e as sort of a baseline processor that will give you a feel for how things are "normally" done. Then I'll talk about how and why the P4 does things differently.

Before we talk about the two processors in detail, it might help to review a few basics of processor design. If you've read my previous work, especially my G4 vs. K7 article, then you're probably familiar most of what I'll cover in the following short review section. More advanced readers will want to skip to the next section. Still, if you haven't though about microarchitecture in a while you might want to give the section below a quick read.

Basic instruction flow

One useful division that computer architects use when talking about CPUs is that of "front end" vs. "back end" or "execution engine." When instructions are fetched from the cache or main memory, they must be decoded and dispatched for execution. This fetching, decoding and dispatching takes place in the processor's front end.

Figure 1.1: Basic CPU Instruction Flow

Instructions make their way from the cache to the front end and down through the execution engine, which is where the actual work of number crunching gets done. Once they leave the execution engine, their results are written back to main memory. This process, in which you FETCH the instruction from the cache, DECODE it into a form that the internals of the processor can understand, EXECUTE it, and WRITE its results back to memory makes up the basic, four stage pipeline that all aspiring CPU architects learn in their undergraduate architecture courses. Each pipeline stage must take exactly one clock cycle to complete its business and send the instruction on to the next stage, so the more quickly all of the stages can do their thing the shorter the processor's clock cycle time (and the higher its clock speed or frequency) can be.

(For a thorough explanation of all things pipelining--what it is, its relation to the CPU's clock speed, etc.--see my first K7 article. From here on out, I'll just assume that you understand the basic concepts of pipelined execution. If you don't, you should read up on it before proceeding further.)

Figure 1.2: Basic 4-stage pipeline

This basic pipeline represents the "normal" path that instructions take through the processor, and as I just noted it assumes that all instructions spend only one cycle being EXECUTEd. While most processors do have one-cycle instructions (the P4 even has 0.5-cycle instructions), they also have some really complicated instructions that need to spend multiple cycles in the EXECUTE stage. To accommodate these multi-cycle instructions, the different functional units their own EXECUTE pipelines (some with one stage, some with more), so they can add stages to the processor's basic pipeline.

Figure 1.3: 4-stage pipeline with pipelined execution units

The take-home message here is that when we talk about how many pipeline stages a processor has we use an ideal number that pretends that each instruction spends only one cycle in the EXECUTE stage, but most instructions pass through multiple EXECUTE stages in the various functional units.

Page: 1 2 3 4 5 6 7 Next →

reader comments

Share this story

You must login or create an account to comment.

← Previous story Next story →

Today on Ars

CNMN Collection
WIRED Media Group
Use of this Site constitutes acceptance of our User Agreement (effective 1/2/14) and Privacy Policy (effective 1/2/14), and Ars Technica Addendum (effective 5/17/2012). View our Affiliate Link Policy. Your California Privacy Rights. The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of Condé Nast.

11 KiB Raw Permalink Blame History