hn-classics/_stories/2001/8254254.md

11 KiB

created_at title url author points story_text comment_text num_comments story_id story_title story_url parent_id created_at_i _tags objectID year
2014-09-01T19:07:23.000Z The Pentium 4 and the G4e: An Architectural Comparison (2001) http://arstechnica.com/features/2001/05/p4andg4e/ CoolGuySteve 47 25 1409598443
story
author_CoolGuySteve
story_8254254
8254254 2001

Source

The Pentium 4 and the G4e: an Architectural Comparison: Part I | Ars Technica

Close

Navigate

Filter by topic

Settings

Front page layout

[
Grid

]23

[
List

]24

[
Wide

]25

Site theme

[ Black on white

]26

[ White on black

]27

Sign in

Comment activity

Sign up or login to join the discussions!

Stay logged in | Having trouble?

Sign up to comment and more Sign up

Features —

The Pentium 4 and the G4e: an Architectural Comparison: Part I

When the Pentium 4 hit the market in November of 2000, it was the first major …

Jon Stokes - May 12, 2001 1:45 am UTC

reader comments

Share this story

Introduction

When the Pentium 4 hit the market in November of 2000, it was the first major new x86 microarchitecture from Intel since the Pentium Pro. In the years prior to the P4's launch the P6 core dominated the market in its incarnations as the Pentium II and Pentium III, and anyone who was paying attention during that time learned at least one, major lesson: clock speed sells. Intel was definitely paying attention, and as the Willamette team labored away in Hillsboro, Oregon they kept MHz foremost in their minds. This singular focus is evident in everything from Intel's Pentium 4 promotional and technical literature down to the very last detail of the processor's design. As this article will show, the successor to the most successful x86 microarchitecture of all time is a machine built from the ground up for stratospheric clock speed.

This article will examine the tradeoffs and design decisions that the P4's architects made in their effort to build a MHz monster, paying special attention to the innovative features that the P4 sports and the ways those features fit with the processor's overall design philosophy and target application domain. We'll cover the P4's ultra-deep pipeline, its trace cache, its double-pumped ALUs, and a host of other aspects of its design, all with an eye to their impact on performance.

One thing I've found in writing about technology is that it's never enough to just explain how something new works. Most of us need a point of reference from which to evaluate design decisions. When covering a new product, I always try to compare it to either a previous version of the same product or (even better) to a competitor's product. Such a comparison provides a context for understanding what's "new" about a new technology and why this technology matters. To this end, I'll be using Motorola's new MPC7450 (a.k.a. the G4e or G4+) as a basis from which to talk about the P4. Note that this article is not a performance comparison; performance comparisons are best done in the lab by testing and benchmarking with real-world applications. I will talk about performance quite a bit, but not in a manner that pits the two processors directly against each other. In the end, it's best to think of this article as an article about the P4 that uses the G4e as a point of reference. I'll be using the G4e as sort of a baseline processor that will give you a feel for how things are "normally" done. Then I'll talk about how and why the P4 does things differently.

Before we talk about the two processors in detail, it might help to review a few basics of processor design. If you've read my previous work, especially my G4 vs. K7 article, then you're probably familiar most of what I'll cover in the following short review section. More advanced readers will want to skip to the next section. Still, if you haven't though about microarchitecture in a while you might want to give the section below a quick read.

Basic instruction flow

One useful division that computer architects use when talking about CPUs is that of "front end" vs. "back end" or "execution engine." When instructions are fetched from the cache or main memory, they must be decoded and dispatched for execution. This fetching, decoding and dispatching takes place in the processor's front end.

Figure 1.1: Basic CPU Instruction Flow

Instructions make their way from the cache to the front end and down through the execution engine, which is where the actual work of number crunching gets done. Once they leave the execution engine, their results are written back to main memory. This process, in which you FETCH the instruction from the cache, DECODE it into a form that the internals of the processor can understand, EXECUTE it, and WRITE its results back to memory makes up the basic, four stage pipeline that all aspiring CPU architects learn in their undergraduate architecture courses. Each pipeline stage must take exactly one clock cycle to complete its business and send the instruction on to the next stage, so the more quickly all of the stages can do their thing the shorter the processor's clock cycle time (and the higher its clock speed or frequency) can be.

(For a thorough explanation of all things pipelining--what it is, its relation to the CPU's clock speed, etc.--see my first K7 article. From here on out, I'll just assume that you understand the basic concepts of pipelined execution. If you don't, you should read up on it before proceeding further.)

Figure 1.2: Basic 4-stage pipeline

This basic pipeline represents the "normal" path that instructions take through the processor, and as I just noted it assumes that all instructions spend only one cycle being EXECUTEd. While most processors do have one-cycle instructions (the P4 even has 0.5-cycle instructions), they also have some really complicated instructions that need to spend multiple cycles in the EXECUTE stage. To accommodate these multi-cycle instructions, the different functional units their own EXECUTE pipelines (some with one stage, some with more), so they can add stages to the processor's basic pipeline.

Figure 1.3: 4-stage pipeline with pipelined execution units

The take-home message here is that when we talk about how many pipeline stages a processor has we use an ideal number that pretends that each instruction spends only one cycle in the EXECUTE stage, but most instructions pass through multiple EXECUTE stages in the various functional units.

Page: 1 2 3 4 5 6 7 Next →

reader comments

Share this story

You must login or create an account to comment.

Previous story Next story →

Sponsored Stories

Powered by

Today on Ars

CNMN Collection
WIRED Media Group
Use of this Site constitutes acceptance of our User Agreement (effective 1/2/14) and Privacy Policy (effective 1/2/14), and Ars Technica Addendum (effective 5/17/2012). View our Affiliate Link Policy. Your California Privacy Rights. The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of Condé Nast.