Computer Architecture: A Quantitative Approach, 4th Edition
John L. Hennessy, David A. Patterson
The era of seemingly unlimited growth in processor performance is over: single chip architectures can no longer overcome the performance limitations imposed by the power they consume and the heat they generate. Today, Intel and other semiconductor firms are abandoning the single fast processor model in favor of multi-core microprocessors--chips that combine two or more processors in a single package. In the fourth edition of Computer Architecture, the authors focus on this historic shift, increasing their coverage of multiprocessors and exploring the most effective ways of achieving parallelism as the key to unlocking the power of multiple processor architectures. Additionally, the new edition has expanded and updated coverage of design topics beyond processor performance, including power, reliability, availability, and dependability.
CD System Requirements
The CD material includes PDF documents that you can read with a PDF viewer such as Adobe, Acrobat or Adobe Reader. Recent versions of Adobe Reader for some platforms are included on the CD.
The content is designed to be viewed in a browser window that is at least 720 pixels wide. You may find the content does not display well if your display is not set to at least 1024x768 pixel resolution.
This CD can be used under any operating system that includes an HTML browser and a PDF viewer. This includes Windows, Mac OS, and most Linux and Unix systems.
Increased coverage on achieving parallelism with multiprocessors.
Case studies of latest technology from industry including the Sun Niagara Multiprocessor, AMD Opteron, and Pentium 4.
Three review appendices, included in the printed volume, review the basic and intermediate principles the main text relies upon.
Eight reference appendices, collected on the CD, cover a range of topics including specific architectures, embedded systems, application specific processors--some guest authored by subject experts.
programs, and that all proposed alternatives to time as the metric or to real programs as the items measured have eventually led to misleading claims or even mistakes in computer design. Even execution time can be defined in different ways depending on what we count. The most straightforward definition of time is called wall-clock time, response time, or elapsed time, which is the latency to complete a task, including disk accesses, memory accesses, input/output activities, operating system
items are likely to be accessed in the near future. Spatial locality says that items whose addresses are near one another tend to be referenced close together in time. We will see these principles applied in Chapter 5. Focus on the Common Case Perhaps the most important and pervasive principle of computer design is to focus on the common case: In making a design trade-off, favor the frequent case over the infrequent case. This principle applies when determining how to spend resources, since the
;store result ;decrement pointer 8 bytes ;branch R1!=R2 The data dependences in this code sequence involve both floating-point data: Loop: L.D F0,0(R1) ;F0=array element ADD.D F4,F0,F2 ;add scalar in F2 S.D F4,0(R1) ;store result and integer data: DADDIU R1,R1,-8 ;decrement pointer ;8 bytes (per DW) BNE R1,R2,Loop ;branch R1!=R2 Both of the above dependent sequences, as shown by the arrows, have each instruction depending on the previous one. The arrows here and in following
toward taken or untaken. Figure 2.3 shows the success of branch prediction using this strategy. The same input data were used for runs and for collecting the profile; other studies have shown that changing the input so that the profile is for a different run leads to only a small change in the accuracy of profile-based prediction. The effectiveness of any branch prediction scheme depends both on the accuracy of the scheme and the frequency of conditional branches, which vary in SPEC from 3% to
pipeline stages during their execution. Instead various stages of execution (instruction fetch, decode, uop issue, rename, schedule, execute, and retire) can take varying numbers of clock cycles. In the Pentium III, 132 ■ Chapter Two Instruction-Level Parallelism and Its Exploitation Front-end BTB 4K entries 64 bits Instruction prefetch System bus Instruction decoder Microcode ROM Trace cache BTB 2K entries Execution trace cache 12K uops Bus interface unit µop queue Register renaming