Computer Systems: An Integrated Approach to Architecture and Operating Systems: Lecture Slides 0321486137, 9780321486134

In the early days of computing, hardware and software systems were designed separately. Today, as multicore systems pred

255 48 20MB

English Pages 756 [769] Year 2008

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Computer Systems: An Integrated Approach to Architecture and Operating Systems: Lecture Slides
 0321486137, 9780321486134

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

COMPUTER SYSTEMS An Integrated Approach to Architecture and Operating Systems

Chapter 1 Introduction

©Copyright 2008 Umakishore Ramachandran and William D. Leahy Jr.

What’s Inside the Box?

Levels of Abstraction

Hardware Software Interface

From Electrons & Holes to a Multiplayer Video Game

The Role of the Operating System • Resource manager • Provide consistent interface to resources • Job scheduler

Client Application (Halo 3)

Player clicks mouse cursor on target

Client Application (Halo 3)

Player clicks mouse cursor on target

OS: Recognizes interrupt Sends it to client application It's a mouse interrupt!

Client Application (Halo 3)

Player clicks mouse cursor on target

OS: Recognizes interrupt Sends it to client application It's a mouse interrupt!

CLIENT

Client Application (Halo 3)

Player clicks mouse cursor on target

OS: Recognizes interrupt Sends it to client application It's a mouse interrupt!

Client Application creates message to send to server application

CLIENT

Client Application (Halo 3)

Player clicks mouse cursor on target

OS: Recognizes interrupt Sends it to client application It's a mouse interrupt!

Client Application creates message to send to server application OS: Sends Message to server

CLIENT

Client Application (Halo 3)

Player clicks mouse cursor on target

OS: Recognizes interrupt Sends it to client application It's a mouse interrupt!

Client Application creates message to send to server application OS: Sends Message to server

OS: Receives Message sends to server application

CLIENT

SERVER

Got a message!

Client Application (Halo 3)

Player clicks mouse cursor on target

OS: Recognizes interrupt Sends it to client application

CLIENT

It's a mouse interrupt!

Client Application creates message to send to server application OS: Sends Message to server

SERVER Application examines message and state of game and determines Master Chief dies! Sends message back to client.

OS: Receives Message sends to server application

Got a message!

Client Application (Halo 3)

Player clicks mouse cursor on target

OS: Recognizes interrupt Sends it to client application

CLIENT

It's a mouse interrupt!

Client Application creates message to send to server application OS: Sends Message to server

SERVER Application examines message and state of game and determines Master Chief dies! Sends message back to client.

OS: Receives Message sends to server application

Got a message!

OS: Sends Message to client

Client Application (Halo 3)

Player clicks mouse cursor on target

OS: Recognizes interrupt Sends it to client application

CLIENT

It's a mouse interrupt!

Client Application creates message to send to server application OS: Sends Message to server

SERVER Application examines message and state of game and determines Master Chief dies! Sends message back to client.

OS: Receives Message sends to server application

Got a message!

OS: Sends Message to client

OS: Receives message and sends it to application

Client Application (Halo 3)

Player clicks mouse cursor on target

OS: Recognizes interrupt Sends it to client application

CLIENT

It's a mouse interrupt!

Client Application creates message to send to server application OS: Sends Message to server

SERVER

ClientApplication generates required images, etc. Sends I/O requests to OS

Application examines message and state of game and determines Master Chief dies! Sends message back to client.

OS: Receives Message sends to server application

Got a message!

OS: Sends Message to client

OS: Receives message and sends it to application

Client Application (Halo 3)

Player clicks mouse cursor on target

OS: Recognizes interrupt Sends it to client application

CLIENT

It's a mouse interrupt!

Client Application creates message to send to server application OS: Sends Message to server

SERVER

OS changes I/O devices to show Master Chief blowing up!!!

ClientApplication generates required images, etc. Sends I/O requests to OS

Application examines message and state of game and determines Master Chief dies! Sends message back to client.

OS: Receives Message sends to server application

Got a message!

ut oh!

OS: Sends Message to client

OS: Receives message and sends it to application

Client Application (Halo 3)

Player clicks mouse cursor on target

OS: Recognizes interrupt Sends it to client application

CLIENT

It's a mouse interrupt!

Client Application creates message to send to server application OS: Sends Message to server

SERVER

OS changes I/O devices to show Master Chief blowing up!!!

ClientApplication generates required images, etc. Sends I/O requests to OS

Application examines message and state of game and determines Master Chief dies! Sends message back to client.

OS: Receives Message sends to server application

Got a message!

ut oh!

OS: Sends Message to client

OS: Receives message and sends it to application

Client Application (Halo 3)

Player clicks mouse cursor on target

OS: Recognizes interrupt Sends it to client application

CLIENT

It's a mouse interrupt!

Client Application creates message to send to server application OS: Sends Message to server

SERVER

OS changes I/O devices to show Master Chief blowing up!!!

ClientApplication generates required images, etc. Sends I/O requests to OS

Application examines message and state of game and determines Master Chief dies! Sends message back to client.

OS: Receives Message sends to server application

Got a message!

ut oh!

OS: Sends Message to client

OS: Receives message and sends it to application

What’s Happening Inside the Box? • • • • •

Processor Memory I/O Parallelism Networking

Layers of Abstraction Application (Algorithms expressed in High Level Language) System software (Compiler, OS, etc.) Computer Architecture Machine Organization (Datapath and Control) Sequential and Combinational Logic Elements

Logic Gates Transistors Solid-State Physics (Electrons and Holes)

Where Does This Course Fit? Fundamentals of Digital Electronic & Logic Design

Fundamentals of Programming

Integrated Approach to Computer Architecture and Operating Systems

Advanced Topics in Operating Systems

Advanced Topics in Computer Architecture

Advanced Topics in Computer Networks

Questions?

Computer Systems An Integrated Approach to Architecture and Operating Systems

Chapter 2 Processor Architecture

©Copyright 2008 Umakishore Ramachandran and William D. Leahy Jr.

Overview • Architectural Issues – Instruction Set – Machine Organization

• Historical Perspective – Programming primarily in Assembly Language – Development of sophisticated compiler technology – Development of Operating Systems

Processor Design • Hardware resources are like letters • Instruction set is like words • Instruction set is key differentiation between different processors (e.g. Intel x86 & Power PC)

C

Fortran

Ada

etc.

Basic

Compiler

Java

Compiler Byte Code

Assembly Language

Assembler

Interpreter

Executable

Instruction Set Architecture HW Implementation 1

HW Implementation 2

HW Implementation N

Instruction Set Design Goals Maximize Performance

Easy to Build Compiler(s)

Easy to Build Hardware

Minimize Cost

High Level language Constructs • High Level Language Constructs a = b + c; /* add b and c and place in a */ d = e – f; /* subtract f from e and place in d */ x = y & z; /* AND y and z and place in x */

• Assembly Language Constructs add a, b, c; sub d, e, f; and x, y, z;

a  b + c d  e – f x  y & z

Where to Keep the Operands Processor

Memory

Devices

Processor

ALU Memory Registers

Devices

Memory Address Specification processor

Memory b

r2

• Need a way of specifying an address in an instruction? • Problem: Addresses are as long (if not longer) than the length of the instruction • One solution: Store a register number of a register containing the address • Include an offset in the space left over

Base + Offset ld rdest, offset(rbase) ld r2, 3(r1) • Semantics – Value in rbase is added to offset forming an effective address – The contents of that effective address are fetched and placed into rdest

• Register transfer language r2←M[(r1) + 3]

Operand Width • How many bits does a load instruction fetch from memory? • How many should it fetch? 4?

1?

16?

8?

64?

32?

more?

• High level languages typically support multiple data types and thus for efficiency most processors will have the ability to fetch different sized “chunks” of data • The minimum is usually 8, the maximum has continually increased.

Other Questions • Arithmetic & logical operations can be performed on what size data? • The processor can move what size data to and from memory? • What is the size data the register can hold?

Endianess • How should bytes be numbered? Byte

Big Endian

Little Endian

100 101 102 103

103 102 101 100

104 105 106 107 108 109 110 111 112 113 114 115

?

Word

• And why does it matter?

107 106 105 104 111 110 109 108 115 114 113 112

Endianess • Different manufacturers have “standardized on different endianesses • Normally on a single machine this is not an issue. – Especially if data is handled commensurate with the way it was declared – Can it be an issue

• In a networked environment it can could cause big problems

Packing Operands Word Operand Alignment • Consider +3

struct { char a; char b[3]; }

b[2]

b[2]

103

+0

+1

+2

b[1]

b[1]

b[0]

102

101

a

100

b[0]

104

a

100

Packing Operands Word Operand Alignment • Consider struct { char a; int b; }

+3

+2

+1

b…

b…

blsb

+0

a bmsb

+3

bmsb

+2

b…

+1

b…

100 104

+0

a

100

blsb

104

Compiling High Level Data Abstractions • Consider struct { int char int long }

a; c; d; e;

• Given the address of the structure can we access the fields using base+offset addressing?

Compiling High Level Data Abstractions • Now consider int a[1000];

• Can individual array elements be accessed using base+offset addressing? a[6] = a[6] + 3;

a[j] = a[i] + 3;

• Perhaps an instruction that formed an effective address by adding two registers together would be useful?

Compiling Conditional Statements • In what order are program statements normally executed? • How do we know what instruction to execute next? • How can we handle this type high-level language construct: if(x==y) z = 7;

Compiling Conditional Statements if(x==y) z = 7;

• Steps to execute – Evaluate the predicate (x==y) – If false change the normal flow of control to skip the x = 7; statement and continue execution – If true execute the z = 7; and then continue execution

Compiling Conditional Statements • Need an instruction that will evaluate a predicate and change program flow • As an example • beq r1, r2, offset – Semantics: if the contents of registers r1 and r2 are equal add the offset to the (already incremented) PC and store that address in the PC

Compiling Conditional Statements • C if(a==b) c = d + e; else c = f + g;

Assuming r1 = a r2 = b r3 = c r4 = d r5 = e r6 = f r7 = g

• Assembly beq r1, r2, then add r3, r6, r7 beq r1, r1, skip* then add r3, r4, r5 skip …

* Effectively an unconditional branch

Compiling Loops • C while(j ! = 0) { /* loop body */ t = t + a[j--]; }

• Assembly beq r1,r0,done ; loop body … done …

Compiling Switch Statements if(n==0) x=a; else if(n==1) x=b; else if(n==2) x=c; else x=d;

Do these produce essentially equivalent assembly code?

Switch (n) { case 0: x=a; break; case 1: x=b; break; case 2: x=c; break; default: x=d; }

Compiling Procedure Calls int main() {

return-value = foo(actual-parms); /* continue upon returning from * foo */

int foo(formal-parameters) {

/* code for function foo */

}

return(); }

Issues with Compiling 1. 2. 3. 4. 5. 6. 7.

Preserve caller state (registers) Pass actual parameters Save the return address Transfer control to callee Allocate space for callee’s local variables Produce return value(s); give to caller Return

Caller State • Where should caller’s register values be saved? – Why do we even need to save them? – Dedicated registers? • What limitation might arise?

– Who should save what? • Caller saved registers • Callee saved registers

What’s Left? • Parameter passing – Dedicated registers – Stack

• Return address – JAL Instruction

• Transfer Control – Change PC – JAL Instruction

What’s Left? • Local variables? – Stack

• Return value(s) – Dedicated registers – Stack

• Returning to point of call – JAL back through link

Software Conventions • Registers s0-s2 are the caller’s s registers • Registers t0-t2 are the temporary registers • Registers a0-a2 are the parameter passing registers • Register v0 is used for return value • Register ra is used for return address • Register at is used for target address • Register sp is used as a stack pointer

Activation Record • Used to store – Caller saved registers – Additional parameters – Additional return values – Return address – Callee saved registers – Local variables

STACK

Step 1. Caller saves any of registers t0-t3 on the stack (if it needs the values in them upon return).

Stack Pointer

Saved t Registers

From t registers

STACK

Step 2. Caller places the parameters in a0-a2 (using the stack for additional parameters if needed).

Stack Pointer

Additional parameters Saved t Registers

From function call

STACK

Step 3. Caller allocates space for any additional return values on the stack

Stack Pointer

Additional return values

Additional parameters Saved t Registers

STACK

Step 4. Caller saves ra

Stack Pointer

ra Additional return values

Additional parameters Saved t Registers

From ra

STACK

Step 5. Caller executes JAL at, ra (no effect on stack)

Stack Pointer

ra Additional return values

Additional parameters Saved t Registers

STACK

Step 6. Callee saves any of registers s0-s3 that it plans to use during its execution on the stack. Stack Pointer

Saved s Registers ra Additional return values

Additional parameters Saved t Registers

From s registers

STACK

Step 7.

Stack Pointer

Local variables Saved s Registers ra Additional return values

Additional parameters Saved t Registers

Callee allocates space for any local variables on the stack

STACK

Step 8. Prior to return, Callee restores any saved s0-s3 registers from the stack Stack Pointer

Saved s Registers ra Additional return values

Additional parameters Saved t Registers

To S registers

STACK

Step 9. Upon return, Caller restores ra

Stack Pointer

ra Additional return values

Additional parameters Saved t Registers

To ra

STACK

Step 10. Caller stores additional return values as desired

Stack Pointer

Additional return values

Additional parameters Saved t Registers

As desired

STACK

Step 11. Upon return, Caller moves stack pointer to discard additiona parameters

Stack Pointer

Additional parameters Saved t Registers

STACK

Step 12. Upon return, Caller restores any saved t0-t3 registers from the stack

Stack Pointer

Saved t Registers

To t registers

Local variables Stack Pointer

Activation Stack Frame for baz

Saved s Registers ra

Activation Stack Frame for bar

Additional return values

Activation Stack Frame for foo

Saved t Registers

Activation Stack Frame for main

Additional parameters

Recursion • Does recursion require any additional instruction set architecture items?

Frame Pointer • During execution of given module it is possible for the stack pointer to move. • Since the location of all items in a stack frame is based on the stack pointer it useful to define a fixed point in each stack frame and maintain the address of this fixed point in a register called the frame pointer • This necessitates storing the old frame pointer in eahc stack frame (i.e caller’s frame pointer)

Stack Pointer*

STACK

New Step 6. Callee stores old frame pointer then copies contents of stack pointer into frame pointer.

Frame Pointer

Old Frame Pointer ra Additional return values

Additional parameters Saved t Registers *Stack pointer can eventually be anywhere

Instruction Set Architecture Choices • • • • •

Specific set of arithmetic and logic instructions Addressing modes Architectural style Memory layout of the instruction. Drivers – Technology trends – Implementation feasibility – Goal of elegant/efficient support for high-level language constructs.

Instructions • MIPS – All loads and stores 32 bits – Special instructions exist for extracting bytes

• DEC Alpha – Instructions for loading and storing different sizes

• Some architectures have predefined values – e.g. 0, 1, etc.

• DEC Vax – Single instruction to load or store all registers

Addressing Modes • Additional modes – Indirect addressing ld @(ra)

• Pseudo-direct addressing – Address is formed from first 6 bits of PC and last 26 bits of instruction

Architecture Styles • Stack oriented – Burroughs

• Memory oriented – IBM s/360 et al

• Register oriented – MIPS, Alpha, ARM

• Hybrid – Intel x86, Power PC

Instruction Format • Zero Operand Instructions – Halt, NOP – Stack machines: Add

• One Operand Instructions – Inc, Dec, Neg, Not – Accumulator machines: Load M, Add M

• Two Operand Instructions – Add r1, r2 (i.e. r1 = r1 + r2) – Mov r1, r2

• Three Operand Instructions – Add r1, r2, r3 – Load rd, rb, offset

Instruction Format Fixed Length Instructions • Pros – Simplifies implementation – Can start interpreting

• Cons – May waste space – May need additional logic in datapath – Limits instruction set designer

Variable Length Instructions • Pros – No wasted space – Less constraints on designer – More flexibility opcodes, addressing modes and operands

• Cons – Complicates implementation

LC-2200 Instruction Set • • • • • • •

32-bit Register-oriented Little-endian Fixed length instruction format 16 general-purpose registers Program counter (PC) register. All addresses are word addresses.

Instruction Format • R-type instructions – add and nand

• I-type instructions – addi, lw, sw, and beq

• J-type instruction – jalr

• O-type instruction – halt

Instruction Format •

R-type instructions (add, nand): bits 31-28: bits 27-24: bits 23-20: bits 19-4: bits 3-0:



I-type instructions (addi, lw, sw, beq): bits 31-28: bits 27-24: bits 23-20: bits 19-0:



opcode reg X reg Y Immediate value or address offset (a 20-bit, 2s complement number with a range of -524288 to +524287)

J-type instructions (jalr): bits 31-28: bits 27-24: bits 23-20: bits 19-0:



opcode reg X reg Y unused (should be all 0s) reg Z

opcode reg X (target of the jump) reg Y (link register) unused (should be all 0s)

O-type instructions (halt): bits 31-28: bits 27-0:

opcode unused (should be all 0s)

LC-2200 Register Set Reg # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Name $zero $at $v0 $a0 $a1 $a2 $t0 $t1 $t2 $s0 $s1 $s2 $k0 $sp $fp $ra

Use always zero (by hardware) reserved for assembler return value argument argument argument Temporary Temporary Temporary Saved register Saved register Saved register reserved for OS/traps Stack pointer Frame pointer return address

callee-save? n.a. n.a. No No No No No No No YES YES YES n.a. No YES No

Issues Influencing Processor Design • Instruction Set • Applications • Other – – – – – – – –

Operating system Support for modern languages Memory system Parallelism Debugging Virtualization Fault Tolerance Security

Instruction Set • Over-arching concern: Compiling high level language constructs into efficient machine code • But other factors are in play – Market pressure – Performance – Technology workarounds

Influence of Applications on Instruction Set Design • Number crunching requires efficient floating point – Development of floating point hardware

• Media applications deal with streaming data – Intel MMX extensions

• Gaming requires sophisticated graphic processing – High end games now include GPU chips

Other Issues Driving Processor Design • • • • • • • •

Operating system Modern languages: Java, C++ and C# Memory system Parallelism Debugging Virtualization Fault tolerance Security

Summary • High-level language constructs shape ISA • Support needed in the ISA for compiling basic program statements (assignment, loops, conditionals, etc.) • Registers • Addressing modes • Software conventions • Software stack/procedure calls • Extensions to minimal ISA’s • Other design issues

COMPUTER SYSTEMS An Integrated Approach to Architecture and Operating Systems

Chapter 3 Processor Implementation

©Copyright 2008 Umakishore Ramachandran and William D. Leahy Jr.

Processor Implementation • Implementation given an instruction set • Instruction-set is not a description of the implementation of the processor – Contract between hardware and software – Allows a compiler writer to generate code for different high-level languages to execute on a processor that implements this contract

• Can there be different implementations of the same instruction set?

3.1 Architecture versus Implementation • Market demands

Why?

• Parallel hardware and software development • Maintain compatibility for legacy software compatibility

3.2 What is involved in Processor Implementation?

• Organization of the electrical components (ALUs, buses, registers, etc.) commensurate with the expected price/performance characteristic of the processor. • Thermal and mechanical aspects including cooling and physical geometry for placement in mother boards. Super Computers High performance primary objective

Servers Intermediate performance and cost

Desktops & PCs Low cost primary objective

Embedded Small size, low cost, and low power consumption primary objectives

3.3 Key hardware concepts A review of important design principles

3.3.1 Circuits • Combinational logic – For a given set of inputs there is one unique output

• Sequential logic – Circuits contain elements that remember state – Output demands on inputs and state

3.3.2 Hardware resources of the datapath • • • • •

Memory ALU Register file Program Counter Instruction Register

3.3.3 Logic Triggering outputs clock

inputs

Level Triggering • Outputs change based on inputs whenever clock is high • Memory will be considered to be level triggered (for cost reasons)

Edge Triggering • Outputs change based on inputs only when clock transitions • Positive edge triggered logic when leading edge cause triggering • Negative edge triggered when trailing edge causes triggering

3.3.4 Connecting the datapath elements PC Addr

Din

Memory Dout

IR Register-file

ALU

3.3.5 Towards bus-based Design • In principle we must make connections between circuit elements for every instruction • Numerous connections are expensive and take up valuable space • Have a set of wires that all elements can connect to and share in order to transfer information

Single Bus Design PC

MAR Register-file

(DPRF)

IR

Addr

Din

Memory Dout

ALU

Dual Bus Design PC

MAR Register-file

(DPRF)

IR

Addr

Din

Memory Dout1 Dout2

ALU

3.3.6 Finite State Machine (FSM) • Abstraction of a sequential logic circuit which captures – States – Outputs while in each state – Designated start state – Possible transitions – Inputs which will trigger transitions Fetch

Decode

Execute

3.4 Datapath Design • Processing Unit (CPU) consists of the Datapath and the Control Unit • Datapath is the combination of hardware resources and their connections • Example for LC-2200 – ALU capable of ADD, NAND, SUB, – Register file with 16 registers (32-bit) shown in Figure 3.14 – PC (32-bit) – Memory with 232 X 32 bit words

Sample Datapath LC-2200 Datapath 32

LdPC

LdA

PC

A

LdB

B

LdMAR

LdIR

MAR

IR

32 Din

WrREG

2 func

4

ALU: 00: ADD 01: NAND 10: A - B 11: A + 1

DrPC

regno

DrALU

=0? 1 LdZ

Z 1

Addr

registers 16x 32 bits

memory 232x 32 bits

Dout

Dout

DrREG

IR[27..24] IR[23..20] IR[3..0] IR[31..28]

IR[31..0]

Din

WrMEM

DrMEM

Rx: Ry: Rz: OP:

4 -bit register 4 -bit register 4 -bit register 4-bit opcode

IR[19..0]

20 sign extend

DrOFF

number to control logic number to control logic number to control logic to control logic

Z: 1-bit boolean to control logic

3.4.1 ISA and Datapath Width • We normally define a size for instructions, addresses and data operands (e.g. 32 bits) • Implementation could use bus and/or interconnects of smaller size (e.g. 8 or 16 bits) • Would require more operations to move a 32 bit value. Would require less chip real estate • Tradeoff speed vs. price

3.4.2 Width of the Clock Pulse • Combinational logic elements have a propagation delay. • Register files have an access time • Writing to a register requires input to be stable both before and after the leading edge of the clock arrives (set up time and hold time) • Wires have a transmission delay • Clock pulse must be wide enough to allow for all of the above

3.4.3 Checkpoint • You should now understand the following basic concepts – Basics of logic design including combinational and sequential logic circuits – Hardware resources for a datapath such as register file, ALU, and memory – Edge-triggered logic and how to arrive at the width of a clock cycle – Datapath interconnection and buses – Finite State Machines

3.5 Control Unit Design • The control unit is an implementation of the Finite State Machine • Depending on the current state and inputs it moves to the correct next state • Typical outputs from control unit (e.g. LC-2200) – – – – – –

Drive signals: DrPC, DrALU, DrREG, DrMEM, DrOFF Load signals: LdPC, LdA, LdB, LdMAR, LdIR Write Memory signal: WrMEM Write Registers signal: WrREG ALU function selector: func Register selector: regno

• Several alternatives exist for implementation

3.5.1 ROM plus state register

Drive Signals PC

...

ALU

Reg

ME M

Load Signals OFF

PC

A

B

MA R

Write Signals IR

MEM

REG

Func

RegSel

3.5.2 FETCH macro state • Need to do – – – –

We need to send PC to the memory Read the memory contents Bring the memory contents read into the IR Increment the PC

• Microstates to accomplish – ifetch1

• PC  MAR

– ifetch2

• MEM[MAR]  IR

– ifetch3

• PC  A

– ifetch4

• A+1  PC

3.5.2 FETCH macro state (Simplifying) • ifetch1 – PC  MAR – PC  A

• ifetch2 – MEM[MAR]  IR

• ifetch3 – A+1  PC

3.5.2 FETCH macro state Adding in control signals •

ifetch1

– PC  MAR – PC  A – Control signals needed: • • •



ifetch2

– MEM[MAR]  IR – Control signals needed: • •



DrPC LdMAR LdA

DrMEM LdIR

ifetch3

– A+1  PC – Control signals needed: • • •

func = 11 DrALU LdPC

3.5.3 DECODE macro state

Fetch O-Type

R-Type I-Type

J-Type

3.5.4 EXECUTE macro state: ADD instruction (part of R-Type) • RX  RY + RZ

3.5.4 EXECUTE macro state: ADD instruction (part of R-Type) •

add1

– Ry  A – Control signals needed: • • •



add2

ifetch1

– Rz  B – Control signals needed: • • •



RegSel = 01 DrREG LdA

.

RegSel = 10 DrREG LdB

. .

add3

– A+B  Rx – Control signals needed: • • • •

func = 00 DrALU RegSel = 00 WrREG

add1

add2

add3

3.5.5 EXECUTE macro state: NAND instruction (part of R-Type) • What must be changed in ADD to implement NAND?

3.5.6 EXECUTE macro state: JALR instruction (part of J-Type) • JALR instruction does the following: – RY  PC + 1 – PC  RX

• jalr1 – PC  Ry – Control signals needed: • DrPC • RegSel = 01 • WrREG

• jalr2 – Rx  PC – Control signals needed: • RegSel = 00 • DrREG • LdPC

3.5.7 EXECUTE macro state: LW instruction (part of I-Type) • RX  MEMORY[RY + signed address-offset]

3.5.7 EXECUTE macro state: LW instruction (part of I-Type) • lw1

• lw3

– Ry  A – Control signals needed:

– A+B  MAR – Control signals needed:

• RegSel = 01 • DrREG • LdA

• lw2

• func = 00 • DrALU • LdMAR



– Sign-extended offset  B – Control signals needed: • DrOFF • LdB

lw4 – MEM[MAR]  Rx – Control signals needed: • DrMEM • RegSel = 00 • WrREG

3.5.8 EXECUTE macro state: SW and ADDI instructions (part of I-Type) • SW similar to LW • ADDI similar to ADD

3.5.9 EXECUTE macro state: BEQ instruction (part of I-Type) 32 • BEQ instruction has the following semantics: If (RX == RY) PC  PC + 1 + signed offset else Nothing*

*PC remains unchanged so execution continues to next instruction in memory

3.5.9 EXECUTE macro state: BEQ instruction (part of I-Type) 32 •

beq1

– Rx  A – Control signals needed: • • •





– Ry  B – Control signals needed: RegSel = 01 DrREG LdB

beq4 – PC A – Control signals needed:

beq2

• • •



RegSel = 00 DrREG LdA

These microsteps execute only if we are taking the branch

• •



beq5

– Sign-extended offset  B – Control signals needed: • •

beq3 – A–B – Load Z register with result of zero detect logic – Control signals needed: • • •

func = 10 DrALU LdZ



DrPC LdA

DrOFF LdB

beq6

– A+B  PC – Control signals needed: • • •

func = 00 DrALU LdPC

3.5.10 Engineering a conditional branch in the microprogram ifetch1

• • •

beq1

beq2

beq3

beq4

beq5

beq6

3.5.10 Engineering a conditional branch in the microprogram Z

Drive Signals

PC

...

ALU

Reg

ME M

Load Signals

OFF

PC

A

B

MA R

Write Signals

IR

MEM

REG

Func

RegSel

3.5.11 DECODE macro state revisited

Drive Signals

PC

...

ALU

Reg

ME M

Load Signals

OFF

PC

A

B

MA R

Write Signals

IR

MEM

REG

Func

RegSel

3.6 Alternative Style of Control Unit Design A number of different approaches may be used to implement the Control Unit

3.6.1 Microprogrammed Control • As presented our design works • Problem: Too slow – Solution: Prefetch the next microinstruction

• Problem: Too much memory required – Solution: Have bit positions control different things as a function of opcode

3.6.2 Hardwired control • State machine can be represented as sequential logic truth table • Thus can be implemented using normal logic or FPGA

3.6.3 Choosing between the two control design styles Control Regime Pros Microprogrammed Simplicity, maintainability, flexibility Rapid prototyping

Hardwired

Cons Potential for space and time inefficiency

Comment Space inefficiency may be mitigated with vertical microcode Time inefficiency may be mitigated with prefetching

When to use For complex instructions, and for quick nonpipelined prototyping of architectures

Examples PDP 11 series, IBM 360 and 370 series, Motorola 68000, complex instructions in Intel x86 architecture

Amenable for pipelined Potentially harder to Maintainability can For High performance implementation change the design be increased with the pipelined implementation Potential for higher Longer design time use of structured of architectures performance hardware such as PLAs and FPGAs

Most modern processors including Intel Pentium series, IBM PowerPC, MIPS

3.7 Historical Perspective Hardware Expensive Memory Expensive

Hardware Less Expensive Memory Expensive

Accumulators

Hardware and Memory Cheap Microprocessors Compilers getting good

Register Oriented Machines (2 address) Register-Memory CISC VAX IBM 360 Motorola 68000 DEC PDP-11 Intel 80x86 Also RISC Fringe Element Berkley RISCSparc Stack Machines Dave Patterson Burroughs B-5000 Stanford MIPS SGI John Hennessy (Banks)

EDSAC IBM 701

IBM 801

1940

1950

1960

1970

1980

1990

Questions?

COMPUTER SYSTEMS An Integrated Approach to Architecture and Operating Systems

Chapter 4 Processor Implementation

©Copyright 2008 Umakishore Ramachandran and William D. Leahy Jr.

4 Interrupts, Traps and Exceptions • Interrupts, traps and exceptions are discontinuities in program flow • Students asking a teacher questions in a classroom is a good analogy to the handling of discontinuities in program flow

4.1 Discontinuities in program execution • We must first understand – Synchronous events: Occur at well defined points aligned with activity of the system • Making a phone call • Opening a file

– Asynchronous events: Occur unexpectedly with respect to ongoing activity of the system • Receiving a phone call • A user presses a key on a keyboard

4.1 Discontinuities in program execution • There is no universally accepted set of definitions for interrupts, traps and exceptions so we will use these – Interrupts: Asynchronous events usually produced by I/O devices which must be handled by the processor by interrupting execution of the currently running process – Traps: Synchronous events produced by special instructions typically used to allow secure entry into operating system code – Exceptions: Synchronous events usually associated with software requesting something the hardware can’t perform i.e. illegal addressing, illegal op code, etc.

4.1 Discontinuities in program execution Type

Sync/Async

Source

Intentional? Examples

Exception

Sync

Internal

No

Overflow, Divide by zero, Illegal memory address

Trap

Sync

Internal

Yes and No

System call, Page fault, Emulated instructions

Interrupt

Async

External

Yes

I/O device completion

4.2 Dealing with program discontinuities • Can happen anywhere even in the middle of an instruction execution. • Unplanned for and forced by the hardware. Hardware has to save the program counter since we are jumping to the handler. • Address of the handler is unknown. Therefore, hardware must manufacture address. • Since hardware saved PC, handler has to discover where to return upon completion.

4.3 Architectural enhancements to handle program discontinuities • When should the processor handle an interrupt? • How does the processor know there is an interrupt? • How do we save the return address? • How do we manufacture the handler address? • How do we handle multiple cascaded interrupts? • How do we return from the interrupt

4.3.1 Modifications to FSM

Fetch

Decode

Execute

int = N

INT

int = Y

$k0 ← PC PC ← new PC

4.3.2 A simple interrupt handler Handler: save processor registers; execute device code; restore processor registers; return to original program;

What happens if an interrupt arrives during handling an interrupt?

Fetch

Decode

Execute

int = N

INT

Add new instruction Enable Ints

int = Y

$k0 ← PC PC ← new PC Disable Ints

4.3.2 A simple interrupt handler Handler: save processor registers; execute device code; restore processor registers; enable ints return to original program;

4.3.3 Handling cascaded interrupts Original

Program

Original $k0← Return Address First First Handler $k0← Return Address

Second

Interrupt Handler

Interrupt Handler

4.3.3 Handling cascaded interrupts

Fetch

Decode

Execute

int = N

INT

Add 2 new instructions Enable Ints Disable Ints

int = Y

$k0 ← PC PC ← new PC Disable Ints

4.3.3 Handling cascaded interrupts Handler: /* The interrupts are disabled when we enter */ save $k0; enable interrupts; save processor registers; execute device code; restore processor registers; disable interrupts; restore $k0; enable interrupts return to original program;

Yay! It works perfectly!!! Handler: /* The interrupts are disabled when we enter */ save $k0; enable interrupts; save processor registers; execute device code; restore processor registers; disable interrupts; restore $k0; enable interrupts return to original program; Or does it? What happens if an interrupt occurs here?

4.3.4 Returning from the handler • Returning involves jumping to the address in $k0 which can be accomplished with jalr $k0 $zero

• But as we have just seen an interrupt at precisely the wrong moment would destroy $k0 and cause a failure • What do we need? restore $k0; enable interrupts return to original program;

4.3.5 Summary of architectural enhancements to LC-2200 to handle interrupts

• Three new instructions to LC-2200: – Enable interrupts – Disable interrupts – Return from interrupt

• Upon an interrupt, store the current PC implicitly into a special register $k0.

4.4 Hardware details for handling external interrupts • What we have presented thus far is what is required for interrupts, traps and exceptions • What do we need specifically for enternal interrupts?

4.4.1 Datapath details for interrupts Processor

Address Bus

Data Bus INT INTA

Device 1

Device 2

INT8 INTA 8

Device 1 INT Processor

Priority Encoder

. . . .

Device 2

. . . .

INT1 INTA

INTA 1

Device 1

Device 2

4.4.2 Details of receiving the address of the handler

Handshake between Processor and Device • Device asserts INT line • Processor upon completion of the current instruction, checks the INT line • If interrupt pending, then processor enters INT macrostate and asserts INTA line on bus • Device upon receiving the INTA from the processor, places its vector on the data bus. • Processor receives vector and looks up entry in interrupt vector table for this vector. Entry is address of handler so we put it in PC • The processor saves the current PC in $k0, and loads PC with value from interrupt vector table

4.4.3 Stack for saving/restoring • Hardware has no guarantee for stack behavior by user program (register/conventions) • Equip processor with 2 stack pointers (User/System) • On interrupt swap stack pointers $sp 2

SSP

1

USP

4.4.3 Stack for saving/restoring • Use system stack for saving all necessary information • Upon completion of interrupt restore registers, etc. • The restore user stack pointer by reversing earlier swap $sp 1

SSP

2

USP

4.5 Putting it all together

A. Executing instruction at 19999. The PC has already been incremented. Device signals interrupt in middle of instruction. $sp points to user stack

ADDR

40

CONT

1000

41 ...

Vector Table

• • •

299 ...

System Stack

USER

1

INT ACK

0

INT Enable

1

$k0

PC

20000

$sp

user stack

19999

20000

inst

inst

300 ...

MODE

INT REQ

• • •

1000 inst

1001 inst

Handler Code

Register File

• • •

300

Original Program

4.5 Putting it all together

B. Interrupt has been sensed. $k0 gets PC. Interrupts are disabled. Interrupt is acknowledged. Device puts vector on bus. BUS

ADDR

40

CONT

1000

41 ...

Vector Table

• • •

...

INT ACK

1

INT Enable

0

$k0

20000

PC

20000

$sp

user stack

19999

20000

inst

inst

300 ...

System Stack

USER

1

40

299

MODE

INT REQ

• • •

1000 inst

1001 inst

Handler Code

Register File

• • •

Original Program

4.5 Putting it all together

C. Handler address is put into PC; Current mode is saved in system stack; New mode is set to kernel; $sp now points to system stack; Interrupt code at 1000 is set to handle the interrupt.

ADDR

40

CONT

1000

41 ...

Vector Table

• • •

0

INT ACK

0

INT Enable

0

PC

299 ...

INT REQ

300 ...

System Stack

• • •

MODE

Register File

1000

1000 inst

1001 inst

Handler Code

KERNEL

• • •

$k0

20000

$sp

299

19999

20000

inst

inst

Original Program

4.5 Putting it all together

D. RETI instruction restores mode from system stack; since returning to user program in this example, $sp now points to user stack; also, copies $k0 into PC, re-enables interrupts and sets Mode to User

ADDR

40

CONT

1000

41 ...

Vector Table

• • •

299 ...

System Stack

USER

0

INT ACK

0

INT Enable

1

$k0

20000

PC

20000

$sp

user stack

19999

20000

inst

inst

300 ...

MODE

INT REQ

• • •

1000 inst

1001 inst

Handler Code

Register File

• • •

Original Program

4.6 Summary • Interrupts help a processor communicate with the outside world. • An interrupt is a specific instance of program discontinuity. • Processor/Bus enhancements included – – – – –

Three new instructions User stack and system stack Mode bit INT macro state Control lines called INT and INTA

4.6 Summary • Software mechanism needed to handle interrupts, traps and exceptions is similar. • Discussed how to write a generic interrupt handler that can handle nested interrupts. • Intentionally simplified. Interrupt mechanisms in modern processors are considerably more complex. For example, modern processors categorize interrupts into two groups: maskable and non-maskable. – maskable: Interrupts that can be temporarily turned off – Non-maskable: Interrupts that cannot be turned off

4.6 Summary • Presented mode as a characterization of the internal state of a processor. Intentionally simplistic view. • Processor state may have a number of other attributes available as discrete bits of information (similar to the mode bit). – Modern processors aggregate all of these bits into one register called processor status word (PSW). – Upon an interrupt and its return, the hardware implicitly pushes and pops, respectively, both the PC and the PSW on the system stack.

• The interested reader is referred to more advanced textbooks on computer architecture for details on how the interrupt architecture is implemented in modern processors.

4.6 Summary • Presented simple treatment of the interrupt handler code to understand what needs to be done in the processor architecture to deal with interrupts. The handler would typically do a lot more than save processor registers. • LC-2200 designates a register $k0 for saving PC in the INT macro state. In modern processors, there is no need for this since the hardware automatically saves the PC on the system stack.

Questions?

COMPUTER SYSTEMS An Integrated Approach to Architecture and Operating Systems

Chapter 5 Processor Performance and Rudiments of Pipelined Processor Design

©Copyright 2008 Umakishore Ramachandran and William D. Leahy Jr.

5.1 Space and Time Metrics • Two important metrics for any program – Space: How much memory does the program code and data require? (Memory footprint) – Time: What is the execution time for the program?

• Different design methodologies – CISC – RISC

• Memory footprint and execution time are not necessarily correlated

What determines execution time? • Execution time = (∑ CPIj) * clock cycle time, where 1 ≤ j ≤ n • Execution time = n * CPIAvg * clock cycle time, where n is the number of instructions (executed not static instruction count)

5.2 Instruction Frequency • Static instruction frequency refers to number of times a particular instruction occurs in compiled code. – Impacts memory footprint – If a particular instruction appears a lot in a program, can try to optimize amount of space it occupies by clever instruction encoding techniques in the instruction format.

• Dynamic instruction frequency refers to number of times a particular instruction is executed when program is run. – Impacts execution time of program – If dynamic frequency of an instruction is high then can try to make enhancements to datapath and control to ensure that CPI taken for its execution is minimized.

5.3 Benchmarks • Benchmarks are a set of programs that are representative of the workload for a processor. • The key difficulty is to be sure that the benchmark program selected really are representative. • A radical new design is hard to benchmark because there may not yet be a compiler or much code.

Evaluating a Suite of Benchmark Programs • Total execution time: cumulative total of execution times of individual programs. • Arithmetic mean (AM): Simply an average of all individual program execution times. – It should be noted, however that this metric may bias the summary value towards a time-consuming benchmark program (e.g. execution times of programs: P1 = 100 secs; P2 = 1 secs; AM = 50.5 secs).

• Weighted arithmetic mean (WAM) : weighted average of the execution times of all the individual programs taking into account the relative frequency of execution of the programs in the benchmark mix • Geometric mean (GM), pth root of the product of p values. This metric removes the bias present in arithmetic mean

SPECint2006 12 programs for quantifying performance of processors on integer programs Intel Core 2 Duo E6850 (3 GHz) Program name 400.perlbench 401.bzip2 403.gcc 429.mcf 445.gobmk 456.hmmer 458.sjeng 462.libquantum 464.h264ref 471.omnetpp 473.astar 483.xalancbmk

Description Applications in Perl Data compression C Compiler Optimization Game based on AI Gene sequencing Chess based on AI Quantum computing Video compression Discrete event simulation Path-finding algorithm XML processing

Time in seconds 510 602 382 328 548 593 679 422 708 362 466 302

5.4 Increasing the Processor Performance • Execution time = n * CPIAvg * clock cycle time • Reduction in the number of executed instructions • Datapath organization leading to lower CPI • Increasing clock speed

5.5 Speedup • Assume a base case execution time of 10 sec. • Assume an improved case execution time of 5 sec. • Percent improvement = (base-new)/base • Percent improvement = (10-5)/5 = 100% • Speedup = base/new • Speedup = 10/5 = 2 • Speedup is preferred by advertising copy writers

Amdahl’s Law • Amdahl’s law: Timeafter = Timeunaffected + Timeaffected/x where x is speedup

5.6 Increasing the Throughput of the Processor • Don’t focus on trying to speedup individual instructions • Instead focus on throughput i.e. number of instructions executed per unit time

5.7 Introduction to Pipelining • Consider a sandwich shop with a five step process – – – – –

Take order Bread Cheese Meat Veggies

• One employee can do the job • Now imagine 5 employees making sandwiches Order

Bread

Cheese

Meat

Veggies

Pipeline Math • If it takes one person 5 minutes to make a sandwich • And we pipeline the process using 5 people each taking a minute • And we start making sandwiches constantly (i.e. ignore startup pipeline filling) • How long does it actually take to make a single sandwich (Real elapsed time) • What is the effective time to produce a sandwich? (i.e. a sandwich exits from the pipeline every how many minutes?)

5.8 Towards an instruction processing assembly line Macro State FETCH DECODE EXECUTE (ADD) EXECUTE (LW)

Functional Units in Use_______________ IR ALU PC MEM IR IR ALU Reg-file IR ALU Reg-file MEM Sign extender

instructions

instructions

I4

F D E

I3 I2

F D E

I4 I3 I2

F D E

F D E F D E F D E

I1 F D E

I1 F D E

time time

5.9 Problems with a simple-minded instruction pipeline • The different stages often need the same datapath resources (e.g. ALU, IR). – Structural Hazards

• The amount of work done in the different stages is not the same. – TFetch TDecode Texecute

5.10 Fixing the problems with the instruction pipeline • • • • •

IF ID/RR EX MEM WB

Instruction Fetch Instruction Decode/Read Registers Execute Memory Write Back

Instruction pipeline with buffers between stages

IF Instruction in

B U F F E R

ID/RR

B U F F E R

EX

B U F F E R

MEM

B U F F E R

WB Instruction out

5.11 Datapath elements for the instruction pipeline ID/RR

IF PC I-MEM

ALU

B U F F E R

DPRF A B Decode logic

EX B U F F E R

ALU-1 ALU-2

MEM B U F F E R

D-MEM

WB B U F F E R

data DPRF

5.12 Pipeline-conscious architecture and implementation • Need for a symmetric instruction format • Need to ensure equal amount of work in each stage IF

ID/RR

EX

MEM

WB

M X

ADD

1 P C

Instr Mem

ADD

0?

DPRF

A M X

ALU

D SE

Pipeline Registers

Data Mem

M X

5.12.1 Anatomy of an instruction passage through the pipeline

IF

F B U F

ID/RR

D B U F

EX

E B U F

MEM

M B U F WB

Pipeline Buffers Name FBUF

Output of Stage IF

DBUF

ID/RR

EBUF

EX

MBUF

MEM

Contents Primarily contains instruction read from memory Decoded IR and values read from register file Primarily contains result of ALU operation plus other parts of the instruction depending on the instruction specifics Same as EBUF if instruction is not LW or SW; If instruction is LW, then buffer contains the contents of memory location read

5.12.2 Design of the Pipeline Registers • Design the pipeline registers solely for the LDR instruction FBUF

DBUF

EBUF

MBUF

5.12.3 Implementation of the stages • Design and implementation of a pipeline processor may be simpler than a non-pipelined processor. • Pipelined implementation modularizes design. • Layout and interpretation of the pipeline registers are analogous to well-defined interfaces between components of a large software system. • Since datapath actions of each stage happen in one clock cycle, the design of each stage is purely combinational. Each stage: – At the beginning of each clock cycle, interprets input pipeline register, – Carries out the datapath actions using the combinational logic for this stage – Writes the result of the datapath action into its output pipeline register.

5.13 Hazards Structural Control Data

• Reduce throughput to < 1 instruction/cycle • Pipeline is synchronous • Pipeline is stalled when an instruction cannot proceed to next stage. • A stall introduces bubble into pipeline. • NOP instruction is manifestation of bubble. • Stage executing NOP instruction does nothing for one cycle. • Output buffer remains unchanged from previous cycle. • Stalls, bubbles, and NOPs used interchangeably in the textbook to mean the same thing.

5.13.1 Structural hazard • Caused by limitations in hardware that don’t allow concurrent execution of different instructions • Examples – – – –

Bus Single ALU Single Memory for instructions and data Single IR

• Remedy is to add additional elements to datapath to eliminate hazard

5.13.2 Data Hazard • Consider these three pairs of instructions. Could they be executed in any sequence and yield correct results? R1 ← R2 + R3 R4 ← R1 + R5

R4 ← R1 + R5 R1 ← R2 + R3

R1 ← R4 + R5 R1 ← R2 + R3

5.13.2.1 RAW Hazard M X

ADD

1 P C

Instr Mem

ADD

BEQ

DPRF

A M X

ALU

D SE

Data Mem

M X

5.13.2.1 RAW Hazard M X

ADD

1 P C

Instr Mem

ADD

BEQ

DPRF

A M X

ALU

D SE

R1 ← R2+R3

Data Mem

M X

5.13.2.1 RAW Hazard M X

ADD

1 P C

Instr Mem

ADD

BEQ

DPRF

A M X

ALU

D SE

R4 ← R1+R5

R1 ← R2+R3

Data Mem

M X

5.13.2.1 RAW Hazard M X

ADD

1 P C

Instr Mem

ADD

BEQ

DPRF

A M X

ALU

D SE

R4 ← R1+R5

R1 ← R2+R3

Data Mem

M X

5.13.2.1 RAW Hazard M X

ADD

1 P C

Instr Mem

ADD

BEQ

DPRF

A M X

ALU

D

Data Mem

SE

R4 ← R1+R5

R1 ← R2+R3

M X

5.13.2.1 RAW Hazard M X

ADD

1 P C

Instr Mem

ADD

BEQ

DPRF

A M X

ALU

D

Data Mem

M X

SE

R4 ← R1+R5

R1 ← R2+R3

5.13.2.1 RAW Hazard M X

ADD

1 P C

Instr Mem

ADD

BEQ

DPRF

A M X

ALU

D

Data Mem

M X

SE

R4 ← R1+R5

5.13.2.2 Solving the RAW Data Hazard Problem: Data Forwarding M X

ADD

1 P C

Instr Mem

=?

ADD

R1 DPRF

A M X

ALU

D SE

R1

R4 ← R1+R5

R1 ← R2+R3

Data Mem

M X

5.13.2.2 Solving the RAW Data Hazard Problem: Data Forwarding • Forwarding components have to be installed to take care of all possible cases

5.13.2.3 Dealing with RAW Data Hazard introduced by Load instructions M X

ADD

1 P C

Instr Mem

ADD

BEQ

DPRF

A M X

ALU

D SE

LW R1,3(R2)

Data Mem

M X

5.13.2.3 Dealing with RAW Data Hazard introduced by Load instructions M X

ADD

1 P C

Instr Mem

ADD

BEQ

DPRF

A M X

ALU

D SE

R4 ← R1+R5

LW R1,3(R2)

Data Mem

M X

5.13.2.3 Dealing with RAW Data Hazard introduced by Load instructions M X

ADD

1 P C

ADD

R=1 Instr Mem

R1 DPRF

BEQ

A M X

ALU

D SE

R=1

R4 ← R1+R5

LW R1,3(R2)

Data Mem

M X

5.13.2.3 Dealing with RAW Data Hazard introduced by Load instructions M X

ADD

1 P C

ADD

R1

Instr Mem

DPRF

BEQ

A M X

ALU

D

Data Mem

R1

SE

R=1

R4 ← R1+R5

LW R1,3(R2)

M X

5.13.2.3 Dealing with RAW Data Hazard introduced by Load instructions M X

ADD

1 P C

ADD

R1

Instr Mem

DPRF

BEQ

A M X

ALU

D

Data Mem

R1

SE

R=1

R4 ← R1+R5

NOP

LW R1,3(R2)

M X

5.13.2.4 Other types of Data Hazards • WAR – Not a problem in our pipeline

• R4 ← R1 + R5 • R1 ← R2 + R3 • WAW – Becomes an issue in complex pipelines with many stages

Stop Here

5.13.3 Control Hazard • Typically associated with branch instructions • PC must contain address of next instruction before we know it!!! • Simple solution: Stall pipeline • But what is the impact?

5.13.3 Control Hazard M X

ADD

1 P C

Instr Mem

ADD

BEQ

DPRF

A M X

ALU

D SE

BEQ R1, R2, X

Data Mem

M X

5.13.3 Control Hazard M X

ADD

1 P C

Instr Mem

ADD

BEQ

DPRF

A M X

ALU

D SE

???

BEQ R1, R2, X

Data Mem

M X

5.13.3 Control Hazard M X

ADD

1 P C

Instr Mem

ADD

BEQ

DPRF

A M X

ALU

D SE

NOP

BEQ R1, R2, X

Data Mem

M X

5.13.3 Control Hazard M X

ADD

1 P C

Instr Mem

ADD

BEQ

DPRF

A M X

ALU

D SE

NOP

NOP

BEQ R1, R2, X

Data Mem

M X

5.13.3 Control Hazard M X

ADD

1 P C

Instr Mem

BEQ

DPRF

ADD

A M X

ALU

D SE

NOP

BEQ R1, R2, X

Data Mem

M X

5.13.3.1 Dealing with branches in the pipelined processor • Delayed Branch • Branch Prediction • Branch prediction with target buffer

5.13.3.2 Summary of dealing with branches in a pipelined processor Name

Pros

Cons

Examples

Stall

Simple

Performance loss

IBM 360

Predict (not taken)

Good performance

Predict (taken)

Good performance

Need additional hardware to be able Most modern processors use this to flush pipeline technique. Some Requires more also employ elaborate hardware sophisticated since target not branch target available until EX buffers

Delayed Branch

Needs no hardware just compiler recognition that it exists

Deep pipelines make it difficult to fill all the delay slots

Older RISC architectures e.g. MIPS, PA-RISC, SPARC

5.13.4 Summary of Hazards • Structural

• Data

• Control

5.14 Dealing with interrupts in a pipelined processor • First Method 1. Stop sending new instructions into the pipeline 2. Wait until the instructions that are in partial execution complete their execution (i.e. drain the pipe). 3. Go to the interrupt state

• Second Method – The other possibility is to flush the pipeline.

5.15 Advanced topics in processor design • Pipelined processor designs have their roots in high performance processors and vector processors of the 1960's and 1970's • Many of the concepts used are still relevant today

5.15.1 Multiple Issue Processors • Sequential Program Model – Perceived Program Order Execution – Actual instruction overlap

• Instruction Level Parallelism (ILP) – Limited by hazards especially control hazards – Basic blocks

5.15.2 Deeper pipelines • Pipelines may have more than 20 stages • Basic blocks are often in the range of 3-7 instructions • Must develop techniques to exploit ILP to make it worthwhile • For example, can issue multiple instructions in one cycle – Assume hardware and/or compiler has selected a group of instruction that can be executed in parallel (no hazards)

• Need additional functional units

5.15.2 Deeper pipelines

Necessity for Deep Pipelining • • • • • • •

Relative increase in storage access time Microcode ROM access Multiple functional units Dedicated floating point pipelines Out of order execution and reorder buffer Register renaming Hardware-based speculation

Different Pipeline Depths

5.15.3 Revisiting program discontinuities in the presence of out-of-order processing • External Interrupts can be handled by stopping instruction issue and allowing pipeline to drain • Exceptions and traps were problematic in early pipelined processors – Instructions following instruction causing exception or trap may have already finished and changed processor state – Known as imprecise execution

• Modern processors retire instructions in program order – Potential exceptions are buffered in re-order buffer and will manifest in strictly program order.

5.15.4 Managing shared resources • Managing shared resources such as register files becomes more challenging with multiple functional units • Solutions – Scoreboard keeps track of all resources needed by an instruction – Tomasulo algorithm equips functional units with registers which act as surrogates to the architecture-visible registers

5.15.5 Power Consumption • Speeding up processors can drive designers to try and pack more smaller components onto the chip. • This can allow the clock cycle time to decrease • Unfortunately higher operational frequencies will cause power consumption to increase • Higher power consumption can also lead to thermal problems with chip operating temperatures

5.15.5 Power Consumption

5.15.6 Multi-core Processor Design • One solution to achieving higher performance without increasing power consumption and temperature beyond acceptable limits is multicore processors • Essentially the chip has more than one processor. • Such a design is not as transparent to the programmer as instruction level parallelism and as such brings a whole new set of challenges and opportunities to effectively utilize these new chips.

5.15.7 Intel Core Microarchitecture: An example pipeline

5.16 Historical Perspective • Amdahl works out basic pipelining principles for dissertation at UW-Madison in 1952 • Amdahl is chief architect of IBM s/360 where pipelining is implemented originally in high end mainframe processors • Early minicomputers did not use pipelining • Killer micros did use pipelining to get needed performance advantages • Today most all processors except for very low end embedded processors use some form of pipelining

Questions?

COMPUTER SYSTEMS An Integrated Approach to Architecture and Operating Systems

Chapter 6 Processor Scheduling

©Copyright 2009Umakishore Ramachandran and William D. Leahy Jr.

6.1 Introduction • Things to Do – Laundry – Study for Test – Cook and eat dinner – Call Mom for her birthday

• How would you do it?

6.2 Programs and Processes • What is an operating system? • What are resources? • How do we create programs?

6.2 Programs and Processes • What is the memory footprint of a user program? Low memory Use by the OS Program code Program global data Program heap

Memory footprint of User program

Program stack High memory

Use by the OS

Program 1 Program 2 . .

• What is the overall view of memory? • Why?

. Program n OS Data Structures OS routines

6.2 Programs and Processes • What resources are required to run: Hello, World! • What is a scheduler? Program Properties Expected running time Expected memory usage Expected I/O requirements

Process/System Properies Available system memory Arrival time of a program Instantaneous memory requirements

Process 1 Process 2 . . . Process n

Processor

scheduler

winner

6.2 Programs and Processes Program • On disk • Static • No state – No PC – No register usage

• Fixed size

One program may yield many processes

Process • In memory (and disk) • Dynamic – changing • State – PC – Registers

• May grow or shrink • Fundamental unit of scheduling

6.2 Programs and Processes Name Job Process

Usual Connotation Unit of scheduling Program in execution; unit of scheduling

Use in this chapter Synonymous with process Synonymous with job

Thread

Unit of scheduling and/or execution; contained within a process

Not used in the scheduling algorithms described in this chapter

Task

Unit of work; unit of Not used in the scheduling algorithms scheduling described in this chapter, except in describing the scheduling algorithm of Linux

6.3 Scheduling Environments

6.3 Scheduling Environments Name Long term scheduler

Environment Batch oriented OS

Loader

In every OS

Medium term scheduler

Every modern OS (timeshared, interactive)

Short term scheduler

Every modern OS (timeshared, interactive)

Dispatcher

In every OS

Role Control the job mix in memory to balance use of system resources (CPU, memory, I/O) Load user program from disk into memory Balance the mix of processes in memory to avoid thrashing Schedule the memory resident processes on the CPU Populate the CPU registers with the state of the process selected for running by the short-term scheduler

6.3 Scheduling Environments Process States

New

Admitted

Exit

Halted

Interrupt

Ready

Running Scheduler Dispatch

I/O or Event Completion

Waiting

I/O or Event Wait

6.4 Scheduling Basics

6.4 Scheduling Basics • Schedulers come in two basic flavors – Preemptive – Non-preemptive

• Basic scheduler steps 1. 2. 3. 4.

Grab the attention of the processor. Save the state of the currently running process. Select a new process to run. Dispatch the newly selected process to run on the processor.

6.4 Scheduling Basics • What information is important to know about a process?

6.4 Scheduling Basics • Process Control Block enum

state_type {new, ready, running, waiting, halted};

typedef struct control_block_type { struct control_block *next_pcb; /* enum state_type state; /* address PC; /* int reg_file[NUMREGS]; /* int priority; address address_space; … … } control_block;

list ptr */ current state */ where to resume */ contents of GPRs */

/* extrinsic property */ /* where in memory */

next_pcb

info…

6.4 Scheduling Basics Partially Executed Swapped Out Processes Ready Queue

I/O

I/O Queue

CPU

I/O Request Time Slice Expired

Child Executes

Fork a Child

Interrupt Occurs

Wait for an Interrupt

6.4 Scheduling Basics Name CPU burst I/O burst PCB Ready queue I/O queue

Non-Preemptive algorithm

Preemptive algorithm Thrashing

Description Continuous CPU activity by a process before requiring an I/O operation Activity initiated by the CPU on an I/O device Process context block that holds the state of a process (i.e., program in execution) Queue of PCBs that represent the set of memory resident processes that are ready to run on the CPU Queue of PCBs that represent the set of memory resident processes that are waiting for some I/O operation either to be initiated or completed Algorithm that allows the currently scheduled process on the CPU to voluntarily relinquish the processor (either by terminating or making an I/O system call) Algorithm that forcibly takes the processor away from the currently scheduled process in response to an external event (e.g. I/O completion interrupt, timer interrupt) A phenomenon wherein the dynamic memory usage of the processes currently in the ready queue exceed the total memory capacity of the system

6.5 Performance Metrics • System Centric. – CPU Utilization: Percentage of time the processor is busy. – Throughput: Number of jobs executed per unit time. – Average turnaround time: Average elapsed time for jobs entering and leaving the system. – Average waiting time: Average of amount of time each job waits while in system

• User Centric – Response time: Time until system responds to user.

6.5 Performance Metrics • Two other qualitative issues – Starvation: The scheduling algorithm prevents a process from ever completing – Convoy Effect: The scheduling algorithm allows long-running jobs to dominate the CPU

6.5 Performance Metrics w1

P1

P2

P3

e1

e2

e3

t1

w2 t2

w3 t3

wi, ei, and ti, are respectively the wait time, execution time, and the elapsed time for a job ji

6.5 Performance Metrics P1 2

P2

3

3

4

P3

2

5

5 9

12 14 19

Assume times are in ms

6.5 Performance Metrics • System Centric. – CPU Utilization: – Throughput: – Average turnaround time: – Average waiting time

• User Centric – Response time:

6.5 Performance Metrics • Assumptions for following slides – Context switch time is negligible – Single I/O queue – Simple model (first-come-first-served) for scheduling I/O requests.

6.6 Non-preemptive Scheduling Algorithms • Non-preemptive means that once a process is running it will continue to do so until it relinquishes control of the CPU. This would be because it terminates, voluntarily yields the CPU to some other process (waits) or requests some service from the operating system.

6.6.1 First-Come First-Served (FCFS) • • • •

Intrinsic property: Arrival time May exhibit convoy effect No starvation High variability of average waiting time

6.6.2 Shortest Job First (SJF) • • • •

Uses anticipated burst time No convoy effect Provably optimal for best average waiting time May suffer from starvation – May be addressed with aging rules

6.6.3 Priority • Each process is assigned a priority • May have additional policy such as FCFS for all jobs with same priority • Attractive for environments where different users will pay more for preferential treatment • SJF is a special case with Priority=1/burst time • FCFS is a special case with Priority = arrival time

6.7 Preemptive Scheduling Algorithms • Two simultaneously implications. – Scheduler is able to assume control of the processor anytime unbeknownst to the currently running process. – Scheduler is able to save the state of the currently running process for proper resumption from the point of preemption.

• Any of the Non-preemptive algorithms can be made Preemptive

6.7.1 Round Robin Scheduler • Appropriate for time-sharing environments • Need to determine time quantum q: Amount of time a process gets before being context switched out (also called timeslice) – Context switching time becomes important

• FCFS is a special case with q = ∞ • If n processes are running under round robin they will have the illusion they have exclusive use of a processor running at 1/n times the actual processor speed

6.7.1.1 Details of Round Robin Algorithm • What do we mean by context?

• How does the dispatcher get run?

• How does the dispatcher switch contexts?

6.7.1.1 Details of Round Robin Algorithm •

Dispatcher: get head of ready queue; set timer; dispatch;



Timer interrupt handler:

Round Robin Scheduling Algorithm

save context in PCB; move PCB to the end of the ready queue; upcall to dispatcher;



I/O request trap: save context in PCB; move PCB to I/O queue; upcall to dispatcher;



I/O completion interrupt handler: save context in PCB; move PCB of I/O completed process to ready queue; upcall to dispatcher;



Process termination trap handler: Free PCB; upcall to dispatcher;

6.8 Combining Priority and Preemption • Modern general purpose operating systems such as Windows NT/XP/Vista and Unix/Linux use multi-level feedback queues • System consists of a number of different queues each with a different expected quantum time • Each individual queue uses FCFS except base queue which uses Round Robin

6.8 Combining Priority and Preemption q1 A process that doesn’t finish before qi drops down 1 level

Note: q1 FREE 2K

8K

3K

P3

11K

2K

FREE

As P2 terminates the memory allocator can look for adjacent free blocks

7.3.2 Variable Size Partitions After coalescing adjacent free blocks

Memory

Allocation table Start address

Size

Process

0

8K

FREE

8K

3K 8K

3K

P3 2K

11K

2K

FREE

7.3.2 Variable Size Partitions • When space is requested several possible options – Best Fit • Lower internal fragmentation • Longer search time • Table may be indexed by start address which is good for coalescing • Table may be indexed by size which is faster for allocation

– First Fit • Faster allocation • Table may be indexed by start address which is good for coalescing

7.3.3 Compaction • If fragmentation becomes excessive we can compact memory by moving processes • This is virtually impossible with modern architectures! – Base register concept

Memory

Allocation table 3K Start address

Size

Process

0

3K

P3

3K

10K

FREE

10K

7.4 Paged Virtual Memory • As memory size increases problem of external fragmentation increases • Want to attack problem of external fragmentation • User views his memory partition as contigous memory • But does it have to really be contiguous?

7.4 Paged Virtual Memory B R O

Memory

CPU K Virtual

Physical

E Address

Address R

• Need a system to take user virtual addresses and translate into physical address corresponding to the physical memory present

7.4 Paged Virtual Memory • Conceptually break up both logical (virtual) memory and physical memory into equal sized blocks Physical Page

Logical Memory

Memory

Page Page Page

Page

Page Size = Frame Size

Frame Frame

Frame Frame Frame

Frame

7.4 Paged Virtual Memory • Where is logical memory? • Need mechanism to translate from logical pages to physical frames BROKER CPU

Virtual Address

Page Table

Memory

Physical Address

7.4 Paged Virtual Memory

Physical Memory

LOW

12

User’s view Page 0 Page 1 Page 2 Page 3

Page Table

35 52 12

15

15

35

52

HIGH

Key Point • The user still perceives a contiguous memory space • The space is not necessarily contiguous in physical memory • External fragmentation is eliminated!

7.4.1 Page Table • Suppose pages and frames are 4096 bytes long. • What do logical addresses look like? 0 4095 4096 32456

00000000000000000000000000000000 00000000000000000000111111111111 00000000000000000001000000000000 00000000000000000111111011001000

7.4.1 Page Table • Suppose pages and frames are 4096 bytes long. • What do logical addresses look like? 0 4095 4096 32456

00000000000000000000000000000000 00000000000000000000111111111111 00000000000000000001000000000000 00000000000000000111111011001000

7.4.1 Page Table • Suppose pages and frames are 4096 bytes long. • What do logical addresses look like? 0 4095 4096 32456

0x000000000000 0x000000000FFF 0x000000001000 0x000000007EC8

7.4.1 Page Table • Suppose pages and frames are 4096 bytes long. • What do addresses look like? 0 4095 4096 32456

0x000000000000 0x000000000FFF 0x000000001000 0x000000007EC8

Virtual Page Number

Offset

7.4.1 Page Table • • • •

Assume page/frame size is 4096 bytes Assume 32 bit virtual address (4 Gb) Assume 28 bit physical address (256 Mb) What is the layout of the virtual address and the physical address? • How does a virtual address like 0x3E1234 get translated into a physical address

CPU

003E1 234

7.4.1 Page Table

0044 234

PTBR

v 0x0

0x0023

0x1

0x0124

0x2

0x1111

0x3

0x3F04

0x3E0

0x0000

0x3E1

0x0044

0x3E2

0x0068

Memory

7.4.2 Hardware for Paging • • • •

PTBR Translation hardware VPN to PFN Page table is in kernel memory space Note: each process has a page table

• How many memory accesses are required for each memory request by the CPU

7.4.3 Page Table Set up typedef struct control_block_type { enum state_type state; address PC; int reg_file[NUMREGS]; struct control_block *next_pcb; int priority; address PTBR; … … } control_block;

7.5 Segmented Virtual Memory • Segmentation is a system allowing a process's memory space to be subdivided into chunks of memory each associated with some aspect of the overall program • Typical segments – Code – Global data – Heap – Stack

7.5 Segmented Virtual Memory • Process address space divided up into n distinct segments • Each segment has – A number – A size

• Each segment starts at its own address 0 and goes up to its size – 1. • Segment addressing

7.5 Segmented Virtual Memory

7.5 Segmented Virtual Memory

7.5 Segmented Virtual Memory

7.5.1 Hardware for Segmentation

7.6 Paging versus Segmentation Attribute User shielded from size limitation of physical memory Relationship to physical memory Address spaces per process

Paging Yes

Segmentation Yes

Physical memory may be less than or greater than virtual memory One

Physical memory may be less than or greater than virtual memory Several

Visibility to the user

User unaware of paging; user is given an User aware of multiple address spaces illusion of a single linear address space each starting at address 0

Software engineering

No obvious benefit

Allows organization of the program components into individual segments at user discretion; enables modular design; increases maintainability

Program debugging

No obvious benefit

Aided by the modular design

Sharing and protection

User has no direct control; operating system can facilitate sharing and protection of pages across address spaces but this has no meaning from the user’s perspective

User has direct control of orchestrating the sharing and protection of individual segments; especially useful for objectoriented programming and development of large software

Size of page/segment

Fixed by the architecture

Internal fragmentation

Internal fragmentation possible for the portion of a page that is not used by the address space None

Variable chosen by the user for each individual segment None

External fragmentation

External fragmentation possible since the variable sized segments have to be allocated in the available physical memory thus creating holes (see Figure 7.18)

7.6.1 Interpreting the CPU generated address Memory System

Virtual Address Computation

Physical Address Computation

Size of Tables

Segmentation

Segment Start address = Segment-Table [Segment-Number] Physical address = Segment Start Address + Segment Offset

Segment table size = 2nseg entries

Paging

PFN = Page-Table[VPN] Physical address:

Page table size = 2nVPN entries

7.7 Summary Memory Management Criterion

User/ Kernel Separation

Fixed Partition

Variablesized Partition

Paged Virtual Memory

Segmented Virtual Memory

Pagedsegmented Virtual Memory

Improved resource utilization

No

Internal fragmentation bounded by partition size; External fragmentation

External fragmentati on

Internal fragmentation bounded by page size

External fragmentation

Internal fragmentation bounded by page size

Independence and protection Liberation from resource limitation

No

Yes

Yes

Yes

Yes

Yes

No

No

No

Yes

Yes

Yes

Sharing by concurrent processes Facilitates good software engineering practice

No

No

No

Yes

Yes

Yes

No

No

No

No

Yes

Yes

7.7 Summary Scheme User/Kernel Separation

Hardware Support Fence register

Still in Use? No

Fixed Partition

Bounds registers

Not in any production operating system

Variable-sized Partition

Base and limit registers

Not in any production operating system

Paged Virtual Memory

Page table and page table base register

Yes, in most modern operating system

Segmented Virtual Memory

Segment table, and segment table base register

Segmentation in this pure form not supported in any commercially popular processors

Paged-segmented Virtual Memory

Combination of the hardware for paging and segmentation

Yes, most modern operating systems based on Intel x86 use this scheme1

It should be noted that Intel’s segmentation is quite different from the pure form of segmentation presented in this chapter. Please Section 7.8.2 for a discussion of Intel’s paged-segmentation scheme. [1]

7.8 Historical Perspective • Burroughs Corporation introduced segmented virtual memory in B5000 line of machines • GE, in partnership with MULTICS project at MIT introduced paged-segmentation in GE 600 line of machines • IBM introduces system/360 with base and limit registers. Relocation system not effective since base register visible to programmers • IBM introduces system/370 with true virtual memory which eventually dominates market

7.8.1 MULTICS • Some academic computing projects have a profound impact on evolution of field for a very long time. • MULTICS project at MIT was one such. • Unix, Linux, Paging, Segmentation, Security, Protection, etc. had their birth in this project. • OS concepts introduced in MULTICS project were way ahead of their time and processor architectures of that time were not geared to support advanced concepts of memory protection advocated by MULTICS. • MULTICS introduced the concept of paged segmentation.

7.8.1 MULTICS

7.8.2 Intel’s Memory Architecture • • •

Intel Pentium line of processors uses paged-segmentation. Approximately, a virtual address is a segment selector plus an offset Total segment space divided into two halves – System segments are common to all processes and are used by OS – User segments are unique to each process.



Two descriptor tables – Global Descriptor Table (GDT) common to all processes – Local Descriptor Table (LDT) for each process

• • •

A bit in the segment selector identifies whether the segment being named by the virtual address is a system or a user segment. Segment descriptor for the selected segment contains the details for translating the offset specified in the virtual address to a physical address. Choice – Use simple segmentation w/o any paging (compatible with earlier processors) – Use paged-segmentation.

7.8.2 Intel’s Memory Architecture

COMPUTER SYSTEMS An Integrated Approach to Architecture and Operating Systems

Chapter 8 Topics in Page-based Memory Management ©Copyright 2008 Umakishore Ramachandran and William D. Leahy Jr.

8.1 Demand Paging • Paging as described in Chapter 7 implied whole program was in memory • But does it have to be? • On average – 30% of a programs memory footprint is the primary logic of program – 70% is little used error handling

• Therefore, prudent for memory manager not to load entire program into memory on startup. • Basic idea is to load parts of the program that are not in memory on demand. • This technique, referred to as demand paging results in better memory utilization.

What would be the main advantage of demand paging?

8.1.1 Hardware for demand paging CPU 003E1 234

0044 234

PTBR

v 0x0

0x0023

0x1

0x0124

0x2

0x1111

0x3

0x3F04

0x3E0

0x0000

0x3E1

0x0044

0x3E2

0x0068

Memory

8.1.1 Hardware for demand paging I5

IF Instruction in

I4 B U F F E R

ID/RR

I3 B U F F E R

EX

I2 B U F F E R

I1

MEM

• Potential page faults

• •

If I5 page faults…handle If I2 page faults

• •

B U F F E R

WB Instruction out

Let I1 complete and squash I3-I5 before INT. INT needs to save PC corresponding to I2 for re-starting I2 after servicing page fault. Note that there is no harm in squashing instructions I3-I5 since they have not modified permanent state of program

8.1.2 Page fault handler 1. Find a free page frame 2. Load the faulting virtual page from the disk into the free page frame 3. Update the page table for the faulting process 4. Place the PCB of the process back in the ready queue of the scheduler

8.1.3 Data structures for Demandpaged Memory Management • Free-list of page frames • Frame table • Disk map

8.1.3 Data structures for Demandpaged Memory Management • Free-list of page frames Free-list Pframe 52

Pframe 20

Pframe 200



Pframe 8

8.1.3 Data structures for Demandpaged Memory Management • Frame table

Pframe

0

1

free

2

3

4

free

5

6

7

free

8.1.3 Data structures for Demandpaged Memory Management • Disk map Disk map for P1

VPN

0

disk address

1

disk address

2

disk address

3

disk address

4

disk address

5

disk address

6

disk address

7

disk address

P1

P2 …. .

Swap space

Pn

8.1.4 Anatomy of a Page Fault • • • •

Find a free page frame Pick victim page and evict Load faulting page Update page table for faulting process and frame table • Restart faulting process

Eviction • When evicting a page we must consider its status – Clean: the page has not been written to and thus matches its counterpart on disk – Dirty: the page has been written to and no longer matches what is on disk

• Clean pages may be simply evicted • Dirty pages must be written back to disk

8.2 Interaction between the Process Scheduler and Memory Manager

User level

Process 2

Process 1

……….

Process n

Kernel

FT ready_q PCB1

PCB2

freelist



PT1

DM1

PT2

DM2

. .

. .

Pframe

Pframe

Memory Manager

CPU scheduler

Hardware

Process dispatch

Timer interrupt

(1)

CPU Page fault

(2)





CPU scheduler dispatches process, it runs until one of following happens 1.

HW timer interrupts CPU causing upcall (1) to CPU scheduler that may result in a process switch. CPU scheduler takes appropriate action to schedule next process on CPU. Process incurs a page fault resulting in an upcall (2) to memory manager that results in page fault handling Process makes system call resulting in another subsystem (not shown) getting an upcall

2. 3.

User level

Process 2

Process 1

……….

Process n

Kernel

FT ready_q PCB1

PCB2

freelist



PT1

DM1

PT2

DM2

. .

. .

Pframe

Pframe

Memory Manager

CPU scheduler

Hardware

Process dispatch

Timer interrupt

(1)

CPU Page fault

(2)



8.3 Page Replacement Policies • How to pick victim page to evict from physical memory when page fault & free-list is empty. • For a given string of page references, policy should result in least number of page faults. – This attribute ensures that the amount of time spent in OS dealing with page faults is minimized.

• Ideally, once a particular page has been brought into physical memory, policy should not incur a page fault for same page again. – This attribute ensures that page fault handler attempts to respect reference pattern of the user programs.

8.3 Page Replacement Policies • May use – Local victim selection • Simple • Don't need frame table • Poor memory utilization

– Global victim selection • Better memory utilization • The norm

• Ideally there are no page faults and memory manager never runs. • Goal is to minimize (or eliminate) page faults

8.3.1 Belady’s Min • In 1966 Laszlo Belady proposed an optimal page replacement algorithm requiring to know in advance the page replacement string • Obviously this is impossible • But the performance level of Belady's Min may be used as a reference standard to compare to other policies performance

8.3.2 First In First Out (FIFO) • Affix a timestamp when a page is brought in to physical memory • If a page has to be replaced, choose the longest resident page as the victim • No special hardware needed • Queue length is number of physical frames

Circular queue

Head



Full

…..

Tail

free

FIFO • Maintain queue. As page is read in enqueue. Use head of queue as frame to replace • Sample 1,2,3,4,1,2,5,1,2,3,4,5

FIFO 1

2

3

4

1

2

5

1

2

3

4

5

1

1

3

3

1

1

5

5

2

2

4

4

2

2

4

4

2

2

1

1

3

3

5

1

1

4

4

4

5

5

5

5

5

5

2

2

2

1

1

1

1

1

3

3

3

3

3

3

2

2

2

2

2

4

4

1

1

1

1

1

5

5

5

5

4

4

2

2

2

2

2

2

1

1

1

1

5

1

1

3 Time

3

3

3

3

3

2

2

2

2

4

4

4

4

4

4

3

3

3

12

12

9

10

Belady’s Anomaly

FIFO

Time

1

2

3

4

1

2

5

1

2

3

4

5

1

1

1

1

1

1

1

1

1

1

1

1

2

2

2

2

2

2

2

2

2

2

2

3

3

3

3

3

3

3

3

3

3

4

4

4

4

4

4

4

4

4

5

5

5

5

5

5

5

8.3.3 Least Recently Used (LRU) • LRU policy makes assumption that if a page has not been referenced in a long time there is a good chance it will not be referenced in the future as well. • Thus, victim page in LRU policy is page that has not been used for longest time.

Push down stack Top

…..

Bottom

free

8.3.3 Least Recently Used (LRU) 1

TIME

Physical Frames

Push Down Stack

2

3

4

1

2

5

1

2

3

4

5

8.3.3 Least Recently Used (LRU) • LRU is appealing but actually not feasible – Stack has as many entries as number of physical frames. For a physical memory of 64 MB & an 8 KB pagesize, size of stack: 8 KB. Too big in datapath! – On every access, hardware has to modify stack to place current reference on top of stack. Too slow.

• LRU may be bad choice in certain situations – e.g. Access N+1 pages in a processor with N frames available

8.3.3.1 Approximate LRU: A Small Hardware Stack • Add a hardware stack with ~16 entries • Push references onto stack – If they are already in stack bring to top – Bottom reference falls out of stack

• When free frame needed randomly select one not in stack • Shown to be successful in some applications • Probably not fast enough for high speed pipelined processor

8.3.3.2 Approximate LRU: Reference bit per page frame • Associate a bit with each frame. – hardware sets on reference – software reads and clears

• Have an n-bit counter register for each frame • Periodically (daemon) right shifts all counters and puts reference bit into high order bit • Highest value counters are recently used frames; lowest value counters are LRU frames

8.3.4 Second chance page replacement algorithm • Initially, OS clears reference bits of all frames. As program executes, hardware sets reference bits for pages referenced by program. • If a page has to be replaced, memory manager chooses replacement candidate in FIFO manner. • If chosen victim’s reference bit is set, then manager clears reference bit and this page is moved to end of FIFO queue. • The victim is first candidate in FIFO order whose reference bit is not set.

8.3.5 Review of page replacement algorithms PAGE REPLACEMENT ALGORITHM FIFO Belady’s MIN

HARDWARE ASSIST COMMENTS NEEDED Could lead to anomalous behavior None Provably optimal performance; not Oracle realizable in hardware; useful as a standard for performance comparison Expected performance close to optimal; infeasible for hardware implementation due to space and time complexity; worstcase performance may be similar or even worse compared to FIFO

True LRU

Push down stack

Approximate LRU #1

A small hardware stack Expected performance close to optimal; worst-case performance may be similar or even worse compared to FIFO

Approximate LRU #2

Reference bit per page

Second Chance Replacement

Reference bit per page

Expected performance close to optimal; moderate hardware complexity; worst-case performance may be similar or even worse compared to FIFO Expected performance better than FIFO; memory manager implementation simplified compared to LRU schemes

8.3.6 Optimizing Memory Management • Beyond basic techniques presented additiona optimizations are possible • These optimizations are on top of the already presented techniques

8.3.7 Pool of free page frames • Instead of waiting for free page count to = 0 • Periodically run daemon to evict pages keeping a pool of n free pages

8.3.7.1 Overlapping I/O with Processing • Upon eviction we add the evicted frame to the free list • If the frame was dirty it is scheduled for write back • When a frame is needed only clean frames are selected skipping over dirty frames still awaiting write back

8.3.7.2 Reverse Mapping to Page Tables • When daemon runs to maintain free list at a certain level it will take pages from processes that might turn around and page fault on those pages • If we maintain additional info in free list we can know this and give page back to process (since the data is still intact)

freelist

Dirty Pframe 52

Clean Pframe 22

Clean Pframe 200

….

8.3.8 Thrashing • Suppose many processes are in memory but CPU utilization is low. What could cause this? 1. Too many I/O bound processes? 2. Too many CPU bound processes?

• Should we add more processes into memory?

8.3.8 Thrashing • If some processes don't really have enough pages to support their current configuration they will constantly be page faulting and trying to grab frames from other processes • This may lead those processes to also start page faulting and grabbing frames • Does this sound like the ideal place to introduce more processes into memory?

8.3.8 Thrashing • Controlling thrashing starts with understanding temporal locality • Temporal locality is the tendency for the same memory location to be accessed over a short period of time • During a given time period t certain pages will be accessed others will not • If during t the pages that need to be accessed are in memory then no page faults will occur

Memory Reference

Frames

1

2 Pages

3 4 5

5

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

2

3

5

5

5

3

3

3

1

1

1

4

4

4

2

2

2

5

5

5

1

1

1

4

4

4

2

2

2

5

5

5

3

3

3

2

2

2

2

2

5

5

5

3

3

3

1

1

1

4

4

4

3

8.3.9 Working set • Working set is the set of pages that defines the locus of activity of a program • The working set size (WSS) denotes the number of distinct pages touched by a process in a window of time. • The total memory pressure (TMP) exerted on the system is the summation of the WSS of all the processes currently competing for resources.

8.3.10 Controlling thrashing 1. If TMP > Physical Memory – Decrease the degree of multiprogramming. – Else: Increase

2. Monitor Page fault rate • if pfr>High • Decrease progs • if pfr