247 48 20MB
English Pages 756 [769] Year 2008
COMPUTER SYSTEMS An Integrated Approach to Architecture and Operating Systems
Chapter 1 Introduction
©Copyright 2008 Umakishore Ramachandran and William D. Leahy Jr.
What’s Inside the Box?
Levels of Abstraction
Hardware Software Interface
From Electrons & Holes to a Multiplayer Video Game
The Role of the Operating System • Resource manager • Provide consistent interface to resources • Job scheduler
Client Application (Halo 3)
Player clicks mouse cursor on target
Client Application (Halo 3)
Player clicks mouse cursor on target
OS: Recognizes interrupt Sends it to client application It's a mouse interrupt!
Client Application (Halo 3)
Player clicks mouse cursor on target
OS: Recognizes interrupt Sends it to client application It's a mouse interrupt!
CLIENT
Client Application (Halo 3)
Player clicks mouse cursor on target
OS: Recognizes interrupt Sends it to client application It's a mouse interrupt!
Client Application creates message to send to server application
CLIENT
Client Application (Halo 3)
Player clicks mouse cursor on target
OS: Recognizes interrupt Sends it to client application It's a mouse interrupt!
Client Application creates message to send to server application OS: Sends Message to server
CLIENT
Client Application (Halo 3)
Player clicks mouse cursor on target
OS: Recognizes interrupt Sends it to client application It's a mouse interrupt!
Client Application creates message to send to server application OS: Sends Message to server
OS: Receives Message sends to server application
CLIENT
SERVER
Got a message!
Client Application (Halo 3)
Player clicks mouse cursor on target
OS: Recognizes interrupt Sends it to client application
CLIENT
It's a mouse interrupt!
Client Application creates message to send to server application OS: Sends Message to server
SERVER Application examines message and state of game and determines Master Chief dies! Sends message back to client.
OS: Receives Message sends to server application
Got a message!
Client Application (Halo 3)
Player clicks mouse cursor on target
OS: Recognizes interrupt Sends it to client application
CLIENT
It's a mouse interrupt!
Client Application creates message to send to server application OS: Sends Message to server
SERVER Application examines message and state of game and determines Master Chief dies! Sends message back to client.
OS: Receives Message sends to server application
Got a message!
OS: Sends Message to client
Client Application (Halo 3)
Player clicks mouse cursor on target
OS: Recognizes interrupt Sends it to client application
CLIENT
It's a mouse interrupt!
Client Application creates message to send to server application OS: Sends Message to server
SERVER Application examines message and state of game and determines Master Chief dies! Sends message back to client.
OS: Receives Message sends to server application
Got a message!
OS: Sends Message to client
OS: Receives message and sends it to application
Client Application (Halo 3)
Player clicks mouse cursor on target
OS: Recognizes interrupt Sends it to client application
CLIENT
It's a mouse interrupt!
Client Application creates message to send to server application OS: Sends Message to server
SERVER
ClientApplication generates required images, etc. Sends I/O requests to OS
Application examines message and state of game and determines Master Chief dies! Sends message back to client.
OS: Receives Message sends to server application
Got a message!
OS: Sends Message to client
OS: Receives message and sends it to application
Client Application (Halo 3)
Player clicks mouse cursor on target
OS: Recognizes interrupt Sends it to client application
CLIENT
It's a mouse interrupt!
Client Application creates message to send to server application OS: Sends Message to server
SERVER
OS changes I/O devices to show Master Chief blowing up!!!
ClientApplication generates required images, etc. Sends I/O requests to OS
Application examines message and state of game and determines Master Chief dies! Sends message back to client.
OS: Receives Message sends to server application
Got a message!
ut oh!
OS: Sends Message to client
OS: Receives message and sends it to application
Client Application (Halo 3)
Player clicks mouse cursor on target
OS: Recognizes interrupt Sends it to client application
CLIENT
It's a mouse interrupt!
Client Application creates message to send to server application OS: Sends Message to server
SERVER
OS changes I/O devices to show Master Chief blowing up!!!
ClientApplication generates required images, etc. Sends I/O requests to OS
Application examines message and state of game and determines Master Chief dies! Sends message back to client.
OS: Receives Message sends to server application
Got a message!
ut oh!
OS: Sends Message to client
OS: Receives message and sends it to application
Client Application (Halo 3)
Player clicks mouse cursor on target
OS: Recognizes interrupt Sends it to client application
CLIENT
It's a mouse interrupt!
Client Application creates message to send to server application OS: Sends Message to server
SERVER
OS changes I/O devices to show Master Chief blowing up!!!
ClientApplication generates required images, etc. Sends I/O requests to OS
Application examines message and state of game and determines Master Chief dies! Sends message back to client.
OS: Receives Message sends to server application
Got a message!
ut oh!
OS: Sends Message to client
OS: Receives message and sends it to application
What’s Happening Inside the Box? • • • • •
Processor Memory I/O Parallelism Networking
Layers of Abstraction Application (Algorithms expressed in High Level Language) System software (Compiler, OS, etc.) Computer Architecture Machine Organization (Datapath and Control) Sequential and Combinational Logic Elements
Logic Gates Transistors Solid-State Physics (Electrons and Holes)
Where Does This Course Fit? Fundamentals of Digital Electronic & Logic Design
Fundamentals of Programming
Integrated Approach to Computer Architecture and Operating Systems
Advanced Topics in Operating Systems
Advanced Topics in Computer Architecture
Advanced Topics in Computer Networks
Questions?
Computer Systems An Integrated Approach to Architecture and Operating Systems
Chapter 2 Processor Architecture
©Copyright 2008 Umakishore Ramachandran and William D. Leahy Jr.
Overview • Architectural Issues – Instruction Set – Machine Organization
• Historical Perspective – Programming primarily in Assembly Language – Development of sophisticated compiler technology – Development of Operating Systems
Processor Design • Hardware resources are like letters • Instruction set is like words • Instruction set is key differentiation between different processors (e.g. Intel x86 & Power PC)
C
Fortran
Ada
etc.
Basic
Compiler
Java
Compiler Byte Code
Assembly Language
Assembler
Interpreter
Executable
Instruction Set Architecture HW Implementation 1
HW Implementation 2
HW Implementation N
Instruction Set Design Goals Maximize Performance
Easy to Build Compiler(s)
Easy to Build Hardware
Minimize Cost
High Level language Constructs • High Level Language Constructs a = b + c; /* add b and c and place in a */ d = e – f; /* subtract f from e and place in d */ x = y & z; /* AND y and z and place in x */
• Assembly Language Constructs add a, b, c; sub d, e, f; and x, y, z;
a b + c d e – f x y & z
Where to Keep the Operands Processor
Memory
Devices
Processor
ALU Memory Registers
Devices
Memory Address Specification processor
Memory b
r2
• Need a way of specifying an address in an instruction? • Problem: Addresses are as long (if not longer) than the length of the instruction • One solution: Store a register number of a register containing the address • Include an offset in the space left over
Base + Offset ld rdest, offset(rbase) ld r2, 3(r1) • Semantics – Value in rbase is added to offset forming an effective address – The contents of that effective address are fetched and placed into rdest
• Register transfer language r2←M[(r1) + 3]
Operand Width • How many bits does a load instruction fetch from memory? • How many should it fetch? 4?
1?
16?
8?
64?
32?
more?
• High level languages typically support multiple data types and thus for efficiency most processors will have the ability to fetch different sized “chunks” of data • The minimum is usually 8, the maximum has continually increased.
Other Questions • Arithmetic & logical operations can be performed on what size data? • The processor can move what size data to and from memory? • What is the size data the register can hold?
Endianess • How should bytes be numbered? Byte
Big Endian
Little Endian
100 101 102 103
103 102 101 100
104 105 106 107 108 109 110 111 112 113 114 115
?
Word
• And why does it matter?
107 106 105 104 111 110 109 108 115 114 113 112
Endianess • Different manufacturers have “standardized on different endianesses • Normally on a single machine this is not an issue. – Especially if data is handled commensurate with the way it was declared – Can it be an issue
• In a networked environment it can could cause big problems
Packing Operands Word Operand Alignment • Consider +3
struct { char a; char b[3]; }
b[2]
b[2]
103
+0
+1
+2
b[1]
b[1]
b[0]
102
101
a
100
b[0]
104
a
100
Packing Operands Word Operand Alignment • Consider struct { char a; int b; }
+3
+2
+1
b…
b…
blsb
+0
a bmsb
+3
bmsb
+2
b…
+1
b…
100 104
+0
a
100
blsb
104
Compiling High Level Data Abstractions • Consider struct { int char int long }
a; c; d; e;
• Given the address of the structure can we access the fields using base+offset addressing?
Compiling High Level Data Abstractions • Now consider int a[1000];
• Can individual array elements be accessed using base+offset addressing? a[6] = a[6] + 3;
a[j] = a[i] + 3;
• Perhaps an instruction that formed an effective address by adding two registers together would be useful?
Compiling Conditional Statements • In what order are program statements normally executed? • How do we know what instruction to execute next? • How can we handle this type high-level language construct: if(x==y) z = 7;
Compiling Conditional Statements if(x==y) z = 7;
• Steps to execute – Evaluate the predicate (x==y) – If false change the normal flow of control to skip the x = 7; statement and continue execution – If true execute the z = 7; and then continue execution
Compiling Conditional Statements • Need an instruction that will evaluate a predicate and change program flow • As an example • beq r1, r2, offset – Semantics: if the contents of registers r1 and r2 are equal add the offset to the (already incremented) PC and store that address in the PC
Compiling Conditional Statements • C if(a==b) c = d + e; else c = f + g;
Assuming r1 = a r2 = b r3 = c r4 = d r5 = e r6 = f r7 = g
• Assembly beq r1, r2, then add r3, r6, r7 beq r1, r1, skip* then add r3, r4, r5 skip …
* Effectively an unconditional branch
Compiling Loops • C while(j ! = 0) { /* loop body */ t = t + a[j--]; }
• Assembly beq r1,r0,done ; loop body … done …
Compiling Switch Statements if(n==0) x=a; else if(n==1) x=b; else if(n==2) x=c; else x=d;
Do these produce essentially equivalent assembly code?
Switch (n) { case 0: x=a; break; case 1: x=b; break; case 2: x=c; break; default: x=d; }
Compiling Procedure Calls int main() {
return-value = foo(actual-parms); /* continue upon returning from * foo */
int foo(formal-parameters) {
/* code for function foo */
}
return(); }
Issues with Compiling 1. 2. 3. 4. 5. 6. 7.
Preserve caller state (registers) Pass actual parameters Save the return address Transfer control to callee Allocate space for callee’s local variables Produce return value(s); give to caller Return
Caller State • Where should caller’s register values be saved? – Why do we even need to save them? – Dedicated registers? • What limitation might arise?
– Who should save what? • Caller saved registers • Callee saved registers
What’s Left? • Parameter passing – Dedicated registers – Stack
• Return address – JAL Instruction
• Transfer Control – Change PC – JAL Instruction
What’s Left? • Local variables? – Stack
• Return value(s) – Dedicated registers – Stack
• Returning to point of call – JAL back through link
Software Conventions • Registers s0-s2 are the caller’s s registers • Registers t0-t2 are the temporary registers • Registers a0-a2 are the parameter passing registers • Register v0 is used for return value • Register ra is used for return address • Register at is used for target address • Register sp is used as a stack pointer
Activation Record • Used to store – Caller saved registers – Additional parameters – Additional return values – Return address – Callee saved registers – Local variables
STACK
Step 1. Caller saves any of registers t0-t3 on the stack (if it needs the values in them upon return).
Stack Pointer
Saved t Registers
From t registers
STACK
Step 2. Caller places the parameters in a0-a2 (using the stack for additional parameters if needed).
Stack Pointer
Additional parameters Saved t Registers
From function call
STACK
Step 3. Caller allocates space for any additional return values on the stack
Stack Pointer
Additional return values
Additional parameters Saved t Registers
STACK
Step 4. Caller saves ra
Stack Pointer
ra Additional return values
Additional parameters Saved t Registers
From ra
STACK
Step 5. Caller executes JAL at, ra (no effect on stack)
Stack Pointer
ra Additional return values
Additional parameters Saved t Registers
STACK
Step 6. Callee saves any of registers s0-s3 that it plans to use during its execution on the stack. Stack Pointer
Saved s Registers ra Additional return values
Additional parameters Saved t Registers
From s registers
STACK
Step 7.
Stack Pointer
Local variables Saved s Registers ra Additional return values
Additional parameters Saved t Registers
Callee allocates space for any local variables on the stack
STACK
Step 8. Prior to return, Callee restores any saved s0-s3 registers from the stack Stack Pointer
Saved s Registers ra Additional return values
Additional parameters Saved t Registers
To S registers
STACK
Step 9. Upon return, Caller restores ra
Stack Pointer
ra Additional return values
Additional parameters Saved t Registers
To ra
STACK
Step 10. Caller stores additional return values as desired
Stack Pointer
Additional return values
Additional parameters Saved t Registers
As desired
STACK
Step 11. Upon return, Caller moves stack pointer to discard additiona parameters
Stack Pointer
Additional parameters Saved t Registers
STACK
Step 12. Upon return, Caller restores any saved t0-t3 registers from the stack
Stack Pointer
Saved t Registers
To t registers
Local variables Stack Pointer
Activation Stack Frame for baz
Saved s Registers ra
Activation Stack Frame for bar
Additional return values
Activation Stack Frame for foo
Saved t Registers
Activation Stack Frame for main
Additional parameters
Recursion • Does recursion require any additional instruction set architecture items?
Frame Pointer • During execution of given module it is possible for the stack pointer to move. • Since the location of all items in a stack frame is based on the stack pointer it useful to define a fixed point in each stack frame and maintain the address of this fixed point in a register called the frame pointer • This necessitates storing the old frame pointer in eahc stack frame (i.e caller’s frame pointer)
Stack Pointer*
STACK
New Step 6. Callee stores old frame pointer then copies contents of stack pointer into frame pointer.
Frame Pointer
Old Frame Pointer ra Additional return values
Additional parameters Saved t Registers *Stack pointer can eventually be anywhere
Instruction Set Architecture Choices • • • • •
Specific set of arithmetic and logic instructions Addressing modes Architectural style Memory layout of the instruction. Drivers – Technology trends – Implementation feasibility – Goal of elegant/efficient support for high-level language constructs.
Instructions • MIPS – All loads and stores 32 bits – Special instructions exist for extracting bytes
• DEC Alpha – Instructions for loading and storing different sizes
• Some architectures have predefined values – e.g. 0, 1, etc.
• DEC Vax – Single instruction to load or store all registers
Addressing Modes • Additional modes – Indirect addressing ld @(ra)
• Pseudo-direct addressing – Address is formed from first 6 bits of PC and last 26 bits of instruction
Architecture Styles • Stack oriented – Burroughs
• Memory oriented – IBM s/360 et al
• Register oriented – MIPS, Alpha, ARM
• Hybrid – Intel x86, Power PC
Instruction Format • Zero Operand Instructions – Halt, NOP – Stack machines: Add
• One Operand Instructions – Inc, Dec, Neg, Not – Accumulator machines: Load M, Add M
• Two Operand Instructions – Add r1, r2 (i.e. r1 = r1 + r2) – Mov r1, r2
• Three Operand Instructions – Add r1, r2, r3 – Load rd, rb, offset
Instruction Format Fixed Length Instructions • Pros – Simplifies implementation – Can start interpreting
• Cons – May waste space – May need additional logic in datapath – Limits instruction set designer
Variable Length Instructions • Pros – No wasted space – Less constraints on designer – More flexibility opcodes, addressing modes and operands
• Cons – Complicates implementation
LC-2200 Instruction Set • • • • • • •
32-bit Register-oriented Little-endian Fixed length instruction format 16 general-purpose registers Program counter (PC) register. All addresses are word addresses.
Instruction Format • R-type instructions – add and nand
• I-type instructions – addi, lw, sw, and beq
• J-type instruction – jalr
• O-type instruction – halt
Instruction Format •
R-type instructions (add, nand): bits 31-28: bits 27-24: bits 23-20: bits 19-4: bits 3-0:
•
I-type instructions (addi, lw, sw, beq): bits 31-28: bits 27-24: bits 23-20: bits 19-0:
•
opcode reg X reg Y Immediate value or address offset (a 20-bit, 2s complement number with a range of -524288 to +524287)
J-type instructions (jalr): bits 31-28: bits 27-24: bits 23-20: bits 19-0:
•
opcode reg X reg Y unused (should be all 0s) reg Z
opcode reg X (target of the jump) reg Y (link register) unused (should be all 0s)
O-type instructions (halt): bits 31-28: bits 27-0:
opcode unused (should be all 0s)
LC-2200 Register Set Reg # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Name $zero $at $v0 $a0 $a1 $a2 $t0 $t1 $t2 $s0 $s1 $s2 $k0 $sp $fp $ra
Use always zero (by hardware) reserved for assembler return value argument argument argument Temporary Temporary Temporary Saved register Saved register Saved register reserved for OS/traps Stack pointer Frame pointer return address
callee-save? n.a. n.a. No No No No No No No YES YES YES n.a. No YES No
Issues Influencing Processor Design • Instruction Set • Applications • Other – – – – – – – –
Operating system Support for modern languages Memory system Parallelism Debugging Virtualization Fault Tolerance Security
Instruction Set • Over-arching concern: Compiling high level language constructs into efficient machine code • But other factors are in play – Market pressure – Performance – Technology workarounds
Influence of Applications on Instruction Set Design • Number crunching requires efficient floating point – Development of floating point hardware
• Media applications deal with streaming data – Intel MMX extensions
• Gaming requires sophisticated graphic processing – High end games now include GPU chips
Other Issues Driving Processor Design • • • • • • • •
Operating system Modern languages: Java, C++ and C# Memory system Parallelism Debugging Virtualization Fault tolerance Security
Summary • High-level language constructs shape ISA • Support needed in the ISA for compiling basic program statements (assignment, loops, conditionals, etc.) • Registers • Addressing modes • Software conventions • Software stack/procedure calls • Extensions to minimal ISA’s • Other design issues
COMPUTER SYSTEMS An Integrated Approach to Architecture and Operating Systems
Chapter 3 Processor Implementation
©Copyright 2008 Umakishore Ramachandran and William D. Leahy Jr.
Processor Implementation • Implementation given an instruction set • Instruction-set is not a description of the implementation of the processor – Contract between hardware and software – Allows a compiler writer to generate code for different high-level languages to execute on a processor that implements this contract
• Can there be different implementations of the same instruction set?
3.1 Architecture versus Implementation • Market demands
Why?
• Parallel hardware and software development • Maintain compatibility for legacy software compatibility
3.2 What is involved in Processor Implementation?
• Organization of the electrical components (ALUs, buses, registers, etc.) commensurate with the expected price/performance characteristic of the processor. • Thermal and mechanical aspects including cooling and physical geometry for placement in mother boards. Super Computers High performance primary objective
Servers Intermediate performance and cost
Desktops & PCs Low cost primary objective
Embedded Small size, low cost, and low power consumption primary objectives
3.3 Key hardware concepts A review of important design principles
3.3.1 Circuits • Combinational logic – For a given set of inputs there is one unique output
• Sequential logic – Circuits contain elements that remember state – Output demands on inputs and state
3.3.2 Hardware resources of the datapath • • • • •
Memory ALU Register file Program Counter Instruction Register
3.3.3 Logic Triggering outputs clock
inputs
Level Triggering • Outputs change based on inputs whenever clock is high • Memory will be considered to be level triggered (for cost reasons)
Edge Triggering • Outputs change based on inputs only when clock transitions • Positive edge triggered logic when leading edge cause triggering • Negative edge triggered when trailing edge causes triggering
3.3.4 Connecting the datapath elements PC Addr
Din
Memory Dout
IR Register-file
ALU
3.3.5 Towards bus-based Design • In principle we must make connections between circuit elements for every instruction • Numerous connections are expensive and take up valuable space • Have a set of wires that all elements can connect to and share in order to transfer information
Single Bus Design PC
MAR Register-file
(DPRF)
IR
Addr
Din
Memory Dout
ALU
Dual Bus Design PC
MAR Register-file
(DPRF)
IR
Addr
Din
Memory Dout1 Dout2
ALU
3.3.6 Finite State Machine (FSM) • Abstraction of a sequential logic circuit which captures – States – Outputs while in each state – Designated start state – Possible transitions – Inputs which will trigger transitions Fetch
Decode
Execute
3.4 Datapath Design • Processing Unit (CPU) consists of the Datapath and the Control Unit • Datapath is the combination of hardware resources and their connections • Example for LC-2200 – ALU capable of ADD, NAND, SUB, – Register file with 16 registers (32-bit) shown in Figure 3.14 – PC (32-bit) – Memory with 232 X 32 bit words
Sample Datapath LC-2200 Datapath 32
LdPC
LdA
PC
A
LdB
B
LdMAR
LdIR
MAR
IR
32 Din
WrREG
2 func
4
ALU: 00: ADD 01: NAND 10: A - B 11: A + 1
DrPC
regno
DrALU
=0? 1 LdZ
Z 1
Addr
registers 16x 32 bits
memory 232x 32 bits
Dout
Dout
DrREG
IR[27..24] IR[23..20] IR[3..0] IR[31..28]
IR[31..0]
Din
WrMEM
DrMEM
Rx: Ry: Rz: OP:
4 -bit register 4 -bit register 4 -bit register 4-bit opcode
IR[19..0]
20 sign extend
DrOFF
number to control logic number to control logic number to control logic to control logic
Z: 1-bit boolean to control logic
3.4.1 ISA and Datapath Width • We normally define a size for instructions, addresses and data operands (e.g. 32 bits) • Implementation could use bus and/or interconnects of smaller size (e.g. 8 or 16 bits) • Would require more operations to move a 32 bit value. Would require less chip real estate • Tradeoff speed vs. price
3.4.2 Width of the Clock Pulse • Combinational logic elements have a propagation delay. • Register files have an access time • Writing to a register requires input to be stable both before and after the leading edge of the clock arrives (set up time and hold time) • Wires have a transmission delay • Clock pulse must be wide enough to allow for all of the above
3.4.3 Checkpoint • You should now understand the following basic concepts – Basics of logic design including combinational and sequential logic circuits – Hardware resources for a datapath such as register file, ALU, and memory – Edge-triggered logic and how to arrive at the width of a clock cycle – Datapath interconnection and buses – Finite State Machines
3.5 Control Unit Design • The control unit is an implementation of the Finite State Machine • Depending on the current state and inputs it moves to the correct next state • Typical outputs from control unit (e.g. LC-2200) – – – – – –
Drive signals: DrPC, DrALU, DrREG, DrMEM, DrOFF Load signals: LdPC, LdA, LdB, LdMAR, LdIR Write Memory signal: WrMEM Write Registers signal: WrREG ALU function selector: func Register selector: regno
• Several alternatives exist for implementation
3.5.1 ROM plus state register
Drive Signals PC
...
ALU
Reg
ME M
Load Signals OFF
PC
A
B
MA R
Write Signals IR
MEM
REG
Func
RegSel
3.5.2 FETCH macro state • Need to do – – – –
We need to send PC to the memory Read the memory contents Bring the memory contents read into the IR Increment the PC
• Microstates to accomplish – ifetch1
• PC MAR
– ifetch2
• MEM[MAR] IR
– ifetch3
• PC A
– ifetch4
• A+1 PC
3.5.2 FETCH macro state (Simplifying) • ifetch1 – PC MAR – PC A
• ifetch2 – MEM[MAR] IR
• ifetch3 – A+1 PC
3.5.2 FETCH macro state Adding in control signals •
ifetch1
– PC MAR – PC A – Control signals needed: • • •
•
ifetch2
– MEM[MAR] IR – Control signals needed: • •
•
DrPC LdMAR LdA
DrMEM LdIR
ifetch3
– A+1 PC – Control signals needed: • • •
func = 11 DrALU LdPC
3.5.3 DECODE macro state
Fetch O-Type
R-Type I-Type
J-Type
3.5.4 EXECUTE macro state: ADD instruction (part of R-Type) • RX RY + RZ
3.5.4 EXECUTE macro state: ADD instruction (part of R-Type) •
add1
– Ry A – Control signals needed: • • •
•
add2
ifetch1
– Rz B – Control signals needed: • • •
•
RegSel = 01 DrREG LdA
.
RegSel = 10 DrREG LdB
. .
add3
– A+B Rx – Control signals needed: • • • •
func = 00 DrALU RegSel = 00 WrREG
add1
add2
add3
3.5.5 EXECUTE macro state: NAND instruction (part of R-Type) • What must be changed in ADD to implement NAND?
3.5.6 EXECUTE macro state: JALR instruction (part of J-Type) • JALR instruction does the following: – RY PC + 1 – PC RX
• jalr1 – PC Ry – Control signals needed: • DrPC • RegSel = 01 • WrREG
• jalr2 – Rx PC – Control signals needed: • RegSel = 00 • DrREG • LdPC
3.5.7 EXECUTE macro state: LW instruction (part of I-Type) • RX MEMORY[RY + signed address-offset]
3.5.7 EXECUTE macro state: LW instruction (part of I-Type) • lw1
• lw3
– Ry A – Control signals needed:
– A+B MAR – Control signals needed:
• RegSel = 01 • DrREG • LdA
• lw2
• func = 00 • DrALU • LdMAR
•
– Sign-extended offset B – Control signals needed: • DrOFF • LdB
lw4 – MEM[MAR] Rx – Control signals needed: • DrMEM • RegSel = 00 • WrREG
3.5.8 EXECUTE macro state: SW and ADDI instructions (part of I-Type) • SW similar to LW • ADDI similar to ADD
3.5.9 EXECUTE macro state: BEQ instruction (part of I-Type) 32 • BEQ instruction has the following semantics: If (RX == RY) PC PC + 1 + signed offset else Nothing*
*PC remains unchanged so execution continues to next instruction in memory
3.5.9 EXECUTE macro state: BEQ instruction (part of I-Type) 32 •
beq1
– Rx A – Control signals needed: • • •
•
•
– Ry B – Control signals needed: RegSel = 01 DrREG LdB
beq4 – PC A – Control signals needed:
beq2
• • •
•
RegSel = 00 DrREG LdA
These microsteps execute only if we are taking the branch
• •
•
beq5
– Sign-extended offset B – Control signals needed: • •
beq3 – A–B – Load Z register with result of zero detect logic – Control signals needed: • • •
func = 10 DrALU LdZ
•
DrPC LdA
DrOFF LdB
beq6
– A+B PC – Control signals needed: • • •
func = 00 DrALU LdPC
3.5.10 Engineering a conditional branch in the microprogram ifetch1
• • •
beq1
beq2
beq3
beq4
beq5
beq6
3.5.10 Engineering a conditional branch in the microprogram Z
Drive Signals
PC
...
ALU
Reg
ME M
Load Signals
OFF
PC
A
B
MA R
Write Signals
IR
MEM
REG
Func
RegSel
3.5.11 DECODE macro state revisited
Drive Signals
PC
...
ALU
Reg
ME M
Load Signals
OFF
PC
A
B
MA R
Write Signals
IR
MEM
REG
Func
RegSel
3.6 Alternative Style of Control Unit Design A number of different approaches may be used to implement the Control Unit
3.6.1 Microprogrammed Control • As presented our design works • Problem: Too slow – Solution: Prefetch the next microinstruction
• Problem: Too much memory required – Solution: Have bit positions control different things as a function of opcode
3.6.2 Hardwired control • State machine can be represented as sequential logic truth table • Thus can be implemented using normal logic or FPGA
3.6.3 Choosing between the two control design styles Control Regime Pros Microprogrammed Simplicity, maintainability, flexibility Rapid prototyping
Hardwired
Cons Potential for space and time inefficiency
Comment Space inefficiency may be mitigated with vertical microcode Time inefficiency may be mitigated with prefetching
When to use For complex instructions, and for quick nonpipelined prototyping of architectures
Examples PDP 11 series, IBM 360 and 370 series, Motorola 68000, complex instructions in Intel x86 architecture
Amenable for pipelined Potentially harder to Maintainability can For High performance implementation change the design be increased with the pipelined implementation Potential for higher Longer design time use of structured of architectures performance hardware such as PLAs and FPGAs
Most modern processors including Intel Pentium series, IBM PowerPC, MIPS
3.7 Historical Perspective Hardware Expensive Memory Expensive
Hardware Less Expensive Memory Expensive
Accumulators
Hardware and Memory Cheap Microprocessors Compilers getting good
Register Oriented Machines (2 address) Register-Memory CISC VAX IBM 360 Motorola 68000 DEC PDP-11 Intel 80x86 Also RISC Fringe Element Berkley RISCSparc Stack Machines Dave Patterson Burroughs B-5000 Stanford MIPS SGI John Hennessy (Banks)
EDSAC IBM 701
IBM 801
1940
1950
1960
1970
1980
1990
Questions?
COMPUTER SYSTEMS An Integrated Approach to Architecture and Operating Systems
Chapter 4 Processor Implementation
©Copyright 2008 Umakishore Ramachandran and William D. Leahy Jr.
4 Interrupts, Traps and Exceptions • Interrupts, traps and exceptions are discontinuities in program flow • Students asking a teacher questions in a classroom is a good analogy to the handling of discontinuities in program flow
4.1 Discontinuities in program execution • We must first understand – Synchronous events: Occur at well defined points aligned with activity of the system • Making a phone call • Opening a file
– Asynchronous events: Occur unexpectedly with respect to ongoing activity of the system • Receiving a phone call • A user presses a key on a keyboard
4.1 Discontinuities in program execution • There is no universally accepted set of definitions for interrupts, traps and exceptions so we will use these – Interrupts: Asynchronous events usually produced by I/O devices which must be handled by the processor by interrupting execution of the currently running process – Traps: Synchronous events produced by special instructions typically used to allow secure entry into operating system code – Exceptions: Synchronous events usually associated with software requesting something the hardware can’t perform i.e. illegal addressing, illegal op code, etc.
4.1 Discontinuities in program execution Type
Sync/Async
Source
Intentional? Examples
Exception
Sync
Internal
No
Overflow, Divide by zero, Illegal memory address
Trap
Sync
Internal
Yes and No
System call, Page fault, Emulated instructions
Interrupt
Async
External
Yes
I/O device completion
4.2 Dealing with program discontinuities • Can happen anywhere even in the middle of an instruction execution. • Unplanned for and forced by the hardware. Hardware has to save the program counter since we are jumping to the handler. • Address of the handler is unknown. Therefore, hardware must manufacture address. • Since hardware saved PC, handler has to discover where to return upon completion.
4.3 Architectural enhancements to handle program discontinuities • When should the processor handle an interrupt? • How does the processor know there is an interrupt? • How do we save the return address? • How do we manufacture the handler address? • How do we handle multiple cascaded interrupts? • How do we return from the interrupt
4.3.1 Modifications to FSM
Fetch
Decode
Execute
int = N
INT
int = Y
$k0 ← PC PC ← new PC
4.3.2 A simple interrupt handler Handler: save processor registers; execute device code; restore processor registers; return to original program;
What happens if an interrupt arrives during handling an interrupt?
Fetch
Decode
Execute
int = N
INT
Add new instruction Enable Ints
int = Y
$k0 ← PC PC ← new PC Disable Ints
4.3.2 A simple interrupt handler Handler: save processor registers; execute device code; restore processor registers; enable ints return to original program;
4.3.3 Handling cascaded interrupts Original
Program
Original $k0← Return Address First First Handler $k0← Return Address
Second
Interrupt Handler
Interrupt Handler
4.3.3 Handling cascaded interrupts
Fetch
Decode
Execute
int = N
INT
Add 2 new instructions Enable Ints Disable Ints
int = Y
$k0 ← PC PC ← new PC Disable Ints
4.3.3 Handling cascaded interrupts Handler: /* The interrupts are disabled when we enter */ save $k0; enable interrupts; save processor registers; execute device code; restore processor registers; disable interrupts; restore $k0; enable interrupts return to original program;
Yay! It works perfectly!!! Handler: /* The interrupts are disabled when we enter */ save $k0; enable interrupts; save processor registers; execute device code; restore processor registers; disable interrupts; restore $k0; enable interrupts return to original program; Or does it? What happens if an interrupt occurs here?
4.3.4 Returning from the handler • Returning involves jumping to the address in $k0 which can be accomplished with jalr $k0 $zero
• But as we have just seen an interrupt at precisely the wrong moment would destroy $k0 and cause a failure • What do we need? restore $k0; enable interrupts return to original program;
4.3.5 Summary of architectural enhancements to LC-2200 to handle interrupts
• Three new instructions to LC-2200: – Enable interrupts – Disable interrupts – Return from interrupt
• Upon an interrupt, store the current PC implicitly into a special register $k0.
4.4 Hardware details for handling external interrupts • What we have presented thus far is what is required for interrupts, traps and exceptions • What do we need specifically for enternal interrupts?
4.4.1 Datapath details for interrupts Processor
Address Bus
Data Bus INT INTA
Device 1
Device 2
INT8 INTA 8
Device 1 INT Processor
Priority Encoder
. . . .
Device 2
. . . .
INT1 INTA
INTA 1
Device 1
Device 2
4.4.2 Details of receiving the address of the handler
Handshake between Processor and Device • Device asserts INT line • Processor upon completion of the current instruction, checks the INT line • If interrupt pending, then processor enters INT macrostate and asserts INTA line on bus • Device upon receiving the INTA from the processor, places its vector on the data bus. • Processor receives vector and looks up entry in interrupt vector table for this vector. Entry is address of handler so we put it in PC • The processor saves the current PC in $k0, and loads PC with value from interrupt vector table
4.4.3 Stack for saving/restoring • Hardware has no guarantee for stack behavior by user program (register/conventions) • Equip processor with 2 stack pointers (User/System) • On interrupt swap stack pointers $sp 2
SSP
1
USP
4.4.3 Stack for saving/restoring • Use system stack for saving all necessary information • Upon completion of interrupt restore registers, etc. • The restore user stack pointer by reversing earlier swap $sp 1
SSP
2
USP
4.5 Putting it all together
A. Executing instruction at 19999. The PC has already been incremented. Device signals interrupt in middle of instruction. $sp points to user stack
ADDR
40
CONT
1000
41 ...
Vector Table
• • •
299 ...
System Stack
USER
1
INT ACK
0
INT Enable
1
$k0
PC
20000
$sp
user stack
19999
20000
inst
inst
300 ...
MODE
INT REQ
• • •
1000 inst
1001 inst
Handler Code
Register File
• • •
300
Original Program
4.5 Putting it all together
B. Interrupt has been sensed. $k0 gets PC. Interrupts are disabled. Interrupt is acknowledged. Device puts vector on bus. BUS
ADDR
40
CONT
1000
41 ...
Vector Table
• • •
...
INT ACK
1
INT Enable
0
$k0
20000
PC
20000
$sp
user stack
19999
20000
inst
inst
300 ...
System Stack
USER
1
40
299
MODE
INT REQ
• • •
1000 inst
1001 inst
Handler Code
Register File
• • •
Original Program
4.5 Putting it all together
C. Handler address is put into PC; Current mode is saved in system stack; New mode is set to kernel; $sp now points to system stack; Interrupt code at 1000 is set to handle the interrupt.
ADDR
40
CONT
1000
41 ...
Vector Table
• • •
0
INT ACK
0
INT Enable
0
PC
299 ...
INT REQ
300 ...
System Stack
• • •
MODE
Register File
1000
1000 inst
1001 inst
Handler Code
KERNEL
• • •
$k0
20000
$sp
299
19999
20000
inst
inst
Original Program
4.5 Putting it all together
D. RETI instruction restores mode from system stack; since returning to user program in this example, $sp now points to user stack; also, copies $k0 into PC, re-enables interrupts and sets Mode to User
ADDR
40
CONT
1000
41 ...
Vector Table
• • •
299 ...
System Stack
USER
0
INT ACK
0
INT Enable
1
$k0
20000
PC
20000
$sp
user stack
19999
20000
inst
inst
300 ...
MODE
INT REQ
• • •
1000 inst
1001 inst
Handler Code
Register File
• • •
Original Program
4.6 Summary • Interrupts help a processor communicate with the outside world. • An interrupt is a specific instance of program discontinuity. • Processor/Bus enhancements included – – – – –
Three new instructions User stack and system stack Mode bit INT macro state Control lines called INT and INTA
4.6 Summary • Software mechanism needed to handle interrupts, traps and exceptions is similar. • Discussed how to write a generic interrupt handler that can handle nested interrupts. • Intentionally simplified. Interrupt mechanisms in modern processors are considerably more complex. For example, modern processors categorize interrupts into two groups: maskable and non-maskable. – maskable: Interrupts that can be temporarily turned off – Non-maskable: Interrupts that cannot be turned off
4.6 Summary • Presented mode as a characterization of the internal state of a processor. Intentionally simplistic view. • Processor state may have a number of other attributes available as discrete bits of information (similar to the mode bit). – Modern processors aggregate all of these bits into one register called processor status word (PSW). – Upon an interrupt and its return, the hardware implicitly pushes and pops, respectively, both the PC and the PSW on the system stack.
• The interested reader is referred to more advanced textbooks on computer architecture for details on how the interrupt architecture is implemented in modern processors.
4.6 Summary • Presented simple treatment of the interrupt handler code to understand what needs to be done in the processor architecture to deal with interrupts. The handler would typically do a lot more than save processor registers. • LC-2200 designates a register $k0 for saving PC in the INT macro state. In modern processors, there is no need for this since the hardware automatically saves the PC on the system stack.
Questions?
COMPUTER SYSTEMS An Integrated Approach to Architecture and Operating Systems
Chapter 5 Processor Performance and Rudiments of Pipelined Processor Design
©Copyright 2008 Umakishore Ramachandran and William D. Leahy Jr.
5.1 Space and Time Metrics • Two important metrics for any program – Space: How much memory does the program code and data require? (Memory footprint) – Time: What is the execution time for the program?
• Different design methodologies – CISC – RISC
• Memory footprint and execution time are not necessarily correlated
What determines execution time? • Execution time = (∑ CPIj) * clock cycle time, where 1 ≤ j ≤ n • Execution time = n * CPIAvg * clock cycle time, where n is the number of instructions (executed not static instruction count)
5.2 Instruction Frequency • Static instruction frequency refers to number of times a particular instruction occurs in compiled code. – Impacts memory footprint – If a particular instruction appears a lot in a program, can try to optimize amount of space it occupies by clever instruction encoding techniques in the instruction format.
• Dynamic instruction frequency refers to number of times a particular instruction is executed when program is run. – Impacts execution time of program – If dynamic frequency of an instruction is high then can try to make enhancements to datapath and control to ensure that CPI taken for its execution is minimized.
5.3 Benchmarks • Benchmarks are a set of programs that are representative of the workload for a processor. • The key difficulty is to be sure that the benchmark program selected really are representative. • A radical new design is hard to benchmark because there may not yet be a compiler or much code.
Evaluating a Suite of Benchmark Programs • Total execution time: cumulative total of execution times of individual programs. • Arithmetic mean (AM): Simply an average of all individual program execution times. – It should be noted, however that this metric may bias the summary value towards a time-consuming benchmark program (e.g. execution times of programs: P1 = 100 secs; P2 = 1 secs; AM = 50.5 secs).
• Weighted arithmetic mean (WAM) : weighted average of the execution times of all the individual programs taking into account the relative frequency of execution of the programs in the benchmark mix • Geometric mean (GM), pth root of the product of p values. This metric removes the bias present in arithmetic mean
SPECint2006 12 programs for quantifying performance of processors on integer programs Intel Core 2 Duo E6850 (3 GHz) Program name 400.perlbench 401.bzip2 403.gcc 429.mcf 445.gobmk 456.hmmer 458.sjeng 462.libquantum 464.h264ref 471.omnetpp 473.astar 483.xalancbmk
Description Applications in Perl Data compression C Compiler Optimization Game based on AI Gene sequencing Chess based on AI Quantum computing Video compression Discrete event simulation Path-finding algorithm XML processing
Time in seconds 510 602 382 328 548 593 679 422 708 362 466 302
5.4 Increasing the Processor Performance • Execution time = n * CPIAvg * clock cycle time • Reduction in the number of executed instructions • Datapath organization leading to lower CPI • Increasing clock speed
5.5 Speedup • Assume a base case execution time of 10 sec. • Assume an improved case execution time of 5 sec. • Percent improvement = (base-new)/base • Percent improvement = (10-5)/5 = 100% • Speedup = base/new • Speedup = 10/5 = 2 • Speedup is preferred by advertising copy writers
Amdahl’s Law • Amdahl’s law: Timeafter = Timeunaffected + Timeaffected/x where x is speedup
5.6 Increasing the Throughput of the Processor • Don’t focus on trying to speedup individual instructions • Instead focus on throughput i.e. number of instructions executed per unit time
5.7 Introduction to Pipelining • Consider a sandwich shop with a five step process – – – – –
Take order Bread Cheese Meat Veggies
• One employee can do the job • Now imagine 5 employees making sandwiches Order
Bread
Cheese
Meat
Veggies
Pipeline Math • If it takes one person 5 minutes to make a sandwich • And we pipeline the process using 5 people each taking a minute • And we start making sandwiches constantly (i.e. ignore startup pipeline filling) • How long does it actually take to make a single sandwich (Real elapsed time) • What is the effective time to produce a sandwich? (i.e. a sandwich exits from the pipeline every how many minutes?)
5.8 Towards an instruction processing assembly line Macro State FETCH DECODE EXECUTE (ADD) EXECUTE (LW)
Functional Units in Use_______________ IR ALU PC MEM IR IR ALU Reg-file IR ALU Reg-file MEM Sign extender
instructions
instructions
I4
F D E
I3 I2
F D E
I4 I3 I2
F D E
F D E F D E F D E
I1 F D E
I1 F D E
time time
5.9 Problems with a simple-minded instruction pipeline • The different stages often need the same datapath resources (e.g. ALU, IR). – Structural Hazards
• The amount of work done in the different stages is not the same. – TFetch TDecode Texecute
5.10 Fixing the problems with the instruction pipeline • • • • •
IF ID/RR EX MEM WB
Instruction Fetch Instruction Decode/Read Registers Execute Memory Write Back
Instruction pipeline with buffers between stages
IF Instruction in
B U F F E R
ID/RR
B U F F E R
EX
B U F F E R
MEM
B U F F E R
WB Instruction out
5.11 Datapath elements for the instruction pipeline ID/RR
IF PC I-MEM
ALU
B U F F E R
DPRF A B Decode logic
EX B U F F E R
ALU-1 ALU-2
MEM B U F F E R
D-MEM
WB B U F F E R
data DPRF
5.12 Pipeline-conscious architecture and implementation • Need for a symmetric instruction format • Need to ensure equal amount of work in each stage IF
ID/RR
EX
MEM
WB
M X
ADD
1 P C
Instr Mem
ADD
0?
DPRF
A M X
ALU
D SE
Pipeline Registers
Data Mem
M X
5.12.1 Anatomy of an instruction passage through the pipeline
IF
F B U F
ID/RR
D B U F
EX
E B U F
MEM
M B U F WB
Pipeline Buffers Name FBUF
Output of Stage IF
DBUF
ID/RR
EBUF
EX
MBUF
MEM
Contents Primarily contains instruction read from memory Decoded IR and values read from register file Primarily contains result of ALU operation plus other parts of the instruction depending on the instruction specifics Same as EBUF if instruction is not LW or SW; If instruction is LW, then buffer contains the contents of memory location read
5.12.2 Design of the Pipeline Registers • Design the pipeline registers solely for the LDR instruction FBUF
DBUF
EBUF
MBUF
5.12.3 Implementation of the stages • Design and implementation of a pipeline processor may be simpler than a non-pipelined processor. • Pipelined implementation modularizes design. • Layout and interpretation of the pipeline registers are analogous to well-defined interfaces between components of a large software system. • Since datapath actions of each stage happen in one clock cycle, the design of each stage is purely combinational. Each stage: – At the beginning of each clock cycle, interprets input pipeline register, – Carries out the datapath actions using the combinational logic for this stage – Writes the result of the datapath action into its output pipeline register.
5.13 Hazards Structural Control Data
• Reduce throughput to < 1 instruction/cycle • Pipeline is synchronous • Pipeline is stalled when an instruction cannot proceed to next stage. • A stall introduces bubble into pipeline. • NOP instruction is manifestation of bubble. • Stage executing NOP instruction does nothing for one cycle. • Output buffer remains unchanged from previous cycle. • Stalls, bubbles, and NOPs used interchangeably in the textbook to mean the same thing.
5.13.1 Structural hazard • Caused by limitations in hardware that don’t allow concurrent execution of different instructions • Examples – – – –
Bus Single ALU Single Memory for instructions and data Single IR
• Remedy is to add additional elements to datapath to eliminate hazard
5.13.2 Data Hazard • Consider these three pairs of instructions. Could they be executed in any sequence and yield correct results? R1 ← R2 + R3 R4 ← R1 + R5
R4 ← R1 + R5 R1 ← R2 + R3
R1 ← R4 + R5 R1 ← R2 + R3
5.13.2.1 RAW Hazard M X
ADD
1 P C
Instr Mem
ADD
BEQ
DPRF
A M X
ALU
D SE
Data Mem
M X
5.13.2.1 RAW Hazard M X
ADD
1 P C
Instr Mem
ADD
BEQ
DPRF
A M X
ALU
D SE
R1 ← R2+R3
Data Mem
M X
5.13.2.1 RAW Hazard M X
ADD
1 P C
Instr Mem
ADD
BEQ
DPRF
A M X
ALU
D SE
R4 ← R1+R5
R1 ← R2+R3
Data Mem
M X
5.13.2.1 RAW Hazard M X
ADD
1 P C
Instr Mem
ADD
BEQ
DPRF
A M X
ALU
D SE
R4 ← R1+R5
R1 ← R2+R3
Data Mem
M X
5.13.2.1 RAW Hazard M X
ADD
1 P C
Instr Mem
ADD
BEQ
DPRF
A M X
ALU
D
Data Mem
SE
R4 ← R1+R5
R1 ← R2+R3
M X
5.13.2.1 RAW Hazard M X
ADD
1 P C
Instr Mem
ADD
BEQ
DPRF
A M X
ALU
D
Data Mem
M X
SE
R4 ← R1+R5
R1 ← R2+R3
5.13.2.1 RAW Hazard M X
ADD
1 P C
Instr Mem
ADD
BEQ
DPRF
A M X
ALU
D
Data Mem
M X
SE
R4 ← R1+R5
5.13.2.2 Solving the RAW Data Hazard Problem: Data Forwarding M X
ADD
1 P C
Instr Mem
=?
ADD
R1 DPRF
A M X
ALU
D SE
R1
R4 ← R1+R5
R1 ← R2+R3
Data Mem
M X
5.13.2.2 Solving the RAW Data Hazard Problem: Data Forwarding • Forwarding components have to be installed to take care of all possible cases
5.13.2.3 Dealing with RAW Data Hazard introduced by Load instructions M X
ADD
1 P C
Instr Mem
ADD
BEQ
DPRF
A M X
ALU
D SE
LW R1,3(R2)
Data Mem
M X
5.13.2.3 Dealing with RAW Data Hazard introduced by Load instructions M X
ADD
1 P C
Instr Mem
ADD
BEQ
DPRF
A M X
ALU
D SE
R4 ← R1+R5
LW R1,3(R2)
Data Mem
M X
5.13.2.3 Dealing with RAW Data Hazard introduced by Load instructions M X
ADD
1 P C
ADD
R=1 Instr Mem
R1 DPRF
BEQ
A M X
ALU
D SE
R=1
R4 ← R1+R5
LW R1,3(R2)
Data Mem
M X
5.13.2.3 Dealing with RAW Data Hazard introduced by Load instructions M X
ADD
1 P C
ADD
R1
Instr Mem
DPRF
BEQ
A M X
ALU
D
Data Mem
R1
SE
R=1
R4 ← R1+R5
LW R1,3(R2)
M X
5.13.2.3 Dealing with RAW Data Hazard introduced by Load instructions M X
ADD
1 P C
ADD
R1
Instr Mem
DPRF
BEQ
A M X
ALU
D
Data Mem
R1
SE
R=1
R4 ← R1+R5
NOP
LW R1,3(R2)
M X
5.13.2.4 Other types of Data Hazards • WAR – Not a problem in our pipeline
• R4 ← R1 + R5 • R1 ← R2 + R3 • WAW – Becomes an issue in complex pipelines with many stages
Stop Here
5.13.3 Control Hazard • Typically associated with branch instructions • PC must contain address of next instruction before we know it!!! • Simple solution: Stall pipeline • But what is the impact?
5.13.3 Control Hazard M X
ADD
1 P C
Instr Mem
ADD
BEQ
DPRF
A M X
ALU
D SE
BEQ R1, R2, X
Data Mem
M X
5.13.3 Control Hazard M X
ADD
1 P C
Instr Mem
ADD
BEQ
DPRF
A M X
ALU
D SE
???
BEQ R1, R2, X
Data Mem
M X
5.13.3 Control Hazard M X
ADD
1 P C
Instr Mem
ADD
BEQ
DPRF
A M X
ALU
D SE
NOP
BEQ R1, R2, X
Data Mem
M X
5.13.3 Control Hazard M X
ADD
1 P C
Instr Mem
ADD
BEQ
DPRF
A M X
ALU
D SE
NOP
NOP
BEQ R1, R2, X
Data Mem
M X
5.13.3 Control Hazard M X
ADD
1 P C
Instr Mem
BEQ
DPRF
ADD
A M X
ALU
D SE
NOP
BEQ R1, R2, X
Data Mem
M X
5.13.3.1 Dealing with branches in the pipelined processor • Delayed Branch • Branch Prediction • Branch prediction with target buffer
5.13.3.2 Summary of dealing with branches in a pipelined processor Name
Pros
Cons
Examples
Stall
Simple
Performance loss
IBM 360
Predict (not taken)
Good performance
Predict (taken)
Good performance
Need additional hardware to be able Most modern processors use this to flush pipeline technique. Some Requires more also employ elaborate hardware sophisticated since target not branch target available until EX buffers
Delayed Branch
Needs no hardware just compiler recognition that it exists
Deep pipelines make it difficult to fill all the delay slots
Older RISC architectures e.g. MIPS, PA-RISC, SPARC
5.13.4 Summary of Hazards • Structural
• Data
• Control
5.14 Dealing with interrupts in a pipelined processor • First Method 1. Stop sending new instructions into the pipeline 2. Wait until the instructions that are in partial execution complete their execution (i.e. drain the pipe). 3. Go to the interrupt state
• Second Method – The other possibility is to flush the pipeline.
5.15 Advanced topics in processor design • Pipelined processor designs have their roots in high performance processors and vector processors of the 1960's and 1970's • Many of the concepts used are still relevant today
5.15.1 Multiple Issue Processors • Sequential Program Model – Perceived Program Order Execution – Actual instruction overlap
• Instruction Level Parallelism (ILP) – Limited by hazards especially control hazards – Basic blocks
5.15.2 Deeper pipelines • Pipelines may have more than 20 stages • Basic blocks are often in the range of 3-7 instructions • Must develop techniques to exploit ILP to make it worthwhile • For example, can issue multiple instructions in one cycle – Assume hardware and/or compiler has selected a group of instruction that can be executed in parallel (no hazards)
• Need additional functional units
5.15.2 Deeper pipelines
Necessity for Deep Pipelining • • • • • • •
Relative increase in storage access time Microcode ROM access Multiple functional units Dedicated floating point pipelines Out of order execution and reorder buffer Register renaming Hardware-based speculation
Different Pipeline Depths
5.15.3 Revisiting program discontinuities in the presence of out-of-order processing • External Interrupts can be handled by stopping instruction issue and allowing pipeline to drain • Exceptions and traps were problematic in early pipelined processors – Instructions following instruction causing exception or trap may have already finished and changed processor state – Known as imprecise execution
• Modern processors retire instructions in program order – Potential exceptions are buffered in re-order buffer and will manifest in strictly program order.
5.15.4 Managing shared resources • Managing shared resources such as register files becomes more challenging with multiple functional units • Solutions – Scoreboard keeps track of all resources needed by an instruction – Tomasulo algorithm equips functional units with registers which act as surrogates to the architecture-visible registers
5.15.5 Power Consumption • Speeding up processors can drive designers to try and pack more smaller components onto the chip. • This can allow the clock cycle time to decrease • Unfortunately higher operational frequencies will cause power consumption to increase • Higher power consumption can also lead to thermal problems with chip operating temperatures
5.15.5 Power Consumption
5.15.6 Multi-core Processor Design • One solution to achieving higher performance without increasing power consumption and temperature beyond acceptable limits is multicore processors • Essentially the chip has more than one processor. • Such a design is not as transparent to the programmer as instruction level parallelism and as such brings a whole new set of challenges and opportunities to effectively utilize these new chips.
5.15.7 Intel Core Microarchitecture: An example pipeline
5.16 Historical Perspective • Amdahl works out basic pipelining principles for dissertation at UW-Madison in 1952 • Amdahl is chief architect of IBM s/360 where pipelining is implemented originally in high end mainframe processors • Early minicomputers did not use pipelining • Killer micros did use pipelining to get needed performance advantages • Today most all processors except for very low end embedded processors use some form of pipelining
Questions?
COMPUTER SYSTEMS An Integrated Approach to Architecture and Operating Systems
Chapter 6 Processor Scheduling
©Copyright 2009Umakishore Ramachandran and William D. Leahy Jr.
6.1 Introduction • Things to Do – Laundry – Study for Test – Cook and eat dinner – Call Mom for her birthday
• How would you do it?
6.2 Programs and Processes • What is an operating system? • What are resources? • How do we create programs?
6.2 Programs and Processes • What is the memory footprint of a user program? Low memory Use by the OS Program code Program global data Program heap
Memory footprint of User program
Program stack High memory
Use by the OS
Program 1 Program 2 . .
• What is the overall view of memory? • Why?
. Program n OS Data Structures OS routines
6.2 Programs and Processes • What resources are required to run: Hello, World! • What is a scheduler? Program Properties Expected running time Expected memory usage Expected I/O requirements
Process/System Properies Available system memory Arrival time of a program Instantaneous memory requirements
Process 1 Process 2 . . . Process n
Processor
scheduler
winner
6.2 Programs and Processes Program • On disk • Static • No state – No PC – No register usage
• Fixed size
One program may yield many processes
Process • In memory (and disk) • Dynamic – changing • State – PC – Registers
• May grow or shrink • Fundamental unit of scheduling
6.2 Programs and Processes Name Job Process
Usual Connotation Unit of scheduling Program in execution; unit of scheduling
Use in this chapter Synonymous with process Synonymous with job
Thread
Unit of scheduling and/or execution; contained within a process
Not used in the scheduling algorithms described in this chapter
Task
Unit of work; unit of Not used in the scheduling algorithms scheduling described in this chapter, except in describing the scheduling algorithm of Linux
6.3 Scheduling Environments
6.3 Scheduling Environments Name Long term scheduler
Environment Batch oriented OS
Loader
In every OS
Medium term scheduler
Every modern OS (timeshared, interactive)
Short term scheduler
Every modern OS (timeshared, interactive)
Dispatcher
In every OS
Role Control the job mix in memory to balance use of system resources (CPU, memory, I/O) Load user program from disk into memory Balance the mix of processes in memory to avoid thrashing Schedule the memory resident processes on the CPU Populate the CPU registers with the state of the process selected for running by the short-term scheduler
6.3 Scheduling Environments Process States
New
Admitted
Exit
Halted
Interrupt
Ready
Running Scheduler Dispatch
I/O or Event Completion
Waiting
I/O or Event Wait
6.4 Scheduling Basics
6.4 Scheduling Basics • Schedulers come in two basic flavors – Preemptive – Non-preemptive
• Basic scheduler steps 1. 2. 3. 4.
Grab the attention of the processor. Save the state of the currently running process. Select a new process to run. Dispatch the newly selected process to run on the processor.
6.4 Scheduling Basics • What information is important to know about a process?
6.4 Scheduling Basics • Process Control Block enum
state_type {new, ready, running, waiting, halted};
typedef struct control_block_type { struct control_block *next_pcb; /* enum state_type state; /* address PC; /* int reg_file[NUMREGS]; /* int priority; address address_space; … … } control_block;
list ptr */ current state */ where to resume */ contents of GPRs */
/* extrinsic property */ /* where in memory */
next_pcb
info…
6.4 Scheduling Basics Partially Executed Swapped Out Processes Ready Queue
I/O
I/O Queue
CPU
I/O Request Time Slice Expired
Child Executes
Fork a Child
Interrupt Occurs
Wait for an Interrupt
6.4 Scheduling Basics Name CPU burst I/O burst PCB Ready queue I/O queue
Non-Preemptive algorithm
Preemptive algorithm Thrashing
Description Continuous CPU activity by a process before requiring an I/O operation Activity initiated by the CPU on an I/O device Process context block that holds the state of a process (i.e., program in execution) Queue of PCBs that represent the set of memory resident processes that are ready to run on the CPU Queue of PCBs that represent the set of memory resident processes that are waiting for some I/O operation either to be initiated or completed Algorithm that allows the currently scheduled process on the CPU to voluntarily relinquish the processor (either by terminating or making an I/O system call) Algorithm that forcibly takes the processor away from the currently scheduled process in response to an external event (e.g. I/O completion interrupt, timer interrupt) A phenomenon wherein the dynamic memory usage of the processes currently in the ready queue exceed the total memory capacity of the system
6.5 Performance Metrics • System Centric. – CPU Utilization: Percentage of time the processor is busy. – Throughput: Number of jobs executed per unit time. – Average turnaround time: Average elapsed time for jobs entering and leaving the system. – Average waiting time: Average of amount of time each job waits while in system
• User Centric – Response time: Time until system responds to user.
6.5 Performance Metrics • Two other qualitative issues – Starvation: The scheduling algorithm prevents a process from ever completing – Convoy Effect: The scheduling algorithm allows long-running jobs to dominate the CPU
6.5 Performance Metrics w1
P1
P2
P3
e1
e2
e3
t1
w2 t2
w3 t3
wi, ei, and ti, are respectively the wait time, execution time, and the elapsed time for a job ji
6.5 Performance Metrics P1 2
P2
3
3
4
P3
2
5
5 9
12 14 19
Assume times are in ms
6.5 Performance Metrics • System Centric. – CPU Utilization: – Throughput: – Average turnaround time: – Average waiting time
• User Centric – Response time:
6.5 Performance Metrics • Assumptions for following slides – Context switch time is negligible – Single I/O queue – Simple model (first-come-first-served) for scheduling I/O requests.
6.6 Non-preemptive Scheduling Algorithms • Non-preemptive means that once a process is running it will continue to do so until it relinquishes control of the CPU. This would be because it terminates, voluntarily yields the CPU to some other process (waits) or requests some service from the operating system.
6.6.1 First-Come First-Served (FCFS) • • • •
Intrinsic property: Arrival time May exhibit convoy effect No starvation High variability of average waiting time
6.6.2 Shortest Job First (SJF) • • • •
Uses anticipated burst time No convoy effect Provably optimal for best average waiting time May suffer from starvation – May be addressed with aging rules
6.6.3 Priority • Each process is assigned a priority • May have additional policy such as FCFS for all jobs with same priority • Attractive for environments where different users will pay more for preferential treatment • SJF is a special case with Priority=1/burst time • FCFS is a special case with Priority = arrival time
6.7 Preemptive Scheduling Algorithms • Two simultaneously implications. – Scheduler is able to assume control of the processor anytime unbeknownst to the currently running process. – Scheduler is able to save the state of the currently running process for proper resumption from the point of preemption.
• Any of the Non-preemptive algorithms can be made Preemptive
6.7.1 Round Robin Scheduler • Appropriate for time-sharing environments • Need to determine time quantum q: Amount of time a process gets before being context switched out (also called timeslice) – Context switching time becomes important
• FCFS is a special case with q = ∞ • If n processes are running under round robin they will have the illusion they have exclusive use of a processor running at 1/n times the actual processor speed
6.7.1.1 Details of Round Robin Algorithm • What do we mean by context?
• How does the dispatcher get run?
• How does the dispatcher switch contexts?
6.7.1.1 Details of Round Robin Algorithm •
Dispatcher: get head of ready queue; set timer; dispatch;
•
Timer interrupt handler:
Round Robin Scheduling Algorithm
save context in PCB; move PCB to the end of the ready queue; upcall to dispatcher;
•
I/O request trap: save context in PCB; move PCB to I/O queue; upcall to dispatcher;
•
I/O completion interrupt handler: save context in PCB; move PCB of I/O completed process to ready queue; upcall to dispatcher;
•
Process termination trap handler: Free PCB; upcall to dispatcher;
6.8 Combining Priority and Preemption • Modern general purpose operating systems such as Windows NT/XP/Vista and Unix/Linux use multi-level feedback queues • System consists of a number of different queues each with a different expected quantum time • Each individual queue uses FCFS except base queue which uses Round Robin
6.8 Combining Priority and Preemption q1 A process that doesn’t finish before qi drops down 1 level
Note: q1 FREE 2K
8K
3K
P3
11K
2K
FREE
As P2 terminates the memory allocator can look for adjacent free blocks
7.3.2 Variable Size Partitions After coalescing adjacent free blocks
Memory
Allocation table Start address
Size
Process
0
8K
FREE
8K
3K 8K
3K
P3 2K
11K
2K
FREE
7.3.2 Variable Size Partitions • When space is requested several possible options – Best Fit • Lower internal fragmentation • Longer search time • Table may be indexed by start address which is good for coalescing • Table may be indexed by size which is faster for allocation
– First Fit • Faster allocation • Table may be indexed by start address which is good for coalescing
7.3.3 Compaction • If fragmentation becomes excessive we can compact memory by moving processes • This is virtually impossible with modern architectures! – Base register concept
Memory
Allocation table 3K Start address
Size
Process
0
3K
P3
3K
10K
FREE
10K
7.4 Paged Virtual Memory • As memory size increases problem of external fragmentation increases • Want to attack problem of external fragmentation • User views his memory partition as contigous memory • But does it have to really be contiguous?
7.4 Paged Virtual Memory B R O
Memory
CPU K Virtual
Physical
E Address
Address R
• Need a system to take user virtual addresses and translate into physical address corresponding to the physical memory present
7.4 Paged Virtual Memory • Conceptually break up both logical (virtual) memory and physical memory into equal sized blocks Physical Page
Logical Memory
Memory
Page Page Page
Page
Page Size = Frame Size
Frame Frame
Frame Frame Frame
Frame
7.4 Paged Virtual Memory • Where is logical memory? • Need mechanism to translate from logical pages to physical frames BROKER CPU
Virtual Address
Page Table
Memory
Physical Address
7.4 Paged Virtual Memory
Physical Memory
LOW
12
User’s view Page 0 Page 1 Page 2 Page 3
Page Table
35 52 12
15
15
35
52
HIGH
Key Point • The user still perceives a contiguous memory space • The space is not necessarily contiguous in physical memory • External fragmentation is eliminated!
7.4.1 Page Table • Suppose pages and frames are 4096 bytes long. • What do logical addresses look like? 0 4095 4096 32456
00000000000000000000000000000000 00000000000000000000111111111111 00000000000000000001000000000000 00000000000000000111111011001000
7.4.1 Page Table • Suppose pages and frames are 4096 bytes long. • What do logical addresses look like? 0 4095 4096 32456
00000000000000000000000000000000 00000000000000000000111111111111 00000000000000000001000000000000 00000000000000000111111011001000
7.4.1 Page Table • Suppose pages and frames are 4096 bytes long. • What do logical addresses look like? 0 4095 4096 32456
0x000000000000 0x000000000FFF 0x000000001000 0x000000007EC8
7.4.1 Page Table • Suppose pages and frames are 4096 bytes long. • What do addresses look like? 0 4095 4096 32456
0x000000000000 0x000000000FFF 0x000000001000 0x000000007EC8
Virtual Page Number
Offset
7.4.1 Page Table • • • •
Assume page/frame size is 4096 bytes Assume 32 bit virtual address (4 Gb) Assume 28 bit physical address (256 Mb) What is the layout of the virtual address and the physical address? • How does a virtual address like 0x3E1234 get translated into a physical address
CPU
003E1 234
7.4.1 Page Table
0044 234
PTBR
v 0x0
0x0023
0x1
0x0124
0x2
0x1111
0x3
0x3F04
0x3E0
0x0000
0x3E1
0x0044
0x3E2
0x0068
Memory
7.4.2 Hardware for Paging • • • •
PTBR Translation hardware VPN to PFN Page table is in kernel memory space Note: each process has a page table
• How many memory accesses are required for each memory request by the CPU
7.4.3 Page Table Set up typedef struct control_block_type { enum state_type state; address PC; int reg_file[NUMREGS]; struct control_block *next_pcb; int priority; address PTBR; … … } control_block;
7.5 Segmented Virtual Memory • Segmentation is a system allowing a process's memory space to be subdivided into chunks of memory each associated with some aspect of the overall program • Typical segments – Code – Global data – Heap – Stack
7.5 Segmented Virtual Memory • Process address space divided up into n distinct segments • Each segment has – A number – A size
• Each segment starts at its own address 0 and goes up to its size – 1. • Segment addressing
7.5 Segmented Virtual Memory
7.5 Segmented Virtual Memory
7.5 Segmented Virtual Memory
7.5.1 Hardware for Segmentation
7.6 Paging versus Segmentation Attribute User shielded from size limitation of physical memory Relationship to physical memory Address spaces per process
Paging Yes
Segmentation Yes
Physical memory may be less than or greater than virtual memory One
Physical memory may be less than or greater than virtual memory Several
Visibility to the user
User unaware of paging; user is given an User aware of multiple address spaces illusion of a single linear address space each starting at address 0
Software engineering
No obvious benefit
Allows organization of the program components into individual segments at user discretion; enables modular design; increases maintainability
Program debugging
No obvious benefit
Aided by the modular design
Sharing and protection
User has no direct control; operating system can facilitate sharing and protection of pages across address spaces but this has no meaning from the user’s perspective
User has direct control of orchestrating the sharing and protection of individual segments; especially useful for objectoriented programming and development of large software
Size of page/segment
Fixed by the architecture
Internal fragmentation
Internal fragmentation possible for the portion of a page that is not used by the address space None
Variable chosen by the user for each individual segment None
External fragmentation
External fragmentation possible since the variable sized segments have to be allocated in the available physical memory thus creating holes (see Figure 7.18)
7.6.1 Interpreting the CPU generated address Memory System
Virtual Address Computation
Physical Address Computation
Size of Tables
Segmentation
Segment Start address = Segment-Table [Segment-Number] Physical address = Segment Start Address + Segment Offset
Segment table size = 2nseg entries
Paging
PFN = Page-Table[VPN] Physical address:
Page table size = 2nVPN entries
7.7 Summary Memory Management Criterion
User/ Kernel Separation
Fixed Partition
Variablesized Partition
Paged Virtual Memory
Segmented Virtual Memory
Pagedsegmented Virtual Memory
Improved resource utilization
No
Internal fragmentation bounded by partition size; External fragmentation
External fragmentati on
Internal fragmentation bounded by page size
External fragmentation
Internal fragmentation bounded by page size
Independence and protection Liberation from resource limitation
No
Yes
Yes
Yes
Yes
Yes
No
No
No
Yes
Yes
Yes
Sharing by concurrent processes Facilitates good software engineering practice
No
No
No
Yes
Yes
Yes
No
No
No
No
Yes
Yes
7.7 Summary Scheme User/Kernel Separation
Hardware Support Fence register
Still in Use? No
Fixed Partition
Bounds registers
Not in any production operating system
Variable-sized Partition
Base and limit registers
Not in any production operating system
Paged Virtual Memory
Page table and page table base register
Yes, in most modern operating system
Segmented Virtual Memory
Segment table, and segment table base register
Segmentation in this pure form not supported in any commercially popular processors
Paged-segmented Virtual Memory
Combination of the hardware for paging and segmentation
Yes, most modern operating systems based on Intel x86 use this scheme1
It should be noted that Intel’s segmentation is quite different from the pure form of segmentation presented in this chapter. Please Section 7.8.2 for a discussion of Intel’s paged-segmentation scheme. [1]
7.8 Historical Perspective • Burroughs Corporation introduced segmented virtual memory in B5000 line of machines • GE, in partnership with MULTICS project at MIT introduced paged-segmentation in GE 600 line of machines • IBM introduces system/360 with base and limit registers. Relocation system not effective since base register visible to programmers • IBM introduces system/370 with true virtual memory which eventually dominates market
7.8.1 MULTICS • Some academic computing projects have a profound impact on evolution of field for a very long time. • MULTICS project at MIT was one such. • Unix, Linux, Paging, Segmentation, Security, Protection, etc. had their birth in this project. • OS concepts introduced in MULTICS project were way ahead of their time and processor architectures of that time were not geared to support advanced concepts of memory protection advocated by MULTICS. • MULTICS introduced the concept of paged segmentation.
7.8.1 MULTICS
7.8.2 Intel’s Memory Architecture • • •
Intel Pentium line of processors uses paged-segmentation. Approximately, a virtual address is a segment selector plus an offset Total segment space divided into two halves – System segments are common to all processes and are used by OS – User segments are unique to each process.
•
Two descriptor tables – Global Descriptor Table (GDT) common to all processes – Local Descriptor Table (LDT) for each process
• • •
A bit in the segment selector identifies whether the segment being named by the virtual address is a system or a user segment. Segment descriptor for the selected segment contains the details for translating the offset specified in the virtual address to a physical address. Choice – Use simple segmentation w/o any paging (compatible with earlier processors) – Use paged-segmentation.
7.8.2 Intel’s Memory Architecture
COMPUTER SYSTEMS An Integrated Approach to Architecture and Operating Systems
Chapter 8 Topics in Page-based Memory Management ©Copyright 2008 Umakishore Ramachandran and William D. Leahy Jr.
8.1 Demand Paging • Paging as described in Chapter 7 implied whole program was in memory • But does it have to be? • On average – 30% of a programs memory footprint is the primary logic of program – 70% is little used error handling
• Therefore, prudent for memory manager not to load entire program into memory on startup. • Basic idea is to load parts of the program that are not in memory on demand. • This technique, referred to as demand paging results in better memory utilization.
What would be the main advantage of demand paging?
8.1.1 Hardware for demand paging CPU 003E1 234
0044 234
PTBR
v 0x0
0x0023
0x1
0x0124
0x2
0x1111
0x3
0x3F04
0x3E0
0x0000
0x3E1
0x0044
0x3E2
0x0068
Memory
8.1.1 Hardware for demand paging I5
IF Instruction in
I4 B U F F E R
ID/RR
I3 B U F F E R
EX
I2 B U F F E R
I1
MEM
• Potential page faults
• •
If I5 page faults…handle If I2 page faults
• •
B U F F E R
WB Instruction out
Let I1 complete and squash I3-I5 before INT. INT needs to save PC corresponding to I2 for re-starting I2 after servicing page fault. Note that there is no harm in squashing instructions I3-I5 since they have not modified permanent state of program
8.1.2 Page fault handler 1. Find a free page frame 2. Load the faulting virtual page from the disk into the free page frame 3. Update the page table for the faulting process 4. Place the PCB of the process back in the ready queue of the scheduler
8.1.3 Data structures for Demandpaged Memory Management • Free-list of page frames • Frame table • Disk map
8.1.3 Data structures for Demandpaged Memory Management • Free-list of page frames Free-list Pframe 52
Pframe 20
Pframe 200
…
Pframe 8
8.1.3 Data structures for Demandpaged Memory Management • Frame table
Pframe
0
1
free
2
3
4
free
5
6
7
free
8.1.3 Data structures for Demandpaged Memory Management • Disk map Disk map for P1
VPN
0
disk address
1
disk address
2
disk address
3
disk address
4
disk address
5
disk address
6
disk address
7
disk address
P1
P2 …. .
Swap space
Pn
8.1.4 Anatomy of a Page Fault • • • •
Find a free page frame Pick victim page and evict Load faulting page Update page table for faulting process and frame table • Restart faulting process
Eviction • When evicting a page we must consider its status – Clean: the page has not been written to and thus matches its counterpart on disk – Dirty: the page has been written to and no longer matches what is on disk
• Clean pages may be simply evicted • Dirty pages must be written back to disk
8.2 Interaction between the Process Scheduler and Memory Manager
User level
Process 2
Process 1
……….
Process n
Kernel
FT ready_q PCB1
PCB2
freelist
…
PT1
DM1
PT2
DM2
. .
. .
Pframe
Pframe
Memory Manager
CPU scheduler
Hardware
Process dispatch
Timer interrupt
(1)
CPU Page fault
(2)
…
•
CPU scheduler dispatches process, it runs until one of following happens 1.
HW timer interrupts CPU causing upcall (1) to CPU scheduler that may result in a process switch. CPU scheduler takes appropriate action to schedule next process on CPU. Process incurs a page fault resulting in an upcall (2) to memory manager that results in page fault handling Process makes system call resulting in another subsystem (not shown) getting an upcall
2. 3.
User level
Process 2
Process 1
……….
Process n
Kernel
FT ready_q PCB1
PCB2
freelist
…
PT1
DM1
PT2
DM2
. .
. .
Pframe
Pframe
Memory Manager
CPU scheduler
Hardware
Process dispatch
Timer interrupt
(1)
CPU Page fault
(2)
…
8.3 Page Replacement Policies • How to pick victim page to evict from physical memory when page fault & free-list is empty. • For a given string of page references, policy should result in least number of page faults. – This attribute ensures that the amount of time spent in OS dealing with page faults is minimized.
• Ideally, once a particular page has been brought into physical memory, policy should not incur a page fault for same page again. – This attribute ensures that page fault handler attempts to respect reference pattern of the user programs.
8.3 Page Replacement Policies • May use – Local victim selection • Simple • Don't need frame table • Poor memory utilization
– Global victim selection • Better memory utilization • The norm
• Ideally there are no page faults and memory manager never runs. • Goal is to minimize (or eliminate) page faults
8.3.1 Belady’s Min • In 1966 Laszlo Belady proposed an optimal page replacement algorithm requiring to know in advance the page replacement string • Obviously this is impossible • But the performance level of Belady's Min may be used as a reference standard to compare to other policies performance
8.3.2 First In First Out (FIFO) • Affix a timestamp when a page is brought in to physical memory • If a page has to be replaced, choose the longest resident page as the victim • No special hardware needed • Queue length is number of physical frames
Circular queue
Head
Full
…..
Tail
free
FIFO • Maintain queue. As page is read in enqueue. Use head of queue as frame to replace • Sample 1,2,3,4,1,2,5,1,2,3,4,5
FIFO 1
2
3
4
1
2
5
1
2
3
4
5
1
1
3
3
1
1
5
5
2
2
4
4
2
2
4
4
2
2
1
1
3
3
5
1
1
4
4
4
5
5
5
5
5
5
2
2
2
1
1
1
1
1
3
3
3
3
3
3
2
2
2
2
2
4
4
1
1
1
1
1
5
5
5
5
4
4
2
2
2
2
2
2
1
1
1
1
5
1
1
3 Time
3
3
3
3
3
2
2
2
2
4
4
4
4
4
4
3
3
3
12
12
9
10
Belady’s Anomaly
FIFO
Time
1
2
3
4
1
2
5
1
2
3
4
5
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
8.3.3 Least Recently Used (LRU) • LRU policy makes assumption that if a page has not been referenced in a long time there is a good chance it will not be referenced in the future as well. • Thus, victim page in LRU policy is page that has not been used for longest time.
Push down stack Top
…..
Bottom
free
8.3.3 Least Recently Used (LRU) 1
TIME
Physical Frames
Push Down Stack
2
3
4
1
2
5
1
2
3
4
5
8.3.3 Least Recently Used (LRU) • LRU is appealing but actually not feasible – Stack has as many entries as number of physical frames. For a physical memory of 64 MB & an 8 KB pagesize, size of stack: 8 KB. Too big in datapath! – On every access, hardware has to modify stack to place current reference on top of stack. Too slow.
• LRU may be bad choice in certain situations – e.g. Access N+1 pages in a processor with N frames available
8.3.3.1 Approximate LRU: A Small Hardware Stack • Add a hardware stack with ~16 entries • Push references onto stack – If they are already in stack bring to top – Bottom reference falls out of stack
• When free frame needed randomly select one not in stack • Shown to be successful in some applications • Probably not fast enough for high speed pipelined processor
8.3.3.2 Approximate LRU: Reference bit per page frame • Associate a bit with each frame. – hardware sets on reference – software reads and clears
• Have an n-bit counter register for each frame • Periodically (daemon) right shifts all counters and puts reference bit into high order bit • Highest value counters are recently used frames; lowest value counters are LRU frames
8.3.4 Second chance page replacement algorithm • Initially, OS clears reference bits of all frames. As program executes, hardware sets reference bits for pages referenced by program. • If a page has to be replaced, memory manager chooses replacement candidate in FIFO manner. • If chosen victim’s reference bit is set, then manager clears reference bit and this page is moved to end of FIFO queue. • The victim is first candidate in FIFO order whose reference bit is not set.
8.3.5 Review of page replacement algorithms PAGE REPLACEMENT ALGORITHM FIFO Belady’s MIN
HARDWARE ASSIST COMMENTS NEEDED Could lead to anomalous behavior None Provably optimal performance; not Oracle realizable in hardware; useful as a standard for performance comparison Expected performance close to optimal; infeasible for hardware implementation due to space and time complexity; worstcase performance may be similar or even worse compared to FIFO
True LRU
Push down stack
Approximate LRU #1
A small hardware stack Expected performance close to optimal; worst-case performance may be similar or even worse compared to FIFO
Approximate LRU #2
Reference bit per page
Second Chance Replacement
Reference bit per page
Expected performance close to optimal; moderate hardware complexity; worst-case performance may be similar or even worse compared to FIFO Expected performance better than FIFO; memory manager implementation simplified compared to LRU schemes
8.3.6 Optimizing Memory Management • Beyond basic techniques presented additiona optimizations are possible • These optimizations are on top of the already presented techniques
8.3.7 Pool of free page frames • Instead of waiting for free page count to = 0 • Periodically run daemon to evict pages keeping a pool of n free pages
8.3.7.1 Overlapping I/O with Processing • Upon eviction we add the evicted frame to the free list • If the frame was dirty it is scheduled for write back • When a frame is needed only clean frames are selected skipping over dirty frames still awaiting write back
8.3.7.2 Reverse Mapping to Page Tables • When daemon runs to maintain free list at a certain level it will take pages from processes that might turn around and page fault on those pages • If we maintain additional info in free list we can know this and give page back to process (since the data is still intact)
freelist
Dirty Pframe 52
Clean Pframe 22
Clean Pframe 200
….
8.3.8 Thrashing • Suppose many processes are in memory but CPU utilization is low. What could cause this? 1. Too many I/O bound processes? 2. Too many CPU bound processes?
• Should we add more processes into memory?
8.3.8 Thrashing • If some processes don't really have enough pages to support their current configuration they will constantly be page faulting and trying to grab frames from other processes • This may lead those processes to also start page faulting and grabbing frames • Does this sound like the ideal place to introduce more processes into memory?
8.3.8 Thrashing • Controlling thrashing starts with understanding temporal locality • Temporal locality is the tendency for the same memory location to be accessed over a short period of time • During a given time period t certain pages will be accessed others will not • If during t the pages that need to be accessed are in memory then no page faults will occur
Memory Reference
Frames
1
2 Pages
3 4 5
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
2
3
5
5
5
3
3
3
1
1
1
4
4
4
2
2
2
5
5
5
1
1
1
4
4
4
2
2
2
5
5
5
3
3
3
2
2
2
2
2
5
5
5
3
3
3
1
1
1
4
4
4
3
8.3.9 Working set • Working set is the set of pages that defines the locus of activity of a program • The working set size (WSS) denotes the number of distinct pages touched by a process in a window of time. • The total memory pressure (TMP) exerted on the system is the summation of the WSS of all the processes currently competing for resources.
8.3.10 Controlling thrashing 1. If TMP > Physical Memory – Decrease the degree of multiprogramming. – Else: Increase
2. Monitor Page fault rate • if pfr>High • Decrease progs • if pfr