Embedded and Real-Time Operating Systems [2 ed.] 9783031287008, 9783031287015

127 118 44MB

English Pages 862 Year 2023

Table of contents :
Contents
Chapter 1: Introduction
1.1 About This Book
1.2 Motivations of This Book
1.3 Objective and Intended Audience
1.4 Unique Features of This Book
1.5 Book Contents
1.6 Use This Book as a Textbook for Embedded Systems
1.7 Use This Book as a Textbook for Operating Systems
1.8 Use This Book for Self-Study
References
Chapter 2: ARM Architecture and Programming
2.1 ARM Processor Modes
2.2 ARM CPU Registers
2.2.1 General Registers
2.2.2 Status Registers
2.2.3 Change ARM Processor Mode
2.3 Instruction Pipeline
2.4 ARM Instructions
2.4.1 Condition Flags and Conditions
2.4.2 Branch Instructions
2.4.3 Arithmetic Operations
2.4.4 Comparison Operations
2.4.5 Logical Operations
2.4.6 Data Movement Operations
2.4.7 Immediate Value and Barrel Shifter
2.4.8 Multiply Instructions
2.4.9 LOAD and Store Instructions
2.4.10 Base Register
2.4.11 Block Data Transfer
2.4.12 Stack Operations
2.4.13 Stack and Subroutines
2.4.14 Software Interrupt (SWI)
2.4.15 PSR Transfer Instructions
2.4.16 Coprocessor Instructions
2.5 ARM Toolchain
2.6 ARM System Emulators
2.7 ARM Programming
2.7.1 ARM Assembly Programming Example 1
2.7.2 ARM Assembly Programming Example 2
2.7.3 Combine Assembly with C Programming
2.7.3.1 Execution Image
2.7.3.2 Function Call Convention in C
2.7.3.3 Long Jump
2.7.3.4 Call Assembly Function from C
2.7.3.5 Call C Function from Assembly
2.7.3.6 In-Line Assembly
2.8 Device Drivers
2.8.1 System Memory Map
2.8.2 GPIO Programming
2.8.3 UART Driver for Serial I/O
2.8.3.1 Demonstration of UART Driver
2.8.3.2 Use TCP/IP Telnet Session as UART Port
2.8.4 Color LCD Display Driver
2.8.4.1 Display Image Files
2.8.4.2 Include Binary Data Sections
2.8.4.3 Programming Example C2.6: LCD Driver
2.8.4.4 Demonstration of Display Images on LCD
2.8.4.5 Display Text
2.8.4.6 Color LCD Display Driver Program
2.8.4.7 Explanations of the LCD Driver Code
2.8.4.8 Demonstration of LCD Driver Program
2.9 Summary
List of Sample Programs
Problems
References
Chapter 3: Interrupts and Exceptions Processing
3.1 ARM Exceptions
3.1.1 ARM Processor Modes
3.1.2 ARM Exceptions
3.1.3 Exceptions Vector Table
3.1.4 Exception Handlers
3.1.5 Return from Exception Handlers
3.2 Interrupts and Interrupts Processing
3.2.1 Interrupt Types
3.2.2 Interrupt Controllers
3.2.2.1 ARM PL190/192 Interrupt Controller
3.2.2.2 Vectored and Non-vectored IRQs
3.2.2.3 Interrupt Priorities
3.2.3 Primary and Secondary Interrupt Controllers
3.3 Interrupt Processing
3.3.1 Vector Table Contents
3.3.2 Hardware Interrupt Sequence
3.3.3 Interrupts Control in Software
3.3.3.1 Enable/Disable Interrupts
3.3.3.2 Interrupt Masking
3.3.3.3 Clear Device Interrupt Request
3.3.3.4 Send EOI to Vectored Interrupt Controller
3.3.4 Interrupt Handlers
3.3.4.1 Interrupt Handler Types
3.3.5 Non-nested Interrupt Handler
3.4 Timer Driver
3.4.1 ARM Versatile 926EJS Timers
3.4.2 Timer Driver Program
3.5 Keyboard Driver
3.5.1 ARM PL050 Mouse-Keyboard Interface
3.5.2 Keyboard Driver
3.5.3 Interrupt-Driven Driver Design
3.5.4 Keyboard Driver Program
3.5.5 Keyboard Driver Using Keyset 2
3.6 UART Driver
3.6.1 ARM PL011 UART Interface
3.6.2 UART Registers
3.6.3 Interrupt-Driven UART Driver Program
3.6.3.1 The uart.c File
3.6.3.2 Explanations of the UART Driver Code
3.6.3.3 Demonstration of KBD and UART Drivers
3.7 Secure Digital (SD) Cards
3.7.1 SD Card Protocols
3.7.2 SD Card Driver
3.7.3 Improved SDC Driver
3.7.4 Multi-sector Data Transfer
3.8 Vectored Interrupts
3.8.1 ARM PL190 Vectored Interrupts Controller (VIC)
3.8.2 Configure VIC for Vectored Interrupts
3.8.3 Vectored Interrupts Handlers
3.8.4 Demonstration of Vectored Interrupts
3.9 Nested Interrupts
3.9.1 Why Nested Interrupts
3.9.2 Nested Interrupts in ARM
3.9.3 Handle Nested Interrupts in SYS Mode
3.9.4 Demonstration of Nested Interrupts
3.10 Nested Interrupts and Process Switch
3.11 Summary
List of Sample Programs
Problems
References
Chapter 4: Models of Embedded Systems
4.1 Program Structures of Embedded Systems
4.2 Super-Loop Model
4.2.1 Super-Loop Program Examples
4.3 Event-Driven Model
4.3.1 Shortcomings of Super-Loop Programs
4.3.2 Events
4.3.3 Periodic Event-Driven Program
4.3.4 Asynchronous Event-Driven Program
4.4 Event Priorities
4.5 Process Models
4.5.1 Uniprocessor Process Model
4.5.2 Multiprocessor Process Model
4.5.3 Real Address Space Process Model
4.5.4 Virtual Address Space Process Model
4.5.5 Static Process Model
4.5.6 Dynamic Process Model
4.5.7 Non-preemptive Process Model
4.5.8 Preemptive Process Model
4.6 Uniprocessor (UP) Kernel Model
4.7 Uniprocessor (UP) Operating System Model
4.8 Multiprocessor (MP) System Model
4.9 Real-Time (RT) System Model
4.10 Design Methodology of Embedded System Software
4.10.1 High-Level Language Support for Event-Driven Programming
4.10.2 State Machine Model
4.10.3 StateChart Model
4.11 Summary
Problems
References
Chapter 5: Process Management in Embedded Systems
5.1 Multitasking
5.2 The Process Concept
5.3 Multitasking and Context Switch
5.3.1 A Simple Multitasking Program
5.3.2 Context Switching
5.3.3 Demonstration of Multitasking
5.4 Dynamic Processes
5.4.1 Dynamic Process Creation
5.4.2 Demonstration of Dynamic Processes
5.5 Process Scheduling
5.5.1 Process Scheduling Terminology
5.5.2 Goals, Policy, and Algorithms of Process Scheduling
5.5.3 Process Scheduling in Embedded Systems
5.6 Process Synchronization
5.6.1 Sleep and Wakeup
5.6.2 Device Drivers Using Sleep/Wakeup
5.6.2.1 Input Device Drivers
5.6.2.2 Output Device Drivers
5.7 Event-Driven Embedded Systems Using Sleep/Wakeup
5.7.1 Demonstration of Event-Driven Embedded System Using Sleep/Wakeup
5.8 Resource Management Using Sleep/Wakeup
5.8.1 Shortcomings of Sleep/Wakeup
5.9 Semaphores
5.10 Applications of Semaphores
5.10.1 Semaphore Lock
5.10.2 Mutex Lock
5.10.3 Resource Management Using Semaphore
5.10.4 Wait for Interrupts and Messages
5.10.5 Process Cooperation
5.10.5.1 Producer-Consumer Problem
5.10.5.2 Reader-Writer Problem
5.10.6 Advantages of Semaphores
5.10.7 Cautions of Using Semaphores
5.10.8 Use Semaphores in Embedded Systems
5.10.8.1 Device Drivers Using Semaphores
5.10.8.2 Event-Driven Embedded System Using Semaphore
5.11 Other Synchronization Mechanisms
5.11.1 Event Flags in OpenVMS
5.11.2 Event Variables in MVS
5.11.3 ENQ/DEQ in MVS
5.12 High-Level Synchronization Constructs
5.12.1 Condition Variables
5.12.2 Monitors
5.13 Process Communication
5.13.1 Shared Memory
5.13.2 Pipes
5.13.2.1 Pipes in Unix/Linux
5.13.2.2 Pipes in Embedded Systems
5.13.2.3 Demonstration of Pipes
5.13.3 Signals
5.13.4 Message Passing
5.13.4.1 Asynchronous Message Passing
5.13.4.2 Synchronous Message Passing
5.13.4.3 Demonstration of Message Passing
5.14 Uniprocessor (UP) Embedded System Kernel
5.14.1 Non-preemptive UP Kernel
5.14.2 Demonstration of Non-preemptive UP Kernel
5.14.3 Preemptive UP Kernel
5.14.4 Demonstration of Preemptive UP Kernel
5.15 Summary
List of Sample Programs
Problems
References
Chapter 6: Memory Management in ARM
6.1 Process Address Spaces
6.2 Memory Management Unit (MMU) in ARM
6.3 MMU Registers
6.4 Accessing MMU Registers
6.4.1 Enabling and Disabling the MMU
6.4.1.1 Enable MMU
6.4.1.2 Disable MMU
6.4.2 Domain Access Control
6.4.3 Translation Table Base Register
6.4.4 Domain Access Control Register
6.4.5 Fault Status Registers
6.4.6 Fault Address Register
6.5 Virtual Address Translations
6.5.1 Translation Table Base
6.5.2 Translation Table
6.5.3 Level-One Descriptor
6.5.3.1 Page Table Descriptor
6.5.3.2 Section Descriptor
6.6 Translation of Section References
6.7 Translation of Page References
6.7.1 Level-1 Page Table
6.7.2 Level-2 Page Descriptors
6.7.3 Translation of Small Page References
6.7.4 Translation of Large Page References
6.8 Memory Management Example Programs
6.8.1 One-Level Paging Using 1 MB Sections
6.8.2 Two-Level Paging Using 4 KB Pages
6.8.3 One-Level Paging with High VA Space
6.8.4 Two-Level Paging with High VA Space
6.8.5 Vector Table in High VA Space
6.9 Summary
List of Sample Programs
Problems
Reference
Chapter 7: User Mode Process and System Calls
7.1 User Mode Processes
7.2 Virtual Address Space Mapping
7.3 User Mode Process
7.3.1 User Mode Image
7.4 System Kernel Supporting User Mode Processes
7.4.1 PROC Structure
7.4.2 Reset Handler
7.4.2.1 Exception and IRQ Stacks
7.4.2.2 Copy Vector Table
7.4.2.3 Create Kernel Mode Page Table
7.4.2.4 Process Context Switching Function
7.4.2.5 System Call Entry and Exit
7.4.2.6 SVC Handler
7.4.2.7 Exception Handlers
7.4.3 Kernel Code
7.4.3.1 Create Process With User Mode Image
7.4.3.2 Execution of User Mode Image
7.4.4 Kernel Compile-link Script
7.4.5 Demonstration of Kernel With User Mode Process
7.5 Embedded System with User Mode Processes
7.5.1 Processes in the Same Domain
7.5.2 Demonstration of Processes in the Same Domain
7.5.3 Processes with Individual Domains
7.5.4 Demonstration of Processes with Individual Domains
7.6 RAM Disk
7.6.1 Creating RAM Disk Image
7.6.2 Process Image File Loader
7.6.3 Binary Image File Loader Program
7.7 Process Management
7.7.1 Process Creation
7.7.2 Process Termination
7.7.3 Process Family Tree
7.7.4 Wait for Child Process Termination
7.7.5 fork-exec in Unix/Linux
7.7.6 Implementation of Fork
7.7.7 Implementation of Exec
7.7.8 Demonstration of fork-exec
7.7.9 Simple sh for Command Execution
7.7.10 vfork
7.7.11 Demonstration of vfork
7.8 Threads
7.8.1 Thread Creation
7.8.2 Demonstration of Threads
7.8.3 Threads Synchronization
7.9 Embedded System with Two-level Paging
7.9.1 Static 2-level Paging
7.9.2 Demonstration of Static 2-level Paging
7.9.3 Dynamic 2-level Paging
7.9.3.1 Modifications to Kernel for Dynamic Paging
7.9.4 Demonstration of 2-level Dynamic Paging
7.10 KMH Memory Mapping
7.10.1 Remap Vector Table
7.10.2 KMH using One-level Static Paging
7.10.3 KMH Using Two-level Static Paging
7.10.4 KMH using Two-level Dynamic Paging
7.11 Embedded System Supporting File Systems
7.11.1 Create SD Card Image
7.11.2 Format SD Card Partitions as File Systems
7.11.3 Create Loop Devices for SD Card Partitions
7.12 Embedded System With SDC File System
7.12.1 SD Card Driver Using Semaphore
7.12.2 System Kernel Using SD File System
7.12.3 Demonstration of SDC File System
7.13 Boot Kernel Image from SDC
7.13.1 SDC Booter Program
7.13.2 Demonstration of Booting Kernel from SDC
7.13.3 Booting Kernel from SDC with Dynamic Paging
7.13.4 Two-Stage Booting
7.13.4.1 Stage-1 Booter
7.13.4.2 Stage-2 Booter
7.13.5 Demonstration of Two-stage Booting
7.14 Summary
List of Sample Programs
Problems
References
Chapter 8: General Purpose Embedded Operating Systems
8.1 General Purpose Operating Systems
8.2 Embedded General Purpose Operating Systems
8.3 Porting Existing GPOS to Embedded Systems
8.4 Develop an Embedded GPOS for ARM
8.5 Organization of EOS
8.5.1 Hardware Platform
8.5.2 EOS Source File Tree
8.5.3 EOS Kernel Files
8.5.4 Capabilities of EOS
8.5.5 Startup Sequence of EOS
8.5.6 Process Management in EOS
8.5.6.1 PROC and Resource Structures
8.5.7 Assembly Code of EOS
8.5.7.1 Reset Handler
8.5.7.2 Initial Page Table
8.5.7.3 System Call Entry and Exit
8.5.7.4 IRQ Handler
8.5.7.5 IRQ and Process Preemption
8.5.8 Kernel Files of EOS
8.5.8.1 The main() Function
8.5.8.2 Kernel Initialization
8.5.8.3 Process Scheduling Functions
8.5.9 Process Management Functions
8.5.9.1 fork-exec
8.5.9.2 exit-wait
8.5.10 Pipes
8.5.11 Message Passing
8.5.12 Demonstration of Message Passing
8.6 Memory Management in EOS
8.6.1 Memory Map of EOS
8.6.2 Virtual Address Spaces
8.6.3 Kernel Mode Pgdir and Page Tables
8.6.4 Process User Mode Page Tables
8.6.5 Switch Pgdir During Process Switch
8.6.6 Dynamic Paging
8.7 Exception and Signal Processing
8.7.1 Signal Processing in Unix/Linux
8.8 Signal Processing in EOS
8.8.1 Signals in PROC Resource
8.8.2 Signal Origins in EOS
8.8.3 Deliver Signal to Process
8.8.4 Change Signal Handler in Kernel
8.8.5 Signal Handling in EOS Kernel
8.8.6 Dispatch Signal Catcher for Execution in User Mode
8.9 Device Drivers
8.10 Process Scheduling in EOS
8.11 Timer Service in EOS
8.12 File System
8.12.1 File Operation Levels
8.12.2 File I/O Operations
8.12.3 EXT2 File System in EOS
8.12.3.1 File System Organization
8.12.3.2 Source Files in the EOS/FS Directory
8.12.4 Implementation of Level-1 FS
8.12.4.1 mkdir-creat-mknod
8.12.4.2 chdir-getcwd-stat
8.12.4.3 rmdir
8.12.4.4 link-unlink
8.12.4.5 symlink-readlink
8.12.4.6 Other Level-1 Functions
8.12.5 Implementation of Level-2 FS
8.12.5.1 open-close-lseek
8.12.5.2 Read Regular Files
8.12.5.3 Write Regular Files
8.12.5.4 Read-Write Special Files
8.12.5.5 Opendir-readdir
8.12.6 Implementation of Level-3 FS
8.12.6.1 mount-umount
8.12.6.2 Implications of Mount
8.12.6.3 File Protection
8.12.6.4 Real and Effective uid
8.12.6.5 File Locking
8.13 Block Device I/O Buffering
8.14 I/O Buffer Management Algorithm
8.14.1 Performance of the I/O Buffer Cache
8.15 User Interface
8.15.1 The INIT Program
8.15.2 The Login Program
8.15.3 The sh Program
8.16 Demonstration of EOS
8.16.1 EOS Startup
8.16.2 Command Processing in EOS
8.16.3 Signal and Exception Handling in EOS
8.16.3.1 Interval Timer and Alarm Signal Catcher
8.16.3.2 Exceptions Handling in EOS
8.17 Summary
Problems
References
Chapter 9: Multiprocessing in Embedded Systems
9.1 Multiprocessing
9.2 SMP System Requirements
9.3 ARM MPcore Processors
9.4 ARM Cortex-A9 MPcore Processor
9.4.1 Processor Cores
9.4.2 Snoop Control Unit (SCU)
9.4.3 Generic Interrupt Controller (GIC) [4]
9.5 GIC Programming Example
9.5.1 Configure the GIC to Route Interrupts
9.5.2 Explanations of the GIC Configuration Code
9.5.3 Interrupt Priority and Interrupt Mask
9.5.4 Demonstration of GIC Programming
9.6 Startup Sequence of ARM MPcores
9.6.1 Raw Startup Sequence
9.6.2 Booter Assisted Startup Sequence
9.6.3 SMP Booting on Virtual Machines
9.7 ARM SMP Startup Examples
9.7.1 ARM SMP Startup Example 1
9.7.2 Demonstration of SMP Startup Example 1
9.7.3 ARM SMP Startup Example 2
9.7.4 Demonstration of ARM SMP Startup Example 2
9.8 Critical Regions in SMP
9.8.1 Implementation of Critical Regions in SMP
9.8.2 Shortcomings of XCHG/SWAP Operations
9.8.3 ARM Synchronization Instructions for SMP
9.8.3.1 ARM LDREX/STREX Instructions
9.8.3.2 ARM WFI, WFE, SEV Instructions
9.8.3.3 ARM Memory Barriers
9.9 Synchronization Primitives in SMP
9.9.1 Spinlocks
9.9.2 Spinlock Example
9.9.3 Demonstration of SMP Startup with Spinlock
9.9.4 Mutex in SMP
9.9.5 Implementation of Mutex Using Spinlock
9.9.6 Demonstration of SMP Startup with Mutex Lock
9.10 Global and Local Timers
9.10.1 Demonstration of Local Timers in SMP
9.11 Semaphores in SMP
9.11.1 Applications of Semaphores in SMP
9.12 Conditional Locking
9.12.1 Conditional Spinlock
9.12.2 Conditional Mutex
9.12.3 Conditional Semaphore Operations
9.13 Memory Management in SMP
9.13.1 Memory Management Models in SMP
9.13.2 Uniform VA Spaces
9.13.2.1 Demonstration of Uniform VA Space Mapping
9.13.3 Non-uniform VA Spaces
9.13.3.1 Demonstration of Non-uniform VA Space Mapping
9.13.4 Parallel Computing in Non-uniform VA Spaces
9.14 Multitasking in SMP
9.15 SMP Kernel for Process Management
9.15.1 The ts.s file
9.15.2 The t.c file
9.15.3 Kernel.c file
9.15.4 Device Driver and Interrupt Handlers in SMP
9.15.4.1 The SDC Driver in SMP
9.15.5 Demonstration of Process Management in SMP
9.16 General Purpose SMP Operating System
9.16.1 Organization of SMP_EOS
9.16.1.1 Hardware Platform
9.16.2 SMP_EOS Source File Tree
9.16.3 SMP_EOS Kernel Files
9.16.4 Process Management in SMP_EOS
9.16.4.1 PROC Structure
9.16.4.2 Running PROC Pointers
9.16.4.3 Spinlocks and Semaphores
9.16.4.4 Use Sleep/Wakeup in SMP
9.16.5 Protect Kernel Data Structures in SMP
9.16.6 Deadlock Prevention in SMP
9.16.6.1 Deadlock Prevention in Parallel Algorithms
9.16.7 Adapt UP Algorithms for SMP
9.16.7.1 Adapt UP Process Scheduling Algorithm for SMP
9.16.7.2 Adapt UP Pipe Algorithm for SMP
9.16.7.3 Adapt UP I/O Buffer Management Algorithm for SMP
9.16.8 Device Driver and Interrupt Handlers in SMP
9.17 SMP_EOS Demonstration System
9.17.1 Startup Sequence of SMP_EOS
9.17.2 Capabilities of SMP_EOS
9.17.3 Demonstration of SMP_EOS
9.18 Summary
List of Sample Programs
Problems
References
Chapter 10: Embedded Real-Time Operating Systems
10.1 Concepts of RTOS
10.2 Task Scheduling in RTOS
10.2.1 Rate-Monotonic Scheduling (RMS)
10.2.2 Earliest-Deadline-First (EDF) Scheduling
10.2.3 Deadline-Monotonic Scheduling (DMS)
10.3 Priority Inversion
10.4 Priority Inversion Prevention
10.4.1 Priority Ceiling
10.4.2 Priority Inheritance
10.5 Survey of RTOS
10.5.1 FreeRTOS
10.5.2 MicroC/OS (μC/OS)
10.5.3 NuttX
10.5.4 VxWorks
10.5.5 QNX
10.5.6 Real-Time Linux [12]
10.5.6.1 Real-Time Linux Patches
10.5.6.2 RTLinux
10.5.7 Critique of Existing RTOS
10.5.7.1 RTOS Organizations
10.5.7.2 Tasks in RTOS
10.5.7.3 Memory Management in RTOS
Real Address Space
Virtual Address Spaces
Dynamic Memory Allocation
Stack Overflow Checking
10.5.7.4 POSIX Compliant APIs
10.5.7.5 Synchronization Primitives
10.5.7.6 Interrupts Handling in RTOS
10.5.7.7 Task Deadlines
10.6 Design Principles of RTOS
10.6.1 Interrupt Processing
10.6.2 Task Management
10.6.3 Task Scheduling
10.6.4 Synchronization Tools
10.6.5 Task Communication
10.6.6 Memory Management
10.6.7 File System
10.6.8 Tracing and Debugging
10.7 Uniprocessor RTOS (UP_RTOS)
10.7.1 Task Management in UP_RTOS
10.7.2 Task Synchronization in UP_RTOS
10.7.3 Task Scheduling in UP_RTOS
10.7.4 Task Communication in UP_RTOS
10.7.4.1 Shared Memory
10.7.4.2 Pipes for Data Streams
10.7.4.3 Message Queues
10.7.5 Protection of Critical Regions
10.7.6 File System and Logging
10.7.7 UP_RTOS with Static Periodic Tasks and Round-Robin Scheduling
10.7.8 UP_RTOS with Static Periodic Tasks and Preemptive Scheduling
10.7.9 UP_RTOS with Dynamic Tasks Sharing Resources
10.8 Multiprocessor RTOS (SMP_RTOS)
10.8.1 SMP_RTOS Kernel for Task Management
10.8.1.1 PROC Structure in SMP_RTOS
10.8.1.2 Kernel Data Structures
10.8.1.3 Protection of Kernel Data Structures
10.8.1.4 Synchronization Tools
10.8.1.5 Task Management
10.8.1.6 Demonstration of Task Management in SMP_RTOS
10.8.2 Adapt UP_RTOS to SMP
10.8.2.1 Demonstration of MP_RTOS
10.8.3 Nested Interrupts in SMP_RTOS
10.8.4 Preemptive Task Scheduling in SMP_RTOS
10.8.5 Task Switch by SGI
10.8.6 Timesliced Task Scheduling in SMP_RTOS
10.8.7 Preemptive Task Scheduling in SMP_RTOS
10.8.7.1 Reschedule Function for Task Preemption
10.8.7.2 Task Creation with Preemption
10.8.7.3 Semaphore with Preemption
10.8.7.4 Mutex with Preemption
10.8.7.5 Nested IRQ Interrupts with Preemption
10.8.8 Priority Inheritance
10.8.9 Demonstration of the SMP_RTOS System
10.9 Summary
List of Sample Programs
Problems
References
Chapter 11: ARMv8 Architecture and Programming
11.1 ARMv8 Architecture and Processors
11.1.1 Exception Levels
11.1.2 Secure World and Normal World
11.1.3 Execution States
11.1.4 Change Exception Levels
11.1.5 Change Execution States
11.2 ARMv8 Registers
11.2.1 ARMv8 Special Registers
11.2.2 Stack Pointers
11.2.3 Program Counter
11.2.4 Exception Link Registers
11.2.5 PSTATE and SPSR Registers
11.3 ARMv8 System Registers
11.4 Aarch64 Instructions
11.4.1 Data Processing Instructions
11.4.2 Shift Operations
11.4.3 Conditional Instructions
11.4.4 Memory Access Instructions
11.4.5 Flow Control Instructions
11.4.6 Exception Handling Instructions
11.4.7 Function Call Conventions in ARMv8
11.5 ARMv8 Exception and Interrupt Handling
11.5.1 Exception Types: Synchronous and Asynchronous Exceptions
11.5.2 Exception Handling
11.5.3 ARMv8 Exception Vector Table
11.5.4 Return from Exception
11.5.5 Interrupt Handling
11.5.6 A Simple Interrupt Handler
11.5.7 Nested Interrupt Handler
11.6 Generic Interrupt Controller Versions 3 and 4
11.6.1 GICv3 Fundamentals
11.6.1.1 Interrupt Types
11.6.1.2 Interrupt Identifiers
11.6.1.3 Security Model
11.6.2 Programming Model of GICv3
11.6.2.1 Distributor (GICD_*)
11.6.2.2 Redistributors (GICR_*)
11.6.2.3 CPU interfaces (ICC_*_ELn)
11.6.2.4 GICv3 Register Names
11.6.2.5 Distributor Register Map
11.6.2.6 Redistributor Register Map
11.6.2.7 CPU Interface System Registers
11.6.3 Routing in GICv3
11.6.3.1 Affinity Routing
11.6.3.2 Routing SPIs and SGIs by PE Affinity
11.6.3.3 Participating Nodes
11.6.4 Enable and Distribution of Interrupts
11.6.4.1 Enable Interrupt Groups
11.6.4.2 Enabling Individual Interrupts
11.6.5 Handling Interrupts
11.6.5.1 Interrupt Priority and Priority Mask
11.6.5.2 Interrupt Acknowledge
11.7 ARMv8 Generic Timer
11.7.1 Processor Timers
11.7.2 Counter and Frequency
11.7.3 Timer Registers
11.7.4 Configure Timer
11.7.5 Timer Interrupts
11.8 Programming Examples
11.8.1 Example 1: GICv3 in Version 2 Legacy Mode
11.8.2 Example 2: GICv3 with EL3 Physical Timer
11.8.3 Example 3: GICv3 with EL3 and EL1 Timer
11.8.4 Example 4: ARMv8 from EL3 to EL1
11.8.4.1 Changing Exception Levels
11.8.4.2 Switch from EL3 to Non-secure EL0
11.8.4.3 Example 4 Program Files
11.8.5 Example 5: OS Kernel with Processes at EL1
11.8.6 Example 6: User Mode Process and System Calls
11.8.6.1 Kernel Mode and User Mode
11.8.6.2 System Calls
11.8.6.3 Example 6 Program Code
11.8.7 Example 7: Refinement of User Mode Process
11.8.8 Example 8: Separate User Mode Image
11.9 Memory Management in ARMv8
11.9.1 Virtual Address Versus Physical Address
11.9.2 System Control Registers for MMU Operations
11.9.3 Separation of Kernel and Application Virtual Address Spaces
11.9.4 Translation Granule and Translation Tables
11.9.5 ARMv8 Translation Table Descriptor Format
11.9.6 Translation Table Descriptor Attributes
11.9.7 OS Kernel Use of Translation Table Descriptors
11.9.8 Security and the MMU
11.9.9 Context Switching in MMU
11.10 Address Translation Cases
11.10.1 4KB Page Size
11.10.2 64KB Page Size
11.10.3 64KB Page Size with 42-Bit Input VA Address
11.11 Translations in EL2 and EL3
11.12 MMU Programming Examples
11.12.1 Example 9: One-Level Table in EL3
11.12.2 Example 10: EL3 with L1 and L2 Mappings
11.12.3 Example 11: TTBR0 and TTBR1 in EL1
11.12.4 Example 12: A Simple OS Kernel with User Mode Processes
11.12.5 Example 13: OS Kernel with HIGH VA
11.13 ARMv9
11.13.1 ARMv9 for Improved Security
11.13.2 ARMv9 for Improved Computing Capability
11.14 Conclusions
References
Chapter 12: ARM TrustZone and Secure Operating Systems
12.1 ARM TrustZone Architecture
12.2 ARM TrustZone Hardware Architecture
12.2.1 Hardware Features for Secure and Normal Worlds
12.2.2 Processor Architecture
12.2.3 Switching Between Security States
12.2.4 Secure and Normal Memory Spaces
12.2.5 Secure and Normal Interrupts
12.2.6 Secure and Normal Worlds in MP Processors
12.2.7 TrustZone in ARMv8-M
12.3 TrustZone Software Architecture
12.3.1 Secure Operating System
12.3.2 Booting a Secure System
12.3.3 Software Chain of Trusts
12.3.4 Example Secure Systems
12.4 Programming Examples of Secure Systems
12.4.1 Example 12.1 Secure and Normal Worlds
12.4.2 Secure World with Interrupts
12.4.3 Programming Project
12.5 Conclusions
References
Index

Recommend Papers

Embedded and Real-Time Operating Systems [2 ed.] 9783031287015, 9783031287008

This book covers the basic concepts and principles of operating systems, showing how to apply them to the design and imp

112 26 110MB Read more

Embedded and Real-Time Operating Systems [2 ed.] 3031287002, 9783031287008, 9783031287015, 9783031295355

This book covers the basic concepts and principles of operating systems, showing how to apply them to the design and imp

118 41 43MB Read more

Realtime Operating Systems. Concepts and Implementation of Microkernels

491 72 570KB Read more

Operating Systems Design and Implementation 0131429388, 9780131429383

Operating Systems Design and Implementation, 3e, is ideal for introductory courses on computer operating systems. Writte

872 12 5MB Read more

Operating Systems: Design And Implementation [Second Edition]

Most books on operating systems deal with theory while ignoring practice. While the usual principles are covered in deta

1,597 88 32MB Read more

Embedded Intelligent Systems 9783486593280

Das Buch vermittelt Kenntnisse zur Technologie und dem Design von Embedded Intelligent Systems. Dazu werden neben der Ke

163 86 44MB Read more

Operating Systems and Middleware - Supporting Controlled Interaction

467 17 3MB Read more

Operating Systems - Three Easy Pieces

766 113 5MB Read more

Operating systems theory 0136378684, 9780136378686

544 17 19MB Read more

Embedded Systems 9788122434989, 8122434983

Cover ; Preface ; Contents ; Chapter 1 Embedded Systems-An Introduction ; 1.1 Basic Idea on System ; 1.2 Embedded System

540 63 8MB Read more

Embedded and Real-Time Operating Systems [2 ed.]
9783031287008, 9783031287015

Author / Uploaded
K.C. Wang

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

K.C. Wang

Embedded and Real-Time Operating Systems Second Edition

Embedded and Real-Time Operating Systems

K. C. Wang

Embedded and Real-Time Operating Systems Second Edition

K. C. Wang School of Electrical Engineering and Computer Science Washington State University Pullman, WA, USA

ISBN 978-3-031-28700-8 ISBN 978-3-031-28701-5 (eBook) https://doi.org/10.1007/978-3-031-28701-5 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2017, 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Contents

1 Introduction�� 1 1.1 About This Book�� 1 1.2 Motivations of This Book �� 1 1.3 Objective and Intended Audience �� 1 1.4 Unique Features of This Book�� 2 1.5 Book Contents�� 3 1.6 Use This Book as a Textbook for Embedded Systems �� 5 1.7 Use This Book as a Textbook for Operating Systems�� 6 1.8 Use This Book for Self-Study�� 6 References�� 6 2 ARM Architecture and Programming �� 7 2.1 ARM Processor Modes �� 7 2.2 ARM CPU Registers�� 8 2.2.1 General Registers�� 8 2.2.2 Status Registers�� 8 2.2.3 Change ARM Processor Mode�� 9 2.3 Instruction Pipeline �� 10 2.4 ARM Instructions�� 10 2.4.1 Condition Flags and Conditions �� 10 2.4.2 Branch Instructions �� 11 2.4.3 Arithmetic Operations�� 11 2.4.4 Comparison Operations�� 12 2.4.5 Logical Operations�� 12 2.4.6 Data Movement Operations�� 12 2.4.7 Immediate Value and Barrel Shifter�� 12 2.4.8 Multiply Instructions�� 13 2.4.9 LOAD and Store Instructions �� 13 2.4.10 Base Register�� 13 2.4.11 Block Data Transfer�� 14 2.4.12 Stack Operations �� 14 2.4.13 Stack and Subroutines�� 14 2.4.14 Software Interrupt (SWI)�� 14 2.4.15 PSR Transfer Instructions �� 15 2.4.16 Coprocessor Instructions�� 15 2.5 ARM Toolchain�� 15 2.6 ARM System Emulators �� 16 2.7 ARM Programming�� 16 2.7.1 ARM Assembly Programming Example 1 �� 16 2.7.2 ARM Assembly Programming Example 2 �� 18 2.7.3 Combine Assembly with C Programming�� 19

v

vi

Contents

2.8 Device Drivers�� 26 2.8.1 System Memory Map�� 26 2.8.2 GPIO Programming�� 26 2.8.3 UART Driver for Serial I/O�� 27 2.8.4 Color LCD Display Driver�� 31 2.9 Summary �� 43 List of Sample Programs�� 43 Problems�� 43 References�� 44 3 Interrupts and Exceptions Processing�� 47 3.1 ARM Exceptions�� 47 3.1.1 ARM Processor Modes �� 47 3.1.2 ARM Exceptions�� 47 3.1.3 Exceptions Vector Table�� 49 3.1.4 Exception Handlers�� 49 3.1.5 Return from Exception Handlers�� 50 3.2 Interrupts and Interrupts Processing �� 51 3.2.1 Interrupt Types�� 51 3.2.2 Interrupt Controllers�� 51 3.2.3 Primary and Secondary Interrupt Controllers �� 52 3.3 Interrupt Processing�� 52 3.3.1 Vector Table Contents �� 52 3.3.2 Hardware Interrupt Sequence �� 53 3.3.3 Interrupts Control in Software�� 54 3.3.4 Interrupt Handlers �� 54 3.3.5 Non-nested Interrupt Handler �� 55 3.4 Timer Driver�� 56 3.4.1 ARM Versatile 926EJS Timers �� 56 3.4.2 Timer Driver Program�� 56 3.5 Keyboard Driver�� 60 3.5.1 ARM PL050 Mouse-Keyboard Interface�� 60 3.5.2 Keyboard Driver�� 61 3.5.3 Interrupt-Driven Driver Design�� 61 3.5.4 Keyboard Driver Program�� 62 3.5.5 Keyboard Driver Using Keyset 2�� 66 3.6 UART Driver�� 71 3.6.1 ARM PL011 UART Interface �� 71 3.6.2 UART Registers�� 71 3.6.3 Interrupt-Driven UART Driver Program�� 72 3.7 Secure Digital (SD) Cards�� 76 3.7.1 SD Card Protocols�� 76 3.7.2 SD Card Driver �� 77 3.7.3 Improved SDC Driver �� 84 3.7.4 Multi-sector Data Transfer�� 85 3.8 Vectored Interrupts�� 87 3.8.1 ARM PL190 Vectored Interrupts Controller (VIC)�� 87 3.8.2 Configure VIC for Vectored Interrupts�� 88 3.8.3 Vectored Interrupts Handlers�� 88 3.8.4 Demonstration of Vectored Interrupts�� 89 3.9 Nested Interrupts�� 90 3.9.1 Why Nested Interrupts�� 90 3.9.2 Nested Interrupts in ARM�� 91

Contents

vii

3.9.3 Handle Nested Interrupts in SYS Mode�� 91 3.9.4 Demonstration of Nested Interrupts�� 92 3.10 Nested Interrupts and Process Switch�� 94 3.11 Summary �� 95 List of Sample Programs�� 95 Problems�� 95 References�� 97 4 Models of Embedded Systems�� 99 4.1 Program Structures of Embedded Systems�� 99 4.2 Super-Loop Model�� 99 4.2.1 Super-Loop Program Examples��100 4.3 Event-Driven Model��101 4.3.1 Shortcomings of Super-Loop Programs��101 4.3.2 Events��101 4.3.3 Periodic Event-Driven Program��102 4.3.4 Asynchronous Event-Driven Program��105 4.4 Event Priorities��106 4.5 Process Models ��107 4.5.1 Uniprocessor Process Model��107 4.5.2 Multiprocessor Process Model��107 4.5.3 Real Address Space Process Model��107 4.5.4 Virtual Address Space Process Model��108 4.5.5 Static Process Model��108 4.5.6 Dynamic Process Model ��108 4.5.7 Non-preemptive Process Model��108 4.5.8 Preemptive Process Model��108 4.6 Uniprocessor (UP) Kernel Model ��108 4.7 Uniprocessor (UP) Operating System Model ��108 4.8 Multiprocessor (MP) System Model��109 4.9 Real-Time (RT) System Model��109 4.10 Design Methodology of Embedded System Software��109 4.10.1 High-Level Language Support for Event-Driven Programming��109 4.10.2 State Machine Model��109 4.10.3 StateChart Model��113 4.11 Summary ��113 Problems��114 References��114 5 Process Management in Embedded Systems��115 5.1 Multitasking��115 5.2 The Process Concept��115 5.3 Multitasking and Context Switch��116 5.3.1 A Simple Multitasking Program ��116 5.3.2 Context Switching��117 5.3.3 Demonstration of Multitasking ��122 5.4 Dynamic Processes ��122 5.4.1 Dynamic Process Creation��122 5.4.2 Demonstration of Dynamic Processes��125 5.5 Process Scheduling ��125 5.5.1 Process Scheduling Terminology��125 5.5.2 Goals, Policy, and Algorithms of Process Scheduling��126 5.5.3 Process Scheduling in Embedded Systems��127

viii

Contents

5.6 Process Synchronization �� 127 5.6.1 Sleep and Wakeup �� 127 5.6.2 Device Drivers Using Sleep/Wakeup�� 128 5.7 Event-Driven Embedded Systems Using Sleep/Wakeup�� 131 5.7.1 Demonstration of Event-Driven Embedded System Using Sleep/Wakeup�� 133 5.8 Resource Management Using Sleep/Wakeup �� 134 5.8.1 Shortcomings of Sleep/Wakeup�� 134 5.9 Semaphores �� 134 5.10 Applications of Semaphores �� 135 5.10.1 Semaphore Lock �� 136 5.10.2 Mutex Lock�� 136 5.10.3 Resource Management Using Semaphore�� 136 5.10.4 Wait for Interrupts and Messages�� 136 5.10.5 Process Cooperation�� 136 5.10.6 Advantages of Semaphores�� 137 5.10.7 Cautions of Using Semaphores�� 138 5.10.8 Use Semaphores in Embedded Systems �� 138 5.11 Other Synchronization Mechanisms �� 140 5.11.1 Event Flags in OpenVMS �� 141 5.11.2 Event Variables in MVS�� 141 5.11.3 ENQ/DEQ in MVS �� 142 5.12 High-Level Synchronization Constructs�� 142 5.12.1 Condition Variables�� 142 5.12.2 Monitors�� 143 5.13 Process Communication�� 143 5.13.1 Shared Memory�� 143 5.13.2 Pipes�� 143 5.13.3 Signals�� 147 5.13.4 Message Passing �� 148 5.14 Uniprocessor (UP) Embedded System Kernel�� 153 5.14.1 Non-preemptive UP Kernel�� 153 5.14.2 Demonstration of Non-preemptive UP Kernel �� 159 5.14.3 Preemptive UP Kernel�� 159 5.14.4 Demonstration of Preemptive UP Kernel �� 165 5.15 Summary �� 166 List of Sample Programs�� 166 Problems�� 166 References�� 168 6 Memory Management in ARM�� 169 6.1 Process Address Spaces�� 169 6.2 Memory Management Unit (MMU) in ARM �� 169 6.3 MMU Registers�� 170 6.4 Accessing MMU Registers �� 171 6.4.1 Enabling and Disabling the MMU�� 171 6.4.2 Domain Access Control�� 172 6.4.3 Translation Table Base Register�� 172 6.4.4 Domain Access Control Register�� 173 6.4.5 Fault Status Registers�� 173 6.4.6 Fault Address Register�� 173 6.5 Virtual Address Translations �� 174 6.5.1 Translation Table Base�� 174 6.5.2 Translation Table�� 174 6.5.3 Level-One Descriptor�� 174

Contents

ix

6.6 Translation of Section References�� 175 6.7 Translation of Page References�� 176 6.7.1 Level-1 Page Table�� 176 6.7.2 Level-2 Page Descriptors�� 176 6.7.3 Translation of Small Page References�� 176 6.7.4 Translation of Large Page References�� 177 6.8 Memory Management Example Programs �� 177 6.8.1 One-Level Paging Using 1 MB Sections�� 178 6.8.2 Two-Level Paging Using 4 KB Pages�� 183 6.8.3 One-Level Paging with High VA Space�� 185 6.8.4 Two-Level Paging with High VA Space �� 189 6.8.5 Vector Table in High VA Space�� 190 6.9 Summary �� 192 List of Sample Programs�� 192 Problems�� 192 Reference �� 192 7 User Mode Process and System Calls�� 193 7.1 User Mode Processes�� 193 7.2 Virtual Address Space Mapping�� 193 7.3 User Mode Process �� 194 7.3.1 User Mode Image�� 194 7.4 System Kernel Supporting User Mode Processes�� 197 7.4.1 PROC Structure�� 197 7.4.2 Reset Handler�� 201 7.4.3 Kernel Code�� 203 7.4.4 Kernel Compile-link Script�� 209 7.4.5 Demonstration of Kernel With User Mode Process�� 209 7.5 Embedded System with User Mode Processes �� 210 7.5.1 Processes in the Same Domain �� 210 7.5.2 Demonstration of Processes in the Same Domain�� 210 7.5.3 Processes with Individual Domains�� 212 7.5.4 Demonstration of Processes with Individual Domains�� 213 7.6 RAM Disk�� 213 7.6.1 Creating RAM Disk Image �� 214 7.6.2 Process Image File Loader�� 216 7.6.3 Binary Image File Loader Program�� 218 7.7 Process Management�� 221 7.7.1 Process Creation�� 221 7.7.2 Process Termination�� 221 7.7.3 Process Family Tree�� 222 7.7.4 Wait for Child Process Termination�� 222 7.7.5 fork-exec in Unix/Linux�� 223 7.7.6 Implementation of Fork�� 224 7.7.7 Implementation of Exec�� 225 7.7.8 Demonstration of fork-exec�� 227 7.7.9 Simple sh for Command Execution�� 228 7.7.10 vfork�� 231 7.7.11 Demonstration of vfork�� 232 7.8 Threads�� 233 7.8.1 Thread Creation�� 233 7.8.2 Demonstration of Threads�� 235 7.8.3 Threads Synchronization�� 236

x

Contents

7.9 Embedded System with Two-level Paging�� 236 7.9.1 Static 2-level Paging �� 237 7.9.2 Demonstration of Static 2-level Paging�� 239 7.9.3 Dynamic 2-level Paging�� 239 7.9.4 Demonstration of 2-level Dynamic Paging�� 242 7.10 KMH Memory Mapping �� 242 7.10.1 Remap Vector Table�� 242 7.10.2 KMH using One-level Static Paging�� 243 7.10.3 KMH Using Two-level Static Paging�� 270 7.10.4 KMH using Two-level Dynamic Paging�� 273 7.11 Embedded System Supporting File Systems�� 273 7.11.1 Create SD Card Image�� 274 7.11.2 Format SD Card Partitions as File Systems�� 275 7.11.3 Create Loop Devices for SD Card Partitions�� 275 7.12 Embedded System With SDC File System �� 276 7.12.1 SD Card Driver Using Semaphore�� 276 7.12.2 System Kernel Using SD File System�� 280 7.12.3 Demonstration of SDC File System�� 281 7.13 Boot Kernel Image from SDC�� 281 7.13.1 SDC Booter Program�� 282 7.13.2 Demonstration of Booting Kernel from SDC �� 286 7.13.3 Booting Kernel from SDC with Dynamic Paging�� 287 7.13.4 Two-Stage Booting �� 287 7.13.5 Demonstration of Two-stage Booting �� 288 7.14 Summary �� 290 List of Sample Programs�� 290 Problems�� 290 References�� 291 8 General Purpose Embedded Operating Systems�� 293 8.1 General Purpose Operating Systems�� 293 8.2 Embedded General Purpose Operating Systems�� 293 8.3 Porting Existing GPOS to Embedded Systems�� 293 8.4 Develop an Embedded GPOS for ARM �� 294 8.5 Organization of EOS�� 294 8.5.1 Hardware Platform�� 294 8.5.2 EOS Source File Tree �� 295 8.5.3 EOS Kernel Files�� 295 8.5.4 Capabilities of EOS�� 296 8.5.5 Startup Sequence of EOS�� 296 8.5.6 Process Management in EOS�� 297 8.5.7 Assembly Code of EOS�� 299 8.5.8 Kernel Files of EOS�� 306 8.5.9 Process Management Functions�� 310 8.5.10 Pipes�� 310 8.5.11 Message Passing �� 311 8.5.12 Demonstration of Message Passing�� 312 8.6 Memory Management in EOS�� 313 8.6.1 Memory Map of EOS�� 313 8.6.2 Virtual Address Spaces �� 313 8.6.3 Kernel Mode Pgdir and Page Tables �� 313 8.6.4 Process User Mode Page Tables �� 314 8.6.5 Switch Pgdir During Process Switch�� 314 8.6.6 Dynamic Paging�� 314

Contents

xi

8.7 Exception and Signal Processing�� 315 8.7.1 Signal Processing in Unix/Linux�� 316 8.8 Signal Processing in EOS �� 317 8.8.1 Signals in PROC Resource �� 317 8.8.2 Signal Origins in EOS�� 317 8.8.3 Deliver Signal to Process�� 317 8.8.4 Change Signal Handler in Kernel �� 318 8.8.5 Signal Handling in EOS Kernel�� 318 8.8.6 Dispatch Signal Catcher for Execution in User Mode�� 319 8.9 Device Drivers�� 319 8.10 Process Scheduling in EOS�� 320 8.11 Timer Service in EOS �� 320 8.12 File System�� 321 8.12.1 File Operation Levels�� 322 8.12.2 File I/O Operations �� 324 8.12.3 EXT2 File System in EOS�� 330 8.12.4 Implementation of Level-1 FS�� 332 8.12.5 Implementation of Level-2 FS�� 338 8.12.6 Implementation of Level-3 FS�� 344 8.13 Block Device I/O Buffering�� 345 8.14 I/O Buffer Management Algorithm�� 346 8.14.1 Performance of the I/O Buffer Cache �� 348 8.15 User Interface�� 349 8.15.1 The INIT Program�� 349 8.15.2 The Login Program �� 349 8.15.3 The sh Program �� 350 8.16 Demonstration of EOS�� 351 8.16.1 EOS Startup�� 351 8.16.2 Command Processing in EOS�� 351 8.16.3 Signal and Exception Handling in EOS�� 352 8.17 Summary �� 353 Problems�� 354 References�� 354 9 Multiprocessing in Embedded Systems �� 357 9.1 Multiprocessing�� 357 9.2 SMP System Requirements�� 357 9.3 ARM MPcore Processors�� 359 9.4 ARM Cortex-A9 MPcore Processor �� 359 9.4.1 Processor Cores�� 359 9.4.2 Snoop Control Unit (SCU) �� 360 9.4.3 Generic Interrupt Controller (GIC) [4] �� 360 9.5 GIC Programming Example �� 360 9.5.1 Configure the GIC to Route Interrupts�� 360 9.5.2 Explanations of the GIC Configuration Code�� 367 9.5.3 Interrupt Priority and Interrupt Mask�� 368 9.5.4 Demonstration of GIC Programming�� 368 9.6 Startup Sequence of ARM MPcores �� 368 9.6.1 Raw Startup Sequence�� 369 9.6.2 Booter Assisted Startup Sequence�� 369 9.6.3 SMP Booting on Virtual Machines �� 369

xii

Contents

9.7 ARM SMP Startup Examples �� 369 9.7.1 ARM SMP Startup Example 1�� 369 9.7.2 Demonstration of SMP Startup Example 1�� 375 9.7.3 ARM SMP Startup Example 2�� 375 9.7.4 Demonstration of ARM SMP Startup Example 2�� 376 9.8 Critical Regions in SMP �� 377 9.8.1 Implementation of Critical Regions in SMP�� 377 9.8.2 Shortcomings of XCHG/SWAP Operations �� 378 9.8.3 ARM Synchronization Instructions for SMP�� 378 9.9 Synchronization Primitives in SMP�� 379 9.9.1 Spinlocks�� 379 9.9.2 Spinlock Example �� 380 9.9.3 Demonstration of SMP Startup with Spinlock �� 381 9.9.4 Mutex in SMP �� 381 9.9.5 Implementation of Mutex Using Spinlock�� 382 9.9.6 Demonstration of SMP Startup with Mutex Lock�� 383 9.10 Global and Local Timers�� 384 9.10.1 Demonstration of Local Timers in SMP �� 387 9.11 Semaphores in SMP�� 387 9.11.1 Applications of Semaphores in SMP�� 388 9.12 Conditional Locking �� 388 9.12.1 Conditional Spinlock�� 389 9.12.2 Conditional Mutex�� 389 9.12.3 Conditional Semaphore Operations�� 390 9.13 Memory Management in SMP�� 390 9.13.1 Memory Management Models in SMP�� 391 9.13.2 Uniform VA Spaces�� 391 9.13.3 Non-uniform VA Spaces �� 393 9.13.4 Parallel Computing in Non-uniform VA Spaces �� 396 9.14 Multitasking in SMP �� 398 9.15 SMP Kernel for Process Management�� 400 9.15.1 The ts.s file�� 400 9.15.2 The t.c file �� 407 9.15.3 Kernel.c file�� 409 9.15.4 Device Driver and Interrupt Handlers in SMP�� 412 9.15.5 Demonstration of Process Management in SMP�� 415 9.16 General Purpose SMP Operating System �� 416 9.16.1 Organization of SMP_EOS�� 416 9.16.2 SMP_EOS Source File Tree �� 417 9.16.3 SMP_EOS Kernel Files�� 418 9.16.4 Process Management in SMP_EOS�� 418 9.16.5 Protect Kernel Data Structures in SMP�� 419 9.16.6 Deadlock Prevention in SMP�� 421 9.16.7 Adapt UP Algorithms for SMP �� 423 9.16.8 Device Driver and Interrupt Handlers in SMP�� 423 9.17 SMP_EOS Demonstration System �� 424 9.17.1 Startup Sequence of SMP_EOS�� 424 9.17.2 Capabilities of SMP_EOS�� 425 9.17.3 Demonstration of SMP_EOS�� 425 9.18 Summary �� 425 List of Sample Programs�� 426 Problems�� 427 References�� 427

Contents

xiii

10 Embedded Real-Time Operating Systems�� 429 10.1 Concepts of RTOS�� 429 10.2 Task Scheduling in RTOS �� 429 10.2.1 Rate-Monotonic Scheduling (RMS) �� 430 10.2.2 Earliest-Deadline-First (EDF) Scheduling�� 430 10.2.3 Deadline-Monotonic Scheduling (DMS)�� 431 10.3 Priority Inversion�� 431 10.4 Priority Inversion Prevention�� 432 10.4.1 Priority Ceiling �� 432 10.4.2 Priority Inheritance �� 432 10.5 Survey of RTOS�� 432 10.5.1 FreeRTOS �� 432 10.5.2 MicroC/OS (μC/OS) �� 433 10.5.3 NuttX�� 433 10.5.4 VxWorks �� 434 10.5.5 QNX�� 434 10.5.6 Real-Time Linux �� 435 10.5.7 Critique of Existing RTOS�� 437 10.6 Design Principles of RTOS �� 439 10.6.1 Interrupt Processing�� 439 10.6.2 Task Management �� 440 10.6.3 Task Scheduling�� 440 10.6.4 Synchronization Tools�� 440 10.6.5 Task Communication�� 440 10.6.6 Memory Management�� 440 10.6.7 File System �� 440 10.6.8 Tracing and Debugging�� 440 10.7 Uniprocessor RTOS (UP_RTOS)�� 440 10.7.1 Task Management in UP_RTOS �� 441 10.7.2 Task Synchronization in UP_RTOS�� 441 10.7.3 Task Scheduling in UP_RTOS�� 442 10.7.4 Task Communication in UP_RTOS�� 442 10.7.5 Protection of Critical Regions�� 443 10.7.6 File System and Logging�� 443 10.7.7 UP_RTOS with Static Periodic Tasks and Round-Robin Scheduling�� 444 10.7.8 UP_RTOS with Static Periodic Tasks and Preemptive Scheduling�� 454 10.7.9 UP_RTOS with Dynamic Tasks Sharing Resources�� 461 10.8 Multiprocessor RTOS (SMP_RTOS)�� 468 10.8.1 SMP_RTOS Kernel for Task Management�� 468 10.8.2 Adapt UP_RTOS to SMP�� 484 10.8.3 Nested Interrupts in SMP_RTOS�� 486 10.8.4 Preemptive Task Scheduling in SMP_RTOS�� 488 10.8.5 Task Switch by SGI�� 489 10.8.6 Timesliced Task Scheduling in SMP_RTOS�� 491 10.8.7 Preemptive Task Scheduling in SMP_RTOS�� 493 10.8.8 Priority Inheritance �� 496 10.8.9 Demonstration of the SMP_RTOS System�� 498 10.9 Summary �� 502 List of Sample Programs�� 502 Problems�� 502 References�� 503

xiv

Contents

11 ARMv8 Architecture and Programming �� 505 11.1 ARMv8 Architecture and Processors�� 505 11.1.1 Exception Levels�� 505 11.1.2 Secure World and Normal World�� 506 11.1.3 Execution States�� 506 11.1.4 Change Exception Levels�� 507 11.1.5 Change Execution States�� 509 11.2 ARMv8 Registers�� 509 11.2.1 ARMv8 Special Registers�� 510 11.2.2 Stack Pointers�� 510 11.2.3 Program Counter�� 510 11.2.4 Exception Link Registers�� 510 11.2.5 PSTATE and SPSR Registers�� 511 11.3 ARMv8 System Registers�� 512 11.4 Aarch64 Instructions�� 512 11.4.1 Data Processing Instructions�� 512 11.4.2 Shift Operations�� 513 11.4.3 Conditional Instructions�� 513 11.4.4 Memory Access Instructions�� 514 11.4.5 Flow Control Instructions �� 514 11.4.6 Exception Handling Instructions�� 514 11.4.7 Function Call Conventions in ARMv8�� 515 11.5 ARMv8 Exception and Interrupt Handling�� 518 11.5.1 Exception Types: Synchronous and Asynchronous Exceptions �� 518 11.5.2 Exception Handling�� 518 11.5.3 ARMv8 Exception Vector Table �� 519 11.5.4 Return from Exception�� 521 11.5.5 Interrupt Handling�� 522 11.5.6 A Simple Interrupt Handler�� 523 11.5.7 Nested Interrupt Handler�� 523 11.6 Generic Interrupt Controller Versions 3 and 4�� 524 11.6.1 GICv3 Fundamentals�� 525 11.6.2 Programming Model of GICv3�� 526 11.6.3 Routing in GICv3�� 530 11.6.4 Enable and Distribution of Interrupts �� 532 11.6.5 Handling Interrupts �� 532 11.7 ARMv8 Generic Timer �� 533 11.7.1 Processor Timers�� 533 11.7.2 Counter and Frequency �� 534 11.7.3 Timer Registers �� 534 11.7.4 Configure Timer�� 535 11.7.5 Timer Interrupts�� 535 11.8 Programming Examples�� 535 11.8.1 Example 1: GICv3 in Version 2 Legacy Mode �� 536 11.8.2 Example 2: GICv3 with EL3 Physical Timer �� 549 11.8.3 Example 3: GICv3 with EL3 and EL1 Timer �� 562 11.8.4 Example 4: ARMv8 from EL3 to EL1�� 587 11.8.5 Example 5: OS Kernel with Processes at EL1�� 605 11.8.6 Example 6: User Mode Process and System Calls �� 625 11.8.7 Example 7: Refinement of User Mode Process�� 654 11.8.8 Example 8: Separate User Mode Image�� 659

Contents

xv

11.9 Memory Management in ARMv8�� 668 11.9.1 Virtual Address Versus Physical Address�� 669 11.9.2 System Control Registers for MMU Operations�� 670 11.9.3 Separation of Kernel and Application Virtual Address Spaces �� 670 11.9.4 Translation Granule and Translation Tables �� 671 11.9.5 ARMv8 Translation Table Descriptor Format�� 673 11.9.6 Translation Table Descriptor Attributes�� 674 11.9.7 OS Kernel Use of Translation Table Descriptors�� 675 11.9.8 Security and the MMU�� 677 11.9.9 Context Switching in MMU�� 677 11.10 Address Translation Cases�� 677 11.10.1 4KB Page Size�� 677 11.10.2 64KB Page Size�� 678 11.10.3 64KB Page Size with 42-Bit Input VA Address�� 678 11.11 Translations in EL2 and EL3�� 679 11.12 MMU Programming Examples�� 680 11.12.1 Example 9: One-Level Table in EL3�� 680 11.12.2 Example 10: EL3 with L1 and L2 Mappings�� 695 11.12.3 Example 11: TTBR0 and TTBR1 in EL1 �� 698 11.12.4 Example 12: A Simple OS Kernel with User Mode Processes�� 714 11.12.5 Example 13: OS Kernel with HIGH VA �� 737 11.13 ARMv9 �� 790 11.13.1 ARMv9 for Improved Security �� 791 11.13.2 ARMv9 for Improved Computing Capability�� 791 11.14 Conclusions�� 792 References�� 792 12 ARM TrustZone and Secure Operating Systems�� 793 12.1 ARM TrustZone Architecture �� 793 12.2 ARM TrustZone Hardware Architecture�� 793 12.2.1 Hardware Features for Secure and Normal Worlds�� 793 12.2.2 Processor Architecture�� 794 12.2.3 Switching Between Security States�� 795 12.2.4 Secure and Normal Memory Spaces�� 796 12.2.5 Secure and Normal Interrupts �� 796 12.2.6 Secure and Normal Worlds in MP Processors�� 797 12.2.7 TrustZone in ARMv8-M �� 797 12.3 TrustZone Software Architecture�� 797 12.3.1 Secure Operating System�� 797 12.3.2 Booting a Secure System�� 798 12.3.3 Software Chain of Trusts�� 798 12.3.4 Example Secure Systems�� 799 12.4 Programming Examples of Secure Systems �� 799 12.4.1 Example 12.1 Secure and Normal Worlds�� 799 12.4.2 Secure World with Interrupts�� 823 12.4.3 Programming Project�� 845 12.5 Conclusions�� 845 References�� 845 Index�� 847

Chapter 1 Introduction

1.1 About This Book This book is about the design and implementation of embedded and real-time operating systems [10]. It covers the basic concepts and principles of operating systems (OS) [15–18], embedded systems architecture [1], embedded systems programming [2], real-time system concepts, and real-time system requirements [8]. It shows how to apply the theory and practice of OS to the design and implementation of operating systems for embedded and real-time systems.

1.2 Motivations of This Book In the early days, most embedded systems were designed for special applications. An embedded system usually consists of a microcontroller and a few I/O devices, which is designed to monitor some input sensors and generate signals to control external devices, such as to turn on LEDs or activate some switches, etc. For this reason, control programs of early embedded systems are also very simple. They are usually written in the form of a super-loop or a simple event-driven program structure. However, as the computing power of embedded systems increases in recent years, embedded systems have undergone a tremendous leap in both complexity and areas of applications. As a result, the traditional approaches to software design for embedded systems are no longer adequate. In order to cope with the ever-increasing system complexity and demands for extra functionality, embedded systems need more powerful software. As of now, many embedded systems are in fact highpower computing machines with multicore processors, gigabyte memory, and multi-gigabyte storage devices. Such systems are intended to run a wide range of application programs. In order to fully realize their potential, modern embedded systems need the support of multifunctional operating systems. A good example is the evolution of earlier cell phones to present smart phones. Whereas the former were designed to perform the simple task of placing or receiving phone calls only, the latter may use multicore processors and run adapted versions of Linux, such as Android, to perform multitasks. The present trend of embedded systems software design is clearly moving in the direction of developing multifunctional operating systems suitable for future mobile environment. The purpose of this book is to show how to apply the theory and practice of operating systems to develop OS for embedded and real-time systems.

1.3 Objective and Intended Audience The objective of this book is to provide a suitable platform for teaching and learning the theory and practice of embedded and real-time operating systems. It covers embedded systems architecture, embedded systems programming, basic concepts and principles of operating systems (OS), and real-time systems. It shows how to apply these principles and programming techniques to the design and implementation of real OS for both embedded and real-time systems. This book is intended for computer science students and computer professionals, who wish to study the internal details of embedded and real-time operating systems. It is suitable as a textbook for courses on embedded and real-time systems in technically oriented computer science/engineering curricula that strive for a balance between theory and practice. The book’s evolutional style, coupled

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. C. Wang, Embedded and Real-Time Operating Systems, https://doi.org/10.1007/978-3-031-28701-5_1

1

2

1 Introduction

with detailed example code and complete working sample systems, makes it especially suitable for self-study by computer enthusiasts. The book covers the entire spectrum of software design for embedded and real-time systems, ranging from simple superloop and event-driven control programs for uniprocessor (UP) systems to complete symmetric multiprocessing (SMP) operating systems on multicore systems. It is also suitable for advanced study of embedded and real-time operating systems.

1.4 Unique Features of This Book This book has many unique features, which distinguish it from other books. 1. This book is self-contained. It includes all the foundation and background information for studying embedded systems, real-time systems, and operating systems in general. These include the ARM architecture, ARM instructions and programming [1], toolchain for developing programs [5], virtual machines [13] for software implementation and testing, program execution image, function call conventions, run-time stack usage, and link C programs with assembly code. 2. Interrupts and interrupt processing are essential to embedded systems. This book covers interrupt hardware and interrupt processing in great detail. These include non-vectored interrupts, vectored interrupts [4, 7] non-nested interrupts, nested interrupts [12], and programming the Generic Interrupt Controller (GIC) [3] in ARM MPcore based systems. It shows how to apply the principles of interrupt processing to develop interrupt-driven device drivers and event-driven embedded systems. 3. The book presents a general framework for developing interrupt-driven device drivers, with an emphasis on the interaction and synchronization between interrupt handlers and processes. For each device, it explains the principles of operation and programming techniques before showing the actual driver implementation, and it demonstrates the device drivers by complete working sample programs. 4. The book shows the design and implementation of complete OS for embedded systems in incremental steps. First, it develops a simple multitasking kernel to support process management and process synchronization. Then it incorporates the memory management unit (MMU) hardware into the system to provide virtual address (VA) mappings and extends the simple kernel to support user mode processes and system calls. Then it adds process scheduling, signal processing, file system, and user interface to the system, making it a complete operating system. The book’s evolutional style helps the reader better understand the material. 5. Chapter 9 covers symmetric multiprocessing (SMP) embedded systems in detail. First, it explains the requirements of SMP systems and compares the ARM MPcore architecture [6] with the SMP system architecture of Intel. Then it describes the SMP features of ARM MPcore processors, which include the SCU and GIC for interrupt routing and interprocessor communication and synchronization by software-generated interrupts (SGIs). It uses a series of programming examples to show how to start up the ARM MPcore processors and points out the need for synchronization in a SMP environment. Then it explains the ARM LDREX/STREX instructions and memory barriers and uses them to implement spinlocks, mutexes, and semaphores for process synchronization in SMP systems. It presents a general methodology for SMP kernel design and shows how to apply the principles to adapt a uniprocessor (UP) kernel for SMP. In addition, it also shows how to use parallel algorithms for process and resource management to improve the concurrency and efficiency in SMP systems. 6. Chapter 10 covers real-time operating systems (RTOS). It introduces the concepts and requirements of real-time systems. It covers the various kinds of task scheduling algorithms in RTOS, and it shows how to handle priority inversion and task preemption in real-time systems. It includes case studies of several popular RTOS and formulates a set of general guidelines for RTOS design. It shows the design and implementation of a UP_RTOS for uniprocessor (UP) systems. Then it extends the UP_RTOS to SMP_RTOS for SMP, which supports nested interrupts, preemptive task scheduling, priority inheritance, and interprocessor synchronization by SGI. 7. Chapter 11 covers ARMv8 architecture and programming, which are very different from earlier versions of ARMv6 and ARMv7. It describes the exception levels and execution states of ARMv8 architecture, instruction set, general-purpose registers for 64-bit execution state, and special registers for exception and interrupt handling. Then it covers exception and interrupt handling in ARMv8 in detail. Next, it covers the GICv3 interrupt controller operations, including its hierarchical organization of distributors, redistributors, CPU interface, interrupt routing, and interrupt groups for added security. Then it uses the generic timers and example programs to show interrupt processing in detail. It also shows how to start up an ARM processor in EL3, migrate to EL1, and develop a simple OS kernel with processes at EL1. The last part

1.5 Book Contents

3

of Chap. 11 covers the memory management unit (MMU) in ARMv8 for virtual address to physical address (PA) translation, which is totally different from earlier versions of ARM processors. As usual, it uses example programs to show VA to PA translation in detail. The cumulative result of this chapter shows a simple OS kernel running at EL1 using high VA space and user mode processes running at EL0 using low VA space. The OS kernel implements system calls, allowing user mode processes to execute functions in the OS kernel and return to user mode with the desired results. It also supports the fork-exec operations similar to Linux kernel. Finally, it describes ARMv9 with improved security and computing capability, which will have impact on machine learning, AI, and supercomputing in general. 8. Chapter 12 covers ARM TrustZone technology and secure operating systems. It explains the goals and functions of the ARM TrustZone technology for security in computing systems. It describes the security model of TrustZone, which consists of a secure world and a nonsecure or normal world, the hardware and software requirements for secure computing, and interactions between the two different worlds. 9. Throughout the book, it uses completely working sample systems to demonstrate the design principles and implementation techniques. It uses the ARM toolchain under Ubuntu (20.04) Linux to develop software for embedded systems, and it uses emulated ARM virtual machines under QEMU as the platform for implementation and testing.

1.5 Book Contents This book is organized as follows. Chapter 2 covers the ARM architecture, ARM instructions, ARM programming, and development of programs for execution on ARM virtual machines. These include ARM processor modes, register banks in different modes, instructions, and basic programming in ARM assembly. It introduces the ARM toolchain under Ubuntu (15.10) Linux and emulated ARM virtual machines under QEMU. It shows how to use the ARM toolchain to develop programs for execution on the ARM Versatilepb virtual machine by a series of programming examples. It explains the function call convention in C and shows how to interface assembly code with C programs. Then it develops a simple UART driver for I/O on serial ports and a LCD driver for displaying both graphic images and text. It also shows the development of a generic printf() function for formatted printing to output devices that support the basic print char operation. Chapter 3 covers interrupt and exception processing. It describes the operating modes of ARM processors, exception types, and exception vectors. It explains the functions of interrupt controllers and the principles of interrupt processing in detail. Then it applies the principles of interrupt processing to the design and implementation of interrupt-driven device drivers. These include drivers for timers, keyboard, UARTs, and SD cards [14], and it demonstrates the device drivers by example programs. It explains the advantages of vectored interrupts over non-vectored interrupts. It shows how to configure the vector interrupt controllers (VICs) for vectored interrupts and demonstrates vectored interrupt processing by example programs. It also explains the principles of nested interrupts and demonstrates nested interrupt processing by example programs. Chapter 4 covers models of embedded systems. First, it explains and demonstrates the simple super-loop system model and points out its shortcomings. Then it discusses the event-driven model and demonstrates both periodic and asynchronous event-driven systems by example programs. In order to go beyond the simple super-loop and event-driven system models, it justifies the needs for processes or tasks in embedded systems. Then it introduces the various kinds of process models, which are used as the models of developing embedded systems in the book. Lastly, it presents the formal methodologies for embedded systems design. It illustrates the finite state machine (FSM) [11] model by a complete design and implementation example in detail. Chapter 5 covers process management. It introduces the process concept and the basic principle of multitasking. It demonstrates the technique of multitasking by context switching. It shows how to create processes dynamically and discusses the goals, policy, and algorithms of process scheduling. It covers process synchronization and explains the various kinds of process synchronization mechanisms, which include sleep/wakeup, mutexes, and semaphores. It shows how to use process synchronization to implement event-driven embedded systems. It discusses the various schemes for interprocess communication, which include shared memory, pipes, and message passing. It shows how to integrate these concepts and techniques to implement a uniprocessor (UP) kernel for process management, and it demonstrates the system requirements and programming techniques for both non-preemptive and preemptive process scheduling. The UP kernel serves as the foundation for developing complete operating systems in later chapters. Chapter 6 covers the ARM memory management unit (MMU) and virtual address space mappings. It explains the ARM MMU in detail and shows how to configure the MMU for virtual address mapping using both one-level and two-level paging.

4

1 Introduction

In addition, it also explains the distinction between low VA space and high VA space mappings and their implications on system implementations. Rather than only discussing the principles of memory management, it demonstrates the various kinds of virtual address mapping schemes by complete working example programs. Chapter 7 covers user mode processes and system calls. First, it extends the basic UP kernel of Chap. 5 to support additional process management functions, which include dynamic process creation, process termination, process synchronization, and wait for child process termination. Then it extends the basic kernel to support user mode processes. It shows how to use memory management to provide each process with a private user mode virtual address space that is isolated from other processes and protected by the MMU hardware. It covers and demonstrates the various kinds of memory management schemes, which include one-level sections and two-level static and dynamic paging. It covers the advanced concepts and techniques of fork, exec, vfork, and threads. In addition, it shows how to use SD cards for storing both kernel and user mode image files in a SDC file system. It also shows how to boot up the system kernel from SDC partitions. This chapter serves as a foundation for the design and implementation of general-purpose OS for embedded systems. Chapter 8 presents a fully functional general-purpose OS (GPOS), denoted by EOS, for uniprocessor (UP) ARM-based embedded systems. The following is a brief summary of the organization and capabilities of the EOS system. 1. System images: Bootable kernel image and user mode executables are generated from a source tree by the ARM toolchain under Ubuntu (15.10) Linux and reside in an EXT2 [9] file system on a SDC partition. The SDC contains stage 1 and stage 2 booters for booting up the kernel image from the SDC partition. After booting up, the EOS kernel mounts the SDC partition as the root file system. 2. Processes: The system supports NPROC=64 processes and NTHRED=128 threads per process; both can be increased if needed. Each process (except the idle process P0) runs in either kernel mode or user mode. Memory management of process images is by 2-level dynamic paging. Process scheduling is by dynamic priority and timeslice. It supports interprocess communication by pipes and message passing. The EOS kernel supports process management functions of fork, exec, vfork, threads, exit, and wait for child termination. 3. Device drivers: It contains device drivers for the most commonly used I/O devices, which include LCD display, timer, keyboard, UART, and SDC. 4. File system: EOS supports an EXT2 file system that is totally Linux compatible. It shows the principles of file operations, the control path, and data flow from user space to kernel space down to the device driver level. It shows the internal organization of file systems, and it describes the implementation of a complete file system in detail. 5. Timer service, exceptions, and signal processing: It provides timer service functions, and it unifies exception handling with signal processing, which allows users to install signal catchers to handle exceptions in user mode. 6. User interface: It supports multiuser logins to the console and UART terminals. The command interpreter sh supports executions of simple commands with I/O redirections, as well as multiple commands connected by pipes. 7. Porting: The EOS system runs on a variety of ARM virtual machines under QEMU, mainly for convenience. It should also run on real ARM-based system boards that support suitable I/O devices. Porting EOS to some popular ARM-based systems, e.g., Raspberry PI-2 is presently underway. The plan is to make it available for readers to download as soon as it is ready. Chapter 9 covers multiprocessing in embedded systems. It explains the requirements of symmetric multiprocessing (SMP) systems and compares the approach to SMP of Intel with that of ARM. It lists ARM MPcore processors and describes the components and functions of ARM MPcore processors in support of SMP. All ARM MPcore-based systems depend on the Generic Interrupt Controller (GIC) for interrupt routing and interprocessor communication. It shows how to configure the GIC to route interrupts and demonstrates GIC programming by examples. It shows how to start up ARM MPcores and points out the need for synchronization in a SMP environment. It shows how to use the classic test-and-set or equivalent instructions to implement atomic updates and critical regions and points out their shortcomings. Then it explains the new features of ARM MPcores in support of SMP. These include the ARM LDRES/STRES instructions and memory barriers. It shows how to use the new features of ARM MPcore processors to implement spinlocks, mutexes, and semaphores for process synchronization in SMP. It defines conditional spinlocks, mutexes, and semaphores and shows how to use them for deadlock prevention in SMP kernels. It also covers the additional features of the ARM MMU for SMP. It presents a general methodology for adapting uniprocessor OS kernel to SMP. Then it applies the principles to develop a complete SMP_EOS for embedded SMP systems, and it demonstrates the capabilities of the SMP_EOS system by example programs.

1.6 Use This Book as a Textbook for Embedded Systems

5

Chapter 10 covers real-time operating systems (RTOS). It introduces the concepts and requirements of real-time systems. It covers the various kinds of task scheduling algorithms in RTOS, which include RMS, EDF, and DMS. It explains the problem of priority inversion due to preemptive task scheduling and shows how to handle priority inversion and task preemption. It includes case studies of several popular real-time OS and presents a set of general guidelines for RTOS design. It shows the design and implementation of a UP_RTOS for uniprocessor (UP) systems. Then it extends the UP_RTOS to a SMP_RTOS, which supports nested interrupts, preemptive task scheduling, priority inheritance, and interprocessor synchronization by SGI. Chapters 11 and 12 are new in the second edition of this book. Chapter 11 covers ARMv8 architecture and programming, which are very different from earlier versions of ARMv6 and ARMv7. It describes the exception levels and execution states of ARMv8 architecture, instruction set, general-purpose registers for 64-bit execution state, and special registers for exceptions and interrupt processing. Then it covers exception and interrupt handling in ARMv8 in detail. Next, it covers the GICv3 interrupt controller, including its hierarchical organization of distributors, redistributors, CPU interface, interrupt routing, and interrupt groups for added security. Then it uses the generic timer and example programs to show interrupt processing in detail. It also shows how to start up an ARM processor in EL3, migrate to EL1, and develop a simple OS kernel with dynamic processes in EL1. The last part of Chap. 11 covers the memory management unit (MMU) in ARMv8 for virtual address to physical address translation, which is totally different from earlier versions of ARM processors. As usual, it uses example programs to show VA to PA translations. The chapter concludes with the development of an OS kernel running in EL1 using high VA space and user mode processes running in EL0 using low VA space. The OS kernel implements system calls, allowing user mode processes to enter the OS kernel to execute kernel functions and return to user mode with the desired results. It also supports the fork-exec operations similar to Unix/Linux. Finally, it describes improved security and computing capability of ARMv9, which will have significant impact on machine learning, AI, and supercomputing in general. Chapter 12 covers ARM TrustZone technology and secure operating systems. It explains the goals and functions of ARM TrustZone technology for security in computing systems. It describes the security model of TrustZone, which consists of a secure world and a nonsecure or normal world, the hardware and software requirements for secure computing, and interactions between the two different worlds.

1.6 Use This Book as a Textbook for Embedded Systems This book is suitable as a textbook for technically oriented courses on embedded and real-time systems in computer science/ engineering curricula that strive for a balance between theory and practice. A one-semester course based on this book may include the following topics: 1. Introduction to embedded systems, ARM architecture, basic ARM programming, ARM toolchain and software development for embedded systems, interface assembly code with C programs, execution image and run-time stack usage, program execution on ARM virtual machines, and simple device drivers for I/O (Chap. 2) 2. Exceptions and interrupts, interrupt processing, and design and implementation of interrupt-driven device drivers (Chap. 3) 3. Embedded system models, super-loop, event-driven embedded systems, and process and formal models of embedded systems (Chap. 4) 4. Process management and process synchronization (Chap. 5) 5. Memory management, virtual address mappings, and memory protection (Chap. 6) 6. Kernel mode and user mode processes and system calls (Chap. 7) 7. Introduction to real-time embedded systems (parts of Chap. 10) 8. Introduction to SMP embedded systems (parts of Chaps. 9 and 10) 9. ARMv8 architecture and programming (parts of Chap. 11) 10. ARM TrustZone and secure operating systems (parts of Chap. 12)

6

1 Introduction

1.7 Use This Book as a Textbook for Operating Systems This book is also suitable as a textbook for technically oriented courses on general-purpose operating systems. A one- semester course based on this book may include the following topics: 1. Computer architecture and systems programming: the ARM architecture, ARM instructions and programming, ARM toolchain, virtual machines, interface of assembly code with C, run-time images and stack usage, simple device drivers (Chap. 2). 2. Exceptions and interrupts, interrupt processing and interrupt-driven device drivers (Chap. 3) 3. Process management and process synchronization (Chap. 5). 4. Memory management, virtual address mappings and memory protection (Chap. 6). 5. Kernel mode and user mode processes, system calls (Chap. 7). 6. General-purpose operating system kernel, process management, memory management, device drivers, file system, signal processing and user interface (Chap. 8). 7. Introduction to SMP, interprocessor communication and synchronization of multicore processors, process synchronization in SMP, SMP kernel design and SMP operating systems (Chap. 9). 8. Introduction to real-time systems (parts of Chap. 10). 9. ARMv8 architecture and programming (Chap. 11) The problems section of each chapter contains questions designed to review the concepts and principles presented in the chapter. While some of the questions involve only simple modifications of the example programs to let the students experiment with alternative design and implementation, many other questions are suitable for advanced programming projects.

1.8 Use This Book for Self-Study Judging from the large number of OS developing projects, and many popular sites on embedded and real-time systems posted on the Internet and their enthusiastic followers, there are a tremendous number of computer enthusiasts who wish to learn the practical side of embedded and real-time operating systems. The evolutional style of this book, coupled with ample code and demonstration system programs, makes it especially suitable for self-study. It is hoped that this book will be useful and beneficial to such readers.

References 1. ARM Architectures: http://www.arm.products/processors/instruction-set-architectures, ARM Information Center, 2016 2. ARM Programming: “ARM Assembly Language Programming”, http://www.peter-cockerell.net/aalp/html/frames.html, 2016 3. ARM GIC: ARM Generic Interrupt Controller (PL390) Technical Reference Manual, ARM Information Center, 2013 4. ARM PL190: PrimeCell Vectored Interrupt Controller (PL190), http://infocenter.arm.com/help/topic/com.arm.doc.ddi0181e/DDI0181.pdf, 2004 5. ARM toolchain: http://gnutoolchains.com/arm-eabi, 2016 6. ARM11: ARM11 MPcore Processor Technical Reference Manual, r2p0, ARM Information Center, 2008 7. ARM926EJ-S: “ARM926EJ-S Technical Reference Manual”, ARM Information Center, 2008 8. Dietrich, S., Walker, D., "The evolution of Real-Time Linux", http://www.cse.nd.edu/courses/cse60463/www/amatta2.pdf, 2015 9. EXT2: www.nongnu.org/ext2-doc/ext2.html, 2001 10. Gajski, DD, Vahid, F, Narayan, S, Gong, J, “Specification and design of embedded systems”, PTR Prentice Hall, 1994 11. Katz, R. H. and G. Borriello, “Contemporary Logic Design”, 2nd Edition, Pearson, 2005. 12. Nesting Interrupts: Nesting Interrupts, ARM Information Center, 2011 13. QEMU Emulators: “QEMU Emulator User Documentation”, http://wiki.qemu.org/download/qemu-doc.htm, 2010. 14. SDC: Secure Digital cards: SD Standard Overview-SD Association https://www.sdcard.org/developers/overview, 2016 15. Silberschatz, A., P.A. Galvin, P.A., Gagne, G, "Operating system concepts, 8th Edition", John Wiley & Sons, Inc. 2009 16. Stallings, W. "Operating Systems: Internals and Design Principles (7th Edition)", Prentice Hall, 2011 17. Tanenbaum, A.S., Woodhull, A.S.,"Operating Systems, Design and Implementation, third Edition", Prentice Hall, 2006 18. Wang, K.C., “Design and Implementation of the MTX Operating Systems, Springer International Publishing AG, 2015

Chapter 2 ARM Architecture and Programming

ARM [1] is a family of reduced instruction set computing (RISC) microprocessors developed specifically for mobile and embedded computing environments. Due to their small sizes and low power requirements, ARM processors have become the most widely used processors in mobile devices, e.g., smart phones, and embedded systems. Presently, most embedded systems are based on ARM processors. In many cases, embedded systems programming has become almost synonymous with ARM processor programming [8]. For this reason, we shall also use ARM processors for the design and implementation of embedded systems in this book. Depending on their release time, ARM processors can be classified into classic cores and the more recent (since 2005) Cortex cores. Depending on their capabilities and intended applications, ARM Cortex cores can be classified into three categories [2]: • The Cortex-M series: These are microcontroller-oriented processors intended for microcontroller unit (MCU) and system-on-chip (SoC) applications. • The Cortex-R series: These are embedded processors intended for real-time signal processing and control applications. • The Cortex-A series: These are application processors intended for general-purpose applications, such as embedded systems with full featured operating systems. The ARM Cortex-A series processors are the most powerful ARM processors, which include the Cortex-A8 [2] single core and the Cortex A9 MPcore [3] with up to 4 CPUs. Because of their advanced capabilities, most recent embedded systems are based on the ARM Cortex-A series processors. On the other hand, there are also a large number of embedded systems intended for dedicated applications that are based on the classic ARM cores, which proved to be very cost-effective. In this book, we shall cover both the classic and ARM-A series processors. Specifically, we shall use the classic ARM926EJ-S core for single CPU systems and the Cortex A9 MPcore for multiprocessor systems. A primary reason for choosing these ARM cores is because they are available as emulated virtual machines (VMs). A major goal of this book is to show the design and implementation of embedded systems in an integrated approach. In addition to covering the theory and principles, it also shows how to apply them to the design and implementation of embedded systems by programming examples. Since most readers may not have access to real ARM-based systems, we shall use emulated ARM virtual machines under QEMU for implementation and testing. In this chapter, we shall cover the following topics: . ARM architecture. . ARM instructions and basic programming in ARM assembly. . Interface assembly code with C programs. . ARM toolchains for (cross) compile-link programs. . ARM system emulators and run programs on ARM virtual machines. . Develop simple I/O device drivers. These include UART drivers for serial ports and . LCD driver for displaying both images and text.

2.1 ARM Processor Modes The ARM processor has seven (7) operating modes, which are specified by the 5 mode bits [4:0] in the Current Processor Status Register (CPSR). The seven ARM processor modes are:

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. C. Wang, Embedded and Real-Time Operating Systems, https://doi.org/10.1007/978-3-031-28701-5_2

7

8

2 ARM Architecture and Programming

USR mode: unprivileged User mode SYS mode: System mode using the same set of registers as User mode FIQ mode: Fast interrupt request processing mode IRQ mode: normal interrupt request processing mode SVC mode: Supervisor mode on reset or SWI (Software Interrupt) ABT mode: data exception abort mode UND mode: undefined instruction exception mode

2.2 ARM CPU Registers The ARM processor has 37 registers in total, all of which are 32 bits wide. These include: 1 dedicated program counter (PC) 1 dedicated current program status register (CPSR) 5 dedicated saved program status registers (SPSR) 30 general-purpose registers The registers are arranged into several banks. The accessibility of the banked registers is governed by the processor mode. In each mode, the ARM CPU can access: A particular set of R0-R12 registers A particular R13 (stack pointer), R14 (link register), and SPSR (saved program status) The same R15 (program counter) and CPSR (current program status register)

2.2.1 General Registers Figure 2.1 shows the organization of general registers in the ARM processor. User and System modes share the same set of registers. Registers R0–R12 are the same in all modes, except for the FIQ mode, which has its own separate registers R8–R12. Each mode has its own stack pointer (R13) and link register (R14). The program counter (PC or R15) and current program status register (CPSR) are the same in all modes. Each privileged mode (SVC to FIQ) has its own Saved Processor Status Register (SPSR).

2.2.2 Status Registers In all modes, the ARM processor has the same current program status register (CPSR). Figure 2.2 shows the contents of the CPSR register. In the CPSR register, NZCV are the condition bits, I and F are IRQ and FIQ interrupt mask bits, T = Thumb state, and M[4:0] are the processor mode bits, which define the processor mode as: Fig. 2.1 Register banks in ARM processor

2.2 ARM CPU Registers

9

Fig. 2.2 Status register of ARM processor USR: FIQ: IRQ: SVC: ABT: UND: SYS:

10000 10001 10010 10011 10111 11011 11111

(0x10) (0x11) (0x12) (0x13) (0x17) (0x1B) (0x1F)

2.2.3 Change ARM Processor Mode All the ARM modes are privileged except the User mode, which is unprivileged. Like most other CPUs, the ARM processor changes mode in response to exceptions or interrupts. Specifically, it changes to FIQ mode when an FIQ interrupt occurs. It changes to IRQ mode when an IRQ interrupt occurs. It enters the SVC mode when power is turned on, following reset or executing an SWI instruction. It enters the Abort mode when a memory access exception occurs, and it enters the UND mode when it encounters an undefined instruction. An unusual feature of the ARM processor is that, while in a privileged mode, it can change mode freely by simply altering the mode bits in the CPSR, by using the MSR and MRS instructions. For example, when the ARM processor starts or following a reset, it begins execution in SVC mode. While in SVC mode, the system initialization code must set up stack pointers of other modes. To do these, it simply changes the processor to an appropriate mode, initialize the stack pointer (R13_mode) and the saved program status register (SPSR) of that mode. The following code segment shows how to switch the processor to IRQ mode while preserving other bits, e.g., F and I bits, in the CPSR: MRS BIC ORR MSR

r0, cpsr r1, r0, #01F r1, r1, #0x12 cpsr, r1

// // // //

get cpsr into r0 clear 5 mode bits in r0 change to IRQ mode write to cpsr

If we do not care about the CPSR contents other than the mode field, e.g., during system initialization, changing to IRQ mode can be done by writing a value to CPSR directly, as in: MSR cpsr, #0x92

// IRQ mode with I bit=1

A special usage of SYS mode is to access User mode registers, e.g., R13 (sp), R14 (lr) from privileged mode. In an operating system, processes usually run in the unprivileged User mode. When a process does a system call (by SWI), it enters the system kernel in SVC mode. While in kernel mode, the process may need to manipulate its User mode stack and return address to the User mode image. In this case, the process must be able to access its User mode sp and lr. This can be done by switching the CPU to SYS mode, which shares the same set of registers with User mode. Likewise, when an IRQ interrupt occurs, the ARM processor enters IRQ mode to execute an interrupt service routine (ISR) to handle the interrupt. If the ISR allows nested interrupts, it must switch the processor from IRQ mode to a different privileged mode to handle nested interrupts. We shall demonstrate this later in Chap. 3 when we discuss exceptions and interrupts processing in ARM-based systems.

10

2 ARM Architecture and Programming

2.3 Instruction Pipeline The ARM processor uses an internal pipeline to increase the rate of instruction flow to the processor, allowing several operations to be undertaken simultaneously, rather than serially. In most ARM processors, the instruction pipeline consists of three stages, FETCH-DECODE-EXECUTE, as shown below: PC PC-4 PC-8

FETCH DECODE EXECUTE

Fetch instruction from memory Decode registers used in instruction Execute the instruction

The program counter (PC) actually points to the instruction being fetched, rather than the instruction being executed. This has implications to function calls and interrupt handlers. When calling a function using the BL instruction, the return address is actually PC-4, which is adjusted by the BL instruction automatically. When returning from an interrupt handler, the return address is also PC-4, which must be adjusted by the interrupt handler itself, unless the ISR is defined with the __attribute__ ((interrupt)) attribute, in which case the compiled code will adjust the link register automatically. For some exceptions, such as Abort, the return address is PC-8, which points to the original instruction that caused the exception.

2.4 ARM Instructions 2.4.1 Condition Flags and Conditions In the CPSR of ARM processors, the highest 4 bits, NZVC, are condition flags or simply condition code, where: N = negative, Z = zero, V = overflow, C = carry bit out Condition flags are set by comparison and TST operations. By default, data processing operations do not affect the condition flags. To cause the condition flags to be updated, an instruction can be postfix with the S symbol, which sets the S bit in the instruction encoding. For example, both of the following instructions add two numbers, but only the ADDS instruction affects the condition flags: ADD r0, r1, r2 ADDS r0, r1, r2

; r0 = r1 + r2 ; r0 = r1 + r2 and set condition flags

In the ARM 32-bit instruction encoding, the leading 4 bits [31:28] represent the various combinations of the condition flag bits, which form the condition field of the instruction (if applicable). Based on the various combinations of the condition flag bits, conditions are defined mnemonically as EQ, NE, LT, GT, LE, GE, etc. The following shows some of the most commonly used conditions and their meanings: 0000 0001 0010 0101 1000 1001 1010 1011 1100 1101 1110

: : : : : : : : : : :

EQ NE CS VS HI LS GE LT GT LE AL

Equal Not equal Carry set Overflow set Unsigned higher Unsigned lower or same Signed greater than or equal Signed less than Signed greater than Signed less than or equal Always

(Z set) (Z clear) (C set) (V set) (C set and Z clear) (C clear or Z set) (C = V) (C != V) (Z=0 and N=V) (Z=1 or N!=V)

A rather unique feature of the ARM architecture is that almost all instructions can be executed conditionally. An instruction may contain an optional condition suffix, e.g., EQ, NE, LT, GT, GE, LE, GT, LT, etc., which determines whether the

2.4 ARM Instructions

11

CPU will execute the instruction based on the specified condition. If the condition is not met, the instruction will not be executed at all without any side effects. This eliminates the need for many branches in a program, which tend to disrupt the instruction pipeline. To execute an instruction conditionally, simply postfix it with the appropriate condition. For example, a non-conditional ADD instruction has the form: ADD r0, r1, r2

; r0 = r1 + r2

To execute the instruction only if the zero flag is set, append the instruction with EQ: ADDEQ r0, r1, r2

; If zero flag is set then r0 = r1 + r2

Similarly for other conditions.

2.4.2 Branch Instructions Branching instructions have the form: B{} label ; branch to label BL{} subroutine ; branch to subroutine with link The Branch (B) instruction causes a direct branch to an offset relative to the current PC. The Branch with link (BL) instruction is for subroutine calls. It writes PC-4 into LR of the current register bank and replaces PC with the entry address of the subroutine, causing the CPU to enter the subroutine. When the subroutine finishes, it returns by the saved return address in the link register R14. Most other processors implement subroutine calls by saving the return address on stack. Rather than saving the return address on stack, the ARM processor simply copies PC-4 into R14 and branches to the called subroutine. If the called subroutine does not call other subroutines, it can use the LR to return to the calling place quickly. To return from a subroutine, the program simply copies LR (R14) into PC (R15), as in: MOV PC, LR

or

BX LR

However, this works only for one-level subroutine calls. If a subroutine intends to make another call, it must save and restore the LR register explicitly since each subroutine call will change the current LR. Instead of the MOV instruction, the MOVS instruction can also be used, which restores the original flags in the CPSR.

2.4.3 Arithmetic Operations The syntax of arithmetic operations is: {}{S} Rd, Rn, Operand2 The instruction performs an arithmetic operation on two operands and places the results in the destination register Rd. The first operand, Rn, is always a register. The second operand can be either a register or an immediate value. In the latter case, the operand is sent to the ALU via the barrel shifter to generate an appropriate value. Examples: ADD r0, r1, r2 SUB r3, r3, #1

; r0 = r1 + r2 ; r3 = r3 - 1

12

2 ARM Architecture and Programming

2.4.4 Comparison Operations CMP: operand1 - operand2, but result not written TST: operand1 AND operand2, but result not written TEQ: operand1 EOR operand2, but result not written Comparison operations update the condition flag bits in the status register, which can be used as conditions in subsequent instructions. Examples: CMP r0, r1 TSTEQ r2, #5

; set condition bits in CPSR by r0-r1 ; test r2 and 5 for equal and set Z bit in CPSR

2.4.5 Logical Operations AND: EOR: ORR: BIC:

operand1 AND operand2 operand1 EOR operand2 operand1 OR operand2 operand1 AND (NOT operand2)

; bit-wise AND ; bit-wise exclusive OR ; bit-wise OR ; clear bits

Examples: AND ORR BIC EORS

r0, r0, r0, r1,

r1, r0, r0, r3,

r2 #0x80 #0xF r0

; ; ; ;

r0 = r1 & r2 set bit 7 of r0 to 1 clear low 4 bits of r0 r1 = r3 ExOR r0 and set condition bits

2.4.6 Data Movement Operations MOV operand1, operand2 MVN operand1, NOT operand2 Examples: MOV r0, r1 MOVS r2, #10 MOVNEQ r1, #0

; r0 = r1 : Always execute ; r2 = 10 and set condition bits Z=0 N=0 ; r1 = 0 only if Z bit != 0

2.4.7 Immediate Value and Barrel Shifter The Barrel shifter is another unique feature of ARM processors. It is used to generate shift operations and immediate operands inside the ARM processor. The ARM processor does not have actual shift instructions. Instead, it has a barrel shifter, which performs shits as part of other instructions. Shift operations include the conventional shift left, right, and rotate, as in: MOV r0, r0, LSL #1 MOV r1, r1, LSR #2 MOV r2, r2, ROR #4

; shift r0 left by 1 bit (multiply r0 by 2) ; shift r1 right by 2 bits (divide r1 by 4) ; swap high and low 4 bits of r2

Most other processors allow loading CPU registers with immediate values, which form parts of the instruction stream, making the instruction length variable. In contrast, all ARM instructions are 32 bits long, and they do not use the instruction stream as data. This presents a challenge when using immediate values in instructions. The data processing instruction for-

2.4 ARM Instructions

13

mat has 12 bits available for operand2. If used directly, this would only give a range of 0 to 4095. Instead, it is used to store a 4-bit rotate value and an 8-bit constant in the range of 0–255. The 8 bits can be rotated right an even number of positions (i.e., RORs by 0, 2, 4,..,30). This gives a much larger range of values that can be directly loaded. For example, to load r0 with the immediate value 4096, use: MOV r0, #0x40, 26

; generate 4096 (0x1000) by 0x40 ROR 26

To make this feature easier to use, the assembler will convert to this form if given the required constant in an instruction, for example: MOV r0, #4096

The assembler will generate an error if the given value cannot be converted this way. Instead of MOV, the LDR instruction allows loading an arbitrary 32-bit value into a register, for example: LDR rd, =numeric_constant

If the constant value can be constructed by using either a MOV or MVN, then this will be the instruction actually generated. Otherwise, the assembler will generate an LDR with a PC-relative address to read the constant from a literal pool.

2.4.8 Multiply Instructions MUL{}{S} Rd, Rm, Rs MLA{}{S} Rd, Rm, Rs,Rn

; Rd = Rm * Rs ; Rd = (Rm * Rs) + Rn

2.4.9 LOAD and Store Instructions The ARM processor is a Load/Store Architecture. Data must be loaded into registers before using. It does not support memory to memory data processing operations. The ARM processor has three sets of instructions which interact with memory. These are: • Single register data transfer (LDR/STR) • Block data transfer (LDM/STM) • Single Data Swap (SWP) The basic load and store instructions are: Load and Store Word or Byte: LDR/STR/LDRB/STRB

2.4.10 Base Register Load/store instructions may use a base register as an index to specify the memory location to be accessed. The index may include an offset in either pre-index or post-index addressing mode. Examples of using index registers are: STR LDR STR STR

r0, r2, r0, r0,

[r1] [r1] [r1, #12] [r1], #12

; ; ; ;

store r0 to location pointed by r1. load r2 from memory pointed to by r1. pre-index addressing :STR r0 to [r1+12] Post-index addressing :STR r0 to [r1],r1+12

14

2 ARM Architecture and Programming

2.4.11 Block Data Transfer The base register is used to determine where memory access should occur. Four different addressing modes allow increment and decrement inclusive or exclusive of the base register location. Base register can be optionally updated following the data transfer by appending it with a “!” symbol. These instructions are very efficient for saving and restoring execution context, e.g., to use a memory area as stack or move large blocks of data in memory. It is worth noting that, when using these instructions to save/restore multiple CPU registers to/from memory, the register order in the instruction does not matter. Lower numbered registers are always transferred to/from lower addresses in memory.

2.4.12 Stack Operations A stack is a memory area which grows as new data is “pushed” onto the “top” of the stack and shrinks as data is “popped” off the top of the stack. Two pointers are used to define the current limits of the stack: • A base pointer: used to point to the “bottom” of the stack (the first location) • A stack pointer: used to point the current “top” of the stack A stack is called descending if it grows downward in memory, i.e., the last pushed value is at the lowest address. A stack is called ascending if it grows upward in memory. The ARM processor supports both descending and ascending stacks. In addition, it also allows the stack pointer to either point to the last occupied address (Full stack) or to the next occupied address (Empty stack). In ARM, stack operations are implemented by the STM/LDM instructions. The stack type is determined by the postfix in the STM/LDM instructions: • STMFD/LDMFD: Full Descending stack • STMFA/LDMFA: Full Ascending stack • STMED/LDMED: Empty Descending stack • STMEA/LDMEA: Empty Ascending stack The C compiler always uses Full Descending stack. Other forms of stacks are rare and almost never used in practice. For this reason, we shall only use Full Descending stacks throughout this book.

2.4.13 Stack and Subroutines A common usage of stack is to create temporary workspace for subroutines. When a subroutine begins, any registers that are to be preserved can be pushed onto the stack. When the subroutine ends, it restores the saved registers by popping them off the stack before returning to the caller. The following code segment shows the general pattern of a subroutine: STMFD sp!, {r0-r12, lr} ; save all registers and return address { Code of subroutine } ; subroutine code LDMFD sp!, {r0-r12, pc} ; restore saved registers and return by lr

If the pop instruction has the “S” bit set (by the “^” symbol), then transferring the PC register while in a privileged mode also copies the saved SPSR to the previous mode CPSR, causing return to the previous mode prior to the exception (by SWI or IRQ).

2.4.14 Software Interrupt (SWI) In ARM, the SWI instruction is used to generate a software interrupt. After executing the SWI instruction, the ARM processor changes to SVC mode and executes from the SVC vector address 0x08, causing it to execute the SWI handler, which is usually the entry point of system calls to the OS kernel. We shall demonstrate system calls in Chap. 5.

2.5 ARM Toolchain

15

2.4.15 PSR Transfer Instructions The MRS and MSR instructions allow contents of CPSR/SPSR to be transferred from appropriate status register to a generalpurpose register. Either the entire status register or only the flag bits can be transferred. These instructions are used mainly to change the processor mode while in a privileged mode: MRS{} Rd, MSR{} , Rm

; Rd = ; = Rm

2.4.16 Coprocessor Instructions The ARM architecture treats many hardware components, e.g., the Memory Management Unit (MMU), as coprocessors, which are accessed by special coprocessor instructions. We shall cover coprocessors in Chap. 6 and later chapters.

2.5 ARM Toolchain A toolchain is a collection of programming tools for program development, from source code to binary executable files. A toolchain usually consists of an assembler, a compiler, a linker, some utility programs, e.g., objcopy, for file conversions, and a debugger. Figure 2.3 depicts the components and dataflows of a typical toolchain. A toolchain runs on a host machine and generates code for a target machine. If the host and target architectures are different, the toolchain is called a cross toolchain or simply a cross compiler. Quite often, the toolchain used for embedded system development is a cross toolchain. In fact, this is the standard way of developing software for embedded systems. If we develop code on a Linux machine based on the Intel ×86 architecture but the code is intended for an ARM target machine, then we need a Linux-based ARM-targeting cross compiler. There are many different versions of Linux based toolchains for the ARM architecture [9]. In this book, we shall use the arm-none-eabi toolchain under Ubuntu Linux Versions 14.04/15.10. The reader can get and install the toolchain, as well as the qemu-system-arm for ARM virtual machines, on Ubuntu Linux as follows:

Fig. 2.3 Toolchain components

2 ARM Architecture and Programming

16

sudo apt-get install gcc-arm-none-eabi sudo apt-get install qemu-system-arm In the following sections, we shall demonstrate how to use the ARM toolchain and ARM virtual machines under QEMU by programming examples.

2.6 ARM System Emulators QEMU supports many emulated ARM machines [10]. These includes: ARM Integrator/CP board, ARM Versatile baseboard, ARM RealView baseboard, and several others. The supported ARM CPUs include ARM926E, ARM1026E, ARM946E, ARM1136, or Cortex-A8. All these are uniprocessor (UP) or single CPU systems. To begin with, we shall consider only uniprocessor (UP) systems. Multiprocessor (MP) systems will be covered later in Chap. 9. Among the emulated ARM virtual machines, we shall choose the ARM Versatilepb baseboard [5] as the platform for implementation and testing, for the following reasons: (1). It supports many peripheral devices. According to the QEMU user manual, the ARM Versatilepb baseboard is emulated with the following devices: • ARM926E, ARM1136, or Cortex-A8 CPU • PL190 Vectored Interrupt Controller • Four PL011 UARTs • PL110 LCD controller • PL050 KMI with PS/2 keyboard and mouse • PL181 MultiMedia Card Interface with SD card • SMC 91c111 Ethernet adapter • PCI host bridge (with limitations) • PCI OHCI USB controller • LSI53C895A PCI SCSI Host Bus Adapter with hard disk and CD-ROM devices (2). The ARM Versatile board architecture is well documented by online articles in the ARM Information Center. (3). QEMU can boot up the emulated ARM Versatilepb virtual machine directly. For example, we may generate a binary executable file, t.bin, and run it on the emulated ARM VersatilepbVM by: qemu-system-arm –M versatilepb –m 128M –kernel t.bin –serial mon:stdio QEMU will load the t.bin file to 0x10000 in RAM and executes it directly. This is very convenient since it eliminates the need for storing the system image in a flash memory and relying on a dedicated booter to boot up the system image.

2.7 ARM Programming 2.7.1 ARM Assembly Programming Example 1 We begin ARM programming by a series of example programs. For ease of reference, we shall label the example programs by C2.x, where C2 denotes the chapter number and x denotes the program number. The first example program, C2.1, consists of a ts.s file in ARM assembly. The following shows the steps of developing and running the example program: (1). ts.s file of C2.1 /************ ts.s file of C2.1 ***********/ .text .global start start:

2.7 ARM Programming mov r0, #1 MOV R1, #2 ADD R1, R1, R0 ldr r2, =result str r1, [r2] stop: b stop .data result: .word 0

17 @ @ // // /*

r0 = 1 r1 = 2 r1 = r1 + r0 r2 = &result result = r1 */

/* a word location */

The program code loads the CPU registers r0 with the value 1 and r1 with the value 2. Then it adds r0 to r1 and stores the result into the memory location labeled result. Before continuing, it is worth noting the following. First, in an assembly code program, instructions are case insensitive. An instruction may use uppercase, lowercase, or even mixed cases. For better readability, the coding style should be consistent, either all lowercase or all uppercase. However, other symbols, e.g., memory locations, are case sensitive. Second, as shown in the program, we may use the symbol @ or // to start a comment line or include comments in matched pairs of /* and */. Which kind of comment lines to use is a matter of personal preference. In this book, we shall use // for single comment lines and use matched pairs of /* and */ for comment blocks that may span multiple lines, which are applicable to both assembly code and C programs. (2). The mk script file A sh script, mk, is used to (cross) compile-link ts.s into an ELF file. Then it uses objcopy to convert the ELF file into a binary executable image named t.bin: arm-none-eabi-as –o ts.o ts.s arm-none-eabi-ld –T t.ld –o t.elf ts.o arm-none-eabi-nm t.elf arm-none-eabi-objcopy –O binary t.elf t.bin

# # # #

assemble ts.s to ts.o link ts.o to t.elf file show symbols in t.elf objcopy t.elf to t.bin

(3). Linker script file In the linking step, a linker script file, t.ld, is used to specify the entry point and the layout of the program sections: ENTRY(start) /* SECTIONS /* { . = 0x10000; /* .text : { *(.text) .data : { *(.data) .bss : { *(.bss) . = ALIGN(8); . = . + 0x1000; stack_top = .; /* }

define start as the entry address */ program sections */ loading address, required by QEMU */ } /* all text in .text section */ } /* all data in .data section */ } /* all bss in .bss section */ /* 4KB stack space */ stack_top is a symbol exported by linker */

The linker generates an ELF file. If desired, the reader may view the ELF file contents by: arm-none-eabi-readelf -a t.elf arm-none-eabi-objdump -d t.elf

# display all information of t.elf # disassemble t.elf file

The ELF file is not yet executable. In order to execute, it must be converted to a binary executable file by objcopy, as in: arm-none-eabi-objcopy –O binary t.elf t.bin # convert t.elf to t.bin

18

2 ARM Architecture and Programming

Fig. 2.4 Register contents of program C2.1

(4). Run the binary executable: To run t.bin on the ARM Versatilepb virtual machine, enter the command: qemu-system-arm –M versatilepb –kernel t.bin –nographic –serial /dev/null The reader may include all the above commands in a mk script, which will compile-link and run the binary executable by a single script command. (5). Check results: To check the results of running the program, enter the QEMU monitor commands: info registers : display CPU registers xp /wd [address] : display memory contents in 32-bit words Figure 2.4 shows the register contents of running the C2.1 program. As the figure shows, the register R2 contains 0x0001001C, which is the address of result. Alternatively, the command line arm-none-eabi-nm t.elf in the mk script also shows the locations of symbols in the program. The reader may enter the QEMU monitor command: xp /wd 0x1001C to display the contents of result, which should be 3. To exit QEMU, enter Control-a x, or Control-C, to terminate the QEMU process.

2.7.2 ARM Assembly Programming Example 2 The next example program, denoted by C2.2, uses ARM assembly code to compute the sum of an integer array. It shows how to use a stack to call subroutine. It also demonstrates indirect and post-index addressing modes of ARM instructions. For the sake of brevity, we only show the ts.s file. All other files are the same as in the C2.1 program. C2.2:

ts.s file: .text .global start start: ldr sp, =stack_top // set stack pointer bl sum // call sum stop: b stop // looping sum: // int sum(): compute the sum of an int array in Result stmfd sp!, {r0-r4, lr} // save r0-r4, lr on stack mov r0, #0 // r0 = 0 ldr r1, =Array // r1 = &Array ldr r2, =N // r2 = &N ldr r2, [r2] // r2 = N loop: ldr r3, [r1], #4 // r3 = *(r1++) add r0, r0, r3 // r0 += r3 sub r2, r2, #1 // r2— cmp r2, #0 // if (r2 != 0 ) bne loop // goto loop; ldr r4, =Result // r4 = &Result

2.7 ARM Programming

19

str r0, [r4] // Result = r0 ldmfd sp!, {r0-r4, pc} // pop stack, return to caller .data N: .word 10 // number of array elements Array: .word 1,2,3,4,5,6,7,8,9,10 Result: .word 0

The program computes the sum of an integer array. The number of array elements (10) is defined in the memory location labeled N, and the array elements are defined in the memory area labeled Array. The sum is computed in R0, which is saved into the memory location labeled Result. As before, run the mk script to generate a binary executable t.bin. Then run t.bin under QEMU. When the program stops, use the monitor commands info and xp to check the results. Figure 2.5 shows the results of running the C2.2 program. As the figure shows, the register R0 contains the computed result of 0x37 (55 in decimal). The reader may use the command: arm-none-eabi-nm t.elf to display symbols in an object code file. It lists the memory locations of the global symbols in the t.elf file, such as: 0001004C 00010050 00010078

N Array Result

Then, use xp /wd 0x10078 to see the contents of Result, which should be 55 in decimal.

2.7.3 Combine Assembly with C Programming Assembly programming is indispensable, e.g., when accessing and manipulating CPU registers, but it is also very tedious. In systems programming, assembly code should be used as a tool to access and control low-level hardware, rather than as a means of general programming. In this book, we shall use assembly code only if absolutely necessary. Whenever possible, we shall implement program code in the high-level language C. In order to integrate assembly and C code in the same program, it is essential to understand program execution images and the calling convention of C.

2.7.3.1 Execution Image An executable image (file) generated by a complier-linker consists of three logical parts: Text section: also called Code section containing executable code Data section: initialized global and static variables and static constants BSS section: un-initialized global and static variables (BSS is not in the image file) During execution, the executable image is loaded into memory to create a run-time image, which looks like the following: (Low address)

----------------------------

| Code | Data | BSS | Heap | Stack |

---------------------------Fig. 2.5 Register contents of program C2.2

(High address)

20

2 ARM Architecture and Programming

A run-time image consists of five (logically) contiguous sections. The Code and Data sections are loaded directly from the executable file. The BSS section is created by the BSS section size in the executable file header. Its contents are usually cleared to zero. In the execution image, the Code, Data, and BSS sections are fixed and do not change. The Heap area is for dynamic memory allocation within the execution image. The stack is for function calls during execution. It is logically at the high (address) end of the execution image, and it grows downward, i.e., from high address toward low address.

2.7.3.2 Function Call Convention in C The function call convention of C consists of the following steps between the calling function (the caller) and the called function (the callee). ---------------------------------- Caller ---------------------------------------------- (1). Load first four parameters in r0–r3; push any extra parameters on stack. (2). Transfers control to callee by BL callee. ----------------------------------- Callee --------------------------------------------- (3). Save LR, FP(r12) on stack; establish stack frame (FP point at saved LR). (4). Shift SP downward to allocate local variables and temp spaces on stack. (5). Use parameters, locals (and globals), to perform the function task. (6). Compute and load return value in R0, pop stack to return control to caller. ------------------------------------Caller ------------------------------------------- (7). Get return value from R0. (8). Clean up stack by popping off extra parameters, if any. The function call convention can be best illustrated by an example. The following t.c file contains a C function func(), which calls another function g(): /****************** t.c file ********************/ extern int g(int x, in y); // an external function int func(int a, int b, int c, int d, int e, int f) { int x, y, z; // local variables x = 1; y = 2; z = 3; // access locals g(x, y); // call g(x,y) return a + e; // return value }

Use the ARM cross compiler to generate an assembly code file named t.s by: arm-none-eabi-gcc –S –mcpu=arm926ej-s t.c The following shows the assembly code generated by the ARM GCC compiler, in which the symbol sp is the stack pointer (R13) and fp is the stack frame pointer (R12): .global func // export func as global symbol func: (1). Establish stack frame stmfd sp!, {fp, lr} // save lr, fp in stack add fp, sp, #4 // FP point at saved LR (2). Shift SP downward 8 (4-byte) slots for locals and temps sub sp, sp, #32 (3). Save r0-r3 (parameters a,b,c,d) in stack at –offsets(fp) str r0, [fp, #-24] // save r0 a str r1, [fp, #-28] // save r1 b str r2, [fp, #-32] // save r2 c str r3, [fp, #-36] // save r3 d

2.7 ARM Programming

21

(4). Execute x=1; y=2; z=3; show their locations on stack mov r3, #1 str r3, [fp, #-8] // x=1 at -8(fp); mov r3, #2 str r3, [fp, #-12] // y=2 at -12(fp) mov r3, #3 str r3, [fp, #-16] // z=3 at -16(fp) (5). Prepare to call g(x,y) ldr r0, [fp, #-8] // r0 = x ldr r1, [fp, #-12] // r1 = y bl g // call g(x,y) (6). Compute a+e as return value in r0 ldr r2, [fp, #-24] // r2 = a (saved at -24(fp) ldr r3, [fp, #4] // r3 = e at +4(fp) add r3, r2, r3 // r3 = a+e mov r0, r3 // r0 = return value in r0 (7). Return to caller sub sp, fp, #4 // sp=fp+4 (point at saved FP) ldmfd sp!, {fp, pc} // return to caller

When calling the function func(), the caller must pass (six) parameters (a, b, c, d, e, f) to the called function. The first four parameters (a, b, c, d) are passed in registers r0–r3. Any extra parameters are passed via the stack. When control enters the called function, the stack top contains the extra parameters (in reverse order). For this example, the stack top contains the two extra parameters e and f. The initial stack looks like the following: High SP low ----|---|---|-------------------| f | e | ----|exParam|--------------------

(1). Upon entry, the called function first establishes the stack frame by pushing LR, FP into stack and letting FP (r12) point at the saved LR. (2). Then it shifts SP downward (toward low address) to allocate space for local variables and temporary working area, etc. The stack becomes: SP SP High |-push->|---- subtract SP by #32 ------>|----- Low ---|+8 |+4 | 0 |-4 |-8 |-12|-16|-20|-24|-28|-32|-36| | f | e |LR |FP | | | | | | | | | ---|exParam|-|-|---| local variables, working space|----FP

In the diagram, the (byte) offsets are relative to the location pointed by the FP register. While execution is inside a function, the extra parameters (if any) are at [fp, +offset], local variables and saved parameters are at [fp, –offset], and all are referenced by using FP as the base register. From the assembly code lines (3) and (4), which save the first four parameters passed in r0–r3 and assign values to local variables x, y, and z, we can see that the stack contents become as shown in the next diagram: High SP Low ----|+8 |+4 | 0 |-4 |-8 |-12|-16|-20|-24|-28|-32|-36|--| f | e |LR |FP | x | y | z | ? | a | b | c | d | ----|exParam|-|-|---|- locals --|---| saved r0-r3 --|--FP |< ----------- stack frame of func ----------- >|

22

2 ARM Architecture and Programming

Although the stack is a piece of contiguous memory, logically each function can only access a limited area of the stack. The stack area visible to a function is called the stack frame of the function, and FP (r12) is called the stack frame pointer. At the assembly code lines (5), the function calls g(x,y) with only two parameters. It loads x into r0, y into r1, and then BL to g. At the assembly code lines (6), it computes a + e as the return value in r0. At the assembly code lines (7), it deallocates the space in stack and pops the saved FP and LR into PC, causing execution return to the caller. The ARM C compiler generated code only uses r0 to r3. If a function defines any register variables, they are assigned the registers r4 to r11, which are saved in stack first and restored later when the function returns. If a function does not call out, there is no need to save/restore the link register LR. In that case, the ARM C compiler generated code does not save and restore the link register LR, allowing faster entry/exit of function calls.

2.7.3.3 Long Jump In a sequence of function calls, such as: main() --> A() --> B()-->C(); when a called function finishes, it normally returns to the calling function, e.g., C() returns to B(), which returns to A(), etc. It is also possible to return directly to an earlier function in the calling sequence by a long jump. The following program demonstrates long jump in Unix/Linux:

{

/***** longjump.c demonstrating long jump in Linux *****/ #include #include jmp_buf env; // for saving longjmp environment main()

int r, a=100; printf("call setjmp to save environment\n"); if ((r = setjmp(env)) == 0){ A(); printf("normal return\n"); } else{ printf("back to main() via long jump, r=%d a=%d\n", r, a); } } int A() { printf("enter A()\n"); B(); printf("exit A()\n"); } int B() { printf("enter B()\n"); printf("long jump? (y|n) "); if (getchar()=='y') longjmp(env, 1234); printf("exit B()\n"); }

In the above longjump.c program, the main() function first calls setjmp(), which saves the current execution environment in a jmp_buf structure and returns 0. Then it proceeds to call A(), which calls B(). While in the function B(), if the user chooses not to return by long jump, the functions will show the normal return sequence. If the user chooses to return by

2.7 ARM Programming

23

longjmp(env, 1234), execution will return to the last saved environment with a nonzero value. In this case, it causes B() to return to main() directly, bypassing A(). The principle of long jump is very simple. When a function finishes, it returns by the (callerLR, callerFP) in the current stack frame, as shown in the following diagram: -------------------------------------------|params|callerLR|callerFP|..............| ------------|---------------------------|--CPU.FP CPU.SP

If we replace (callerLR, callerFP) with (savedLR, savedFP) of an earlier function in the calling sequence, execution would return to that function directly. For example, we may implement setjmp(int env[2]) and longjmp(int env[2], int value) in assembly as follows:

setjmp:

.global setjmp, longjmp // int setjmp(int env[2]); save LR, FP in env[2]; return 0 stmfd sp!, {fp, lr} add fp, sp, #4 ldr r1, [fp] // caller’s return LR str r1, [r0] // save LR in env[0] ldr r1, [fp, #-4] // caller’s FP str r1, [r0, #4] // save FP in env[1] mov r0, #0 // return 0 to caller sub sp, fp, #4 ldmfd sp!, {fp, pc}

longjmp: // int longjmp(int env[2], int value) stmfd sp!, {fp, lr} add fp, sp, #4 ldr r2, [r0] // return function’s LR str r2, [fp] // replace saved LR in stack frame ldr r2, [r0, #4] // return function’s FP str r2, [fp, #-4] // replace saved FP in stack frame mov r0, r1 // return value sub sp, fp, #4 ldmfd sp!, {fp, pc} // return via REPLACED LR and FP

Long jump can be used to abort a function in a calling sequence, causing execution to resume to a known environment saved earlier. In addition to the (savedLR, savedFP), setjmp() may also save other CPU registers and the caller’s SP, allowing longjmp() to restore the complete execution environment of the original function. Although rarely used in user mode programs, long jump is a common technique in systems programming. For example, it may be used in a signal catcher to bypass a user mode function that caused an exception or trap error. We shall demonstrate this technique later in Chap. 8 on signals and signal processing.

2.7.3.4 Call Assembly Function from C The next example program C2.3 shows how to call assembly function from C. The main() function in C calls the assembly function sum() with six parameters, which returns the sum of all the parameters. In accordance with the calling convention of C, the main() function passes the first four parameters a, b, c, and d in r0–r3 and the remaining parameters e and f on stack. Upon entry to the called function, the stack top contains the parameters e and f, in the order of increasing addresses. The called function first establishes the stack frame by saving LR and FP on stack and letting FP (r12) point at the saved LR. The parameters e and f are now at FP + 4 and FP + 8, respectively. The sum function simply adds all the parameters in r0 and returns to the caller:

24

2 ARM Architecture and Programming

(1). /************ t.c file int g; int main() { int a, b, c, d, e, f; a = b = c =d =e = f = 1; g = sum(a,b,c,d,e,f); }

of Program C2.3 ***********/ // un- initialized global

// local variables // values do not matter // call sum(), passing a,b,c,d,e,f

(2). /*********** ts.s file of Program C2.3 **********/ .global start, sum start: ldr sp, =stack_top bl main // call main() in C stop: b stop sum: // int sum(int a,b,c,d,e,f){ return a+b+e+d+e+f;} // upon entry, stack top contains e, f, passed by main()in C // Establish stack frame stmfd sp!, {fp, lr} // push fp, lr add fp, sp, #4 // fp -> saved lr on stack // Compute sum of all (6) parameters add r0, r0, r1 // first 4 parameters are in r0-r3 add r0, r0, r2 add r0, r0, r3 ldr r3, [fp, #4] // load e into r3 add r0, r0, r3 // add to sum in r0 ldr r3, [fp, #8] // load f into r3 add r0, r0, r3 // add to sum in r0 // Return to caller

sub

sp, fp, #4

ldmfd sp!, {fp, pc}

// sp=fp-4 (point at saved FP)

// return to caller

It is noted that in the C2.3 program, the sum() function does not save r0–r3 but use them directly. Therefore, the code should be more efficient than that generated by the ARM GCC compiler. Does this mean we should write all programs in assembly? The answer is, of course, a resounding no. It should be easy for the reader to figure out the reasons. (3). Compile-link t.c and ts.s into an executable file: arm-none-eabi-as –o ts.o ts.s # assemble ts.s arm-none-eabi-gcc –c t.c # compile t.c into t.o arm-none-eabi-ld –T t.ld –o t.elf t.o ts.o # link to t.elf file arm-none-eabi-objcopy –O binary t.elf t.bin # convert t.elf to t.bin

(4). Run t.bin and check results on an ARM virtual machine as before.

2.7.3.5 Call C Function from Assembly Calling C functions with parameters from assembly is also easy if we follow the calling convention of C. The following program C2.4 shows how to call a C function from assembly: /*********** t.c file of Program 2.4 ****************/ int sum(int x, int y){ return x + y; } // t.c file /*********** ts.s file of Program C2.4 *************/ .text .global start, sum

2.7 ARM Programming start:

ldr ldr ldr ldr ldr bl ldr str stop: b

a: b: c:

sp, r2, r0, r2, r1,

=stack_top =a [r2] =b [r2] sum r2, =c r0, [r2] stop

25 // need a stack to make calls // r0 = a // r1 = b // c = sum(a,b) // store return value in c

.data .word 1 .word 2 .word 0

2.7.3.6 In-Line Assembly In the above examples, we have written assembly code in a separate file. Most ARM toolchains are based on GCC. The GCC compiler supports in-line assembly, which is often used in C code for convenience. The basic format of in-line assembly is: __asm__("assembly code"); or simply asm("assembly code"); If the assembly code has more than one line, the statements are separated by \n\t; as in: asm("mov %r0, %r1\n\t; add %r0,#10,r0\n\t"); In-line assembly code can also specify operands. The template of such in-line assembly code is: asm ( assembler template : output operands : input operands : list of clobbered registers ); The assembly statements may specify output and input operands, which are referenced as %0, %1. For example, in the following code segment: int a, b=10; asm("mov %1,%%r0; mov %%r0,%0;" :"=r"(a) :"r"(b) :"%r0" );

// // // //

use %%REG for registers output MUST have = input clobbered registers

In the above code segment, %0 refers to a, %1 refers to b, and %%r0 refers to the r0 register. The constraint operator “r” means to use a register for the operand. It also tells the GCC compiler that the r0 register will be clobbered by the in-line code. Although we may insert fairly complex in-line assembly code in a C program, overdoing it may compromise the readability of the program. In practice, in-line assembly should be used only if the code is very short, e.g., a single assembly instruction or the intended operation involves a CPU control register. In such cases, in-line assembly code is not only clear but also more efficient than calling an assembly function.

26

2 ARM Architecture and Programming

2.8 Device Drivers The emulated ARM Versatilepb board is a virtual machine. It behaves just like a real hardware system, but there are no drivers for the emulated peripheral devices. In order to do any meaningful programming, whether on a real or virtual system, we must implement device drivers to support basic I/O operations. In this book, we shall develop drivers for the most commonly used peripheral devices by a series of programming examples. These include drivers for UART serial ports, timers, LCD display, keyboard, and the multimedia SD card, which will be used later as a storage device for file systems. A practical device driver should use interrupts. We shall show interrupt-driven device drivers in Chap. 3 when we discuss interrupts and interrupts processing. In the following, we shall show a simple UART driver by polling and an LCD driver, which does not use interrupts. In order to do these, it is necessary to know the ARM Versatile system architecture.

2.8.1 System Memory Map The ARM system architecture uses memory-mapped I/O. Each I/O device is assigned a block of contiguous memory in the system memory map. Internal registers of each I/O device are accessed as offsets from the device base address. Table 2.1 shows the (condensed) memory map of the ARM Versatile/926EJ-S board [4]. In the memory map, I/O devices occupy a 2 MB area beginning from 256 MB.

2.8.2 GPIO Programming Most ARM-based system boards provide general-purpose input-output (GPIO) pins as I/O interface to the system. Some of the GPIO pins can be configured for inputs. Other pins can be configured for outputs. In many beginning-level embedded system courses, the programming assignments and course project are usually to program the GPIO pins of a small embedded system board to interface with some real devices, such as switches, sensors, LEDs, relays, etc. Compared with other I/O devices, GPIO programming is relatively simple. A GPIO interface, e.g., the LPC2129 GPIO MCU used in many early embedded system boards, consists of four 32-bit registers:

Table 2.1 Memory map of ARM Versatile/ARM926EJ-S

2.8 Device Drivers

27

GPIODIR: set pin direction: 0 for input and 1 for output. GPIOSET: set pin voltage level to high (3.3V). GPIOCLR: set pin voltage level to low (0V). GPIOPIN: read this register returns the states of all pins. The GPIO registers can be accessed as word offsets from a (memory mapped) base address. In the GPIO registers, each bit corresponds to a GPIO pin. Depending on the direction setting in the IODIR, each pin can be connected to an appropriate I/O device. As a specific example, assume that we want to use the GPIO pin0 for input, which is connected to a (de-bounced switch), and pin1 for output, which is connected to the (ground side) of an LED with its own +3.3 V voltage source and a currentlimiting resistor. We can program the GPIO registers as follows: GPIODIR: bit0=0 (input), bit1=1 (output). GPIOSET: all bits=0 (no pin is set to high). GPIOCLR: bit1=1 (set to low or ground). GPIOPIN: read pin state; check pin0 for any input. Similarly, we may program other pins for desired I/O functions. Programming the GPIO registers can be done in either assembly code or C. Given the GPIO base address and the register offsets, it should be fairly easy to write a GPIO control program, which: . Turn on the LED if the input switch is pressed or closed . Turn off the LED if the input switch is released or open We leave this and other GPIO programming cases as exercises in the Problem section. In some systems, the GPIO interface may be more sophisticated, but the programming principle remains the same. For example, on the ARM Versatile-PB board, GPIO interfaces are arranged in separate groups called ports (Port0 to Port2), which are at the base addresses 0x101E4000-0x101E6000. Each port provides eight GPIO pins, which are controlled by a (8-bit) GPIODIR register and a (8-bit) GPIODATA register. Instead of checking the input pin states, GPIO inputs may use interrupts. Although interesting and inspiring to students, GPIO programming can only be performed on real hardware systems. Since the emulated ARM VMs do not have GPIO pins, we can only describe the general principles of GPIO programming. However, all the ARM VMs support a variety of other I/O devices. In the following sections, we shall show how to develop drivers for such devices.

2.8.3 UART Driver for Serial I/O Relying on the QEMU monitor commands to display register and memory contents is very tedious. It would be much better if we can develop device drivers to do I/O directly. In the next example program, we shall write a simple UART driver for I/O on emulated serial terminals. The ARM Versatile board supports four PL011 UART devices for serial I/O [6]. Each UART device has a base address in the system memory map. The base addresses of the four UARTs are: UART0: UART1: UART2: UART3:

0x101F1000 0x101F2000 0x101F3000 0x10090000

Each UART has a number of registers, which are byte offsets from the base address. The following lists the most important UART registers: 0x00 0x18 0x24 0x2C 0x38

UARTDR UARTFR UARIBRD UARTLCR UARTIMIS

Data register: for read/write chars Flag register: TxEmpty, RxFull, etc. Baud rate register: set baud rate Line control register: bits per char, parity, etc. Interrupt mask register for TX and RX interrupts

28

2 ARM Architecture and Programming

In general, a UART must be initialized by the following steps: (1). Write a divisor value to the baud rate register for a desired baud rate. The ARM PL011 technical reference manual lists the following integer divisor values (based on 7.38 MHz UART clock) for the commonly used baud rates: 0x4=1152000, 0xC=38400, 0x18=192000, 0x20=14400, 0x30=9600

( 2). Write to Line Control register to specify the number of bits per char and parity, e.g., 8 bits per char with no parity. (3). Write to Interrupt Mask register to enable/disable RX and TX interrupts. When using the emulated ARM Versatilepb board, it seems that QEMU automatically uses default values for both baud rate and line control parameters, making steps (1) and (2) either optional or unnecessary. In fact, it is observed that writing any value to the integer divisor register (0x24) would work but this is not the norm for UARTs in real systems. For the emulated Versatilepb board, all we need to do is to program the Interrupt Mask register (if using interrupts) and check the Flag register during serial I/O. To begin with, we shall implement the UART I/O by polling, which only checks the Flag status register. Interrupt-driven device drivers will be covered later in Chap. 3 when we discuss interrupts and interrupts processing. When developing device drivers, we may need assembly code in order to access CPU registers and the interface hardware. However, we shall use assembly code only if absolutely necessary. Whenever possible, we shall implement the driver code in C, thus keeping the amount of assembly code to a minimum. The UART driver and test program, C2.5, consists of the following components: (1). ts.s file: When the ARM CPU starts, it is in the Supervisor or SVC mode. The ts.s file sets the SVC mode stack pointer and calls main() in C:

start:

.global start, stack_top

// stack_top defined in t.ld

ldr sp, =stack_top // set SVC mode stack pointer bl main // call main() in C b . // if main() returns, just loop

(2). t.c file: This file contains the main() function, which initializes the UARTs and uses UART0 for I/O on the serial port: /************* t.c file of C2.5 **************/ int v[] = {1,2,3,4,5,6,7,8,9,10}; // data array int sum; #include "string.c" #include "uart.c"

// contains strlen(), strcmp(), etc. // UART driver code file

int main() { int i; char string[64]; UART *up; uart_init(); // initialize UARTs up = &uart[0]; // test UART0 uprints(up, "Enter lines from serial terminal 0\n\r"); while(1){ ugets(up, string); uprints(up, " "); uprints(up, string); uprints(up, "\n\r"); if (strcmp(string, "end")==0) break; }

2.8 Device Drivers

}

29

uprints(up, "Compute sum of array:\n\r"); sum = 0; for (i=0; in = i; } uart[3].base = (char *)(0x10009000); // uart3 at 0x10009000 }

30

2 ARM Architecture and Programming

int ugetc(UART *up) // input a char from UART pointed by up { while (*(up->base+UFR) & RXFE); // loop if UFR is RXFE return *(up->base+UDR); // return a char in UDR } int uputc(UART *up, char c) // output a char to UART pointed by up { while (*(up->base+UFR) & TXFF ); // loop if UFR is TXFF *(up->base+UDR) = c; // write char to data register } int ugets(UART *up, char *s) // input a string of chars { while ((*s = ugetc(up)) != '\r'){ uputc(up, *s); s++; } *s = 0; } int uprints(UART *up, char *s) // output a string of chars { while(*s) uputc(up, *s++); }

(4). link-script file: The linker script file, t.ld, is the same as in the program C2.2. (5). mk and run script file: The mk script file is also the same as in C2.2. For one serial port, the run script is: qemu-system-arm -M versatilepb -m 128M -kernel t.bin -serial mon:stdio For more serial ports, add –serial /dev/pts/1 –serial /dev/pts/2, etc., to the command line. Under Linux, open xterm(s) as pseudo terminals. Enter the Linux ps command to see the pts/n numbers of the pseudo terminals, which must match the pts/n numbers in the –serial /dev/pts/n option of QEMU. On each pseudo terminal, there is a Linux sh process running, which will grab all the inputs to the terminal. To use a pseudo terminal as serial port, the Linux sh process must be made inactive. This can be done by entering the Linux sh command: sleep 1000000 which lets the Linux sh process sleep for a large number of seconds. Then the pseudo terminal can be used as a serial port of QEMU.

2.8.3.1 Demonstration of UART Driver In the uart.c file, each UART device is represented by a UART data structure. As of now, the UART structure only contains a base address and a unit ID number. During UART initialization, the base address of each UART structure is set to the physical address of the UART device. The UART registers are accessed as *(up->base+OFFSET) in C. The driver consists of two basic I/O functions, ugetc() and uputc(): (1). int ugetc(UART *up): This function returns a char from the UART port. It loops until the UART flag register is no longer RXFE, indicating there is a char in the data register. Then it reads the data register, which clears the RXFF bit and sets the RXFE bit in FR, and returns the char. (2). int uputc(UART *up, c): This function outputs a char to the UART port. It loops until the UART’s flag register is no longer TXFF, indicating the UART is ready to transmit another char. Then it writes the char to the data register for transmission out. The functions ugets() and uprints() are for I/O of strings or lines. They are based on ugetc() and uputc(). This is the typical way of how I/O functions are developed. For example, with gets(), we can implement an int itoa(char *s) function which converts a sequence of numerical digits into an integer. Similarly, with putc(), we can implement a printf() function for

2.8 Device Drivers

31

Fig. 2.6 Demonstration of UART driver program

formatted printing, etc. We shall develop and demonstrate the printf() function in the next section on LCD driver. Figure 2.6 shows the outputs of running the C2.5 program, which demonstrates UART drivers.

2.8.3.2 Use TCP/IP Telnet Session as UART Port In addition to pseudo terminals, QEMU also supports TCP/IP telnet sessions as serial ports. First, run the program as: qemu-system-arm -M versatilepb -m 128M -kernel t.bin \ -serial telnet:localhost:1234,server When QEMU starts, it will wait until a telnet connection is made. From another (X-window) terminal, enter telnet localhost 1234 to connect. Then, enter lines from the telnet terminal.

2.8.4 Color LCD Display Driver The ARM Versatile board supports a color LCD display, which uses the ARM PL110 Color LCD controller [7]. On the Versatile board, the LCD controller is at the base address 0x10120000. It has several timing and control registers, which can be programmed to provide different display modes and resolutions. To use the LCD display, the controller’s timing and control registers must be set up properly. ARM’s Versatile Application Baseboard manual provides the following timing register settings for VGA and SVGA modes: Mode Resolution OSC1 timeReg0 timeReg1 timeReg2 -----------------------------------------------------------VGA 640x480 0x02C77 0x3F1F3F9C 0x090B61DF 0x067F1800 SVGA 800x600 0x02CAC 0x1313A4C4 0x0505F6F7 0x071F1800 ------------------------------------------------------------

The LCD’s frame buffer address register must point to a frame buffer in memory. With 24 bits per pixel, each pixel is represented by a 32-bit integer, in which the low 3 bytes are the BGR values of the pixel. For VGA mode, the needed frame buffer size is 1220 KB bytes. For SVGA mode, the needed frame buffer size is 1895 KB. In order to support both VGA and SVGA modes, we shall allocate a frame buffer size of 2 MB. Assuming that the system control program runs in the lowest 1 MB of physical memory, we shall allocate the memory area from 2 to 4 MB for the frame buffer. In the LCD Control register (0x1010001C), bit0 is LCD enable and bit11 is power-on, and both must be set to 1. Other bits are for byte order, number of bits per pixel, mono or color mode, etc. In the LCD driver, bits3-1 are set to 101 for 24 bits per pixel, and all other bits are 0 s for little-endian byte order by default. The reader may consult the LCD technical manual for the meanings of the various bits. It should be noted that, although the ARM manual lists the LCD Control register at 0x1C, it is in fact at 0x18 on the emulated Versatilepb board of QEMU. The reason for this discrepancy is unknown.

2.8.4.1 Display Image Files As a memory mapped display device, the LCD can display both images and text. It is actually much easier to display images than text. The principle of displaying images is rather simple. An image consists of H (height) by W (width) pixels, where H base = (char *)0x10006000; *(kp->base + KCNTL) = 0x10; // bit4=Enable bit0=INT on *(kp->base + KCLK) = 8; kp->head = kp->tail = 0; kp->data = 0; kp->room = 128; // KBD driver state variables shifted = 0; release = 0; control = 0;

}

printf("Detect KBD scan code: press the ENTER key : "); while( (*(kp->base + KSTAT) & 0x10) == 0); scode = *(kp->base + KDATA); printf("scode=%x ", scode); if (scode==0x5A) keyset=2; printf("keyset=%d\n", keyset);

// kbd_handler1() for scan code set 1 void kbd_handler1() { u8 scode, c; KBD *kp = &kbd; color = YELLOW; scode = *(kp->base + KDATA); //printf("scan code = %x ", scode);

if (scode & 0x80) return; c = unsh[scode];

3.5 Keyboard Driver

}

if (c != '\r') printf("kbd interrupt: c=%x %c\n", c, c); kp->buf[kp->head++] = c; kp->head %= 128; kp->data++; kp->room--;

// kbd_handelr2() for scan code set 2 void kbd_handler2() { u8 scode, c; KBD *kp = &kbd; color = YELLOW; scode = *(kp->base + KDATA); // printf("scanCode = %x ", scode); if (scode == 0xF0){ // key release release = 1; return; } if (release && scode != 0x12){ // ordinay key release release = 0; return; } if (release && scode == 0x12){ // Left shift key release release = 0; shifted = 0; return; } if (!release && scode == 0x12){// left key press and HOLD release = 0; shifted = 1; return; } if (!release && scode == LCTRL){// left Control key press and HOLD release = 0; control = 1; return; } if (release && scode == LCTRL){ // Left Control key release release = 0; control = 0; return; } /********* catch Control-C ****************/ if (control && scode == 0x21){ // Control-C // send number 2 signal to processes on KBD printf("Control-C: %d\n", control); control = 0;

69

70

3 Interrupts and Exceptions Processing }

return;

if (!shifted) c = ltab[scode]; // lowercase else c = utab[scode]; // uppercase if (c != '\r') printf("kbd interrupt: c=%x %c\n", c, c); if (control && scode == 0x23){ // Control-D c = 0x04; printf("Control-D: c = %x\n", c); }

}

kp->buf[kp->head++] = c; kp->head %= 128; kp->data++; kp->room--;

void kbd_handler() { if (keyset == 1) kbd_handler1(); else kbd_handler2(); } int kgetc() { char c; KBD *kp = &kbd; unlock(); while(kp->data == 0);

}

lock(); c = kp->buf[kp->tail++]; kp->tail %= 128; kp->data--; kp->room++; unlock(); return c;

int kgets(char s[ ]) { char c; while( (c = kgetc()) != '\r'){ if (c=='\b'){ s--; continue; } *s++ = c; } *s = 0; return strlen(s); }

3.6 UART Driver

71

3.6 UART Driver 3.6.1 ARM PL011 UART Interface The ARM Versatile board supports four PL011 UART devices for serial I/O [3]. Each UART device has a base address in the system memory map: UART0: UART1: UART2: UART3:

0x101F1000 0x101F2000 0x101F3000 0x10009000

The first three UARTs, UART0 to UART2, are adjacent in the system memory map. They interrupt at IRQ12 to IRQ14 on the primary PIC. UART4 in located at 0x10000900, and it interrupts at IRQ6 on the SIC. In general, a UART must be initialized by the following steps: (1). Write a divisor value to the baud rate register for a desired baud rate. The ARM PL011 technical reference manual lists the following integer divisor values (based on 7.38 MHz UART clock) for the commonly used baud rates: 0x4=1152000, 0xC=38400, 0x18=192000, 0x20=14400, 0x30=9600

( 2). Write to Line Control register to specify the number of bits per char and parity, e.g., 8 bits per char with no parity, etc. (3). Write to Interrupt Mask register to enable/disable RX and TX interrupts. When using the emulated ARM Versatilepb board under QEMU, it seems that QEMU automatically uses default values for both baud rate and line control parameters, making steps (1) and (2) either optional or unnecessary. In fact, it is observed that writing any value to the integer divisor register (0x24) would work, but the reader should be aware this is not the norm for UARTs in real systems. In this case, we only need to program the Interrupt Mask register (if using interrupts) and check the Flag register during serial I/O.

3.6.2 UART Registers Each UART interface contains several 32-bit registers. The most important UART registers are: 0ffset -----0x00 0x04 0x18 0x2c 0x30 0x38 0x40

Name Function -------- ----------------------------UARTDR data register: for read/write chars UARTSR receive status/clear error UARTFR TxEmpty, RxFull, etc. 0x24 UARIBRD set baud rate UARTLCR line control register UARTCR control register UARTIMSC interrupt mask for TX and RX UARTMIS interrupt status

Some of the UART registers are already explained in the program C2.3 of Chap. 2. Here, we shall focus on the registers that are related to interrupts. The ARM UART interface supports many kinds of interrupts. For simplicity, we shall only consider Rx (receiving) and Tx (transmitting) interrupts for data transfer and ignore such interrupts as modem status and error conditions. To allow UART Rx and Tx interrupts, bits 4 and 5 of the interrupt mask register (UARTIMSC) must be set to 1. When a UART interrupts, the masked interrupt status register (UARTMIS) contains the interrupt identification, e.g., bit 4=1 if it is an Rx interrupt and bit 5=1 if it is a Tx interrupt. Depending on the interrupt type, the interrupt handler can branch to a corresponding ISR to handle the interrupt.

72

3 Interrupts and Exceptions Processing

3.6.3 Interrupt-Driven UART Driver Program This section shows the design and implementation of an interrupt-driven UART driver for serial I/O. In order to keep the driver simple, we shall only use UART0 and UART1, but the same code is also applicable to other UARTs. The UART driver program is denoted by C3.3, which is organized as follows.

3.6.3.1 The uart.c File This file implements the UART driver using interrupts. Each UART is represented by a UART structure. It contains the UART base address, a unit number, an input buffer, an output buffer, and control variables. Both buffers are circular, with head pointers for entering chars and tail pointers for removing chars. Among the control variables, data is the number of chars in the buffer, and room is the number of empty spaces in the buffer. For outputs, txon is a flag indicating whether the UART is already in the transmission state. To use interrupts, the UART Interrupt Mask Set/Clear register (IMSC) must be set up properly. The UART supports many kinds of interrupts. For simplicity, we shall only consider TX (bit 5) and RX (bit 4) interrupts. In uart_init(), the C statement *(up->base+IMSC) |= 0x30; // bits 4,5 = 1

sets bits 4 and 5 of IMSC to 1, which enable TX and RX interrupts of the UART. For simplicity, we shall only use UART0 and UART1. UART0 interrupts at IRQ12 and UART1 interrupts at IRQ13, both on the primary VIC. In order to allow UART interrupts, bits 12 and 13 of the PIC must be set to 1. These are done in main() by the statements: VIC_INTENABLE |= (1n = i; UART ID number up->indata = up->inhead = up->intail = 0; up->inroom = SBUFSIZE; up->outdata = up->outhead = up->outtail = 0; up->outroom = SBUFSIZE; up->txon = 0; } uart[3].base = (char *)(0x10009000); // uart3 at 0x10009000 } void uart_handler(UART *up) { u8 mis = *(up->base + MIS); // read MIS register if (mis & (1inhead %= SBUFSIZE; up->indata++; up->inroom--; } int do_tx(UART *up) // TX interrupt handler { char c; printf("TX interrupt\n"); if (up->outdata base+MASK) = 0x10; // disable TX interrupt up->txon = 0; // turn of txon flag return; } c = up->outbuf[up->outtail++]; up->outtail %= SBUFSIZE; *(up->base+UDR) = (int)c; // write c to DR up->outdata--; up->outroom++; } int ugetc(UART *up) // return a char from UART { char c; while(up->indata data > 0 READONLY

73

74

3 Interrupts and Exceptions Processing c = up->inbuf[up->intail++]; up->intail %= SBUFSIZE; // updating variables: must disable interrupts lock(); up->indata--; up->inroom++; unlock(); return c;

} int uputc(UART *up, char c) // output a char to UART { kprintf("uputc %c ", c); if (up->txon){ //if TX is on, enter c into outbuf[] up->outbuf[up->outhead++] = c; up->outhead %= 128; lock(); up->outdata++; up->outroom--; unlock(); return; } // txon==0 means TX is off => output c & enable TX interrupt // PL011 TX is riggered only if write char, else no TX interrupt int i = *(up->base+UFR); // read FR while( *(up->base+UFR) & 0x20 ); // loop while FR=TXF *(up->base+UDR) = (int)c; // write c to DR UART0_IMSC |= 0x30; // 0000 0000: bit5=TX mask bit4=RX mask up->txon = 1; } int ugets(UART *up, char *s) // get a line from UART { kprintf("%s", "in ugets: "); while ((*s = (char)ugetc(up)) != '\r'){ uputc(up, *s++); } *s = 0; } int uprints(UART *up, char *s) // print a line to UART { while(*s) uputc(up, *s++); }

3.6.3.2 Explanations of the UART Driver Code (1). uart_handler(UART *up): The uart interrupt handler reads the MIS register to determine the interrupt type. It is an RX interrupt if bit 4 of MIS is 1. It is a TX interrupt if bit 5 of MIS is 1. Depending on the interrupt type, it calls do_rx() or do_tx() to handle the interrupt. (2). do_rx(): This is an input interrupt. It reads the ASCII char from the data register, which clears the RX interrupt of the UART. Then it enters the char into the circular input buffer and increments the data variable by 1. The data variable represents the number of chars in the input buffer. (3). do_tx(): This is an output interrupt, which is triggered by writing the last char to the output data register and transmission of that char has finished. The handler checks whether there are any chars in the output buffer. If the output buffer is empty, it disables the UART TX interrupt and returns. If the TX interrupt is not disabled, it would cause an infinite sequence of TX interrupts. If the output buffer is not empty, it takes a char from the buffer and writes it to the data register for transmission out.

3.6 UART Driver

75

(4). ugetc(): ugetc is for the main program to get a char from a UART port. Its logic and synchronization with the RX interrupt handler are the same as kgetc() of the keyboard driver. So we shall not repeat them here again. (5). uputc(): uputc() is for the main program to output a char to a UART port. If the UART port is not transmitting (the txon flag is off), it writes the char to the data register, enables TX interrupt, and sets the txon flag. Otherwise, it enters the char into the output buffer, updates the data variable, and returns. The TX interrupt handler will output the chars from the output buffer on each successive interrupt. (6). Formatted printing: uprintf(UART *up, char *fmt,…) is for formatted printing to a UART port. It is based on uputc().

3.6.3.3 Demonstration of KBD and UART Drivers The t.c file contains the IRQ_handler() and the main() function. The main() function first initializes the UART devices and the VIC for interrupts. Then it tests the UART driver by issuing serial I/O on the UART terminals. For each IRQ interrupt, the IRQ_handler() determines the interrupt source and calls an appropriate handler to handle the interrupt. For clarity, UARTrelated codes are shown in bold-faced lines: /*************** C3.3. t.c file *******************/ #include "defines.h" #include "string.c" #include "uart.c" // UART driver #include "kbd.c" // keyboard driver #include "timer.c" // timer #include "vid.c" // LCD display driver #include "exceptions.c" void copy_vectors(void){

// same as before }

void IRQ_handler() // IRQ interrupt handler in C { // read VIC status registers to find out which interrupt int vicstatus = VIC_STATUS; // VIC status BITs: timer0,1=4,uart0=13,uart1=14 if (vicstatus & (1base+TVALUE)==0) // timer 3 timer_handler(3); } if (vicstatus & (1nmsg); // wait for message P(running->mlock); // lock PROC.mqueue MBUF *mp = dequeue(running->mqueue); // get a message V(running->mlock); // release mlock copy(mp->contents, msg); // copy contents to Umode put_mbuf(mp); // free mbuf

}

The above s_send/s_recv algorithm is correct in terms of process synchronization, but there are other problems. Whenever a blocking protocol is used, there are chances of deadlock. Indeed, the s_send/s_recv algorithm may lead to the following deadlock situations: (1). If processes only send but do not receive, all processes would eventually be blocked at P(nmbuf) when there are no more free mbufs. (2). If no process sends but all try to receive, every process would be blocked at its own nmsg semaphore. (3). A process Pi sends a message to another process Pj and waits for a reply from Pj, which does exactly the opposite. Then Pi and Pj would mutually wait for each other, which is the familiar cross-locked deadlock. As for how to handle deadlocks in message passing, the reader may consult Chap. 6 of Wang [15], which also contains a server-client-based message passing protocol.

5.13 Process Communication

5.13.4.3 Demonstration of Message Passing The Example Program C5.7 demonstrates synchronous message passing. (1). Type.h file: added MBUF type

struct semaphore{ int value; struct proc *queue; }; typedef struct mbuf{ struct mbuf *next; int priority; int pid; char contents[128]; }MBUF; typedef struct proc{ // same as before, but added MBUF *mQueue; struct semaphore mQlock; struct semaphore nmsg; int kstack[SSIZE]; }PROC;

(2). Message.c file /******** message.c file ************/ #define NMBUF 10 struct semaphore nmbuf, mlock; MBUF mbuf[NMBUF], *mbufList; // mbufs buffers and mbufList int menqueue(MBUF **queue, MBUF *p){// enter p into queue by priority} MBUF *mdequeue(MBUF **queue){// return first queue element} int msg_init() { int i; MBUF *mp; printf("mesg_init()\n"); mbufList = 0; for (i=0; ipid; mp->priority = 1; strcpy(mp->contents, msg); P(&p->mQlock); menqueue(&p->mQueue, mp); V(&p->mQlock); V(&p->nmseg); return 0; } int recv(char *msg) // recv msg from own msgqueue { P(&running->nmsg); P(&running->mQlock); MBUF *mp = mdequeue(&running->mQueue); V(&running->mQlock); strcpy(msg, mp->contents); int sender = mp->pid; put_mbuf(mp); return sender; }

(3). t.c file: #include "type.h" #include "message.c" int sender() // send task code { struct uart *up = &uart[0]; char line[128]; while(1){ ugets(up, line); printf("task%d got a line=%s\n", running->pid, line); send(line, 4); printf("task%d send %s to pid=4\n", running->pid,line); } } int receiver() // receiver task code { char line[128]; int pid; while(1){ printf("task%d try to receive msg\n", running->pid); pid = recv(line); printf("task%d received: [%s] from task%d\n", running->pid, line, pid); } }

5 Process Management in Embedded Systems

5.14 Uniprocessor (UP) Embedded System Kernel

153

Fig. 5.12 Demonstration of message passing int main() { msg_init();

}

kprintf("P0 kfork tasks\n"); kfork((int)sender, 1); // sender process kfork((int)receiver, 1); // receiver process while(1){ if (readyQueue) tswitch(); }

Figure 5.12 shows the sample outputs of running the C5.7 Program.

5.14 Uniprocessor (UP) Embedded System Kernel An embedded system kernel comprises dynamic processes, all of which execute in the same address space of the kernel. The kernel provides functions for process management, such as process creation, synchronization, communication, and termination. In this section, we shall show the design and implementation of uniprocessor (UP) embedded system kernels. There are two distinct types of kernels: non-preemptive and preemptive. In a non-preemptive kernel, each process runs until it gives CPU voluntarily. In a preemptive kernel, a running process can be preempted either by priority or by time slice.

5.14.1 Non-preemptive UP Kernel A uniprocessor (UP) kernel is non-preemptive if each process runs until it gives up the CPU voluntarily. While a process runs, it may be diverted to handle interrupts, but control always returns to the point of interruption in the same process at the end of interrupt processing. This implies that in a non-preemptive UP kernel only one process runs at a time. Therefore, there

154

5 Process Management in Embedded Systems

is no need to protect data objects in the kernel from the concurrent executions of processes. However, while a process runs, it may be diverted to execute an interrupt handler, which may interfere with the process if both try to access the same data object. To prevent interference from interrupt handlers, it suffices to disable interrupts when a process executes a piece of critical code. This simplifies the system design. The Example Program C5.8 demonstrates the design and implementation of a non-preemptive kernel for uniprocessor embedded systems. We assume that the system hardware consists of two timers, which can be programmed to generate timer interrupts with different frequencies, a UART, a keyboard, and an LCD display. The system software consists of a set of concurrent processes, all of which execute in the same address space but with different priorities. Process scheduling is by non-preemptive priority. Each process runs until it goes to sleep, blocks itself, or terminates. Timer0 maintains the time of day and displays a wall clock on the LCD. Since the task of displaying the wall clock is short, it is performed by the timer0 interrupt handler directly. Two periodic processes, timer_task1 and timer_task2, each of which calls the pause(t) function to suspend itself for a number of seconds. After registering a pause time in the PROC structure, the process changes status to PAUSE, enters itself into a pauseList, and gives up the CPU. On each second, Timer2 decrements the pause time of every process in the pauseList by 1. When the time reaches 0, it makes the paused process ready to run again. Although this can be accomplished by the sleep-wakeup mechanism, it is intended to show that periodic tasks can be implemented in the general framework of timer service functions. In addition, the system supports two sets of cooperative processes, which implement the producer-consumer problem to demonstrate process synchronization using semaphores. Each producer process tries to get an input line from UART0. Then it deposits the chars into a shared buffer, char pcbuffer[N] of size N bytes. Each consumer process tries to get a char from the pcbuff[N] and displays it to the LCD. Producer and consumer processes share the common data buffer as a pipe. For brevity, we only show the relevant code segments of the system: (1). /********* timer.c file of C5.8 **********/ PROC *pauseList; // a list of pausing processes typedef struct timer{ u32 *base; // timer's base address; int tick, hh, mm, ss; // per timer data area char clock[16]; }TIMER; TIMER timer[4]; // 4 timers; void timer_init() { printf("timer_init()\n"); pauseList = 0; // pauseList initially 0 // initialize all 4 timers } void timer_handler(int n) // n=timer unit { int i; TIMER *t = &timer[n]; t->tick++; t->ss = t->tick; if (t->ss == 60){ t->ss=0; tp->mm++; if (t->mm == 60){ t->mm = 0; t->hh++; } } if (n==0){ // timer0: display wall-clock directly for (i=0; iclock[i], n, 70+i); t->clock[7]='0'+(t->ss%10); t->clock[6]='0'+(t->ss/10); t->clock[4]='0'+(t->mm%10); t->clock[3]='0'+(t->mm/10); t->clock[1]='0'+(t->hh%10); t->clock[0]='0'+(t->hh/10); for (i=0; iclock[i], n, 70+i); }

5.14 Uniprocessor (UP) Embedded System Kernel if (n==2){// timer2: process PAUSed PROCs in pauseList PROC *p, *tempList = 0; while ( p = dequeue(&pauseList) ){ p->pause--; if (p->pause == 0){ // pause time expired p->status = READY; enqueue(&readyQueue, p); } else enqueue(&tempList, p); } pauseList = tempList; // updated pauseList } timer_clearInterrupt(n);

} int timer_start(int n) { TIMER *tp = &timer[n]; kprintf("timer_start %d base=%x\n", n, tp->base); *(tp->base+TCNTL) |= 0x80; // set enable bit 7 } int timer_clearInterrupt(int n) { TIMER *tp = &timer[n]; *(tp->base+TINTCLR) = 0xFFFFFFFF; }

(2). /************ uart.c file of C5.8 *************/ typedef struct uart{ char *base; // base address; as char * u32 id; // uart number 0-3 char inbuf[BUFSIZE]; int inhead, intail; struct semaphore indata, uline; char outbuf[BUFSIZE]; int outhead, outtail; struct semaphore outroom; int txon; // 1=TX interrupt is on }UART; UART uart[4]; // 4 UART structures int uart_init() { int i; UART *up; for (i=0; ibase = (char *)(0x101f1000 + i*0x1000); *(up->base+0x2C) &= ~0x10; // disable FIFO *(up->base+0x38) |= 0x30; up->id = i; up->inhead = up->intail = 0; up->outhead = up->outtail = 0; up->txon = 0; up->indata.value = 0; up->indata.queue = 0; up->uline.value = 0; up->uline.queue = 0; up->outroom.value = BUFSIZE; up->outroom.queue = 0; } }

155

156

5 Process Management in Embedded Systems

void uart_handler(UART *up) { u8 mis = *(up->base + MIS); // read MIS register if (mis & 0x10) do_rx(up); if (mis & 0x20) do_tx(up); } int do_rx(UART *up) // UART rx interrupt handler { char c; c = *(up->base+UDR); up->inbuf[up->inhead++] = c; up->inhead %= BUFSIZE; V(&up->indata); if (c=='\r'){ V(&up->uline); } } int do_tx(UART *up) // UART tx interrupt handler { char c; u8 mask; if (up->outroom.value >= SBUFSIZE){ // outbuf[ ] empty // disable TX interrupt; return *(up->base+IMSC) = 0x10; // mask out TX interrupt up->txon = 0; return; } c = up->outbuf[up->outtail++]; up->outtail %= BUFSSIZE; *(up->base + UDR) = (int)c; V(&up->outroom); } int ugetc(UART *up) { char c; P(&up->indata); lock(); c = up->inbuf[up->intail++]; up->intail %= BUFSIZE; unlock(); return c; } int ugets(UART *up, char *s) { while ((*s = (char)ugetc(up)) != '\r'){ uputc(up, *s); s++; } *s = 0; } int uputc(UART *up, char c) { if (up->txon){ // if TX is on => enter c into outbuf[] P(&up->outroom); // wait for room lock(); up->outbuf[up->outhead++] = c;

5.14 Uniprocessor (UP) Embedded System Kernel up->outhead %= 128; unlock(); return;

}

} int i = *(up->base+UFR); // read FR while( *(up->base+UFR) & 0x20 ); // loop while FR=TXF *(up->base + UDR) = (int)c; // write c to DR *(up->base+IMSC) |= 0x30; up->txon = 1;

(3). /**************** t.c file #include "type.h" #include "string.c" #include "queue.c" #include "pv.c" #include "vid.c" #include "kbd.c" #include "uart.c" #include "timer.c" #include "exceptions.c" #include "kernel.c"

of C5.8******************/

// IRQ interrupts handler entry point void irq_chandler() { int vicstatus, sicstatus; // read VIC SIV status registers to find out which interrupt vicstatus = VIC_STATUS; sicstatus = SIC_STATUS; if (vicstatus & (1priority >= running->priority){ if (intnest==0){ // not in IRQ handler: preempt immediately printf("%d PREEMPT %d NOW\n", readyQueue->pid, running->pid); tswitch(); } else{ // still in IRQ handler: defer preemption printf("%d DEFER PREEMPT %d\n", readyQueue->pid, running->pid); swflag = 1; // set need to switch task flag } } int_on(SR); } int kwakeup(int event) { PROC *p, *tmp=0; int SR = int_off(); while((p = dequeue(&sleepList))!=0){ if (p->event==event){ p->status = READY; enqueue(&readyQueue, p); } else{ enqueue(&tmp, p); } } sleepList = tmp; reschedule(); // if wake up any higher priority procs int_on(SR); }

5.14 Uniprocessor (UP) Embedded System Kernel int V(struct semaphore *s) { PROC *p; int cpsr; int SR = int_off(); s->value++; if (s->value queue); p->status = READY; enqueue(&readyQueue, p); printf("timer: V up task%d pri=%d; running pri=%d\n", p->pid, p->priority, running->priority); resechedule(); } int_on(SR); } int mutex_unlock(MUTEX *s) { PROC *p; int SR = int_off(); printf("task%d unlocking mutex\n", running->pid); if (s->lock==0 || s->owner != running){ // unlock error int_on(SR); return -1; } // mutex is locked and running task is owner if (s->queue == 0){ // mutex has no waiter s->lock = 0; // clear lock s->owner = 0; // clear owner } else{ // mutex has waiters: unblock one as new owner p = dequeue(&s->queue); p->status = READY; s->owner = p; printf("%d mutex_unlock: new owner=%d\n", running->pid,p->pid); enqueue(&readyQueue, p); resechedule(); } int_on(SR); return 0; } int kfork(int func, int priority) { // create new task with priority as before mutex_lock(&readyQueuelock); enqueue(&readyQueue, p); mutex_unlock(&readyQueuelock); reschedule(); return p-pid; }

163

164

(3). t.c file: /**************** t.c file of Program C5.9***************/ #include "type.h" MUTEX *mp; // global mutex struct semaphore s1; // global semaphore #include "queue.c" #include "pv.c" // P/V on semaphores #include "mutex.c" // mutex functions #include "kbd.c" #include "uart.c" #include "vid.c" #include "exceptions.c" #include "kernel.c" #include "timer.c" // timer drivers int copy_vectors(){ // same as before } int irq_chandler(){ // shown above } int task3() { printf("PROC%d running: ", running->pid); mutex_lock(mp); printf("PROC%d inside CR\n", running->pid); mutex_unlock(mp); printf("PROC%d exit\n", running->pid); kexit(); } int task2() { printf("PROC%d running\n", running->pid); kfork((int)task3, 3); // create P3 with priority=3 printf("PROC%d exit:", running->pid); kexit(); } int task1() { printf("proc%d start\n", running->pid); while(1){ P(&s1); // task1 is Ved up by timer periodically mutex_lock(mp); kfork((int)task2, 2); // create P2 with priority=2 printf("proc%d inside CR\n", running->pid); mutex_unlock(mp); printf("proc%d finished loop\n", running->pid); } } int main() { fbuf_init(); // LCD driver uart_init(); // UARTs driver kbd_init(); // KBD driver kprintf("Welcome to Wanix in ARM\n"); // configure VIC for vectored nterrupts timer_init(); timer_start(0); // timer0: wall clock timer_start(2); // timer2: timing events mp = mutex_create(); // create global mutex s1.value = 0; // initialize semaphore s1 s1.queue = (PROC*)0;

5 Process Management in Embedded Systems

5.14 Uniprocessor (UP) Embedded System Kernel

}

kernel_init(); kfork((int)task1, 1); while(1){ if (readyQueue) tswitch(); }

165

// initialize kernel and run P0 // create P1 with priority=1 // P0 loop

5.14.4 Demonstration of Preemptive UP Kernel The sample system C5.9 demonstrates fully preemptive process scheduling. When the system starts, it creates and runs the initial process P0, which has the lowest priority 0. P0 creates a new process P1 with priority = 1. Since P1 has a higher priority than P0, it immediately preempts P0, which demonstrates direct preemption without any delay. When P1 runs in task1(), it first waits for a timer event by P(s1 = 0), which is Ved up by a timer periodically (every 4 seconds). While P1 waits on the semaphore, P0 resumes running. When the timer interrupt handler V up P1, it tries to preempt P0 by P1. Since task switch is not allowed inside interrupt handler, the preemption is deferred, which demonstrates preemption may be delayed by interrupt processing. As soon as interrupt processing ends, P1 will preempt P0 to become running again. To illustrate process preemption due to blocking, P1 first locks the mutex mp. While holding the mutex lock, P1 creates a process P2 with a higher priority = 2, which immediately preempts P1. We assume that P2 does not need the mutex. It creates a process P3 with a higher priority = 3, which immediately preempts P2. In the task3() code, P3 tries to lock the same mutex mp, which is still held by P1. Thus, P3 gets blocked on the mutex, which switches to run P2. When P2 finishes, it calls kexit() to terminate, causing P1 to resume running. When P1 unlocks the mutex, it unblocks P3, which has a higher priority than P1, so it immediately preempts P1. After P3 terminates, P1 resumes running again and the cycle repeats. Figure 5.14 shows the sample outputs of running the Program C5.9.

Fig. 5.14 Demonstration of process preemption

5 Process Management in Embedded Systems

166

It is noted that, in a strict priority system, the current running process should always be the one with the highest priority. However, in the sample system C5.9, when the process P3, which has the highest priority, tries to lock the mutex that is already held by P1, which has a lower priority, it becomes blocked on the mutex and switches to run the next runnable process. In the sample system, we assumed that the process P2 does not need the mutex, so it becomes the running process when P3 gets blocked on the mutex. In this case, the system is running P2, which does not have the highest priority. This violates the strict priority principle, resulting in what’s known as a priority inversion [8], in which a low priority process may block a higher priority process. If P2 keeps on running or it switches to another process of the same or lower priority, process P3 would be blocked for an unknown amount of time, resulting in an unbounded priority inversion. Whereas simple priority inversion may be considered as natural whenever processes are allowed to compete for exclusive control of resources, unbounded priority inversion could be detrimental to systems with critical timing requirements. The Example Program C5.9 actually implements a scheme called priority inheritance, which prevents unbounded priority inversion. We shall discuss priority inversion in more detail later in Chap. 10 on real-time systems.

5.15 Summary This chapter covers process management. It introduces the process concept, the principle, and technique of multitasking by context switching. It showed how to create processes dynamically and discussed the principles of process scheduling. It covered process synchronization and the various kinds of process synchronization mechanisms. It showed how to use process synchronization to implement event-driven embedded systems. It discussed the various kinds of process communication schemes, which include shared memory, pipes, and message passing. It shows how to integrate these concepts and techniques to implement a uniprocessor (UP) kernel that supports process management with both non-preemptive and preemptive process scheduling. The UP kernel will be the foundation for developing complete operating systems in later chapters.

List of Sample Programs C5.1: Context switch C5.2: Multitasking C5.3. Dynamic processes C5.4. Event-driven multitasking system using sleep/wakeup C5.5: Even-driven multitasking system using semaphores C5.6: Pipe C5.7. Message passing C5.8: Non-preemptive UP kernel C5.9: Preemptive UP kernel

Problems 1. In the Example Program C5.2, the tswitch function saves all the CPU registers in the process kstack, and it restores all the save registers of the next process when it resumes. Since tswitch() is called as function, it is clearly unnecessary to save/restore R0. Assume that the tswitch function is implemented as: tswitch: // disable IRQ interrupts stmfd sp!, {r4-r12, lr} LDR r0, =running // r0=&running LDR r1, [r0, #0] // r1->runningPROC str sp, [r1, #4] // running->ksp = sp

Problems

167 bl scheduler LDR r0, =running LDR r1, [r0, #0] // r1->runningPROC lDR sp, [r1, #4] // enable IRQ interrupts ldmfd sp!, {r4-r12, pc}

(1). Show how to initialize the kstack of a new process for it to start to execute the body() function. (2). Assume that the body() function is written as: int body(int dummy, int pid, int ppid){ }

where the parameters pid and ppid are the process id and the parent process id of the new process. Show how to modify the kfork() function to accomplish this. 2. Rewrite the UART driver in Chap. 3 by using sleep/wakeup to synchronize processes and the interrupt handler. 3. In the Example Program C5.3, all the processes are created with the same priority (so that they take turn to run). (1). What would happen if the processes are created with different priorities? (2). Implement a change_priority(int new_priority) function, which changes the running task’s priority to new_priority. Switch process if the current running process no longer has the highest priority. 4. With dynamic processes, a process may terminate when it has completed its task. Implement a kexit() function for tasks to terminate. 5. In all the example programs, each PROC structure has a statically allocated 4 KB kstack. (1). Implement a simple memory manager to allocate/deallocate memory dynamically. When the system starts, reserve a piece of memory, e.g., a 1 MB area beginning at 4 MB, as a free memory area. The function char *malloc(int size) allocates a piece of free memory of size bytes. When a memory area is no longer needed, it is released back to the free memory area by void mfree(char *address, int size)

Design a data structure to represent the current available free memory. Then implement the malloc() and mfree() functions. (2). Modify the kstack field of the PROC structure as an integer pointer int *kstack; and modify the kfork() function as int kfork(int func, int priority, int stack_size)

which dynamically allocates a memory area of stack_size (in 1 KB units) for the new process. (3). When a process terminates, its stack area must be freed. How to implement this? 6. It is well known that an interrupt handler must never go to sleep, become blocked, or wait. Explain why. 7. Misuse of semaphores may result in deadlocks. Survey the literatures to find out how to deal with deadlocks by deadlock prevention, deadlock avoidance, deadlock diction, and recovery. 8. The Pipe Program C5.6 is similar to named pipes (FIFOs) in Linux. Read the Linux man page on fifo to learn how to use named pipes for interprocess communication. 9. Modify the Example Program C5.8 by adding another set of cooperative produced-consumer processes to the system. Let producers get lines from the KBD, process the chars, and pipe them to the consumers, which output the chars to the second UART terminal.

168

5 Process Management in Embedded Systems

10. Modify the Example Program C5.9 to handle nested IRQ interrupts in SYS mode but still allow task preemption at the end of nested interrupt processing. 11. Assume that all processes have the same priority. Modify the Example Program C5.9 to support process scheduling by time slice.

References 1. Accetta, M. et al., “Mach: A New Kernel Foundation for UNIX Development”, Technical Conference – USENIX, 1986 2. ARM toolchain: http://gnutoolchains.com/arm-eabi, 2016 3. Bach, M. J., “The Design of the Unix operating system”, Prentice Hall, 1990 4. Buttlar, D, Farrell, J, Nichols, B., “PThreads Programming, A POSIX Standard for Better Multiprocessing”, O’Reilly Media, 1996 5. Dijkstra, E.W., “Co-operating Sequential Processes”, in Programming Languages, Academic Press, 1965 6. Hoare, C.A.R, “Monitors: An Operating System Structuring Concept”, CACM, Vol. 17, 1974 7. IBM MVS Programming Assembler Services Guide, Oz/OS V1R11.0, IBM, 2010 8. Lampson, B; Redell, D. (June 1980). “Experience with processes and monitors in MESA”. Communications of the ACM (CACM) 23(2): 105–117, 1980 9. OpenVMS: HP OpenVMS systems Documentation, http://www.hp.com/go/openvms/doc, 2014 10. Pthreads: https://computing.llnl.gov/tutorials/pthreads/, 2015 11. QEMU Emulators: “QEMU Emulator User Documentation”, http://wiki.qemu.org/download/qemu-doc.htm, 2010. 12. Silberschatz, A., Galvin, P.A., Gagne, G, “Operating system concepts, 8th Edition”, John Wiley & Sons, Inc. 2009 13. Stallings, W. “Operating Systems: Internals and Design Principles (7th Edition)”, Prentice Hall, 2011 14. Tanenbaum, A. S., Woodhull, A. S., “Operating Systems, Design and Implementation, third Edition”, Prentice Hall, 2006 15. Wang, K.C., “Design and Implementation of the MTX Operating Systems”, Springer International Publishing AG, 2015

Chapter 6 Memory Management in ARM

6.1 Process Address Spaces After power-on or reset, the ARM processor starts to execute the reset handler code in Supervisor (SVC) mode. The reset handler first copies the vector table to address 0, initializes stacks of the various privileged modes for interrupts and exception processing, and enables IRQ interrupts. Then it executes the system control program, which creates and starts up processes or tasks. In the static process model, all the tasks run in SVC mode in the same address space of the system kernel. The main disadvantage of this scheme is the lack of memory protection. While executing in the same address space, tasks share the same global data objects and may interfere with one another. An ill-designed or misbehave task may corrupt the shared address space, causing other tasks to fail. For better system security and reliability, each task should run in a private address space, which is isolated and protected from other tasks. In the ARM architecture, tasks may run in the unprivileged User mode. It is very easy to switch the ARM CPU from a privileged mode to User mode. However, once in User mode, the only way to enter privileged mode is by one of the following means. Exceptions: When an exception occurs, the CPU enters a corresponding privileged mode to handle the exception. Interrupts: An interrupt causes the CPU to enter either FIQ or IRQ mode. SWI: The SWI instruction causes the CPU to enter the Supervisor or SVC mode. In the ARM architecture, System mode is a separate privileged mode, which share the same set of CPU registers with User mode, but it is not the same system or kernel mode found in most other processors. To avoid confusion, we shall refer to the ARM SVC mode as the Kernel mode. SWI can be used to implement system calls, which allow a User mode process to enter Kernel mode, execute kernel functions, and return to User mode with the desired results. In order to separate and protect the memory regions of individual tasks, it is necessary to enable the memory management hardware, which provides each task with a separate virtual address (VA) space. In this chapter, we shall cover the ARM memory management unit (MMU) and demonstrate virtual address mapping and memory protection by example programs.

6.2 Memory Management Unit (MMU) in ARM The ARM memory management unit (MMU) [1] performs two primary functions: First, it translates virtual addresses into physical addresses (PA). Second, it controls memory access by checking permissions. The MMU hardware which performs these functions consists of a translation lookaside buffer (TLB), access control logic, and translation table walking logic. The ARM MMU supports memory accesses based on either Sections or Pages. Memory management by sections is a one-level paging scheme. The level-1 page table contains section descriptors, each of which specifies a 1 MB block of memory. Memory management by paging is a two-level paging scheme. The level-1 page table contains page table descriptors, each of which describes a level-2 page table. The level-2 page table contains page descriptors, each of which specifies a page frame in memory and access control bits. The ARM paging scheme supports two different page sizes. Small pages consist of 4 KB blocks of memory, and large pages consist of 64 KB blocks of memory. Each page comprises four sub-pages. Access control can be extended to 1 KB sub-pages within small pages and to 16 KB sub-pages within large pages. The ARM MMU © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. C. Wang, Embedded and Real-Time Operating Systems, https://doi.org/10.1007/978-3-031-28701-5_6

169

170

6 Memory Management in ARM

also supports the concept of domains. A domain is a memory area that can be defined with individual access rights. The Domain Access Control Register (DACR) specifies the access rights for up to 16 different domains, 0 to 15. The accessibility of each domain is specified by a 2-bit permission, where 00 is for no access, 01 is for client mode, which checks the access permission (AP) bits of the domain or page table entries, and 11 is for manager mode, which does not check the AP bits in the domain. The TLB contains 64 translation entries in a cache. During most memory accesses, the TLB provides the translation information to the access control logic. If the TLB contains a translated entry for the virtual address, the access control logic determines whether access is permitted. If access is permitted, the MMU outputs the appropriate physical address corresponding to the virtual address. If access is not permitted, the MMU signals the CPU to abort. If the TLB does not contain a translated entry for the virtual address, the translation table walk hardware is invoked to retrieve the translation information from a translation table in physical memory. Once retrieved, the translation information is placed into the TLB, possibly overwriting an existing entry. The entry to be overwritten is chosen by cycling sequentially through the TLB locations. When the MMU is turned off, e.g., during reset, there is no address translation. In this case, every virtual address is a physical address. Address translation takes effect only when the MMU is enabled.

6.3 MMU Registers The ARM processor treats the MMU as a coprocessor. The MMU contains several 32-bit registers which control the operation of the MMU. Figure 6.1 shows the format of MMU registers. MMU registers can be accessed by using the MRC and MCR instructions. The following is a brief description of the ARM MMU registers c0 to c10. Register c0 is for access to the ID register, cache type register, and TCM status registers. Reading from this register returns the device ID, the cache type, or the TCM status depending on the value of Opcode_2 used. Register c1 is the Control Register, which specifies the configuration of the MMU. In particular, setting the M bit (bit 0) enables the MMU, and clearing the M bit disables the MMU. The V bit of c1 specifies whether the vector table is remapped during reset. The default vector table location is 0x00. It may be remapped to 0xFFFF0000 during reset. Register c2 is the Translation Table Base Register (TTBR). It holds the physical address of the first-level translation table, which must be on a 16 KB boundary in main memory. Reading from c2 returns the pointer to the currently active firstlevel translation table. Writing to register c2 updates the pointer to the first-level translation table. Register c3 is the Domain Access Control Register. It consists of 16 2-bit fields; each defines the access permissions for one of the 16 domains (D15–D0). Register c4 is currently not used. Register c5 is the Fault Status Register (FSR). It indicates the domain and type of access being attempted when an abort occurred. Bits 7:4 specify which of the 16 domains (D15–D0) was being accessed, and Bits 3:1 indicate the type of access being attempted. A write to this register flushes the TLB.

Register 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10

1 write

9

8

7

6

5

4

3

2

1

0

0 R S B 1 D P W C A M

Translation Table Base

2 write 3 write

Control

0 0 0 0 0

15

14

4 read

13

12

11

10

Domain Access Control 9 8 7 6

Fault Status Flush TLB

6 read

Fault Address

Fig. 6.1 ARM MMU registers

Purge Address

4

0 0 0 0

5 write

6 write

5

3

2

Domain

1

0

Status

6.4 Accessing MMU Registers

171

Register c6 accesses the Fault Address Register (FAR). It holds the virtual address of the access when a fault occurred. A write to this register causes the data written to be treated as an address, and, if it is found in the TLB, the entry is marked as invalid. This operation is known as a TLB purge. The Fault Status Register and Fault Address Register are only updated for data faults, not for prefetch faults. Register c7 controls the caches and the write buffer. Register c8 is the TLB Operations Register. It is used mainly to invalidate TLB entries. The TLB is divided into two parts: a set-associative part and a fully associative part. The fully associative part, also referred to as the lockdown part of the TLB, is used to store entries to be locked down. Entries held in the lockdown part of the TLB are preserved during an invalidate TLB operation. Entries can be removed from the lockdown TLB using an invalidate TLB single entry operation. The invalidate TLB operations invalidate all the unpreserved entries in the TLB. The invalidate TLB single entry operations invalidate any TLB entry corresponding to the virtual address. Register c9 accesses the Cache Lockdown and TCM Region Register on some ARM boards equipped with TCM. Register c10 is the TLB Lockdown Register. It controls the lockdown region in the TLB.

6.4 Accessing MMU Registers The registers of CP15 can be accessed by MRC and MCR instructions in a privileged mode. The instruction format is shown in Fig. 6.2. MCR{cond} p15,,,,, MRC{cond} p15,,,,, The CRn field specifies the coprocessor register to access. The CRm field and Opcode_2 fields specify a particular operation when addressing registers. The L bit distinguishes MRC (L = 1) and MCR (L = 0) instructions.

6.4.1 Enabling and Disabling the MMU The MMU is enabled by writing the M bit, bit 0, of the CP15 Control Register c1. On reset, this bit is cleared to 0, disabling the MMU.

6.4.1.1 Enable MMU Before enabling the MMU, the system must do the following: 1. Program all relevant CP15 registers. This includes setting up suitable translation tables in memory. 2. Disable and invalidate the Instruction Cache. The Instruction Cache can be enabled when enabling the MMU. To enable the MMU, proceed as follows: 1. Program the Translation Table Base and Domain Access Control Registers. 2. Program first-level and second-level descriptor page tables as needed. 3. Enable the MMU by setting bit 0 in the CP15 Control Register c1.

6.4.1.2 Disable MMU To disable the MMU, proceed as follows: Fig. 6.2 MCR and MRC instruction format

172

6 Memory Management in ARM

(1). Clear bit 2 in the CP15 Control Register c1. The Data Cache must be disabled prior to, or at the same time as, the MMU being disabled, by clearing bit 2 of the Control Register. If the MMU is enabled, then disabled, and subsequently re-enabled, the contents of the TLBs are preserved. If the TLB entries are now invalid, they must be invalidated before the MMU is re-enabled. (2). Clear bit 0 in the CP15 Control Register c1. When the MMU is disabled, memory accesses are treated as follows: • All data accesses are treated as Noncacheable. The value of the C bit, bit 2, of the CP15 Control Register c1 should be zero (SBZ). • All instruction accesses are treated as Cacheable if the I bit (bit 12) of the CP15 Control Register c1 is set to 1 and Noncacheable if the I bit is set to 0. • All explicit accesses are Strongly Ordered. The value of the W bit, bit 3, of the CP15 Control Register c1 is ignored. • No memory access permission checks are performed, and no aborts are generated by the MMU. • The physical address for every access is equal to its virtual address. This is known as a flat address mapping. • The FCSE PID should be zero when the MMU is disabled. This is the reset value of the FCSE PID. If the MMU is to be disabled, the FCSE PID must be cleared. • All CP15 MMU and cache operations work as normal when the MMU is disabled. • Instruction and data prefetch operations work as normal. However, the Data Cache cannot be enabled when the MMU is disabled. Therefore, a data prefetch operation has no effect. Instruction prefetch operations have no effect if the Instruction Cache is disabled. No memory access permissions are performed, and the address is flat mapped. • Accesses to the TCMs work as normal if the TCMs are enabled.

6.4.2 Domain Access Control Memory access is controlled primarily by domains. There are 16 domains, each defined by 2 bits in the Domain Access Control Register. Each domain supports two kinds of users. Clients: Clients use a domain. Managers: Managers control the behavior of the domain. The domains are defined in the Domain Access Control Register. In Fig. 6.1, row 3 illustrates how the 32 bits of the register are allocated to define 16 2-bit domains. Table 6.1 shows the meanings of the domain access bits.

6.4.3 Translation Table Base Register Register c2 is the Translation Table Base Register (TTBR), for the base address of the first-level translation table. Reading from c2 returns the pointer to the currently active first-level translation table in bits [31:14] and an unpredictable value in bits [13:0]. Writing to register c2 updates the pointer to the first-level translation table from the value in bits [31:14] of the written value. Bits [13:0] should be zero. The TTBR can be accessed by the following instructions: MRC p15, 0, , c2, c0, 0 MCR p15, 0, , c2, c0, 0

; read TTBR ; write TTBR

The CRm and Opcode_2 fields are SBZ (should be zero) when writing to c2. Table 6.1 Access bits in Domain Access Control Register

6.4 Accessing MMU Registers

173

6.4.4 Domain Access Control Register Register c3 is the Domain Access Control Register consisting of 16 2-bit fields. Each 2-bit field defines the access permissions for one of the 16 domains, D15-D0. Reading from c3 returns the value of the Domain Access Control Register. Writing to c3 writes the value of the Domain Access Control Register. The 2-bit domain access control bits are defined as: Value 00 01 10 11

Meaning No access Client Reserved Manager

Description Any access generates a domain fault. Accesses are checked against the access permission bits in the section or page descriptor. Currently behaves like the no access mode. Accesses are not checked against the access permission bits so a permission fault cannot be generated.

The Domain Access Control Register can be accessed by the following instructions: MRC p15, 0, , c3, c0, 0 MCR p15, 0, , c3, c0, 0

; read domain access permissions ; write domain access permissions

6.4.5 Fault Status Registers Register c5 is the Fault Status Registers (FSRs). The FSRs contain the source of the last instruction or data fault. The instruction-side FSR is intended for debug purposes only. The FSR is updated for alignment faults and external aborts that occur while the MMU is disabled. The FSR accessed is determined by the value of the Opcode_2 field: Opcode_2 = 0 Opcode_2 = 1

Data Fault Status Register (DFSR) Instruction Fault Status Register (IFSR)

The FSR can be accessed by the following instructions: MRC MCR MRC MCR

p15, p15, p15, p15,

0, 0, 0, 0,

, c5, , c5, , c5, , c5,

c0, c0, c0, c0,

0 0 1 1

;read DFSR ;write DFSR ;read IFSR ;write IFSR

The format of the Fault Status Register (FSR) is: |31 9| 8 |7 6 5 4 |3 2 1 0 | --------------------------------------------| UNP/SBZ | 0 | Domain | Status | ---------------------------------------------

The following describes the bit fields in the FSR: Bits [31:9] [8] [7:4] [3:0]

Description UNP/SBZP. Always reads as zero. Writes ignored. Specify the domain (D15-D0) being accessed when a data fault occurred. Type of fault generated. Table 6.2 shows the encodings of the status field in the FSR and if the domain field contains valid information.

6.4.6 Fault Address Register Register c6 is the Fault Address Register (FAR). It contains the modified virtual address of the access being attempted when a data abort occurred. The FAR is only updated for data aborts, not for prefetch aborts. The FAR is updated for alignment

174

6 Memory Management in ARM

Table 6.2 FSR status field encoding

faults and external aborts that occur while the MMU is disabled. The FAR can be accessed by using the following instructions: MRC p15, 0, , c6, c0, 0 MCR p15, 0, , c6, c0, 0

; read FAR ; write FAR

Writing c6 sets the FAR to the value of the data written. This is useful for a debugger to restore the value of the FAR to a previous state. The CRm and Opcode_2 fields are SBZ (should be zero) when reading or writing CP15 c6.

6.5 Virtual Address Translations The MMU translates virtual addresses generated by the CPU into physical addresses to access external memory and also derives and checks the access permission. Translation information, which consists of both the address translation data and the access permission data, resides in a translation table located in physical memory. The MMU provides the logic needed to traverse the translation table, obtain the translated address, and check the access permission. The translation process consists of the following steps.

6.5.1 Translation Table Base The Translation Table Base (TTB) Register points to the base of a translation table in physical memory which contains Section and/or Page descriptors.

6.5.2 Translation Table The translation table is the level-one page table. It contains 4096 4-byte entries, and it must be located on a 16 KB boundary in physical memory. Each entry is a descriptor, which specifies either a level-2 page table base or a section base. Figure 6.3 shows the format of the level-one page entries.

6.5.3 Level-One Descriptor A level-one descriptor is either a page table descriptor or a section descriptor, and its format varies accordingly. Figure 6.3 shows the format of level-one descriptors. The descriptor type is specified by the two least significant bits.

6.5.3.1 Page Table Descriptor A page table descriptor (second row in Fig. 6.3) defines a level-2 page table. We shall discuss 2-level paging in Sect. 6.6.

175

6.6 Translation of Section References

Fig. 6.3 Level-one descriptors

Fig. 6.4 Translation of section references

6.5.3.2 Section Descriptor A section descriptor (third row in Fig. 6.3) has a 12-bit base address, a 2-bit AP field, a 4-bit domain field, the C and B bits, and a type identifier (b10). The bit fields are defined as follows: Bits 31:20: Base address of a 1 MB section in memory. Bits 19:12: Always 0. Bits 11:10 (AP): Access permissions of this section. Their interpretation depends on the S and R bits (bits 8–9 of MMU control register C1). The most commonly used AP and SR settings are as follows: AP 00 01 10 11

SR xx xx xx xx

-SupervisorNo access Read/write Read/write Read/write

---User--No access No access Read only Read/write

----------------Note---------------------Access allowed only in Supervisor mode Writes in User mode cause permission fault Access permitted in both modes.

Bits 8:5: Specify one of the 16 possible domains (in the Domain Access Control Register) that form the primary access controls. Bit 4: Should be 1. Bits 3:2 (C and B): Control the cache and write buffer-related functions as follows: C – Cacheable: Data at this address will be placed in the cache (if the cache is enabled). B – Bufferable: Data at this address will be written through the write buffer (if the write buffer is enabled).

6.6 Translation of Section References In the ARM architecture, the simplest kind of paging scheme is by 1 MB sections, which uses only a level-one page table. So we discuss memory management by sections first. When the MMU translates a virtual address to a physical address, it consults the page tables. The translation process is commonly referred to as a page table walk. When using sections, the translation consists of the following steps, which are depicted in Fig. 6.4.

176

6 Memory Management in ARM

(1). A virtual address (VA) comprises a 12-bit Table index and a 20-bit Section index, which is the offset in the section. The MMU uses the 12-bit Table index to access a section descriptor in the translation table pointed by the TTBR. (2). The section descriptor contains a 12-bit base address, which points to a 1 MB section in memory, a (2-bit) AP field, and a (4-bit) domain number. First, it checks the domain access permissions in the Domain Access Control Register. Then, it checks the AP bits for accessibility to the section. (3). If the permission checking passes, it uses the 12-bit section base address and the 20-bit Section index to generate the physical address as: (32-bit)PA = ((12-bit)Section_base_address // 1MB at 6MB has space for 64 pgdirs of 64 PROCs for (i=0; idata_abort printf("P0 switch to P1 : enter a line : \n"); kgets(line); // enter a line to continue while(1){ // P0 code while(readyQueu==0); // loop if readyQueue empty tswitch(); // switch to run any ready process } }

7.4 System Kernel Supporting User Mode Processes

207

7.4.3.1 Create Process With User Mode Image (8). fork.c file: This file implements the kfork() function, which creates a child process with a user mode image u1 in the user mode area of the new process. In addition, it also ensures that the new process can execute its Umode image in user mode when it runs. #define UIMAGE_SIZE 0x100000 PROC *kfork() { extern char _binary_u1_start, _binary_u1_end; int usize; char *cp, *cq; PROC *p = get_proc(&freeList); if (p==0){ printf("kfork failed\n"); return (PROC *)0; } p->ppid = running->pid; p->parent = running; p->status = READY; p->priority = 1; // all NEW procs priority=1 // clear all “saved” regs in kstack to 0 for (i=1; ikstack[SSIZE-i] = 0; // set kstack to resume to goUmode in ts.s p->kstack[SSIZE-15] = (int)goUmode; // let saved ksp point to kstack[SSIZE-28] p->ksp = &(p->kstack[SSIZE-28]); // load Umode image to Umode memory cp = (char *)&_binary_u1_start; usize = &_binary_u1_end - &_binary_u1_start; cq = (char *)(0x800000 + (p->pid-1)*UIMAGE_SIZE); memcpy(cq, cp, usize); // return to VA 0 in Umode image; p->kstack[SSIZE-1] = (int)VA(0); p->usp = VA(UIMAGE_SIZE); // empty Umode stack pointer p->ucpsr = 0x10; // CPU.spsr=Umode enqueue(&readyQueue, p); printf("proc %d kforked a child %d\n", running->pid, p->pid); printQ(readyQueue); return p; }

Explanation of kfork(): kfork() creates a new process with a user mode image and enters it into the readyQueue. When the new process begins to run, it first resumes in kernel. Then it returns to user mode to execute the Umode image. The current kfork() function is the same as in Chap. 5 except for the user mode image part. Since this part is crucial, we shall explain it in more detail. In order for a process to execute its Umode image, we may ask the question: how did the process wind up in the readyQueue? The sequence of events must be as follows:

(1). It did a system call from Umode by SWI #0, which causes it to enter kernel to execute SVC handler (in ts.s), in which it uses STMFD sp!, {r0–r12, lr} to save Umode registers into the (empty) kstack, which becomes

208

7 User Mode Process and System Calls

-----------------------------------------------------------|uLR|ur12 ur11 ur10 ur9 ur8 ur7 ur6 ur5 ur4 ur3 ur2 ur1|ur0| ------------------------------------------------------------1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14

where the prefix u denotes Umode registers and –i means SSIZE-i. It also saves Umode sp and cpsr into PROC.usp and PROC.ucpsr. Then, it called tswitch() to give up CPU, in which it again uses STMFD sp!, {r0-r12, lr} to save Kmode registers into kstack. This adds one more frame to kstack, which becomes ksp -------goUmode-------------------------------------------|--|KLR kr12 kr11 kr10 kr9 kr8 kr7 kr6 kr5 kr4 kr3 kr2 kr1 kr0| -------------------------------------------------------------15 -16 -17 -18 -19 -20 -21 -22 -23 -24 -25 -26 -27 -28

where the prefix k denotes Kmode registers. In the PROC kstack, kLR = where the process called tswitch() and that’s where it shall resume to, uLR = where the process did the system call, and that’s where it shall return to when it goes back to Umode. Since the process never really ran before, all other “saved” CPU registers do not matter, so they can all be set to 0. Accordingly, we initialize the new process kstack as follows: 1. Clear all “saved” registers in kstack to 0 for (i=1; ikstack[SSIZE-i] = 0; } 2. Set saved ksp to kstack[SSIZE-28] p->ksp = &(p->kstack[SSIZE-28]); 3. Set kLR = goUmode, so that p will resume to goUmode (in ts.s) p->kstack[SSIZE-15] = (int)goUmode; 4. Set uLR to VA(0), so that p will execute from VA=0 in Umode p->kstack[SSIZE-1] = VA(0); // beginning of Umode image 5. Set new process usp to point at ustack TOP and ucpsr to Umode p->usp = (int *)VA(UIMAGE_SIZE); // high end of Umode image p->ucpsr = (int *)0x10; // Umode status register

7.4.3.2 Execution of User Mode Image With this setup, when the new process begins to run, it first resumes to goUmode (in ts.s), in which it sets Umode sp=PROC. usp, cpsr=PROC.ucpsr. Then it executes ldmfd sp!, {r0-r12, pc}^ which causes it to execute from uLR=VA(0) in Umode, i.e., from the beginning of the Umode image with the stack pointer pointing at the high end of the Umode image. Upon entry to us.s, it calls main() in C. To verify that the process is indeed executing in Umode, we get the CPU’s cpsr register and show the current mode, which should be 0x10. To test memory protection by the MMU, we try to access VAs outside of the process 1MB VA range, e.g., 0x80200000, as well as VA in kernel space, e.g., 0x4000. In either case, the MMU should detect the error and generate a data abort exception. In the data_abort handler, we read the MMU’s fault_status and fault_address registers to show the cause of the exception as well as the VA address that caused the exception. When the data_abort handler finishes, we let it return to PC-4, i.e., skip over the bad instruction that caused the data_abort exception, allowing the process continue. In a real system, when a Umode process commits memory access exceptions, it is a very serious matter, which usually causes the process to terminate. As for how to deal with exceptions caused by Umode processes in general, the reader may consult Chap. 9 of [14] on signal processing.

7.4 System Kernel Supporting User Mode Processes

209

7.4.4 Kernel Compile-link Script (2). mk script: generate kernel image and run it under QEMU (cd USER; mku u1) # create binary executable u1.o arm-none-eabi-as -mcpu=arm926ej-s ts.s -o ts.o arm-none-eabi-gcc -c -mcpu=arm926ej-s t.c -o t.o arm-none-eabi-ld -T t.ld ts.o t.o u1.o -Ttext=0x10000 -o t.elf arm-none-eabi-objcopy -O binary t.elf t.bin qemu-system-arm -M versatilepb -m 256M -kernel t.bin

In the linker script, u1.o is used as a raw binary data section in the kernel image. For each raw data section, the linker exports its symbolic addresses, such as _binary_u1_start, _binary_u1_end, _binary_u1_size which can be used to access the raw data section in the loaded kernel image.

7.4.5 Demonstration of Kernel With User Mode Process Figure 7.1 shows the output screen of running the C7.1 program. When a process begins to execute the Umode image, it first gets the CPU’s cpsr to verify it is indeed executing in user mode (mode=0x10). Then it tries to access some VA outside of its VA space. As the figure shows, each invalid VA generates a data abort exception. After testing the MMU for memory protection, it issues syscalls to get its pid and ppid. Then, it shows a menu and asks for a command to execute. Each command issues a syscall, which causes the process to enter kernel to execute the corresponding syscall function in kernel. Then it

Fig. 7.1 Demonstration of User Mode Process and System Calls

210

7 User Mode Process and System Calls

returns to Umode and prompts for a command to execute again. The reader may run the program and enter commands to test system calls in the sample system.

7.5 Embedded System with User Mode Processes Based on the example program C7.1, we propose two different models for embedded systems to support multiple User mode processes.

7.5.1 Processes in the Same Domain Instead of a single Umode image, we may create many Umode images, denoted by u1, u2, …, un. Modify the linker script to include all the Umode images as separate raw data sections in the kernel image. When the system starts, create n processes, P1, P2, .. , Pn, each executes a corresponding image in User mode. Modify kfork() to kfork(int i), which creates process Pi and loads the image ui to the memory area of process Pi. On ARM-based systems, use the simplest memory management scheme by allocating each process a 1MB Umode image area by process PID, e.g., P1 at 8MB, P2 at 9MB, etc. Some of the processes can be periodic while others can be event-driven. All the processes run in the same virtual address space of [2GB to 2GB + 1MB] but each has a separate physical memory area, which is isolated from other processes and protected by the MMU hardware. We demonstrate such a system by an example.

7.5.2 Demonstration of Processes in the Same Domain In this example system, we create 4 Umode images, denoted by u1 to u4. All Umode images are compile-linked with the same starting virtual address 0x80000000. They execute the same ubody(int i) function. Each process calls ubody(pid) with a unique process ID number for identification. When setting up the process page tables, the kernel space is assigned the domain number 0. All Umode spaces are assigned the domain number 1. When a process begins execution in Umode, it allows the user to test memory protection by trying to access invalid virtual addresses, which would generate memory protection faults. In the data abort exception handler, it displays the MMU’s fault_status and fault_addr registers to show the exception cause as well as the faulting virtual address. Then each process executes an infinite loop, in which it prompts for a command and executes the command. Each command invokes a system call interface, which issues a system call, causing the process to execute the system call function in kernel. To demonstrate additional capabilities of the system, we add the following commands: switch: enter kernel to switch process; sleep: enter kernel to sleep on process pid; wakeup: enter kernel to wake up sleeping process by pid; If desired, we may also use sleep/wakeup to implement event-driver processes, as well as for process cooperation. To support and test the added user mode commands, simply add them to the command and syscall interfaces. #include "string.c" #include "uio.c" int ubody(int id) { char line[64]; int *va; int pid = getpid(); int ppid = getppid(); int PA = getPA(); printf("test memory protection\n”); printf("try VA=0x80200000 : "); va = (int *)0x80200000; *va = 123;

7.5 Embedded System with User Mode Processes

211

printf("try VA=0x1000 : "); va = (int *)0x1000; *va = 456; while(1){ printf("Process #%d in Umode at %x parent=%d\n", pid, PA, ppid); umenu(); printf("input a command : "); ugetline(line); uprintf("\n"); if (!strcmp(line, "switch")) // only show the added commands uswitch(); if (!strcmp(line, "sleep")) usleep(); if (!strcmp(line, "wakeup")) uwakeup(); }

} /******* u1.c file ******/ // u2.c, u3.c u4.c are similar #include "ucode.c" main(){ ubody(1); }

The following shows the (condensed) kernel files. First, the syscall routing table, svc_handler(), is modified to support the added syscalls. int svc_handler(volatile int a, int b, int c, int d) { int r; switch(a){ case 0: r = kgetpid(); break; case 1: r = kgetppid(); break; case 2: r = kps(); break; case 3: r = kchname((char *)b); break; case 4: r = ktswitch(); break; case 5: r = ksleep(running->pid); break; case 6: r = kwakeup((int)b); break; case 92:r = kgetPA(); // return running->pgdir[2048]&0xFFF00000; } running->kstack[SSIZE-14] = r; }

In the kernel t.c file, after system initialization it creates 4 processes, each with a different Umode image. In kfork(pid), it uses the new process pid to load the corresponding image into the process memory area. The loading addresses are P1 at 8MB, P2 at 9MB, P3 at 10MB, and P4 at 11MB. Process page tables are set up in the same way as in Program C7.1. Each process has a page table in the 6MB area. In the process page tables, entry 2048 (VA=0x80000000) points to the process Umode image area in physical memory. /*********** kernel t.c file ************/ #define VA(x) (0x80000000 + (u32)x) int main() { // initialize kernel as before for (int i=1; i never return again }

212

7 User Mode Process and System Calls

PROC *kfork(int i) // i = 1 to 4 { // create new PROC *p with pid=i as before u32 *addr = (char *)(0x800000 + (p->pid-1)*0x100000); // copy Umode image ui of p to addr p->usp = (int *)VA(0x100000); // high end of 1MB VA p->kstack[SSIZE-1] = VA(0); // starting VA=0 enqueue(&readyQueue, p); return p; }

Figure 7.2 shows the outputs of running the C7.2 program. It shows that the switch command switches the running process from P1 to P2. The reader may run the system and enter other user mode commands to test the system.

7.5.3 Processes with Individual Domains In the first system model of C7.2, each process has its own page table. Switching process requires switching page table, which in turn requires flushing the TLB and I&D caches. In the next system model, the system supports n ppid = running->pid; p->parent = running; p->status = READY; p->priority = 1;

}

232

}

7 User Mode Process and System Calls

p->vforked = 1; // add vforked entry in PROC p->pgdir = running->pgdir; // share parent’s pgdir for (i=1; i kstack[SSIZE-i] = running->kstack[SSIZE-i]; } for (i=15; ikstack[SSIZE-i] = 0; } p->kstack[SSIZE - 14] = 0; // child return pid = 0 p->kstack[SSIZE-15] = (int)goUmode; // resume to goUmode p->ksp = &(p->kstack[SSIZE-28]); p->ucpsr = running->ucpsr; pusp = (int)running->usp; cusp = pusp - 1024; // child ustack: 1024 bytes down p->usp = (int *)cusp; memcpy((char *)cusp, (char *)pusp, 128); // 128 entries enough enqueue(&readyQueue, p); printf("proc %d vforked a child %d\n",running->pid, p->pid); return p->pid;

int kexec(char *cmdline) // cmdline=VA in Umode space { // fetch cmdline and get cmd filename: SAME as before if (p->vforked){ p->pgdir = (int *)(0x600000 + p->pid*0x4000); printf("%d is VFORKED: switchPgdir to %x",p->pid, p->pgdir); switchPgdir(p->pgdir); p->vforked = 0; } // load cmd file, set up kstack; return to VA=0: SAME as before }

7.7.11 Demonstration of vfork The sample system C7.5 implements vfork. To demonstrate vfork, we add a vfork command to Umode programs. The vfork command calls the uvfork() function, which issues a syscall to execute kvfork() in kernel. int uvfork() { int ppid, pid, status; ppid = getpid(); pid = syscall(11, 0,0,0); // vfork() syscall# = 11 if (pid){ printf("vfork parent %d return child pid=%d ", ppid, pid); printf("wait for child to terminate\n"); pid = wait(&status); printf("vfork parent: dead child=%d, status=%x\n", pid, status); } else{ printf("vforked child: mypid=%d ", getpid()); printf("vforked child: exec(\"u2 test vfork\")\n"); syscall(10, "u2 test vfork",0,0); } }

7.8 Threads

233

Fig. 7.5 Demonstration of vfork

In uvfork(), the process issues a syscall to create a child by vfork(). Then it waits for the vforked child to terminate. The vforked child issues an exec syscall to change image to a different program. When the child exits, it wakes up the parent, which would never know that the child was executing in its Umode image before. Figure 7.5 shows the outputs of running the samples system C7.5.

7.8 Threads In the process model, each process is an independent execution unit in a unique Umode address space. In general, the Umode address spaces of processes are all distinct and separate. The mechanism of vfork() allows a process to create a child process which temporarily shares the same address space with the parent, but they eventually diverge. The same technique can be used to create separate execution entities in the same address space of a process. Such execution entities in the same address space of a process are called light-weighted processes, which are more commonly known as threads [12]. The reader may consult [2, 10, 11], for more information on threads and threads programming. Threads have many advantages over processes. For a detailed analysis of the advantages of threads and their applications, the reader may consult Chap. 5 of [14]. In this section, we shall demonstrate the technique of extending vfork() to create threads.

7.8.1 Thread Creation (1). Thread PROC structures: As an independent execution entity, each thread needs a PROC structure. Since threads of the same process execute in the same address space, they share many things in common, such the pgdir, opened file descriptors, etc. It suffices to maintain only one copy of such shared information for all threads in the same process. Rather than drastically altering the PROC structure, we shall add a few fields to the PROC structure and use them for both processes and threads. The modified PROC structure is

234

7 User Mode Process and System Calls #define NPROC 10 #define NTHREAD 16 typedef struct proc{ // same as before, but add int type; // PROCESS|THREAD type int tcount; // number of threads in a process int kstack[SSIZE] }PROC; PROC proc[NPROC+NTHREAD],*freeList,*tfreeList,*readyQueue;

During system initialization, we put the first NPROC PROCs into freeList as before, but put the remaining NTHREAD PROCs into a separate tfreeList. When creating a process by fork() or vfork(), we allocate a free PROC from the freeList and the type is PROCESS. When creating a thread, we allocate a PROC from tfreeList and the type is THREAD. The advantage of this design is that it keeps the needed modifications to a minimum. For instance, the system falls back the pure process model if NTHREAD=0. With threads, each process may be regarded as a container of threads. When creating a process, it is created as the main thread of the process. With only the main thread, there is virtually no difference between a process and thread. However, the main thread may create other threads in the same process. For simplicity, we shall assume that only the main thread may create other threads and the total number of threads in a process is limited to TMAX. (2). Thread creation: The system call int thread(void *fn, int *ustack, int *ptr); creates a thread in the address space of a process to execute a function, fn(ptr), using the ustack area as its execution stack. The algorithm of thread() is similar to vfork(). Instead of temporarily sharing Umode stack with the parent, each thread has a dedicated Umode stack and the function is executed with a specified parameter (which can be a pointer to a complex data structure). The following shows the code segment of kthread() in kernel: // create a thread to execute fn(ptr); return thread's pid int kthread(int fn, int *stack, int *ptr) { int i, uaddr, tcount; tcount = running->tcount; if (running->type!=PROCESS || tcount >= TMAX){ printf("non-process OR max tcount %d reached\n", tcount); return -1; } p = (PROC *)get_proc(&tfreeList); // get a thread PROC if (p == 0){ printf("\nno more THREAD PROC "); return -1; } p->status = READY; p->ppid = running->pid; p->parent = running; p->priority = 1; p->event = 0; p->exitCode = 0; p->pgdir = running->pgdir; // same pgdir as parent p->type = THREAD; p->tcount = 0; // tcount not used by threads p->ucpsr = running->ucpsr; p->usp = stack + 1024; // high end of stack for (i=1; ikstack[SSIZE-i] = 0; p->kstack[SSIZE-1] = fn; // uLR = fn p->kstack[SSIZE-14] = (int)ptr; // saved r0 for fn(ptr)

7.8 Threads

}

235

p->kstack[SSIZE-15] = (int)goUmode; // resume to goUmode p->ksp = &p->kstack[SSIZE-28]; // aved ksp enqueue(&readyQueue, p); running->tcount++; // inc caller tcount return(p->pid);

When a thread starts to run, it first resumes to goUmode. Then it follows the syscall stack frame to return to Umode to execute fn(ptr) as if it was invoked by a function call. When control enters fn(), it uses stmfd sp!, {fp, lr} to save the return link register in Umode stack. When the function finishes, it returns by ldmfs sp!, {fp, pc} In order for the fn() function to return gracefully when it finishes, the initial Umode lr register must contain a proper return address. In the scheduler() function, when the next running PROC is a thread, we load the Umode lr register with the value of VA(4). At the virtual address 4 of every Umode image is an exit(0) syscall (in ts.s), which allows the thread to terminate normally. Each thread may run statically, i.e., in an infinite loop, and never terminate. If needed, it uses either sleep or semaphore to suspend itself. A dynamic thread terminates when it has completed the designated task. As usual, the parent (main) thread may use the wait system call to dispose of terminated children threads. For each terminated child, it decrements its tcount value by 1. We demonstrate a system which supports threads by an example.

7.8.2 Demonstration of Threads The sample system C7.6 implements support for threads. In the ucode.c file, we add a thread command and thread-related syscalls to the kernel. The thread command calls uthread(), in which it issues the thread syscall to create N pid, running->pgdir); printf("pgdir[0] = %x\n", running->pgdir[0]); switchPgdir(PA(running->pgdir); printf("OK\n"); }

256 // (4) -------- wait.c file -------int ksleep(int event) { int sr = int_off(); printf("proc %d ksleep on %x\n", running->pid, event); running->event = event; running->status = SLEEP; enqueue(&sleepList, running); printSleepList(sleepList); int_on(sr); tswitch(); } int kwakeup(int event) { PROC *p, *tmp=0; int sr = int_off(); while((p = dequeue(&sleepList))!=0){ if (p->event==event){ printf("kwakeup %d\n", p->pid); p->status = READY; enqueue(&readyQueue, p); } else{ enqueue(&tmp, p); } } sleepList = tmp; int_on(sr); } int kexit(int value) { int i; PROC *p; int wk1; wk1 = 0; printf("%d in kexit, value=%d\n", running->pid, value); if (running->pid==1){ kprintf("P1 never dies\n"); return -1; } for (i=1; istatus != FREE) && (p->ppid == running->pid)){ printf("give %d to P1\n", p->pid); p->ppid = 1; p->parent = &proc[1]; wk1++; } } running->exitCode = value; running->status = ZOMBIE; kwakeup((int)running->parent);

7 User Mode Process and System Calls

7.10 KMH Memory Mapping

}

if (wk1) kwakeup((int)&proc[1]); tswitch(); int kwait(int *status) { int i; PROC *p; int child = 0; printf("%d in kwait() : ", running->pid); for (i=1; istatus != FREE && p->ppid == running->pid){ child++; } } if (child==0){ printf("no child\n"); return -1; }

}

while(1){ for (i=1; istatus==ZOMBIE) && (p->ppid == running->pid)){ kprintf("found a ZOMBIE child %d\n", p->pid); *status = p->exitCode; p->status = FREE; p->priority = 0; putproc(p); return p->pid; } } printf("sleep on %x\n", running); ksleep((int)running); }

// (5). ------ fork.c file -------int body(), goUmode(); char *istring = "init start"; PROC *kfork(char *filename) { int i; char *cp; char line[8]; int *ustacktop, *usp; u32 BA, Btop, Busp; PROC *p = getproc(); if (p==0){ kprintf("kfork failed\n"); return (PROC *)0; }

257

258

7 User Mode Process and System Calls

p->ppid = running->pid; p->parent = running; p->parent = running; p->status = READY; p->priority = 1; p->cpsr = (int *)0x10; // previous mode = UMODE p->paddr = (int *)(0x80800000 + (p->pid-1)*0x100000);

// set kstack to resume to body for (i=1; ikstack[SSIZE-i] = 0; p->kstack[SSIZE-15] = (int)goUmode; p->ksp = &(p->kstack[SSIZE-28]);

// in dec reg=address ORDER !!!

// kstack must contain a resume frame FOLLOWed by a goUmode frame // ksp // -|----------------------------------------// r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 fp ip pc| // ------------------------------------------// 28 27 26 25 24 23 22 21 20 19 18 17 16 15 // // usp NOTE: r0 is NOT saved in svc_entry() // -|-----goUmode-------------------------------// r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 ufp uip upc| //------------------------------------------------// 14 13 12 11 10 9 8 7 6 5 4 3 2 1 /******************** // to go Umode, must set new PROC's Umode cpsr to Umode=10000 // this was done in ts.s sring init the mode stacks ==> // user mode's cspr was set to IF=00, mode=10000 ***********************/ // must load filename to Umode image area at 8MB+(pid-1)*1MB load(filename, p); // p->PROC containing pid, pgdir, etc

// // // // // //

must fix Umode ustack for it to goUmode: how did the PROC come to Kmode? by swi # from VA=0 in Umode => at that time all CPU regs are 0 p's ustack is at its Uimage (8mb+(pid- 1)*1mb) high end from PROC's point of view, it's a VA at 1MB (from its VA=0) but we in Kmode must access p's Ustack directly

/***** this sets each proc's ustack differently, thinking each in 8MB+ustacktop = (int *)(0x800000+(p->pid)*0x100000 + 0x100000); TRY to set it to OFFSET 1MB in its section; regardless of pid *****************************************************************/ // p's physical begin address is at BA=p->pgdir[2048] & 0xFFF00000 // its ustacktop is at BA+1MB = BA + 0x100000; // put istring in it, let p->usp be its VA pointing at the istring

7.10 KMH Memory Mapping BA = p->pgdir[0] & 0xFFF00000; Btop = BA + 0x100000; Busp = Btop - 128 + 0x80000000; printf("BA=%x Busp=%x\n", BA, Busp); cp = (char *)Busp; strcpy(cp, istring); //for 1MB Umode image, ustack top is at VA=1MB, set it to 1MB-128 cp = (char *)(0x100000 - 128); printf("cp = %x\n", cp); p->kstack[SSIZE-14] = p->kstack[SSIZE-13] = (int)(cp); p->usp = (int *)(cp); p->kstack[SSIZE-1] = (int)(0); // -|-----goUmode----------------------------------------------// r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 ufp uip upc|string | //-------------------------------------------------------------// 14 13 12 11 10 9 8 7 6 5 4 3 2 1 | | enqueue(&readyQueue, p);

}

printQ(readyQueue); kprintf("proc %d kforked a child %d\n", running->pid, p->pid); return p;

int fork() { // fork a child proc with identical Umode image=> same u1 file int i, j; char *cp, *cq; u32 *ucp, *upp; u32 PBA, PBtop, Pusp; u32 CBA, CBtop, Cusp; PROC *p = dequeue(&freeList); if (p==0){ kprintf("fork failed\n"); return -1; } p->ppid = running->pid; p->parent = running; p->status = READY; p->priority = 1; p->paddr = (int *)(0x80800000 + (p->pid-1)*0x100000); printf("running usp=%x linkR=%x\n", running->usp, running->upc); PBA = (running->pgdir[0] & 0xFFFF0000); PBA += 0x80000000; PBtop = PBA + 0x100000; // top of 1MB Uimage printf("FORK: parent %d uimage at %x\n", running->pid, PBA);

259

260

7 User Mode Process and System Calls

CBA = (p->pgdir[0] & 0xFFFF0000); CBA += 0x80000000; CBtop = CBA + 0x100000; // top of 1MB Uimage printf("FORK: child %d uimage at %x\n", p->pid, CBA); printf("copy Umode image from %x to %x\n", PBA, CBA); // copy 1MB of Umode image upp = (int *)PBA; ucp = (int *)CBA; for (i=0; iupc = running->upc; p->usp = running->usp; // both should be VA in their sections p->cpsr = running->cpsr; // the hard part: child must resume to the same place as parent // child kstack must contain |parent kstack|goUmode stack| printf("copy kernel mode stack\n"); j = &running->kstack[SSIZE] - running->ksp; // printf("j=%d\n", j); // this frame must be copied from parent's kstack, except PC, r0=0 // 1 2 3 4 5 6 7 8 9 10 11 12 13 14 // ---------------------------------------------// PC ip fp r10 r9 r8 r7 r6 r5 r4 r3 r2 r1 r0 | //----------------------------------------------// 15 16 17 18 19 20 21 22 23 24 25 26 27 28 //-------------------------------------------------// goUmode ip fp r10 r9 r8 r7 r6 r5 r4 r3 r2 r1 r0 | // -----------------------------------------------for (i=1; i kstack[SSIZE-i] = running->kstack[SSIZE-i]; //printf("%x ", p->kstack[SSIZE-i]); } //printf("\n"); printf("FIX UP child resume PC to %x\n", running->upc); p->kstack[SSIZE-1] = (int)(running->upc); p->kstack[SSIZE-(j+1)] = (int)goUmode; p->ksp = &(p->kstack[SSIZE-(j+14)]); enqueue(&readyQueue, p);

kprintf("KERNEL: proc %d forked a child %d\n", running->pid, p->pid); printQ(readyQueue); }

return p->pid;

7.10 KMH Memory Mapping // (6) ---------- exec.c file -----------int exec(char *uline) { int i, *ip; char *cp, kline[64]; PROC *p = running; char file[32], filename[32]; int *usp, *ustacktop; u32 BA, Btop, Busp, uLINE; char line[32]; printf("EXEC: proc %d uline=%x\n", running->pid, uline); BA = (p->pgdir[0] & 0xFFF00000); Btop = BA + 0x100000; // top of 1MB Uimage printf("EXEC: proc %d Uimage at %x uline=%x\n", running->pid, BA, uline); uLINE = (int)uline; kstrcpy(kline, (char *)uLINE); printf("EXEC: proc %d line = %s\n", running->pid, kline); // first token of kline = filename cp = kline; i=0; while(*cp != ' ' && *cp != 0){ filename[i] = *cp; i++; cp++; } filename[i] = 0; BA = p->pgdir[0] & 0xFFF00000; kprintf("load file %s to %x\n", file, BA); // load filename to Umode image if (load(filename, p) pgdir[0] & 0xFFF00000; Btop = BA + 0x100000; Busp = Btop - 128 + 0x80000000; printf("BA=%x Busp=%x\n", BA, Busp); cp = (char *)Busp; strcpy(cp, kline); // for 1MB Umode image, ustack top is at VA=1MB, set it to 1MB-128 cp = (char *)(0x100000 - 128); p->kstack[SSIZE-14] = p->kstack[SSIZE-13] = (int)(cp); p->usp = (int *)(cp);

261

262 p->kstack[SSIZE-1] = (int)(0); // -|-----goUmode------------------------------------------------// r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 ufp uip upc|string | //---------------------------------------------------------------// 14 13 12 11 10 9 8 7 6 5 4 3 2 1 | | p->kstack[SSIZE-14] = p->kstack[SSIZE-13] = (int)(cp); p->kstack[SSIZE-1] = (int)(0);

}

kprintf("kexec exit\n"); return 0;

// (7) -------svc.c file ----------

int kgetpid() { return running->pid; } int kgetppid() { return running->ppid; } char *pstatus[]={"FREE ","READY ","SLEEP ","BLOCK ","ZOMBIE ", "RUN "}; int kps() { int i; PROC *p; for (i=0; ipid, p->ppid); if (p==running) printf("%s ", pstatus[5]); else printf("%s", pstatus[p->status]); printf("name=%s\n", p->name); } } int kchname(char *s) { kprintf("kchname: name=%s\n", s); strcpy(running->name, s); return 123; } int kkfork() { PROC *p = kfork("/bin/u1"); if (p) return p->pid; return -1; }

7 User Mode Process and System Calls

7.10 KMH Memory Mapping int kkwait(int *status) { int pid, e; pid = kwait(&e); printf("write %x to status at %x in Umode\n", e, status); *status = e; return pid; } int ktswitch() { kprintf("%d in ktswitch()\n", running->pid); tswitch(); kprintf("%d exit ktswitch()\n", running->pid); } int getusp() { return (int)running->usp; } int svc_handler(volatile int a, int b, int c, int d) { int r; //kprintf("svc_handler: a=%d b=%d c=%d d=%d\n",a,b,c,d); switch(a){ case 0: r = kgetpid(); break; case 1: r = kgetppid(); break; case 2: r = kps(); break; case 3: r = kchname((char *)b); break; case 4: r = kkfork(); break; case 5: r = ktswitch(); break; case 6: r = kkwait((int *)b); break; case 7: r = kexit(b); break; case 8: r = getusp(); break; case 9: r = fork(); break; case 10: r = exec((char *)b); break;

}

case 90: case 91: case 92: default:

r = kgetc() & 0x7F; break; r = kputc(b); break; r = (int)running->paddr - 0x80000000; break; printf("invalid syscall %d\n", a);

running->kstack[SSIZE-14] = r; // save uR0 = r for Umode

}

return r; // return r to goUmode

# (8) --------- t.ld file --------ENTRY(reset_handler) SECTIONS

263

264 {

7 User Mode Process and System Calls

. = 0x80010000; .text :

{

// loading VA address of kernel

ts.o t.o

} .data :

{

ts.o(.data) t.o (.data)

} .bss : { _sbegin = .; *(.bss); _send = .; }

}

. = . = /* . = /*

ALIGN(8); . + 0x1000; /* 4kB of stack memory */ stack_top = .; */ . + 0x1000; /* 4kB of irq stack memory */ irq_stack_top = .; */

# (9) ---------- mk sh script file -----------# sdimage is a disk image containing an EXT2 file system # generate u1, u2 and copy to /bin of sdimage (cd USER; mku u1; mku u2) echo compile-link: kernel to 0x80010000 Kernel Mapped High to 2GB arm-none-eabi-as -mcpu=arm926ej-s ts.s -o ts.o arm-none-eabi-gcc -c -mcpu=arm926ej-s t.c -o t.o arm-none-eabi-ld -T t.ld ts.o t.o -Ttext=0x80010000 -o t.elf arm-none-eabi-objcopy -O binary t.elf t.bin rm *.o *.elf echo ready to go? read dummy qemu-system-arm -M versatilepb -m 128M -sd sdimage -kernel t.bin \ -serial mon:stdio

Explanations of C7.9 Program (1). High VA in kernel image: Assume that the kernel VA space is mapped to 2GB. In order to let the kernel use high virtual addresses, the link command must be modified to generate high virtual address. For the kernel image, the modified link command is arm-none-eabi-ld -T t.ld ts.o t.o mtx.o -Ttext=0x80010000 -o t.elf arm-none-eabi-objcopy -O binary t.elf t.bin # convert ELF to binary

The kernel image is loaded at the physical address 0x10000 but its VA is 0x800100000. For user mode images, the link command is changed to generate Umode images at VA=0 arm-none-eabi-ld -T u.ld us.o u1.o -Ttext=0x0 -o u1.elf arm-none-eabi-objcopy -O binary u1.elf u1 # convert ELF to binary

7.10 KMH Memory Mapping

265

(2). VA to PA conversion: The kernel code uses VA but page table entries must use PA. For convenience, we define the macros #define VA(x) ((u32)(x) + 0x80000000) #define PA(x) ((u32)(x) - 0x80000000)

for conversions between VA and PA. (3). VA for I/O Base Addresses: The base addresses of I/O devices are specified in PA, which must be remapped to VA. For the Versatilepb board, I/O devices are located in the 2MB I/O space beginning at PA 256MB. The I/O device base addresses must be mapped to VA by the VA(x) macro. (4). Initial Page Table: Since the kernel code is compile-linked with VA, we cannot execute the kernel’s C code before configuring the MMU for address translation. The initial page table must be constructed in assembly code when the system starts. Set up all the page tables in assembly code is tedious. Therefore, we build the page tables in several stages, first a simple page table in assembly for the system to begin to use VA. Then we build more page tables in C once the OS kernel runs in VA. (5). The ts.s file: The reset_handler sets up the initial page table and enables the MMU for address translation. It also sets bit-13 of the MMU control register C1 to 1 for vector remap. Then it sets up the privileged mode stacks, and call main() in C. For the sake of brevity, we shall only show the code segments that are relevant to memory mapping. The assembly code shown below sets up an initial one-level page table at PA=0x4000 (16KB) using 1MB sections. In the one-level page table, entries 2048 to 2048 + 258 map VA = (2GB, 2GB + 258MB) to the low 258 physical memory, which includes the 2MB I/O space at 256MB. The kernel mode space is assigned the domain 0, with the access bits set to AP=01 (client mode) to prevent user mode process from accessing kernel’s VA space. It configures the MMU control register C1 to remap vector table to 0xFFFF0000. Then it sets the TLB base register and enables the MMU for address translation. After these, the system is running with VA in the range of (2GB to 2GB + 258MB). /************* ts.s file **************/ //.set Mtable, 0x4000 // level-1 page table must be at 16K boundary reset_handler: mov r0, #0x4000

// r0 = Mtable contents=0x4000 at 16KB

// Assume Versatilepb, arm926ej-s has 256MB RAM + 2MB I/O at 256MB, // so the first 258 entries should be sufficient, but mov r1, #value // must use a value acceptable to the mov # instruction, so try 256 mov r1, #256 // 256 will certainly work add r1, r1, #2 // add 2 to get 258 mov r2, #0x100000 // r2=1M mov r3, #(0x1 st_nlink); // link count printf(“%4d “, sp->st_uid // uid printf(“%8d ", sp->st_size); // file size strcpy(ftime, ctime(&sp->st_ctime)); ftime[strlen(ftime)-1] = 0; // kill \n at end printf("%s ",ftime); // time in calendar form printf("%s", basename(fname)); // file basename if (S_ISLNK(sp->st_mode)){ // if symbolic link r = readlink(fname, sbuf, 4096); printf(" -> %s", sbuf); // -> linked pathname } printf("\n"); } int ls_dir(char *dname) // list a DIR {

328

8 General Purpose Embedded Operating Systems

char name[256]; // EXT2 filename: 1-255 chars DIR *dp; struct dirent *ep; // open DIR to read names dp = opendir(dname); // opendir() syscall while (ep = readdir(dp)){ // readdir() syscall strcpy(name, ep->d_name); if (!strcmp(name, ".") || !strcmp(name, "..")) continue; // skip over . and .. entries strcpy(name, dname); strcat(name, "/"); strcat(name, ep->d_name); ls_file(name); // call list_file() }

} int main(int argc, char *argv[]) { struct stat mystat, *sp; int r; char *s; char filename[1024], cwd[1024]; s = argv[1]; // ls [filename] if (argc == 1) // no parameter: ls CWD s = "./"; sp = &mystat; if ((r = stat(s, sp)) < 0){ // stat() syscall perror("ls"); exit(1); } strcpy(filename, s); if (s[0] != '/'){ // filename is relative to CWD getcwd(cwd, 1024); // get CWD path strcpy(filename, cwd); strcat(filename, "/"); strcat(filename,s); // construct $CWD/filename } if (S_ISDIR(sp->st_mode)) ls_dir(filename); // list DIR else ls_file(filename); // list single file }

The reader may compile and run the program C8.2 under Linux. It should list either a single file or a DIR in the same format as does the Linux ls –l command. Example 4: File copy programs: This example shows the implementations of two file copy programs; one uses system calls and the other one uses library I/O functions. Both programs are run as a.out src dest, which copies src to dest. We list the programs side by side in order to show their similarities and differences.

8.12 File System

329

------------ cp.sysall.c ------------|------ cp.libio.c -----------#include | #include #include | #include #include | main(int argc, char *argv[ ]) | main(int argc, char *argv[ ]) { | { int fd, gd; | FILE *fp, *gp; int n; | int n; char buf[4096]; | char buf[4096]; if (argc < 3) exit(1); | if (argc < 3) exit(1); fd = open(argv[1], O_RDONLY); | fp = fopen(argv[1], "r"); gd = open(argv[2],O_WRONLY|O_CREAT);| gp = fopen(argc[1], “w+”); if (fd < 0 || gd < 0) exit(2); | if (fp==0 || gp==0) exit(2); while(n=read(fd, buf, 4096)){ | while(n=fread(buf,1,4096,fp)){ write(gd, buf, n); | fwrite(buf, 1, n, gp); } | } close(fd); close(gd); | fclose(fp); fclose(gp); } | } ---------------------------------------------------------------------

The program cp.syscall.c shown on the left-hand side uses system calls. First, it issues open() system calls to open the src file for READ and the dest file foe WRITE, which creates the dest file if it does not exist. The open() syscalls return two (integer) file descriptors, fd and gd. Then it uses a loop to copy data from fd to gd. In the loop, it issues read() syscall on fd to read up to 4 KB data from the kernel to a local buffer. Then it issues write() syscall on gd to write the data from the local buffer to kernel. The loop ends when read() returns 0, indicating that the source file has no more data to read. The program cp.libio.c shown on the right-hand side uses library I/O functions. First, it calls fopen() to create two file streams, fp and gp, which are pointers to FILE structures. fopen() creates a FILE structure (defined in stdio.h) in the program’s heap area. Each FILE structure contains a local buffer, char fbuf[BLKSIZE], of file block size and a file descriptor field. Then it issues an open() system call to get a file descriptor and record the file descriptor in the FILE structure. Then it returns a file stream (pointer) pointing at the FILE structure. When the program calls fread(), it tries to read data from fbuf in the FILE structure. If fbuf is empty, fread() issues a read() system call to read a BLKSIZE of data from kernel to fbuf. Then it transfers data from fbuf to the program’s local buffer. When the program calls fwrite(), it writes data from the program’s local buffer to fbuf in the FILE structure. If fbuf is full, fwrite() issues a write() system call to write a BLKSIZE of data from fbuf to kernel. Thus, library I/O functions are built on top of system calls, but they issue system calls only when needed and they transfer data from/to kernel in file block size for better efficiency. Based on these discussions, the reader should be able to deduce which program is more efficient if the objective is only to transfer data. However, if a user mode program intends to access single chars in files or read/write lines, etc., using library I/O functions would be the better choice. Exercise 3: Assume that we rewrite the data transfer loops in the programs of Example 3 as follows, both transfer one byte at a time: cp.syscall.c | cp.syscall.c -------------------------------------------------------------while((n=read(fd, buf, 1)){ | while((n=fgetc(fp))!= EOF){ write(gd, buf, n); | fputc(n, gp); } | } --------------------------------------------------------------

Which program would be more efficient? Explain your reasons. Exercise 4: When copying files, the source file must be a regular file and we should never copy a file to itself. Modify the programs in Example 3 to handle these cases.

330

8 General Purpose Embedded Operating Systems

8.12.3 EXT2 File System in EOS For many years, Linux used EXT2 [4] as the default file system. EXT3 [6] is an extension of EXT2. The main addition to EXT3 is a journal file, which records changes made to the file system in a journal log. The log allows for quicker recovery from errors in case of a file system crash. An EXT3 file system with no error is identical to an EXT2 file system. The newest extension of EXT3 is EXT4 [3]. The major change in EXT4 is in the allocation of disk blocks. In EXT4, block numbers are 48 bits. Instead of discrete disk blocks, EXT4 allocates contiguous ranges of disk blocks, called extents. EOS is a small system intended mainly for teaching and learning the internals of embedded operating systems. Large file storage capacity is not the design goal. Principles of file system design and implementation, with an emphasis on simplicity and compatibility with Linux, are the major focal points. For these reasons, we choose ETX2 as the file system. Support for other file systems, e.g., FAT and NTFS, are not implemented in the EOS kernel. If needed, they can be implemented as user-level utility programs. This section describes the implementation of an EXT2 file system in the EOS kernel. Implementation of EXT2 file system is discussed in detail in [10]. For the sake of completeness and reader convenience, we include similar information here.

8.12.3.1 File System Organization Figure 8.5 shows the internal organization of an EXT2 file system. The organization diagram is explained by the labels (1) to (5). (1). is the PROC structure of the running process. Each PROC has a cwd field, which points to the in-memory INODE of the PROC’s Current Working Directory (CWD). It also contains an array of file descriptors, fd[], which point to opened file instances.

Fig. 8.5 EXT2 File System Data Structures

8.12 File System

331

(2). is the root directory pointer of the file system. It points to the in-memory root INODE. When the system starts, one of the devices is chosen as the root device, which must be a valid EXT2 file system. The root INODE (inode #2) of the root device is loaded into memory as the root directory (/) of the file system. This operation is known as "mount root file system". (3). is an openTable entry. When a process opens a file, an entry of the PROC’s fd array points to an openTable, which points to the in-memory INODE of the opened file. (4). is an in-memory INODE. Whenever a file is needed, its INODE is loaded into a minode slot for reference. Since INODEs are unique, only one copy of each INODE can be in memory at any time. In the minode, (dev, ino) identify where the INODE came from, for writing the INODE back to disk if modified. The refCount field records the number of processes that are using the minode. The dirty field indicates whether the INODE has been modified. The mounted flag indicates whether the INODE has been mounted on and, if so, the mntabPtr points to the mount table entry of the mounted file system. The lock field is to ensure that an in-memory INODE can only be accessed by one process at a time, e.g., when modifying the INODE or during a read/write operation. (5). is a table of mounted file systems. For each mounted file system, an entry in the mount table is used to record the mounted file system information. In the in-memory INODE of the mount point, the mounted flag is turned on and the mntabPtr points to the mount table entry. In the mount table entry, mntPointPtr points back to the in-memory INODE of the mount point. As will be shown later, these doubly-linked pointers allow us to cross mount points when traversing the file system tree. In addition, a mount table entry may also contain other information of the mounted file system, such as the device name, superblock, group descriptor, and bitmaps for quick reference. Naturally, if any such information has been changed while in memory, they must be written back to the storage device when the mounted file system is umounted.

8.12.3.2 Source Files in the EOS/FS Directory In the EOS kernel source tree, the FS directory contains files which implement an EXT2 file system. The files are organized as follows: ------------------ Common files of FS -----------------------------type.h : EXT2 data structure types global.c: global variables of FS util.c : common utility functions: getino(), iget(), iput(), search(), etc. allocate_deallocate.c : inodes/blocks management functions Implementation of the file system is divided into three levels. Each level deals with a distinct part of the file system. This makes the implementation process modular and easier to understand. Level-1 implements the basic file system tree. It contains the following files, which implement the indicated functions: -------------------------- Level-1 of FS ---------------------------mkdir_creat.c : make directory, create regular and special file cd_pwd.c : change directory, get CWD path rmdir.c : remove directory link_unlink.c : hard link and unlink files symlink_readlink.c : symbolic link files stat.c : return file information misc1.c : access, chmod, chown, touch, etc. ----------------------------------------------------------------User-level programs which use the level-1 FS functions include mkdir, creat, mknod, rmdir, link, unlink, symlink, rm, ls, cd and pwd, etc. Level-2 implements functions for reading/writing file contents.

332

8 General Purpose Embedded Operating Systems

---------------------- Level-2 of FS ------------------------------open_close_lseek.c : open file for READ|WRITE|APPEND, close file and lseek read.c : read from an opened file descriptor write.c : write to an opened file descriptor opendir_readdir.c : open and read directory dev_switch_table : read/write special files buffer.c : block device I/O buffer management Level-3 implements mount, umount, and file protection. ---------------------- Level-3 of FS -----------------------------mount_umount.c : mount/umount file systems file protection : access permission checking file-locking : lock/unlock files ------------------------------------------------------------------

8.12.4 Implementation of Level-1 FS (1). type.h file: This file contains the data structure types of the EXT2 file system, such as superblock, group descriptor, inode, and directory entry structures. In addition, it also contains the open file table, mount table, pipes and PROC structures, and constants of the EOS kernel. (2). global.c file: This file contains global variables of the EOS kernel. Examples of global variables are

MINODE minode[NMINODES]; MOUNT mounttab[NMOUNT]; OFT oft[NOFT];

// in memory INODEs // mount table // Opened file instance

(3). util.c file: This file contains utility functions of the file system. The most important utility functions are getino(), iget(), and iput(), which are explained in more detail. (3).1. u32 getino(int *dev, char *pathname): getino() returns the inode number of a pathname. While traversing a pathname the device number may change if the pathname crosses mounting point(s). The parameter dev is used to record the final device number. Thus, getino() essentially returns the (dev, ino) of a pathname. The function uses tokenize() to break up pathname into component strings. Then it calls search() to search for the component strings in successive directory minodes. Search() returns the inode number of the component string if it exists, or 0 if not. (3).2. MINODE *iget(in dev, u32 ino): This function returns a pointer to the in-memory INODE of (dev, ino). The returned minode is unique, i.e., only one copy of the INODE exists in kernel memory. In addition, the minode is locked (by the minode’s locking semaphore) for exclusive use until it is either released or unlocked. (3).3. iput(MINODE *mip): This function releases and unlocks a minode pointed by mip. If the process is the last one to use the minode (refCount = 0), the INODE is written back to disk if it is dirty (modified). (3).4. Minodes Locking: Every minode has a lock field, which ensures that a minode can only be accessed by one process at a time, especially when modifying the INODE. Unix uses a busy flag and sleep/wakeup to synchronize processes accessing the same minode. In EOS, each minode has a lock semaphore with an initial value 1. A process is allowed to access a minode only if it holds the semaphore lock. The reason for minodes locking is as follows. Assume that a process Pi needs the inode of (dev, ino), which is not in memory. Pi must load the inode into a minode entry. The minode must be marked as (dev, ino) to prevent other processes from loading the same inode again. While loading the inode from disk, Pi may wait for I/O completion, which switches to another process Pj. If Pj needs exactly the same inode, it would find the needed minode already exists. Without the minode lock, Pj would proceed to use the minode before it is even loaded in yet. With the lock, Pj must wait until the minode is loaded, used, and then released by Pi. In addition, when a process read/write an opened file, it must lock the file’s minode to ensure that each read/write operation is atomic.

8.12 File System

333

(4). allocate_deallocate.c file: This file contains utility functions for allocating and deallocating minodes, inodes, disk blocks and open file table entries. It is noted that both inode and disk block numbers count from 1. Therefore, in the bitmaps bit i represents inode/block number i+1. (5). mount_root.c file: This file contains the mount_root() function, which is called during system initialization to mount the root file system. It reads the superblock of the root device to verify the device is a valid EXT2 file system. It loads the root INODE (ino=2) into a minode and sets the root pointer to the root minode. Then it unlocks the root minode to allow all processes to access the root minode. A mount table entry is allocated to record the mounted root file system. Some key parameters on the root device, such as the starting blocks of the bitmaps and inodes table, are also recorded in the mount table for quick reference. (6). mkdir_creat.c file: This file contains mkdir and creat functions for making directories and creating files, respectively. mkdir and creat are very similar, so they share some common code. Before discussing the algorithms of mkdir and creat, we first show how to insert/delete a DIR entry into/from a parent directory. Each data block of a directory contains DIR entries of the form

|ino rlen nlen name|ino rlen nlen name| ...

where name is a sequence of nlen chars without a terminating NULL byte. Since each DIR entry begins with a u32 inode number, the rec_len of each DIR entry is always a multiple of 4 (for memory alignment). The last entry in a data block spans the remaining block, i.e., its rec_len is from where the entry begins to the end of block. In mkdir and creat, we assume the following: (a) A DIR file has at most 12 direct blocks. This assumption is reasonable because, with 4 KB block size and an average file name of 16 chars, a DIR can contain more than 3000 entries. We may assume that no user would put that many entries in a single directory. (b) Once allocated, a DIR’s data block is kept for reuse even if it becomes empty. With these assumptions, the insertion and deletion algorithms are as follows: /************* Algorithm of Insert_dir_entry ******************/ (1). need_len = 4*((8+name_len+3)/4); // new entry need length (2). for each existing data block do { if (block has only one entry with inode number==0) enter new entry as first entry in block; else{ (3). go to last entry in block; ideal_len = 4*((8+last_entry’s name_len+3)/4); remain = last entry's rec_len - ideal_len; if (remain >= need_len){ trim last entry's rec_len to ideal_len; enter new entry as last entry with rec_len = remain; } (4). else{ allocate a new data block; enter new entry as first entry in the data block; increase DIR's size by BLKSIZE; } } write block to disk; } (5). mark DIR's minode modified for write back; /************* Algorithm of Delete_dir_entry (name) *************/ (1). search DIR's data block(s) for entry by name; (2). if (entry is the only entry in block)

334

8 General Purpose Embedded Operating Systems

clear entry’s inode number to 0; else{ (3). if (entry is last entry in block) add entry's rec_len to predecessor entry's rec_len; (4). else{ // entry in middle of block add entry's rec_len to last entry's rec_len; move all trailing entries left to overlay deleted entry; } } (5). write block back to disk;

Note that in the Delete_dir_entry algorithm, an empty block is not deallocated but kept for reuse. This implies that a DIR’s size will never decrease. Alternative schemes are listed in the Problem section as programming exercises.

8.12.4.1 mkdir-creat-mknod mkdir creates an empty directory with a data block containing the default . and .. entries. The algorithm of mkdir is /********* Algorithm of mkdir *********/ int mkdir(char *pathname) { 1. if (pathname is absolute) dev = root->dev; else dev = PROC's cwd->dev 2. divide pathname into dirname and basename; 3. // dirname must exist and is a DIR: pino = getino(&dev, dirname); pmip = iget(dev, pino); check pmip->INODE is a DIR 4. // basename must not exist in parent DIR: search(pmip, basename) must return 0; 5. call kmkdir(pmip, basename) to create a DIR; kmkdir() consists of 4 major steps: 5-1. allocate an INODE and a disk block: ino = ialloc(dev); blk = balloc(dev); mip = iget(dev,ino); // load INODE into an minode 5-2. initialize mip->INODE as a DIR INODE; mip->INODE.i_block[0] = blk; other i_block[ ] are 0; mark minode modified (dirty); iput(mip); // write INODE back to disk 5-3. make data block 0 of INODE to contain . and .. entries; write to disk block blk. 5-4. enter_child(pmip, ino, basename); which enters (ino, basename) as a DIR entry to the parent INODE; 6. increment parent INODE's links_count by 1 and mark pmip dirty; iput(pmip); }

Creat creates an empty regular file. The algorithm of creat is similar to mkdir. The algorithm of creat is as follows: /****************** Algorithm of creat () ******************/ creat(char * pathname) { This is similar to mkdir() except (1). the INODE.i_mode field is set to REG file type, permission bits set to 0644 for rw-r--r--, and

8.12 File System

}

335

(2). no data block is allocated, so the file size is 0. (3). Do not increment parent INODE’s links_count

It is noted that the above creat algorithm differs from that in Unix/Linux. The new file’s permissions are set to 0644 by default and it does not open the file for WRITE mode and return a file descriptor. In practice, creat is rarely used as a standalone syscall. It is used internally by the kopen() function, which may create a file, open it for WRITE and return a file descriptor. The open operation will be described later. Mknod creates a special file which represents either a char or block device with a device number=(major, minor). The algorithm of mknod is /*********** Algorithm of mknod ***********/ mknod(char *name, int type, int device_number) { This is similar to creat() except (1). the default parent directory is /dev; (2). INODE.i_mode is set to CHAR or BLK file type; (3). INODE.I_block[0] contains device_number=(major, minor); }

8.12.4.2 chdir-getcwd-stat Each process has a Current Working Directory (CWD), which points to the CWD minode of the process in memory. chdir(pathname) changes the CWD of a process to pathname. getcwd() returns the absolute pathname of CWD. stat() returns the status information of a file in a STAT structure. The algorithm of chdir() is /********** Algorithm of chdir ************/ int chdir(char *pathname) { (1). get INODE of pathname into a minode; (2). verify it is a DIR; (3). change running process CWD to minode of pathname; (4). iput(old CWD); return 0 for OK; }

getcwd() is implemented by recursion. Starting from CWD, get the parent INODE into memory. Search the parent INODE’s data block for the name of the current directory and save the name string. Repeat the operation for the parent INODE until the root directory is reached. Construct an absolute pathname of CWD on return. Then copy the absolute pathname to user space. stat(pathname, STAT *st) returns the information of a file in a STAT structure. The algorithm of stat is /********* Algorithm of stat *********/ int stat(char *pathname, STAT *st) // st points to STAT struct { (1). get INODE of pathname into a minode; (2). copy (dev, ino) to (st_dev, st_ino) of STAT struct in Umode (3). copy other fields of INODE to STAT structure in Umode; (4). iput(minode); return 0 for OK; }

8.12.4.3 rmdir As in Unix/Linux, in order to rm a DIR, the directory must be empty, for the following reasons. First, removing a non-empty directory implies removing all the files and subdirectories in the directory. Although it is possible to implement a rrmdir() operation, which recursively removes an entire directory tree, the basic operation is still to remove one directory at a time. Second, a non-empty directory may contain files that are actively in use, e.g., opened for read/write, etc. Removing such a

336

8 General Purpose Embedded Operating Systems

directory is clearly unacceptable. Although it is possible to check whether there are any active files in a directory, it would incur too much overhead in the kernel. The simplest way out is to require that a directory to be removed must be empty. The algorithm of rmdir() is /******** Algorithm of rmdir ********/ rmdir(char *pathname) { 1. get in-memory INODE of pathname: ino = getino(&de, pathname); mip = iget(dev,ino); 2. verify INODE is a DIR (by INODE.i_mode field); minode is not BUSY (refCount = 1); DIR is empty (traverse data blocks for number of entries = 2); 3. /* get parent's ino and inode */ pino = findino(); //get pino from .. entry in INODE.i_block[0] pmip = iget(mip->dev, pino); 4. /* remove name from parent directory */ findname(pmip, ino, name); //find name from parent DIR rm_child(pmip, name); 5. /* deallocate its data blocks and inode */ truncat(mip); // deallocate INODE's data blocks 6. deallocate INODE idalloc(mip->dev, mip->ino); iput(mip); 7. dec parent links_count by 1; mark parent dirty; iput(pmip); 8. return 0 for SUCCESS. }

8.12.4.4 link-unlink The link_unlikc.c file implements link and unlink. link(old_file, new_file) creates a hard link from new_file to old_file. Hard links can only be to regular files, not DIRs, because linking to DIRs may create loops in the file system name space. Hard link files share the same inode. Therefore, they must be on the same device. The algorithm of link is

}

/********* Algorithm of link ********/ link(old_file, new_file) { 1. // verify old_file exists and is not DIR; oino = getino(&odev, old_file); omip = iget(odev, oino); check file type (cannot be DIR). 2. // new_file must not exist yet: nion = get(&ndev, new_file) must return 0; ndev of dirname(newfile) must be same as odev 3. // creat entry in new_parent DIR with same ino pmip -> minode of dirname(new_file); enter_name(pmip, omip->ino, basename(new_file)); 4. omip->INODE.i_links_count++; omip->dirty = 1; iput(omip); iput(pmip);

8.12 File System

337

unlink decrements the file’s links_count by 1 and deletes the file name from its parent DIR. When a file’s links_count reaches 0, the file is truly removed by deallocating its data blocks and inode. The algorithm of unlink() is /*********** Algorithm of unlink *********/ unlink(char *filename) { 1. get filename's minode: ino = getino(&dev, filename); mip = iget(dev, ino); check it's a REG or SLINK file 2. // remove basename from parent DIR rm_child(pmip, mip->ino, basename); pmip->dirty = 1; iput(pmip); 3. // decrement INODE's link_count mip->INODE.i_links_count--; if (mip->INODE.i_links_count > 0){ mip->dirty = 1; iput(mip); } 4. if (!SLINK file) // assume:SLINK file has no data block truncate(mip); // deallocate all data blocks deallocate INODE; iput(mip); }

8.12.4.5 symlink-readlink symlink(old_file, new_file) creates a symbolic link from new_file to old_file. Unlike hard links, symlink can link to anything, including DIRs or files not on the same device. The algorithm of symlink is Algorithm of symlink(old_file, new_file) { 1. check: old_file must exist and new_file not yet exist; 2. create new_file; change new_file to SLINK type; 3. // assume length of old_file name INODE.i_block); }

338

8 General Purpose Embedded Operating Systems

8.12.4.6 Other Level-1 Functions Other level-1 functions include stat, access, chmod, chown, touch, etc. The operations of all such functions are of the same pattern: (1). get the in-memory INODE of a file by ino = getinod(&dev, pathname); mip = iget(dev,ino); (2). get information from the INODE or modify the INODE; (3). if INODE is modified, mark it DIRTY for write back; (4). iput(mip);

8.12.5 Implementation of Level-2 FS Level-2 of FS implements read/write operations of file contents. It consists of the following functions: open, close, lseek, read, write, opendir, and readdir.

8.12.5.1 open-close-lseek The file open_close_lseek.c implements open(), close() and lseek(). The system call int open(char *filename, int flags); opens a file for read or write, where flags = 0|1|2|3|4 for R|W|RW|APPEND, respectively. Alternatively, flags can also be specified as one of the symbolic constants O_RDONLY, O_WRONLY, O_RDWR, which may be bitwise or-ed with file creation flags O_CREAT, O_APPEND, O_TRUNC. These symbolic constants are defined in type.h. On success, open() returns a file descriptor for subsequent read()/write() system calls. The algorithm of open() is

}

/************** Algorithm of open() ***********/ int open(file, flags) { 1. get file's minode: ino = getino(&dev, file); if (ino==0 && O_CREAT){ creat(file); ino = getino(&dev, file); } mip = iget(dev, ino); 2. check file INODE's access permission; for non-special file, check for incompatible open modes; 3. allocate an openTable entry; initialize openTable entries; set byteOffset = 0 for R|W|RW; set to file size for APPEND mode; 4. Search for a FREE fd[ ] entry with the lowest index fd in PROC; let fd[fd]point to the openTable entry; 5. unlock minode; return fd as the file descriptor;

Figure 8.6 shows the data structure created by open(). In the figure, (1) is the PROC structure of the process that calls open(). The returned file descriptor, fd, is the index of the fd[ ] array in the PROC structure. The contents of fd[fd] points to a OFT, which points to the minode of the file. The OFT’s refCount represents the number of processes which share the same instance of an opened file. When a process first opens a file, it sets the refCount in the OFT to 1. When a process forks, it copies all opened file descriptors to the child process, so that the child process shares all the opened file descriptors with the

8.12 File System

339

Fig. 8.6 Data Structures of open()

parent, which increments the refCount of every shared OFT by 1. When a process closes a file descriptor, it decrements the OFT’s refCount by 1 and clears its fd[] entry to 0. When an OFT’s refCount reaches 0, it calls iput() to dispose of the minode and deallocates the OFT. The OFT’s offset is a conceptual pointer to the current byte position in the file for read/write. It is initialized to 0 for R|W|RW mode or to file size for APPEND mode. In EOS, lseek(fd, position) sets the offset in the OFT of an opened file descriptor to the byte position relative to the beginning of the file. Once set, the next read/write begins from the current offset position. The algorithm of lseek () is trivial. For files opened for READ, it only checks the position value to ensure it’s within the bounds of [0, file_size]. If fd is a regular file opened for WRITE, lseek allows the byte offset to go beyond the current file size but it does not allocate any disk block for the file. Disk blocks will be allocated when data are actually written to the file. The algorithm of closing a file descriptor is

}

/************* Algorithm of close() *****************/ int close(int fd) { (1). check fd is a valid opened file descriptor; (2). if (PROC's fd[fd] != 0){ (3). if (openTable's mode == READ/WRITE PIPE) return close_pipe(fd); // close pipe descriptor; (4). if (--refCount == 0){ // if last process using this OFT lock(minodeptr); iput(minode); // release minode } } (5). clear fd[fd] = 0; // clear fd[fd] to 0 (6). return SUCCESS;

8.12.5.2 Read Regular Files The system call int read(int fd, char buf[ ], int nbytes) reads nbytes from an opened file descriptor into a buffer area in user space. read() invokes kread() in kernel, which implements the read system call. The algorithm of kread() is /**************** Algorithm of kread() in kernel ****************/ int kread(int fd, char buf[ ], int nbytes, int space) //space=K|U { (1). validate fd; ensure oft is opened for READ or RW; (2). if (oft.mode = READ_PIPE)

340

8 General Purpose Embedded Operating Systems

return read_pipe(fd, buf, nbytes); (3). if (minode.INODE is a special file) return read_special(device,buf,nbytes); (4). (regular file): return read_file(fd, buf, nbytes, space);

}

/*************** Algorithm of read regular files ****************/ int read_file(int fd, char *buf, int nbytes, int space) { (1). lock minode; (2). count = 0; avil = fileSize - offset; (3). while (nbytes){ compute logical block: lbk = offset / BLKSIZE; start byte in block: start = offset % BLKSIZE; (4). convert logical block number, lbk, to physical block number, blk, through INODE.i_block[ ] array; (5). read_block(dev, blk, kbuf); // read blk into kbuf[BLKSIZE]; char *cp = kbuf + start; remain = BLKSIZE - start; (6) while (remain){// copy bytes from kbuf[ ] to buf[ ] (space)? put_ubyte(*cp++, *buf++) : *buf++ = *cp++; offset++; count++; // inc offset, count; remain--; avil--; nbytes--; // dec remain, avil, nbytes; if (nbytes==0 || avail==0) break; } } (7). unlock minode; (8). return count; }

The algorithm of read_file() can be best explained in terms of Fig. 8.7. Assume that fd is opened for READ. The offset in the OFT points to the current byte position in the file from where we wish to read nbytes. To the kernel, a file is just a sequence of (logically) contiguous bytes, numbered from 0 to fileSize-1. As Fig. 8.7 shows, the current byte position, offset, falls in a logical block, lbk = offset / BLKSIZE, the byte to start read is start = offset % BLKSIZE and the number of bytes

Fig. 8.7 Data Structures for Read_file()

8.12 File System

341

remaining in the logical block is remain = BLKSIZE - start. At this moment, the file has avail = fileSize - offset bytes still available for read. These numbers are used in the read_file algorithm. In EOS, block size is 4 KB and files have at most double indirect blocks. The algorithm of converting logical block number to physical block number for read is /* Algorithm of Converting Logical Block to Physical Block */ u32 map(INODE, lbk){ // convert lbk to blk if (lbk < 12) // direct blocks blk = INODE.i_block[lbk]; else if (12 offset / BLOCK_SIZE; compute start byte: start = oftp->offset % BLOCK_SIZE;

342

8 General Purpose Embedded Operating Systems

Fig. 8.8 Data Structures for write_file() (4). (5). (6)

(7).

convert lbk to physical block number, blk; read_block(dev, blk, kbuf); //read blk into kbuf[BLKSIZE]; char *cp = kbuf + start; remain = BLKSIZE - start; while (remain){ // copy bytes from kbuf[ ] to ubuf[ ] put_ubyte(*cp++, *ubuf++); offset++; count++; // inc offset, count; remain --; nbytes--; // dec remain, nbytes; if (offset > fileSize) fileSize++; // inc file size if (nbytes dev != dev){ if (bp->dev >= 0) out_devlist(bp); bp->dev = dev; enter_devlist(bp); } bp->dev = dev; bp->blk = blk; bp->valid = 0; bp->async = 0; bp->dirty = 0; return bp;

} } int brelse(struct buf *bp) { if (bp->lock.value < 0){ // bp has waiter V(&bp->lock); return; } if (freebuf.value < 0 && bp->dirty){ awrite(bp); return; } enter_freelist(bp); // enter b pint bfreeList bp->busy = 0; // bp non longer busy V(&bp->lock); // unlock bp V(&freebuf); // V(freebuf) }

Since both processes and the SDC interrupt handler access and manipulate the free buffer list and device I/O queue, interrupts are disabled when processes operate on these data structures to prevent any race conditions.

8.14.1 Performance of the I/O Buffer Cache With I/O buffering, the hit ratio of the I/O buffer cache is about 40% when the system starts. During system operation, the hit ratio is constantly above 60%. This attests to the effectiveness of the I/O buffering scheme.

8.15 User Interface

349

8.15 User Interface All user commands are ELF executable files in the /bin directory (on the root device). From the EOS system point of view, the most important user mode programs are init, login, and sh, which are necessary to start up the EOS system. In the following, we shall explain the roles and algorithms of these programs.

8.15.1 The INIT Program When EOS starts, the initial process P0 is handcrafted. P0 creates a child P1 by loading the /bin/init file as its Umode image. When P1 runs, it executes the init program in user mode. Henceforth, P1 plays the same role as the INIT process in Unix/ Linux. A simple init program, which forks only one login process on the system console, is shown below. The reader may modify it to fork several login processes, each on a different terminal. /********************* init.c file ****************/ #include "ucode.c" int console; int parent() // P1’s code { int pid, status; while(1){ printf("INIT : wait for ZOMBIE child\n"); pid = wait(&status); if (pid==console){ // if console login process died printf("INIT: forks a new console login\n"); console = fork(); // fork another one if (console) continue; else exec("login /dev/tty0"); // new console login process } printf("INIT: I just buried an orphan child proc %d\n", pid); } } main() { int in, out; // file descriptors for terminal I/O in = open("/dev/tty0", O_RDONLY); // file descriptor 0 out = open("/dev/tty0", O_WRONLY); // for display to console printf("INIT : fork a login proc on console\n"); console = fork(); if (console) // parent parent(); else // child: exec to login on tty0 exec("login /dev/tty0"); }

8.15.2 The Login Program All login processes execute the same login program, each on a different terminal, for users to login. The algorithm of the login program is

350

8 General Purpose Embedded Operating Systems

/****************** Algorithm of login *******************/ // login.c : Upon entry, argv[0]=login, argv[1]=/dev/ttyX #include "ucode.c" int in, out, err; char name[128],password[128] main(int argc, char *argv[]) { (1). close file descriptors 0,1 inherited from INIT. (2). open argv[1] 3 times as in(0), out(1), err(2). (3). settty(argv[1]); // set tty name string in PROC.tty (4). open /etc/passwd file for READ; while(1){ (5). printf("login:"); gets(name); printf("password:"); gets(password); for each line in /etc/passwd file do{ tokenize user account line; (6). if (user has a valid account){ (7). change uid, gid to user's uid, gid; // chuid() change cwd to user's home DIR // chdir() close opened /etc/passwd file // close() (8). exec to program in user account // exec() } } printf("login failed, try again\n"); } }

8.15.3 The sh Program After login, the user process typically executes the command interpreter sh, which gets command lines from the user and executes the commands. For each command line, if the command is non-trivial, i.e., not cd or exit, sh forks a child process to execute the command line and waits for the child to terminate. For simple commands, the first token of a command line is an executable file in the /bin directory. A command line may contain I/O redirection symbols. If so, the child sh handles I/O redirections first. Then it uses exec to change image to execute the command file. When the child process terminates, it wakes up the parent sh, which prompts for another command line. If a command line contains a pipe symbol, such as cmd1 | cmd2, the child sh handles the pipe by the following do_pipe algorithm: /***************** do_pipe Algorithm **************/ int pid, pd[2]; pipe(pd); // create a pipe: pd[0]=READ, pd[1]=WRITE pid = fork(); // fork a child to share the pipe if (pid){ // parent: as pipe READER close(pd[1]); // close pipe WRITE end dup2(pd[0], 0); // redirect stdin to pipe READ end exec(cmd2); } else{ // child : as pipe WRITER close(pd[0]); // close pipe READ end dup2(pd[1], 1); // redirect stdout to pipe WRITE end exec(cmd1); }

Multiple pipes are handled recursively, from right to left.

8.16 Demonstration of EOS

351

8.16 Demonstration of EOS 8.16.1 EOS Startup Figure 8.9 shows the startup screen of the EOS system. After booting up, it first initializes the LCD display, configures vectored interrupts, and initializes device drivers. Then it initializes the EOS kernel to run the initial process P0. P0 builds page directories, page tables, and switches page directory to use dynamic 2-level paging. P0 creates the INIT process P1 with /bin/ init as Umode image. Then it switches process to run P1 in user mode. P1 forks a login process P2 on the console and another login process P3 on a serial terminal. When creating a new process, it shows the dynamically allocated page frames of the process image. When a process terminates, it releases the allocated page frames for reuse. Then P1 executes in a loop, waiting for any ZOMBIE child. Each login process opens its own terminal special file as stdin (0), stdout (1), and stderr (2) for terminal I/O. Then each login process displays a login: prompt on its terminal and waits for a user to login. When a user tries to login, the login process validates the user by checking the user account in the /etc/passwd file. After a user login, the login process becomes the user process by acquiring its uid and changing directory to the user’s home directory. Then the user process changes image to execute the command interpreter sh, which prompts for user commands and executes the commands.

8.16.2 Command Processing in EOS Figure 8.10 shows the processing sequence of the command line “cat f1 | grep line” by the EOS sh process (P2). For any non-trivial command line, sh forks a child process to execute the command and waits for the child to terminate. Since the command line has a pipe symbol, the child sh (P8) creates a pipe and forks a child (P9) to share the pipe. Then the child sh (P8) reads from the pipe and executes the command grep. The child sh (P9) writes to the pipe and executes command cat. P8 and P9 are connected by a pipe and run concurrently. When the pipe reader (P8) terminates, it sends the child P9 as an orphan to the INIT process P1, which wakes up to free the orphan process P9.

Fig. 8.9 Startup Screen of EOS

352

8 General Purpose Embedded Operating Systems

Fig. 8.10 Command Processing by the EOS sh

8.16.3 Signal and Exception Handling in EOS In EOS, exceptions are handled by the unified framework of signal processing. We demonstrate exception and signal processing in EOS by the following examples.

8.16.3.1 Interval Timer and Alarm Signal Catcher In the USER directory, the itimer.c program demonstrates interval timer, alarm signal, and alarm signal catcher. /******************** itimer.c file ************************/ void catcher(int sig) { printf("proc %d in catcher: sig=%d\n", getpid(), sig); } main(int argc, char *argv[]) { int t = 1; if (argc>1) t = atoi(argv[1]); // timer interval printf("install catcher? [y|n]”); if (getc()==’y’) signal(14, catcher); // install catcher() for SIGALRM(14) itimer(t); // set interval timer in kernel printf(“proc %d looping until SIGALRM\n”, getpid()); while(1); // looping until killed by a signal }

In the itimer.c program, it first lets the user to either install or not to install a catcher for the SIGALRM(14) signal. Then, it sets an interval timer of t seconds and executes a while(1) loop. When the interval timer expires, the timer interrupt handler sends a SIGALRM(14) signal to the process. If the user did not install a signal 14 catcher, the process will die by the signal. Otherwise, it will execute the catcher once and continue to loop. In the latter case, the process can be killed by other means, e.g., by the Control_C key or by a kill pid command from another process. The reader may modify the catcher() function to install the catcher again. Recompile and run the system to observe the effect.

8.17 Summary

353

8.16.3.2 Exceptions Handling in EOS In addition to timer signals, we also demonstrate unified exception and signal processing by the following user mode programs. Each program can be run as a user command. Data.c: This program demonstrates data_abort exception handling. In the data_abort handler, we first read and display the MMU’s fault status and address registers to show the MMU status and invalid VA that caused the exception. If the exception occurred in Kmode, it must be due to bugs in the kernel code. In this case, there is nothing the kernel can do. So it prints a PANIC message and stops. If the exception occurred in Umode, the kernel converts it to a signal, SIGSEG(11) for segmentation fault, and sends the signal to the process. If the user did not install a catcher for signal 11, the process will die by the signal. If the user has installed a signal 11 catcher, the process will execute the catcher function in Umode when it gets a number 11 signal. The catcher function uses a long jump to bypass the faulty code, allowing the process to terminate normally. Prefetch.c: This program demonstrates prefetch_abort exception handling. The C code attempts to execute the in-line assembly code asm(“bl 0x1000”); which would cause a prefetch_abort at the next PC address 0x1004 because it is outside of the Umode VA space. In this case, the process gets a SIGSEG(11) signal also, which is handled the same way as a data_abort exception. Undef.c: This program demonstrates undefined exception handling. The C code attempts to execute asm(“mcr p14, 0, r1, c8, c7, 0”) ; which would cause an undef_abort because the coprocessor p14 does not exist. In this case, the process gets an illegal instruction signal SIGILL(4). If the user has not installed a signal 4 catcher, the process will die by the signal. Divide by zero: Most ARM processors do not have divide instructions. Integer divisions in ARM are implemented by idiv and udiv functions in the aeabi library, which check for divide by zero errors. When a divide by zero is detected, it branches to the __aeabi_idiv0 function. The user may use the link register to identify the offending instruction and take remedial actions. In the Versatilepb VM, divide by zero simply returns the largest integer value. Although it is not possible to generate divide by zero exceptions on the ARM Versatilepb VM, the exception handling scheme of EOS should be applicable to other ARM processors.

8.17 Summary This chapter presents a fully functional general purpose embedded OS, denoted by EOS. The following is a brief summary of the organization and capabilities of the EOS system: 1. System Images: Bootable kernel image and user mode executables are generated from a source tree by ARM toolchain (of Ubuntu 15.0 Linux) and reside in an EXT2 file system on a SDC partition. The SDC contains stage-1 and stage-2 booters for booting up the kernel image from the SDC partition. After booting up, the kernel mounts the SDC partition as the root file system. 2. Processes: The system supports NPROC=64 processes and NTHRED=128 threads per process, both can be increased if needed. Each process (except the idle process P0) runs in either kernel mode or user mode. Memory management of process images is by 2-level dynamic paging. Process scheduling is by dynamic priority and time-slice. It supports interprocess communication by pipes and messages passing. The EOS kernel supports fork, exec, vfork, threads, exit, and wait for process management. 3. It contains device drivers for the most commonly used I/O devices, e.g., LCD display, timer, keyboard, UART, and SDC. It implements a fully Linux compatible EXT2 file system with I/O buffering of SDC read/write to improve efficiency and performance. 4. It supports multi-user logins to the console and UART terminals. The user interface sh supports executions of simple commands with I/O re-directions, as well as multiple commands connected by pipes. 5. It provides timer service functions, and it unifies exceptions handling with signal processing, allowing users to install signal catchers to handle exceptions in user mode.

8 General Purpose Embedded Operating Systems

354

6. The system runs on a variety of ARM virtual machines under QEMU, mainly for convenience. It should also run on real ARM-based system boards that support suitable I/O devices. Porting EOS to some popular ARM-based systems, e.g., Raspberry PI-2 is currently underway. The plan is to make it available for readers to download as soon as it is ready.

Problems 1. In EOS, the initial process P0’s kpgdir is at 32 KB. Each process has its own pgdir in PROC.res. With a 4 MB Umode image size, entries 2048-2051 of each PROC’s pgdir define the Umode page tables of the process. When switch task, we use switchPgdir((int)running->res->pgdir); to switch to the pgdir of the next running process. Modify the scheduler() function (in kernel.c file) as follows:

int *kpgdir = (int *)0x8000; // Kmode pgdir at 32KB if (running != old){ // truly switch process for (i=0; ires->pgdir[2048+i]; switchPgdir((int)kpgdir); // use the same kpgdir at 32KB }

( 1). Verify that the modified scheduler() still works, and explain WHY? (2). Extend this scheme to use only one kpgdir for ALL processes. (3). Discuss the advantages and disadvantages of using one pgdir per process vs. one pgdir for all processes. 2. The send/recv operations in EOS use the synchronous protocol, which is blocking and may lead to deadlock due to, e.g., no free message buffers. Redesign the send/recv operations to prevent deadlocks. 3. Modify the Delete_dir_entry algorithm as follows. If the deleted entry is the only entry in a data block, deallocate the data block and compact the DIR INODE’s data block array. Modify the Insert_dir_entry algorithm accordingly and implement the new algorithms in the EOS file system. 4. In the read_file algorithm of Sect. 8.12.5.2, data can be read from a file to either user space or kernel space. (1). Justify why it is necessary to read data to kernel space. (2). In the inner loop of the read_file algorithm, data are transferred one byte at a time for clarity. Optimize the inner loop by transferring chunks of data at a time (HINT: minimum of data remaining in the block and available data in the file). 5. Modify the write-file algorithm in Sect. 8.12.5.3 to allow (1). Write data to kernel space, and (2). Optimize data transfer by copying chunks of data. 6. Assume: dir1 and dir2 are directories. cpd2d dir1 dir2 recursively copies dir1 into dir2. (1). Write C code for the cpd2d program. (2). What if dir1 contains dir2, e.g., cpd2d /a/b /a/b/c/d? (3). How to determine whether dir1 contains dir2? 7. Currently, EOS does not yet support file streams. Implement library I/O functions to support file streams in user space.

References 1. Android: https://en.wikipedia.org/wiki/Android_operating_system, 2016 2. ARM Versatilepb: ARM 926EJ-S, 2016: Versatile Application Baseboard for ARM926EJ-S User Guide, Arm information Center, 2016 3. Cao, M., Bhattacharya, S, Tso, T., "Ext4: The Next Generation of Ext2/3 File system", IBM Linux Technology Center, 2007. 4. Card, R., Theodore Ts'o,T., Stephen T weedie,S., “Design and Implementation of the Second Extended Filesystem”, web.mit.edu/tytso/www/ linux/ext2intro.html, 1995

References 5. EXT2: www.nongnu.org/ext2-doc/ext2.html, 2001 6. EXT3: jamesthornton.com/hotlist/linux-filesystems/ext3-journal, 2015 7. FreeBSD: FreeBSD/ARM Project, https://www.freebsd.org/platforms/arm.html, 2016 8. Raspberry_Pi: https://www.raspberrypi.org/products/raspberry-pi-2-model-b, 2016 9. Sevy, J., “Porting NetBSD to a new ARM SoC”, http://www.netbsd.org/docs/kernel/porting_netbsd_arm_soc.html, 2016 10. Wang, K.C., “Design and Implementation of the MTX Operating System”, Springer International Publishing AG, 2015

355

Chapter 9 Multiprocessing in Embedded Systems

9.1 Multiprocessing A multiprocessor system consists of a multiple number of processors, including multi-core processors, which share main memory and I/O devices. If the shared main memory is the only memory in the system, it is called a Uniform Memory Access (UMA) system. If, in addition to the shared memory, each processor also has private local memory, it is called a Non-uniform Memory Access (NUMA) system. If the roles of the processors are not the same, e.g., only some of the processors may execute kernel code while others may not, it is called an Asymmetric MP (ASMP) system. If all the processors are functionally identical, it is called a Symmetric MP (SMP) system. With the current multicore processor technology, SMP has become virtually synonymous with MP. In this chapter, we shall discuss SMP operating systems on ARM-based multiprocessor systems.

9.2 SMP System Requirements A SMP system requires much more than just a multiple numbers of processors or processor cores. In order to support SMP, the system architecture must have additional capabilities. SMP is not new. In the PC world, it started in the early 1990s by Intel on their x86-based multicore processors. Intel’s Multiprocessor Specification [8] defines SMP-compliant systems as PC/AT compatible systems with the following capabilities: (1). Cache Coherence: In a SMP-compliant system, many CPUs or cores share memory. In order to speed up memory access, the system typically employs several levels of cache memory, e.g., L1 cache inside each CPU and L2 cache in between the CPUs and main memory, etc. The memory subsystem must implement a cache coherence protocol to ensure consistency of the cache memories. (2). Support interrupts routing and inter-processor interrupts. In a SMP-compliant system, interrupts from I/O devices can be routed to different processors to balance the interrupt processing load. Processors can interrupt each other by InterProcessor Interrupts (IPIs) for communication and synchronization. In a SMP-compliant system, these are provided by a set of Advanced Programmable Interrupt Controllers (APICs). A SMP-compliant system usually has a system-wide IOAPIC and a set of local APICs of the individual processors. Together, the APICs implement an inter-processor communication protocol, which supports interrupts routing and IPIs. (3). An extended BIOS, which detects the system configuration and builds SMP data structures for the operating system to use. (4). When a SMP-compliant system starts, one of the processors is designated as the Boot Processor (BSP), which executes the boot code to start up the system. All other processors are called Application Processors (APs), which are held in the idle state initially but can receive IPIs from the BSP to start up. After booting up, all processors are functionally identical. In comparison, SMP on ARM-based systems is relatively new and still evolving. It may be enlightening to compare ARM’s approach to SMP with that of Intel. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. C. Wang, Embedded and Real-Time Operating Systems, https://doi.org/10.1007/978-3-031-28701-5_9

357

358

9 Multiprocessing in Embedded Systems

(1). Cache Coherence: all ARM MPcore-based systems include a Snoop Control Unit (SCU), which implements a cache memory coherence protocol to ensure the consistency of the cache memories. Due to the internal pipelines of the ARM CPUs, ARM introduced several kinds of barriers to synchronize both memory access and instruction executions. (2). Interrupts Routing: Similar to the APICs of Intel, all ARM MPcore systems use a Generic Interrupt Controller (GIC) for interrupts routing. The GIC consists of two parts: an interrupt distributor and an interface to CPUs. The GIC’s interrupt distributor receives external interrupt requests and sends them to the CPUs. Each CPU has a CPU interface, which can be programmed to either allow or disallow interrupts to be sent to the CPU. Each interrupt has an interrupt ID number, where lower numbers have higher priorities. Interrupts routing may be further controlled by the CPU’s interrupt priority mask register. For I/O devices, the interrupt ID numbers correspond roughly to their traditional vector numbers. (3). Inter-processor Interrupts: The sixteen special GIC interrupts of ID numbers 0 to 15 are reserved for Software Generated Interrupts (SGIs), which correspond to the IPIs in the Intel SMP architecture. In an ARM MPcore-based system, a CPU may issue SGIs to wakeup other CPUs from the WFI state, causing them to take actions, e.g., to execute an interrupt handler, as a means of inter-processor communication. In the ARM Cortex-A9 MPcore, the SGI register (GICD_SGIR) is at the offset 0x1F00 from the Peripheral Base Address. The SGI register contents are ---------------- SGI Register Contents ------------------bits 25-24 : targetListfilter: 00 = to a specified CPU 01 = to all other CPUs 10 = to the requesting CPU bits 23-16 : CPUtargetList; each bit for a CPU interface bit 15 : for CPUs with security extension only bits 3-0 : Interrupt ID (0-15) ---------------------------------------------------------

To issue a SGI, simply write an appropriate value to the SGI register. For example, the following code segment sends a SGI of interrupt ID to a specific target CPU: int send_sgi(int intID, int targetCPU, int filter) { int *sgi_reg = (int *)(CGI_BASE + 0x1F00); *sgi_reg = (filterclock = 8; }

(4). timer.c file of C9.1 /************ timer.c file of C9.1 ***********/ #define CTL_ENABLE ( 0x00000080 ) #define CTL_MODE ( 0x00000040 ) #define CTL_INTR ( 0x00000020 ) #define CTL_PRESCALE_1 ( 0x00000008 ) #define CTL_PRESCALE_2 ( 0x00000004 ) #define CTL_CTRLEN ( 0x00000002 )

1 1

0 0

364 #define CTL_ONESHOT #define DIVISOR 64

9 Multiprocessing in Embedded Systems ( 0x00000001 )

typedef struct timer{ u32 LOAD; // Load Register, TimerXLoad 0x00 u32 VALUE; // Current Value Register, TimerXValue 0x04 u32 CONTROL;// Control Register, TimerXControl 0x08 u32 INTCLR; // Interrupt Clear Register, TimerXIntClr 0x0C u32 RIS; // Raw Interrupt Status Register, TimerXRIS 0x10 u32 MIS; // Masked Interrupt Status Register,TimerXMIS 0x14 u32 BGLOAD; // Background Load Register, TimerXBGLoad 0x18 u32 *base; }TIMER; TIMER *tp[4]; // 4 timers; 2 timers per unit; at 0x00 and 0x20 int kprintf(char *fmt, ...); extern int row, col; int kpchar(char, int, int); int unkpchar(char, int, int); char clock[16]; char *blanks = " : : "; int hh, mm, ss; u32 tick = 0; void timer0_handler() { int i; tick++; if (tick >= DIVISOR){ tick = 0; ss++; if (ss==60){ ss = 0; mm++; if (mm==60){ mm = 0; hh++; } } } if (tick==0){ // every second: display a wall clock color = GREEN; for (i=0; iRIS = 0x0; tp[i]->MIS = 0x0; tp[i]->LOAD = 0x100; // 0x62=|011- 0010=|NOTEn|Pe|IntE|-|scal=00|1=32- bit|0=wrap| tp[i]->CONTROL = 0x62; tp[i]->BGLOAD = 0xF0000/DIVISOR; } strcpy(clock, "00:00:00"); hh = mm = ss = 0; } void timer_start(int n) // timer_start(0), 1, etc. { TIMER *tpr; printf("timer_start\n"); tpr = tp[n]; tpr->CONTROL |= 0x80; // set enable bit 7 } int timer_clearInterrupt(int n) // timer_start(0), 1, etc. { TIMER *tpr = tp[n]; tpr->INTCLR = 0xFFFFFFFF; }

(5). vid.c file of C9.1: same as in previous chapters except LCD base=0x10120000 (6). t.c file of C9.1 /************* t.c file of C9.1: ConfigGIC *************/ #define GIC_BASE 0x1F000000 #include #include #include #include

"uart.c" "kbd.c" "timer.c" "vid.c"

int copy_vectors(){ // same as before } int config_int(int intID, int targetCPU) { int reg_offset, index, address; char priority = 0x80; // set intID BIT in int ID register reg_offset = (intID>>3) & 0xFFFFFFFC; index = intID & 0x1F; address = (GIC_BASE + 0x1100) + reg_offset; *(int *)address |= (1 5) 2 // same as intID & 0x3

Then the following lines of code address = (GIC_BASE + 0x1800) + reg_offset + index; *(char *)address = (char)(1 0xF. The reader may try the following experiment. If we set the CPU’s priority mask register to a value svc_stack (16KB area in t.ld) mov r1, r11 // r1 = cpuid add r1, r1, #1 // cpuid++ lsl r2, r1, #12 // (cpuid+1)* 4096 add r0, r0, r2 mov sp, r0 // SVC sp=svc_stack[cpuid] high end // go in IRQ mode with interrupts OFF MSR cpsr, #0x92 // set IRQ stack LDR r0, =irq_stack // r0->irq_stack (16KB area in t.ld) mov r1, r11 add r1, r1, #1 lsl r2, r1, #12 // (cpuid+1) * 4096 add r0, r0, r2 mov sp, r0 // IRQ sp=irq_stack[cpuid] high end // go back to SVC mode with IRQ ON MSR cpsr, #0x13 cmp r11, #0 bne APs // only CPU0 copy vectors, call main() BL copy_vectors // copy vectors to address 0 BL main // CPU0 call main() in C B . APs: // each AP call APstart() in C adr r0, APaddr ldr pc, [r0] APaddr: .word APstart irq_handler: sub lr, lr, #4 stmfd sp!, {r0-r3, r12, lr} bl irq_chandler // call irq_chandler() in C ldmfd sp!, {r0-r3, r12, pc}^ vectors_start: LDR PC, reset_handler_addr LDR PC, undef_handler_addr LDR PC, swi_handler_addr LDR PC, prefetch_abort_handler_addr LDR PC, data_abort_handler_addr B .

9 Multiprocessing in Embedded Systems

9.7 ARM SMP Startup Examples

371

LDR PC, irq_handler_addr LDR PC, fiq_handler_addr reset_handler_addr: .word undef_handler_addr: .word swi_handler_addr: .word prefetch_abort_handler_addr: .word data_abort_handler_addr: .word irq_handler_addr: .word fiq_handler_addr: .word vectors_end: // unused dummy exception handlers undef_handler: swi_handler: prefetch_abort_handler: data_abort_handler: fiq_handler: B . enable_scu: // MRC p15, 4, r0, c15, c0, 0 // LDR r1, [r0] // ORR r1, r1, #0x1 // STR r1, [r0] // BX lr get_cpuid: MRC p15, 0, r0, c0, c0, 5 // AND r0, r0, #0x03 // MOV pc, lr // ----------------- end of ts.s

reset_handler undef_handler swi_handler prefetch_abort_handler data_abort_handler irq_handler fiq_handler

enable the SCU Read peripheral base address read SCU Control Register set bit0 (Enable bit) to 1 write back modified value

read CPU ID register mask in the CPUID field file -------------------

Explanations of the ts.s file: The system is compile-linked to a t.bin executable image with starting address 0x10000. In the linker script file, t.ld, it specifies svc_stack as the beginning address of a 16 KB area, which will be the SVC mode stacks of the 4 CPUs. Similarly, the CPUs will use the 16 KB areas at irq_stack as their IRQ mode stacks. Since we do not intend to handle any other kinds of exceptions, the ABT and UND mode stacks, as well as exception handlers are omitted. The image is run under QEMU as qemu-system-arm –m realview-pbx-a9 –smp 4 –m 512M –kernel t.bin –serial mon:stdio For clarity, the machine type (–m realview-pbx-a9) and number of CPUs (-smp 4) are shown in bold face. When the system starts, the executable image is loaded to 0x10000 by QEMU and runs from there. When CPU0 starts to execute reset_handler, QEMU has started up the other CPUs (CPU1 to CPU3) also but they are held in the WFI state. So, only CPU0 executes the reset_handler at first. It sets up the SVC and IRQ mode stacks, copies the vector table to address 0, and then calls main() in C. CPU0 first initializes the system. Then it writes 0x10000 as the start address of the APs to the SYS_ FLAGSSET register (at 0x10000030), and issues a SGI to all the APs, causing them to execute from the same reset_handler code at 0x10000. However, based on their CPU ID number, each AP only sets up its own SVC and IRQ mode stacks and calls APstart() in C, bypassing the system initialization code, such as copying vectors, that are already done by CPU0. (2). t.c file: in main(), CPU0 enables the SCU and configures the GIC to route interrupts. For simplicity, the system only supports UART0 (0x10009000) and timer0 (0x10011000). In config_gic(), timer interrupts are routed to CPU0 and UART interrupts are routed to CPU1, which are quite arbitrary. If desired, the reader may verify that they can be routed to any CPU. Then CPU0 initializes the device drivers and starts the timer. Then it writes the AP start address to 0x10000030 and issues SGI to activate the APs. After that, all CPUs execute in WFI loops, but CPU0 and CPU1 can respond to and handle timer and UART interrupts.

372 /******************** t.c file of C9.2 ***************************/ #include “type.h” #define GIC_BASE 0x1F000000 int *apAddr = (int *)0x10000030; // SYS_FLAGSSET register #include "uart.c" #include "timer.c" int copy_vectors(){ // same as before } int APstart() // AP startup code { int cpuid = get_cpuid(); uprintf("CPU%d start: ", cpuid); uprintf("CPU%d enter WFI state\n", cpuid); while(1){ asm("WFI"); } } int config_int(int intID, int targetCPU) { int reg_offset, index, address; char priority = 0x80; reg_offset = (intID>>3) & 0xFFFFFFFC; index = intID & 0x1F; address = (GIC_BASE + 0x1100) + reg_offset; *(int *)address = (1 LOAD = 0x0;

373

374

9 Multiprocessing in Embedded Systems

tp->VALUE= 0xFFFFFFFF; tp->RIS = 0x0; tp->MIS = 0x0; tp->LOAD = 0x100; tp->CONTROL = 0x62; // |En|Per|Int|-|Sca|00|32B|Wrap|=01100010 tp->BGLOAD = 0xF0000/DIVISOR;

} int timer_start() { TIMER *tpr = tp; tpr->CONTROL |= 0x80; } int timer_clearInterrupt() { TIMER *tpr = tp; tpr->INTCLR = 0xFFFFFFFF;

}

// set enable bit 7

// write to INTCLR register

(4). uart.c file: this is the UART driver. The driver is ready to 4 UARTs but it only uses UART0. For the sake of brevity, the uprintf() code is not shown. /************ uart.c file of C9.2 ***********/ #define UART0_BASE 0x10009000 typedef struct uart{ u32 DR; // data reg u32 DSR; u32 pad1[4]; // 8+16=24 bytes to FR register u32 FR; // flag reg at 0x18 u32 pad2[7]; u32 IMSC; // at offset 0x38 }UART; UART *upp[4]; // 4 UART pointers UART *up; // active UART pointer int uprintf(char *fmt, ...){ // same as before } int uart_handler(int ID) { char c; int cpuid = get_cpuid(); up = upp[ID]; c = up->DR; uprintf("UART%d interrupt on CPU%d : c=%c\n", ID, cpuid, c); } int uart_init() { int i; for (i=0; iIMSC |= (1priority > running->priority){ if (intsest==0)// if not in interrupt handler tswitch(); // preempt running task immediately else{ swflag = 1; // defer preemption until end of interrupts } interrupt_handler_exit // switch task if swflag set & end of IRQs { if (!intnest && swflag) tswtich() in SVC mode; } }

In reschedule(), if the current running task is no longer the highest priority and execution is not inside any interrupt handler, it preempts the current running task immediately. If execution is still inside an interrupt handler, it sets a switch task flag, which defers task switch until all nested interrupt processing have ended.

10.7.4 Task Communication in UP_RTOS The UP_RTOS kernel provides three kinds of mechanisms for inter-task communication.

10.7.4.1 Shared Memory When the system starts, it initializes a fixed number (32) of 64 KB memory regions, e.g., from 32 to 34 MB, for inter-task communication through shared memory. Each shared memory region is represented by a structure: #define NPID NPROC/sizeof(int) struct shmem{ int procID[NPID]; // bit vector for NPROC task IDs MUTEX mutex; // mutex lock for shared region char *address; // start address of memory region }shmem[32];

When the system starts, it initializes the shmem structures to contain procID = {0}; // no task using the shmem yet mutex = UNLOCKED; // for exclusive access to the shared region address = pointer to a unique 64KB memory area

10.7 Uniprocessor RTOS (UP_RTOS)

443

To use a shared memory, a task must first attach itself to a shmem structure by int shmem_attach(struct shmem *mp);

shmem_attach() records the task ID bit in procID (for accessibility checking) and returns the number of tasks c urrently attached to the shared memory. After attaching to a shared memory, tasks may use shmem_read( struct shmem *mp, char buf[ ], int nbytes); shmem_write(struct shmem *mp, char buf[ ], int nbytes);

to read/write data from/to the shared memory. The read/write functions only guarantee each read/write operation is atomic (by the mutex lock of the shared memory). The data format and meaning of the shared memory contents are entirely up to the user to decide.

10.7.4.2 Pipes for Data Streams This is the same pipe mechanism of Chap. 5 (Sect. 5.13.2). It allows tasks to use pipes to read/write streams of data. 10.7.4.3 Message Queues This is the same synchronous message passing mechanism of Chap. 5 (Sect. 5.13.4). It allows tasks to communicate by exchange messages. As in the case of shared memory, the user may design the message format and contents to suit the needs.

10.7.5 Protection of Critical Regions In UP_RTOS, all critical regions are protected by mutex locks. The mutex locking order is always unidirectional, so deadlocks can never occur in the UP_RTOS kernel.

10.7.6 File System and Logging Logging typically requires writing information to a log file in a file system, which may incur a variable amount of delay time to tasks. We do not see any justifiable needs for tasks in a real-time system to perform file operations. For simplicity and efficiency, we implement logging by a special logging task, which receives logging requests from other tasks by messages and writes the logging information to a storage device, such as a SDC, directly. When the system starts, it creates a logging task with the second lowest priority 1 (higher than the idle task at priority 0). Other tasks uses

log char log_ information

to record a line in the log. The logging operation sends a message to the logging task, which formats the log information in the form

timestamp : taskID : line

and writes it to a SDC (1 KB) block whenever it runs. It also writes the logged lines to a UART port to let the user see the logging activities online. When the system terminates, the log information can be retrieved from the SDC for tracing and analysis. We show the implementation of the UP_RTOS kernel and demonstrate its capabilities by the following example programs: C10.1: UP_RTOS with static periodic tasks and round-robin scheduling C10.2: UP_RTOS with static periodic tasks and preemptive scheduling C10.3: UP_RTOS with dynamic tasks and preemptive scheduling

444

10 Embedded Real-Time Operating Systems

10.7.7 UP_RTOS with Static Periodic Tasks and Round-Robin Scheduling The first UP_RTOS example, denoted by C10.1, demonstrates static periodic tasks with round-robin scheduling. We assume that all tasks are periodic with the same period, hence the same priority in accordance with the RMS scheduling algorithm. As a uniprocessor (UP) system, the UP_RTOS kernel maintains a single readyQueue for task scheduling. In the readyQueue, tasks are ordered by priority. Tasks with the same priority are ordered first in, first out (FIFO). Since all the tasks have the same priority, they are scheduled to run by round-robin. When the system starts, it creates four static tasks, all execute the same taskCode() function with the same period. Each task executes an infinite loop, in which a task first blocks itself on a unique semaphore (initialized to 0). A blocked task will be V-ed up by a timer periodically by the task period. When a task is unblocked and becomes ready to run, we get its ready time from a global time and record it in the task PROC structure. When a task runs, it first gets the start time. Then it does some computation, which is simulated by a delay loop. At the end of the execution loop, each task gets the end time and computes its execution and deadline times as follows: Execution time Ci = end_time - start_time; Deadline time Di = end_time - ready_time; It compares the deadline time with task period to see whether it has met the deadline (equal to task period). Then it repeats the loop again. The following lists the code of the C10.1 program. In order to keep the program simple, the system only supports nested interrupts and preemptive task scheduling but without priority inheritance which will be implemented later. (1). ts.s file of C10.1: The main feature of the ts.s file is that it supports nested IRQ interrupts. Task switch is deferred until nested interrupt processing have ended. Details of task preemption are explained later: /************* ts.s file of C10.1 ***************/ .text .code 32 .global vectors_start, vectors_end .global proc, procsize .global tswitch, scheduler, running .global int_off, int_on, lock, unlock .global swflag, intnest, int_end .set vectorAddr, 0x10140030 // VIC vector base address reset_handler: // set SVC stack to high end of proc[0].kstack ldr r0, =proc ldr r1, =procsize ldr r2, [r1, #0] add r0, r0, r2 mov sp, r0 // copy vector table to address 0 bl copy_vectors // go in IRQ mode, set IRQ stack msr cpsr, #0x92 ldr sp, =irq_stack_top // call main() in SVC mode with IRQ on: all tasks run in SVC mode msr cpsr, #0x13 bl main b .

10.7 Uniprocessor RTOS (UP_RTOS) irq_handler: // support nested interrupts in SVC mode sub lr, lr, #4 stmfd sp!, {r0-r12, lr} mrs r0, spsr stmfd sp!, {r0} // push SPSR ldr r0, =intnest // r0->intnest ldr r1, [r0] add r1, #1 str r1, [r0] // intnest++ mov r1, sp // get irq sp into r1 ldr sp, =irq_stack_top // reset IRQ stack poiner to IRQ stack top // switch to SVC mode // to allow nested IRQs: clear IRQ source MSR cpsr, #0x93 // to SVC mode with interrupts OFF sub sp, #60 // dec SVC mode sp by 15 entries mov r0, sp // r0=SVC stack top // copy IRQ stack to SVC stack mov r3, #15 // 15 times copy_stack: ldr r2, [r1], #4 // get an entry from IRQ stack str r2, [r0], #4 // write to proc's kstack sub r3, #1 // copy 15 entries from IRQ stack to PROC's kstack cmp r3, #0 copy_stack // read vectoraddress register: MUST!!! else no interrupts ldr r1, =vectorAddr ldr r0, [r1] // read vectorAddr register to ACK interrupt stmfd sp!, {r0-r3, lr} msr cpsr, #0x13 // still in SVC mode but enable IRQ bl irq_chandler // handle interrupt in SVC mode, IRQ off msr cpsr, #0x93 ldmfd sp!, {r0-r3, lr} ldr r0, =intnest // check interrupt nest level ldr r1, [r0] sub r1, #1 str r1, [r0] // intnest-cmp r1, #0 // if intnest != 0 => no_switch bne no_switch // intnest==0: END OF IRQs: if swflag=1: switch task ldr r0, =swflag ldr r0, [r0] cmp r0, #0 bne do_switch // if swflag=0: no task switch no_switch: ldmfd sp!, {r0} msr spsr, r0 // restore SPSR // irq_chandler() already issued EOI ldmfd sp!, {r0-r12, pc}^ // return via SVC stack do_switch: // still in IRQ mode bl endIRQ // show "at end of IRQs" bl tswitch // call tswitch(): resume to here // will switch task, so must issue EOI ldr r1, =vectorAddr str r0, [r1] // issue EOI ldmfd sp!, {r0} msr spsr, r0

445

446 ldmfd sp!, {r0-r12, pc}^

10 Embedded Real-Time Operating Systems // return via SVC stack

tswitch: // for task switch in SVC mode // disable IRQ interrupts mrs r0, cpsr orr r0, r0, #0x80 // set I bit to MASK out IRQ interrupts msr cpsr, r0 stmfd sp!, {r0-r12, lr} ldr r0, =running // r0=&running ldr r1, [r0, #0] // r1->runningPROC str sp, [r1, #4] // running->ksp = sp bl scheduler ldr r0, =running ldr r1, [r0, #0] // r1->runningPROC ldr sp, [r1, #4] // enable IRQ interrupts mrs r0, cpsr bic r0, r0, #0x80 // clear bit means UNMASK IRQ interrupt msr cpsr, r0 ldmfd sp!, {r0-r12, pc} // utility functions: int_on/int_off/lock/unlock: NOT shown vectors_start: LDR PC, reset_handler_addr LDR PC, undef_handler_addr LDR PC, swi_handler_addr LDR PC, prefetch_abort_handler_addr LDR PC, data_abort_handler_addr B . LDR PC, irq_handler_addr LDR PC, fiq_handler_addr reset_handler_addr: .word reset_handler undef_handler_addr: .word undef_handler swi_handler_addr: .word swi_handler prefetch_abort_handler_addr: .word prefetch_abort_handler data_abort_handler_addr: .word data_abort_handler irq_handler_addr: .word irq_handler fiq_handler_addr: .word fiq_handler vectors_end:

Explanations of the ts.s File Reset_handler: As usual, reset_handler is the entry point. First, it sets the SVC mode stack pointer to the high end of proc[0] and copies the vector table to address 0. Next, it changes to IRQ mode to set the IRQ mode stack. Then it calls main() in SVC mode. During system operation, all tasks run in SVC mode in the same address of the kernel. Irq_handler: Task switch is usually triggered by interrupts, which may make a blocked task ready to run. Thus, irq_handler is the most important piece of assembly code relevant to process preemption. Therefore, we only focus on the irq_handler code. As pointed out in Chap. 2, the ARM CPU cannot handle nested interrupts in IRQ mode. Nested interrupt processing must be performed in a different privileged mode. In order to support process preemption due to interrupts, we choose to handle IRQ interrupts in SVC mode. The algorithm of irq_handler can be best described by the following algorithm:

10.7 Uniprocessor RTOS (UP_RTOS)

/********* Algorithm of IRQ handler for full task preemption ********/ (1). Upon entry, adjust return lr; save context, including spsr, in IRQ stack (2). Increment interrupt nesting counter by 1 (3). Switch to SVC mode with interrupts disabled (4). Transfer context from IRQ stack to proc’s SVC stack; reset IRQ stack (5). Ack and clear interrupt source (prevent infinite loops from the same interrupt source) (6). Enable IRQ interrupts; save working registers in SVC stack (7). Call ISR in SVC mode with IRQ interrupts enabled (8). (return from ISR): disable IRQ interrupts; restore working registers (9). Decrement interrupt nesting counter by 1 (10). If still inside interrupt handler (counter nonzero): goto no_switch (11). (end of IRQs): if swflag is set: goto do_switch (12). no_switch: return via saved context in SVC stack (13). do_switch: write EOI to interrupt controller; call tswitch() to switch task (14). (when switched-out task resumes): return via saved interrupt context in SVC stack (2). uart.c file: UARTs driver – for outputs only by TX interrupts. (3). vid.c file: LCD driver – same as before except frame buffer is at 4 MB. (4). timer.c file: Use timer0 to activate tasks periodically: /** timer.c file of C10.1 **/ #define TLOAD 0x0 #define TVALUE 0x1 #define TCNTL 0x2 #define TINTCLR 0x3 #define TRIS 0x4 #define TMIS 0x5 #define TBGLOAD 0x6 typedef struct timer{ u32 *base; // timer's base address int tick, hh, mm, ss; // per timer data area char clock[16]; }TIMER; TIMER timer[4]; // 4 timers, use only timer0 void timer_init() { int i; TIMER *tp; printf("timer_init(): "); gtime = 0; for (i=0; ibase = (u32 *)0x101E2000; if (i==1) tp->base = (u32 *)0x101E2020; if (i==2) tp->base = (u32 *)0x101E3000; if (i==3) tp->base = (u32 *)0x101E3020; *(tp->base+TLOAD) = 0x0; // reset *(tp->base+TVALUE)= 0xFFFFFFFF; *(tp->base+TRIS) = 0x0; *(tp->base+TMIS) = 0x0; *(tp->base+TLOAD) = 0x100;

447

448

}

}

10 Embedded Real-Time Operating Systems //0x62=|011-0000=|NOTEn|Pe|IntE|-|scal=00|32- bit|0=wrap| *(tp->base+TCNTL) = 0x62; *(tp->base+TBGLOAD) = 0xF00; // timer count tp->tick = tp->hh = tp->mm = tp->ss = 0; strcpy((char *)tp->clock, "00:00:00");

void timer_handler(int n) { int i; TIMER *t = &timer[n]; gtime++; // increment global time t->tick++; // for local wall clock if (t->tick >= 64){ t->tick=0; t->ss++; if (t->ss == 60){ t->ss=0; t->mm++; if (t->mm==60){ t->mm=0; t->hh++; } } } // display a wall clock if (t->tick == 0){ // display wall clock for (i=0; iclock[i], 0, 70+i); } t->clock[7]='0'+(t->ss%10); t->clock[6]='0'+(t->ss/10); t->clock[4]='0'+(t->mm%10); t->clock[3]='0'+(t->mm/10); t->clock[1]='0'+(t->hh%10); t->clock[0]='0'+(t->hh/10); for (i=0; iclock[i], 0, 70+i); } } // unblock tasks by period if ((gtime % period)==0){ // period = N*T timer ticks for (i=1; ibase+TCNTL) |= 0x80; // set enable bit 7 } int timer_clearInterrupt(int n) // timer_start(0), 1, etc. { TIMER *tp = &timer[n]; *(tp->base+TINTCLR) = 0xFFFFFFFF; }

10.7 Uniprocessor RTOS (UP_RTOS)

(5). Kernel files of C10.1 /*********** pv.c file of C10.1 ***********/ extern PROC *running; extern PROC *readyQueue; extern int swflag; extern int intnest; int P(struct semaphore *s) // no priority inheritance { int SR = int_off(); s->value--; if (s->value < 0){ running->status = BLOCK; enqueue(&s->queue, running); int_on(SR); tswitch(); } int_on(SR); } int V(struct semaphore *s) // no priority inherence { PROC *p; int cpsr; int SR = int_off(); s->value++; if (s->value queue); p->status = READY; enqueue(&readyQueue, p); printf("V up task%d pri=%d; running pri=%d\n", p->pid, p->priority, running->priority); reschedule(); // may preempty running task

}

} int_on(SR);

/**************** kernel.c file of C10.1 **************/ #define NPROC 256 PROC proc[NPROC], *running, *freeList, *readyQueue; int procsize = sizeof(PROC); int swflag = 0; // switch task flag int intnest; // interrupt nesting level int kernel_init() { int i, j; PROC *p; kprintf("kernel_init()\n"); for (i=0; ipid = i; p->status = READY; p->run_time = 0; p->next = p + 1;

449

450

}

10 Embedded Real-Time Operating Systems

} proc[NPROC-1].next = 0; freeList = &proc[0]; sleepList = 0; readyQueue = 0; intnest = 0; running = getproc(&freeList); // create and run P0 running->priority = 0; printf("running = %d\n", running->pid);

int scheduler() { printf("task%d switch task: ", running->pid); if (running->status==READY) enqueue(&readyQueue, running); printQ(readyQueue); running = dequeue(&readyQueue); printf("next running = task%d pri=%d realpri=%d\n", running->pid, running->priority, running->realPriority); color = RED+running->pid; swflag = 0; } int reschedule() // called from inside V()with IRQ disabled { if (readyQueue && readyQueue->priority > running->priority){ if (intnest==0){ printf("task%d PREEMPT task%d IMMEDIATELY\n", readyQueue->pid, running->pid); tswitch(); } else{ printf("task%d DEFER PREEMPT task%d ", readyQueue->pid, running->pid); swflag = 1; // IRQ are disabled, so no need to lock/unlock } } } // kfork() create a new task and enter into readyQueue PROC *kfork(int func, int priority) { int i; PROC *p = getproc(&freeList); if (p==0){ kprintf("kfork failed\n"); return (PROC *)0; } p->ppid = running->pid; p->parent = running; p->status = READY; p->realPriority = p->priority = priority; p->run_time = 0; p->ready_time = 0;

10.7 Uniprocessor RTOS (UP_RTOS)

}

451

// set kstack to resume to execute func() for (i=1; ikstack[SSIZE-i] = 0; p->kstack[SSIZE-1] = (int)func; p->ksp = &(p->kstack[SSIZE-14]); enqueue(&readyQueue, p); printf("task%d create a child task%d\n", running->pid, p->pid); reschedule(); return p;

/******************* t.c file of C10.1 ****************/ // set T=5 timer ticks, task period = 8*T #define T 5 #define period 8*T #include "type.h" #include "string.c" #include "queue.c" #include "pv.c" #include "uart.c" #include "vid.c" #include "exceptions.c" #include "kernel.c" #include "timer.c" // globals struct semaphore ss[5]; volatile u32 gtime; UART *up0, *up1; int tcount = 0; void copy_vectors(void) {

// semaphores for task to block // global time // number of timer interrupts in low IRQ // same as before }

int enterint() // for nested IRQs: clear interrupt source { int status, ustatus, scode; status = *((int *)(VIC_BASE_ADDR)); // read status register if (status & (1base = (u32 *)0x101E2020; if (i==2) tp->base = (u32 *)0x101E3000; if (i==3) tp->base = (u32 *)0x101E3020; *(tp->base+TLOAD) = 0x0; // reset *(tp->base+TVALUE)= 0xFFFFFFFF; *(tp->base+TRIS) = 0x0; *(tp->base+TMIS) = 0x0; *(tp->base+TLOAD) = 0x100; //0x62=|011-0000=|NOTEn|Pe|IntE|-|scal=00|32- bit|0=wrap| *(tp->base+TCNTL) = 0x62; *(tp->base+TBGLOAD) = 0xF00; // timer count tp->tick = tp->hh = tp->mm = tp->ss = 0; strcpy((char *)tp->clock, "00:00:00"); } } void timer_handler(int n) { int i; TIMER *t = &timer[n]; gtime++; // increment global time t->tick++; // for local wall clock if (t->tick >= 64){ t->tick=0; t->ss++; if (t->ss == 60){ t->ss=0; t->mm++; if (t->mm==60){ t->mm=0; t->hh++; } } } if (t->tick == 0){ // display wall clock for (i=0; iclock[i], 0, 70+i); } t->clock[7]='0'+(t->ss%10); t->clock[6]='0'+(t->ss/10); t->clock[4]='0'+(t->mm%10); t->clock[3]='0'+(t->mm/10); t->clock[1]='0'+(t->hh%10); t->clock[0]='0'+(t->hh/10); for (i=0; iclock[i], 0, 70+i); }

} if ((gtime % (2*T))==0){ // V(&ss[4]); proc[4].ready_time = gtime; } if ((gtime % (3*T))==0){ // V(&ss[3]); proc[3].ready_time = gtime; } if ((gtime % (4*T))==0){ // V(&ss[2]); proc[2].ready_time = 0; } if ((gtime % (5*T))==0){ // V(&ss[1]); proc[1].ready_time = gtime; } timer_clearInterrupt(n);

activate P4 every 2T

activate P3 every 3T

activate P2 every 4T

activate P1 every 5T

} void timer_start(int n) // timer_start(0), 1, etc. { TIMER *tp = &timer[n]; printf("timer_start %d\n", n); *(tp->base+TCNTL) |= 0x80; // set enable bit 7 } int timer_clearInterrupt(int n) // timer_start(0), 1, etc. { TIMER *tp = &timer[n]; *(tp->base+TINTCLR) = 0xFFFFFFFF; }

(5). Kernel files of C10.2: Same as in C10.1 (6). t.c file of C10.2 /**************** t.c file of C10.2 ****************/ #define T 32 #include "type.h" #include "string.c" #include "queue.c" #include "pv.c" #include "uart.c" #include "vid.c" #include "exceptions.c" #include "kernel.c" #include "timer.c" // globals

458 struct semaphore ss[5]; volatile u32 gtime; UART *up0, *up1; int tcount = 0; void copy_vectors(void) {

10 Embedded Real-Time Operating Systems // semaphores for task to block // global time // number of timer interrupts in low IRQ // same as before }

int enterint() // for nested IRQs: clear interrupt source { int status, ustatus, scode; status = *((int *)(VIC_BASE_ADDR)); // read status register if (status & (1priority > running->priority){ if (intnest==0){ // not in IRQ handler: preempt immediately printf("%d PREEMPT %d IMMEDIATELY\n", readyQueue->pid, pid); tswitch(); }

462

10 Embedded Real-Time Operating Systems else{ // still in IRQ handler: defer preemption printf("%d DEFER PREEMPT %d\n", readyQueue->pid, running->pid); swflag = 1; // set need to switch task flag }

} int_on(SR);

} int P(struct semaphore *s) { int SR = int_off(); s->value--; if (s->value < 0){ // block running task running->status = BLOCK; enqueue(&s->queue, running); running->status = BLOCK; enqueue(&s->queue, running); // priority inheritance if (running->priority > s->owner->priority){ s->owner->priority = running->priority; // boost owner priority reorder_readyQueue(); // re-order readyQueue } tswitch(); } s->owner = running; // as owner of semaphore int_on(SR); } int V(struct semaphore *s) { PROC *p; int cpsr; int SR = int_off(); s->value++; s->owner = 0; // clear s owner field if (s->value queue); p->status = READY; s->owner = p; // new owner enqueue(&readyQueue, p); running->priority = running->rpriority; // restore owner priority printf("timer: V up task%d pri=%d; running pri=%d\n", p->pid, p->priority, running->priority); resechedule(); } int_on(SR); } int mutex_lock(MUTEX *s) { PROC *p; int SR = int_off(); printf("task%d locking mutex %x\n", running->pid, s); if (s->lock==0){ // mutex is in unlocked state s->lock = 1; s->owner = running; } else{ // mutex is already locked: caller BLOCK on mutex running->status = BLOCK;

10.7 Uniprocessor RTOS (UP_RTOS) enqueue(&s->queue, running); // priority inheritance if (running->priority > s->owner->priority){ s->owner->priority = running->priority; // boost owner priority reorder_readyQueue(); // re-order readyQueue } tswitch(); // switch task

} int_on(SR);

} int mutex_unlock(MUTEX *s) { PROC *p; int SR = int_off(); printf("task%d unlocking mutex\n", running->pid); if (s->lock==0 || s->owner != running){ // unlock error int_on(SR); return -1; } // mutex is locked and running task is owner if (s->queue == 0){ // mutex has no waiter s->lock = 0; // clear lock s->owner = 0; // clear owner } else{ // mutex has waiters: unblock one as new owner p = dequeue(&s->queue); p->status = READY; s->owner = p; enqueue(&readyQueue, p); running->priority = running->rpriority; // restore owner priority resechedule(); } int_on(SR); return 1; } PROC *kfork(int func, int priority) { // create new task p with priority as before p->rpriority = p->priority = priority; // with dynamic tasks, readyQueue must be protected by a mutex mutex_lock(&readyQueuelock); enqueue(&readyQueue, p); // enter new task into readyQueue mutex_unlock(&readyQueuelock); reschedule(); return p; }

(3). t.c file of C10.3 /**************** t.c file of C10.3 ***************/ #include "type.h" MUTEX *mp; // global mutex struct semaphore s1; // global semaphore #include "queue.c" #include "mutex.c" // modified mutex operations #include "pv.c" // modified P/V operations

463

464 #include #include #include #include #include #include #include #include

10 Embedded Real-Time Operating Systems "kbd.c" "uart.c" "vid.c" "exceptions.c" "kernel.c" "timer.c" "sdc.c" "mesg.c"

// SDC driver // message passing

int klog(char *line){ send(line, 2) } // send msg to log task #2 int endIRQ(){ printf("until END of IRQ\n") } void copy_vectors(void) { // same as before } // use vectored interrupts of PL190 int timer0_handler(){ timer_handler(0) } int uart0_handler() { uart_handler(&uart[0]) } int uart1_handler() { uart_handler(&uart[1]) } int v31_handler() { int sicstatus = *(int *)(SIC_BASE_ADDR+0); if (sicstatus & (1pid); P(&s1); klog("running"); mutex_lock(mp); printf("task%d inside CR\n", running->pid); klog("create task4"); kfork((int)task, 4); // create P4 while holding mutex lock delay(); mutex_unlock(mp); pid = kwait(&status); } } int task2() // logging task at priority 1 { int blk = 0; char line[1024]; // 1KB log line printf("task%d: ", running->pid); while(1){ //printf("task%d:recved line=%s\n", running->pid, line); r = recv(line); put_block(blk, line); // write log line to SDC blk++; uprintf("%s\n", line); // display to UART0 on-line printf("log: "); } } int task1() { int status, pid; pid = kfork((int)task2, 2); pid = kfork((int)task3, 3);

465

466

}

10 Embedded Real-Time Operating Systems while(1) // task1 waits for any ZOMBIE children pid = kwait(&status);

int main() { fbuf_init(); // LCD uart_init(); // UARTs kbd_init(); // KBD printf("Welcome to UP_RTOS in ARM\n"); /* configure VIC and SIC for interrupts */ VIC_INTENABLE = 0; VIC_INTENABLE |= (1 high end svcstack[cpuid] MOV sp, r0 // IRQ sp of CPU ID -> highend of irq[ID][1024] //AP: go in ABT mode to set ABT stack MSR cpsr, #0x97 LDR sp, =abt_stack lsl r2, r1, #12 // r2 = r1*4096 ADD r0, r0, r2 // r0 -> high end svcstack[cpuid] MOV sp, r0 // IRQ sp of CPU ID -> highend of irq[ID][4096] //AP: go in UND mode to set UND stack MSR cpsr, #0x9B LDR sp, =und_stack lsl r2, r1, #12 // r2 = r1*4096 ADD r0, r0, r2 // r0 -> high end svcstack[cpuid]

10.8 Multiprocessor RTOS (SMP_RTOS)

473

MOV sp, r0 // IRQ sp of CPU ID -> highend of irq[ID][4096] //AP: go back in SVC mode to set SPSR MSR cpsr, #0x93 MSR spsr, #0x13 //AP: call APstart() with IRQ on MSR cpsr, #0x13 bl APstart // call APstart() in C

(3). The main() and APstart() Functions in C In main(), CPU0 first initializes the device drivers, configures the GIC for interrupts, initializes the kernel data structures, and runs the initial task with pid=1000. Then it issues SGI to activate the APs. After all the APs are ready, it creates a P0 as the logging task and the INIT task P1. Then it executes run_task(), trying to run tasks from the readyQueue of the same CPU. Upon receiving a SGI from CPU0, each AP starts to execute the apStart code in assembly. The actions of each AP are similar to those of CPU0, i.e., it sets SVC mode stack pointer to the high end of iproc[cpuid] and IRQ and other mode stacks by cpuid. Then it calls APstart() in C. In APstart(), each AP configures and starts its local timer and updates the global variable ncpu to synchronize with CPU0. Then it calls run_task() also. At this moment, the readyQueues of the APs are still empty. So each AP runs an idle task with pid=1000+cpuid and enters the WFE state. After creating tasks on the APs, there are two ways to let the APs continue. CPU0 may send a SGI to all the APs, causing them to get up from the WFE state, or each AP will get up when its local timer interrupts. The following shows the code segments of run_task(), APstart() and main(): int run_task() // each CPU tries to run task from its own readyQ { int cpuid = get_cpuid(); while(1){ slock(&readylock[cpuid]); if (readyQueue[cpuid] == 0){ // check own readyQueue sunlock(&readylock[cpuid]); asm("WFE"); // wait for events/interrupts } else{ sunlock(&readylock[cpuid]); ttswitch(); // switch task on CPU } } } int APstart() { int cpuid = get_cpuid(); slock(&aplock); color = YELLOW; printf("CPU%d in APstart ... ", cpuid); config_int(29, cpuid); // configure local timer ptimer_init(); ptimer_start(); // start local timer ncpu++; running = &iproc[cpuid]; // run initial iproc[cpuid] printf("%d running on CPU%d\n", running->pid, cpuid); sunlock(&aplock); run_task(); } int main() { // initialize device drivers // configure GIC for interrupts

474

}

10 Embedded Real-Time Operating Systems kernel_init(); //initialize kernel printf("CPU0 startup APs: "); int *APaddress = (int *)0x10000030; *APaddress = (int)apStart; send_sgi(0x0, 0x0F, 0x01); // send SGI 0 to all Aps printf("CPU0 waits for APs ready\n"); while(ncpu < 4); printf("CPU0 continue\n"); kfork(int)logtask, 1, 0); // create P0 on CPU0 as logging task kfork((int)f1, 1, 0); // create P1 on CPU0 as INIT task run_task();

10.8.1.6 Demonstration of Task Management in SMP_RTOS We demonstrate task management in the initial SMP_RTOS kernel by a sample program, denoted by C10.4. Task management includes task creation, task termination, and wait for child task termination. The program consists of the following components: (1). ts.s file of C10.4: /*********** ts.s file of C10.4: SMP_RTOS kernel **********/ .text .code 32 .global reset_handler, vectors_start, vectors_end .global proc, procsize, apStart .global tswitch, scheduler .global int_on, int_off, lock, unlock .global get_fault_status, get_fault_addr, get_cpsr, get_spsr; .global slock, sunlock, get_cpuid reset_handler: // same as shown above apStart: // AP entry point // same as shown above irq_handler: // IRQ interrupts entry point sub lr, lr, #4 stmfd sp!, {r0-r12, lr} // save context bl irq_chandler // call irq_handler() ldmfd sp!, {r0-r12, pc}^ // return data_handler: sub lr, lr, #4 stmfd sp!, {r0-r12, lr} bl data_abort_handler ldmfd sp!, {r0-r12, pc}^ tswitch: // tswitch() in SVC mode msr cpsr, #0x93 // interrupts off stmfd sp!, {r0-r12, lr} ldr r4, =run // r4=&run mrc p15, 0, r5, c0, c0, 5 // read CPUID register to r5 and r5, r5, #0x3 // only the CPU ID mov r6, #4 // r6 = 4 mul r6, r5 // r6 = 4*cpuid add r4, r6 // r4 = &run[cpuid]

10.8 Multiprocessor RTOS (SMP_RTOS) ldr r6, [r4, #0] // r6->running PROC str sp, [r6, #4] // save sp in PROC.ksp bl scheduler // call scheduler() in C ldr r4, =run // r4=&run mrc p15, 0, r5, c0, c0, 5 // read CPUID register to r5 and r5, r5, #0x3 // only the CPU ID mov r6, #4 // r6 = 4 mul r6, r5 // r6 = 4*cpuid add r4, r6 // r4 = &run[cpuid] ldr r6, [r4, #0] // r6->running PROC ldr sp, [r6, #4] // restore ksp msr cpsr, #0x13 // interrupts on ldmfd sp!, {r0-r12, pc} // resume new running int_off: // int sr=int_off() stmfd sp!, {r1} mrs r1, cpsr mov r0, r1 // r0 = r1 orr r1, r1, #0x80 msr cpsr, r1 // mask out IRQ ldmfd sp!, {r1} mov pc, lr // return original cpsr int_on: // int_on(SR) mrs cpsr, r0 mov pc, lr lock: // maks out IRQ in cpsr mrs r0, cpsr orr r0, r0, #0x80 msr cpsr, r0 mov pc, lr unlock: // mask in IRQ in cpsr mrs r0, cpsr bic r0, r0, #0x80 msr cpsr, r0 mov pc, lr .global send_sgi // void send_sgi(int ID, int target_list, int filter_list) send_sgi: // send_sgi with CPUTarget list and r3, r0, #0x0F // Mask off unused bits of ID to r3 and r1, r1, #0x0F // Mask off unused bits of target_filter and r2, r2, #0x0F // Mask off unused bits of filter_list orr r3, r3, r1, LSL #16 // Combine ID and target_filter orr r3, r3, r2, LSL #24 // and now the filter list // Get the address of the GIC mrc p15, 4, r0, c15, c0, 0 // Read periph base address add r0, r0, #0x1F00 // Add offset of the sgi_trigger reg str r3, [r0] // Write to SGI Register(ICDSGIR) bx lr slock: // int slock(&spin) ldrex r1, [r0] cmp r1, #0x0 WFENE bne slock mov r1, #1 strex r2, r1, [r0]

475

476

10 Embedded Real-Time Operating Systems

cmp r2, #0x0 bne slock DMB // barrier bx lr sunlock: // sunlock(&spin) mov r1, #0x0 DMB str r1, [r0] DSB SEV bx lr get_fault_status: // read and return MMU reg 5 mrc p15,0,r0,c5,c0,0 // read DFSR mov pc, lr get_fault_addr: // read and return MMU reg 6 mrc p15,0,r0,c6,c0,0 // read DFSR mov pc, lr get_cpsr: mrs r0, cpsr mov pc, lr get_spsr: mrs r0, spsr mov pc, lr get_cpuid: mrc p15, 0, r0, c0, c0, 5 // Read CPU ID register and r0, r0, #0x03 // Mask off, leaving the CPU ID field bx lr .global enable_scu // void enable_scu(void): Enables the SCU enable_scu: mrc p15, 4, r0, c15, c0, 0 // Read periph base address ldr r1, [r0, #0x0] // Read the SCU Control Register orr r1, r1, #0x1 // Set bit 0 (The Enable bit) str r1, [r0, #0x0] // Write back modifed value bx lr vectors_start: LDR PC, reset_handler_addr LDR PC, undef_handler_addr LDR PC, svc_handler_addr LDR PC, prefetch_abort_handler_addr LDR PC, data_abort_handler_addr B . LDR PC, irq_handler_addr LDR PC, fiq_handler_addr reset_handler_addr: .word reset_handler undef_handler_addr: .word undef_handler svc_handler_addr: .word svc_entry prefetch_abort_handler_addr: .word prefetch_abort_handler data_abort_handler_addr: .word data_handler irq_handler_addr: .word irq_handler fiq_handler_addr: .word fiq_handler vectors_end: // other SMP utility functions: NOT SHOWN

10.8 Multiprocessor RTOS (SMP_RTOS)

(2). Kernel.c file of SMP_RTOS /*********** kernel.c file of SMP_RTOS ***********/ #define NCPU 4 #define NPROC 256 #define SSIZE 1024 #define FREE 0 #define READY 1 #define BLOCK 2 #define ZOMBIE 3 typedef struct proc{ struct proc *next; int *ksp; // saved stack pointer int status; // FREE|READY|BLOCK|ZOMBIE, etc. int priority; // real priority int pid; int ppid; struct proc *parent; int exitCode; // exit code int ready_time; struct mbuf *mqueue; // message queue struct mutex mlock; // message queue mutex lock struct semaphore nmsg; // wait for message semaphore int cpuid; // which CPU this proc is on int epriority; // effective priority for PI int timeslice; // for time sliced scheduling struct semaphore wchild; // wait for ZOMBIE children int kstack[SSIZE]; }PROC; #define running run[get_cpuid()] // kernel data structures and spinlocks PROC iproc[NCPU]; // per CPU initial/idle PROCs PROC *run[NCPU]; // pointer to running PROC on each CPU PROC proc[NPROC]; // PROC structures PROC *freeList; // freeList PROC *readyQueue[NCPU]; // per CPU readyQueue int readylock[NCPU]; // per CPU readyQueue spinlcok int intnest[NCPU]; // per CPU IRQ nesting level int swflag[NCPU]; // per CPU switch task flag int freeListlock = 0; // freeList spinlock int procsize = sizeof(PROC); // PROC size int irq_stack[NCPU][1024]; // per CPU IRQ mode stack int abt_stack[NCPU][1024]; // per CPU ABT mode stack int und_stack[NCPU][1024]; // per CPU UND mode stack int kernel_init() { int i; PROC *p; char *cp; printf("kernel_init(): init iprocs\n"); for (i=0; ipid = 1000 + i; // special pid for per CPU idle PROCs p->status = READY;

477

478

10 Embedded Real-Time Operating Systems p->ppid = p->pid; run[i] = p; readyQueue[i] = 0; readylock[i] = 0; intnest[i] = 0; swflag[i] = 0;

}

} for (i=0; ipid = i; p->status = FREE; p->priority = 0; p->ppid = 0; p->next = p + 1; } proc[NPROC-1].next = 0; freeList = &proc[0]; sleepList = 0; freelock = 0; running = run[0];

// // // // //

run[i] points to idle iPROC[i] ready queues spinlocks IRQ nest levels switch task flags

// freeList of PROCs

// initial task on CPU0

int ttswitch() // switch task on CPUID { int cpuid = get_cpuid(); lock(); slock(&readylock[cpuid]); tswitch(); sunlock(&readylock[cpuid]); unlock(); } int scheduler() { int pid; PROC *old=running; int cpuid = get_cpuid(); slock(&readylock[cpuid]; if (running->pid < 1000){ // regular tasks if (running->status==READY) enqueue(&readyQueue[cpuid], running); } running = dequeue(&readyQueue[cpuid]); if (running == 0) running = &iproc[cpuid];// run idle task on CPU swflag[cpuid] = 0; sunlock(&readylock[cpuid]); } int kfork(int func, int priority, int cpu) { int cpuid = get_cpuid(); PROC *p = getproc(&freeList); // allocate a free PROC if (p==0){ printf("kfork1 failed\n"); return -1 } p->ppid = running->pid; p->parent = running;

10.8 Multiprocessor RTOS (SMP_RTOS)

}

479

p->status = READY; p->priority = priority; p->cpuid = cpuid for (i=1; ikstack[SSIZE-i] = 0; p->kstack[SSIZE-1] = (int)func; // resume to func p->ksp = &(p->kstack[SSIZE-14]); // saved ksp slock(&readylock[cpu]); enqueue(&readyQueue[cpu], p); // enter p into rQ[cpu] sunlock(&readylock[cpu]); if (p->priority > running->priority){ if (cpuid == cpu){ // if on the same CPU printf("%d PREEMPT %d on CPU%d\n", p->pid, running->pid, cpuid); ttswitch(); } } return p->pid;

int P(struct semaphore *s) // P operation in SMP { int SR = int_off(); slock(&s->lock); s->value--; if (s->value < 0){ running->status = BLOCK; enqueue(&s->queue, running); // priority inheritance only if on the same CPU if (running->priority > s->owner->priority){ if (s->owner->cpu == running->cpu){ s->owner->priority = running->priority; //raise owner riority reorder_readyQueue(); // re-order readyQueue } } sunlock(&s->lock); ttswitch(); } else{ s->owner = running; // as owner of s sunlock(&s->lock); } int_on(SR); } int V(struct semaphore *s) // { PROC *p; int SR = int_off(); slock(&s->lock); s->value++; s->owner = 0; // if (s->value queue); p->status = READY; s->owner = p; //

V operation in SMP

clear s owner field s queue has waiters

new owner

480

10 Embedded Real-Time Operating Systems slock(&readylock[p->cpuid]); // rQ of original CPU enqueue(&readyQueue[p->cpuid], p); sunlock(&readylock[p->cpuid]); running->priority = running->rpriority; // restore owner priority resechedule();

}

} sunlock(&s->lock); int_on(SR);

int kexit(int value) // task termination { int i; PROC *p; if (running->pid==1){ return -1 } // P1 never dies slock(&proclock); // acquire proclock for (i=2; istatus != FREE) && (p->ppid == running->pid)){ if (p->status==ZOMBIE){ p->status = FREE; putproc(&freeList, p); // release any ZOMBIE child } else{ printf("send child %d to P1\n", p->pid); p->ppid = 1; p->parent = &proc[1]; } } sunlock(&proclock); // release proclock running->exitCode = value; running->status = ZOMBIE; V(&running->parent->wchild); ttswitch(); } int kwait(int *status) // { int i, nChild = 0; PROC *p; slock(&proclock); // for (i=2; istatus != FREE && nChild++; } sunlock(&proclock); if (!nChild){ // return -1; } P(&running->wchild); // slock(&proclock); for (i=2; istatus==ZOMBIE) *status = p->exitCode;

wait for any ZOMBIE child

acquire proclock

p->ppid == running->pid)

no child error wait for a ZOMBIE child

&& (p->ppid == running->pid)){

10.8 Multiprocessor RTOS (SMP_RTOS)

}

}

p->status = FREE; putproc(&freeList, p); sunlock(&proclock); return p->pid;

(3). t.c file of SMP_RTOS /************* t.c file of SMP_RTOS ***********/ #include "type.h" int aplock = 0; int ncpu = 1; #include “string.c” #include "queue.c" #include "pv.c" #include "uart.c" #include "kbd.c" #include "ptimer.c" #include "vid.c" #include "exceptions.c" #include "kernel.c" #include "sdc.c" #include “message.c” int copy_vectors(){ // same as before } #define GIC_BASE 0x1F000000 int config_gic() { // set int priority mask register *(int *)(GIC_BASE + 0x104) = 0xFFFF; // set CPU interface control register: enable signaling interrupts *(int *)(GIC_BASE + 0x100) = 1; // distributor control register to send pending interrupts to CPUs *(int *)(GIC_BASE + 0x1000) = 1; config_int(29, 0); // ptimer at 29 to CPU0 config_int(44, 1); // UAR0 to CPU1 config_int(49, 2); // SDC to CPU0 config_int(52, 3); // KBD to CPU1 } int config_int(int N, int targetCPU) { int reg_offset, index, value, address; reg_offset = (N>>3)&0xFFFFFFFC; index = N & 0x1F; value = 0x1 pid, cpuid); sunlock(&aplock); run_task(); } int main() { int cpuid = get_cpuid(); enable_scu(); fbuf_init(); printf("Welcome to SMP_RTOS in ARM\n"); sdc_init(); kbd_init(); uart_init(); ptimer_init(); ptimer_start(); config_gic(); printf("CPU0 initialize kernel\n"); kernel_init(); printf("CPU0 startup APs: "); int *APaddr = (int *)0x10000030; *APaddr = (int)apStart; send_sgi(0x0, 0x0F, 0x01); printf("CPU0 waits for APs ready\n"); while(ncpu < 4); printf("CPU0 continue\n"); kfork(int)logtask, 1, 0); // P0 on CPU0 as loggin task kfork((int)INIT_task, 1, 0); // P1 on CPU0 as INIT task run_task(); }

483

484

10 Embedded Real-Time Operating Systems

Fig. 10.5 Task management in SMP_RTOS

The object of the initial SMP_RTOS kernel is to demonstrate that the system can create and run tasks on different CPUs. When the system starts, CPU0 creates and runs the initial task P1000, which calls main(). In main(), P1000 initializes device drivers and the system kernel. Then it activates the secondary CPUs (APs). Each AP initializes itself and runs an initial task P1000+cpuid, which calls run_task(), trying to run tasks from its own readyQueue[cpuid]. After activating the APs, P1000 creates a logging task P0 and an INIT task P1 on CPU0. Then P1000 switches task to run the INIT task P1. P1 creates two children tasks on each CPU and waits for any child task to terminate. Each task executes a loop, in which it delays for some time to simulate task computation and calls ttswitch() to switch tasks. In the sample system, tasks run the loop continually. The reader may test the kwait() operation by modifying the taskCode() function to let tasks terminate. Figure 10.5 shows the sample outputs of running the program C10.4, which demonstrates task management in the SMP_RTOS kernel.

10.8.2 Adapt UP_RTOS to SMP The simplest way to develop a RTOS for multiprocessors is to adapt the UP_RTOS for MP operations. In this approach, the system comprises multiple CPUs. Each CPU executes a set of tasks that are bound to the CPU and operate in the local environment of the CPU. The main advantage of this approach lies in its simplicity. It requires very little work to convert the UP_ROTS into a MP RTOS. The disadvantage is that the resulting system is only an asymmetric MP system, not a SMP system. We illustrate such a system, denoted by MP_RTOS, by a sample program C10.5. The program consists of the following components: (1). Ts.s file: Same as in UP_RTOS, except the simplified irq_hanler, which supports task switch but no nested interrupts: irq_handler: sub lr, lr, #4 stmfd sp!, {r0-r12, lr} mrs r0, spsr

// IRQ interrupts entry point // save all Umode regs in kstack

10.8 Multiprocessor RTOS (SMP_RTOS)

485

stmfd sp!, {r0} bl irq_chandler // check return value: 0=nornal, 1=swflag was set => swtich task cmp r0, #0 bgt doswitch ldmfd sp!, {r0} msr spsr, r0 ldmfd sp!, {r0-r12, pc}^ doswitch: msr cpsr, #0x93 stmfd sp!, {r0-r3, lr} bl ttswitch msr cpsr, #0x93 ldmfd sp!, {r0-r3, lr} msr cpsr, #0x92 ldmfd sp!, {r0} msr spsr, r0 ldmfd sp!, {r0-r12, pc}^

(2). kernel.c file: Same as in UP_RTOS for task management, except we define struct semaphore sem[NCPU];

and initialize each sem.value = 0. Each semaphore is associated with a CPU, which can only be accessed by tasks executing on the same CPU. If a task becomes blocked on a semaphore, it is unblocked by the CPU’s local timer periodically: void ptimer_handler() { int cpuid = id = get_cpuid(); struct ttt *tp = &tt[cpuid]; slock(&plock); // update tick, ss, mm, hh: same as before if (tp->tick==0){ // every second: display a wall clock // display wall clock: same as before if ((tp->ss % 5)==0) // every 5 seconds, activate a task V(&sem[cpuid]); // V to unblock a waiting task } sunlock(&plock); ptimer_clearInterrupt(); // clear timer interrupt }

(3). t.c file: Same as in UP_RTOS for task management, except for the following modifications. Instead of creating task on different CPUs, the initial task of each CPU creates a task to execute INIT_task() on the same CPU. Each task creates another task to execute taskCode(), also on the same CPU. All the tasks have the same priority 1. Each task executes an infinite loop, in which it delays for some time to simulate task computation and then blocks itself on a semaphore associated with the CPU. The local timer of each CPU periodically calls V to unblock the tasks, allowing them to continue. Whenever a ready task is entered into the local scheduling queue of a CPU, it may preempt the current running task on the same CPU: /*********** t.c file of MP_RTOS C10.5 ************/ int run_task(){ // same as before int kdelay(int d){ // delay }

486

10 Embedded Real-Time Operating Systems

int taskCode() { int cpuid = get_cpuid(); while(1){ printf("TASK%d ON CPU%d\n", running->pid, cpuid); kdelay(running->pid*10000); reset_sem(&sem[cpuid]); // reset sem.value to 0 P(&sem[cpuid]); } } int INIT_task() { int cpuid = get_cpuid(); kfork((int)taskCode, 1, cpuid); while(1){ printf("task%d on CPU%d\n", running->pid, cpuid); kdelay(running->pid*10000); reset_sem(&sem[cpuid]); // reset sem.value to 0 P(&sem[cpuid]); } } int APstart() { int cpuid = get_cpuid(); slock(&aplock); printf("CPU%d in APstart: ", cpuid); config_int(29, cpuid); // need this per CPU ptimer_init(); ptimer_start(); ncpu++; running = &iproc[cpuid]; printf("%d running on CPU%d\n", running->pid, cpuid); sunlock(&aplock); kfork((int)INIT_task, 1, cpuid); run_task(); } int main() { // initialize device drivers, GIC and kernel: same as before kprintf("CPU0 continue\n"); kfork((int)f1, 1, 0); run_task(); }

10.8.2.1 Demonstration of MP_RTOS Figure 10.6 shows the outputs of running the MP_RTOS system.

10.8.3 Nested Interrupts in SMP_RTOS In a UP RTOS kernel, only one CPU handles all the interrupts. To ensure fast responses to interrupts, it is vital to allow nested interrupts. In a SMP RTOS kernel, interrupts are usually routed to different CPUs to balance the interrupt processing load.

10.8 Multiprocessor RTOS (SMP_RTOS)

487

Fig. 10.6 Demonstration of MP_RTOS system

Despite this, it is still necessary to allow nested interrupts. For instance, each CPU may use a local timer, which generates timer interrupts with high priority. Without nested interrupts, a low-priority device driver may block timer interrupts, causing the timer to lose ticks. In this section, we shall show how to implement nested interrupts in the SMP_RTOS kernel by the sample program C10.6. The principle is the same as in UP_RTOS, except for the following minor difference. All ARM MPcore processors use the GIC for interrupt control. In order to support nested interrupts, the interrupt handler must read the GIC’s interrupt ID register to acknowledge the current interrupt before enabling interrupts. The GIC’s interrupt ID register can only be read once. The interrupt handler must use the interrupt ID to invoke the irq_chandler(int intID) in C. For special SGI interrupts (0–15), they should be handled separately since they have high interrupt priorities and do not need any nesting: .set GICreg, 0x1F00010C // GIC intID register .set GICeoi, 0x1F000110 // GIC EOI register irq_handler: sub lr, lr, #4 stmfd sp!, {r0-r12, lr} mrs r0, spsr stmfd sp!, {r0} // push SPSR // read GICreg to get intID and ACK current interrupt ldr r1, =GICreg ldr r0, [r1] // handle SGI intIDs direcly cmp r0, #15 // SGI interrupts bgt nesting ldr r1, =GICeoi str r0, [r1] // issue EOI

488 b return nesting: // to SVC mode msr cpsr, #0x93 // stmfd sp!, {r0-r3, lr} msr cpsr, #0x13 // bl irq_chandler // msr cpsr, #0x93 // ldmfd sp!, {r0-r3, lr} // msr cpsr, #0x92 // return: ldmfd sp!, {r0} msr spsr, r0 ldmfd sp!, {r0-r12, pc}^

10 Embedded Real-Time Operating Systems

SVC mode with IRQ off SVC mode with IRQ interrupts on call irq_chandler(intID) SVC mode with IRQ off IRQ mode with interrupts off

int irq_chandler(int intID) // called with interrupt ID { int cpuid = get_cpuid(); int cpsr = get_cpsr() & 0x9F; // CPSR with |IRQ|mode| mask

}

intnest[cpuid]++; // if (intID != 29) printf("IRQ%d on CPU%d in if (intID == 29){ // ptimer_handler(); } if (intID == 44){ // uart_handler(0); } if (intID == 49){ // sdc_handler(); } if (intID == 52){ // kbd_handler(); } *(int *)(GIC_BASE + 0x110) = intnest[cpuid]--; //

inc intsest by 1 mode=%x\n", intID, cpuid, cpsr); local timer of CPU

UART0 interrpt

SDC interrupt

KBD interrupt

intID; // issue EOF dec intnest by 1

Figure 10.7 shows the outputs of running the program C10.6, which demonstrates SMP_RTOS with nested interrupts. As the figure shows, all interrupts are handled in SVC mode.

10.8.4 Preemptive Task Scheduling in SMP_RTOS In order to meet task deadlines, every ROTS must support preemptive task scheduling to ensure high-priority tasks can be executed as soon as possible. In this section, we shall show how to implement preemptive task scheduling in the SMP_RTOS kernel and demonstrate task preemption by the sample program C10.7. The task preemption problem can be classified into two main cases. Case 1: When a running task makes another task of higher priority ready to run, e.g., when creating a new task with a higher priority or unblocking a task with a higher priority. Case 2: When an interrupt handler makes a task runnable, e.g., when unblocking a task with a higher priority or when the current running task has exhausted its time quantum in round-robin scheduling by timeslice.

10.8 Multiprocessor RTOS (SMP_RTOS)

489

Fig. 10.7 Nested interrupts in SMP_RTOS

In a UP kernel, implementing task preemption is straightforward. For the first case, whenever a task makes another task ready to run, it simply checks whether the new task has a higher priority. If so, it invokes task switching to preempt itself immediately. For the second case, the situation is only slightly more complex. With nested interrupts, the interrupt handler must check whether all nested interrupt processing have ended. If so, it can invoke task switching to preempt the current running task. Otherwise, it defers task switching until the end of nested interrupt processing. In SMP, the situation is very different. In a SMP kernel, an execution entity, i.e., a task or an interrupt handler, executing on one CPU may make another task ready to run on a different CPU. If so, it must inform the other CPU to reschedule tasks for possible task preemption. In this case, it must issue a SGI, e.g., with interrupt ID=2, to the other CPU. Responding to a SGI_2 interrupt, the target CPU can execute a SGI_2 interrupt handler to reschedule tasks. We illustrate SGI-based task switch by the sample program C10.7. For the sake of brevity, we only show the relevant code segments.

10.8.5 Task Switch by SGI /************ irq_handler() in ts.s file *************/ irq_handler: // IRQ interrupts entry point sub lr, lr, #4 stmfd sp!, {r0-r12, lr} // save all Umode regs in kstack bl irq_chandler // call irq_handler() in C in svc.c file // return value=1 means swflag is set => swtich task cmp r0, #0 bne doswitch ldmfd sp!, {r0-r12, pc}^ // return doswitch: msr cpsr, #0x93 stmfd sp!, {r0-r3, lr} bl ttswitch // switch task in SVC mode ldmfd sp!, {r0-r3, lr} msr cpsr, #0x92 // return by context in IRQ stack ldmfd sp!, {r0-r12, pc}^

490

10 Embedded Real-Time Operating Systems

/************** t.c file ****************************/ int send_sgi(int ID, int targetCPU, int filter) { int target = (1 3) // > 3 means to all CPUs target = 0xF; // turn on all 4 CPU ID bits int *sgi_reg = (int *)(GIC_BASE + 0x1F00); *sgi_reg = (filter pid*100000); // simulate task compute time } } int main() { // start up APs: same as before printf("CPU0 continue\n"); kfork((int)f1, 1, 1); // create tasks on CPU1 kfork((int)f2, 1, 1); kfork((int)f1, 1, 2); // create tasks on CPU2 kfork((int)f2, 1, 2);

10.8 Multiprocessor RTOS (SMP_RTOS)

}

491

while(1){ printf("input a line: "); kgetline(line); printf("\n"); send_sgi(2, 1, 0); // SGI_2 to CPU1 send_sgi(2, 2, 0); // SGI_2 to CPU2 }

In the sample program C10.7, after starting up the APs, the main program executing on CPU0 creates two sets of tasks on both CPU1 and CPU2, all of the same priority 1. All the tasks execute an infinite loop. Without external inputs, each CPU would continue to execute the same task forever. After creating tasks on different CPUs, the main program prompts for an input line from the keyboard. Then it sends a SGI with intID=2 to the other CPUs, causing them to execute the SGI2_handler. In the SGI2_handler, each CPU turns on its switch task flag and returns a 1. In the irq_handler code, it checks the return value. If the return value is nonzero, indicating a need for task switch, it changes to SVC mode and calls ttswitch() to switch task. When the switched-out task resumes, it returns to the original interrupted point by the saved context in IRQ stack. Figure 10.8 shows the outputs of task switching by SGI. As the figure shows, each input line sends SGI-2 to both CPU1 and CPU2, causing them to switch tasks.

10.8.6 Timesliced Task Scheduling in SMP_RTOS In a RTOS, tasks may have the same priority. Instead of waiting for such tasks to give up CPU voluntarily, it may be desirable or even necessary to schedule them by round-robin with timeslice. In this scheme, when a task is scheduled to run, it is given a timeslice as the maximum amount of time the task is allowed to run. In the timer interrupt handler, it decrements the running task’s timeslice periodically. When the running task’s timeslice reaches zero, it is preempted to run another task. We demonstrate task scheduling by timeslice in the SMP_RTOS kennel by an example. The example program, C10.8, is the same as C10.7 except for the following modifications:

Fig. 10.8 Task switch by SGI in SMP_RTOS

492

10 Embedded Real-Time Operating Systems

/********** kernel.c file of C10.8 ***********/ int scheduler() { int pid; PROC *old=running; int cpuid = get_cpuid(); if (running->pid < 1000){ // regular tasks only if (running->status==READY) enqueue(&readyQueue[cpuid], running); } running = dequeue(&readyQueue[cpuid]); if (running == 0) // if CPU’s readyQueue is empty running = &iproc[cpuid]; // run the idle task running->timeslice = 4; // 4 seconds timeslice swflag[cpuid] = 0; sunlock(&readylock[cpuid]); } /********* ptimer.c file of C10.8 ***************/ void ptimer_handler() // local timer handler { int cpuid = get_cpuid(); // update tick, display wall clock: same as before ptimer_clearInterrupt(); // clear timer interrupt if (tp->tick == 0){ // per second if (running->pid < 1000){// only regular tasks running->timeslice--; printf("%dtime=%d ", running->pid, running->timeslice); if (running->timeslice pid, cpuid); swflag[cpuid] = 1; // set swflag o switch task at end of IRQ } } } } /************* t.c file of C10.8 ****************/ int f3() { int cpuid = get_cpuid(); while(1){ printf("TASK%d ON CPU%d\n", running->pid, cpuid); kdelay(running->pid * 100000); // simulate computation } } int f2() { int cpuid = get_cpuid(); while(1){ printf("task%d on cpu%d\n", running->pid, cpuid); kdelay(running->pid * 100000); // simulate computation } } int f1() { int pid, status; kfork((int)f2, 1, 1); // create task on CPU1 kfork((int)f3, 1, 1); // create task on CPU1

10.8 Multiprocessor RTOS (SMP_RTOS)

493

while(1){ pid = kwait(&status); // P1 waits for ZOMBIE child }

} int run_task(){ // same as before } int main() { // start APs” same as before printf("CPU0 continue\n"); kfork((int)f1, 1, 0); // create task 1 on CPU0 run_task(); }

In the sample program C10.8, the initial task of CPU0 creates task1 to execute f1() on CPU0. Task1 creates task2 and task3 on CPU1 with the same priority. Then task1 waits for any child task to terminate. Since both task2 and task3 have the same priority, without timeslicing, CPU1 would continue to run task2 forever. With timeslicing, task2 and task3 would take turn to run, each runs for a timeslice = 4 seconds. Figure 10.9 shows the outputs of running the C10.8 program, which demonstrates timesliced task scheduling.

10.8.7 Preemptive Task Scheduling in SMP_RTOS Based on the above discussions, we now show the implementation of preemptive task scheduling in the SMP_ROTS kernel.

Fig. 10.9 Timesliced task scheduling in SMP_RTOS

494

10 Embedded Real-Time Operating Systems

10.8.7.1 Reschedule Function for Task Preemption

int reschedule(PROC *p) { int cpuid = get_cpuid(); // executing CPU int targetCPU = p->cpuid; // target CPU if (cpuid == targetCPU){ // if same CPU if (readyQueue[cpuid]->priority > running->priority){ if (intnest[cpuid]==0){ // PREEMPTION immediately ttswitch(); } else{ // defer until end of nested IRQs swflag[cpuid] = 1; } } } else{ // different CPU send_sgi(intID=2, CPUID=targetCPU, filter=0); } }

10.8.7.2 Task Creation with Preemption

int kfork(int func, int priority, int cpu) { int cpuid = get_cpuid(); // executing CPU // create a new task p, enter into readyQueue[cpuid] as before reschedule(p); // invoke resechedule return p->pid; }

10.8.7.3 Semaphore with Preemption

int V(struct semaphore *s) { PROC *p = 0; int SR = int_off(); slock(&s->lock); s->value++; if (s->value queue); p->status = READY; slock(&readylock[p->cpuid]); enqueue(&readyQueue[p->cpuid], p); sunlock(&readylock[p->cpuid]); } sunlock(&s->lock); int_on(SR); if (p) // if has unblocked a waiter reschedule(); }

10.8 Multiprocessor RTOS (SMP_RTOS)

10.8.7.4 Mutex with Preemption

int mutex_unlock(MUTEX *s) { PROC *p = 0; int SR = int_off(); slock(&s->lock) // acquire spinlock // ASSUME: mutex has waiters: unblock one as new owner p = dequeue(&s->queue); p->status = READY; s->owner = p; slock(&readylock[p->cpuid]); enqueue(&readyQueue[p->cpuid], p); sunlock(&readylock[p->cpuid]); sunlock(&s->lock); // release spinlock int_on(SR); if (p) // if has unblocked a waiter reschedule(); }

10.8.7.5 Nested IRQ Interrupts with Preemption

int SGI2_handler() { int cpuid = get_cpuid(); PROC *p = readyQueue[cpuid]; if (p && p->priority > running->priority){ swflag[cpuid] = 1; // set task switch flag } } int irq_chandler(int intID) // called with interrupt ID { int cpuid = get_cpuid(); intnest[cpuid]++; // inc intsest count by 1 if (intID == 2){ // SGI_2 interrupt SGI2_handler(); } // other IRQ interrupt handlers: same as before *(int *)(GIC_BASE + 0x110) = (cpuIDstate==UNLOCKED{ // mutex is in unlocked state s->state = LOCKED; s->owner = running; sunlock(&s->lock); int_on(SR); return 1; } /************** mutex already locked ******************/ if (s->owner == running){ // already locked by this task printf("mutex already locked by you!\n"); sunlock(&s->lock); int_on(SR); return 1; } printf("TASK%d BLOCK ON MUTEX: ", running->pid); running->status = BLOCK; enqueue(&s->queue, running); // boost owner priority to running's priority if (running->priority > s->owner->priority){ if (s->owner->cpuid == cpuid){ // same CPU s->owner->priority = running->priority; // boost owner priority } else{ // not the same CPU PID[s->owner-cpuid] = s->owner->pid; // owner’s pid send_sgi(3, s->owner->cpuid, 0); // send SGI to recehdule task } } sunlock(&s->lock); tswitch(); // switch task int_on(SR); return 1; } int mutex_unlock(MUTEX *s) { PROC *p; int cpuid = get_cpuid(); int SR = int_off(); slock(&s->lock); // spinlock if (s->state==UNLOCKED || s->owner != running){ printf("%d mutex_unlock error\n", running->pid); sunlock(&s->lock); int_on(SR); return 0; } // caller is owner and mutex was locked

498

}

10 Embedded Real-Time Operating Systems

if (s->queue == 0){ // mutex has no waiter s->state = UNLOCKED; // clear locked state s->owner = 0; // clear owner running->priority = running->rpriority; } else{ // mutex has waiters: unblock one as new owner p = dequeue(&s->queue); p->status = READY; s->owner = p; slock(&readylock[p->cpuid]); enqueue(&readyQueue[p->cpuid], p); sunlock(&readylock[p->cpuid]); running->priority = running->realPriority; // restore priority if (p->cpuid == cpuid){ // same CPU if (p->priority > running->priority){ if (intnest==0) // not inside IRQ handler ttswitch(); // preemption now else{ swflag[cpuid] = 1; // defer preemption until IRQ end } else // caller and p on different CPUs => SGI to p's CPU send_sgi(2, p->cpuid, 0); // send SGI to reschedule task } sunlock(&s->lock); int_on(SR); return 1;

10.8.9 Demonstration of the SMP_RTOS System The SMP_RTOS system integrates all the functionalities discussed above to provide the system with the following capabilities: ( 1). A SMP kernel for task management. Tasks are distributed to different CPUs to improve concurrency in SMP. (2). Preemptive task scheduling by priority and timeslice. (3). Support nested interrupts. (4). Priority inheritance in mutex and semaphore operations. (5). Task communication by shared memory and messages. (6). Inter-processor synchronization by SGI. (7). A logging task for recording task activities to a SDC. We demonstrate the SMP_RTOS system by the sample program C10.9. For the sake of brevity, we only show the t.c file of the system: /******** t.c file of SMP_RTOS demo. system C10.9 ********/ #include "type.h" // system types and constants #include "string.c" struct semaphore s0, s1; // for task synchroonization int irq_stack[4][1024], abt_stack[4][1024], und_stack[4][1024]; #include "queue.c" // enqueue/dequeue functions #include "pv.c" // semaphores with priority inheritance #include "mutex.c" // mutex with priority inheritance #include "uart.c" // UARTs driver

10.8 Multiprocessor RTOS (SMP_RTOS) #include #include #include #include #include #include #include #include #include int int int int int

"kbd.c" "ptimer.c" "vid.c" "exceptions.c" "kernel.c" "wait.c" "fork.c" "sdc.c" "message.c"

// // // // // // // // //

499 KBD driver local timer driver LCD display driver exceptions handlers kernel init and task schedulers kexit() and kwait() functions kfork() function SDC driver message passing

copy_vectors(){ // same as before } config_gic() { // same as before } config_int(int N, int targetCPU){// same as before } irq_chandler(){ // same as before } APstart() { // same as before }

int SGI_handler() // sgi handler { int cpuid = get_cpuid(); printf("%d on CPU%d got SGI-2: ", running->pid, cpuid); if (readyQueue[cpuid]){ // if readQ nonempty printf("set swflag\n"); swflag[cpuid] = 1; } else{ // empty readyQ printf("NO ACTION\n"); swflag[cpuid] = 0; } } int klog(char *line){ send(line, 2) } char *logmsg = "log information"; int task5() { printf("task%d running: ", running->pid); klog("start"); mutex_lock(mp); printf("task%d inside CR\n", running->pid); klog("inside CR"); mutex_unlock(mp); klog("terminate"); kexit(0); } int task4() { int cpuid = get_cpuid(); while(1){ printf("%d on CPU%d: ", running->pid, cpuid); kfork((int)task5, 5, 2); // creat task5 task on CPU2 kdelay(running->pid*10000); kexit(0); } }

500 int task3() { int cpuid = get_cpuid(); while(1){ printf("%d on CPU%d: ", running->pid, cpuid); klog(logmsg); mutex_lock(mp); // lock mutex pid = kfork((int)task4, 4, 2); // create task4 on CPU2 kdelay(running->pid*10000); mutex_unlock(mp); pid = kwait(&status); } } int task2() // log task { int r, blk; char line[1024]; printf(“log task %d start\n”, running->pid); blk = 0; // start write to block 0 of SDC while(1){ printf("LOGtask%d: receiving\n", running->pid); r = recv(line); // receive a log line put_block(blk, line); // write log to SDC (1KB) block blk++; uprintf("%s", line); // display log to UART0 on-line printf("log: "); } } int task1() { int status, pid, p3; int cpuid = get_cpuid(); fs_init(); kfork((int)task2, 2, 1); V(&s0); // start up logtask kfork((int)task3, 3, 2); while(1) pid = kwait(&status); } int run_task(){ // same as before } int main() { int cpuid = get_cpuid(); enable_scu(); fbuf_init(); kprintf("Welcome to SMP_RTOS in ARM\n"); sdc_init(); kbd_init(); uart_init(); ptimer_init(); ptimer_start(); msg_init(); config_gic(); kprintf("CPU0 initialize kernel\n");

10 Embedded Real-Time Operating Systems

10.8 Multiprocessor RTOS (SMP_RTOS)

}

501

kernel_init(); mp = mutex_create(); s0.lock = s1.lock = s0.value = s1.value = 0; s0.queue = s1.queue = = 0; printf("CPU0 startup APs: "); int *APaddr = (int *)0x10000030; *APaddr = (int)apStart; send_sgi(0x0, 0x0F, 0x01); printf("CPU0 waits for APs ready\n"); while(ncpu < 4); kfork((int)task1, 1, 0); run_task();

Figure 10.10 shows running the SMP_RTOS system. The top part of Fig. 10.10 shows the online log information generated by the logging task. The bottom part of Fig. 10.10 shows the startup screen of SMP_RTOS.

Fig. 10.10 Demonstration of SMP_RTOS

502

10 Embedded Real-Time Operating Systems

In the main() function of the SMP_RTOS system, the initial process on CPU0 creates a task P1, which behaves as the INIT process. P1 creates task2 as the logging task on CPU1 and task3 on CPU2. Then P1 executes a loop to wait for any ZOMBIE child. Task3 first locks up a mutex. Then it creates task4, which creates task5, all on CPU2 but with different priorities. When task5 runs, it tries to acquire the same mutex lock, which is still held by task3. In this case, task5 will be blocked on the mutex, but it raises the effective priority of task3 to task5, which demonstrates priority inheritance. In addition, the system also shows task preemption and inter-processor synchronization by SGIs. All the tasks may send log information as messages to the logging task, which invokes the SDC driver directly to write the log information to a (1 KB) block of the SDC. It also writes the log information to a UART port to display it online. Variation to the logging scheme is listed as a programming project in the Problems section.

10.9 Summary This chapter covers embedded real-time operating systems (RTOS). It introduces the concepts and requirements of real-time systems. It covers the various kinds of task scheduling algorithms in RTOS, which include RMS, EDF, and DMS. It explains the problem of priority inversion due to preemptive task scheduling and shows how to handle priority inversion and task preemption. It includes case studies of several popular real-time OS and presents a set of general guidelines for RTOS design. It shows the design and implementation of a UP_RTOS for uniprocessor (UP) systems. Then it extends the UP_RTOS to a SMP_RTOS, which supports nested interrupts, preemptive task scheduling, priority inheritance, and inter-processor synchronization by SGI.

List of Sample Programs C10.1. UP_RTOS with periodic tasks and round-robin scheduling C10.2. UP_RTOS with periodic tasks and preemptive scheduling C10.3. UP_RTOS with dynamic tasks C10.4. SMP_RTOS kernel for task management C10.5. MP_RTOS, a localized SMP system C10.6. Nested interrupts in SMP_RTOS C10.7. Task preemption by SGI in SMP_RTOS C10.8. Timesliced task scheduling in SMP_RTOS C10.9. Demonstration of SMP_RTOS

Problems 1. In the UP_RTOS and SMP_RTOS systems, the logging task saves log information to a SDC directly, bypassing the file system. Modify the programs C10.1 and C10.8 to let the logging task save log information to a file in an (EXT2/3) file system on a SDC. Discuss the advantages and disadvantages of using log files. 2. In the UP_RTOS kernel, priority inheritance in only one-level. Implement priority inherence in a chain of nested requests for mutex or semaphore locks. 3. For the example UP_ROTS system C10.3, perform a response time analysis to determine whether the tasks will meet their deadlines. Verify whether the task can meet their deadlines empirically. 4. In the MP_RTOS system, tasks are bound to separate CPUs for parallel processing. Devise a way to support task migration, which allows tasks to be distributed to different CPUs to balance the processing load. 5. In the SMP_RTOS system, tasks running on different CPUs may interfere with one another since they all execute in the same address space. Use MMU with non-uniform VA to PA mapping to provide each CPU with a separate VA space. 6. In a SMP RTOS, interrupts may be handled directly by ISRs or by pseudo-tasks. Design and implement RTOS systems which employ these interrupt processing schemes and compare their performances.

References

503

References 1. Audsley, N.C: “Deadline Monotonic Scheduling”, Department of Computer Science, University of York, 1990 2. Audsley, N. C, Burns, A., Richardson, M. F., Tindell, K., Wellings, A. J.: “Applying new scheduling theory to static priority pre-emptive scheduling”. Software Engineering Journal, 8(5):284–292, 1993. 3. Buttazzo, G. C.: Counter RMS claims Rate Monotonic vs. EDF: Judgment Day, Real-Time Systems, 29, 5–26, 2005 4. Dietrich, S., Walker, D., “The evolution of Real-Time Linux”, http://www.cse.nd.edu/courses/cse60463/www/amatta2.pdf, 2015 5. DNX: DNX RTOS, http://www.dnx-rtos.org, 2015 6. FreeRTOS: FreeRTOS, http://www.freertos.org, 2016 7. Hagen, W. “Real-Time and Performance Improvements in the 2.6 Linux Kernel”, Linux journal, 2005 8. Josheph, M and Pandya, P., “Finding response times in a real-time system”, Comput. J., 1986, 29, (5). pp. 390–395 9. Jones, M. B. (December 16, 1997). “What really happened on Mars?”. Microsoft.com. 1997 10. Labrosse, J.: Micro/OS-II, R&D Books, 1999 11. Leung, J.Y T., Merrill, M. L: A note on preemptive scheduling of periodic, real-time tasks. Information Processing Letters, 11(3):115–118, 1982 12. Linux: “Intro to Real-Time Linux for Embedded Developers”, https://www.linux.com/blog/intro-real-time-linux-embedded-developers 13. Liu, C. L.; Layland, J. “Scheduling algorithms for multiprogramming in a hard real-time environment”, Journal of the ACM 20 (1): 46–61, 1973 14. Micrium: Micro/OS-III, https://www.micrium.com, 2016 15. Nutt, G. NuttX, Real-Time Operating system, nuttx.org, 2016 16. POSIX 1003.1b: https://en.wikipedia.org/wiki/POSIX, 2016 17. QNX: QNX Neutrino RTOS, http://www.qnx.com/products/neutrino-rtos, 2015 18. Reeves, G. E.: “What really happened on Mars? – Authoritative Account”. Microsoft.com, 1997. 19. RTLinux: RTLinux, https://en.wikipedia.org/wiki/RTLinux 20. Sha, L., Rajkumar, R., Lehoczky, J.P.: (September 1990). “Priority Inheritance Protocols: An Approach to Real-Time Synchronization”, IEEE Transactions on Computers, Vol 39, pp1175–1185, 1990 21. VxWorks: VxWorks: http://windriver.com, 2016. 22. Yodaiken, V.: “The RTLinux Manifesto”, Proceedings of the 5th Linux Conference, 1999 23. Wang, K. C., “Design and Implementation of the MTX Operating System”, Springer International Publishing AG, 2015

Chapter 11 ARMv8 Architecture and Programming

11.1 ARMv8 Architecture and Processors The ARMv8 architecture [1–4] is a successor to the ARMv7. Unlike ARMv7, which only supports 32-bit code, ARMv8 supports both 64-bit and 32-bit executions. It introduces the ability to perform execution with 64-bit-wide registers while preserving backward compatibility with existing 32-bit ARMv7 software. Representative ARMv8 processors include Cortex-A53 and Cortex-A57 [5, 6]. The Cortex-A53 processor is a mid-range, low-power processor with between one and four cores in a single cluster, each with an L1 cache subsystem, an optional integrated GICv3/4 interface, and an optional L2 cache controller. The Cortex-A57 processor is targeted at mobile and enterprise computing applications including compute intensive 64-bit applications such as high-end computer, tablet, and server products. The Cortex-A57 processor features cache coherent interoperability with other processors, including the ARM Mali family of graphics processing units (GPUs) for GPU compute, and provides optional reliability and scalability features for high-performance enterprise applications. It provides significantly more performance than the ARMv7 Cortex-A15 processor, at a higher level of power efficiency. The inclusion of cryptography extensions improves performance on cryptography algorithms by ten times over the previous generation of processors.

11.1.1

Exception Levels

The ARMv8 architecture classifies processor executions into four exception levels, denoted by EL0 to EL3 [7]. The notation of ELn denotes exception level n, with larger n value means higher privilege level. The concept of exception levels is fundamental to the ARMv8 architecture. All operations take place at a defined exception level. Exception levels provide a logical separation of software execution privilege that applies across all operating states of the ARMv8 architecture. It supports and realizes the concept of hierarchical protection domains common in computer science theory. It is similar to the four layers of protection rings in the x86 architecture of Intel [8], except the privilege levels are reversed. In the Intel architecture, the innermost ring 0 is the highest privilege level, and the outermost ring 3 the lowest privilege level. The following is a typical example of what software runs at each exception level in the ARMv8 architecture: EL0: EL1: EL2: EL3:

Normal user applications at the unprivileged level Operating system kernel, privileged level Hypervisor for virtualization to support multiple operating systems Low-level firmware, including the secure monitor

Figure 11.1 depicts the various kinds of software that can run at different exception levels

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. C. Wang, Embedded and Real-Time Operating Systems, https://doi.org/10.1007/978-3-031-28701-5_11

505

506

11 ARMv8 Architecture and Programming

Fig. 11.1 ARMv8 exception levels

Fig. 11.2 ARMv8 exception levels in normal and secure worlds

11.1.2

Secure World and Normal World

In addition to exception levels, ARMv8 also defines two security states: secure and non-secure. The secure state is referred to as the secure world, and the non-secure state is referred to as the non-secure or normal world. The ARM TrustZone technology allows the system resources to be partitioned between the secure and normal worlds. This provides protection against certain software attacks and hardware attacks. The secure monitor acts as a gateway for moving between the normal and secure worlds. The secure monitor in the ARMv8 architecture is at a higher exception level than all other software. Figure 11.2 shows the exception levels in the normal and secure world.

11.1.3

Execution States

ARMv8 supports both 64- and 32-bit executions. Accordingly, it defines two execution states, AArch64 and AArch32, denoted by Ar64 and Ar32 for short. As their names imply, Ar64 executes instructions in 64-bit mode and uses 64-bit-wide general-purpose registers. Ar32 executes 32-bit code and uses 32-bit registers to provide backward compatibility with ARMv7. However, the two execution states in ARMv8 are separate and incompatible. At each exception level, an execution must be either in Ar64 state or in Ar32 state. It is not possible to have mixed 64-bit and 32-bit executions at the same time. We shall explain this later in Sect. 11.1.5, which shows how to change execution states. In order to avoid this confusion, we shall ignore the 32-bit execution state and consider only the 64-bit execution state.

11.1 ARMv8 Architecture and Processors

11.1.4

507

Change Exception Levels

In ARMv6 and ARMv7, processors may execute in different modes at different privilege levels P0 to P2, which are as follows: USER: Unprivileged User mode at PL0 SYS: System mode shares same registers with User mode at PL1 SVC: Supervisor mode at PL1 ABT: Abort mode at PL1 UND: Undefined mode at PL1 IRQ: IRQ interrupt mode at PL1 FIQ: FIQ interrupt mode at PL1 ---------- with HYP/MON extensions --------HYP: Hypervisor extension mode at PL1 MON: Monitor extension mode at PL2

Transitions between the various modes are by exceptions, which cause the processor to enter a privileged mode to handle the exception. Exceptions include interrupts, traps, and special system calls of SVC, HVC, and SMC. Specifically, interrupts, traps, and SVC call cause transitions from the current mode at PL0 or PL1 into a privileged mode at PL1: Interrupts: Traps: SVC call:

to FIQ or IRQ mode to ABT or UND mode to SVC mode

While in a privileged modes at PL1, the processor may enter HYP mode by HVC (hypervisor call). Transition from the privileged mode at PL1 to the secure monitor mode at PL2 is by SMC (Secure Monitor Call) [10]. HVC call: SMC call:

to hypervisor mode to Secure Monitor mode

At the end of exception handling, the processor may return to a previous mode at the same privilege level, e.g., from the IRQ mode back to the SVC mode or to a lower privilege level, e.g., from the SVC mode back to the User mode. In ARMv8, this model of handling exceptions is retained in the 32-bit execution state, but exception handling in 64-bit mode is totally different. First, ARMv8 defines two security states, secure world or normal world, and four privilege levels L0 to L3. In ARMv8, the execution modes at PL1 of ARMv7 (Supervisor, IRQ, FIQ, ABT, UND, and System modes) are combined into a single group, which is in the secure world if EL3 is executing in the Ar32 state. Otherwise, they are all in the non-secure world of EL1. Figure 11.3 shows the mapping of the various exception modes of ARMv7 into a single group at EL1 in ARMv8. These changes have a significant impact on exception handling. In ARMv7, each mode, such as SVC, IRQ, FIQ, ABT, UND, and SYS, has a separate stack pointer. An exception must be handled in a specific mode using the corresponding stack.

Fig. 11.3 Processor modes in AR64

508

11 ARMv8 Architecture and Programming

This is completely changed in ARMv8. Instead of a separate stack pointer in each exception mode as in ARMv7, there is only one stack pointer SP_ELn in each exception level. An OS kernel running at EL1 can handle all exceptions with a single SP_EL1 stack, rather than different stacks of different exception modes, a very cumbersome feature in ARMv7. Exception handling will be covered in more detail in later sections. In ARMv8, exception level can only be changed by exceptions, either when taking an exception or when returning from an exception handling. In ARMv8, exceptions include the following: Interrupts such as IRQ and FIQ Memory system aborts (ABT) Undefined instructions (UND) System Errors (SError) System calls (SVC) from EL0 to OS kernel at EL1 Hypervisor calls (HVC) to EL2 Secure Monitor Calls (SMC) to EL3 Changing exception levels obeys the following rules: . Exceptions may occur in the lowest exception level EL0, but there is no exception handling in EL0. All exceptions must be handled in a privileged exception level EL1 to EL3. . An exception usually takes the execution to a higher exception level, e.g., an exception occurred in EL0 causes execution to EL1, but it may also stay in the same level, e.g., from EL1 to EL1. An exception can never change to a lower exception level. . End of exception handling and return to the previous exception level is performed by the new ERET operation, which restores the saved processor state and returns to the address when the exception was taken. ERET is similar to IRET (return from interrupts) in Intel x86. . Return from an exception (by ERET) can stay at the same exception level or enter a lower exception level. It can never move to a higher exception level. When an exception occurs, the processor normally enters a higher exception level to handle the exception, but it may stay at the same level if there is either no higher level or no need to migrate to a higher level. At the end of exception handling, it returns to the previous exception level, which is usually a lower level but may stay in the same level again. This is the standard practice of handling nested interrupts used in all OS kernels. It is welcome news to see ARMv8 now follows the same principle to remedy the defects of its early architectures of ARMv6 and ARMv7. The above rules of changing exception levels can be clarified by some concrete example cases: (1). Assume that the processor is executing at the exception level EL0. Without any exception, the processor will stay in EL0. Before letting the processor run in EL0, we may program the control register to let the processor enter EL1 on any exception in EL0. While executing at EL0, exceptions may occur by device interrupts. If so, the processor will enter EL1 on any interrupt. Even if there are no interrupts, we may cause a trap by letting the processor execute bad instructions that cause memory access violation or undefined instruction, which will cause the CPU to enter EL1 to handle the trap. Even if without these two reasons, we can explicitly request to enter EL1 by SVC call, which always causes the processor to enter EL1. So there are many options to let the processor change the exception level from EL0 to EL1. When execution of the exception handling routine in EL1 finishes, we may let the processor return to EL0 by ERET, which restores saved processor status from SPSR_EL1 register and return to the place of exception by saved address in the Exception Link Register ELR_EL1. It is worthy to compare the ARM exception handling scheme with that of Intel: In Intel x86, all the information are saved in the higher-level stack when an exception occurs. The IRET instruction causes return (from interrupts) by popping saved information off the stack [25]. In the ARM scheme, the processor state is saved in SPSR_ELn register and return address in Exception Link Register ELR_RLn. Software must save these registers (in stack) in order to handle nested interrupts. (2). When the power is on or following a reset, the processor usually begins execution in the highest exception level. In ARMv7, it begins from the SVC mode. So the OS kernel starts execution in SVC mode. Chapter 7 of this book showed how to let an OS kernel execute process in User mode. In ARMv8, assuming the highest implemented exception level is EL3, the system will start execution in EL3. A crucial question is how to let the processor go into EL1 to start an OS

11.2 ARMv8 Registers

509

Fig. 11.4 Ar64 and Ar32 states

kernel? The trick is to pretend the processor was executing in EL1, and it entered EL3 by an exception (even though there was none). If so, SPSR_EL3 and ELR_EL3 must contain information for the processor to return to EL1. While in EL3, we can access control registers in EL1 to set up their initial values and fabricate a set of saved processor status and return address in SPSR_EL3 and ELR_EL3, followed by ERET. The processor would obey the contents of SPSR_EL3 and ELR_EL3 to return to the entry address of the OS kernel in EL1. This is also the standard technique for an OS kernel to run User mode processes.

11.1.5

Change Execution States

The ARMv8 architecture defines two execution states, AArch64 and AArch32, denoted by Ar64 and Ar32 for short. Each state is used to describe execution using 64-bit-wide general-purpose registers or 32-bit-wide general-purpose registers, respectively. In ARMv8, the Ar64 and Ar32 execution states are incompatible. An Ar64 OS may support both Ar64 and Ar32 applications, and an Ar32 OS can only support Ar32 applications. Even under an Ar64 OS, an application must be in either Ar64 or Ar32 state, but not in mixed execution states. These restrictions are better illustrated in Fig. 11.4. Figure 11.4 shows an Ar64 hypervisor that hosts both an Ar64 and Ar32 OS. An Ar64 OS can host a mix of Ar64 and AAr32 applications, but an Ar32 OS can only host Ar32 applications. Execution states can only be changed by changing the exception level. Taking an exception can change execution state from Ar32 to Ar64, and returning from an exception can change it from Ar64 to Ar32. As an example, it is possible to have 32-bit and 64-bit applications running under a 64-bit OS. In this case, the 32-bit application can switch to the 64-bit application by the following means. First, it executes a Supervisor Call (SVC) or receives an interrupt, which causes it to enter the OS kernel at EL1 in Ar64 state. The OS kernel can switch task and return to the 64-bit task at EL0 in Ar64 state. Likewise, a 64-bit application can switch to a 32-bit application by the same mechanism. Practically speaking, this means that we cannot have a mixed 32-bit and 64-bit application, because there is no direct way of calling between them. The incompatibility between 64-bit and 32-bit applications of ARMv8 is a big surprise, which may hinder the use of mixed 64-bit and 32-bit applications in ARM systems.

11.2 ARMv8 Registers ARMv8-A provides 31 64-bit general-purpose registers X0 to X30, which are always accessible in all exception levels. Each register (X0–X30) is 64 bits wide, but each also has a 32-bit form (W0–W30). Figure 11.5 shows the register format. Each 32-bit W register forms the lower half of a corresponding 64-bit X register. Read from a W registers ignore the higher 32 bits of the corresponding X register and leave them unchanged. Write to a W registers set the higher 32 bits of the X register to zero. As an example, writing 0xFFFFFFFF into W0 sets X0 to 0x00000000FFFFFFFF. Among the 31 general registers, X29 is the stack frame pointer, and X30 is the link registers for function calls. We shall explain their roles in Sect. 11.4.7, which covers function call conventions in ARMv8.

510

11 ARMv8 Architecture and Programming

Fig. 11.5 64-bit general- purpose register format

Fig. 11.6 ARMv8 special registers

11.2.1

ARMv8 Special Registers

ARMv8 has many special registers, which are shown in Fig. 11.6. In Fig. 11.6, XZR/WZR is the zero register, which contains value 0. It can be used to set other items to 0 by register to register transfer without the need of an immediate value in the instruction. SP_ELn are stack pointers at the exception level ELn. PC is the program counter shared by all exception levels. SPSR_ELn and ELR_Ln are Saved Processor Status register and Exception Linker register of ELn (n > 0).

11.2.2

Stack Pointers

The stack pointer (SP_ELn) is a banked register that points to the top of the stack at that exception level. By default, taking an exception selects the stack pointer for the target exception level (SP_ELn). For example, taking an exception to EL1 selects SP_EL1. However, when in Ar64 execution state at an exception level other than EL0, the processor can use either: • The stack pointer that is associated with that exception level (SP_ELn) • The stack pointer that is associated with EL0 (SP_EL0) In other words, ARMv8 allows an exception handler at high exception levels to use stack pointer at EL0. This unique feature of ARMv8 is unavailable in other architectures, but its usage and effectiveness are both highly questionable.

11.2.3

Program Counter

The program counter (PC) contains the next instruction address. However, in ARMv8, the PC register cannot be accessed directly as in ARMv7. The only instructions that use the PC are those whose function is to compute a PC-relative address (ADR, ADRP, literal load, and direct branches) and branch-and-link instructions that store a return address in the link register (BL and BLR). The only way to modify the program counter PC is using branch, exception generation, and exception return instructions.

11.2.4

Exception Link Registers

The Exception Link Register ELR_ELn is also banked for each n > 0. It holds the return address of an exception, and the processor status at the point of exception is saved in the SPSR_ELn register. They are used by the IRET instruction to return from an exception handling.

11.2 ARMv8 Registers

11.2.5

511

PSTATE and SPSR Registers

Unlike ARMv7 which has a Current Processor Status register (also called Current Program Status register) CPSR, ARMv8 does not have a single CPSR. The current processor status information is maintained in many system registers, which are collectively known as the PSTATE. Important fields of PSTATE include the following: NZCV: condition flag, same as in ARMv7; accessible in EL0 DAIF: Debug, SError, IRQ, FIQ mask bits SS: Software Step bit. IL: Illegal execution state bit EL (2): Exception level nRW execution state 0 = 64-bit 1= 32-bit SP stack pointer selector. 0 = SP_EL0, 1 = SP_ELn PSTATE fields are accessed using special-purpose registers, which are read using the MRS instruction and written using the MSR instructions. The special registers are as follows: CurrentEL: DAIF: NZCV: SPSel:

Holds the current exception level. Specifies the current interrupt mask bits D, A, I, F. Holds the condition flags N, Z, C, V. At EL1 or higher, this selects SP_ELn or SP_EL0 as stack pointer.

For example, to access the SPSel, use the following instructions: MRS X0, SPSel MSR SPSel, X0

// Read SPSel into X0 // Write X0 to SPSel

Exception handler code can use this feature to switch from using SP_ELn to SP_EL0. This switch is controlled by writing to the SPSel bit, as shown in the following code: MSR SPSel, #0 // switch to SP_EL0 MSR SPSel, #1 // switch to the SP of current Exception level ELn ARM documents justify the use of either SP_ELn or SP_EL0 for exception processing as follows. SP_EL1 might point to a piece of memory that holds a small stack that the kernel can guarantee to always be valid. SP_ EL0 might point to a kernel task stack that is larger, but not guaranteed to be safe from overflow. By using SP_EL0 allows the OS kernel to access User mode information at EL0 easily. However, none of these reasons are convincing enough, and I do not know of any OS that uses this exotic feature in their kernel. I can predict this will be another ARM flop in its next architectural release. Other PSTATE fields can be accessed using the following operands: MSR DAIFSet, #Imm4 // Used to set any or all of DAIF to 1 MSR DAIFClr, #Imm4 // Used to clear any or all of DAIF to 0 SPSR Registers AVMv8 has a Saved Processor Status register (also called Saved Program Status register) SPSR_ELn at each exception level n > 0. When taking an exception, the processor state is saved in the relevant SPSR_ELn, in a similar way to the SPSR in ARMv7. The SPSR holds the value of PSTATE fields before taking an exception. It is used to restore the value of PSTATE fields when executing an exception return. Figure 11.7 shows the SPSR contents when exceptions are taken from:

Fig. 11.7 Saved Processor Status register

11 ARMv8 Architecture and Programming

512

In the SPSR register, NZCV = condition bits SS = Software Step IL = value of PSTATE.IL before the exception is taken DAIF = mask bits for Debug, SError, IRQ, and FIQ M = execution state: 0 for Ar64 state, 1 for Ar32 state M[3:0] = exception level of exception taken from

11.3 ARMv8 System Registers In Ar64 state of ARMv8, system configuration is controlled through system registers. Some important control registers are as follows: System control registers SCTLR_ELn Vector base registers VBAR_ELn Translation table base registers TTBR0_ELn, TTBR1_Eln Registers that have the suffix ELn also have a separate, banked copy in some or all the levels, except EL0. The ELn suffix of each register specifies the lowest exception level that it can be accessed from. For example: TTBR0_EL1 is accessible from EL1, EL2, and EL3. TTBR0_EL2 is accessible from EL2 and EL3. In ARMv7, system registers are accessed through coprocessor 15 (CP15). In ARMv8, system registers are accessed by MSR and MRS instructions. The code to access a system register takes the following form: MRS X0, TTBR0_EL1 MSR TTBR0_EL1, X0

// Copy TTBR0_EL1 into X0 // Copy X0 into TTBR0_EL1

11.4 Aarch64 Instructions In Ar64, instructions are still 4 bytes, but they support 64-bit operations unavailable in Ar32. Instruction encodings in Ar64 and Ar32 are different. Programmers only need to use the instructions by standard instruction mnemonics, such as ADD and SUB. ADD W0, W1, W2 ADD X0, X1, X2 ADD X0, X1, #42

// add 32-bit registers // add 64-bit registers // add immediate to 64-bit register

The following is a brief summary of instructions in Ar64 state.

11.4.1

Data Processing Instructions

The general format of data processing instructions is Instruction Rd, Rn, Op2 // Rd=dest, Rn, Op2 are operands The use of R indicates that it can be either an X or a W register. The second operand Op2 may be a register, a modified register, or an immediate value. The data processing operations include the following: Arithmetic: Logical : Comparison: Move :

ADD, AND, CMP, MOV,

SUB, NEG BIC, ORR, EOR TST MVN

11.4 Aarch64 Instructions

513

Some instructions also have an S suffix, indicating that the instruction sets flags. This includes ADDS, SUBS, ANDS, and BICS. There are other flag-setting instructions, notably CMP and TST, but these do not take an S suffix. Examples of arithmetic and logical instructions are as follows: ADD W0, W1, W2, LSL #3 SUBS X0, X4, X3, ASR #2 MOV X0, X1 CMP W3, W4 ADD W0, W5, #27

// // // // //

W0 = W1 + (W2 > 2), set flags Copy X1 to X0 Set flags based on W3 - W4 W0 = W5 + 27

The logical operations are essentially the same as the corresponding Boolean operators operating on individual bits of the register. The BIC (Bitwise Bit Clear) instruction performs an AND of the register that is the first after the destination register, with the inverted value of the second operand. For example, to clear bit [11] of register X0, use the following: MOV X1, #0x800 BIC X0, X0, X1 Multiply and Divide Instructions The multiply instructions provided are broadly similar to those in ARMv7, but with the ability to perform 64-bit multiplies in a single instruction. There are multiply instructions that operate on 32-bit or 64-bit values and return a result of the same size as the operands. For example, two 64-bit registers can be multiplied to produce a 64-bit result. Likewise, two 32-bit registers can be multiplied to produce a 32-bit result: MUL X0, X1, X2 UDIV W0, W1, W2 SDIV X0, X1, X2

// X0 = X1 * X2 // W0 = W1 / W2 (unsigned, 32-bit divide) // X0 = X1 / X2 (signed, 64-bit divide)

However, you cannot directly multiply a 32-bit W register by a 64-bit X register.

11.4.2

Shift Operations

The following instructions are specifically for shifting: Logical shift left (LSL). The LSL instruction performs multiplication by a power of 2. Logical shift right (LSR). The LSR instruction performs division by a power of 2. Arithmetic shift right (ASR). The ASR instruction performs division by a power of 2, preserving the sign bit. Rotate right (ROR). The ROR instruction performs a bitwise rotation, wrapping the bits rotated from the LSB into MSB.

11.4.3

Conditional Instructions

The Ar64 instruction set does not support conditional execution for every instruction. The Ar64 instruction set enables conditional execution of only program flow control branch instructions. This is in contrast to Ar32 where most instructions can be predicated with a condition code. ARM cites the reason to drop predicated execution for every instruction is because it does not offer sufficient benefit to justify its significant use of opcode space. The real reason is probably due to the fact that few C compilers, e.g., GCC, generate code that uses conditional instructions. This proves again that something looks good in theory or on paper may not survive the real-world test.

514

11 ARMv8 Architecture and Programming

11.4.4

Memory Access Instructions

As with all prior ARM processors, the ARMv8 architecture is a load/store architecture. This means that no data processing instruction operates directly on data in memory. The data must first be loaded into registers inside the processor, modified, and then stored to memory. The program must specify an address, the size of data to be transferred, and a source or destination register. There are additional load/store instructions which provide further options, such as non-temporal load/store, load/store exclusives, and acquire/release, which are useful in multiprocessing [Chap. 9 of this book]. The general form of a load instruction is as follows: LDR LDR LDR LDR LDR

Rt, X0, X0, X0, X0,

[X1] [X1, #8] [X1, X2] [X1, X2, LSL, #3]

// // // //

Load Load Load Load

from from from from

the address in X1 address X1 + 8 address X1 + X2 address X1 + (X2 0), and the handler also executes in the same level ELn but uses SP_ELn. Group 3 is used if the exception is taken from a lower ELn that executes in Ar64. Group 4 is used if the exception is taken from a lower ELn that executes in Ar32. The following code segment shows a typical vector table for EL3 vectors for exceptions in Ar64: // Typical exception vector table code. .balign 0x800 // KCW: at 2KB boundary // Group 1 of table : handler using SPL0 that NOBODY use it Vector_table_el3: curr_el_sp0_sync: // The exception handler for a synchronous // exception from the current EL using SP0. .balign 0x80 // 128 bytes separator curr_el_sp0_irq: // The exception handler for an IRQ exception // from the current EL using SP0. .balign 0x80 curr_el_sp0_fiq: // The exception handler for an FIQ exception // from the current EL using SP0. .balign 0x80 curr_el_sp0_serror: // The exception handler for a System Error // exception from the current EL using SP0. // Group 2 of table: use current level stack, which make sense .balign 0x80 curr_el_spx_sync: // The exception handler for a synchronous // exception from current EL using current SP. .balign 0x80 curr_el_spx_irq: // The exception handler for an IRQ exception // from current EL using current SP. .balign 0x80 curr_el_spx_fiq: // The exception handler for an FIQ from // the EL using current SP.

11.5 ARMv8 Exception and Interrupt Handling

.balign 0x80 curr_el_spx_serror:

521

// The exception handler for a System Error // exception from current EL using current SP.

// Group 3 of table: for Ar64, exception occurred in lower level .balign 0x80 lower_el_aarch64_sync: // The exception handler for a synchronous // exception from a lower EL (AArch64). .balign 0x80 lower_el_aarch64_irq: // The exception handler for an IRQ from a // lower EL(Aarch64). .balign 0x80 lower_el_aarch64_fiq: // The exception handler for an FIQ from a // lower EL(Aarch64). .balign 0x80 lower_el_aarch64_serror: // The exception handler for a System Error // exception from a lower EL(Aarch64). // Group 4 of table: for Ar32, exception occurred in lower level .balign 0x80 lower_el_aarch32_sync: // The exception handler for a synchronous // exception from a lower EL (Aarch32). .balign 0x80 lower_el_aarch32_irq: // The exception handler for an IRQ exception // from a lower EL (Aarch32). .balign 0x80 lower_el_aarch32_fiq: // The exception handler for an FIQ exception // from a lower EL (Aarch32). .balign 0x80 lower_el_aarch32_serror: // The exception handler for a System Error // exception from a lower EL (Aarch32). If we only consider Ar64 execution state and the handler always uses stack pointer of the current ELn, then Groups 1 and 4 of the vector table become irrelevant and can be ignored. This leaves only two possible cases, either Group 2 or Group 3. As an example, assuming that exceptions and interrupts are not routed to either EL2 or EL3, then all exceptions must be handled in EL1 by default. Assume also SPSel in PSTATE is set to 1 (to select SP_ELn). Case 1: If an IRQ interrupt occurs in the OS kernel at EL1, it will be handled in EL1 within the OS kernel, and then Group 2 applies and execution takes place from address VBAR_EL1 + 0x280, using SP_EL1 stack. Case 2: If an exception (including interrupts) occurs in EL0, which can only be handled in EL1, then Group 3 applies and execution takes place from the address VBAR_EL1 + 0x400, using SP_EL1 also. If the SPSel bit in PSTATE is 0, the handler would use SP_EL0. In fact, the handler code can switch from using SP_ELn to using SP_EL0 by changing the SPSel bit in PSTATE, which is most astounding. Why ARM allows for this case is unclear. ARM documents cited two possible reasons as follows. First, SP_EL1 may point to a piece of memory that holds a small stack which the kernel can always guarantee to be valid. SP_EL0 may point to a kernel task stack that is larger, but not guaranteed to be safe from overflow. Second, switching to use SP_EL0 inside the handler facilitates access to the values from the thread in the handler. However, none of these reasons make any sense. For the first reason, why cannot the OS kernel uses a large stack at EL1? For the second reason, what kind of values can be accessed from a potentially corrupted stack in EL0? So, none of the reasons seem very convincing. I am not aware of any operating system that uses this feature, and it shall not be used in this book either.

11.5.4

Return from Exception

At the end of an exception handler, control returns to the point of original exception by the ERET instruction, which restores the pre-exception PSTATE from SPSR_ELn and returns execution back to the original location by restoring PC from ELR_ELn.

522

11 ARMv8 Architecture and Programming

Note that ELR_Eln is different from the register X30, which is the link register in subroutine calls by BL or BLR. Return from subroutines is by the RET instruction. In contrast, the ELR_ELn register is used to store the return address from an exception, and return from exceptions is by the ERET instruction. These ARM features are almost the same as taking an exception and IRET in Intel X86. In Intel X86, when an exception occurs, the CPU automatically saves the return address and CPU status register in stack. In addition, the CPU may also push an error status on stack. In contrast, ARM processors do not use the stack but keeps such information in separate special registers for faster processing speed, at the expense of more programming complexity.

11.5.5

Interrupt Handling

Interrupts are special kinds of exceptions. Interrupt handling follow the same principle of exception handling. However, interrupt handling deserves a special treatment due to its physical nature and interactions with devices and interrupt controller hardware. The reader may read the principle of interrupts and interrupt handling in Chap. 3 of this book. This section briefly reviews these concepts and techniques. ARM commonly uses interrupt to mean interrupt signal. On ARM A-profile and R-profile processors, that means an external IRQ or FIQ interrupt signal. The architecture does not specify how these signals are used. FIQ is often reserved for secure interrupt sources. Both A-profile and R-profile processor cores have two physical interrupt pins, IRQ and FIQ. An FIQ can mask IRQs, but an IRQ cannot mask FIQs. Historically, the architecture had features to permit lower-latency processing of FIQs. When the processor takes an exception to Ar64 execution state, all the DAIF interrupt masks in PSTATE are set automatically. This means that further interrupts are disabled. If the software is to support nested interrupts, e.g., to allow a higher-priority interrupt to interrupt the handling of a lower-priority interrupt, then the software must explicitly re-enable interrupts using the following instruction: MSR DAIFClr, #imm The immediate value is in fact a 4-bit field. There are also masks for A (for SError) and D (for Debug). An interrupt handler begins execution from an entry in the vector table, which is typically in assembly code. After initial processing, it usually calls a handler function in C, which actually handles the interrupt. The execution control flow of such an interrupt handler is shown in Fig. 11.8. Fig. 11.8 Interrupt handling

11.5 ARMv8 Exception and Interrupt Handling

11.5.6

523

A Simple Interrupt Handler

The following code demonstrates a simple interrupt handler for non-nested interrupts: IRQ_Handler // Stack all corruptible registers and save PCS corruptible interrupts STP X0, X1, [SP, #-16]! // push x0, x1 STP X2, X3, [SP, #-16]! // push x2, x2 ... // PUSH rest of corruptible registers BL read_irq_source // a function to find out interrupt source // and clear the interrupt request BL C_irq_handler // the C interrupt handler // restore from stack corruptible registers LDP X2, X3, [SP], #16 // pop x2, x3 LDP X0, X1, [SP], #16 // pop x0, x1 ... ERET However, from a performance point of view, the following sequence can also be used:IRQ_Handler SUB SP, SP, # STP X0, X1, [SP], +16 STP X2, X3, [SP], +16 ... // Interrupt handling BL read_irq_source BL C_irq_handler

// // // //

SP = SP - Store X0,X1 at base of frame Store X2,X3 at base of frame + 16 bytes more register storing

// find out interrupt source, clear request // the C interrupt handler

// restore from stack the corruptible registers LDP X0, X1, [SP, #-16]! // Load X0, X1 at base of frame LDP X2, X3, [SP, #-16]! // Load X2, X3 at base of frame + 16 bytes ... // more register loading ADD SP, SP, # // Restore SP at its original value ... ERET

11.5.7

Nested Interrupt Handler

The ARMv7 architecture was not designed to handle nested interrupts efficiently. The main reason is due to its lacking of a link register dedicated to interrupt processing. In ARMv7, in order to handle nested interrupts, the interrupt handler routine must change from interrupt mode stack to SYS mode stack, back and forth, which was a major flop of the ARMv7 architecture. In ARMv8, linker registers for subroutine calls and exception processing are different, and each exception level n > 0 has its own ELR_ELn, which avoids the link register corruption problem in ARMv7. In ARMv8, a nested interrupt handler only requires a little extra code. It only needs to preserve on the stack the contents of SPSR_EL1 and ELR_EL1. IRQs can be reenabled after determining (and clearing) the interrupt source. Figure 11.9 shows the control flow of nested interrupt handling. The following code demonstrates a simple handler for nested interrupts: ASM_IRQ_Handler: // Stack all corruptible registers // Save ... // Read SPSR_EL1 and ELR_EL1 into GP registers // PCS MRS X0, SPSR_EL1 // Corruptible MRS X1, ELR_EL1 // registers

524

11 ARMv8 Architecture and Programming

Fig. 11.9 Handling of nested interrupts

// Stack SPSR_EL1 and ELR_EL1 STP X0, X1, [SP, #-16]! BL identify_and_clear_source // Unmask IRQs MSR DAIFClr, #0b0010 BL C_IRQ_Handler // Mask IRQs MSR DAIFSet, #0b0010 // Restore SPSR_EL1 and ELR_EL1 LDP X0, X1, [SP], #16 MSR SPSR_EL1, X0 MSR ELR_EL1, X1 // Restore corruptible registers ... ERET

// SPSR_EL1 // ELR-EL1

// Unmask // IRQs // // // // // //

Mask IRQs Restore PCS corruptible registers, SPSR_EL1 and ELR_EL1

// Exception return

11.6 Generic Interrupt Controller Versions 3 and 4 The Generic Interrupt Controller (GIC) [15] is the default interrupt controller of ARM. It has evolved into several versions, ranging from GICv1 to GICv4. Currently, GICv3 is the primary GIC for ARMv8. It supports all key features of GICv2. In addition, it supports interrupt groups with affinity routing, message-based interrupts, and an enhanced security model, with separate secure and non-secure group 1 interrupts. As an example, ARM CoreLink GIC-500 is an implementation of GICv3, which also supports legacy software written for GICv2. The newest version of GIC is GICv4 [16], which supports direct virtual interrupt injections for enhanced virtualizations in EL2.

11.6 Generic Interrupt Controller Versions 3 and 4

11.6.1

525

GICv3 Fundamentals

This section describes the basic operation of the GICv3 interrupt controller. It also describes the different programming interfaces in GICv3.

11.6.1.1 Interrupt Types GICv3 defines the following types of interrupts: SGIs (Software Generated Interrupt): SGIs are used for inter-processor communication. They are generated by writing to a SGI register in the GIC. SGIs are available in GICv2 also. Chapter 9 of this book showed how to use SGIs for interprocessor communication in an ARM MPcore system. PPIs (Private Peripheral Interrupts): PPIs are peripheral interrupts that target a single, specific Processor Element (PE). An example of PPI is an interrupt from the generic timer of a PE. Each PE has its own generic timer. A generic timer interrupt is directed only to that PE. SPIs (Shared Peripheral Interrupts): SPIs are global peripheral interrupts that can be routed to a specified PE or to one of a group of PEs. LPIs (Locality-Specific Peripheral Interrupts): LPIs are new in GICv3. They are different from other types of interrupts in a number of ways. First, LPIs are only supported when GICD_CTLR.ARE_NS==1. Second, LPIs are always messagebased interrupts, and their configuration is held in tables in memory rather than in registers. 11.6.1.2 Interrupt Identifiers Each interrupt source is identified by an interrupt ID number, referred to as an INTID. The available INTIDs are grouped into ranges. Each range is assigned to a particular type of interrupts, as the following table shows: INTID 0–15 16–31 32–1019 1020–1023 1024–8191 8182 and greater

Interrupt type SGIs PPIs SPIs Special Reserved LPIs

Notes Banked per PE Banked per PE Shared Special cases

SGIs (Software Generated Interrupts) are used for inter-processor communication. PPIs (Private Peripheral Interrupts) are PE specific, which are directed to a specific PE. SPIs are Shared Peripheral Interrupt that can be routed to individual PEs or groups of PEs. LPIs are a new class of Locality-Specific Peripheral Interrupts, which intended to extend the interrupt ID space in GIVv3 and GICv4.

11.6.1.3 Security Model The ARM GICv3 architecture supports ARM TrustZone technology. Each INTID must be assigned a group and security setting. GICv3 supports three combinations of group and security settings, as the following shows: Interrupt type Secure Group 0 Secure Group 1 Non-secure Group 1

Example use Interrupts for EL3 (secure firmware) Interrupts for secure EL1 (trusted OS) Interrupts for non -secure state (OS or hypervisor)

Group 0 interrupts are always signaled as FIQs. Group 1 interrupts are signaled as either IRQs or FIQs, depending on the current security state and exception level of the PE, as shown below: EL security state of PE

Group 0

Secure EL0/1 Non-secure EL0/1/2

FIQ FIQ

Group 1 Secure IRQ FIQ

Non-secure FIQ IRQ

Figure 11.10 shows the GICD_CTLR register of GICv3. The Disable Security (DS) bit of GICD_CTLR register indicates whether a GIC is configured to support the ARMv8-A security model. Its reset value is 1 for a single security state. When GICD_CTLR.DS == 0, the GIC supports two security states and three interrupt groups: Group 0, secure Group 1, and non-secure Group 1.

526

11 ARMv8 Architecture and Programming

Fig. 11.10 GICD_CTLR register

Fig. 11.11 Security states of GICv3

When GICD_CTLR.DS == 1, the GIC supports only a single security state, which can be either secure state or non-secure state, and two interrupt groups: Group 0 and Group 1. Figure 11.11 shows the two security states of GICv3.

11.6.2

Programming Model of GICv3

The biggest change in the ARM GICv3 interrupt controller is in the hierarchical organization of interface components. In GICv3, the interface is split into three groups. Figure 11.12 shows the organization of the register interface. At the top level of the interface is the Distributor, which provides the routing configuration for SPIs and holds all the associated routing and priority information. At the next level is the Redistributor, which provides the configuration settings for PPIs and SGIs. At the bottom level is the CPU interface, which provides controls of interrupts to specific processor. The following describes their general functionalities.

11.6.2.1 Distributor (GICD_*) The Distributor registers are memory-mapped. They contain global settings that affect all PEs connected to the interrupt controller. The Distributor provides a programming interface for the following: • Interrupting prioritization and distribution of SPIs

11.6 Generic Interrupt Controller Versions 3 and 4

527

Fig. 11.12 Programming interface of GICv3 interrupt controller

• • • • • • •

Enabling and disabling SPIs Setting the priority level of each SPI Routing information for each SPI Setting each SPI to be level-sensitive or edge-triggered Generating message-based SPIs Controlling the active and pending state of SPIs Controlling to determine the programming model that is used in each security state (affinity routing or legacy)

11.6.2.2 Redistributors (GICR_*) Redistributors are new in GICv3 and GICv4. For each connected PE, there is a Redistributor. Registers of Redistributors are memory-mapped. The Redistributors provide a programming interface for the following: • • • • • •

Enabling and disabling SGIs and PPIs Setting the priority level of SGIs and PPIs Setting each PPI to be level-sensitive or edge-triggered Assigning each SGI and PPI to an interrupt group Controlling the state of SGIs and PPIs Base address control for the data structures in memory that support the associated interrupt properties and pending state for LPIs • Power management support for the connected PE

11.6.2.3 CPU interfaces (ICC_*_ELn) Each processor has a CPU interface connected to a Redistributor. The CPU interface provides a programming interface for the following: • • • • • •

General control and configuration to enable interrupt handling Acknowledging an interrupt Performing a priority drop and deactivation of interrupts Setting an interrupt priority mask for the PE Defining the preemption policy for the PE Determining the highest-priority pending interrupt for the PE

528

11 ARMv8 Architecture and Programming

In GICv2, CPU interface consists of memory-mapped registers denoted by GICC_*. In GICv3, CPU interface are system registers denoted by ICC_*_ELn. Memory-mapped GICC_* registers are only available in legacy mode in support of GICv2.

11.6.2.4 GICv3 Register Names All of the GIC registers have names that provide a short mnemonic for the function of the register: • Memory-mapped registers are prefixed by one of the following: GICC, to indicate a CPU interface register, used only in legacy mode of GICv2 GICD, to indicate a Distributor register, for both GICv3 and GICv2 GICH, to indicate a virtual interface control register, typically accessed by a hypervisor GICR, to indicate a Redistributor register, used only in GICv3 and GICv4 GICV, to indicate a virtual CPU interface register, used only in hypervisor GITS, to indicate an ITS register, used only with interrupt translation service (ITS) • System registers are prefixed by the following: ICC, to indicate a physical GIC CPU interface system register ICV, to indicate a virtual GIC CPU interface system register ICH, to indicate a virtual interface control system register

11.6.2.5 Distributor Register Map Distributor registers are located at offsets from the GIC base address. The following shows some of the GICD registers as offsets from the GIC base address: #define #define #define #define #define #define #define #define #define

GICD_CTLR GICD_TYPER GICD_IIDR GICD_STATUSR GICD_SETSPI_NSR GICD_CLRSPI_NSR GICD_SETSPI_SR GICD_CLRSPI_SR GICD_IROUTER

0x0000 0x0004 0x0008 0x0010 0x0040 0x0048 0x0050 0x0058 0x0600

// a range of IROUTER registers

The GIC base address depends on implementation. In ARM Virt virtual machines of QEMU, GIC_BASE = 0x08000000. Alternatively, the GIC base address can also be found out from the device tree blob (DTB). As an example, the following Linux commands will get the DTB of ARM Virt VM under QEMU: apt install device-tree-compiler # install device-tree-compiler utility qemu-system-aarch64 -machine virt, gic-version=3, dumpdtb=dump.dtb dtc -o dump.dts -I dtb dump.dtb Then the dump.dts file contains the device tree structure. For GICv2, the GIC base is at 0x08000000, and GICC base is at 0x08010000. For GICv3, the GIC base is at 0x08000000, and GICR base is at 0x080A0000.

11.6.2.6 Redistributor Register Map Redistributor registers are defined as offsets from the GICR base address. For the ARM Virt VM, GICR_BASE = GIC_BASE + 0xA0000: /* ReDistributor Registers for Control and Physical LPIs */ #define GICR_CTLR 0x0000 #define GICR_IIDR 0x0004 #define GICR_TYPER 0x0008 #define GICR_STATUSR 0x0010 #define GICR_WAKER 0x0014

11.6 Generic Interrupt Controller Versions 3 and 4

#define #define #define #define #define #define #define #define #define #define

GICR_SETLPIR GICR_CLRLPIR GICR_SEIR GICR_PROPBASER GICR_PENDBASER GICR_INVLPIR GICR_INVALLR GICR_SYNCR GICR_MOVLPIR GICR_MOVALLR

529

0x0040 0x0048 0x0068 0x0070 0x0078 0x00A0 0x00B0 0x00C0 0x0100 0x0110

/* ReDistributor Registers for SGIs and PPIs */ #define GICR_IGROUPRn 0x0080 #define GICR_ISENABLERn 0x0100 #define GICR_ICENABLERn 0x0180 #define GICR_ISPENDRn 0x0200 #define GICR_ICPENDRn 0x0280 #define GICR_ISACTIVERn 0x0300 #define GICR_ICACTIVERn 0x0380 #define GICR_IPRIORITYRn 0x0400 #define GICR_ICFGR0 0x0C00 #define GICR_ICFGR1 0x0C04 #define GICR_IGROUPMODRn 0x0D00 #define GICR_NSACRn 0x0E00 Each Redistributor defines two 64KB frames in the physical address map, which are as follows: • RD_base for controlling the overall behavior of the Redistributor, for controlling LPIs, and for generating LPIs in a system that does not include at least one ITS • SGI_base for controlling and generating PPIs and SGIs The two Redistributor frames are contiguous and ordered as follows: 1. RD_base 64KB frame 2. SGI_base 64KB frame Thus, each Redistributor has two 64KB frames, for a total frame size of 128KB. In GICv4, there are two additional 64KB frames: a frame to control virtual LPIs and a reserved page. In GICv4, the total frame size of each Redistributor is 256 KB.

11.6.2.7 CPU Interface System Registers In GICv3 and GICv4, CPU interface is by system registers, denoted by ICC_*_ELn: /* CPU #define #define #define #define #define #define #define #define #define #define #define

Interface System Registers */ ICC_IAR0_EL1 S3_0_C12_C8_0 ICC_IAR1_EL1 S3_0_C12_C12_0 ICC_EOIR0_EL1 S3_0_C12_C8_1 ICC_EOIR1_EL1 S3_0_C12_C12_1 ICC_HPPIR0_EL1 S3_0_C12_C8_2 ICC_HPPIR1_EL1 S3_0_C12_C12_2 ICC_BPR0_EL1 S3_0_C12_C8_3 ICC_BPR1_EL1 S3_0_C12_C12_3 ICC_DIR_EL1 S3_0_C12_C11_1 ICC_PMR_EL1 S3_0_C4_C6_0 ICC_RPR_EL1 S3_0_C12_C11_3

530

11 ARMv8 Architecture and Programming

#define ICC_CTLR_EL1 #define ICC_CTLR_EL3

S3_0_C12_C12_4 S3_6_C12_C12_4

#define #define #define #define #define #define #define #define #define #define

S3_0_C12_C12_5 S3_4_C12_C9_5 S3_6_C12_C12_5 S3_0_C12_C12_6 S3_0_C12_C12_7 S3_6_C12_C12_7 S3_0_C12_C13_0 S3_0_C12_C11_7 S3_0_C12_C11_5 S3_0_C12_C11_6

ICC_SRE_EL1 ICC_SRE_EL2 ICC_SRE_EL3 ICC_IGRPEN0_EL1 ICC_IGRPEN1_EL1 ICC_IGRPEN1_EL3 ICC_SEIEN_EL1 ICC_SGI0R_EL1 ICC_SGI1R_EL1 ICC_ASGI1R_EL1

Although each ICC system register has a unique symbolic name, some cross-compiler, e.g., aarch64-linux-gnu-gcc, may not recognize the register names yet. In that case, the ICC registers can be accessed by their encodings in the form Sopc0_opc1_Cn_Cm_op2 For example, ICC_CTLR_EL3 can be accessed as s3_6_c12_c12_4

11.6.3

Routing in GICv3

11.6.3.1 Affinity Routing GICv3 uses affinity routing to identify connected PEs and to route interrupts to a specific PE or group of PEs. Affinity routing is a hierarchical address-based routing scheme to identify specific PE nodes for interrupt routing. In Ar64 state, the affinity of a PE is held in the Multiprocessor Affinity Register MPIDR_EL1. The affinity value is represented by four 8-bit fields a.b.c.d., which stands for the affinity hierarchy levels ... In the (64-bit) MPIDR_EL1, bits [39:31]=a, bits [23:16]=b, bits [15:8]=c, bits[7:0]=d. Figure 11.13 shows an affinity routing diagram. With four levels of affinity routing, there is a Redistributor at each level which routes the interrupt based on its level values. The affinity a.b.c.d. of an interrupt may be resolved to a specific processor core in an ARM system comprising cluster of groups, each group comprising many machine, each machine has many cores, as in

Fig. 11.13 Affinity routing in GICv3/v4

11.6 Generic Interrupt Controller Versions 3 and 4

531

For an interrupt with ID n (n=32 to 1019), its routing affinity a.b.c.d. is specified in GICD_IROUTER. For example, in Fig. 11.13, an interrupt ID=n is routed by its affinity a.b.c.d. = 0.0.0.15 to the PE in cluster 0, group 0, processor 0, core 15 in an ARM system. Affinity routing is controlled by the Affinity Routing Enable ARE_S and ARE_NS bits in the distributor register GICD_ CTLR: To enable affinity routing for Secure IRQs, set GICD_CTLR.ARE_S to 1. To enable affinity routing for non-secure IRQs, set GICD_CTLR.ARE_NS to 1. With affinity routing disabled, GICv3 supports the legacy mode of GICv2.

11.6.3.2 Routing SPIs and SGIs by PE Affinity Among the different type of interrupts, PPIs do not need routing since they are for specific processor. Only SPIs and SGIs need routing. SPIs are routed using an affinity address and the routing mode information that is held in GICD_IROUTER. SGIs are routed using the affinity address and routing mode information that is written by software when it generates the SGI. SPIs and SGIs are routed using different registers. SPIs are routed using GICD_IROUTER.Interrupt_Routing_Mode. GICD_IROUTER is a 64-bit register. For each SPI interrupt ID n = 32 to1019, there is a corresponding GICR_IROUTER at GICD_BASE + 0x6000+8*n, which specifies the affinity routing information a.b.c.d. of that interrupt ID. The bit fields of GICD_IROUTER are as follows: Bit31 = Interrupt_Routing_Mode: Bit[39:32]=a, [23:16]=b, [15:8]=c, [7:0]=d. If Interrupt_Routing_Mode is 0, SPIs are routed to a single PE specified by a.b.c.d. If Interrupt_Routing_Mode is 1, SPIs are routed to any PE defined as a participating node. For example, a.b.c.d = 0.0.0.1 route the interrupt to CPU0 in an ARM MPcore processor. For a SPI interrupt ID=n, the affinity values can be set as follows: clear GICD_IROUTER by writing 0 to GICD_IROUTERn write value=0x000000aa_00bbccdd to GICD_IROUTERn SGIs are routed using ICC_SGI0R_EL1.IRM and ICC_SGI1R_EL1.IRM: If the IRM bit is set to 1, SGIs are routed to all participating PEs in the system, excluding the originating PE. If the IRM bit is cleared to 0, SGIs are routed to a group of PEs, specified by a.b.c.targetlist. The target list provides a bit field encoding for affinity level 0 values of 0–15.

11.6.3.3 Participating Nodes Participating nodes for affinity routing are controlled by GICD and GICR registers. An enabled SPI configured to use the 1 of N distribution model can target a PE when: GICR_WAKER.ProcessorSleep == 0 and the interrupt group of the interrupt is enabled on the PE. GICD_CTLR.E1NWF == 1. GICR_TYPER.DPGS == 1, and, for the interrupt group of the interrupt, GICR_CTLR.{DPG1S, DPG1NS, DPG0} == 0. In the Redistributor Control Register GICR_CTLR, the DPG1S, DPG1NS, and DPG0 fields are defines as: DPG1S, DPG1NS, DPG0,

bit [26] Disable Processor selection for Group 1 secure interrupts. bit [25] Disable Processor selection for Group 1 non-secure interrupts. bit [24] Disable Processor selection for Group 0 interrupts.

These bits should all be cleared to 0 to allow processor selects

532

11.6.4

11 ARMv8 Architecture and Programming

Enable and Distribution of Interrupts

11.6.4.1 Enable Interrupt Groups The following control bits enable and disable the distribution of interrupts in various groups: • GICD_CTLR.EnableGrp1S • GICD_CTLR.EnableGrp1NS • GICD_CTLR.EnableGrp0 The following control bits enable and disable the distribution of interrupt groups at the CPU interface: • ICC_IGRPEN0_EL1.Enable for Group 0 interrupts • ICC_IGRPEN1_EL1.Enable for Group 1 interrupts

11.6.4.2 Enabling Individual Interrupts PPIs: PPIs can be enabled and disabled by writing to GICR_ISENABLER0 and GICR_ICENABLER0 when affinity routing is enabled for the security state of the interrupt. Individual PPIs can also be enabled and disabled by writing to GICD_ ISENABLER and GICD_ICENABLER n = 0 for PPIs, if legacy operation for physical interrupts is supported and configured. PPIs in the optional, extended PPI range are enabled and disabled by writing to GICR_ISENABLERE and GICD_ ICENABLERE. SPIs: Individual SPIs can be enabled and disabled by writing to GICD_ISENABLER and GICD_ICENABLER n>0 for SPIs. SPIs in the optional, extended SPI range are enabled and disabled by writing to GICD_ISENABLERE, and GICD_ ICENABLERE. SGIs: SGIs can be enabled and disabled by writing to GICR_ISENABLER0 and GICR_ICENABLER0 when affinity routing is enabled. Individual SGIs can also be enabled and disabled by writing to GICD_ISENABLER and GICD_ ICENABLER n = 0 for SGIs, if legacy operation for physical interrupts is supported and configured.

LPIs: Individual LPIs can be enabled by setting the enable bits programmed in the LPI configuration table.

11.6.5

Handling Interrupts

When the source of the interrupt is asserted, e.g., when a peripheral asserting a dedicated interrupt signal, an interrupt becomes pending. When an interrupt becomes pending, the interrupt controller decides whether to send the interrupt to one of the connected PEs. The PE which the interrupt controller selects, if any, depends on the following settings: • Group enables: Sect. 11.6.1.3 described how INTID is assigned to a Group (Group 0, secure Group 1 or non-secure Group 1). For each group, there is a group bit in the Distributor and in the CPU interface. An interrupt that is a member of a disabled group cannot be signaled to a PE. • Interrupt enables: Individually disabled interrupts can become pending, but will not be forwarded to a PE. • Routing controls: Depending on the type of interrupt, the interrupt controller must decide which PEs can receive the interrupt.

11.7 ARMv8 Generic Timer

533

For SPIs, this is controlled by GICD_IROUTERn. A SPI can target one specific PE or anyone of the connected PEs. For LPIs, the routing information comes from the ITS if an ITS is implemented PPIs are specific to one PE and can only be handled by that PE. For SGIs, the originating PE defines the list of target PEs.

11.6.5.1 Interrupt Priority and Priority Mask Each PE has a Priority Mask register (ICC_PMR_EL1) in its CPU interface. This register sets the minimum priority that is required for an interrupt to be forwarded to that PE. Only interrupts with a higher priority than the value in the register are signaled to the PE. Running priority: If the PE is not already handling an interrupt, the running priority is the idle priority (0xFF). Only an interrupt with a higher priority than the running priority can preempt the current interrupt.

11.6.5.2 Interrupt Acknowledge The CPU interface has two Interrupt Acknowledge Registers (IARs). Reading the IAR returns the INTID and advances the interrupt state machine. In a typical interrupt handler, the first step when handling an interrupt is to read one of the IARs. IAR Register ICC_IAR0_EL1 ICC_IAR1_EL1

Usage Used to acknowledge Group 0 interrupts Used to acknowledge Group 1 interrupts

Now that we have a GICv3 interrupt controller, which can be configured to route interrupts, all we need is some I/O device that can generate interrupts to the GICv3 controller. In ARMv6 and ARMv7, there are many hardware boards with ARM processors, memory, and I/O devices that form a complete computer system. For most such ARM boards, there are also virtual machines that emulate the real hardware boards. The situation is somewhat different in ARMv8. While there are many ARMv8-based hardware boards, there are only a few ARMv8-based virtual machines, most with limited support for I/O devices. For example, the ARM Virt virtual machine emulates ARMv8 processors but with only GPIO and UART for I/O. Raspberry PI-3 appears to be a good choice with reasonable price, but programming Rasp3 is not as easy as it may appear. In order to use Rasp3, a first requirement is to have a special cable to connect the Rasp3 to a host computer that emulates a serial port. After much searching, I have decided to use the ARM Virt VM as the platform for ARMv8 programming. The ARM Virt VM supports a variety of ARMv8 processors, and every ARM processor supports generic timers, which can be used as interrupt sources.

11.7 ARMv8 Generic Timer Every ARMv8 processor has a generic timer, which provides a standardized timer framework for ARM cores [17]. The generic timer includes a System Counter and a set of per-core timers, as shown in the following diagram (Diagram 11.1). The System Counter is an always-on device. It provides a fixed frequency incrementing system count. The system count value is broadcast to all the cores in the system, giving the cores a common view of the passage of time. The system count value is between 56 bits and 64 bits in width, with a frequency typically in the range of 1 MHz to 50 MHz. Each core has a set of timers. These timers are comparators, which compare against the broadcast system count that is provided by the System Counter. Software can configure timers to generate interrupts. Software can also use the system count to add timestamps, because the system count gives a common reference point for all cores.

11.7.1

Processor Timers

The number of timers that a core provides depends on which extensions are implemented, as shown in the following table: Timer name EL1 physical timer

When is the timer present Always

11 ARMv8 Architecture and Programming

534

Diagram 11.1 Generic timer Timer name EL1 virtual timer Non-secure EL2 physical timer Non-secure EL2 virtual timer EL3 physical timer Secure EL2 physical timer Secure EL2 virtual timer

When is the timer present Always EL2 ARMv8.4-VHE EL3 ARMv8.4-SecEL2 ARMv8.4-SecEL2

For simplicity, we shall ignore timers in EL2 and consider only timers in EL1 and EL3.

11.7.2

Counter and Frequency

The CNTPCT_EL0 system register reports the current system count value. To ensure the read count is in order, it may be necessary to use the ISB instruction to synchronize the order of instructions. CNTFRQ_EL0 reports the frequency of the system count. This register is not populated by hardware. The register is write-able at the highest implemented exception level and readable at all exception levels. Firmware, typically running at EL3, populates this register as part of early system initialization. Higher-level software, like an operating system, can then use the register to get the frequency. It is observed that on most virtual machines the counter frequency is prefixed, which makes these steps optional.

11.7.3

Timer Registers

Each timer has the following three system registers: Register _CTL_ELn _CVAL_ELn _TVAL_ELn

Purpose Control register Comparator value Timer value

In the register name, identifies which timer is being accessed. The following table shows the possible values:

535

11.8 Programming Examples Timer EL1 physical timer EL1 virtual timer EL3 physical timer

Register prefix CNTP CNTV CNTPS

ELn EL0 EL0 EL1

For example, CNTP_CVAL_EL0 is the Comparator register of the EL1 physical timer.

11.7.4

Configure Timer

There are two ways to configure a timer, either using the comparator (CVAL) register or using the timer (TVAL) register. The comparator register, CVAL, is a 64-bit register. Software writes a value to this register, and the timer triggers when the count reaches, or exceeds, that value, i.e.: Timer Condition Met: CVAL /dev/null

aarch64-linux-gnu-as -o ts.o ts.s aarch64-linux-gnu-gcc -c -mcpu=cortex- a57 -Wall -o t.o t.c #2> /dev/null aarch64-linux-gnu-ld -T t.ld ts.o t.o -o kernel.elf rm *.o

echo go? read DUMMY

qemu-system-aarch64 -machine virt -cpu cortex-a57 \ -machine virtualization=on -machine secure=on -machine gic-version=3 \ -nographic -kernel kernel.elf Figure 11.17 shows the sample outputs of the Example 4 program. Referring to Fig. 11.17, the outputs can be explained as follows. (1). It first shows the CPU starts in EL3 (currentEL bits=0xC). It initializes the GICv3 controller and configures the EL1 physical timer. (2). Then it enters EL2 to print a line, which shows the currentEL bits =0x8. Then it enters EL1 to print a line in EL1, which shows the currentEL bits = 0x4. (3). While in EL1, it tests exception handling by brk, hvc, and svc. In each case, the trap handler shows the exception syndrome in the ESR_EL1 register, which matches ARM specifications. For the brk and hvc instructions, the return address in ELR_EL1 points to the same instruction, which would cause an infinite sequence of the same kind of traps again when the trap handler returns. The trap handler adds 4 to the return address to skip over the offending instruction. For the SVC instruction, the return address is already at instruction address + 4, so no need to adjust the return address.

11.8 Programming Examples

605

Fig. 11.17 Sample outputs of Example 4 program

( 4). Then it starts the timer and falls back to a WFI loop, waiting for timer interrupts. (5). When a timer interrupt occurs, the processor is diverted to execute irq_handler to handle the interrupt.

11.8.5

Example 5: OS Kernel with Processes at EL1

The Example 5 program implements a simple OS kernel in EL1. The simple OS kernel supports dynamic process creation and process switch by context switching. The system defines NPROC=9 process structures, each represents a process. After initializing the kernel data structures, the OS kernel starts to run an initial process P0 (pid=0). P0 forks NPROC-1=8 processes into a readyQueue ordered by process scheduling priority. Then it waits for timer interrupts. Every 4 or 5 seconds, it switches process to run the next process from the readyQueue. The example consists of the following files: (1). Ts.s file /********************** ts.s file ************************/ .global start start:

// Initialize general registers X0-30 .irp reg, 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 MOV X\reg, XZR .endr .irp reg, 16,17,18,19,20,21,22,23,24,25,26,27,28,29,30

606

11 ARMv8 Architecture and Programming

MOV .endr

X\reg, XZR

// Check which core is running // ---------------------------// Core 0.0.0.0 should continue to execute // All other cores should sleep (WFI) MRS x0, MPIDR_EL1 UBFX x1, x0, #32, #8 // Extract Aff3 BFI w0, w1, #24, #8 // Insert Aff3 into bits [31:24], so that [31:0] // is now Aff3.Aff2.Aff1.Aff0 // Using w register means bits [63:32] are zeroed CBZ w0, core0 // If 0.0.0.0, branch to code for core0

1:

WFI B

1b

// If not 0.0.0.0, then go to sleep

core0: // set up SP_EL3 pointer LDR x0, =stack3_top mov sp, x0

// Disable trapping of CPTR_EL3 accesses or use of Adv.SIMD/FPU // ------------------------------------------------------------MOV MSR

x0, #0 CPTR_EL3, x0

// Clear all trap bits

MSR ISB

SCR_EL3, xzr

// Ensure NS bit is clear

// // // //

Configure GIC CPU IF ------------------For processors that do not support legacy operation these steps could be omitted.

// set MOV //MSR MSR ISB //MSR MSR

Interrupt Controller System Register Enable to use ICC x0, #1 ICC_SRE_EL3, x0 s3_6_C12_C12_5, x0 // ICC_SRE_EL3, x0

MOV //MSR MSR ISB //MSR MSR ISB //MSR MSR

x1, #1 SCR_EL3, x1 s3_6_C12_C12_5, x1

// Set NS bit, to access Non-secure registers

ICC_SRE_EL2, x0 s3_4_C12_C9_5, x0

// bit0=0: memory mapped, 1 use ICC // ICC_SRE_EL2, x0

ICC_SRE_EL1, x0 s3_0_C12_C12_5, x0

// bit0=0: memory mapped, 1 use system regs // ICC_SRE_EL1, x0

ICC_SRE_EL1, x0 s3_0_C12_C12_5, x0

// ICC_SRE_EL1, x0

// Now do the NS SRE bits in Secure Config register SCR_EL3 // bit0 NS=0: EL0,EL1 are in secure state, NS=1: EL < 3 are non-secure // SCR_EL3, x1

// NOT route interrupts to EL3 MOV x1, #0 // Initial value of register is unknown BIC x1, x1, #(1 x1, x0->x0

632

11 ARMv8 Architecture and Programming

.global trap1_handler trap1_handler: handler trap_chandler

.global trap0_handler trap0_handler: // trap from EL0 may be systeam call by SVC handler trap_chandler goUmode: // trap from EL0, with goUmode: .global fiq_handler fiq_handler: handler fiq_chandler

.global el1_vector_table .balign 0x800

el1_vector_table:

//Group 1: Current EL with SP0: Group 1 does NOT apply curr_el_sp0_sync: b . .balign 0x80 curr_el_sp0_irq: b . .balign 0x80 curr_el_sp0_fiq: b . .balign 0x80 curr_el_sp0_serr: b . //Group 2: Current EL with SPx: // Group 2: YES .balign 0x80 curr_el_spx_sync: b trap1_handler .balign 0x80 curr_el_spx_irq:

// synchronous as traps from EL1

b irq_handler // if timer EL1 generate IRQ .balign 0x80 curr_el_spx_fiq: b fiq_handler .balign 0x80 curr_el_spx_serr: b .

// if timer EL1 uses FIQ // Serror

//Group 3: Lower EL using AArch64: // Group 3: from EL0 to OS kernel at EL1 .balign 0x80 lower_el_a64_sync: b trap0_handler

.balign 0x80 lower_el_a64_irq: b irq_handler .balign 0x80 lower_el_a64_fiq: b fiq_handler

// trap from EL0

11.8 Programming Examples

.balign 0x80 lower_el_a64_serr: b .

//Group4: Lower EL using AArch32: // Group 4 does NOT apply .balign 0x80 lower_el_a32_sync: b . .balign 0x80 lower_el_a32_irq: b . .balign 0x80 lower_el_a32_fiq: b . .balign 0x80 lower_el_a32_serr: b . // ------ end of el1_vector_table -------wfi:

.global wfi wfi ret

// for trap_handler to use: ESR_EL1, ELR_EL1 .global get_ESR_EL1 get_ESR_EL1: MRS x0, ESR_EL1 ret .global get_ELR_EL1 get_ELR_EL1: MRS x0, ELR_EL1 ret

.global add4ELR_EL1 add4ELR_EL1: // add 4 to ELR_EL1 to skip over bad instruction MRS x0, ELR_EL1 add x0, x0, #4 MSR ELR_EL1, x0 ret // for generate traps in EL1 .global do_brk, do_hvc, do_svc do_brk: brk #1 ret do_hvc: hvc #2 ret do_svc: svc #3 ret .global tswitch tswitch: stp x29, x30, [sp, #-16]! stp x27, x28, [sp, #-16]!

633

634

11 ARMv8 Architecture and Programming

stp stp stp stp stp stp stp stp stp stp stp stp stp stp

x25, x23, x21, x19, x17, x15, x13, x11, x9, x7, x5, x3, x1,

x26, x24, x22, x20, x18, x16, x14, x12, x10, x8, x6, x4, x2,

[sp, [sp, [sp, [sp, [sp, [sp, [sp, [sp, [sp, [sp, [sp, [sp, [sp,

#-16]! #-16]! #-16]! #-16]! #-16]! #-16]! #-16]! #-16]! #-16]! #-16]! #-16]! #-16]! #-16]!

x0, xzr, [sp, #-16]!

// sp->|xzr|x0|x1|.....|x29|x30| LDR x0, =running LDR x1, [x0, #0] mov x2, sp str x2, [x1, #8] bl

scheduler

// r0=&running // r1->runningPROC // running->ksp = sp

LDR x0, =running LDR x1, [x0, #0] ldr x2, [x1, #8] mov sp, x2

// r1->runningPROC

// restore registers from (x29, x30) to (x1, x2) ldp x1, x0, [sp], #16 // pop xzr->x1, x0->x0 ldp x1, x2, [sp], #16 ldp x3, x4, [sp], #16 ldp x5, x6, [sp], #16 ldp x7, x8, [sp], #16 ldp x9, x10, [sp], #16 ldp x11, x12, [sp], #16 ldp x13, x14, [sp], #16 ldp x15, x16, [sp], #16 ldp x17, x18, [sp], #16 ldp x19, x20, [sp], #16 ldp x21, x22, [sp], #16 ldp x23, x24, [sp], #16 ldp x25, x26, [sp], #16 ldp x27, x28, [sp], #16 ldp x29, x30, [sp], #16 // return by link register X30, X29=FP ret

unlock:

.global unlock MSR ret

DAIFClr, 0x3

11.8 Programming Examples

635

.global toUmode // toUmode(usp) toUmode:

// Determine the EL0 Execution state. MOV X1, #0b00000 // DAIF=0000 M[4:0]=00000 EL0t. MSR SPSR_EL1, X1

ADR x1, el0_entry // el1_entry points to the first instruction of MSR ELR_EL1, X1 // EL0 code. ERET

el0_entry: mov

sp, x0

bl uCode

// x0 = proc.usp

//--------------------------------.global syscall syscall: svc #1 ret

/*********** g.c file ************/ #include #define u8 uint8_t #define u32 uint32_t #define u64 uint64_t

#define GICD_BASE 0x08000000 // GICD base of QEMU virt VM #define GICR_BASE 0x08000000 + 0xA0000 // GICR base of QEMU virt VM //GIC Distributor registers: #define GICD_CTLR #define GICD_TYPER #define GICD_IIDR #define GICD_STATUSR #define GICD_SETSPI_NSR #define GICD_CLRSPI_NSR #define GICD_SETSPI_SR #define GICD_CLRSPI_SR #define GICD_IROUTER //Redistributor registers: #define GICR_CTLR #define GICR_IIDR #define GICR_TYPER #define GICR_STATUSR #define GICR_WAKER #define #define #define #define #define #define #define

GICR_SETLPIR GICR_CLRLPIR GICR_PROPBASER GICR_PENDBASER GICR_INVLPIR GICR_INVALLR GICR_SYNCR

(GICD_BASE (GICD_BASE (BICD_BASE (GICD_BASE (GICD_BASE (GICD_BASE (GICD_BASE (GICD_BASE (GICD_BASE

+ + + + + + + + +

0x0000) 0x0004) 0x0008) 0x0010) 0x0040) 0x0048) 0x0050) 0x0058) 0x6000)

(GICR_BASE (GICR_BASE (GICR_BASE (GICR_BASE (GICR_BASE

+ + + + +

0x0000) 0x0004) 0x0008) 0x0010) 0x0014)

(GICR_BASE (GICR_BASE (GICR_BASE (GICR_BASE (GICR_BASE (GICR_BASE (GICR_BASE

+ + + + + + +

0x0040) 0x0048) 0x0070) 0x0078) 0x00A0) 0x00B0) 0x00C0)

636

11 ARMv8 Architecture and Programming

//registers relative to SGI_base, add 0x10000 = 64KB: #define GICR_IGROUPR0 (GICR_BASE + 0x10080) #define GICR_ISENABLER0 (GICR_BASE + 0x10100) #define GICR_ICENABLER0 (GICR_BASE + 0x10180) #define GICR_ISPENDR0 (GICR_BASE + 0x10200) #define GICR_ICPENDR0 (GICR_BASE + 0x10280) #define GICR_ISACTIVER0 (GICR_BASE + 0x10300) #define GICR_ICACTIVER0 (GICR_BASE + 0x10380) #define GICR_IPRIORITYR (GICR_BASE + 0x10400) //byte-accessible register #define GICR_ICFGR0 (GICR_BASE + 0x10C00) #define GICR_ICFGR1 (GICR_BASE + 0x10C04) #define GICR_IGRPMODR0 (GICR_BASE + 0x10D00) #define GICR_NSACR (GICR_BASE + 0x10E00) // GICv3: GICR regs are in one of 2 frames, each 128KB: always use frame0 // read/write GICD functions: u64 readGICD_IROUTER(u64 n) { u64 address = GICD_IROUTER + 8*n; // puts("read: addr="); putx(address); puts("\n"); return *(u64 *)(address); }

void writeGICD_IROUTER(u32 n, u64 data) { u64 address = GICD_IROUTER + 8*n; //puts("write:addr="); putx(address); puts(" data="); putx(data); puts("\n"); *(u64 *)(address) = data; } void gicdWrite(u32 reg, u32 data) { *(u32 *)(reg) = data; } u32 gicdRead(u32 reg) { return *(u32 *)(reg); } u32 gicrRead(u32 reg) { return *(u32 *)(reg); }

void gicrWrite(u32 reg, u32 data) { *(u32 *)(reg) = data; }

//Set SGI or PPI's priority: //intID: 0~31, SGI or PPI interrupt ID.

void setSGIPPIPriority(u32 intID, u8 priority) { u32 regAddr; regAddr = GICR_IPRIORITYR + intID; // byte accessible *(char *)regAddr = priority; }

11.8 Programming Examples

637

// Initialize GICv3 Districutor void gicv3_init() { ups("gic_init()\n");

ups("read GICD_TYPER: "); u32 r1 = gicdRead(GICD_TYPER); ups("type="); upx(r1); ups("\n");

// typer=0x03780407: // 31-16=reserved 15-11=LSP# 10=secureExtension=1 7:5=CPU# // 3:0 32(n+1)=32*8=256 interrupts ugetc();

ups("write 0xF7 to GICD_CTLR\n"); //gicdWrite(GICD_CTLR, 0xB7); gicdWrite(GICD_CTLR, 0xF7); // // // //

// KCW: bit6 DS=1 disable secure state

GICD_CTLR = 0xB7=1011 0111 = bit0: enable Grp0=1, bit1=enableGrp1NS=1, bit2=enableGrp1S=1 bi4=ARE_S=1, bit5=ARE_NS=1, bit6=DS=1: disable Secure state KCW: if GICD_CTLR.DS = 0: intID would be 1024

ups("read GICD_CTLR & poll bit 31: "); while(1){ // keep read until bit 31 = 1 u32 v = *(u32 *)(GICD_CTLR); if ((v & 0x80000000)==0) break; } ups("after read GICD_CTLR\n"); //*(u32 *)(GICR_WAKER) = 0; // clear ups("readWaker "); u32 r = gicrRead(GICR_WAKER);

//ups("writeWaker\n"); gicrWrite(GICR_WAKER, r & 0xFFFFFFFFC); ups("poll GICR_WAKER for 0 "); while(1){ r = gicrRead(GICR_WAKER); if ( (r & 0x4)==0 ) break; } ups("after poll WAKER\n"); ugetc();

//Enable SRE: bit0 of ICC_SRE_EL3 enable use of ICC system registers ups("enable SRE in ICC_SRE_EL3\n"); u32 regVal = getICC_SRE_EL3(); regVal |= 0x01; // Enable SRE setICC_SRE_EL3(regVal); // set to minimal priority 0xFF ups("set ICC_PMR_EL1 priority to 0xFF\n"); setICC_PMR_EL1(0xFF);

638

11 ARMv8 Architecture and Programming

//binary point: .sssss. //all 5 bits are subpriority bits, not supporting interrupt preemption. ups("set binary point ICC_BPR0_EL1\n"); setICC_BPR0_EL1(0x04); /******************************************** //ICC_CTLR_EL3: bits432=EOImode: 0 to use regVal=getICC_CTLR_EL3(); puts("set EOImode_EL3 for interrupt pri drop and deactivation\n");

// EOImode_EL3: priority drop and interrupt deactivation together. // ICC_CTLR_EL3 bits 321=EOImode: 0 : to use ICC_EOIR0_EL1, ICC_EOIR1_EL1 // for priority drop and interrupt deactivation // clear bit 2 to 0: let EOImode_EL3 use ICC_EOIR0_EL1, ICC_EOIR1_EL1 setICC_CTLR_EL3(regVal & ~0x04); ****************************/ //ICC_CTLR_EL1: bits432=EOImode: 0 to use regVal=getICC_CTLR_EL1(); ups("set EOImode_EL1 for interrupt pri drop and deactivation\n"); // // // //

EOImode_EL1: priority drop and interrupt deactivation together. ICC_CTLR_EL3 bits 321=EOImode: 0 : to use ICC_EOIR0_EL1, ICC_EOIR1_EL1 for priority drop and interrupt deactivation clear bit 2 to 0: let EOImode_EL3 use ICC_EOIR0_EL1, ICC_EOIR1_EL1

setICC_CTLR_EL1(regVal & ~0x04);

//All SGI and PPI interrupts are of secure group 0 //Set group for interrupts ups("set interrupt to Group 0\n"); // set interrupts to Group 0); gicrWrite(GICR_IGROUPR0, 0); gicrWrite(GICR_IGRPMODR0, 0); //Enable Group 0 interrupts at CPU interface: ups("enable Group 0 ICC_IGROPEN0_EL1\n"); setICC_IGRPEN0_EL1(0x01);

} //#include //#define u8 uint8_t //#define u32 uint32_t //#define u64 uint64_t #define TIMER_INTID u64 timer_freq;

void config_timer() { u32 val; int x;

30

// EL1 Physical Timer interrupt ID

ups("configTimer()\n"); ups("read TimerFreq freq="); timer_freq = read_TimerFreq(); upx(timer_freq); ups("\n");

11.8 Programming Examples

639

//timer interrupt is level sensitive per ARM document: val = gicrRead(GICR_ICFGR1); // for x 15-0: bits 2x+1,2x 0:level 1:edge x = TIMER_INTID - 16; x = x*2 + 1; val &= ~(0x1next = q; return; } while (q->next && p->priority next->priority){ q = q->next; } p->next = q->next; q->next = p; } PROC *dequeue(PROC **queue) { PROC *p = *queue; if (p) *queue = p->next; return p; } void printQ(PROC *p) { ups("readyQueue = "); while(p){ upc('['); upi(p->pid); upc(']'); ups("->"); p = p->next; } ups("NULL\n"); } void printList(PROC *p) { ups("freeList = "); while(p){ upi(p->pid); ups("->"); p = p->next; } ups("NULL\n"); }

11.8 Programming Examples

/********** kernel.c **************/ extern void goUmode(); extern void uCode();

#define NPROC 9 PROC proc[NPROC], *running, *freeList, *readyQueue; u64 procsize = sizeof(PROC); void body();

char *pname[NPROC]={"sun", "mercury", "venus", "earth", "mars", "jupiter", "saturn","uranus","neptune"}; void kernel_init() { u64 i; PROC *p; ups("kernel_init() : "); for (i=0; ipid = i; strcpy(p->name, pname[i]); p->status = READY; p->next = p + 1; } proc[NPROC-1].next = 0; // circular proc list freeList = &proc[0]; readyQueue = 0;

}

p = dequeue(&freeList); p->priority = 0; p->ppid = 0; running = p; ups("running = P"); upi(running->pid); ups("\n"); printList(freeList);

void kexit() { ups("proc "); upi(running->pid); ups(" kexit\n"); running->status = FREE; running->priority = 0; enqueue(&freeList, running); tswitch(); } u64 kfork(u64 func, u64 priority) { u64 i; PROC *p = dequeue(&freeList);; if (p==0){ ups("no more PROC, kfork failed\n"); return 0; } p->status = READY; p->priority = priority; p->ppid = running->pid;

643

644

11 ARMv8 Architecture and Programming

p->usp = (u64 *)(ustack + (p->pid)*0x4000); // each proc has own usp // set kstack to resume to body // stack = x30 x29 ...... x0 0

//syscall frame | tswitch frame // x30 29 .... x2 x1 x0 zr spsr elr| x30 x29 ... x1 x0 zr| // 1 2 31 32 33 34 | 35 34+32=66 // spsr VA0 | goUmode for (i=1; ikstack[SSIZE-i] = 0;

p->kstack[SSIZE-33] = 0x0; // spsr M[4:0]=0 ar64, previous=SP_EL0 p->kstack[SSIZE-34] = uCode; // elr_el1= umode Code p->kstack[SSIZE-35] = (u64)func; p->ksp = &(p->kstack[SSIZE-66]);

// kCode() OR goUmode()

enqueue(&readyQueue, p);

ups("proc "); upi(running->pid); ups("kforked a child proc "); upi(p->pid); ups(" "); printQ(readyQueue); }

return p->pid;

void scheduler() { ups("P"); upi(running->pid); ups("in scheduler "); if (running->status == READY) enqueue(&readyQueue, running); running = dequeue(&readyQueue); ups("next running = "); upi(running->pid); ups("\n"); //ups("proc "); upi(running->pid); ups("running\n"); ups("----------------------------------------\n"); } void kCode() { char c; u64 pid;

//ups("proc "); upi(running->pid); ups(" resume to body()\n"); while(1){ unlock(); pid = running->pid; ups("P"); upi(running->pid); ups(running->name); ups(" running in OS kernel at EL1\n");

}

}

ups("enter a c = ugetc(); switch(c){ case 'f': case 's': case 'u': }

command [f|s|u] : "); uputc(c); ups("\n");

kfork(kCode, 1); tswitch(); toUmode(running->usp);

break; break; break;

11.8 Programming Examples

645

void kmode() { char c; u64 pid;

while(1){ unlock(); pid = running->pid; ups("proc P"); upi(running->pid); ups(running->name); ups(" running in OS kernel at EL1\n");

}

}

ups("enter a c = ugetc(); switch(c){ case 'f': case 's': case 'u': }

command [f|s|u] : "); uputc(c); ups("\n"); kfork(kCode, 1); tswitch(); return;

break; break; break;

/************* syscall.c file ****************/ typedef struct stack{ u64 elr, spsr, zxr, x0, x1, x2, x3; ; // only need elr, x0-x3 }KSTACK;

void trap_chandler(KSTACK *p) // p->kstack Top { u64 a,b,c,d; // syscall parameters from Umode in syscall(a,b,c,d) ups("trap_chandler "); u64 esr = get_ESR_EL1(); u64 elr = get_ELR_EL1(); u64 spsr = get_SPSR_EL1();

ups("elr="); upx(elr); ups("\n"); u32 r = esr >> 24; r = r & 0x000000FF;

ups("syndrome="); up2x(r); ups(" "); ups("spsr="); upx(spsr); ups("\n");

if (r == 0xF2 || r == 0x02){ // if brk or hvc: inc elr by 4, return p->elr += 4; return; } // looking for SVC from EL0, syndrome=0x56 // BAD ARM idea of lumping SVC into async exceptions!!!

if (r == 0x56){ // svc from EL0 a = p->x0; b = p->x1; c = p->x2; d = p->x3; // syscall parameters ups("syscall number="); upi(a); if (a==0){ ups("getpid "); r = running->pid; }

646

}

11 ARMv8 Architecture and Programming

if (a==1){ ups("getppid "); r = running->ppid; } if (a==2){ ups("chname "); ups(b); ups(" "); strcpy(running->name, b); ups("new name="); ups(running->name); ups("\n"); r = 0; } if (a==3){ ups("ps : "); printList(freeList); printQ(readyQueue); ups("running = P"); upi(running->pid); ups(running->name); ups(" "); r = 0; } if (a==4){ ups("SWITCH PROCESS in OS kernel\n"); tswitch(); r = 0; } if (a==5){ ups("ENTER Kmode\n"); kmode(); r = 0; } p->x0 = r; // return value to Umode

}

// process Umode code in EL0 void uCode() { u64 r; while(1){ ups("P"); upi(running->pid); ups("in Umode try get CurrentEL\n"); //ups("usp="); upx(running->usp); ups("\n"); u64 el = get_current_el(); ups("el from Umode = "); upx(el); ups("\n"); //ugetc();

ups("test syscalls to OS kernel in EL1 and back to EL0\n"); ups("syscall 0 getpid "); r = syscall(0,0,0,0); ups("return value = "); upi(r); ups("\n"); ugetc(); ups("syscall 1 getppid "); r = syscall(1,0,0,0); ups("return value = "); upi(r); ups("\n"); ugetc();

11.8 Programming Examples

/******** commented out for shorter display ******* ups("syscall 2 chanme "); r = syscall(2,"KCW",0, 0); ups("running->name="); ups(running->name); ups(" "); ups("retrun value = "); upi(r); ups("\n"); ugetc(); ups("syscall 3 ps "); r = syscall(3,0,0,0); ups("return value = "); upi(r); ups("\n");

ups("syscall 4 switch process\n"); r = syscall(4,0,0,0); *************************************************/

}

}

ups("syscall 5 to OS kernel\n"); r = syscall(5,0,0,0);

#include #define u8 uint8_t #define u32 uint32_t #define u64 uint64_t

u64 switchProc; u64 seconds; u64 ustack; // contain ustack_start in t.ld #include "uart.c" #include "gicv3.c" #include "timer.c"

#include "kernel.c"

u64 getTimerCount(); u64 getMPIDR_EL1(), getMPIDR_EL3(); void strcpy(char *dest, char *src) { while(*src){ *dest = *src; dest++; src++; } *dest = 0; }

// these are needed for aarch64 gcc if change sp i code unsigned long __stack_chk_guard;

void __stack_chk_guard_setup(void) { __stack_chk_guard = 0xBAAAAAAD;//provide some magic numbers } void __stack_chk_fail(void) { // Error message }

647

648

11 ARMv8 Architecture and Programming

char line[64]; int main() { switchProc = 0; seconds = 0;

ups("currentEL ="); u64 r = read_current_el(); upx(r); ups("\n");

/***** skip these tests ****** ups("do_brk\n"); do_brk(); // puts("after brk\n"); ups("do_hvc\n"); do_hvc(); //puts("after hvc\n");

ups("do_svc\n"); do_svc(); //puts("after svc\n"); ugetc(); **************************/

ups("initialize OS kernel in EL1\n"); kernel_init(); //ups("P0 fork P1 OS kernel : "); kfork(kCode, 1);

// fork P1

//printQ(readyQueue);

ups("enter a key to continue : "); ugetc(); tswitch(); }

while(1) wfi();

ENTRY(start)

SECTIONS { /* start at Virt RAM address */ . = 0x40000000; .text : { ts.o *(.text) } .data : { *(.data) } .bss : { *(.bss) } . = ALIGN(16);

11.8 Programming Examples

649

. = . + 0x4000; stack3_top = .; . = . + 0x4000; stack2_top = .; . = . + 0x4000; stack1_top = .; . = . + 0x4000; stack0_top = .;

ustack_start = .; . = . + 0x4000; p1stack_top = .; . = . + 0x4000; p2stack_top = .; . = . + 0x4000; p3stack_top = .; . = . + 0x4000; p4stack_top = .; . = . + 0x4000; p5stack_top = .; . = . + 0x4000; p6stack_top = .; . = . + 0x4000; p7stack_top = .; }

. = . + 0x4000; p8stack_top = .;

rm *.elf 2> /dev/null (cd USER; mku u1)

aarch64-linux-gnu-as -o ts.o ts.s aarch64-linux-gnu-gcc -c -mcpu=cortex- a57 -Wall -o t.o t.c #2> /dev/null aarch64-linux-gnu-ld -T t.ld ts.o t.o -o kernel.elf aarch64-linux-gnu-objdump -D kernel.elf > kernel.list rm *.o echo go? read DUMMY

qemu-system-aarch64 -machine virt -cpu cortex-a57 \ -machine virtualization=on -machine secure=on -machine gic-version=3 \ -nographic -kernel kernel.elf Explanations of Example 6 Code (1). Startup sequence: Similar to Example 5, Example 6 program starts execution in EL3. After configuring GICv3 and timer, it drops down to EL2 and then to EL1 to execute main() in C. In the main() function, it first initializes the OS kernel data structures, create, and run the initial process P0 (pid=0). P0 forks processes with User mode execution images and enters them into the readyQueue. Then P0 switches process to run P1. (2). Process with User mode image: Unlike Example 5, in which all processes run in the OS kernel at EL1, processes (other than P0) in Example 6 can run in User mode at EL0. This is done in the revised kfork() function. Before examining the kfork() function code, we first explain how to set up a process to execute in the OS kernel.

650

11 ARMv8 Architecture and Programming

We may think of each process as an execution entity in a forever loop, in which it runs for a while and then it calls tswitch() to give up CPU to another process. When it regains CPU, it emerges from tswitch() to run again, etc., and tswitch() consists of three parts: tswitch: SAVE:

save CPU registers in proc’s (kmode) stack; save stack pointer sp in proc.ksp FIND: bl scheduler // call scheduler to let running point to next PROC to run RESUME: restore CPU stack pointer sp from (running) proc.ksp pop saved CPU registers from proc’s ksatck ret // by saved x30 When creating processes to execute in User mode, we may use the same technique. First, we pretend the new process ran in User mode before. The reason why it is not running now is because it did a syscall from User mode to the OS kernel, followed by a tswitch() to put itself in the readyQueue and gave up the CPU. If so, its Kmode stack must contain two frames: a syscall frame followed by a tswitch frame, as the following shows | syscall frame | tswitch frame | kstack = |x30 x29 ….. x1,x0 xzr spsr elr | x30 x29 … x1 x0, zxr| | 1 2 31 32 33 34 | 35 36 .. 66| where all the saved register entries can be 0, except kstack[SSIZE-33] = spsr : bits[5:0]=00000: Previous in ar64, EL0 kstack[SSIZE-34] = entry address of Umode image kstack[SSIZE-35] = resume address in Kmode = address of kCode() Optionally, we may also set the saved x0 in the syscall frame as the return value to User mode and the saved x0 in the tswitch frame to the address of entry 34 as the kmode stack pointer when it is ready to goUmode, as in kstack[SSIZE-31 = return value to User mode, e.g. process pid kstack[SSIZE-65] = &kstck[SSIZE-34]; Accordingly, we modify the kfork() function as follows, where the changes are shown in bold face for clarity: u64 kfork(u64 func, u64 priority) {

u64 i; PROC *p = dequeue(&freeList);; if (p==0){ ups("no more PROC, kfork failed\n"); return 0; } p->status = READY; p->priority = priority; p->ppid = running->pid; p->parent = running;

// set kstack for new proc to resume and goUmode // syscall frame | tswitch frame | // x30 29 .... x2 x1 x0 zr spsr elr | x30 x29 ... x1 x0 zr| // 1 2 31 32 33 34 | 35 65 66 // spsr uCode | func for (i=1; ikstack[SSIZE-i] = 0;

// zero out kstack[ ]

p->kstack[SSIZE-31] = p->pid; p->kstack[SSIZE-33] = 0x0; p->kstack[SSIZE-34] = uCode;

// return pid // spsr=00000: ar64, was in EL0 // elr_el1= umode Code

11.8 Programming Examples

651

p->kstack[SSIZE-35] = (u64)func; // kCode() OR goUmode() p->kstack[SSIZE-65] = &p->kstack[SSIZE-34]; p->ksp = &(p->kstack[SSIZE-66]); // saved proc.ksp p->usp = ustack + (p->pid)*0x4000; // each proc has its own usp enqueue(&readyQueue, p); //ups("proc "); upi(running->pid); ups("kforked a child proc "); //upi(p->pid); ups("\n"); printQ(readyQueue); }

return p->pid;

Process Resume Sequence: Now we explain the resume sequence of a newly created process to start running. The generated code of the handler macro consists of three parts: (1). Save registers x30-x1, x0, zxr, spsr, elr in process Kmode stack; Save process usp (SP_EL0) into proc.usp; (2). Call trap_chandler(sp); goUmode: (3). Restore process usp from proc.usp; Restore spsr, elr (which may be modified by OS kernel) Restore saved registers; // saved x0=return value to Umode ERET When a newly created process is scheduled to run, it resumes from the bottom half of tswitch() labeled RESUME, which is RESUME: Restore ksp from proc.ksp; Use the tswitch stack frame to return to the saved return address in x30. When P0 creates a P1 by kfork(kCode, 1); the saved x30 at P1’s kstack[SSIZE-35] = kCode. When P1 starts to run, it will resume to execute kCode() in the OS kernel. kCode displays a short command menu [f|s|u] where the command f: fork a child process by kfork() s: switch process u: go to User mode to execute uCode() in EL0. When a process gets the “u” command, it cannot simply return or execute goUmode. This is because its kmode stack pointer has been changed while executing kCode(). So, we let it call toUmode(running->usp) in assembly, where running>usp is the User mode stack pointer of the process. Each process has its own User mode stack, but they all execute the same uCode() function code in User mode. For ease of reference, the toUmode() code is shown below: .global toUmode // called as toUmode(running->usp) toUmode: // x0 = usp // Determine EL0 Execution state MOV X1, #0b00000 // DAIF=0000 M[4:0]=00000 for Ar64 and EL0 MSR SPSR_EL1, X1 // SPSR of EL1 as if from EL0 before

652

11 ARMv8 Architecture and Programming

ADR x1, el0_entry MSR ELR_EL1, X1 ERET

el0_entry: mov sp, x0 bl uCode

// el1_entry point // load into ELR_SP1 // go EL0

// x0=proc.usp; each proc has its own usp in EL0 // call uCode() in C

(3). System calls to OS kernel Once a process is in User mode at EL0, it cannot change the exception level arbitrarily for obvious reasons. The only way a User mode process can go back to the OS kernel at EL1 is by one of the following means: (a) Interrupts: While executing in User mode at EL0, the CPU’s PSTATE.DAIF bits are 0s so that it will take any interrupts, which cause the CPU to enter EL1, hence the process executing on the CPU to the OS kernel at EL1. (b) Traps: These are called synchronous exceptions in ARMv8, which include all the traditional trap errors and execution of the SVC instruction. Here we focus on system calls by SVC. While executing uCode() in a User mode, a process may use SVC instruction to do system calls, or syscalls for short, to the OS kernel, by int r = syscall(int a, int b, int c, int d); where the parameter a is the syscall number and b, c, and d are the parameters to the OS kernel functions. Syscall itself is implemented in assembly as syscall: svc #1 ret In ARM, the syscall parameters a, b, c, and d are passed in registers x0 to x3, and syscall return value is in the result register x0. A User mode process may issue any number of syscalls to the OS kernel, provided that the OS kernel implements such calls. For simplicity and demonstration purpose, we only implemented six syscalls: Syscall Function -----------------------------------------------------------------------0 getpid : return pid of process 1 getppid : return pid of parent process 2 chname: change process name in proc.name field 3 ps : print proc information in OS kernel 4 switch : switch process in OS kernel 5 kmode : go back to OS kernel to run at EL1 ----------------------------------------------------------------------------Syscall Handler in OS Kernel AMRv8 classifies SVC as a kind of synchronous exception. For syscalls by SVC from EL0, the OS kernel’s EL1 vector table must be configured properly. Specifically, Group 3 of the EL1 vector table must be filled with entries to accept such exceptions from EL0: //Group 3: Lower EL using AArch64: // Group 3: from EL0 to OS kernel // at EL1 .balign 0x80 lower_el_a64_sync: b trap0_handler

.balign 0x80 lower_el_a64_irq: b irq_handler

// traps from EL0 // IRQ from EL0

11.8 Programming Examples

.balign 0x80 lower_el_a64_fiq: b fiq_handler

653

// FIQ from EL0

.balign 0x80 lower_el_a64_serr: // do not consider Serror b . In the vector table, the lower_el_a64_sync entry contains b trap0_handler which calls the handler macro with two parameters handler trap_chandler goUmode: where trap_chandler is the trap handler function in C and goUmode: is a label in the generated code, which can be used by processes to return to Umode from the OS kernel. Other entries for IRQ and FIQ call the same handler macro with only a Chnadler() to call. Their generated code does not need a label because interrupt processing normally does not pause in the OS kernel. When a SYNC exception occurs or when a process issues a syscall by SVC, the CPU is directed to execute the same trap_chandler() in the OS kernel. A standard systems programming technique is to pass the kernel mode stack pointer to the trap_chandler(u64 sp) as a parameter, allowing the trap_chandler() to access all the needed information in the process kmode stack. Since there are many different types of SYNC exceptions, such as memory faults and illegal instruction, the trap_chandler code must examine the syndrome register ESR_EL1 to recognize syscalls by SVC, which has the syndrome bit pattern 0x56. The following shows the trap_chandler() code. For clarity, the code that deals with syscalls are shown in bold face lines: typedef struct stack{ // as saved in kstack by the handler macro u64 elr, spsr, zxr, x0, x1, x2, x3; // need elr, x0-x3 }KSTACK; void trap_chandler(KSTACK *p) // p->kstack Top { u64 a,b,c,d; // syscall parameters from Umode syscall(a,b,c,d) ups("trap_chandler "); u32 esr = getESR_EL1(); // get ESR_EL1 u64 elr = getELR_EL1(); // ELR_EL1 also in kstack u64 spsr = get_SPSR_EL1(); ups("esr="); upx(esr); ups(" "); ups("elr="); upx(elr); ups(" "); u32 r = esr >> 24; r = r & 0x000000FF;

ups("syndrome="); up2x(r); ups("\n"); ups(“spsr = “); upx(spsr); ups(“\n”); if (r == 0xF2 || r == 0x02){ // if brk or hvc: inc elr by 4 p->elr += 4; return; } // looking for SVC from EL0, syndrome=0x56 // ARM lumping SVC into sync exceptions NOT a good idea !!! if (r == 0x56){ // got it! This is svc from EL0 a = p->x0; b = p->x1;

654

11 ARMv8 Architecture and Programming

c = p->x2; d = p->x3;

ups("syscall number="); upi(a);

}

// implement syscalls in OS kernel if (a==0){ ups("getpid "); r = running->pid; } if (a==1){ ups("getppid "); r = running->ppid; } if (a==2){ ups("chname "); ups(b); ups(" "); strcpy(running->name, b); ups("new name="); ups(running->name); ups("\n"); r = 0; } if (a==3){ ups("ps : "); printList(freeList); printQ(readyQueue); ups("running = P"); upi(running->pid); ups(running->name); ups(" "); r = 0; } if (a==4){ ups("SWITCH PROCESS in OS kernel\n"); tswitch(); r = 0; } if (a==5){ ups("ENTER Kmode\n"); kMode(); r = 0; } p->x0 = r; // syscall return value to Umode

}

When a process executing in the User mode at EL0 tries to get Current EL, it generates a trap error. The trap_chndler() in OS kernel at EL1 displays SPSR.bits[4:0] = 0, which confirms the process was executing at EL0. Figure 11.19 shows the outputs of the Example 6 program.

11.8.7

Example 7: Refinement of User Mode Process

In Example 6, every process starts to execute kCode in the OS kernel. It enters the User mode only by the “u” command. When it gets a “u” command, the process arranges its own return path to User mode, executes an IRET to enter EL0, and then calls uCode() to execute in EL0. Logically speaking, there is nothing wrong with this sequence of actions taken by the process itself to migrate from EL1 to EL0. It actually gives the process and the programmer a sense of total control of their own destiny. However, in doing so the process totally ignored the syscall frame in its kstack. In a real system like Unix or Linux, every process, except P0 (pid=0), is forked by a parent, which sets up all the execution environment of the child pro-

11.8 Programming Examples

655

Fig. 11.19 Sample outputs of Example 6 program

cess. When the child process starts to run, it immediately goes into the User mode to execute a User mode image directly. We can achieve the same effect in our simple OS kernel as well. Recall that every new process is created with its kernel mode stack containing a syscall frame followed by a tswitch frame, as in // set kstack for new proc to resume and goUmode // syscall frame | tswitch frame | // x30 29 .... x2 x1 x0 zr spsr elr | x30 x29 ... x1 x0 zr| // 1 2 31 32 33 34 | 35 66 // spsr uCode| func In the tswitch stack frame, the func entry at kstack[SSIZE-35] is the starting address of the process when it resumes running in the OS kernel. If we set func to goUmode, the process would immediately execute goUmode code, which restores SPSR_EL1 and ELR_EL1 from the saved spsr and uCode to go into the User mode at EL0. The needed modifications are listed below: (1). kernel.c file:

extern void uCode(); extern void goUmode();

u64 kfork(u64 func, u64 priority) { u64 i; PROC *p = dequeue(&freeList);;

656

11 ARMv8 Architecture and Programming

if (p==0){ ups("no more PROC, kfork failed\n"); return 0; } p->status = READY; p->priority = priority; p->ppid = running->pid; p->parent = running;

// syscall frame | tswitch frame | // x30 29 .... x2 x1 x0 zr spsr elr | x30 x29 ... x1 x0 zr| // 1 2 31 32 33 34 | 35 66 // spsr VA0 | goUmode for (i=1; ikstack[SSIZE-i] = 0;

p->kstack[SSIZE-31] = p->pid; // proc pid to Umode p->kstack[SSIZE-33] = 0x0; // spsr bitsM[4:0]=00000:ar64, EL0 p->kstack[SSIZE-34] = uCode; // elr_el1= uCode p->kstack[SSIZE-35] = (u64)goUmode; p->ksp = &(p->kstack[SSIZE-66]); p->usp = (u64 *)(ustack + (p->pid)*0x4000); enqueue(&readyQueue, p);

//ups("proc "); upi(running->pid); ups("kforked a child proc "); //upi(p->pid); ups("\n"); printQ(readyQueue); }

return p->pid;

void scheduler() { ups("proc "); upi(running->pid); ups("in scheduler "); if (running->status == READY) enqueue(&readyQueue, running); running = dequeue(&readyQueue); ups("next running = "); upi(running->pid); ups("\n"); //ups("proc "); upi(running->pid); ups("running\n"); } void kCode() { char c; u64 pid;

//ups("proc "); upi(running->pid); ups(" resume to kCode()\n"); while(1){ unlock(); pid = running->pid; ups("proc "); upi(running->pid); ups(running->name); ups(" running in OS kernel at EL1\n"); ups("enter a command [f|s|u] : "); c = ugetc(); uputc(c); ups("\n");

11.8 Programming Examples

}

}

switch(c){ case 'f': kfork(kCode, 1); break; case 's': tswitch(); break; case 'u': toUmode(); break; }

void kMode() { char c; u64 pid;

while(1){ unlock(); pid = running->pid; ups("proc P"); upi(running->pid); ups(running->name); ups(" running in OS kernel at EL1\n"); ups("enter a command [f|s|u] : "); c = ugetc(); uputc(c); ups("\n");

}

}

switch(c){ case 'f': kfork(goUmode, 1); break; case 's': tswitch(); break; case 'u': return; break; }

typedef struct stack{ u64 elr, spsr, zxr, x0, x1, x2, x3; }KSTACK; only need elr, x0 to x3

void trap_chandler(KSTACK *p) // p->kstack Top { u64 a,b,c,d; // syscall parameters from Umode in syscall(a,b,c,d) ups("trap_chandler: "); u64 esr = get_ESR_EL1(); u64 elr = get_ELR_EL1(); u64 spsr = get_SPSR_EL1();

//ups("esr="); upx(esr); ups(" "); ups("elr="); upx(elr); ups(" "); ups(“spsr=”); upx(spsr); ups(“ ”);

u32 r = esr >> 24; r = r & 0x000000FF; ups("syndrome="); up2x(r); ups("\n");

if (r == 0xF2 || r == 0x02){ // if brk or hvc: inc elr by 4, return p->elr += 4; return; } // looking for SVC from EL0, syndrome=0x56 // BAD ARM idea of lumping SVC into async exceptions!!!

657

658

11 ARMv8 Architecture and Programming

if (r a = b = c = d =

== 0x56){ // svc from EL0 p->x0; p->x1; // syscall parameters a,b p->x2; p->x3;

ups("syscall number = "); upi(a);

}

}

if (a==0){ ups("getpid "); r = running->pid; } if (a==1){ ups("getppid "); r = running->ppid; } if (a==2){ ups("chname "); ups(b); ups(" "); strcpy(running->name, b); ups("new name="); ups(running->name); ups("\n"); //ugetc(); r = 0; } if (a==3){ ups("ps : "); printList(freeList); printQ(readyQueue); ups("running = P"); upi(running->pid); ups(running->name); ups("\n"); r = 0; } if (a==4){ ups("SWITCH PROCESS in OS kernel\n"); tswitch(); r = 0; } if (a==5){ ups("ENTER KMODE in OS KERNEL\n"); kMode(); r = 0; } p->x0 = r;

// process Umode code in EL0 void uCode(int pid) { u64 r; while(1){ ups("P"); upi(pid); ups("in EL0 try to get CurrentEL\n"); ups("usp="); upx(running->usp); ups("\n"); ups("test syscalls to OS kernel in EL1 and back to EL0\n");

11.8 Programming Examples

659

ups("umode syscall 0: "); r = syscall(0,0,0,0); ups("return value = "); upi(r); ups("\n"); //ugetc(); ups("umode syscall 1: "); r = syscall(1,0,0,0); ups("return value = "); upi(r); ups("\n"); //ugetc();

/********** skip these for shorter print out ******** ups("umode syscall 2: "); r = syscall(2,"myname is KCW",0, 0); ups("running->name="); ups(running->name); ups(" "); ups("return value = "); upi(r); ups("\n"); ugetc(); ups("umode syscall 3: "); r = syscall(3,0,0,0); ups("return value = "); upi(r); ups("\n"); ugetc();

ups("umode syscall 4: "); r = syscall(4,0,0,0); //ups("return value = "); upi(r); ups("\n"); ****************************************************/

}

ups("umode syscall 5: "); r = syscall(5,0,0,0); //ups("return value = "); upi(r); ups("\n");

}

(2). t.c file:

ups("initialize OS kernel in EL1\n"); kernel_init(); ups("P0 fork P1 in OS kernel\n"); kfork(goUmode, 1);

// new process goUmode directly

Figure 11.20 shows the sample outputs of the Example 7 program.

11.8.8

Example 8: Separate User Mode Image

Our final refinement of Example 7 program is to use a separate User mode image, rather than as a part of the OS kernel image. First, we develop a User mode program in a USER directory as follows: // USER Directory Files (1). us.s file : entry to User mode image .global uEntry, syscall, getusp .text // on entry, x0=process pid arranged in kfork() in OS kernel uEntry: bl main

660

11 ARMv8 Architecture and Programming

Fig. 11.20 Sample outputs of Example 7 program

// user process issues syscall(a,b,c,d) to OS kernel syscall: svc #0 ret getusp: mov x0, sp ret (2). u1.c file /*********** u1.c file ***********/ #include #define u8 uint8_t #define u32 uint32_t #define u64 uint64_t

// these are needed for aarch64 gcc if change sp in code unsigned long __stack_chk_guard; void __stack_chk_guard_setup(void) { __stack_chk_guard = 0xBAAAAAAD;//provide some magic numbers

11.8 Programming Examples

} void __stack_chk_fail(void) { // Error message }

// Access UART of I/O u32 *UART0DR = (unsigned int *) 0x09000000; u32 *UART0FR = (unsigned int *) 0x09000018; int ugetc() { while( *UART0FR & (1> i) & 0xf]);

661

662

}

11 ARMv8 Architecture and Programming

}

if (i == 32) uputc(' ');

void urpu(u32 x) { char c; if (x){ c = ctab[x % 10]; urpu(x/10); uputc(c); } } void upi(u32 x) { if (x==0) uputc('0'); else urpu(x); uputc(' '); }

extern u64 syscall();

int strcmp(char *s1, char *s2) { while((*s1 && *s2) && (*s1==*s2)){ s1++; s2++; } if ((*s1==0) && (*s2==0)) return 0; return(*s1 - *s2); } int main(u64 pid) { u64 ppid; char line[64]; while(1){

ups("process P"); upi(pid); ups("in EL0\n"); ups("input a command |getpid|getppid|kmode| : "); ugets(line); ups("\n"); if (strcmp(line, "getpid")==0){ pid = syscall(0,0,0,0); ups("pid="); upi(pid); ups("\n"); }

if (strcmp(line, "getppid")==0){ ppid = syscall(1,0,0,0); ups("ppid="); upi(ppid); ups("\n"); }

11.8 Programming Examples

}

}

663

if (strcmp(line, "kmode")==0){ syscall(2,0,0,0); }

(3). u.ld file : Memory Address = 0x40100000 ENTRY(uEntry)

SECTIONS { . = 0x40100000; .text : { *(.text) } .data : { *(.data) } .bss : { *(.bss) } } (4). mku sh script file echo mku $1

aarch64-linux-gnu-as -o us.o us.s aarch64-linux-gnu-gcc -c -mcpu=cortex- a57 -Wall -o $1.o $1.c aarch64-linux-gnu-ld -T u.ld us.o $1.o -o $1.elf aarch64-linux-gnu-objdump -D $1.elf > $1.list aarch64-linux-gnu-objcopy -O binary $1.elf $1 # aarch64-linux-gnu-objcopy: supported targets: # elf64-littleaarch64 # elf64-bigaarch64

aarch64-linux-gnu-objcopy -I binary -O elf64-littleaarch64 \ -B aarch64 $1 $1.o cp $1.o ../ // ------------ end of USER files ----------------

In the USER directory, the mku scripts compile and link us.s and u1.c into a u1.elf file with starting address 0x40100000. Then it converts u1.elf into a pure binary executable and finally into a u1.o file in object code format, which can be included in the OS kernel image as a dummy data section. Since u1 can only be executed from the address 0x40100000, it must be copied to the correct address before any process can execute it. This is done by the OS kernel when it starts. Then the OS kernel can create processes to execute u1 as their User mode images at 0x40100000. This simulates “loading” the User mode image into a known physical address for the process to execute in the User mode. In Example 8, the ts.s, gicv3.c, timer.c, and uart.c files are the same as in Example 7. We only show the modified kernel.c, t.c, t.ld, and mk sh script files. For the sake of simplicity, we only implement three syscalls, getpid, getppid, and kmode, which lets the User mode process to execute a kMode() function in kernel. The kModeC() function supports fork, process switch, and return to the User mode. For clarity, important modifications are shown in bold face lines. (1). Kernel.c file /****** kernel.c file ********/ void kernel_init() { u64 i, j; PROC *p; ups("kernel_init() : "); for (i=0; ipid = i; strcpy(p->name, pname[i]); p->status = READY; p->next = p + 1;

} proc[NPROC-1].next = 0; // circular proc list freeList = &proc[0]; readyQueue = 0;

}

p = dequeue(&freeList); p->priority = 0; p->ppid = 0; running = p; ups("running = P"); upi(running->pid); ups("\n"); printList(freeList);

void kexit() { ups("proc "); upi(running->pid); ups(" kexit\n"); running->status = FREE; running->priority = 0; enqueue(&freeList, running); tswitch(); } u64 kfork(u64 kcode, u64 priority, u64 ucode) { u64 i; PROC *p = dequeue(&freeList);; if (p==0){ ups("no more PROC, kfork failed\n"); return 0; } p->status = READY; p->priority = priority; p->ppid = running->pid; p->parent = running; // syscall frame // x30 29 .... x2 x1 x0 zr spsr elr // 1 2 31 32 33 34 // spsr VA0

| | | |

p->kstack[SSIZE-31] p->kstack[SSIZE-33] p->kstack[SSIZE-34] p->kstack[SSIZE-35]

// // // //

for (i=1; ikstack[SSIZE-i] = 0; = = = =

p->pid; 0x0; ucode; (u64)goUmode;

tswitch frame x30 x29 ... 35 goUmode

x1 x0 zr| 66

return pid to Umode spsr=00000: ar64, EL0 elr_el1= umode Code goUmode directly

p->ksp = &(p->kstack[SSIZE-66]); p->usp = (u64 *)(0x90000000 + 0x10000 - (p->pid)*0x4000);

enqueue(&readyQueue, p); ups("proc "); upi(running->pid); ups("kforked a child proc ");

11.8 Programming Examples

}

665

upi(p->pid); ups(" "); printQ(readyQueue); return p->pid;

void scheduler() { ups("proc "); upi(running->pid); ups("in scheduler "); if (running->status == READY) enqueue(&readyQueue, running); running = dequeue(&readyQueue); ups("next running = "); upi(running->pid); ups("\n"); //ups("proc "); upi(running->pid); ups("running\n"); } void kMode() { char c; u64 pid;

// process kMode code

while(1){ unlock(); pid = running->pid; ups("proc "); upi(pid); ups(running->name); ups(" running in OS kernel at EL1\n"); ups("enter a c = ugetc(); switch(c){ case 'f': case 's': case 'u':

}

}

command [f|s|u] : "); uputc(c); ups("\n");

kfork(goUmode, 1, 0x40100000); break; tswitch(); break; return; break;

} typedef struct stack{ u64 elr, spsr, zxr, x0, x1, x2, x3; }KSTACK;

// only need elr, x0 to x3

void trap_chandler(KSTACK *p) // p->kstack Top { u64 a,b,c,d; // syscall parameters from Umode ups("trap_chandler: "); u64 esr = getESR_EL1(); u64 elr = getELR_EL1();

//ups("esr="); upx(esr); ups(" "); ups("elr="); upx(elr); ups(" ");

u32 r = esr >> 24; r = r & 0x000000FF; ups("syndrome="); up2x(r); ups("\n");

if (r == 0xF2 || r == 0x02){ // if brk or hvc: inc elr by 4, return p->elr += 4; return; }

666

11 ARMv8 Architecture and Programming

// looking for SVC from EL0, syndrome=0x56 // ARM lumping SVC into async exceptions very BAD idea !!! if (r a = b = c = d =

== 0x56){ // svc from EL0 p->x0; p->x1; // syscall parameters a,b p->x2; p->x3;

ups("syscall #="); upi(a);

}

}

// For simplicity, we only implement 3 syscalls if (a==0){ ups("getpid "); r = running->pid; } if (a==1){ ups("getppid "); r = running->ppid; } if (a==2){ ups("EXECUTE IN OS KERNEL\n"); kMode(); r = 0; } p->x0 = r;

(2). t.c file

/********* t.c file ********/ #include #define u8 uint8_t #define u32 uint32_t #define u64 uint64_t

u64 switchProc; u64 seconds; u64 ustack; // contain ustack_start in t.ld #include "uart.c" #include "gicv3.c" #include "timer.c"

#include "kernel.c" u64 getTimerCount(); u64 getMPIDR_EL1(), getMPIDR_EL3(); void strcpy(char *dest, char *src) { while(*src){ *dest = *src; dest++; src++; } *dest = 0; }

11.8 Programming Examples

667

// these are needed for aarch64 gcc if change sp i code unsigned long __stack_chk_guard; void __stack_chk_guard_setup(void) { __stack_chk_guard = 0xBAAAAAAD;//provide some magic numbers } void __stack_chk_fail(void) { // Error message } extern char _binary_u1_start, _binary_u1_end; extern int _binary_u1_size; int main() { switchProc = 0; seconds = 0;

char *cp = &_binary_u1_start; u64 usize = &_binary_u1_end - &_binary_u1_start; ups("copy u1 to 0x40100000\n"); char *src = cp; char *dest = 0x40100000; for (int i=0; i < usize; i++){ *dest = *src; dest++; src++; }

ups("initialize OS kernel in EL1\n"); kernel_init();

//ups("P0 fork P1 at 0x40100000 in OS kernel : kfork(goUmode, 1, 0x40100000);

ups("enter a key to continue : "); ugetc(); tswitch(); }

while(1) wfi();

(3). t.ld file ENTRY(start)

SECTIONS { /* start at Virt RAM address */ . = 0x40000000; .text : { ts.o *(.text) } .data : {

");

668

11 ARMv8 Architecture and Programming

*(.data) } .bss : { *(.bss) } .data : { /* u1.o as a dummy data section */ u1.o }

. = ALIGN(16);

. = . + 0x4000; stack3_top = .; . = . + 0x4000; stack2_top = .; . = . + 0x4000; stack1_top = .; . = . + 0x4000; stack0_top = .;

}

(4) mk3 sh script file

rm *.elf 2> /dev/null cp USER/u1.o ./

# use USER/u1.o as a data section

aarch64-linux-gnu-as -o ts.o ts.s aarch64-linux-gnu-gcc -c -mcpu=cortex- a57 -Wall -o t.o t.c aarch64-linux-gnu-ld -T t.ld ts.o t.o -o kernel.elf aarch64-linux-gnu-objdump -D kernel.elf > kernel.list rm *.o

echo go? read DUMMY

qemu-system-aarch64 -machine virt -cpu cortex-a57 –m 2048M

\

-machine virtualization=on -machine secure=on -machine gic-version=3 \ -nographic -kernel kernel.elf //-------------- end of Example 8 program files -----------Figure 11.21 shows the sample outputs of Example 8 program. Exercise: In a real operating system, I/O operation is done by either privileged instruction or accessing I/O mapped memory. As a result, only the OS kernel can do I/O operations. User mode processes should not be able to do I/O directly. Basic I/O operations in the User mode should be syscalls to the OS kernel. In Example 8, the User mode program u1.c uses the same physical address of UART to do I/O, which is logically incorrect. Modify the u1.c code to make both uget() and uputc() syscalls to the OS kernel, which actually performs I/O for User mode processes.

11.9 Memory Management in ARMv8 When ARMv8 processors are executing in the Ar64 execution state, full virtual address (VA) is 64 bits, which is usually denoted as two sections of eight hex digits each in the form 0x12345678_12345678

11.9 Memory Management in ARMv8

669

Fig. 11.21 Sample outputs of Example 8 program

or as four sections of four hex digits each in the form 0x1234_1234_1234_1234. However, the actual implemented VA size is only 48 bits, with an optional maximum of 52 bits. Virtual addresses (VA) are used by executing programs. Physical addresses (PA) are actual addresses of physical memory. During execution, the processor must translate virtual addresses (VA) to physical addresses (PA) by the memory management unit (MMU) hardware. In addition to mapping VA to PA, the translation tables also specify the memory attributes and access rights for protection. Due to the larger VA space and several exception levels, address translation in ARMv8 is quite complex, which is very different from that in ARMv7.

11.9.1

Virtual Address Versus Physical Address

Conversion of VA to PA is done by the memory management unit (MMU) [20, 21]. In ARMv8, system resources including memory can be partitioned into secure (S) world at exception level EL3 and non-secure (NS) world at exception levels EL2, EL1, and EL0. Each world has its own VA and PA addresses. Memory management occurs at different exception levels. Table 11.1 shows the memory types of VA to PA translations at different exception levels. Each translation of VA to PA is called a translation stage. Translation of VA to PA in EL2 and EL3 involves only one stage. Translation in the non-secure world of EL1 and EL0 may require two stages. This is because the hypervisor at EL2 may support multiple guest operating systems at EL1 through virtualization. Each guest OS kernel translates its VA to an intermediate PA (IPA), which is translated to real PA by the hypervisor at EL2. For simplicity, this book considers only one-stage translations.

670

11 ARMv8 Architecture and Programming

Table 11.1 VA to PA in secure and non-secure world

Table 11.2 System control register for MMU operations

11.9.2

System Control Registers for MMU Operations

In ARMv8, MMU operations are controlled by system control registers. Table 11.2 shows the system registers that control MMU operations. As usual, the ELn suffix of the registers indicates the lowest exception level that can access the registers. The following lists some of the important fields of system control registers: SCTLR_ELx : System control registers: M enables MMU, C enables data cache TCR_ELx : Translation Control Registers: IPS, OA size; TGn, granule; TnSZ, IA size TTBR0_EL1, TTBR1_EL1 : Translation table base registers for EL0 and EL1. TTBR0_EL2, TTBR0_EL3 : Translation table base register for EL2 and EL3. There are no TTBR1 registers in EL2 and EL3. Physical address size: The ID_AA64MMFR0_EL1.PARange 4-bit field specifies implemented physical address size, which ranges from 0x0 to 0x6 for 32, 36, 40, 42, 44, 48, and 52 bits, respectively. For ARMv8, the most common physical memory size is 48 bits. The Translation Control Registers (TCR_ELx): Figure 11.22 shows the fields of the TCR_ELx register, which controls the input space size, output PA space size, and translation granule size. PS/IPS: Size of PA or IPA space, maximum output address size, e.g., 101 for 48 bits. TGn: The two-bit Translation Granule TG1 and TG0 fields define the granule size for kernel and user space, respectively, TBn=00 for 4KB, 01 for16KB, and 11 for 64KB. TxSZ fields specify the input VA address size. Specifically, TCR_EL1.T0SZ specifies the size for the lower VA range, using TTBR0_EL1. TCR_EL1.T1SZ specifies the size for the upper VA range, using TTBR1_EL1. Each of the TCR_EL2 and TCR_EL3 only has a single T0SZ field.

11.9.3

Separation of Kernel and Application Virtual Address Spaces

In ARMv8, operating system kernel runs in exception level EL1, and User mode applications run in exception level EL0. ARMv8 supports separate VA spaces for the OS kernel and applications, for the following reasons. Chapter 6 of this book covers memory management of VA to PA mapping. Under an operating system, a process may execute in either kernel (SVC) mode or User (USR) mode. In ARMv6 and ARMv7, there is only one translation base register TTBR, which points the first-level page table of the current running process. Therefore, each process can only have one set of page tables which defines the VA to PA mappings of the process in both modes. The entire 4GB VA space can be divided into two

11.9 Memory Management in ARMv8

671

Fig. 11.22 Translation Control Register (TCR)

equal halves of 2GB each. In the Kernel Mapped High (KMH) scheme, the kernel VA space is at the high end of VA space, and User mode VA is at the low end of the VA space. The level-1 page table of each process contains 4096 entries, and each entry defines a 1MB section. The low 2048 entries define User mode VA to PA mapping. The high 2048 entries define kernel mode VA to PA mapping. Since processes share the same VA space in kernel mode, the high 2048 entries of all processes are identical. When switching process, the OS kernel must switch the page tables of one process to another. This typically requires the MMU to flush its translation lookaside buffer (TLB), at least to invalidate some entries in the TLB. Since all processes in kernel mode share the same kernel VA space, which rarely changes, switching entire page table may be wasteful. ARMv8 MMU is designed to remedy this problem. In ARMv8, the VA space is divided into two subranges. Each process has two separate translation tables, one for kernel mode in EL1 and another one for User mode in EL0. When switch processes, the OS kernel only needs to switch User mode translation table without changing the kernel mode translation table. The division VA space into kernel mode and User mode spaces is done as follows. The Translation Control Register TCR_EL1 contains the size fields T0SZ[5:0] and T1SZ[5:0]. The integer in the field gives the number of the most significant bits in a 64-bit VA address that must be either all 0s or all 1s. This divides the VA space into two subranges. For each VA subrange, the input address size is 2**(64-TnSZ), where TnSZ is one of TCR_EL1. {T0SZ, T1SZ}. This means the two VA subranges are as follows: Lower VA subrange: 0x0000_0000_0000_0000 to (2(**64-T0SZ)- 1) Upper VA subrange: (2**64 - 2**(64-T1SZ) ) to 0xFFFF_FFFF_FFFF_FFFF By choosing different values of T1SZ and T0SZ between 16 and 39, the size of the two subranges may vary. Increasing T0SZ value decreases the upper limit of the lower VA s ubrange, which reduces the User mode space size. Increasing T1SZ value increases the lower limit of upper VA subrange, which reduces the kernel space size. Figure 11.23 shows the division of VA space into kernel mode space and User mode space. It also shows the effect of TnSZ on the subrange sizes. The maximum VA subranges correspond to both T0SZ and T1SZ having the minimum value of 16. In this case, the subranges are as follows: Lower VA subrange: 0x0000_0000_0000_0000 to 0x0000_FFFF_FFFF_FFFF Upper VA subrange: 0xFFFF_0000_0000_0000 to 0xFFFF_FFFF_FFFF_FFFF Corresponding to the division of VA space into two subranges, there are two separate translation base registers, TTBR0_ EL1 and TTBR1_EL1. Bits[55:48] of the TTBR contains an Address Space ID field, and bits[47:1] is the table base address. Which ASID to use is controlled by the Translation Base Control register TBBCR.A1 field. Given a 64-bit VA, the MMU selects the translation table pointed to by TTBR0 if the upper 16 bits of the VA are all 0. It selects TTBR1 if the upper 16 bits of the VA are all 1. Note that EL2 and EL3 only have a TTBR0, but no TTBR1. This means EL2 or EL3 can only use the low-end virtual address range 0x0 to 0x0000FFFF_FFFFFFFF. Figure 11.24 shows the VA to PA mappings of kernel space by TTBR1_EL1 and user space by TTBR0_EL1.

11.9.4

Translation Granule and Translation Tables

In every VA to PA translation scheme, the translation table size and the needed translation levels depend on VA space size and page size. A small page size makes more efficient usage of physical memory but requires more translation levels. Large page size reduces the translation levels but may lead to inefficient memory usage due to internal wastage in each page. An

672

11 ARMv8 Architecture and Programming

Fig. 11.23 Kernel and user subranges

Fig. 11.24 Address translation of kernel and user spaces

OS kernel must try to achieve a balance between these two conflicting requirements. In ARMv8, the TG1 and TG0 fields of Translation Control Register (TCR) define the translation granule, which is the smallest page size. In ARMv8, the supported granule sizes are 4KB, 16KB, or 64KB. The different granule sizes can affect the number and size of translation tables required. For a page size of n bits, the last-level page table maps the least significance n bits VA to PA directly. The remaining 48-n VA bits [47:n] must be resolved by address translation. The required number of levels can be determined as follows: Each translation table entry is 64 bits or 8 bytes. A translation table must be in a page, which can contain a maximum of 2**n/8 = 2**(n−3) entries. Each level can resolve at most n-3 VA bits. For page sizes 4K, 16K, and 64K, the number of VA bits that can be resolved in each level of translation is (n−3) = 9, 11, 13, respectively. These are summarized below: -----------------------------------------------Page size : 4KB 16KB 64KB Last level size: 12 bits 14 bits 16 bits Bits per level : 9 bits 11 bits 13 bits ------------------------------------------------

11.9 Memory Management in ARMv8

673

The translation levels are denoted by L0 to L3. Depending on the input VA size and page size, the required translation levels, as well as the size of each level, can be determined by trying to fill the tables from the last level L3 backward. Some examples are as follows: 4KB page size: When using 4KB page size, the maximum number of VA bits that can be resolved in each level is 9. The leading 16 bits [63:48] of a VA is used to select either TTBR0 or TTBR1. The last 12 VA bits [11:0] are the byte offsets within a 4KB page. The remaining 48 − 12 = 36 bits can be divided into four sets of 9 bits each of the form [9:9:9:9]. In this case, the translation levels needed is 4, as the following figure shows.

16KB page size: When using 16KB page size, the maximum number of VA bits that can be resolved in a level is 11. The leading 16 bits [63:48] of a VA is used to select either TTBR0 or TTBR1. The last 14 VA bits [13:0] are the byte offset within a 16KB page. The remaining 48 − 14 = 34 bits can be divided into four sets of [1:11:11:11] for translation levels [0:1:2:3]. In this case, the needed number of levels is also 4. For this reason, an OS should not use 16KB page size. | 16 |1 | 11 | 11 | 11 | 14 | |63 --- 48|47|46 – 36|35 – 25|24 – 14|13 – 0| | select |L0| L1 | L2 | L3 | offset | 64KB page size: When using 64KB page size, the maximum number of VA bits that can be resolved in a level is 13. The leading 16 bits [63:48] of a VA is used to select either TTBR0 or TTBR1. The last 16 VA bits [15:0] are the byte offsets within a 64KB page. The remaining 48 – 16 = 32 bits are divided into three sets of [6:13:13] for levels [1:2:3]. In this case, it only requires three translation levels.

11.9.5

ARMv8 Translation Table Descriptor Format

Figure 11.25 shows the translation table descriptor format, in which bit0=0 for invalid entries, bit1 denotes the descriptor type, 0 = block entry, and 1 = table entry. Table entries are for levels L0, L1, and L2. Each table entry points to a next-level table. Block entries are for levels L1 and L2. Each block entry points to a physical block memory address. Each translation table begins with a lowest level L0 or L1 containing table descriptors (type = 11). The last level L3 can only contain block entry descriptors. Each of the level L1 (if not the lowest level) and level L2 entries can be either a table entry, which points to a next-level table or a block entry, which points to a block of physical memory. Thus, a translation table may end before reaching the last level L3, which allows the system to use memory blocks larger than the granule size. This kind of flexibility of allowing memory to be accessed in different ways is unique to ARMv8 MMU. Fig. 11.25 Translation table descriptor type

674

11.9.6

11 ARMv8 Architecture and Programming

Translation Table Descriptor Attributes

In addition to a base address, a descriptor also contains attribute fields, which control the access to next-level tables or the memory region of a block or page. Depending on the descriptor type, the attribute fields have different meanings. Attributes in table descriptors: In a table descriptor, bits [63:59] of the descriptor define the attributes for the next-level translation table access.

. The attributes of table descriptor are as follows: NSTable bit[63]: For memory accesses from secure state, this specifies the security level for subsequent levels of lookup. For memory accesses from non-secure state, this bit is ignored. APTable bits[62:61]: Access permissions limit for subsequent levels of lookup, e.g., 00 to allow access from all later levels. UXNTable or XNTable bit[60]: XN limit for subsequent levels of lookup. PXNTable bit[59]: PXN limit for subsequent levels of lookup. The table descriptor format supports hierarchical attributes, so that an attribute set at one level can be inherited by lower (higher numbered) levels. It means that a table entry in an L0, L1, or L2 table can override one or more attributes that are specified in the table that it points to. This can be used for access permissions, security, and execution permissions. For example, an entry in the L1 table that has NSTable = 1 means that the NS bits in the L2 and L3 tables that it points to are ignored and all the entries are treated as having NS = 1. This feature only restricts subsequent levels of lookup for the same stage of translation. Attribute fields in block and page descriptors: In block and page descriptors, the memory attributes are split into an upper attribute block and a lower attribute block. Figure 11.26 shows the attributes of file descriptors. The attributes are as follows: Reserved for software bits[58:55]: Reserved for OS kernel to use. UNX|XN Bit[54]: Execute-never bit, whether the region is executable. PXN Bit[53]: Privileged execute-never bit, whether the region is executable at EL1. nG Bit[11]: No global bit, whether the TLB entry applies to all ASID or only to the current ASID. AF Bit[10]: Access flag. SH Bits[9:8]: Shareability field. AP Bits[7:6]: Data access permission bits, as shown in the figure. NS Bit[5]: Non-secure bit. For memory accesses from secure state. AttrIndex Bits[4:2]: Stage 1 memory attributes index field into MAIR_ELn register. MARI_ELn register indices define memory region as device/normal memory, shareable or not in multi-core processors, inner or outer shareable for cache coherence. Access permissions: The attribute fields of block or page table entry specify access permission (AP) bits and other attributes of the page region. Access permissions (AP) control whether a region is readable or writeable, or both, and can be set separately to EL0 for unprivileged and to EL1, EL2, and EL3 for privileged accesses, as shown in Fig. 11.27. So access permissions are all specified in table descriptors. ARMv8 MMU no longer uses domains for memory protection as in ARMv6 and ARMv7. Other attributes: The operating system kernel runs in the exception level EL1. It defines the translation table of EL1 for the kernel itself to use and a translation of EL0 for applications to use. The OS kernel can specify different permissions for its own data area and for those of applications in accordance with the permissions shown in Fig. 11.27. Other attribute fields

11.9 Memory Management in ARMv8

675

Fig. 11.26 Page descriptor attributes

Fig. 11.27 Page access permissions

control the executability of code pages. Blocks of memory or pages can be marked as executable or non-executable (Execute Never (XN)). The OS kernel code can set the attributes Unprivileged Execute Never (UXN) and Privileged Execute Never (PXN) separately to prevent application code running with kernel privilege or attempts to execute kernel code while in an unprivileged state.

The hypervisor, which runs at exception level EL2, and secure monitor at EL3 only have translation schemes for their own use, and therefore there is no need for a privileged and unprivileged split in permissions.

11.9.7

OS Kernel Use of Translation Table Descriptors

As described in Chap. 6, paging schemes can be full-paging or demand-paging. In full-paging, all the pages of an execution image are allocated before execution starts, and all the pages are present in memory during execution. In demand-paging, not all pages of an execution image are allocated, which are allocated only as needed. Full-paging can be static or dynamic. In static paging, each User mode process image is allocated a piece of contiguous memory, which is divided into page frames, with page table entries pointing at the page frames. In dynamic paging, the page frames of an image are allocated dynamically from a pool of free page frames, which may not be contiguous. In demand-paging, the page frames of an executing program are allocated only when they are needed.

676

11 ARMv8 Architecture and Programming

Virtual memory: In general, the term virtual memory refers to any scheme that maps virtual addresses to physical addresses. However, it is commonly used to mean a system can run large programs with less real memory. The scheme that makes this possible is by demand-paging. In the demand-paging scheme, when an OS kernel runs a process, it only loads a small portion of the process pages into memory for the process to start execution. Those pages that are not loaded into memory are marked as not-present (by a P bit in the page table descriptor entry, which is the AF bit in ARM). When the process attempts to access a page that is marked not-present, it triggers a page fault, which traps the process to the OS kernel to execute a page fault handler. If the page fault is due to a not-present page, the OS kernel can load in the needed page from the process image in secondary storage, mark the page table entry present, and let the process continue to re-execute the instruction that triggered the page fault earlier. The principle is simple but quite complex to implement as it opens up a sequence of issues, which must be handled properly in order for the scheme to work. A fundamental issue in demand-paging is that if a new page frame must be allocated but there are no free page frames, some existing pages must be evicted to make room for the needed page. This is known as the page replacement problem, which has been studied extensively in literatures and covered in many OS books. The reader may consult such books for more information. Although this book does not implement demand-paging, it is still worthy to discuss the principles of page replacement. (1). Ideal Page Replacement Rule: An ideal or optimal page replacement rule is to replace those pages that will incur fewest page faults in the future, but this is unrealizable since there is no way to predicate the future behavior of a system. The simplest replacement rule that is easy to implement is by FIFO (first in, first out), but studies have shown it has very poor performance in terms of page faults. We can only use some realistic algorithms that approximate the ideal replacement rule. (2). Least Recently Used (LRU) Page Replacement: The scheme is to replace a page that has not been referenced for the longest time. The rationale is such pages are not likely to be referenced again, based on the sequential flow of execution instructions which tend to access data in different memory regions. To support LRU, we may maintain all the pages in a link list with most recently used pages in front and least recently used pages at the end. During page replacement, choose pages from the end of the link list. Since the link list must be reordered on every memory reference, implementation of LRU by the link list method is feasible but very expensive. An alternative scheme is to assume a large global counter which is used to update page entries on every memory reference. Whenever a page is accessed, copy the current counter value as its access time. During page replacement, choose a page with the least access time to replace. However, the scheme is impractical due to too much overhead and inefficiency. (3). Not Recently Used (NRU) Page Replacement: Each page frame has an A (access) bit and D (dirty) bit, which can be used to implement a simple NRU page replacement algorithm. In this scheme, the kernel periodically clears these bits (by timer interrupts), which are set by the MMU hardware when the pages are accessed, i.e., set the A bit whenever the page is accessed, set the D bit whenever the page is modified. During page replacement, use the (A,D) bits to rank the pages, and replace a page with the highest rank order: Rank ---3 2 1 0

(A,D) Choice ----- ----------------------------------------------------------------------------(0,0): best candidate since page is not accessed and not modified (0,1): this may happen if (A,D) are not cleared at the same time (1,0): third best, even though the page will probably be accessed again soon (1,1): worst candidate, replace this page must write the page back first

In ARmv8, the page descriptor only has an access flag (AF) bit, which can be used to indicate whether a page has been accessed or not. According to documents of ARMv8.0 on memory management [20], the AF must be maintained by the software. When a page is created, its descriptor AF bit is cleared to 0. Whenever the page is accessed for the first time, it triggers a page fault because AF=0. The OS kernel must set the AF bit to 1 and let the process continue. The OS kernel may use the AF bit for page replacement when using demand-paging. For example, the Linux ARM64 kernel uses the AF bit for PTE_AF to check whether a page has ever been accessed. When a page must be swapped out of memory, it would choose to swap out pages that are not accessed. Also, in ARMv8.0 page table entries do not have a dirty flag bit, but it has a reserved field which can be used by the OS kernel for this purpose. In page descriptors, bits [58:55] are reserved for software use. An OS kernel may use it to record

11.10 Address Translation Cases

677

OS-specific information in the translation table. For example, the Linux kernel uses one of these bits to mark an entry as clean or dirty. If the page is later swapped out of memory, a clean page can simply be discarded, but a dirty page must be written back to memory first. At any rate, an OS kernel must maintain these flag bits by software. In ARMv8.1 [20], both the AF and dirty flags in page table entries can be updated automatically by the MMU hardware. However, these features must be explicitly enabled when configuring the MMU. These changes reflect the evolution style of the ARM processor. (4). Local versus global: When trying to evict some existing pages, the final question is where to look for such pages. The search is local if we only try to evict pages of the current process. It is global if we may evict pages of other processes in the system. In the latter case, the candidates are usually non-runnable, i.e., blocked or sleeping, processes. In ARMv8, the Address Space ID (ASID) can be used to identify pages by process ID for replacement of local pages.

11.9.8

Security and the MMU

The ARMv8 architecture defines two security states, secure and non-secure. It also defines two physical address spaces, secure and non-secure, such that the normal world can only access the non-secure physical address space. The secure world can access both the secure and non-secure physical address spaces. In non-secure state, the NS bits and NSTable bits in translation tables are ignored. Only non-secure memory can be accessed. In secure state, the NS bits and NSTable bits control whether a virtual address translates to a secure or non-secure physical address. The SCR_EL3.CIF bit can be used to prevent the secure world from executing virtual address that translates to a non-secure physical address. Additionally, when in the secure world, the SCR.CIF bit can also be used to control whether secure instruction fetches can be made to non-secure physical memory.

11.9.9

Context Switching in MMU

For EL0 and EL1, there are two translation tables. TTBR0_EL1 provides translations for the bottom of virtual address space, which is typically application space. TTBR1_EL1 covers the top of virtual address space, typically kernel space. This split means that the OS mappings do not have to be replicated in the translation tables of each task. Translation table entries contain a non-global (nG) bit. If the nG bit is set for a particular page, it is associated with a specific task or application. If the bit is marked as 0, then the entry is global and applies to all tasks. For non-global entries, when the TLB is updated and the entry is marked as non-global, a value is stored in the TLB entry in addition to the normal translation information. This value is called the Address Space ID (ASID), which is a number assigned by the OS to each individual task. Subsequent TLB lookups only match on that entry if the current ASID matches with the ASID that is stored in the entry. This permits multiple valid TLB entries to be present for a particular page marked as non-global but with different ASID values. In other words, we do not necessarily need to flush the TLBs when we context switch. In AArch64, this ASID value can be specified as either an 8-bit or 16-bit value, controlled by the TCR_EL1.AS bit. The current ASID value is specified in either TTBR0_EL1 or TTBR1_EL1. TCR_EL1 controls which TTBR holds the ASID, but it is normally TTBR0_EL1, as this corresponds to application space.

11.10 Address Translation Cases 11.10.1

4KB Page Size

When using 4kB granule size, the required translation levels is 4. The leading 16 bits [63:48] of a 64-bit VA is used to select either TTBR0 or TTBR1. The last 12 VA bits [11:0] are the byte offsets within a 4KB page. The remaining 48 − 12 = 36 bits are divided into four levels of 9 bits each for levels L3 to L0, as shown in Fig. 11.28. Figure 11.28 shows the translation diagram using 4KB page size. The top part of the figure shows the division of VA bits into four sets [a:b:c:d]. The middle part of the figure shows the roles of the VA bits in [a:b:c:d]. The bottom part of the figure shows the translation process. First, bits [47:39] of the VA index into a 512-entry L0 table. Each of these table entries spans a 512 GB range and points to an L1 table. Within the 512-entry L1 table, bits [38:30] are used as index to select an entry and each entry

678

11 ARMv8 Architecture and Programming

Fig. 11.28 Translation diagram using 4KB page

points to either a 1GB block or an L2 table. Bits [29:21] index into a 512-entry L2 table, and each entry points to a 2MB block or a L3 table. At the last level L3, bits [20:12] index into a 512-entry L3 table, and each entry points to a 4kB block.

11.10.2

64KB Page Size

When using 64kB granule size, the leading 16 bits of VA[63:48] select either TTBR0_EL1 or TTBR1_EL1. The least significant 16 VA bits [15:0] define a 64KB page. The remaining VA bits [47:16] are divided into three levels, L3, L2, and L1, with L1 containing only 6 bits [47:42]. In this case, the translation has three levels. Figure 11.29 shows the partition of VA bits into levels and the translation process for 64KB pages.

11.10.3

64KB Page Size with 42-Bit Input VA Address

This example assumes 42-bit actual VA space. The highest VA bits [63:42] select either TTBR0 or TTBR1. The least significant 16 VA bits define a 64KB block. The remaining 42 − 16 = 26 VA bits must be resolved by address translation. The needed translation level is 2, with 13 bits in each of L3 and L2. Each level table contains 8192 entries. Detailed steps of the translation process are as follows. 1. If VA[63:42] = 1, then TTBR1 is used for the base address for the first page table. If VA[63:42] = 0, TTBR0 is used for the base address for the first page table. 2. The page table contains 8192 64-bit page table entries and is indexed via VA[41:29]. The MMU reads the pertinent level 2 page table entry from the table. 3. The MMU checks the level 2 page table entry for validity (type=11) and whether or not the requested memory access AP bits is allowed. Assuming it is valid and the memory access is allowed. 4. In Fig. 11.30, the level 2 page table entry is a table descriptor (type=11) which refers to the address of the level 3 page table. 5. Bits [47:16] are taken from the level 2 page table entry and form the base address of the level 3 page table. 6. Bits [28:16] of the VA are used to index the level 3 page table entry. The MMU reads the pertinent level 3 page table entry from the table. 7. The MMU checks the level 3 page table entry for validity and whether or not the requested memory access is allowed. Assuming it is valid and the memory access is allowed. 8. In Fig. 11.30, the level 3 page table entry is a page descriptor which refers to a 64KB page. 9. Bits [47:16] are taken from the level 3 page table entry and used to form PA[47:16]. 10. Because we have a 64KB page, VA[15:0] is taken to form PA[15:0]. 11. The full PA[47:0] is returned, along with additional information from the page table entries

11.11 Translations in EL2 and EL3

679

Fig. 11.29 Translation process of 64KB pages

Fig. 11.30 Virtual to physical translation for a 64KB page

11.11 Translations in EL2 and EL3 The virtualization extensions to the ARMv8 architecture introduce a second stage of translation. When a hypervisor is present in the system, one or more guest operating systems might be present. These continue to use TTBRn_EL1 as previously described, and MMU operation appears unchanged. The hypervisor must perform some extra translation steps in a two-stage process to share the physical memory system between the different guest operating systems. In the first stage, a virtual address (VA) is translated to an intermediate physical address (IPA). This is usually under OS control. A second stage, controlled by the hypervisor, then performs translation of the IPA to the final physical address (PA). The hypervisor and secure monitor also have their set of stage 1 translation tables for their own code and data, which perform mapping directly from VA to PA. Figure 11.31 summarizes this two-stage translation process. The stage 2 translations, which convert an intermediate physical address to a physical address, use an extra set of tables under control of the hypervisor. These must be explicitly enabled by writing to the hypervisor configuration register HCR_ EL2. This process only applies to non-secure EL1/0 accesses. The base address of this stage 2 translation table is specified in the Virtualization Translation Table Base Register VTTBR0_EL2. It specifies a single contiguous address space at the bottom of memory. The size of the supported address space is specified in the TSZ[5:0] field of the Virtualization Translation Control Register, VTCR_EL2. The TG field of this

680

11 ARMv8 Architecture and Programming

Fig. 11.31 Two-stage translation process

register specifies the granule size, while the SL0 field controls the first level of table lookup. Any access outside the defined address range causes a translation fault.

11.12 MMU Programming Examples 11.12.1

Example 9: One-Level Table in EL3

This example is adapted from ARM MMU examples [22]. The original program was targeted for the ARM FVP Base Platforms [23, 24]. It is adapted to the ARM Virt virtual machine under QEMU [18]. For this example, we assume that the ARM Virt VM has three contiguous blocks of secure memory at the physical addresses: 0x0000_0000 - 0x3FFF_FFFF : 1GB Secure Peripheral 0x4000_0000 - 0x7FFF_FFFF : 1GB Secure DRAM 0x8000_0000 - 0xBFFF_FFFF : 1GB Secure DRAM The VM’s Translation Control Register at EL3 (TCLR_EL3) is configured with 39-bit input VA address, 32-bit output PA address, and 4KB granule size. With 39-bit input VA space and 4KB granule size, the number of VA bits to be resolved is 39 – 12 = 27, which can be divided into three sets of [9:9:9] for levels L1, L2, and L3. If the system were to use 4KB pages, it would require three levels of translation tables. However, the example only uses 1GB blocks by one-level translation. The starting level is L1, which should have 512 entries, each covers 1GB of VA. The mapping table of this example is an identity mapping of VA to PA. The example code is in Ar64 assembly. The code runs in EL3, which is the startup exception level of Ar64 systems equipped with EL3. For the ARM Virt virtual machine, it is necessary to specify the –machine secure=on option when running the ARM Virt VM under QEMU. The following lists the complete code of Example 9: // (1). ********** part1.S file **************

//************************************************************* .align 3 // ------------ bits in SPSR to choose stack pointer --------.equ AArch64_EL2_SP2, 0x09 // EL2h 1001 .equ AArch64_EL2_SP0, 0x08 // EL2t 1000 .equ AArch64_EL1_SP1, 0x05 // EL1h 0101 .equ AArch64_EL1_SP0, 0x04 // EL1t 0100 .equ AArch64_EL0_SP0, 0x00

11.12 MMU Programming Examples

681

// ------------ Define page table entry templates -------------// Table entry template: |attribute| address |type=0x11| // |NSTable|PXNTable|UXNTable|APTable|=0,addr=0,type=3 .equ TT_S1_TABLE, // // // //

MAIR register 0 = b01000100 1 = b11111111 2 = b00000000

0x00000000000000003

// table entry

contains 8-bit index=7 to 0 for memory types = Normal, Inner/Outer Non-Cacheable 0x44 = Normal, Inner/Outer WB/WA/RA 0xFF = Device-nGnRnE 0x00

// Block entry templates for L1, L2 = |attr| addr |attr|type=0x01| .equ TT_S1_FAULT, 0x0 .equ TT_S1_NORMAL_NO_CACHE, 0x00000000000000401 // Index = 0, AF=1 .equ TT_S1_NORMAL_WBWA, 0x00000000000000405 // Index = 1, AF=1 .equ TT_S1_DEVICE_nGnRnE, 0x00600000000000409 // Index = 2, AF=1 // PXN, UXN=1 // Attribute bits in block entry .equ TT_S1_UXN, (1 |xzr|x0|x1|.....|x29|x30| LDR x10, =running LDR x11, [x10, #0] mov x12, sp str x12, [x11, #8] bl

scheduler

LDR x10, =running LDR x11, [x10, #0] ldr x12, [x11, #8] mov sp, x12

// r10=&running // r11->runningPROC // running->ksp = sp

// r1->runningPROC

// restore registers from (x29, x30) to (x1, x2) ldp x1, x0, [sp], #16 // pop xzr->x1, x0->x0 ldp ldp ldp

x1, x2, [sp], #16 x3, x4, [sp], #16 x5, x6, [sp], #16

11.12 MMU Programming Examples

ldp ldp ldp ldp ldp ldp ldp ldp ldp ldp ldp ldp

x7, x9, x11, x13, x15, x17, x19, x21, x23, x25, x27, x29,

x8, x10, x12, x14, x16, x18, x20, x22, x24, x26, x28, x30,

759

[sp], [sp], [sp], [sp], [sp], [sp], [sp], [sp], [sp], [sp], [sp], [sp],

#16 #16 #16 #16 #16 #16 #16 #16 #16 #16 #16 #16

// return by link register X30, X29=FP ret

.global unlock

unlock:

MSR ret

DAIFClr, 0x3

.global syscall syscall: svc #1 ret

.global getSPSel getSPSel: // mrs X0, SPSel // MRS:read MSR: write mrs X0, s3_0_C4_C2_0 //SPSel // MRS:read ret .global getksp

// get Kmode sp: should be SP_EL1

getksp:

mov x0, sp ret

.global get_VBAR_EL1 get_VBAR_EL1: MRS x0, VBAR_EL1 ret wfi:

.global wfi wfi ret

.global get_TTBR0_EL1 get_TTBR0_EL1: MRS x0, TTBR0_EL1 ret .global get_FAR_EL1 get_FAR_EL1: MRS x0, FAR_EL1 ret .global get_ESR_EL1 get_ESR_EL1: MRS x0, ESR_EL1 ret

MSR: write

760

11 ARMv8 Architecture and Programming

.global get_ELR_EL1 get_ELR_EL1: MRS x0, ELR_EL1 ret

.global add4ELR_EL1 .global read_current_el read_current_el: MRS X0, CurrentEL ret .global get_SP_EL0 get_SP_EL0: MRS x0, SP_EL0 ret .global get_SP_EL1 get_SP_EL1: MRS x0, SP_EL1 ret

.global switchPgdir switchPgdir: lsl x0, x0, #32 lsr x0, x0, #32 MSR TTBR0_EL1, x0 ISB

// Invalidate TLBs // ---------------TLBI VMALLE1 DSB SY ISB

ret // ------end of OS kernel ts.s file ------(2). ----- OS Kernel uart.c file ------//----- OS kernel uart.c file ------#include #define u8 uint8_t #define u32 uint32_t #define u64 uint64_t #define printf kprintf

u32 *UART0DR = (unsigned int *) 0xFFFFff8009000000; u32 *UART0FR = (unsigned int *) 0xFFFFff8009000018; char *ctab = "0123456789ABCDEF"; void up2x(u8 n) { ups("0x"); upc(ctab[(n >> 4) & 0xF]); upc(ctab[n &0x0F]); upc(' '); }

11.12 MMU Programming Examples

int ugetc() { while( *UART0FR & (1next = q; return; } while (q->next && p->priority next->priority){ q = q->next; } p->next = q->next; q->next = p; } PROC *dequeue(PROC **queue) { PROC *p = *queue; if (p) *queue = p->next; return p; }

763

764

11 ARMv8 Architecture and Programming

void printQ(PROC *p) { ups("readyQueue = "); while(p){ upc('['); upi(p->pid); upc(']'); ups("->"); p = p->next; } ups("NULL\n"); } void printList(PROC *p) { ups("freeList = "); while(p){ upi(p->pid); ups("->"); p = p->next; } ups("NULL\n"); }

/****** kernel.c file ********/ void ksleep(int event) { running->event = event; running->status = SLEEP; tswitch(); }

void wakeup(int event) { int i; PROC *p; for (i=1; istatus == SLEEP && p->event == event){ p->event = 0; p->status = READY; enqueue(&readyQueue, p); } } } void kexit(int value) { int i; PROC *p; if (running->pid == 1){ printf("P1 never die\n", null); return; } printf("P%d in kexit: ", running->pid); for (i=1; iparent == running){

11.12 MMU Programming Examples

p->parent = &proc[1]; p->ppid = 1;

}

} } running->exitCode = value; running->status = ZOMBIE; running->priority = 0; wakeup(running->parent); printf("P%d become a ZOMBIE, wakeup parent\n", running->pid); tswitch();

int kwait(int *status) { int i; PROC *p; int hasChild = 0;

printf("P%d in kwait: ", running->pid); for (i=2; iparent == running){ printf("has child %d: ", p->pid); hasChild = 1; break; } } if (!hasChild){ ups("wait error: no child\n"); return -1; }

}

while(1){ for (i=2; iparent == running && p->status == ZOMBIE){ printf("found ZOMBIE child %d\n", p->pid); *status = p->exitCode; p->status = FREE; p->priority = 0; p->ppid = 0; p->parent = 0; enqueue(&freeList, p); return p->pid; } } ksleep(running); }

void kernel_init() { u64 i; PROC *p; printf("kernel_init() : %s", null);

765

766

11 ARMv8 Architecture and Programming

for (i=0; ipid = i; strcpy(p->name, pname[i]); p->status = FREE; p->next = p + 1; } proc[NPROC-1].next = 0; // circular proc list freeList = &proc[0]; readyQueue = 0;

}

p = dequeue(&freeList); p->priority = 0; // P0 has lowest priority 0 p->ppid = 0; running = p; printf("running = P%d\n", running->pid); printList(freeList);

extern extern extern extern

char int char int

_binary_u1_start, _binary_u1_end; _binary_u1_size; _binary_u2_start, _binary_u2_end; _binary_u2_size;

int load(char *file, u64 address) { int i; u64 cp, size; char *src, *dest; PROC *p = running; if (strcmp(file, "u1")==0){ cp = &_binary_u1_start; size = &_binary_u1_end - &_binary_u1_start; } if (strcmp(file, "u2")==0){ cp = &_binary_u2_start; size = &_binary_u2_end - &_binary_u2_start; }

// OS kernel must use HIGH VA printf("load %s ", file); printf(" to %x ", 0xFFFFff8000000000 + address); src = cp; dest = 0xFFFFff8000000000 + address;

}

for (i=0; i < size; i++){ *dest = *src; dest++; src++; } ups("done\n");

extern extern extern extern

u64 u64 u64 u64

p10, p20, p30, p11, p21, p31, ptable0[8]; // ptable1[8]; //

p40, p50, p60, p70, p80; p41, p51, p61, p71, p81; = {&p10,&p20,&p30,&p40,&p50,&p60,&p70,&p80}; = {&p11,&p21,&p31,&p41,&p51,&p61,&p71,&p81};

11.12 MMU Programming Examples

extern extern extern extern

char int char int

767

_binary_u1_start, _binary_u1_end; _binary_u1_size; _binary_u2_start, _binary_u2_end; _binary_u2_size;

#define TT_S1_TABLE // // // // //

0x00000000000000003

// type=3

TT block entries templates (L1 and L2, NOT L3) Assuming table contents: 0 = b01000100 = Normal, Inner/Outer Non-Cacheable 1 = b11111111 = Normal, Inner/Outer WB/WA/RA 2 = b00000000 = Device-nGnRnE

#define #define #define #define

TT_S1_FAULT TT_S1_NORMAL_NO_CACHE TT_S1_NORMAL_WBWA TT_S1_DEVICE_nGnRnE

0x0 0x00000000000000401 0x00000000000000405 0x00600000000000409

#define #define #define #define

TT_S1_UXN TT_S1_PXN TT_S1_nG TT_S1_NS

(1 (1 (1 (1

#define #define #define #define #define

TT_S1_PRIV_RW TT_S1_PRIV_RO TT_S1_USER_RW TT_S1_USER_RO TT_S1_DBM

(0x0) (0x2 pgdir) = v0; //printf("pgdir

// type=3

= %x ", p->pgdir); printf("v0=%x\n", v0);

p->paddress = 0x80000000 + (p->pid- 1)*0x200000; // 2MB area

768

11 ARMv8 Architecture and Programming

u64 *t1 = ptable1[p->pid-1]; for (i=0; ipaddress; *t1 = v1; //printf("paddress=%x ", p->paddress); //printf("l1=%x ", t1); //printf("v1=%x\n",v1); // optional: pass string to Umode image via ustack char *s = 0xFFFFff8000000000 + p->paddress + 0x200000 - 128; strcpy(s, "u1 start"); p->usp = 0x200000 - 128; //printf("usp = %x\n", p->usp); // load umode image to PA load("u1", p->paddress);

// must be VA

// syscall frame // x30 29 .... x2 x1 x0 zr spsr elr // 1 2 31 32 33 34 // spsr VA0

| | | |

p->kstack[SSIZE-33] = 0x0; p->kstack[SSIZE-34] = ucode;

// spsr=00000: ar64, was in EL0 // Umode entry VA

for (i=1; ikstack[SSIZE-i] = 0;

tswitch frame x30 x29 ... 35 goUmode

x1 x0 zr| 66

p->kstack[SSIZE-31] = p->usp; // parameter to Umode p->kstack[SSIZE-35] = (u64)goUmode; // goUmode directly p->kstack[SSIZE-65] = p->usp; // return value from Kernel

p->ksp = &(p->kstack[SSIZE-66]); enqueue(&readyQueue, p);

// saved ksp

printf("proc %d kforked a child ",running->pid); upi(p->pid); ups(" "); printQ(readyQueue); }

return p->usp;

// as parameter to Umode image

void scheduler() { PROC *old = running; printf("proc %d in scheduler ", running->pid); if (running->status == READY) enqueue(&readyQueue, running); running = dequeue(&readyQueue); printf("next running = %d\n", running->pid);

}

if (old != running){ printf("switchPgdir to %x\n", running->pgdir); switchPgdir(running->pgdir); }

11.12 MMU Programming Examples

char *pstatus[]={"FREE

769

","READY

int kps() { int i; PROC *p; for (i=0; ipid); printf("ppid=%d", p->ppid);

}

}

","SLEEP

","BLOCK

","ZOMBIE ", " RUN

"};

if (p==running) printf("%s ", pstatus[5]); else printf("%s", pstatus[p->status]); printf("name=%s\n", p->name);

typedef struct stack{ u64 elr, spsr, zxr, x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15; u64 x16, x17, x18, x19, x20, x21, x22, x23, x24, x25, x26, x27, x28, x29, x30; // only need elr, x0 to x3 }KSTACK; u64 get_FAR_EL1(); u64 get_ESR_EL1(); u64 get_ELR_EL1();

void trap_chandler(KSTACK *p) // p->kstack Top { u64 a,b,c,d; // syscall parameters from Umode in syscall(a,b,c,d) u64 spsr = p->spsr;

u64 esr = get_ESR_EL1(); u64 elr = get_ELR_EL1(); u64 far = get_FAR_EL1();

// printf("trap_chandler: elr = %x\n", elr); u32 r = esr >> 24; r = r & 0x000000FF; // printf("syndrome = %x\n", r);

// if brk or hvc or MM fault: inc elr by 4, return if (r != 0x56){ printf("trap_chandler: elr = %x\n", elr); printf("syndrome = %x ", r); printf("far = %x\n",far); printf("spsr= %x\n",spsr); }

p->elr += 4; return;

// looking for SVC from EL0, syndrome=0x56 // ARM lumping SVC into async exceptions very BAD idea !!!

770

11 ARMv8 Architecture and Programming

if (r a = b = c = d =

== 0x56){ // svc from EL0 p->x0; p->x1; // syscall parameters a,b p->x2; p->x3;

//printf("syscall # = %d ", a); if (a==0){ // ups("getpid "); r = running->pid; } if (a==1){ ups("getppid "); r = running->ppid; } if (a==2){ kps(); r = 0; }

if (a==3){ ups("switch process\n"); tswitch(); r = 0; } if (a==4){ r = fork(); } if (a==5){ kexit(b); }

if (a==6){ r = kwait(b); } if (a==7){ r = kexec(b); } if (a==90){ r = ugetc(); }

}

}

if (a==91){ uputc(b); } p->x0 = r;

int fork() { // fork a child process with identical Umode image

11.12 MMU Programming Examples

771

int i; PROC *p = dequeue(&freeList); if (p==0){ printf("fork failed %s\n", null); return -1; } // printf("parent usp=%x\n", running->usp); // u64 sp0 = get_SP_EL0(); // printf("sp0=%x\n", sp0); p->ppid = running->pid; p->parent = running; p->status = READY; p->priority = 1;

// make new TTBR0 mapping tables for Umode image p->pgdir = ptable0[p->pid-1]; u64 v0 = TT_S1_TABLE; // 0x00000000000000003 // type=3 v0 |= ptable1[p->pid-1]; v0 = (v0 > 32); *(p->pgdir) = v0; //printf("pgdir = %x ", p->pgdir); printf("v0=%x\n", v0); p->paddress = 0x80000000 + (p->pid- 1)*0x200000; u64 *t1 = ptable1[p->pid-1]; for (i=0; ipaddress; *t1 = v1; //printf("paddress=%x ", p->paddress); //printf("l1=%x ", t1); //printf("v1=%x\n",v1);

p->paddress = 0x80000000 + (p->pid- 1)*0x200000; u64 PA = running->paddress + 0xFFFFff8000000000; u64 CA = p->paddress + 0xFFFFff8000000000; printf("copy Umode image from %x ", PA); // copy 2MB of Umode image

printf("%x\n", CA);

char *src = PA; char *dest = CA;

for (i=0; iusp = running->usp;

// both are VA in their images

//printf("usp=%x\n", p->usp);

772

11 ARMv8 Architecture and Programming

// syscall frame // x30 29 .... x2 x1 x0 zr spsr elr // 1 2 31 32 33 34 // spsr VA0

| | | |

tswitch frame x30 x29 ... 35 goUmode

x1 x0 zr| 66

for (i=1; i < 35; i++){ p->kstack[SSIZE-i] = running->kstack[SSIZE-i]; } printf("spsr=%x ", p->kstack[SSIZE-33]); printf("elr =%x\n", p->kstack[SSIZE-34]); for (i=35; ikstack[SSIZE-i] = 0;

p->kstack[SSIZE-31] = 0; // child return pid=0 p->kstack[SSIZE-35] = (u64)goUmode; // resume to goUmode() p->ksp = &(p->kstack[SSIZE-66]); enqueue(&readyQueue, p);

printf("KERNEL: P%d forked a child ", running->pid); printf("%d\n", p->pid); printQ(readyQueue); }

return p->pid;

int kexec(char *line) { // change Umode image to file, passing cmd line to new image int i; PROC *p = running; char kline[64]; strcpy(kline, line); //printf("exec line=%s\n", line); char *file, *cp; file = kline; cp = kline; while (*cp && *cp != ' ') cp++; *cp = 0; //printf("file=%s\n", file);

load(file, running->paddress);

char *s = 0xFFFFff8000000000 + p->paddress + 0x200000 - 128; strcpy(s, line); //printf("s=%s\n", s); p->usp = 0x200000 - 128; p->kstack[SSIZE-33] = 0x0; p->kstack[SSIZE-34] = 0x0; }

return p->usp;

// spsr=00000: ar64, was in EL0 // Umode entry VA

11.12 MMU Programming Examples

(4) -----EOS kernel t.c file ----// OS kernel t.c file #include #define u8 uint8_t #define u32 uint32_t #define u64 uint64_t

#define printf kprintf char *null = ""; u64 utable; u64 ustack;

// contain ustack_start in t.ld

#include "uart.c" #include "kernel.c"

int strcmp(char *s1, char *s2) { while((*s1 && *s2) && (*s1==*s2)){ s1++; s2++; } if ((*s1==0) && (*s2==0)) return 0; }

return(*s1 - *s2);

void strcpy(char *dest, char *src) { while(*src){ *dest = *src; dest++; src++; } *dest = 0; }

// these are needed for aarch64 gcc if change sp i code unsigned long __stack_chk_guard;

void __stack_chk_guard_setup(void) { __stack_chk_guard = 0xBAAAAAAD;//provide some magic numbers } void __stack_chk_fail(void) { // Error message } u64 read_current_el(); u64 get_VBAR_EL1();

extern char _binary_u1_start, _binary_u1_end; extern int _binary_u1_size; extern u64 T0_L0; extern u64 tt0_L0;

773

774

11 ARMv8 Architecture and Programming

extern u64 p10, p20, p30, p40, p50, extern u64 p11, p21, p31, p41, p51, u64 ptable0[8] = {&p10, &p20, &p30, u64 ptable1[8] = {&p11, &p21, &p31, u64 get_TTBR0_CL1(); u64 get_SP_EL1(); u64 get_SP_EL0();

p60, p70, p80; p61, p71, p81; &p40, &p50, &p60, &p70, &p80}; &p41, &p51, &p61, &p71, &p81};

int main() { u64 cp, size; char *src, *dest; int i; u64 *up; u64 el = read_current_el(); printf("Enter EOS kernel currentEL = %x\n", el); /****** skip memory tests ****** ups("test VA=0xC0000000\n"); up = 0xC0000000; *up = 1234; ups("test VA=0xFFFFff80_C0000000\n"); up = 0xFFFFff80C0000000; *up = 1234; ******************************/ u64 ttbr0 = get_TTBR0_EL1(); printf("ttbr0=%x\n", ttbr0);

printf("initialize OS kernel in EL1 %s\n", null); kernel_init();

kfork(goUmode, 1, 0x00000000);

printf("enter a key to swtich to P1 : %s", null); ugetc(); ups("\n"); tswitch();

while(1){ if (readyQueue) tswitch(); } }

wfi();

(5). ------ OS kernel t.ld file -----ENTRY(start)

PA = 0xFFFFff8040100000;

SECTIONS { /* start at Virt RAM PA address */ /* . = 0xFFFFff8080000000;*/ . = PA; .text : {

11.12 MMU Programming Examples

*(.text) } .data : { *(.data) } .bss : { *(.bss) } .data : { u1.o } .data : { u2.o }

. = ALIGN(16); . = . + 0x4000; stack3_top = .; . = . + 0x4000; stack2_top = .; . = . + 0x4000; stack1_top = .; . = . + 0x4000; stack0_top = .;

. = ALIGN(4096);

utable_start = .; p1table . = . + p2table . = . + p3table . = . + p4table . = . + p5table . = . + p6table . = . + p7table . = . + p8table }

= .; 0x2000; = .; 0x2000; = .; 0x2000; = .; 0x2000; = .; 0x2000; = .; 0x2000; = .; 0x2000; = .;

. = . + 0x4000; ustack_start = .;

(6). ------- OS kernel mk sh scritp file ------rm *.elf 2> /dev/null rm *.o (cd ../USER; mku u1; mku u2)

aarch64-linux-gnu-as -o ts.o ts.s aarch64-linux-gnu-gcc -c -mcpu=cortex- a57 -Wall -o t.o t.c

775

776

11 ARMv8 Architecture and Programming

aarch64-linux-gnu-ld -T t.ld ts.o t.o -o eos.elf aarch64-linux-gnu-objdump -D eos.elf > eos.list

aarch64-linux-gnu-objcopy -O binary eos.elf eos aarch64-linux-gnu-objcopy -I binary -O elf64-littleaarch64 \ -B aarch64 eos eos.o cp eos.o ../

# copy eos.o to top directory

// ------- end of OS kernel files ------------// USER directory files

(1). ------- USER us.s file -----------

.global uEntry, syscall, get_usp, main0 .text uEntry: bl main0 b .

// user process issues syscall(a,b,c,d) ==> a,b,c,d in x0 to x3 syscall: svc #0 ret get_usp: mov x0, sp ret

.global get_current_el get_current_el: MRS X0, CurrentEL ret

(2). ----- USER ucode.c file ----// ucode.c file

#define printf kprintf #define null "" extern u64 syscall();

// these are needed for aarch64 gcc if change sp i code unsigned long __stack_chk_guard;

void __stack_chk_guard_setup(void) { __stack_chk_guard = 0xBAAAAAAD;//provide some magic numbers } void __stack_chk_fail(void) { // Error message } /******* no longer available in Umode *** u32 *UART0DR = (unsigned int *) 0x09000000; u32 *UART0FR = (unsigned int *) 0x09000018; *****************************************/

11.12 MMU Programming Examples

int ugetc() { return syscall(90, 0, 0, 0); } void uputc(char c) { syscall(91, c, 0, 0); }

char *ctab = "0123456789ABCDEF"; void up2x(u8 n) { ups("0x"); uputc(ctab[(n >> 4) & 0xF]); uputc(ctab[n &0x0F]); uputc(' '); } char *ugets(char *s) { char *cp = s; char c = ugetc(); while (c != '\r'){ uputc(c); *s= c; s++; c = ugetc(); } *s = 0; return cp; } void ups(char *s) { while(*s){ uputc(*s); s++; } }

void upx(u64 n) { ups("0x"); for (int i = 60; i >= 0; i -= 4){ uputc(ctab[(n >> i) & 0xf]); if (i == 32) uputc('_'); } } void urpu(u32 x) { char c; if (x){ c = ctab[x % 10];

777

778

}

11 ARMv8 Architecture and Programming

}

urpu(x/10); uputc(c);

void upi(int x) { if (x==0){ uputc('0'); return; } if (x $1.list aarch64-linux-gnu-objcopy -O binary $1.elf $1 aarch64-linux-gnu-objcopy -I binary -O elf64-littleaarch64 \ -B aarch64 $1 $1.o cp $1.o ../EOS

# copy to EOS directory

// ------- end of Example 13 files ---------------Explanations of Example 13 Code In the KMH scheme, the OS kernel runs in HIGH VA space, and User mode processes run in LOW VA space. The OS kernel must be compiled with HIGH virtual addresses, e.g., starting from VA=0xFFFFff80_00000000. This creates a dilemma, which is hard to resolve. On the one hand, the OS kernel cannot execute in HIGH VA space until it has configured and enabled the MMU for VA to PA translation. On the other hand, it cannot configure the MMU while in HIGH VA space because it cannot even start to execute. In a real hardware system, the MMU is usually turned off when the CPU starts, so the CPU begins execution in real address mode. The kernel startup code can execute a piece of position-independent code to configure the MMU and enable VA to PA translation. Then it can coerce the CPU to switch from PA to VA (usually by a PCrelative jump or branch), thus completing the seemingly impossible task of resolving the dilemma. Unfortunately, this trick does not seem to work on the ARM Virt virtual machines. It seems that when the ARM Virt virtual machine starts, the processor is stuck in the VA space if the OS kernel is compiled with HIGH VA. After several failed trials, I had to use the following scheme to work around the problem.

11.12 MMU Programming Examples

783

(1). OS kernel image and User mode image files The OS kernel EOS is compiled with HIGH VA at 0xFFFFff80_4010000000. The User mode image u1 is compiled with LOW VA at 0x00000000. The User mode image is converted to a pure binary executable file, which is included in the OS kernel image as a dummy data section. When the OS kernel starts up, it can copy u1 to a PA as a process User mode image. (2). The BOOTER The BOOTER is a program compiled with real addresses. It includes the OS kernel image EOS as a dummy data section. The BOOTER runs on the ARM Virt VM at the PA=0x40000000 first. When the BOOTER starts, it first migrates from EL3 to EL1. Then it configures the MMU at EL1 to use both TTBR0 for LOW VA mapping and TTBR1 for HIGH VA mapping. Then it copies the OS kernel image to the PA at 0x4010000000, which is mapped to VA=0xFFFFff80_40100000 by the mapping tables of TTBR1_EL1. These are done by the following code segment of the BOOTER’s ts.s file: #define VA #define PA

0xFFFFff8040100000 0x0000000040100000

u64 cp = &_binary_eos_start; u64 size = &_binary_eos_end - &_binary_eos_start; ups("copy eos to PA: "); upx(PA); ups(" char *src = cp; char *dest = PA;

");

for (int i=0; i[pi0 = a table entry] -> [pi1 = 2MB block for USER_RW] Then the OS kernel calls kernel_init() to initialize the system, creates, and runs the initial process P0 as usual. P0 creates a child process P1 with u1 as its Umode image by kfork(goUmode, 1, 0x00000000); // kfork(kCode, priority, uCode) To any process executing in the User mode, it is always at VA=0 as long as TTBR0 maps its VA=0 to PA of the Umode image. In an OS kernel, each process has its own TTBR0 mapping table (recorded in its PROC.pgdir), which maps the process Umode VA to PA. When switching process, we must switch the mapping tables of TTBR0. These are done in scheduler() and switchPgdir(), as shown below: void scheduler() { PROC *old = running; printf("proc %d in scheduler ", running->pid); if (running->status == READY) enqueue(&readyQueue, running); running = dequeue(&readyQueue); printf("next running = %d\n", running->pid);

}

if (old != running){ printf("switchPgdir to %x\n", running->pgdir); switchPgdir(running->pgdir); }

.global switchPgdir // swithPgdir(u64 newPgdir) switchPgdir: lsl x0, x0, #32 // must use PA lsr x0, x0, #32 MSR TTBR0_EL1, x0 ISB // Invalidate TLBs // ---------------TLBI VMALLE1 DSB SY ISB ret

// change TTBR0_EL1 to newPgdir

All processes share the same VA space in Kmode, so there is no need to switch TTBR1 during process switch.

11.12 MMU Programming Examples

785

kfork(): kfork is only used by P0 to create a child process with a User mode image u1. Each process has a 2MB User mode memory area at PA=0x80000000 + (pid-1)*2MB. P1’s Umdoe image is at 0x80000000. kfork() first sets up the mapping table of the new process for it to access the 2MB area as VA=0 to 2MB. Then it copies u1 image to the PA memory of the new process. These are done by the following code segment of kfork(): extern char _binary_u1_start, _binary_u1_end; extern int _binary_u1_size; extern char _binary_u2_start, _binary_u2_end; extern int _binary_u2_size; int load(char *file, u64 address) { int i; u64 cp, size; char *src, *dest; PROC *p = running;

if (strcmp(file, “u1”) || strcmp(file, “u2)) return 0; // only u1 or u2 if (strcmp(file, "u1")==0){ cp = &_binary_u1_start; size = &_binary_u1_end - &_binary_u1_start; } if (strcmp(file, "u2")==0){ cp = &_binary_u2_start; size = &_binary_u2_end - &_binary_u2_start; }

// OS kernel must use HIGH VA printf("load %s ", file); printf(" to %x ", 0xFFFFff8000000000 + address); src = cp; dest = 0xFFFFff8000000000 + address;

}

for (i=0; i < size; i++){ *dest = *src; dest++; src++; } ups("done\n"); return 1;

extern u64 ptable0[8]; //{&p10,&p20,&p30,&p40,&p50,&p60,&p70,&p80}; extern u64 ptable1[8]; //{&p11,&p21,&p31,&p41,&p51,&p61,&p71,&p81}; p->pgdir = ptable0[p->pid-1]; u64 v0 = TT_S1_TABLE; // 0x00000000000000003 // type=3 v0 |= ptable1[p->pid-1]; v0 = (v0 > 32); *(p->pgdir) = v0; printf("pgdir = %x ", p->pgdir); printf("v0=%x\n", v0); p->paddress = 0x80000000 + (p->pid- 1)*0x200000; u64 *t1 = ptable1[p->pid-1]; for (i=0; ipaddress; *t1 = v1; printf("paddr = %x ", p->paddress); printf("l1=%x ", t1); printf("v1=%x\n",v1); load(“u1”, p->paddress);

Note that the User mode memory AP bits = TT_S1_USER_RW = 01 for User mode RW in EL0. After setting up the Umode memory mapping table for the new process, the next crucial step of kfork() is to create the environment for the new process to return to its Umode image when it is scheduled to run. These are done by the following code segment in kfork(): // syscall frame // x30 29 .... x2 x1 x0 zr spsr elr // 1 2 31 32 33 34 // spsr VA0

| | | |

tswitch frame x30 x29 ... 35 goUmode

x1 x0 zr| 66

for (i=1; ikstack[SSIZE-i] = 0;

// zero out proc.kstack

p->kstack[SSIZE-33] = 0x0; p->kstack[SSIZE-34] = ucode;

// spsr=00000: ar64, was in EL0 // Umode entry VA

p->kstack[SSIZE-31] = p->usp;

// return value to Umode image

p->kstack[SSIZE-35] = (u64)goUmode; // goUmode directly p->ksp = &(p->kstack[SSIZE-66]);

// optional: pass a string to Umdoe image via ustack char *s = 0xFFFFff8000000000 + p->paddress + 0x200000 - 128; strcpy(s, "u1 start"); p->usp = 0x200000 - 128; p->kstack[SSIZE-65] = p->usp;

// must be VA // return value from Kmode

enqueue(&readyQueue, p); return p->usp; // pass to Umode image as a string pointer

The statement p->kstack[SSIZE-33] = 0x0 is most critical. It tells the new process that its previous execution state was Ar64 and its previous exception level was EL0. When the new process begins to run, it resumes to goUmode() (in ts.s file), which restores SPSR_EL1 = 0x0 ERL_EL1 = ucode Followed by ERET, which causes the process to execute from VA=ucode in EL0. When a process executes in the User mode, it first tries to verify it is indeed executing in the User mode at EL0. Unfortunately, there is no direct way to know the exception level at EL0, which can only be inferred indirectly. If the CPU executing in EL0 tries to read the CurrentEL register, it will generate a synchronous exception fault due to privilege violation, which traps the process to the OS kernel. The trap handler in the OS kernel can get and display contents of the SPSR_EL1 register. If SPSR_EL1.bits[4:0] = b0000, it proves indirectly that the previous exception level was EL0. The EOS User mode process has a code to check this, but they are commented out. Readers may comment in the code to verify that User mode processes are indeed executing at EL0. While running at EL0, a User mode process can issue syscalls to enter the OS kernel, execute kernel functions, and return to the User mode with the desired results.

11.12 MMU Programming Examples

787

(4). Fork-Exec in the OS kernel After kfork P1, P0 switches process to run P1. Since P0 has the lowest priority 0, it will be at the bottom of the readyQueue. P0 will run again only when there are no other runnable processes. In that case, P0 resumes running in a loop. It switches process as soon as the readyQueue becomes nonempty. After P1 runs, all other processes are created by the int pid = fork(); syscall. fork() creates a child process with a Umode image identical to the parent. The parent process returns from fork() with child pid. When the child process runs, its returned pid value is 0. While executing, a process may change its Umode execution image by the exec(command_line) // command_line = filename parameter_strings syscall. Exec() replaces the process Umode image with the image from a filename and execute the new Umode image. The EOS kernel implements and supports fork-exec as in Unix/Linux. For simplicity, the EOS kernel only supports two Umode image files u1 and u2. For ease of identifying which file is executing, u1 prints in lowercase and u2 prints in uppercase. The reader may create and test other User mode image files. (5). Trap and syscall handlers in OS kernel When an exception occurs, the CPU follows the vector table to execute the trap handler code generated by the handler macro: .macro handler CHANDLER stp x29, x30, stp x27, x28, stp x25, x26, stp x23, x24, stp x21, x22, stp x19, x20, stp x17, x18, stp x15, x16, stp x13, x14, stp x11, x12, stp x9, x10, stp x7, x8, stp x5, x6, stp x3, x4, stp x1, x2, stp

xzr,

GO // 2 parameters: CHANDLER to call and a label GO [sp, #-16]! [sp, #-16]! [sp, #-16]! [sp, #-16]! [sp, #-16]! [sp, #-16]! [sp, #-16]! [sp, #-16]! [sp, #-16]! [sp, #-16]! [sp, #-16]! [sp, #-16]! [sp, #-16]! [sp, #-16]! [sp, #-16]!

x0, [sp, #-16]!

// x0-x3 are syscall parameters, preserve them on EL1 stack // push elr_el1, spsr_el1 mrs x10, elr_el1 mrs x11, spsr_el1 stp x10, x11, [sp, #-16]!

// stack = x30 29 .... x2 x1 x0 zxr spsr_el1 elr_el1

// LDR LDR MRS str

load x10, x11, x10, x10,

sp_el0 with =running [x10, #0] sp_el0 [x11, #24]

proc.usp // r10=&running // r11->running PROC

// save usp at proc.usp

mov x0, sp // x0 = sp BL \CHANDLER // Call \CHANDLER in C x0=sp

// need a label here: goUmode: OR NULL \GO // load sp_el0 with proc.usp LDR x10, =running // r10=&running

788

11 ARMv8 Architecture and Programming

LDR x11, [x10, #0] // r11->running PROC ldr x10, [x11, #24] // x10 = proc.usp MSR sp_el0, x10 ldp

x0, x1, [sp], #16

// pop spsr, elr into x1, x2

ldp

x1,

// pop xzr->x1, x0->x0

msr msr ldp ldp ldp ldp ldp ldp ldp ldp ldp ldp ldp ldp ldp ldp ldp .endm

// restore sp_el0 = this proc's usp

ERET

elr_el1, x0 spsr_el1, x1 x1, x3, x5, x7, x9, x11, x13, x15, x17, x19, x21, x23, x25, x27, x29,

x0, [sp], #16

x2, x4, x6, x8, x10, x12, x14, x16, x18, x20, x22, x24, x26, x28, x30,

[sp], [sp], [sp], [sp], [sp], [sp], [sp], [sp], [sp], [sp], [sp], [sp], [sp], [sp], [sp],

#16 #16 #16 #16 #16 #16 #16 #16 #16 #16 #16 #16 #16 #16 #16

// restore elr_el1 // restore spsr_el1

.global trap1_handler trap1_handler: handler trap_chandler // trap from EL1, no need for a label

.global trap0_handler trap0_handler: // trap from EL0 may be systeam call by SVC handler trap_chandler goUmode: // trap from EL0, with goUmode: All low-level trap handlers call trap_chandler() in the OS kernel, which is shown below: typedef struct stack{ u64 elr, spsr, zxr, x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15; u64 x16, x17, x18, x19, x20, x21, x22, x23, x24, x25, x26, x27, x28, x29, x30; // only need elr, x0 to x3 }KSTACK; u64 get_FAR_EL1(); u64 get_ESR_EL1(); u64 get_ELR_EL1();

void trap_chandler(KSTACK *p) // p->kstack Top { u64 a,b,c,d; // syscall parameters from Umode in syscall(a,b,c,d) u64 spsr = p->spsr;

u64 esr = get_ESR_EL1(); u64 elr = get_ELR_EL1();

11.12 MMU Programming Examples

u64 far = get_FAR_EL1();

// printf("trap_chandler: elr = %x\n", elr); u32 r = esr >> 24; r = r & 0x000000FF; // printf("syndrome = %x\n", r);

// if brk or hvc or MM fault: inc elr by 4, return if (r != 0x56){ printf("trap_chandler: elr = %x\n", elr); printf("syndrome = %x ", r); printf("far = %x\n",far); printf("spsr= %x\n",spsr); }

p->elr += 4; return;

// looking for SVC from EL0, syndrome=0x56 // ARM lumping SVC into async exceptions very BAD idea !!! if (r a = b = c = d =

== 0x56){ // svc from EL0 p->x0; p->x1; // syscall parameters a,b,c,d p->x2; p->x3;

//printf("syscall # = %d ", a); if (a==0){ // ups("getpid "); r = running->pid; } if (a==1){ ups("getppid "); r = running->ppid; }

if (a==2){ kps(); r = 0; } if (a==3){ ups("switch process\n"); tswitch(); r = 0; } if (a==4){ r = fork(); } if (a==5){ kexit(b); } if (a==6){

789

11 ARMv8 Architecture and Programming

790

}

r = kwait(b);

if (a==7){ r = kexec(b); } if (a==90){ r = ugetc(); }

}

if (a==91){ uputc(b); } p->x0 = r;

}

Kernel Syscall Functions For demonstration purpose, the OS kernel implements the following syscall functions: 0: getpid: 1: getppid: 2: ps: 3: switch: 4: fork: 5: exit: 6: wait: 7: exec:

return process pid return process ppid display process information in OS kernel switch process in OS kernel fork a child process with identical Umdoe image terminate with exitCode=pid, become a ZOMBIE, wakeup parent wait for (and FREE) a ZOMBIE child, return its pid and exitCode change User mode image

The following sequence of commands is used to test run Example 13 program: enter: P0 switch to run P1 fork : P1: forks a child P2 in readyQueue; switch: P1 switch to run P2 exec : “u2 one two three”: P2 change Umode image to u2, Command-line parameters = u2, one, two, three exit: P2 calls kexit(pid) to become a ZOMBIE, wakeup parent, tswitch() P1 running: wait: P1 waits for a ZOMBIE child P2 and frees it. Figure 11.36 shows the sample outputs of Example 13 program. Exercise: Modify Example 13 to use 48-bit input VA space and 32-bit output PA space.

11.13 ARMv9 This section describes the new features of the ARMv9 architecture for future computing needs. ARMv9 was announced by ARM in early 2021 [12]. It is a successor to the ARMv8 architecture with enhancements for improved performances in two major areas: improved security and improved computing capability in general [13].

11.13 ARMv9

791

Fig. 11.36 Sample outputs of Example 13 program

11.13.1

ARMv9 for Improved Security

On improved security, the ARMv9 introduces the Confidential Computing Architecture (CCA), which extends the TrustZone technology to provide additional security in embedded systems and mobile devices. Confidential computing shields portions of code and data from access or modification while in use, even from privileged software, by performing computation in a hardware-based secure environment. The AARM CCA introduces the concept of dynamically created Realms, useable by all applications, in a region that is separate from both the secure and non-secure worlds. Exactly how the ARM proposed Realms scheme works remains to be seen. However, real and tangible improvements of the ARMv9 architecture are much more noticeable in the area of improved computing capability.

11.13.2

ARMv9 for Improved Computing Capability

On improved computing capability, Fugaku, the world’s fastest supercomputer in 2021, was based on a Fujitsu A64FX processor that implements the Scalable Vector Extension (SVE) jointly developed by ARM and Fujitsu. ARMv9 has developed a new SVE-2 based on SVE for enhanced high-powered computing needs in machine learning (ML), digital signal processing (DSP), and AI in general. The SVE2 improves ARM’s SIMD capabilities by allowing variable vector size and predicated execution of floating point and vectored instructions. In SVE2, vector size ranges from 128 to 2048 bytes. It even allows variable 128-byte granularity of vectors, irrespective of what the actual hardware is running on. Predicated execution of instructions has proved a failure in ARMv7, which is de-emphasized in ARMv8. The situation could be very different in supercomputing, in which a vendor will try to take every bit advantage over its rivals to claim the world’s fastest computer title. Predicated instruction program-

792

11 ARMv8 Architecture and Programming

ming may just give that needed nano-meter thin edge, but such software probably cannot count on GCC compiler even though most supercomputers use Linux as operating systems. Machine learning (ML) and AI computing rely heavily on efficient matrix multiplication operations implemented in hardware. Traditionally, matrix operations are specialized and strong features of graphics processor units (GPUs). Some supercomputers even rely heavily on GPUs, such as the GPU of Nvidia. ARMv9 plans to incorporate both variable-sized vector processing and matrix operations in hardware. These improved computing capabilities will have a significant impact on future computing. Although these topics are important, they are not directly related to operating systems. Also, due to the lack of available ARMv9 platforms for programming, this book can only present some general information on the ARMv9 architecture without any programming examples.

11.14 Conclusions This chapter covers three major areas of ARMv8 architecture and programming. First, it describes the exception levels of ARMv8, which divides an ARM-based system into a secure world and a non-secure or normal world. It explains the Ar64 and Ar32 execution states, change exception levels, and execution states. It covers the instruction set and general and special registers in 64-bit execution environment. Then it covers exception and interrupt handling in ARMv8 in detail. Next, it covers the GICv3 interrupt controller operations, including the hierarchical organization of distributors, redistributors, CPU interface, interrupt routing, and interrupt groups for added security. Then it uses the generic timers and example programs to show interrupt processing in detail. It also shows how to start up an ARM processor in EL3, migrate to EL1, and develop a simple OS kernel with processes at EL1. The last part of this chapter covers the memory management unit (MMU) in ARMv8 for virtual address to physical address translation, which is totally different from earlier versions of ARM processors. In addition to presenting principles, it uses example programs to show VA to PA translation in detail. The high point of this chapter is the development of a OS kernel running in EL1 using HIGH VA space and User mode processes running in EL0 using LOW VA space. The OS kernel supports system calls, allowing User mode processes to execute functions in the OS kernel and return to the User mode with the desired results. It also supports the fork-exec operations similar to Linux kernel. Finally, it describes ARMv9 with improved security and computing capability, which will have an impact on machine learning, AI, and supercomputing in general.

References 1. Fundamentals of ARMv8-A, ARM 2017 2. Introduction to ARMv8 Architecture, V2.0, ARM Developer, 2022 3. ARM Architecture Reference Manual, ARMv8, for ARMv8-A architecture profile, ARM 2013 4. ARM Cortex-A Series Programmer’s Guide for ARMv8-A, Version 1.0, ARM Developer, 2022 5. ARM Cortex -A53 MPCore Processor Technical Reference Manual, ARM, 2014 6. ARM Cortex-A57 MPCore Processor Technical Reference Manual. ARM 2013 7. AArch64 Exception Model, ARM Developer, 2022 8. Protection ring, https:// en.wikipedia.org/wiki/Protection_ring, 2022 9. ARM 64-Bit Assembly Language, L. D. Pyeatt, W. Ughetta, Newnes, 2020 10. SMC CALLING CONVENTION System Software on ARM Platforms, ARM, 2016 11. Procedure Call Standard for the Arm 64-bit Architecture, ARM, 2022 12. Arm’s solution to the future needs of AI, security and specialized computing is v9, ARM, March 30, 2021 13. Arm Architecture Reference Manual Supplement Armv9, for Armv9-A architecture Profile, ARM, 2021 14. AArch64 Exception and Interrupt Handling, ARM Developer, 2022 15. ARM Generic Interrupt Controller Architecture Specification GIC architecture version 3.0 and version 4.0, ARM, 2015 16. GICv3 and GICv4 Software Overview, ARM 2016 17. Generic Timer, ARM Developer, 2022 18. Virt Generic Virtual Platform, https://www.qemu.org/docs/master/system/riscv/virt.html 19. Bare-metal Boot Code for ARMv8-A Processors, V1.0, ARM, 2017 20. ARMv8-A Memory Model Version 1.0 ARM 2019 21. ARMv8-A Memory Management, Develop Paper, 2020 22. AArch64 memory management examples, version 1.0, ARM 2019 23. Fast Models, Fixed Virtual Platforms FVP Reference Guide, Version 11.4, ARM, 2018 24. Fast Model Reference Guide version 11.18, ARM 2022 25. Wang, K.C., “Design and Implementation of the MTX Operating Systems, Springer International Publishing AG, 2015

Chapter 12 ARM TrustZone and Secure Operating Systems

ARM processors have dominated the world of mobile devices, which have seen their explosive growth in recent years. Nowadays, mobile devices, such as smart phones, are not only used for personal communication but also for all kinds of activities through the Internet, such as online shopping, online bank transactions, personal identification, and even proof of health as in countries like China during the COVID pandemic. In short, mobile devices have becoming indispensable tools in daily lives. As such, they also contain a wide range of sensitive personal information, such as bank account, credits information, passwords, and medical records. Such information must be maintained and accessed in a secure environment. Security in mobile devices, and computing systems in general, has become ever more important. ARM TrustZone technology [1, 2] is a hardware security extension aimed to provide a trusted or secure execution environment by splitting computer resources into two distinct worlds: a secure world comprising information and operations under strict security protection and a non-secure or normal world for running applications under an ordinary operating system. ARM TrustZone was available as a security extension in ARMv6 and ARMv7. It is now a standard feature in ARMv8. It is also the basis of security in recently released ARMv9 [3, 4]. TrustZone was not included in this book originally for reasons of simplicity. With the advent of secure computing technology and increasing demands for security in embedded systems and mobile devices, it is now appropriate and necessary to cover this important area. This chapter covers secure systems based on the ARM TrustZone.

12.1 ARM TrustZone Architecture This section introduces ARM TrustZone technology and explains its goal and functions [5]. ARM TrustZone aims to provide a trusted computing environment by partitioning computer resources into a secure world and a non-secure or normal world. This is accomplished by adding security features in both hardware and software system architectures.

12.2 ARM TrustZone Hardware Architecture 12.2.1 Hardware Features for Secure and Normal Worlds At the hardware level, ARM TrustZone uses added hardware components to partition computing resources into secure and normal worlds. Such hardware components include the Advanced Microcontroller Bus Architecture (AMBA3), which uses a NS (non-secure) bit to partition memory into secure (NS = 0) and non-secure (NS = 1) regions. In essence, this NS bit is added to all memory system transactions, including cache tags, access to system memory, and peripherals. In ARMv7, which uses 32-bit addressing, the NS bit can be considered as an additional address bit, giving a 32-bit physical address space for the secure world and a completely separate 32-bit physical address space for the normal world. The memory management unit (MMU) is modified to have different translation table base registers to allow each world to manage its own VA to PA mappings. Processor L1 cache is also modified to use the NS bit to support secure and normal access of cache lines. For devices outside the normal memory, an Advanced Peripheral Bus (APB) attached to the system bus via an AXI-to-APB bridge can be used to classify peripheral devices as either secure or normal. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. C. Wang, Embedded and Real-Time Operating Systems, https://doi.org/10.1007/978-3-031-28701-5_12

793

794

12 ARM TrustZone and Secure Operating Systems

In ARMv8, these hardware components have evolved into security features known as TrustZone Address Space Controller (TZASC) and TrustZone Memory Adapter (TZMA). The TZASC can be used to configure specific memory regions as secure or non-secure, such that software running in the secure world can access memory associated with both worlds, but applications running in the normal world cannot access secure world memory. Partitioning the DRAM into different memory regions and its respective association with a specific world is performed by the TZASC under the control of a programming interface that is restricted to software with secure world privileges. A similar memory partitioning functionality is implemented by the TZMA but targeting off-chip ROM or SRAM. All these additional hardware components are designed to partition resources in a computer system into secure and normal worlds. Resources can only be managed from the secure world. Normal world can only access resources allocated to that world.

12.2.2 Processor Architecture ARM TrustZone began in ARMv6 and ARMv7 as a security extension. Over time, it has evolved to its present state. The discussion here will only cover the current state of TrustZone in ARMv8 and its extension in ARMv9. ARMv8 processors supporting security have exception (privilege) levels EL0 to EL3, where higher number means higher privilege level. Normal world comprises user applications at EL0, a regular non-secure OS at EL1 and an optional hypervisor for virtualization at EL2. Secure world consists of a secure monitor at EL3 and possibly a trusted OS at secure EL1 and trusted applications at secure EL0. This book does not consider virtualization for virtual machines, so we shall ignore EL2. A simple secure system may contain two processors executing in different modes, one for each world, as shown in Fig. 12.1. To comply with the low-power requirements of ARM architecture, the two processors can be virtual, implemented by a single physical processor through timeslicing. In ARMv8, the processor executing in exception levels EL0, EL1, and EL2 can be in either secure state or non-secure state, which is controlled by the SCR_EL3.NS bit. The NS bit is often denoted in different ELn levels as NS.EL1: Non-secure state, exception level 1 S.EL1: Secure state, exception level 1 EL3 is always in secure state, regardless of the NS bit value. It only enforces and indicates the security states of ELn with n