280 74 35MB
English Pages 272 Year 2022
Dissecting Computer Architecture
DISSECTING COMPUTER ARCHITECTURE
Alvin Albuero De Luna
ARCLER
P
r
e
s
s
www.arclerpress.com
Dissecting Computer Architecture Alvin Albuero De Luna
Arcler Press 224 Shoreacres Road Burlington, ON L7L 2H2 Canada www.arclerpress.com Email: [email protected]
e-book Edition 2023 ISBN: 978-1-77469-583-8 (e-book)
This book contains information obtained from highly regarded resources. Reprinted material sources are indicated and copyright remains with the original owners. Copyright for images and other graphics remains with the original owners as indicated. A Wide variety of references are listed. Reasonable efforts have been made to publish reliable data. Authors or Editors or Publishers are not responsible for the accuracy of the information in the published chapters or consequences of their use. The publisher assumes no responsibility for any damage or grievance to the persons or property arising out of the use of any materials, instructions, methods or thoughts in the book. The authors or editors and the publisher have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission has not been obtained. If any copyright holder has not been acknowledged, please write to us so we may rectify. Notice: Registered trademark of products or corporate names are used only for explanation and identification without intent of infringement.
© 2023 Arcler Press ISBN: 978-1-77469-439-8 (Hardcover)
Arcler Press publishes wide variety of books and eBooks. For more information about Arcler Press and its products, visit our website at www.arclerpress.com
ABOUT THE AUTHOR
Alvin Albuero De Luna is an IT educator at the Laguna State Polytechnic University under the College of Computer Studies, which is located in the Province of Laguna, in Philippines. He earned his Bachelor of Science in Information Technology from STI College and his Master of Science in Information Technology from Laguna State Polytechnic University. He was also a holder of two (2) National Certifications from TESDA (Technical Education and Skills Development Authority), namely NC II - Computer Systems Servicing, and NC III - Graphics Design. And he is also a Passer of Career Service Professional Eligibility given by the Civil Service Commission of the Philippines.
TABLE OF CONTENTS
List of Figures.........................................................................................................xi List of Tables.........................................................................................................xv List of Abbreviations........................................................................................... xvii Preface............................................................................................................ ....xix Chapter 1
Introduction to Computer Architecture..................................................... 1 1.1. Introduction......................................................................................... 2 1.2. Fundamental Concepts........................................................................ 3 1.3. Processors............................................................................................ 5 1.4. Basic System Architecture.................................................................... 6 1.5. Characteristics of Von Neumann Machine........................................... 8 1.6. Buses................................................................................................. 10 1.7. Processor Operation.......................................................................... 11 1.8. ALU................................................................................................... 12 1.9. Interrupts........................................................................................... 13 1.10. CISC and RISC................................................................................. 15 1.11. Input/Output.................................................................................... 24 1.12. DMA............................................................................................... 25 1.13. Parallel and Distributed Computers.................................................. 28 1.14. Embedded Computer Architecture................................................... 33 References................................................................................................ 36
Chapter 2
Classification of Computer Architecture.................................................. 49 2.1. Introduction....................................................................................... 50 2.2. Von-Neumann Architecture................................................................ 50 2.3. Harvard Architecture......................................................................... 53 2.4. Instruction Set Architecture................................................................ 56 2.5. Microarchitecture.............................................................................. 71
2.6. System Design................................................................................... 78 References................................................................................................ 81 Chapter 3
Computer Memory Systems..................................................................... 93 3.1. Introduction....................................................................................... 94 3.2. Memory Hierarchy............................................................................. 97 3.3. Managing the Memory Hierarchy...................................................... 99 3.4. Caches............................................................................................. 104 3.5. Main Memory.................................................................................. 112 3.6. Present and Future Research Problems............................................. 118 References.............................................................................................. 123
Chapter 4
Computer Processing and Processors..................................................... 133 4.1. Introduction..................................................................................... 134 4.2. Computer Processors....................................................................... 135 4.3. Computer Processes (Computing).................................................... 138 4.4. Multitasking and Process Management............................................ 141 4.5. Process States.................................................................................. 142 4.6. Inter-Process Communication (IPC).................................................. 143 4.7. Historical Background of Computer Processing............................... 144 4.8. Types of Central Processing Units (CPUS)......................................... 144 References.............................................................................................. 149
Chapter 5
Interconnection Networks..................................................................... 155 5.1. Introduction..................................................................................... 156 5.2. Questions About Interconnection Networks..................................... 157 5.3. Uses of Interconnection Networks................................................... 159 5.4. Network Basics................................................................................ 170 References.............................................................................................. 178
Chapter 6
Superscalar Processors........................................................................... 187 6.1. Introduction..................................................................................... 188 6.2. Sources of Complexity..................................................................... 189 6.3. Basic Structures............................................................................... 190 6.4. Current Implementations................................................................. 192 6.5. A Complexity-Effective Microarchitecture........................................ 193 References.............................................................................................. 202 viii
Chapter 7
Measurement of Computer Performance............................................... 207 7.1. Introduction..................................................................................... 208 7.2. Common Goals of Performance Analysis......................................... 209 7.3. Solution Techniques......................................................................... 214 7.4. Assessing Performance With Benchmarks......................................... 217 References.............................................................................................. 220
Chapter 8
Recent Developments in High-Performance Computing........................ 225 8.1. Introduction..................................................................................... 226 8.2. A Short History of Supercomputers.................................................. 227 8.3. 2000–2005: Intel Processors, Cluster, and the Earth-Simulator......... 229 8.4. 2005 and Beyond............................................................................ 233 References.............................................................................................. 240
Index...................................................................................................... 245
ix
LIST OF FIGURES
Figure 1.1. Software layers Figure 1.2. Basic computer system Figure 1.3. Data flow Figure 1.4. Ported versus memory-mapped I/O spaces Figure 1.5. Harvard architecture Figure 1.6. Three-bus system Figure 1.7. ALU block diagram Figure 1.8. The devices of eight bit-organized 8×1 as well as one word-organized 8×8 Figure 1.9. Shared-memory MIMD Figure 1.10. Message-passing MIMD Figure 1.11. Block diagram of a generic computer Figure 1.12. Block diagram of an embedded computer Figure 2.1. Architecture of CPU Figure 2.2. Memory architecture for the von Neumann Figure 2.3. Harvard architecture Figure 2.4. Instruction set architecture Figure 2.5. Microarchitecture illustration Figure 2.6. Design and architecture of a software Figure 3.1. Computing system Figure 3.2. Hierarchical representation of memory systems of computer Figure 3.3. System of main memory and cache Figure 3.4. A memory hierarchy having three tiers of caches is an instance Figure 3.5. The memory hierarchy of a computer is depicted in this diagram Figure 3.6. Virtual memory may work in tandem with the actual memory of a computer to offer quicker, more fluid processes Figure 3.7. System of virtual memory Figure 3.8. The virtual memory is organized in a general manner Figure 3.9. Virtual memory enhancement
Figure 3.10. Memory caches (caches of central processing unit) employ high-speed static RAM (SRAM) chips, while disc caches are frequently part of main memory and are made up of ordinary dynamic RAM (DRAM) chips Figure 3.11. Cache data Figure 3.12. The logical organization Figure 3.13. To enable rapid access to data, a distributed cache consolidates the RAM of numerous computers into a single in-memory data store, which is then utilized as a data cache Figure 3.14. Single versus multiple-level caches Figure 3.15. DRAM types Figure 3.16. Kinds of DRAMs Figure 3.17. The organization of DRAM Figure 3.18. Bank organization Figure 3.19. The management of cache Figure 3.20. A cache block is being transferred Figure 3.21. Challenges of DRAM scaling Figure 3.22. The design of a three-dimensional stacked DRAM memory cell Figure 4.1. Representation of a typical processor Figure 4.2. CPU working diagram Figure 4.3. Components of CPU Figure 4.4. A simplified view of the instruction cycle Figure 4.5. Program vs. process vs. thread (scheduling, preemption, context switching) Figure 4.6. A list of methods as shown by htop Figure 4.7. A process table as shown by the KDE system guard Figure 4.8. Examples of computer multitasking programs Figure 4.9. In the state diagram, the different process stages are shown, with arrows suggesting probable transitions between them Figure 4.10. CPU types Figure 5.1. The functional view of an interconnection network Figure 5.2. To connect the CPU and memory, an interconnection network is used Figure 5.3. The processor-memory connection requires two packet types Figure 5.4. A typical I/O network links several host adapters to a greater number of I/O devices, such as disc discs in this example Figure 5.5. Interconnection networks are used as a switching fabric by certain network routers, moving packets between line cards that transmit and receive packets across network channels xii
Figure 5.6. The configurations of nodes, marked by circles numbered 00 to 33, and the channels linking the nodes make up a network topology Figure 5.7. A 16-node torus topology is packaged in this way Figure 5.8. A 16-node ring network has shorter latency than the 16-node, 2-D torus of Figure 5.6 for the limitations of our scenario. This delay is attained at the cost of reduced throughput Figure 5.9. In the 2-D torus of Figure 5.6, there are two routes from 01 to 22 Figure 5.10. Two flow control strategies are depicted in time-space diagrams. The horizontal axis represents time, while the vertical axis represents space (channels) Figure 5.11. Simplified block diagram of a router Figure 5.12. Router architecture Figure 6.1. Superscalar architecture Figure 6.2. Baseline superscalar model Figure 6.3. Block diagram of an out-of-order superscalar processor Figure 6.4. Register rename logic Figure 6.5. Dependence-based microarchitecture Figure 6.6. Microarchitecture based on dependencies Figure 6.7. An example of instruction steering Figure 6.8. Performance (IPC) of a dependency-based microarchitecture Figure 6.9. Grouping the dependency-based microarchitecture: An eight-way machine divided into two four-way clusters (2 X 4-way) Figure 6.10. Performance of clustered dependency-based microarchitecture Figure 7.1. Computer performance measurement system Figure 7.2. A graph showing the performance of a CPU at different byte transmission rates Figure 7.3. System tuning utility interface Figure 7.4. The relative performance of different CPUs Figure 7.5. Diagnostic tool for performance debugging in Windows Figure 7.6. Software to test the performance of a PC Figure 8.1. HPC in the digital arena Figure 8.2. For the past 60 years, the efficiency of the faster computer systems has been measured Figure 8.3. Characteristics of supercomputers Figure 8.4. Cluster computing is illustrated in this diagram Figure 8.5. Principal architectural classifications listed among the Top-500 Figure 8.6. Families of main processors shown in the Top-500 xiii
Figure 8.7. The rate of replacement in the Top-500 is described as the number of systems eliminated from the list due to their insufficient performance Figure 8.8. As shown in the Top-500, high-performance computing systems are used by people from all around the world Figure 8.9. As shown in the Top-500, Asian users of high-performance computing systems Figure 8.10. In the Top-500, the overall increase in cumulative and individual performance may be viewed Figure 8.11. Extrapolation of current performance growth rates shown in the diagram
xiv
LIST OF TABLES
Table 5.1. Processor-memory interconnection network parameters Table 5.2. I/O connectivity network parameters Table 5.3. A packet switching fabric’s parameters Table 6.1. Baseline simulation model Table 6.2. Delay of reservation table in technology Table 7.1. A comparison of the performance-analysis solution techniques
LIST OF ABBREVIATIONS
ALU
arithmetic logic unit
CISC
complex instruction set computer
CM
connection machine
CPU
central processing unit
CU
control unit
DMA
direct memory access
DMAC
DMA controller
DRAM
dynamic random-access memory
DSP
digital signal processor
EA
effective address
EAROM
electrically alterable read-only memory
EEROM
electrically erasable read-only memory
ERDs
entity-relationship diagrams
HPC
high-performance computing
ICP
in-circuit programming
IMC
integrated memory controller
IPC
instructions per cycle
IPC
inter-process communication
ISP
in-system programming
MAC
multiply-and-accumulate
MIMD
multiple-instruction multiple-data
MPP
massive parallel processors
OS
operating system
OTP
one-time programmable
PSW
processor status word
RAM
random access memory
RISC
reduced instruction set computer xvii
ROM
read-only memory
SMP
symmetric multiprocessor
SoC
system-on-chip
SRAM
static random-access memory
TLB
translation lookaside buffer
UI
user interface
XOR
exclusive OR
xviii
PREFACE
In the nearly six decades since the first consumer computer was constructed, computer technology has advanced tremendously. A mobile computer today costs less than $500 and has more performance, main memory, and disk space than a computer that cost $1 million in 1985. Advances in computer technology, as well as advancements in computer design, have contributed to this rapid increase. While technological advancements have been pretty continuous, progress resulting from improved computer architectures seems to be slow. Both forces contributed significantly to the first 25 years of electrical computing, with annual performance improvements of roughly 25%. The microprocessor first appeared in the late 1970s the microprocessor’s capacity to keep up with advances in integrated circuit technology resulted in a greater rate of performance enhancement—roughly 35% each year. This rate of expansion, along with the economic advantages of mass-produced microcontrollers, resulted in microprocessors accounting for a growing portion of the computer market. Furthermore, two important shifts in the computer industry made commercial success with a new design simpler than ever before. The virtual abolition of assembly language programming decreased the necessity for object-code compatibility in the first place. Second, the development of standardized, vendor-independent operating systems (OS) like UNIX and its counterpart, Linux, reduced the expense and risk of releasing a new design. Computer science education should represent the current status of the discipline while also introducing the ideas that are influencing computers. We believe that readers in all areas of computing should be aware of the organizational paradigms that influence computer system performance, capabilities, and success. Professionals in every computing field must be able to grasp both hardware and software in today’s world. The interaction of hardware and software at various levels also provides a foundation for comprehending computer fundamentals. The essential ideas in computer organization and design are the same whether you’re interested in electrical engineering or computer science. As a result, the focus of this book is on demonstrating the relationship between hardware and software, as well as the concepts that underpin modern computers. Readers with no prior knowledge of assembly language or logic design who need to understand basic computer organization, as well as readers with prior knowledge of
logic design or assembly language, will benefit from this book. This book is aimed at software developers, system designers, computer engineers, computer professionals, reverse engineers, and anyone else interested in the architecture and design principles that underpin all types of modern computer systems, from tiny embedded systems to cellphones to huge cloud server farms. Additionally, readers will learn about the possible directions these technologies will take in the following years. working knowledge about computer processors is advantageous but not needed. —Author
CHAPTER
1
INTRODUCTION TO COMPUTER ARCHITECTURE
CONTENTS 1.1. Introduction......................................................................................... 2 1.2. Fundamental Concepts........................................................................ 3 1.3. Processors............................................................................................ 5 1.4. Basic System Architecture.................................................................... 6 1.5. Characteristics of Von Neumann Machine........................................... 8 1.6. Buses................................................................................................. 10 1.7. Processor Operation.......................................................................... 11 1.8. ALU................................................................................................... 12 1.9. Interrupts........................................................................................... 13 1.10. CISC and RISC................................................................................. 15 1.11. Input/Output.................................................................................... 24 1.12. DMA............................................................................................... 25 1.13. Parallel and Distributed Computers.................................................. 28 1.14. Embedded Computer Architecture................................................... 33 References................................................................................................ 36
2
Dissecting Computer Architecture
1.1. INTRODUCTION Developing a computer entails creating a machine capable of storing and manipulating data. Computer systems may be classified into basically two distinct kinds. The primary one is the most evident: the desktop computer. When you speak the word “computer” to anyone, that is often a device that notices (Hennessy and Patterson, 2011). An embedded computer, on either hand, is a computer that is incorporated into some other system for management or/and tracking reasons. Embedded computers are significantly much more prevalent as compared to desktop computers; they are also far less evident. If you ask the typical individual how often computers one’s has in their house, he or she may respond with one or two. Indeed, he may have 30 or more concealed in his televisions, remote controls, VCRs, washing machines, DVD players, gaming consoles, mobile phones, ovens, air conditioners, and a variety of other equipment (El-Rewini and Abd-ElBarr, 2005). We’ll examine computer architects in this section. This is true for embedded and desktop computers alike since the major distinction between an embedded system and an overall computer is its intended usage. The essential operating concepts and core architectures are the same (Cragon, 2000). Either has a CPU, memory and regularly a variety of I/O devices. A fundamental distinction between them is in their applicability, which is evident in both their system architecture and software. Desktop computers may execute a wide range of applications, with an operating system (OS) orchestrating computer resources By installing various applications, the desktop computer’s function may be altered. It could be utilized like a word document one minute and an MP3 player or database customer the next. The consumer has control as to which software is installed and operated (Page, 2009; Martínez-Monés et al., 2005). By comparison, an embedded computer is often devoted to a single purpose. Embedded systems are often utilized to substitute application-specific devices. The benefit of an embedded microprocessor over discrete electronics would be the software, not the hardware, determines the system’s operation. This simplifies the manufacturing process by making it much simpler to develop an embedded system than a sophisticated circuit (Van De Goor, 1989). Typically, an embedded system will execute just a single user at a time. The embedded computer seems not to have a computer system, and
Introduction to Computer Architecture
3
it seldom allows the customer to download new software randomly. Apart from a desktop computer, where its flash storage stores merely boot software and (maybe) low-level drivers, the software is often included in the system’s memory card (Eeckhout, 2010; Kaxiras and Martonosi, 2008). While embedded hardware is usually smaller as compared to a desktop system, it could also be significantly more sophisticated. An embedded computer might well be constructed on one chip with some supporting elements and serve as something as easy as a regulator for a lawn irrigation system. Instead, the embedded computer may be a 150-processor distributed parallel device that handles so now all of the viable jet’s flying as well as control systems. As varied as embedded hardware could be, the fundamental design concepts remain the same (Hwang and Faye, 1984). This chapter discusses some critical ideas in computer design, with a focus on which that pertain to embedded systems. Its objective is to provide a foundation beforehand proceeding to more practical knowledge (Hill et al., 2000).
1.2. FUNDAMENTAL CONCEPTS A computer is essentially a device that procedures, stores, and saves data. Statistics in a spreadsheet, alphanumeric characters on a paper, color dots in an image, waveforms of sound, or the status of a machine, including an air conditioner or a CD player, are all examples of data. The computer stores all information as numbers. Since we are engaged in C code, pondering intricate algorithms and data structures, it’s good to forget this (Bindal, 2017). The information is manipulated by the computer, which performs operations on the figures. Transferring a set of numbers to onboard storage, every integer indicating a pixel of color, allows a picture to be shown on a screen. To stream an MP3 format file, the computer takes a set of numbers from the disc and into the ram, distorts those figures to transform compacted audio data into pure voice files, and then sends the raw voice files to the audio chip (Dumas II, 2018). From online surfing to print, all a computer performs includes transferring and interpreting numbers. A computer’s electronics are little more than a mechanism for storing, moving, and changing numbers. Many hardware and software components to get design a computer system. The CPU, or hardware that operates computer programs, is the essential part of the computer. The computer also contains memory, that
4
Dissecting Computer Architecture
may be of various kinds in one machine (Wang and Ledley, 2012). A memory for storing programs as well as the information that the programs operate when the processor initiates them. Gadgets for storing information and transmitting data around the world are also included on the computer. These may enable text input through a keyboard, data display on a screen, or program and data transfer to and from a disc drive (Leon-Garcia and Widjaja, 2000). The computer’s functioning and functionality are controlled by software. In a computer, there are numerous “layers” of software (Figure 1.1). A layer’s interactions are usually limited to the layers directly beyond and below it (Stokes, 2007).
Figure 1.1. Software layers.
When a computer initially turns on, it runs programs that are executed by the CPU at the lowest point. These applications set the state of all the other hardware components and prepare the computer for proper functioning. Since it is completely saved in the computer’s memory, this software is referred to as firmware (Akram and Sawalha, 2019). A firmware contains the bootloader. A bootloader is processor-based software that takes the computer system from a disc (or no dynamic storage or the system adapter) and also stores this in memory for the computer to use. The bootloader is found in desktop and workstation computers, as well as certain embedded computers (El-Rewini and Abd-El-Barr, 2005). The OS, which runs on top of the firmware, is in charge of the computer’s functioning. It manages ram use and regulates devices including the keyboard, mouse, screen, and disc drives, among others. This is also the software that provides the user an interface, allowing her to execute applications and access her data on the hard drive. The OS typically comprises a collection of software tools for program applications, letting them reach the screen, disc
Introduction to Computer Architecture
5
drives, and other resources (Porter et al., 2013). An OS is not required or used by all embedded systems. In many cases, an embedded device will merely execute code specialized to its purpose, and an OS is unnecessary. In other cases, like network routers, an OS enables software integration and forms the development method much simpler. Whether or not an OS is needed and supportive is frequently determined by the embedded computer’s intended function and, to a lesser degree, the designer’s choice (Patti et al., 2012). The application software, at its most basic level, consists of the programs that allow the computer to work. Everything that isn’t an application is referred to as system software. The line between application and system software on embedded devices is often indistinct. This represents the basic idea of embedded design, which states as a system must be developed to fulfill its goal most simply and feasible (Hasan and Mahmood, 2012).
1.3. PROCESSORS The CPU is the most critical component of the computer; it is the focal point about which all else revolves. The processor is, in fact, the computational component of the computer. A processor is an electrical machine that is able of altering information (knowledge) following the instructions supplied in a sequence. Additionally, the instructions are referred to as opcodes or machine language. Computers are customizable because this series of commands may be adjusted to fit the application. A program is composed of a series of instructions (McGrew et al., 2019). Instructions are numerical values in a computer, much like data. When various numbers are read and processed by a processor, they produce distinct results. The mechanics of a music box is an excellent comparison (Kozyrakis and Patterson, 1998). A music box is comprised of a revolving drum with several bumps and a row of prongs. As the drum spins, the bumps trigger certain prongs, resulting in the production of music. Similarly, the bit designs of instructions are sent to the processor’s execution unit. Variable bit patterns control the activation or deactivation of various components of the processor core. Thus, a specific instruction’s bit sequence may initiate arithmetic, while the other bit sequence may induce the storage of a byte in memory (Ho et al., 1995). A machine-code program is a series of commands. Each kind of CPU has a unique instruction set, which means that the instructions’ function (and the bit sequences which enable them) vary. The instructions on the processor are frequently relatively basic, like “enhance two integers” or “call this
6
Dissecting Computer Architecture
function.” Nevertheless, in certain computers, they might be as intricate as well as smart as “if the outcome of the previous operation was zero after that used that specific digit to refer another integer in storage, and afterward increase the first integer.” This will be discussed in further depth later in this lesson in the section on CISC and RISC processors (Kozyrakis et al., 1997).
1.4. BASIC SYSTEM ARCHITECTURE The CPU is unable of doing many tasks on its own. Storage (for data and instruction store), supporting logic, and at minimum one I/O device (also known as an “input/output device”) are all required. Figure 1.2 depicts the basic computer system (Harmer et al., 2002).
Figure 1.2. Basic computer system.
A microprocessor is a processor that is often constructed on a single integrated circuit. Almost all current processors are microprocessors, excluding those seen in very huge supercomputers, and the two names are sometimes used indiscriminately. The Freescale/IBM PowerPC, Sun SPARC, Intel Pentium series, MIPS, and ARM, amongst many others, are popular microprocessors nowadays. A microprocessor is also referred to as a central processing unit (CPU) (Taylor and Frederick, 1984). A microcontroller is an integrated circuit that has a CPU, memory, and certain I/O machines and is designed to be used in embedded systems. The CPU and its I/O are linked via buses that are part of a similar integrated circuit. Microcontrollers come in a wide variety of shapes and sizes. They
Introduction to Computer Architecture
7
vary from the tiniest PICs and AVRs (which will be discussed in this book) to integrated PowerPC processors with built-in I/O. We’ll look at both microprocessors and microcontrollers in this book (Eastman et al., 1993). Microcontrollers resemble system-on-chip (SoC) processors, which are used in traditional computers like PCs and workstations. SoC processors contain a unique set of I/O that reflects their planned use and are built to connect to massive amounts of external storage. Microcontrollers typically have completely of their memory on a chip and might only accept removable storage devices to a limited extent (Trenas et al., 2010). Each of the commands and the information that the microprocessor will change are stored in the computer system’s memory. A computer system’s memory is not ever blank. It has anything in it if it’s instructions, useful data, or simply the arbitrary junk which arrived in the storage whenever the system was first turned on (Carter and Bouricius, 1971). As demonstrated in Figure 1.3, commands are taken (fetched) from storage, whereas data is read from as well as worded to memory (Arbelaitz et al., 2014).
Figure 1.3. Data flow.
A Von Neumann machine is a kind of computer architecture called after John Von Neumann, another of the concept’s originators. Almost all current computers, with several exceptions, use this shape. Control-flow computers are what Von Neumann computers are called. The computer’s actions are guided by the instrumentation of a program. Similarly, the computer operates according to a step-by-step program (Barua, 2001).
8
Dissecting Computer Architecture
1.5. CHARACTERISTICS OF VON NEUMANN MACHINE 1.5.1. There Is No Real Difference between Data and Instructions A processor might be told to start processing at any location in storage, and it has no means of confirming if the series of numbers starting at this location is information or commands. Additionally, the command 0×4143 may be information (either the value 0×4143 or the ASCII letters “A” and “C”). The CPU has no method of distinguishing between information and instructions. If a figure is intended to be performed by the processor, this is referred to as an instruction; if it is intended to be changed, it is referred to as data (Shin and Yoo, 2019; Backus, 1978). Due to this absence of differentiation, the processor is able for modifying its instructions under program control (considering these as information). And, since the CPU is unable of discriminating among data and instructions, it will mindlessly perform whatever it is handed, regardless of whether the series of instructions is relevant or not (Iannucci, 1988).
1.5.2. Data Has No Inherent Meaning An integer that reflects a colored dot in a picture and an integer that indicates a character in a text file have no difference. The way these numbers are processed throughout the execution of a program gives them meaning (Backus, 1978).
1.5.3. Data and Instructions Share the Same Memory It implies as some other programs may regard the patterns of instructions contained inside a program as information. A compiler generates a binary representation of a program from a series of integers (instructions) stored in memory (Zhao et al., 2020). The compiled program is viewed as data by the translator. It is not a program until the processor executes it. Likewise, an OS loads an application software from a disc by considering the application system’s list of steps as data. The program is loaded into memory in the same way that a picture or text file is, which is made feasible by the sharing system memory (Buehrer and Ekanadham, 1987).
Introduction to Computer Architecture
9
1.5.4. Memory Is a Linear (One-Dimensional) Array of Storage Locations The OS, numerous applications, and their accompanying information are all stored in much the similar linear region in the computer’s memory. Every storage region has a distinct, consecutive address. A storage location’s address is being used to identify (and choose) that place (Feustel, 1973). The address space is sometimes referred to as the namespace, and the memory map describes how well that destination address is divided across various memory and I/O units. The address space is a collection of all memory locations that may be addressed. This equates to 216 = 65,536 = 64 K of memory in an 8-bit CPU (like the 68HC11) with a 16-bit address bus. As a result, the CPU has a 64 K address space. Computers with 32-bit address buses have a maximum memory access of 232 = 4,294,967,296 = 4G (Traversa and Di Ventra, 2015). Some CPUs, such as the Intel x86 series, include a dedicated address space for I/O machines, along with dedicated instructions for addressing it. It is referred to as ported I/O. Inside the address space, most processors, on the other hand, make no difference between memory and I/O machines. I/O machines and memory devices share a very similar linear space, and similar instructions are being used to reach them. Memory-mapped I/O is the term for this. The most frequent kind of I/O is memory-mapped I/O. Ported I/O address spaces are more uncommon, as is the usage of the phrase (Giloi et al., 1978). The majority of microprocessors on the market are Von Neumann computers. The Harvard design deviates from this in those instructions and data are stored in distinct memory areas, each having its address, data, and control buses (Figures 1.4 and 1.5). This offers some benefits, including the ability to do simultaneous instruction and data fetches and the fact that the amount of instruction is not limited by the amount of the conventional information unit (word) (Kanerva, 2009; Pippenger, 1990).
10
Dissecting Computer Architecture
Figure 1.4. Ported versus memory-mapped I/O spaces.
Figure 1.5. Harvard architecture.
1.6. BUSES A bus is a real collection of signals which serve the same purpose. Buses enable electrical impulses to be sent among various sections of a computer system, and therefore data to be transferred from one gadget to the other (Mange et al., 1997). The data bus, for instance, is a set of signals that transport data among the CPU and the computer’s different subsystems. A bus’s “width” represents the number of signal lines allocated to data transmission. An 8-bit-wide bus, for example, transports 8 bits of data simultaneously. The three-bus system architecture is used by most of today’s microprocessors (with a few outliers). The control bus, address bus, and data bus are the three buses (Figure 1.6) (Eigenmann and Lilja, 1998).
Introduction to Computer Architecture
11
Figure 1.6. Three-bus system.
The data bus is two-dimensional, with the CPU determining the transmit orientation. The address bus transports the address, which corresponds to the memory area that the CPU is attempting to enter. External circuitry is responsible for determining which device has a certain memory address and activating that item. Address decoding is the term for this process (Händler, 1975). The control bus receives data from the processor related to the latest access status, like whether it’s a read or write operation. The control bus may also provide data about the current request to the CPU, including an address error. Although several processors contain separate control lines, a few are shared by several processors. The control bus might include output signals like write, valid address, read, and so on. In addition to resetting, one or even more interruption lines, and clock input, a CPU normally features many input control lines (Nowatzki et al., 2015).
1.7. PROCESSOR OPERATION There are six fundamental connection kinds that a CPU may make to external chips. The processor is capable of writing data to memory or an I/O device, reading information from memory or an I/O device, reading instructions and data, as well as doing internal information modification inside the processor (Inoue and Pham, 2017). Transferring information to memory is essentially equivalent to sending files to an I/O port in several systems. Likewise, getting information from memory is considered an exterior operation in a similar way as reading data from such an I/O machine or reading data thru storage is considered an external operation. Similarly, the CPU is not aware of the difference between memory and I/O (Ganguly et al., 2019). The processor’s inner computer storage units are referred to as registers. The processor contains a restricted quantity of registers, which are utilized
12
Dissecting Computer Architecture
to store the present data/operands being manipulated by the processor (Buehrer and Ekanadham, 1987).
1.8. ALU The CPU’s arithmetic logic unit (ALU) performs internal data arithmetic operations. The instructions read and executed by the microprocessor regulate the data flow among the registers and the ALU. The commands also regulate the ALU’s mathematical operation by using the monitoring inputs. An ALU is presented in symbolic form in Figure 1.7 (Iannucci, 1988).
Figure 1.7. ALU block diagram.
The ALU executes the code on one or more integers as commanded by the processors (typically subtraction, addition, OR, NOT, XOR, AND shift left/right, or rotate left/right). Operands are integers that are typically obtained from one or more registers and memory capacity. The output of the operation is subsequently copied to a defined register or memory location (Šilc et al., 1999). The status result showed if the action had any special qualities, or either the output was nil, insignificant, or if there was an overshoot or carryover. Several processors include differentiate various for division and multiplication, and also bit shifting, leading to quicker operations as well as better data. Every architecture will have its particular collection of ALU characteristics, that might differ a lot from one processor to the other one. They are, nevertheless, all differences in a theme with similar qualities (Yazdanpanah et al., 2013).
Introduction to Computer Architecture
13
1.9. INTERRUPTS Interrupts (or, in such processors, traps or errors) are a method for distracting the processor’s focus away from the latest program’s function in terms of dealing with an unforeseen situation. This event could be the result of a peripheral failure or given the fact that an I/O machine has ended the most recent task given to it and is now prepared to do the next. Every time you press a key or mouse cursor, your computer experiences interference. Imagine it analogous to a function call produced by hardware (Verhulst, 1997). Interrupts eliminate the requirement for the CPU to continually check the I/O ports for the service requests. As an alternative, the CPU may remain to do added tasks. When I/O machines need care, they will activate one of the processor’s interrupt inputs (Bevier, 1989). Interrupts may be prioritized differently in different processors, meaning that now the events that might disturb the CPU are of varying importance. If the CPU is addressing a low-priority interrupt, it will pause to handle a higher-priority interrupt. If the processor is delivering an interrupt and the other, lower-priority interrupt arrives, the processor will overlook the lowerpriority interrupt until the higher-priority operation is finished (Pesavento, 1995). Usually, whenever an interrupt takes place, the CPU would save its status by stacking its registers and program counter. The CPU, therefore, inserts an interrupt vector in the program counter. The interrupt vector recognizes the interrupt service procedure’s location (ISR). Thus, inserting the vector into the program counter is responsible for the CPU to start the computation of the ISR, which provides the desired services to the interrupting device (Halmos, 1973). The final instruction of an ISR is invariably a Return from Interrupt. This informs the processors to retrieve their pre-saved state (registers and sequence number) from the stack and restart the operation of the actual program. Usually, interrupts are not apparent to the program out of which they originated. This shows that the original program is completely “oblivious” to the fact that the processor has now been halted, except for a period lost (Feustel, 1973). Rather than transferring their register bank to the stack, computers with shadowing registers use them to retain their current state. This results in a large reduction in the quantity of main memory (and so the quantity of time needed to manage an interrupt). Nonetheless, since there is only one set of shadow registers, a computer processing many interruptions must
14
Dissecting Computer Architecture
“manually” hold the state of the registers until another interrupt is served. Unless this is accomplished, crucial state data will be destroyed. On return from such an ISR, the principles of the shadow registers are decided to switch back into the key register array (Giloi, 1997).
1.9.1. Hardware Interrupts Whenever an I/O device (including a serial controller or a disc controller) is prepared for the following series of information to be sent, there are two methods to know. The first is busy waiting, also known as polling, in which the CPU checks the device’s status register repeatedly until it becomes ready. This is the most inefficient use of the processor’s time, yet it is also the easiest to implement. Polling may shorten the time it takes for the CPU to react to a transition in a peripheral in certain time-critical applications (Tsafrir, 2007). A preferable method is for the gadget to send an interrupt to the CPU when it is prepared to make a transfer. Because small, basic processors may have one (sometimes two) interrupt input, multiple gadgets may be forced to share the processor’s interrupt lines. Whenever an interrupt happens, the processor must examine every machine to establish which one caused it. (This may also be thought of as polling.) The benefit of interrupting polling over regular polling is that it only polls when a system needs to be serviced. Polling interrupts is only useful in systems with a limited quantity of devices; else, the CPU will spend too much time attempting to figure out where the interrupt came from (Feng et al., 2008). The second method of handling an interrupt is to use vectored interrupts, wherein the interrupting device specifies the interrupt vector to be followed by the CPU. Vectored interruptions lessen the time required for the CPU to figure out where the interrupt came from. If an interrupt request can come from multiple sources, it’s important to apply distinctly interrupts to different interests (levels) (Liu et al., 2010). Depending on the requirements, this may be done from either software or hardware. The CPU in this system has several interrupt lines, every of which corresponds to a certain interrupt vector. Whenever a priority 7 interrupt happens (interrupt lines matching to “7” are pushed), the CPU inserts vector 7 into its program counter and begins performing the interrupt 7 service procedure (Regehr and Duongsaa, 2005). Interrupts that are vectored may be made a step further. Whenever a processor or gadget generates an interrupt, certain processors and devices
Introduction to Computer Architecture
15
assist the device by actually inserting the relevant vector into the data bus. This implies that, rather than being restricted to one interrupt per peripheral, the system may be made much more adaptable by allowing each component to contribute an interrupt vector unique to the event that is triggering the interrupt. The CPU, on the other hand, should enable this function, which most do not (Weaver et al., 2013). A rapid hardware interrupt is a component present in certain CPUs. Only the program counter is preserved with this interruption. It is assumed that the ISR will safeguard the values of the registers by storing their state directly when needed. Whenever an I/O device demands a fast reply from a CPU and therefore can delay for the processor to some of its registers to the stack, rapid interrupts come in handy. Fast interruptions are generated via a special (and distinct) interrupt line (Wright and Dawson, 1988).
1.9.2. Software Interrupts An instruction generates a software interrupt. This is the lowest-priority interrupt and is typically used by applications to ask that the system software execute a function (functioning system or firmware). Thus, why do software interruptions exist? Why is not the required code section invoked directly? Indeed, why would we employ an OS to accomplish jobs for us in the first place? It all comes down to compatibility (Mogul and Ramakrishnan, 1997). Leaping to a subroutine (or invoking a function) is equivalent to jumping to a specified memory location. In future models of the system software, the subroutines might not be located at similar locations as in previous versions. By utilizing a software interruption, our application is not required to understand the location of the routines. It is directed to the right position by the item in the vector database (Slye and Elnozahy, 1998).
1.10. CISC AND RISC Complex instruction set computer (CISC, pronounced “Sisk”) computers and reduced instruction set computer (RISC) processors are the two basic methods of processor design. The Intel x86, Motorola 68xxx, and National Semiconductor 32xxx processors, as well as the Intel Pentium, are all examples of classic CISC CPUs. The Freescale/IBM PowerPC, the MIPS architecture, Sun’s SPARC, the ARM, the Atmel AVR, and the Microchip PIC are all examples of RISC architectures (Jamil, 1995).
16
Dissecting Computer Architecture
CISC processors consist of a single processing unit, external memory, a tiny register set, and hundreds of distinct commands. They are, in many respects, miniature versions of the processing elements seen in main computers from the 1960s. During the late 1970s and early 1980s, the trend in processor design was toward larger and more sophisticated instruction sets. Are you looking for a way to get a string of letters from an I/O port (Blem et al., 2013)? There’s just one instruction in CISC (80×86 family) to achieve that! In certain CISC processors, like the Motorola 68000, the number of instructions in a CISC processor might exceed 1,000. This had the benefit of allowing assemblylanguage programming simpler because it required code to do the task. It seemed logical to have each instruction perform more since memory was sluggish and costly (Bhandarkar, 1997). This lowered the number of instructions necessary to accomplish a particular function, as well as the amount of memory space and memory visits needed to retrieve instructions. The relative benefits of the CISC technique started to dwindle as memory grew quicker and cheaper, and compilers have become more effective. Another of the primary disadvantages of CISC is that as a result of providing a such broad and diversified instruction set, the processor itself becomes more sophisticated. The command and instructions decode modules are complicated and sluggish, the silicon is huge and difficult to manufacture, and they require a great deal of energy and so emit lots of heat. The operating costs imposed by CISC on the silicon grew burdensome as processors got more powerful (Andrews and Sand, 1992). Whenever a single processor function is evaluated, it may improve processor speed but reduce the overall function of the system if it raises the device’s total complication. Processors may be made simpler and quicker by reducing their instruction group to the most often used instructions. Each instruction is decoded and executed in fewer cycles, and the cycles are smaller (Lozano and Ito, 2016). The disadvantage is that the more (simplified) commands are needed to complete a job, but this is far more than compensated for by the processor’s enhanced efficiency. For instance, if cycle time and quantity of cycles per instruction are either lowered by a factor of 4 but the number of instructions needed to complete a job increases by 50%, the processor’s execution is accelerated up by at least 08 (Blem et al., 2015). As a result of this insight, CPU design was rethought. The RISC architecture emerged as a consequence, paving the way for the creation of
Introduction to Computer Architecture
17
very high-performance computers. The underlying RISC principle is to shift difficulty from silicon to the language compiler. The hardware is as basic and quick as feasible (Ko et al., 1998; Ritpurkar et al., 2014). A succession of considerably simple commands may accomplish a given sophisticated command. Some processors, for instance, feature an XOR (exclusive OR) instruction for bit management and a pure command to manage a register to zero. By XORing a register with itself, a register may be set to zero. As a result, there is no need for separate unambiguous instruction (Khazam and Mowery, 1994). It may be substituted with XOR, which is already existent. In addition, some processors may remove a memory address by sending a zero to it. By emptying a register and afterward putting it to the memory address, an identical purpose may be performed. The command to fill a register with a real number may be substituted with a clear instruction immediately by an add command with the actual number as its argument. Thus, six instructions (xor, clear reg, clear memory, load literal, store, and add) can be replaced with just three (xor, store, and add). So, the following CISC assembly pseudocode (Garth, 1991; Vanhaverbeke and Noorderhaven, 2001):
becomes the following RISC pseudocode:
The resultant code is larger, but the instruction decodes the unit’s decreased complication may lead to speedier overall functioning. To provide RISC its ease, many similar code optimizations occur (Krad and Al-Taie, 2007). RISC processors are distinguished by a variety of features. They feature enormous register sets (over 1,000 in certain designs), minimizing the number of times the CPU needs to reach the main memory. Variables that are often used may be stored in the CPU, minimizing the number of requests
18
Dissecting Computer Architecture
to (slow) exterior memory. This is used by high-level language compilers (like C) to maximize processor efficiency (El-Aawar, 2006). RISC computers provide quick instruction execution due to their smaller and easier instruction decode units, which also minimizes the area and energy usage of the processing element. RISC instructions are typically executed in one and otherwise two cycles (it largely depends on the specific processor) (Wolfe and Chanin, 1992). It’s in comparison to commands for a CISC processor, which might consume tens of cycles to complete. On an 80486 CISC processor, for instance, one instruction (integer multiplication) requires 42 cycles to accomplish. On a RISC CPU, a similar instruction may only require one cycle. A RISC processor’s instructions follow a basic format. The duration of all instructions is often similar (which provides instruction to decode units easier) (Tokhi and Hossain, 1995). The “load/store” architecture is implemented by RISC processors. This implies that store and load are the only instructions that make use of memory. A CISC processor, on the other hand, allows many (if not all) instructions to reach or change memory. All the other instructions (this apart from store and load) on a RISC processor task only on the registers. The capability of RISC processors to perform (most) of their instructions in one period is aided by this. As a result, RISC processors lack the variety of addressing options present in CISC computers (Rebaudengo et al., 2000). Pipelined instruction execution is common on RISC processors. This implies that one instruction has been performed whereas the next has been decrypted and the third has been improbable. Many instructions are in the pipeline and in the procedure of being performed at any one time. This, once again, improves processor performance. Although not all commands can be finished in one cycle, the processor can problem and quit instructions on every cycle, resulting in single-cycle execution (Xiang et al., 2021). Load operations on certain RISC processors may enable the implementation of successive, unassociated commands to proceed beforehand the data asked by the load has been coming back from memory, allowing the operation of following, unassociated instructions to proceed beforehand the data asked by the load has been coming back from memory. As a result, these instructions might overlay the load, resulting in better processor presentation (El-Aawar, 2008; Bromley, 1997). RISC processors have become more commonly employed, especially in embedded computer systems, because of their less power usage and computational capability, and several RISC traits are surfacing in what were
Introduction to Computer Architecture
19
once CISC designs (like with the Intel Pentium). Ironically, several RISC architectures are incorporating CISC-like characteristics, blurring the line between RISC and CISC (Colwell et al., 1983). Kevin Dowd and Charles Severance’s high-performance computing (HPC; O’Reilly) has an outstanding explanation of RISC architectures and processor performance problems (Ditzel, 1991). So, RISC or CISC, which would be preferable for embedded and industrial applications? If minimal power usage is required, RISC is most likely the best architecture to utilize. A CISC processor, on the other hand, may be a superior option if program store space is limited, as CISC instructions provide more “bang” for the byte (Dandamudi, 2005).
1.10.1. Digital Signal Processors (DSPs) The digital signal processor (DSP) architecture is a unique form of processor architecture (DSP). These processors have commands and architectures that are geared for numeric arrays data processing. They usually expand on the Harvard architectural notion by not single separating data and code places, as well as by dividing data spaces into two or more banks. This enables the parallel fetching of instructions and data for many operands. As a result, DSPs may achieve very high performance and beat either CISC or RISC processors in some requests (Tokhi and Hossain, 1995). DSPs feature specialized hardware that is suitable for statistical array processing. They often include hardware looping, which enables and controls the repetitive performance of an instruction sequence through specific registers (Weiss and Fettweis, 1996). This is sometimes referred to as zero-overhead looping since the program does not have to expressly verify any conditions throughout the looping process. DSPs almost always have specialized hardware for speeding up arithmetic tasks. Numerous features include high-speed multipliers, multiply-and-accumulate (MAC) units, and barrel shifters. DSP processors are often employed in embedded applications, and a large number of standard embedded microcontrollers have some kind of DSP capabilities (Martin and Owen, 1998).
1.10.2. Memory The processor’s data and software are stored in memory. There are many different kinds of memory, and many systems be using a combination of
20
Dissecting Computer Architecture
them. While a few memories would then preserve their components in the absence of power, accessing it will be slow. Both these memory devices will be large in capacity, but they will need more support circuitry and be shorter to reach. Some storage media will sacrifice ability for speed, resulting in devices that are relatively small but suitable for maintaining with the fastest processors (Ko et al., 1998). Memory chips could be organized in one of two different ways: word-organized or bit-organized. Complete words, nybbles, or bytes, are saved in a single constituent in the word-organized scheme, while every bit of a byte or word is apportioned to a distinctive element in the bit-organized scheme (Figure 1.8) (Tolley, 1991).
Figure 1.8. The devices of eight bit-organized 8×1 as well as one word-organized 8×8.
Memory chips are available in a variety of dimensions, with the thickness of the chip being defined as the component of the size explanation. A DRAM (dynamic RAM) chip, for example, might well be defined as 4M1 (bit-organized), while an SRAM (static RAM) chip could be 512K8 (word-organized). So, every chip has a similar storage capacity in both instances, but they are organized differently (Andrews and Sand, 1992). In the particular instance of DRAM, a memory component for an 8-bit data bus would need eight chips, so even though SRAM will only need one chip. The DRAMs, however, are accessed all at the same as they are organized in parallel. The DRAM block’s final size is (4M1)8 devices, which is 32 M. The use of multiple DRAMs on a memory unit is a usual practice. DRAMs are often placed in ordinary PCs in this manner. While x16 devices exist, the most popular memory chip widths are x1, x4, and x8. With 32 x1 equipment, eight x4 devices, or four x8 devices, a 32-bit-wide bus may be built (Dandamudi, 2005).
Introduction to Computer Architecture
21
1.10.3. RAM RAM is the abbreviation for random access memory. This is rather misleading since the vast majority (if not all) of memory storage can be called “random access.” RAM is the computer system’s “working memory.” It is a convenient location for the CPU to send information for temporary storage. Usually, RAM is dynamic, meaning that its contents are lost when the system goes offline (Bromley, 1997). First, before the system lights down, all data saved in RAM that should be maintained might be copied to a certain type of permanent storage. There are nonvolatile RAMs that include a battery backup mechanism, ensuring that the RAM continues to operate even after the remaining part of the computer system has closed. RAMs are typically classified into two types: static RAM (sometimes referred to as SRAM) and dynamic RAM (also called DRAM) (Ko et al., 1998). Every bit of data is held in an SRAM by a combination of logic gates. SRAMs are the quickest kind of RAM presently offered, need little extra assistance circuitry, and use very little power. Their disadvantages include a much lower capacity than DRAM and a substantially higher price. Their limited capacity necessitates the use of more chips to provide a similar amount of memory. A contemporary PC constructed entirely of SRAM would have been much larger and it would require a little fortune to create. (It would, however, be quite rapid) (Suryakant et al., 2018). Separate bits of data are stored in DRAM using modules with what are effectively capacitors. The capacitor bands will retain their charge for a brief duration beforehand degrading. As a result, DRAMs require constant refreshing, almost every few milliseconds. This constant refresh demand necessitates extra support and may cause the CPU to delay reach to the memory. If access by the CPU conflicts with the requirement to renew the arrays, the restore cycle must give preference (Patterson, 2018). DRAMs are the highest-capacity memory machines currently accessible and are classified into a large number of subspecies. DRAMs cannot be interfaced to tiny microcontrollers in particular, and not in a practical manner. DRAM support is built into the majority of CPUs with wide memory addresses. Linking DRAM to these CPUs is as simple as “linking the dots” (or pins, as the situation might be). For CPUs that lack DRAM functionality, dedicated DRAM controller chips are produced that simplify the process of connecting the DRAMs (Blem et al., 2013).
22
Dissecting Computer Architecture
Numerous CPUs have instruction and/or data caching that retain information on current memory reaches. These caches are often (but not always) built inside CPUs and are accomplished using rapid memory cells and high-speed data links. Usually, instructions are executed out from the instruction cache, which results in rapid execution. Must a cache miss happen, the CPU can swiftly refill the caches using system memory. Certain processors have functionality that anticipates cache misses and initiate cache reloading preceding the miss happening. Caches are built in very rapid SRAM and are often utilized in big systems to substitute for DRAM’s gradualness (Karshmer et al., 1990).
1.10.4. ROM Read-only memory (ROM) is a type of memory that can only be used once. Many (modern) ROMs can also be printed, so it’s a bit misleading. Dissolved memory, or ROMs, do not need electricity to preserve their data. They are extremely gentler than RAM in default, as well as much worse as compared to rapid static RAM (Stefan and Stefan, 1999). The main function of ROM in a computer system is to store the code (and occasionally data) which must be available when the computer is turned on. Firmware is a kind of software that is used to start a computer by setting up the I/O machine to a defined position. It might include a bootloader software for loading an OS from a disc or the network, or it may include the program itself in the event of an embedded device (Bromley, 1997). On-chip ROM is found in many microcontrollers, decreasing element count and trying to simplify system architecture. A vast array of diodes is used to make typical ROM (in a simplified sense). A ROM’s unsigned bit state is all 1s, with every byte position reading 0xFF. Trying to burn the ROM is the method of placing software onto a ROM (Malaiya and Feng, 1988). The phrase arises from the belief that the programming procedure involves delivering a big enough current thru the necessary diodes to “burn” or “blow” them, resulting in a zero at a certain bit position. It can be done via a ROM burner or, if the system provides it, the ROM can be written in-circuit. In-system programming (ISP) or in-circuit programming (ICP) is the term for this kind of programming (ICP). OTP (One-time programmable) ROMs, as the name suggests, could only be burnt one time. They’re often used by computer makers in systems in which the firmware is reliable and the product is sent in bulk to clients. Mask-programmable ROMs are likewise OTP, however, unlike OTPs, the
Introduction to Computer Architecture
23
chipmaker burns them before delivery. They’re utilized once the software is reliable, and they have the benefit of minimizing manufacturing costs for bulk shipments, similar to OTPs (Jex, 1991).
1.10.5. EPROM While OTP ROMs are ideal for delivery in finished goods, they are inefficient for fixing, as every iteration of code modification requires an original chip to be burnt and the previous one discarded. Even so, OTPs are a very costly development choice. Nobody in their right mind utilizes OTPs for development and testing (Wolfe and Chanin, 1992). The erasable programmable read-only memory, or EPROM, is a (slightly) superior option for system development and debugging. Thru a tiny window on the tip of the chip, ultraviolet light could be utilized to wipe the EPROM, enabling it to be reconfigured and recycled. They share the same pins and signals as related OTP and masking components. Thus, although an EPROM is utilized during development, OTPs can be utilized in production without affecting the parts of the network (Burgelman and Grove, 1996). EPROMs and their OTP counterparts have capacities ranging from several kilobytes (very uncommon nowadays) to a megabyte or even more. The disadvantage of EPROM technology would be as it requires the chip to be withdrawn thru the circuitry to be wiped, and a process might require several minutes. The chip would then be put into the burner, programmed, and reinserted into the circuit (Ganesan et al., 2003). This may result in very lazy debugging cycles. Additionally, it renders the machine inoperable for recording modifiable system settings. EPROMs are becoming scarcer these days. While they are still available, flash-based memory (which will be addressed momentarily) is significantly more prevalent and is the preferred media (Elahi and Arjeski, 2015).
1.10.6. EEROM Electrically erasable read-only memory (EEROM), often called EEPROM, is a kind of memory that can be erased (electrically erasable programmable read-only memory). It is also known as electrically alterable read-only memory (EAROM) on rare occasions. EEROM may be read as “e-e ROM,” “e-squared ROM,” or just “e-squared” for short (Bannatyne, 1999).
24
Dissecting Computer Architecture
In-circuit erasure and reprogramming of EEROMs are possible. Because their capacity is substantially less than that of ordinary ROM (usually just several kilobytes), they are unsuitable for storing firmware. As an alternative, they are often utilized to store system characteristics and mode information that must be kept after power-off (Zhang, 2011). Many microcontrollers have a tiny EEROM on-chip for storing system settings. This is particularly helpful in embedded systems, where it may be utilized to store network addresses, configuration settings, serial numbers, and maintenance data, among other things (Gunes et al., 2000).
1.10.7. Flash Flash is the most recent ROM technology, and it is currently widely used. The programmability of EEROM is combined with the big quantity of ordinary ROMs in flash memory. “Flash ROMs” and “flash RAMs” are terms used to describe flash chips. Because they aren’t like regular ROMs or RAMs, I like to simply refer to them all as “flash” to avoid any misunderstanding (Fujishima et al., 1992). Personal sectors of flash are usually organized as sectors, which has the benefit of allowing personal sectors to be wiped away and revised without changing the overall of the device’s components. A sector must usually be deleted before it could be read. It can’t simply be copied over like RAM. There are various distinct flash technologies, and the criteria for wiping and programming flash gadgets change from one producer to the next (Chiang and Chang, 1999).
1.11. INPUT/OUTPUT Other devices besides memory may be found in the processor’s address space. The CPU uses these input/output machines (I/O machines, also called peripherals) to connect with the outside universe. Serial controllers, which connect with keyboards, mouse, modems, and other peripherals; linear I/O devices, which operate exterior subsystems; and disk-drive control systems, video, and audio control systems, and networking devices are just a few examples (Ritpurkar et al., 2014). There are three basic methods for exchanging data with the outside world which may be discussed in subsections.
Introduction to Computer Architecture
25
1.11.1. Programmed I/O The processor receives or transmits data at periods that are suitable for it (the processor) (Yano et al., 1994).
1.11.2. Interrupt-Driven I/O External circumstances command the processor to halt the present program until the external event is handled. When an exterior device disrupts the processor (by asserting an interruption control line), the CPU suspends the task given (program) and begins performing an interrupt service procedure. Data is transferred from intake to storage or from storage to the outlet may be part of an interrupt’s function (Güven et al., 2017).
1.11.3. Direct Memory Access (DMA) Direct memory access (DMA) enables information to be transmitted directly through I/O machines to memory, bypassing the CPU entirely. DMA is utilized in high-speed systems when the data transmission rate is critical. DMA is not supported by all CPUs (Kim et al., 2002).
1.12. DMA DMA is a method of speeding up information exchange among two portions of memory or memory and an I/O machine. Assume you wish to read 100 megabytes from the disc and save them in memory. There are two possibilities available to you. One approach is for the CPU to receive one byte thru the disc controller at a time into a register, and subsequently save the data of the register in the proper memory address. The CPU must get a command, decode it, view the information, get the next command, decode it, and afterward save the data for every byte sent. The procedure then repeats itself for the following byte (Chartoff et al., 2009). DMA is the second way to move huge volumes of data throughout the system. A DMA Controller (DMAC) is a unique device that handles high-speed exchanges among memory and I/O machines. By creating a network among the I/O machine and the memory, DMA bypasses the CPU. As a result, data is received from the I/O machine and put into memory without the requirement for byte-by-byte (or word-by-word) transfers to be performed (Stewin and Bystrov, 2012).
26
Dissecting Computer Architecture
The DMAC must have access to the addressing and data lines for performing a DMA transfer. The system designer has numerous options for implementing this. The most frequent (and perhaps easiest) option is for the CPU to be turned off and for its buses to be “released” (the buses are tristate). This enables the DMAC to “take control” of the buses for the brief time necessary to complete the transition. DMA-capable processors often feature a unique control signal which allows a DMAC (or another CPU) to demand the buses (Zahler et al., 1991). DMA may be divided into four categories which are discussed in subsections.
1.12.1. Standard Block Transfer This is achieved by a series of memory exchanges performed by the DMAC. The transfers begin with a load from a resource address and end with a save to a destination. Standard block transmit are started by software and are utilized to move data structures across memory regions (Aggarwal et al., 1987).
1.12.2. Demand-Mode Transfers The transfer is managed by an exterior device, as opposed to the conventional model. Demand-mode transmissions are utilized to transmit information from memory to I/O or conversely. Data transfer is requested and synchronized by the I/O machine (Antohe and Wallace, 2002).
1.12.3. Fly-by Transfer Provides for rapid data transfer across the system. In contrast to normal DMA transmissions, which require many bus connections, fly-by transfers transport information between source and destination in one access. Beforehand the data is sent to its destination, it is not stored in the DMAC. Memory and I/O are assigned distinct bus command signals throughout a flyby transfer. For instance, a read demand is sent to an I/O device concurrently with a written request to memory. Data is sent directly thru the I/O device to the storage device (Lung and Wolfner, 1999).
1.12.4. Data-Chaining Transfers Enable DMA transactions to be done in storage as defined by a linked list. Information sequencing begins with a reference to a descriptor in storage. The descriptor is a list that specifies the number of bytes, the destination
Introduction to Computer Architecture
27
addresses, and a reference to another descriptor. The DMAC reads the required transferring data by this list and starts transferring data. The transmission will proceed till the number of bytes transmitted equals the byte-count space entering. When the descriptor is finished, the reference to the next descriptor is full. This process is repeated until a null pointer is detected (Raghunathan et al., 1999). Consider a fly-by transmission of data from the hard-disk driver to RAM to demonstrate the usage of DMA. The CPU configures the DMAC for the transmission at the start of a DMA transfer. This configuration entails identifying the data’s origin, endpoint, quantity, and other characteristics. The disc controller sends a support ticket to the DMAC (not the processor). After that, the DMAC sends a HOLD or BR (bus request) to the CPU (Spreen, 1976). The processor fulfills the latest instruction; high-impedance states the addressing, controller, and information buses (floats, tri-states, or discharges them); and answers to the DMAC with such a HOLD-acknowledge or BG (bus granted) and into a dormant condition. When the DMAC receives a HOLD acknowledgment, it inserts the address of the memory position in which the transferring to memory would start on the data bus and creates a WRITE to the storage, whereas the disc controller stores the data from the data bus. As a result, the main memory transfer is made from the disk drive to the storage (Sebbel, 1976). Transmission through memory to I/O machines is also feasible in the same method. DMCA can transfer information in blocks. Because the I/O device creates (or takes) information, the DMAC instantly advances the position on the address bus to refer to the next memory area. Following completion of the transfer, the buses are given to the CPU, which continues regular operation (Huynh et al., 1993). Not every DMAC supports every kind of DMA. Certain DMACs receive data from a resource, store it locally, and afterward transfer it to a target. They carry out the transmission just like a processor would. The benefit of employing a DMAC rather than a CPU is that every transfer would still require program fetches if conducted by the processor. Thus, although the transfer occurs sequentially, the DMAC does not want to retrieve and execute instructions, resulting in a quicker transmission than a CPU (Lauderdale and Khan, 2012). DMA support is often absent from tiny microcontrollers. Certain midrange (16-bit, low-end 32-bit) CPUs may implement DMA. All high-end processors (32-bit and higher) will handle DMA, but a few will contain an
28
Dissecting Computer Architecture
on-chip DMAC. Likewise, peripherals designed for small-scale computers would not enable DMA, but peripherals designed for high-speed and powerful systems would (Tibbals, 1976).
1.13. PARALLEL AND DISTRIBUTED COMPUTERS Some embedded applications demand more processing power than compared to a single processor can provide. This might not be feasible to execute a model with the most recent superscalar RISC processor for financial purposes, or the user may suit on its own to dispersed computing, where jobs are spread among many communicating computers. It can be more costeffective to deploy a fleet of lower-cost units across the installation. The use of parallel processors in embedded systems has become more popular (Kalaiselvi and Rajaraman, 2000).
1.13.1. Introduction to Parallel Architectures Computers’ traditional architecture is based on the Von Neumann serial architecture. Computers of this kind typically include one sequential CPU. The primary constraint on this computer architecture would be that the typical processor can only execute one instruction at one time. As a result, algorithms that execute on such computers should be described as a serial issue. A given job must be divided into several consecutive phases, every of which must be completed in sequence, one by one (Parhami, 2006). Numerous computationally difficult issues are also extremely parallel. These issues are characterized by an algorithm that is performed to a huge data collection. Usually, the computation for every component in the set of data is identical and is only broadly highly dependent on the performance of neighboring data computations. Thus, performance gains might well be realized by doing computations simultaneously for every component in the set of data, instead of sequentially traversing the set of data and calculating every result serially. In these kinds of uses, devices with a large number of processors operating concurrently on a data model frequently outclass traditional computers by a large margin (Leighton, 2014). The computer’s grain is described as the number of processing components included inside the machine. A coarse-grained computer has a small number of processors, while a fine-grained machine contains 1000s of processing components. Generally, a fine-grained device’s processing
Introduction to Computer Architecture
29
elements are significantly less strong than a roughly chopped grained computer. The processing power can be obtained by using a brute-force strategy of getting a high number of computational components. Parallel machines come in a variety of configurations. Every architecture has its benefits and drawbacks, and everyone has its adherents (Barney, 2010).
1.13.2. SIMD Computers SIMD computers (Single-Instruction Multiple-Data) are extremely concurrent devices with vast sets of basic processing units. Each processor component in a SIMD device contains a limited number of local storage. The SIMD computer’s commands are transmitted from a central information service to all of the machine’s processing elements. As just a result, every processor performs the very same command as the rest of the machine’s processing units. All components inside the data structure are operated concurrently because each processor performs the command on its local data (Nassimi and Sahni, 1981). In most cases, a SIMD machine is often utilized in combination with a traditional computer. The connection machine (CM-1) by Thinking Machines Corporation had been an instance of this, as the “host” computer was just a VAX minicomputer or a Silicon Graphics or Sun workstation. The CM-1 was a fine-grained SIMD computer with up to 64 K processing units that emerged to the host machine as a block of 64 K “intelligent memory.” A set of data was transferred into the processing matrix of the CM-1 via a host program, with every processor operating as a separate memory phase. The host just sent instructions to all of the CM-1’s processing elements at the same time. The host after which read the outcome from the CM-1 as if it were customary memory afterward when the computations were finished (Maresca and Li, 1989). The SIMD machine’s major benefit is as it is made up of easy and inexpensive processing elements. As a result, greater computational power can be obtained utilizing low-cost, off-the-shelf elements. Furthermore, because every processor executes similar instructions and thus shares a similar instruction fetch, the machine’s architecture is simplified. For such an actual computer, just one instruction store is needed (Park, 2004).
30
Dissecting Computer Architecture
The SIMD’s fundamental drawback is the utilization of numerous processing components, every of which executes the identical instructions in tandem. Many issues are not amenable to just being dissolved into a form that can be executed by a SIMD computer. Furthermore, the sets of data associated with a particular issue may not be ideally suited to a certain SIMD architecture. A SIMD device with 10 k processing components, for instance, doesn’t work well for a given dataset of 12 k data items (Bjørstad et al., 1992).
1.13.3. MIMD Computers The multiple-instruction multiple-data (MIMD) computer is another main type of parallel device. These devices are often coarse-grained groupings of semi-autonomous microprocessors, each having its very own local storage and software. An algorithm running on a MIMD computer is often divided into several specific subs, every of which is processed on different processors of the MIMD device. By assigning similar programs to every processor core in the MIMD machine, the MIMD device might well be viewed as a SIMD computer. A MIMD computer has substantially finer granularity as compared to a SIMD device. MIMD computers typically are using a small quantity of very powerful processors instead of a vast amount of less prominent ones (Bozkus et al., 1994). MIMD computers are categorized into two categories: shared-memory MIMD computers and message-passing MIMD computers. Shared-memory MIMD systems are comprised of a slew of high-speed processors, every having local memory or cache and connectivity to a big, global memory (Figure 1.9). The data and applications that will be performed by the device are stored in the memory. A table containing programs (or sub-programs) awaiting execution is also stored in this memory (Hatcher and Quinn, 1991). Each CPU will load a task and its related data into a memory location or cache and execute semi-independently of the further processes in the system. Process interaction is also carried out via the memory locations (Radcliffe, 1990).
Introduction to Computer Architecture
31
Figure 1.9. Shared-memory MIMD.
Communicating the program between many powerful computers provides power benefits. Nevertheless, the circuitry inside the system has to be random amongst processors to get entry to the system’s memory space and accompanying common buses. Furthermore, accommodations must be provided for a CPU trying to access out-of-date data in global memory. When processor A receives a procedure and information structures into its memory location and then alters this data pattern, processor B must always be told that a much more updated iteration of the data model exists when it attempts to reach a similar data structure in system memory. Arbitration is handled in a processor like the (since defunct) Motorola MC88110, which was designed and used in shared-memory MIMD machines (Lord et al., 1983). The message-passing MIMD computer is an alternate MIMD design (Figure 1.10). Each CPU in this system does have its personal, primary memory. The device has no universal memory. Every CPU (processing with memory location) both loads or has stored the programs (and related data) that will be executed into it. Every process runs independently on its CPU, and interprocess interaction is accomplished by message transmission across a shared media. The processors may interact over a single common bus (like Ethernet, CAN, or SCSI) or via a more complex interprocessor connection architecture, like 2-D arrays, N-dimensional hypercubes, rings, stars, and trees, or completely integrated systems (Offutt et al., 1992).
32
Dissecting Computer Architecture
Figure 1.10. Message-passing MIMD.
These devices do not have the bus-contention issues that sharedmemory machines have. The most proper and effective method of linking the processing units of a message-passing MIMD device, on the other hand, remains a significant field of study. Every architecture has advantages and disadvantages, and what is optimal for a specific procedure varies to some extent about what this application is. Issues requiring just a minimal quantity of inter-process communication can function efficiently on a computer with little interconnectivity, but other programs may clog the communication system with message delivery. If a portion of a processing node’s time has been spent passing messages for its neighbors, a device with a significant level of interoperability but a small level of connectivity might spend the majority of its time dealing with passing messages, with little time being spent on real computing (Hord, 2018).
Introduction to Computer Architecture
33
The completely linked system is the ideal interconnection design, in which each processing unit has a straight communication path with the other processing node. Nevertheless, owing to the expenses and difficulties of a high level of interconnection, that’s not always feasible. A solution to that problem is to give a restricted quantity of connections to another processor core in the device, built on the notion that a processing unit will not require or be capable of interacting with each other processor core in the device at the same time. These restricted connections out of each processor node can then be joined via a crossbar switch, allowing complete device connectivity via a restricted quantity of links per unit (Hiranandani et al., 1992). A distributed device is made up of separate computers that have been networked with each other to form a loosely linked MIMD parallel processor. MIMD machines include programs like Beowulf or even SETI@Home. In the embedded world, networked devices are widespread. A group of tiny processing elements scattered throughout a facility may provide localized control and monitoring, constituting a parallel device that executes the global domination algorithm. Military and commercial aircraft avionics are also dispersed parallel processors (de Cougny et al., 1996). Let us just take a closer look at software applications and also how these connect to design.
1.14. EMBEDDED COMPUTER ARCHITECTURE The usefulness of a computer, as well as its architecture, storage, and I/O, is determined by what it is being utilized for, what activities this must accomplish, and how much it relates to other people and other devices (Ramakrishna and Michael, 2001). Figure 1.11 depicts an unnamed desktop pc (not certainly a PC). It contains a huge primary memory for the computer system, programs, and data, and also a bulk storage medium connection (discs and DVD/CD-ROM drives). It features several input/output (I/O) ports for user intervention (keyboard, mouse, and audio), customer outputs (visual interface, and audio), and communication (networking and peripherals). The fast processor needs a system manager to observe its core body temperature as well as supply voltages, as well as to obtain a system reset (Platzner et al., 2000).
34
Dissecting Computer Architecture
Figure 1.11. Block diagram of a generic computer.
Large-scale embedded computers may have a similar shape. They could, for instance, serve as a wireless router or entry point, necessitating one or even more system interfaces, a big amount of memory, and a quick process. They might also need a user interface (UI) as part of their embedded program and, in several respects, might be nothing more than a standard computer devoted to a single purpose. As a result, several high-performance embedded systems are not that, unlike a traditional desktop workstation in terms of hardware (McLoughlin, 2011). Microcontrollers are used as the processor in smaller embedded systems, with the benefit of combining most of the computer’s features on a single chip. Figure 1.12 depicts a random embedded system that relies on a generic microcontroller. A microcontroller must have a CPU, a tiny proportion of internal storage (ROM and/or RAM), as well as some type of I/O, that is incorporated as subsystem frames within the microcontroller. These sub-systems give the processor increased features and are found in several processors. In the next chapters, we’ll go through the subsystems that you’ll find in most microcontrollers (Wilson, 2001).
Introduction to Computer Architecture
35
Figure 1.12. Block diagram of an embedded computer.
But for the time being, let’s take a little tour and look at the many uses that they might be put to. Digital I/O, often known as general-purpose I/O or GPIO, is the most popular kind of I/O. These are connectors that can be set as a digital input or output by software on a pin-by-pin basis. They can be utilized to detect the status of switches or push buttons, or to receive the digital state of some other device, as digital inputs. They could be used as output signals to switch external drives on or off or to communicate status to them (Kisacanin et al., 2008). A digital output might, for instance, have been used to engage the controller for a motor, switch a light on or off, or trigger another device like a water valve for a garden irrigation system. The digital inputs, as well as outputs, could be used in conjunction to provide an interface and interface for a further chip. Many microcontrollers have additional subsystems than digital I/O, but they all can be converted to overall digital I/O if the other systems’ capability isn’t needed. As a design engineer, this allows you a lot of flexibility in terms of how you employ your microcontroller in your application (Schlessman and Wolf, 2015). Analog inputs are included on many microcontrollers, enabling sensors to be selected for tracking or storing. An embedded computer may, for example, detect light levels, heat, vibration or acceleration, air or water pressure, humidity, or magnetic field. However, the analog inputs could be utilized to check basic voltages, possibly to assure a bigger system’s dependable functioning (Schoeberl, 2009).
36
Dissecting Computer Architecture
REFERENCES 1.
2. 3.
4.
5.
6.
7.
8. 9.
10. 11. 12.
13.
Aggarwal, A., Chandra, A. K., & Snir, M., (1987). Hierarchical memory with a block transfer. In: 28th Annual Symposium on Foundations of Computer Science (SFCS 1987) (Vol. 1, pp. 204–216). IEEE. Akram, A., & Sawalha, L., (2019). A survey of computer architecture simulation techniques and tools. IEEE Access, 7(2), 78120–78145. Andrews, K., & Sand, D., (1992). Migrating a CISC computer family onto RISC via object code translation. ACM SIGPLAN Notices, 27(9), 213–222. Antohe, B. V., & Wallace, D. B., (2002). Acoustic phenomena in a demand mode piezoelectric inkjet printer. Journal of Imaging Science and Technology, 46(5), 409–414. Arbelaitz, O., Martı, J. I., & Muguerza, J., (2014). Analysis of introducing active learning methodologies in a basic computer architecture course. IEEE Transactions on Education, 58(2), 110–116. Backus, J., (1978). Can programming be liberated from the von Neumann Style? A functional style and its algebra of programs. Communications of the ACM, 21(8), 613–641. Bannatyne, R., (1999). Semiconductor developments for automotive systems. In: 1999 IEEE 49th Vehicular Technology Conference (Cat. No. 99CH36363) (Vol. 2, pp. 1392–1396). IEEE. Barney, B., (2010). Introduction to Parallel Computing (Vol. 6, No. 13, pp. 10–15). Lawrence Livermore National Laboratory. Barua, S., (2001). An interactive multimedia system on “computer architecture, organization, and design”. IEEE Transactions on Education, 44(1), 41–46. Bevier, W. R., (1989). Kit and the short stack. Journal of Automated Reasoning, 5(4), 519–530. Bhandarkar, D., (1997). RISC versus CISC: A tale of two chips. ACM SIGARCH Computer Architecture News, 25(1), 1–12. Bindal, A., (2017). Fundamentals of Computer Architecture and Design (3rd ed., pp. 3–6). Cham, Switzerland: Springer International Publishing. Bjørstad, P., Manne, F., Sørevik, T., & Vajteršic, M., (1992). Efficient matrix multiplication on SIMD computers. SIAM Journal on Matrix Analysis and Applications, 13(1), 386–401.
Introduction to Computer Architecture
37
14. Blem, E., Menon, J., & Sankaralingam, K., (2013). Power struggles: Revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures. In: 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA) (Vol. 1, pp. 1–12). IEEE. 15. Blem, E., Menon, J., Vijayaraghavan, T., & Sankaralingam, K., (2015). ISA wars: Understanding the relevance of ISA being RISC or CISC to performance, power, and energy on modern architectures. ACM Transactions on Computer Systems (TOCS), 33(1), 1–34. 16. Bozkus, Z., Choudhary, A., Fox, G., Haupt, T., Ranka, S., & Wu, M. Y., (1994). Compiling Fortran 90D/HPF for distributed memory MIMD computers. Journal of Parallel and Distributed Computing, 21(1), 15–26. 17. Bromley, A., (1997). Hardware experiments with CISC and RISC computer architectures. In: Proceedings of the 2nd Australasian Conference on Computer Science Education (Vol. 1, pp. 207–215). 18. Buehrer, R., & Ekanadham, K., (1987). Incorporating data flow ideas into von Neumann processors for parallel execution. IEEE Transactions on Computers, 100(12), 1515–1522. 19. Burgelman, R. A., & Grove, A. S., (1996). Strategic dissonance. California Management Review, 38(2), 8–28. 20. Carter, W. C., & Bouricius, W. G., (1971). A survey of fault-tolerant computer architecture and its evaluation. Computer, 4(1), 9–16. 21. Chartoff, R. P., Menczel, J. D., & Dillman, S. H., (2009). Dynamic mechanical analysis (DMA). Thermal Analysis of Polymers: Fundamentals and Applications, 2(3), 387–495. 22. Chiang, M. L., & Chang, R. C., (1999). Cleaning policies in mobile computers using flash memory. Journal of Systems and Software, 48(3), 213–231. 23. Colwell, R. P., Hitchcock, C. Y., & Jensen, E. D., (1983). Peering through the RISC/CISC fog: An outline of the research. ACM SIGARCH Computer Architecture News, 11(1), 44–50. 24. Cragon, H. G., (2000). Computer Architecture and Implementation (Vol. 1, No. 2, pp. 4–9). Cambridge University Press. 25. Dandamudi, S. P., (2005). RISC principles. Guide to RISC Processors: For Programmers and Engineers (2nd edn., pp. 39–44).
38
Dissecting Computer Architecture
26. De Cougny, H. L., Shephard, M. S., & Özturan, C., (1996). Parallel three-dimensional mesh generation on distributed memory MIMD computers. Engineering with Computers, 12(2), 94–106. 27. Ditzel, D. R., (1991). Why RISC has won. In: The SPARC Technical Papers (Vol. 1, pp. 67–70). Springer, New York, NY. 28. Dumas, II. J. D., (2018). Computer Architecture: Fundamentals and Principles of Computer Design (Vol. 3, No. 5, pp. 4–8). CRC Press. 29. Eastman, C. M., Chase, S. C., & Assal, H. H., (1993). System architecture for computer integration of design and construction knowledge. Automation in Construction, 2(2), 95–107. 30. Eeckhout, L., (2010). Computer architecture performance evaluation methods. Synthesis Lectures on Computer Architecture, 5(1), 1–145. 31. Eigenmann, R., & Lilja, D. J., (1998). Von Neumann computers. Wiley Encyclopedia of Electrical and Electronics Engineering, 23(1), 387– 400. 32. El-Aawar, H., (2006). CISC vs. RISC hardware and programming complexity measures of addressing modes. In: Proceedings of the 2nd International Conference on Perspective Technologies and Methods in MEMS Design (Vol. 1, pp. 43–48). IEEE. 33. El-Aawar, H., (2008). An application of complexity measures in addressing modes for CISC- and RISC architectures. In: 2008 IEEE International Conference on Industrial Technology (Vol. 2, pp. 1–7). IEEE. 34. Elahi, A., & Arjeski, T., (2015). Logic gates and introduction to computer architecture. In: ARM Assembly Language with Hardware Experiments (Vol. 1, pp. 17–34). Springer, Cham. 35. El-Rewini, H., & Abd-El-Barr, M., (2005). Advanced Computer Architecture and Parallel Processing (4th edn., pp. 2–7). John Wiley & Sons. 36. Feng, X., Shao, Z., Dong, Y., & Guo, Y., (2008). Certifying low-level programs with hardware interrupts and preemptive threads. ACM SIGPLAN Notices, 43(6), 170–182. 37. Feustel, E. A., (1973). On the advantages of tagged architecture. IEEE Transactions on Computers, 100(7), 644–656. 38. Fujishima, M., Yamashita, M., Ikeda, M., Asada, K., Omura, Y., Izumi, K., & Sugano, T., (1992). 1 GHz 50 mu W 1/2 frequency divider
Introduction to Computer Architecture
39.
40.
41.
42.
43.
44.
45.
46. 47.
48.
39
fabricated on ultra-thin SIMOX substrate. In: 1992 Symposium on VLSI Circuits Digest of Technical Papers (Vol. 1, pp. 46, 47). IEEE. Ganesan, P., Venugopalan, R., Peddabachagari, P., Dean, A., Mueller, F., & Sichitiu, M., (2003). Analyzing and modeling encryption overhead for sensor network nodes. In: Proceedings of the 2nd ACM International Conference on Wireless Sensor Networks and Applications (Vol. 1, pp. 151–159). Ganguly, A., Muralidhar, R., & Singh, V., (2019). Towards energyefficient non-von Neumann architectures for deep learning. In: 20th International Symposium on Quality Electronic Design (ISQED) (Vol. 1, pp. 335–342). IEEE. Garth, S. C., (1991). Combining RISC and CISC in PC systems. In: IEE Colloquium on RISC Architectures and Applications (Vol. 1, pp. 10–19). IET. Giloi, W. K., & Berg, H. K., (1978). Data structure architectures: A major operational principle. In: Proceedings of the 5th Annual Symposium on Computer Architecture (Vol. 1, pp. 175–181). Giloi, W. K., (1997). Konrad Zuse’s Plankalku/spl uml/l: The first high-level,” non von Neumann” programming language. IEEE Annals of the History of Computing, 19(2), 17–24. Gunes, S., Yaldiz, E., & Sayin, M. V., (2000). The design and implementation of microcontroller supported amalgamator [for dentistry]. In: 2000 10th Mediterranean Electrotechnical Conference. Information Technology and Electrotechnology for the Mediterranean Countries; Proceedings; MeleCon 2000 (Cat. No. 00CH37099) (Vol. 2, pp. 758–760). IEEE. Güven, Y., Coşgun, E., Kocaoğlu, S., Gezici, H., & Yılmazlar, E., (2017). Understanding the Concept of Microcontroller Based Systems to Choose the Best Hardware for Applications (2nd edn., pp. 3–9). Halmos, P. R., (1973). The legend of John Von Neumann. The American Mathematical Monthly, 80(4), 382–394. Händler, W., (1975). On classification schemes for computer systems in the post-von-Neumann-era. In: GI-4. Jahrestagung (Vol. 1, pp. 439– 452). Springer, Berlin, Heidelberg. Harmer, P. K., Williams, P. D., Gunsch, G. H., & Lamont, G. B., (2002). An artificial immune system architecture for computer security
40
49.
50. 51. 52.
53.
54.
55. 56.
57. 58.
59.
Dissecting Computer Architecture
applications. IEEE Transactions on Evolutionary Computation, 6(3), 252–280. Hasan, R., & Mahmood, S., (2012). Survey and evaluation of simulators suitable for teaching computer architecture and organization supporting undergraduate students at Sir Syed University of engineering & technology. In: Proceedings of 2012 UKACC International Conference on Control (Vol. 1, pp. 1043–1045). IEEE. Hatcher, P. J., & Quinn, M. J., (1991). Data-Parallel Programming on MIMD Computers (Vol. 90, No. 77). MIT Press. Hennessy, J. L., & Patterson, D. A., (2011). Computer Architecture: A Quantitative Approach (Vol. 1, pp. 2–6). Elsevier. Hill, M. D., Hill, M. D., Jouppi, N. P., Jouppi, N. P., & Sohi, G. S., (2000). Readings in Computer Architecture (Vol. 1, pp. 3–8). Gulf Professional Publishing. Hiranandani, S., Kennedy, K., & Tseng, C. W., (1992). Compiling Fortran D for MIMD distributed-memory machines. Communications of the ACM, 35(8), 66–80. Ho, R. C., Yang, C. H., Horowitz, M. A., & Dill, D. L., (1995). Architecture validation for processors. In: Proceedings 22nd Annual International Symposium on Computer Architecture (Vol. 1, pp. 404– 413). IEEE. Hord, R. M., (2018). Parallel Supercomputing in MIMD Architectures (4th edn., pp. 4–9). CRC press. Huynh, K., Khoshgoftaar, T., & Marazas, G., (1993). High-level performance analysis of the IBM subsystem control block (SCB) architecture. Micro Processing and Microprogramming, 36(3), 109– 125. Hwang, K., & Faye, A., (1984). Computer Architecture and Parallel Processing (2nd edn., pp. 5–9). Iannucci, R. A., (1988). Toward a dataflow/von Neumann hybrid architecture. ACM SIGARCH Computer Architecture News, 16(2), 131–140. Inoue, K., & Pham, C. K., (2017). The memorism processor: Towards a memory-based artificially intelligence complementing the von Neumann architecture. SICE Journal of Control, Measurement, and System Integration, 10(6), 544–550.
Introduction to Computer Architecture
41
60. Jamil, T., (1995). RISC versus CISC. IEEE Potentials, 14(3), 13–16. 61. Jex, J., (1991). Flash memory BIOS for PC and notebook computers. In: IEEE Pacific Rim Conference on Communications, Computers and Signal Processing Conference Proceedings (Vol. 1, pp. 692–695). IEEE. 62. Kalaiselvi, S., & Rajaraman, V., (2000). A survey of checkpointing algorithms for parallel and distributed computers. Sadhana, 25(5), 489–510. 63. Kanerva, P., (2009). Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors. Cognitive Computation, 1(2), 139–159. 64. Karshmer, A. I., Thomas, J. N., Annaiyappa, P. K., Eshner, D., Kankanahalli, S., & Kurup, G., (1990). Architectural support for operating systems: A popular RISC vs. a popular CISC. Microprocessing and Microprogramming, 30(1–5), 21–32. 65. Kaxiras, S., & Martonosi, M., (2008). Computer architecture techniques for power efficiency. Synthesis Lectures on Computer Architecture, 3(1), 1–207. 66. Khazam, J., & Mowery, D., (1994). The commercialization of RISC: Strategies for the creation of dominant designs. Research Policy, 23(1), 89–102. 67. Kim, Y. R., Little, D. N., Lytton, R. L., D’Angelo, J., Davis, R., Rowe, G., & Tashman, L., (2002). Use of dynamic mechanical analysis (DMA) to evaluate the fatigue and healing potential of asphalt binders in sand asphalt mixtures. In: Asphalt Paving Technology: Association of Asphalt Paving Technologists-Proceedings of the Technical Sessions (Vol. 71, pp. 176–206). Association of Asphalt Paving Technologists. 68. Kisacanin, B., Bhattacharyya, S. S., & Chai, S., (2008). Embedded Computer Vision (4th ed., pp. 2–6). Springer Science & Business Media. 69. Ko, U., Balsara, P. T., & Nanda, A. K., (1998). Energy optimization of multilevel cache architectures for RISC and CISC processors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 6(2), 299–308. 70. Kozyrakis, C. E., & Patterson, D. A., (1998). A new direction for computer architecture research. Computer, 31(11), 24–32.
42
Dissecting Computer Architecture
71. Kozyrakis, C. E., Perissakis, S., Patterson, D., Anderson, T., Asanovic, K., Cardwell, N., & Yelick, K., (1997). Scalable processors in the billion-transistor era: IRAM. Computer, 30(9), 75–78. 72. Krad, H., & Al-Taie, A. Y., (2007). A new trend for CISC and RISC architectures. Asian J. Inform. Technol., 6(11), 1125–1131. 73. Lauderdale, C., & Khan, R., (2012). Towards a code-based runtime for exascale computing: Position paper. In: Proceedings of the 2nd International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era (Vol. 1, pp. 21–26). 74. Leighton, F. T., (2014). Introduction to Parallel Algorithms and Architectures: Arrays·Trees·Hypercubes (Vol. 1, pp. 2–6). Elsevier. 75. Leon-Garcia, A., & Widjaja, I., (2000). Communication Networks: Fundamental Concepts and Key Architectures (Vol. 2, pp. 2–5). New York: McGraw-Hill. 76. Liu, M., Liu, D., Wang, Y., Wang, M., & Shao, Z., (2010). On improving real-time interrupt latencies of hybrid operating systems with twolevel hardware interrupts. IEEE Transactions on Computers, 60(7), 978–991. 77. Lord, R. E., Kowalik, J. S., & Kumar, S. P., (1983). Solving linear algebraic equations on MIMD computer. Journal of the ACM (JACM), 30(1), 103–117. 78. Lozano, H., & Ito, M., (2016). Increasing the code density of embedded RISC applications. In: 2016 IEEE 19th International Symposium on Real-Time Distributed Computing (ISORC) (Vol. 1, pp. 182–189). IEEE. 79. Lung, O., & Wolfner, M. F., (1999). Drosophila seminal fluid proteins enter the circulatory system of the mated female fly by crossing the posterior vaginal wall. Insect Biochemistry and Molecular Biology, 29(12), 1043–1052. 80. Malaiya, Y. K., & Feng, S., (1988). Design of testable RISC-to-CISC control architecture. In: Proceedings of the 21st Annual Workshop on Microprogramming and Microarchitecture (Vol. 1, pp. 57–59). 81. Mange, D., Madon, D., Stauffer, A., & Tempesti, G., (1997). Von Neumann revisited: A Turing machine with self-repair and selfreproduction properties. Robotics and Autonomous Systems, 22(1), 35–58.
Introduction to Computer Architecture
43
82. Maresca, M., & Li, H., (1989). Connection autonomy in SIMD computers: A VLSI implementation. Journal of Parallel and Distributed Computing, 7(2), 302–320. 83. Martin, D., & Owen, R. E., (1998). A RISC architecture with uncompromised digital signal processing and microcontroller operation. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181) (Vol. 5, pp. 3097–3100). IEEE. 84. Martínez-Monés, A., Gómez-Sánchez, E., Dimitriadis, Y. A., JorrínAbellán, I. M., Rubia-Avi, B., & Vega-Gorgojo, G., (2005). Multiple case studies to enhance project-based learning in a computer architecture course. IEEE Transactions on Education, 48(3), 482–489. 85. McGrew, T., Schonauer, E., & Jamieson, P., (2019). Framework and tools for undergraduates designing RISC-V processors on an FPGA in computer architecture education. In: 2019 International Conference on Computational Science and Computational Intelligence (CSCI) (Vol. 1, pp. 778–781). IEEE. 86. McLoughlin, I. V., (2011). Computer Architecture: An Embedded Approach (2nd ed., pp. 1–5). McGraw-Hill. 87. Mogul, J. C., & Ramakrishnan, K. K., (1997). Eliminating receive livelock in an interrupt-driven kernel. ACM Transactions on Computer Systems, 15(3), 217–252. 88. Nassimi, D., & Sahni, S., (1981). Data broadcasting in SIMD computers. IEEE Transactions on Computers, 100(2), 101–107. 89. Nowatzki, T., Gangadhar, V., & Sankaralingam, K., (2015). Exploring the potential of heterogeneous von Neumann/dataflow execution models. In: Proceedings of the 42nd Annual International Symposium on Computer Architecture (Vol. 2, pp. 298–310). 90. Offutt, A. J., Pargas, R. P., Fichter, S. V., & Khambekar, P. K., (1992). Mutation testing of software using MIMD computer. In: ICPP (2) (Vol. 1, pp. 257–266). 91. Page, D., (2009). A Practical Introduction to Computer Architecture (Vol. 2, No. 5, pp. 2–8). Springer Science & Business Media. 92. Parhami, B., (2006). Introduction to Parallel Processing: Algorithms and Architectures (2nd ed., pp. 1–5). Springer Science & Business Media.
44
Dissecting Computer Architecture
93. Park, J. W., (2004). Multiaccess memory system for attached SIMD computer. IEEE Transactions on Computers, 53(4), 439–452. 94. Patterson, D., (2018). 50 years of computer architecture: From the mainframe CPU to the domain-specific you and the open RISC-V instruction set. In: 2018 IEEE International Solid-State Circuits Conference-(ISSCC) (Vol. 1, pp. 27–31). IEEE. 95. Patti, D., Spadaccini, A., Palesi, M., Fazzino, F., & Catania, V., (2012). Supporting undergraduate computer architecture students using a visual mips64 CPU simulator. IEEE Transactions on Education, 55(3), 406–411. 96. Pesavento, U., (1995). An implementation of von Neumann’s selfreproducing machine. Artificial Life, 2(4), 337–354. 97. Pippenger, N., (1990). Developments in “the synthesis of reliable organisms from unreliable components”. The Legacy of John Von Neumann, 50, 311–324. 98. Platzner, M., Rinner, B., & Weiss, R., (2000). Toward embedded qualitative simulation: A specialized computer architecture for QSIM. IEEE Intelligent Systems and their Applications, 15(2), 62–68. 99. Porter, L., Garcia, S., Tseng, H. W., & Zingaro, D., (2013). Evaluating student understanding of core concepts in computer architecture. In: Proceedings of the 18th ACM Conference on Innovation and Technology in Computer Science Education (Vol. 1, pp. 279–284). 100. Radcliffe, N. J., (1990). Genetic Neural Networks on MIMD Computers (2nd ed pp. 2–8). KB thesis scanning project 2015. 101. Raghunathan, A., Dey, S., & Jha, N. K., (1999). Register transfer level power optimization with an emphasis on glitch analysis and reduction. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 18(8), 1114–1131. 102. Ramakrishna, B., & Michael, S., (2001). Embedded computer architecture and automation. IEEE Computer, 34(4), 75–83. 103. Rebaudengo, M., Reorda, M. S., Violante, M., Cheynet, P., Nicolescu, B., & Velazco, R., (2000). Evaluating the effectiveness of a software fault-tolerance technique on RISC-and CISC-based architectures. In: Proceedings 6th IEEE International On-Line Testing Workshop (Cat. No. PR00646) (Vol. 1, pp. 17–21). IEEE. 104. Regehr, J., & Duongsaa, U., (2005). Preventing interrupt overload. ACM SIGPLAN Notices, 40(7), 50–58.
Introduction to Computer Architecture
45
105. Ritpurkar, S. P., Thakare, M. N., & Korde, G. D., (2014). Synthesis and simulation of a 32Bit MIPS RISC processor using VHDL. In: 2014 International Conference on Advances in Engineering & Technology Research (ICAETR-2014) (Vol. 1, pp. 1–6). IEEE. 106. Schlessman, J., & Wolf, M., (2015). Tailoring design for embedded computer vision applications. Computer, 48(5), 58–62. 107. Schoeberl, M., (2009). Time-predictable computer architecture. EURASIP Journal on Embedded Systems, 2(3), 1–17. 108. Sebbel, H., (1976). Input/output microprogramming for the 7.755 central processing unit of siemens system 7.000. Euromicro Newsletter, 2(3), 47–53. 109. Shin, D., & Yoo, H. J., (2019). The heterogeneous deep neural network processor with a non-von Neumann architecture. Proceedings of the IEEE, 108(8), 1245–1260. 110. Šilc, J., Silc, J., Robic, B., & Ungerer, T., (1999). Processor Architecture: From Dataflow to Superscalar and Beyond; with 34 Tables (4th edn., pp. 1–5). Springer Science & Business Media. 111. Slye, J. H., & Elnozahy, E. N., (1998). Support for software interrupts in log-based rollback-recovery. IEEE Transactions on Computers, 47(10), 1113–1123. 112. Spreen, H., (1976). Partially integrated input/output channels. Euromicro Newsletter, 2(3), 41–46. 113. Stefan, D., & Stefan, G., (1999). A processor network without an interconnection path. In: CAS’99 Proceedings: 1999 International Semiconductor Conference (Cat. No. 99TH8389) (Vol. 1, pp. 305– 308). IEEE. 114. Stewin, P., & Bystrov, I., (2012). Understanding DMA malware. In: International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (Vol. 1, pp. 21–41). Springer, Berlin, Heidelberg. 115. Stokes, J., (2007). Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (Vol. 1, No. 2, pp. 5–9). No starch press. 116. Suryakant, U. A., Shelke, S. D., & Bartakke, P. P., (2018). RISC controller with multiport RAM. In: 2018 International Conference On Advances in Communication and Computing Technology (ICACCT) (Vol. 1, pp. 571–574). IEEE.
46
Dissecting Computer Architecture
117. Taylor, J. H., & Frederick, D. K., (1984). An expert system architecture for computer-aided control engineering. Proceedings of the IEEE, 72(12), 1795–1805. 118. Tibbals, H. F., (1976). A structure for interprocess communication in a data communications handler. In: Proceedings of the 1976 Annual Conference (Vol. 1, pp. 356–360). 119. Tokhi, M. O., & Hossain, M. A., (1995). CISC, RISC, and DSP processors in real-time signal processing and control. Microprocessors and Microsystems, 19(5), 291–300. 120. Tolley, D. B., (1991). Analysis of CISC versus RISC microprocessors for FDDI network interfaces. In: [1991] Proceedings 16th Conference on Local Computer Networks (Vol. 1, pp. 485–486). IEEE Computer Society. 121. Traversa, F. L., & Di Ventra, M., (2015). Universal memcomputing machines. IEEE Transactions on Neural Networks and Learning Systems, 26(11), 2702–2715. 122. Trenas, M. A., Ramos, J., Gutierrez, E. D., Romero, S., & Corbera, F., (2010). Use of a new Moodle module for improving the teaching of a basic course on computer architecture. IEEE transactions on Education, 54(2), 222–228. 123. Tsafrir, D., (2007). The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops). In: Proceedings of the 2007 Workshop on Experimental Computer Science (Vol. 1, pp. 4-es). 124. Van De, G. A. J., (1989). Computer Architecture and Design (Vol. 1, pp. 1–7). Addison-Wesley Longman Publishing Co., Inc. 125. Vanhaverbeke, W., & Noorderhaven, N. G., (2001). Competition between alliance blocks: The case of the RISC microprocessor technology. Organization Studies, 22(1), 1–30. 126. Verhulst, E., (1997). Non-sequential processing: Bridging the semantic gap left by the von Neumann architecture. In: 1997 IEEE Workshop on Signal Processing Systems; SiPS 97 Design and Implementation formerly VLSI Signal Processing (Vol. 1, pp. 35–49). IEEE. 127. Wang, S. P., & Ledley, R. S., (2012). Computer Architecture and Security: Fundamentals of Designing Secure Computer Systems (Vol. 1, pp. 1–5). John Wiley & Sons. 128. Weaver, V. M., Terpstra, D., & Moore, S., (2013). Non-determinism and overcount on modern hardware performance counter implementations.
Introduction to Computer Architecture
47
In: 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) (Vol. 1, pp. 215–224). IEEE. 129. Weiss, M. H., & Fettweis, G. P., (1996). Dynamic code width reduction for VLIW instruction set architectures in digital signal processors. In: Proceedings IWISP’96 (Vol. 1, pp. 517–520). Elsevier Science Ltd. 130. Wilson, G. R., (2001). Embedded Systems and Computer Architecture (Vol. 1, pp. 3–9). Elsevier. 131. Wolfe, A., & Chanin, A., (1992). Executing compressed programs on an embedded RISC architecture. ACM Sigmicro Newsletter, 23(1, 2), 81–91. 132. Wright, R. D., & Dawson, M. R., (1988). Using hardware interrupts for timing visual displays and reaction-time key interfacing on the commodore 64. Behavior Research Methods, Instruments, & Computers, 20(1), 41–48. 133. Xiang, T., Zhang, L., An, S., Ye, X., Zhang, M., Liu, Y., & Fan, D., (2021). RISC-NN: Use RISC, NOT CISC as Neural Network Hardware Infrastructure (Vol. 1, pp. 5–9). arXiv preprint arXiv:2103.12393. 134. Yano, K., Sasaki, Y., Rikino, K., & Seki, K., (1994). Lean integration: Achieving a quantum leap in performance and cost of logic LSIs. In: Proceedings of IEEE Custom Integrated Circuits Conference-CICC’94 (Vol. 1, pp. 603–606). IEEE. 135. Yazdanpanah, F., Alvarez-Martinez, C., Jimenez-Gonzalez, D., & Etsion, Y., (2013). Hybrid dataflow/von-Neumann architectures. IEEE Transactions on Parallel and Distributed Systems, 25(6), 1489–1509. 136. Zahler, A. M., Williamson, J. R., Cech, T. R., & Prescott, D. M., (1991). Inhibition of telomerase by G-quartet DMA structures. Nature, 350(6320), 718–720. 137. Zhang, D., (2011). Wastewater handle system design based on PICmicrocontroller. In: 2011 International Conference on Computer Science and Service System (CSSS) (Vol. 1, pp. 2831–2833). IEEE. 138. Zhao, Y., Fan, Z., Du, Z., Zhi, T., Li, L., Guo, Q., & Chen, Y., (2020). Machine learning computers with fractal Von Neumann architecture. IEEE Transactions on Computers, 69(7), 998–1014.
CHAPTER
2
CLASSIFICATION OF COMPUTER ARCHITECTURE
CONTENTS 2.1. Introduction....................................................................................... 50 2.2. Von-Neumann Architecture................................................................ 50 2.3. Harvard Architecture......................................................................... 53 2.4. Instruction Set Architecture................................................................ 56 2.5. Microarchitecture.............................................................................. 71 2.6. System Design................................................................................... 78 References................................................................................................ 81
50
Dissecting Computer Architecture
2.1. INTRODUCTION The architecture of computer systems is comprised of the laws of mathematics as well as methodologies and procedures that describe how computer systems have been constructed and function. Architecture has been designed to meet the demands of the consumer while also including economic and budgetary restrictions. Previously, architecture was designed upon paper and implemented as hardware (Hwang and Jotwani, 1993). The architecture is created, tested, and produced in hardware form once the transistor-transistor logic has been built in. The productivity, efficiency, dependability, and expense of a computer system may all be used to characterize computer architecture. It is concerned with technological standards for hardware or software. The CPU, memory, Input/output devices, and channels of communication that link to it (Loo, 2007; Shaout and Eldos, 2003). The several types of computer architecture are discussed in further sections.
2.2. VON-NEUMANN ARCHITECTURE John von Neumann is the author of such a design proposal. Modern computers, such as those we use today, are built on the von-Neumann architecture. This is founded on several notions (Mariantoni et al., 2011). There is just one write/read memory in our system, which allows us to read and write data and instructions at the same time. Whenever we talk about memory, we have been referring to a single storage place that is utilized for writing and reading instructions for data and instructions. Memory has also been utilized to store instructions and data. Within the computer system, instructions, and data are kept in a single write/read memory that can be accessed from any direction (Figure 2.1) (Faust et al., 2015).
Classification of Computer Architecture
51
Figure 2.1. The Architecture of CPU. Source: https://www.educba.com/types-of-computer-architecture/.
Every memory contains numerous places, each with its address. We may read or write every instruction and data via addressing the contents of memory with its position, regardless of what form of instructions and data have been contained in the memory. Until a modification is necessary, execution is always done in sequential order. For instance, if we’re running a command from line one to line ten, although we suddenly need to run line 50 rather than line 11, we’ll jump to instruction 50 and run it (Shin and Yoo, 2019). The data and instruction code are executed on a bus. The input device receives data or commands, and the central processing unit (CPU) conducts one action at a time, typically fetching information or instructions into or out of memory. The process is completed and then communicated to the output device. The CPU contains the control and logic components for processing processes (Dutta et al., 2020). The stored-program concept is employed in this design. The stored program has been the most significant part of the von Neumann model in von Neumann architecture. The following are the main characteristics of such architecture (Langdon and Poli, 2006): •
There is no differentiation between data and instructions. This stipulation has numerous major ramifications (Kassani et al., 2019):
52
Dissecting Computer Architecture
–
Much like the data, instructions have been expressed as integers. The design of software and memory has been made easier by this unified handling of data and instructions. – Rather than having separate memories for data and instructions, a single memory has been utilized for both. As a result, a single memory route may contain both data and instructions. – Irrespective of the nature of data stored at a given place, the memory is addressed via location. • The instructions in the saved program are performed in the order in which they are found in the stored program by default unless otherwise specified (Laird, 2009; Tsai et al., 2006). The data to be processed is placed in the computer’s memory along with the software that will run on it (Figure 2.2 shows a computer with a von Neumann architecture, which has a single memory space that holds both data and instructions). This might result in a condition known as the von Neumann bottleneck, which restricts the amount of time the CPU can operate at a given speed (Quirita et al., 2016). Data and instructions should share a similar pathway to the CPU from memory, therefore if the CPU has been in the process of writing a data value to memory, it will not be able to fetch the next instruction to be performed. It is necessary to wait till the data is written before proceeding, and conversely, before progressing (Giannetsos et al., 2010).
Figure 2.2. Memory architecture for the von Neumann.
Classification of Computer Architecture
53
2.2.1. Control Unit (CU) The control unit (CU) directs the memory, ALU, and I/O devices of the computer, informing them how to react to the program instructions read and processed from the memory unit. Other computer elements need control signals, which are provided by the CU (Lo et al., 2005).
2.2.2. Buses When data is transported from one portion of a computer to another, buses are used to attach all-important interior elements, such as memory and the CPU, together. A conventional CPU system bus is made up of three buses: the control bus, the data bus, and the address bus (Bodin and Berman, 1979). Control Bus
Controls and coordinates overall computer activity by carrying control commands/signals from the Processor (as well as status signals from several other devices)
Data Bus
Transports information among the memory unit, CPU, and I/O devices.
Address Bus
Between the CPU and the memory, it transports data addresses (but not data)
2.2.3. Memory Unit The memory unit is made up of RAM, which is also known as main memory. At the same time, such memory is faster than a hard drive (secondary memory), and it is also immediately accessible by the CPU. RAM is divided into sections called partitions. Every partition comprises an address as well as the contents of the partition (Barnes et al., 1968). Every position in the memory would be recognized by its address, which would be different for each location. Loading data from permanent memory (the hard drive) into random access memory (RAM), which is quicker and easily available, enables the CPU to execute at a much faster pace (ParrillaGutierrez et al., 2020).
2.3. HARVARD ARCHITECTURE Harvard architecture is utilized when code and data have been stored in independent memory sections (Figure 2.3). A unique memory block is required for instructions and information. Data may be accessed from one memory region, while instructions can be accessed from another memory
54
Dissecting Computer Architecture
area. The core processing unit houses all of the data storage (CPU) (HuynhThe et al., 2020). A solitary set of cycles of the clock is all that is needed. It has been probable to build a pipeline. Designing is challenging. Reading and writing instructions, and also processing data access, are all capabilities of the CPU. The access codes and data address spaces in the Harvard design are separate, that is, data address zero and instruction address zero are not the same. Command address zero specifies a 24-byte value, whereas data address zero specifies an 8-byte value which is not included in the 24-byte value (Francillon and Castelluccia, 2008).
Figure 2.3. Harvard architecture. Source: https://www.w3schools.in/computer-fundamentals/types-of-computerarchitecture.
The enhanced Harvard architecture machine is similar to a Harvard architecture machine in that the independent instructions and data caches share the same address space. It features digital signal processors (DSPs) that can run simple or complex audio or video algorithms, and can be replicated (Konyavsky and Ross, 2019). Microcontrollers have a limited amount of programs and information memory, and they speed up processing via running concurrent commands and accessing information. In Figure 2.4, we may see that there are separate instructions and data memory, as well as a bus for performing operations. It’s contained within the CPU. It contains a distinct arithmetic and logic unit and may execute simultaneous I/O operations (Kong et al., 2010).
Classification of Computer Architecture
55
Data and instructions have been both kept in a similar memory in a normal computer which obeys the architecture of von Neumann. Data and instructions are transported on the same buses. This indicates the CPU can’t perform both (the write/read data and read the instruction) at a similar time. The Harvard architecture is a type of computer architecture that features separate command and storage of data, as well as separate bus systems (signal paths). It was developed to circumvent the bottleneck of Von Neumann Architecture. Having separate command and information buses provides the essential advantage of allowing the CPU to retrieve instructions while also reading and writing data at the same time (Li and Yan, 2010). The Harvard architecture structure is explained in the following subsections.
2.3.1. Buses Buses are used as traffic signal vehicles. In Harvard’s architecture, the data and instruction buses are segregated from one another. Buses are available in a range of designs and capacities (Chung, 1995; Ben Mahjoub and Atri, 2019). • • •
•
Data Bus: This moves information between the processor, the central memory system, and the Input/output devices. Data Address Bus: This moves information addresses from the CPU to the central memory system. Instruction Bus: This moves information between the CPU, the main memory, and Input/output devices (Matuszczyk and Maltese, 1995). Instruction Address Bus: This moves command addresses from the CPU to the central memory system.
2.3.2. Operational Registers It has many kinds of registers that have been utilized to accumulate the addresses of different kinds of commands. Memory Data Register and Memory Address Register, for instance, have been operational registers (Flynn, 1972).
2.3.3. Program Counter It contains the position of the next to be performed instruction. The next address is subsequently sent to the memory address registered by the program counter (Zhou et al., 2004).
56
Dissecting Computer Architecture
2.3.4. Logic Unit and Arithmetic The ALU is the part of the CPU that does all of the necessary computations. In addition, comparison, subtraction, logical operations of bit shifting, and different arithmetic operations are all performed by it (Syamala and Tilak, 2011).
2.3.5. Control Unit (CU) The control unit (CU) is the part of the CPU which regulates all of the processor’s signals. It manages the output and input devices, and the flow of data and instructions throughout the system (Padgett et al., 1989).
2.3.6. Input/Output System With the help of CPU input commands, input devices are utilized to read data into primary memory. Output devices give data from a computer as outcomes. With the help of output devices, the computer presents the data for calculation (Larraza-Mendiluze and Garay-Vitoria, 2014).
2.3.7. Harvard Architecture’s Benefits For data and instruction, Harvard’s design features two independent buses. As a result, the CPU may concurrently approach commands and write/ read information. Harvard architecture has an important benefit (Ross and Westerman, 2004). In practice, if we have two distinct caches, we employ Modified Harvard Architecture (instruction and data). It is a distinctive pattern seen in both ARM CPUs and X86.
2.4. INSTRUCTION SET ARCHITECTURE Understanding the relevance of the ISA, discussing the elements that must be addressed while building the ISA of a machine, and looking at an instance instruction set architecture, MIPS, are the goals of this lesson (Goodacre and Sloss, 2005).
Classification of Computer Architecture
57
Figure 2.4. Instruction set architecture. Source: https://www.embedded.com/a-quick-introduction-to-instruction-setarchitecture-and-extensibility/.
As we’ve seen, the command set architecture and the structures of the computer are both important components of the computer architecture course’s curriculum. The ISA determines what the processor is able of performing and how it goes about doing it. A byproduct of this is that the ISA acts as a bridge between your program and your hardware. The command set of the central processing is the only mechanism through which the CPU may communicate with the rest of the system (Liu et al., 2016; Chen et al., 2019). For you to be able to instruct the computer, it must first learn to talk in its language. The commands are the words of the language of the computer, and the command is its vocabulary. It may not receive satisfactory answers from the computer till you are familiar with the term and have a large glossary. The ISA is the component of the machine that may be shown by a compiler writer, an application programmer, or an assembly language programmer (Patterson and Ditzel, 1980). Because the ISA explains what the computer is capable of doing and because the computer should be constructed in such a way that it can accomplish the functions indicated in your instruction set architecture, this is the merely link you have. The ISA is the only mechanism through which your computer can connect with you. This provides you with a better grasp of how the software and hardware communicate with one another (Patterson and Sequin, 1998).
58
Dissecting Computer Architecture
Consider the following scenario: you have a higher-level program written in C that is irrespective of the architecture about which you wish to run the application. To run on certain architecture, this higher-level program should be transformed into an assembly language program modified to that architecture (McQuillan, 1978). Suppose that you find out that such a comprises of several commands like STORE, ADD, LOAD, etc., wherein everything you had written in the case of enhanced-level language has now been transformed into a collection of instructions that is particular to the specific architecture. All of the commands that have been represented here have been a portion of the instruction set architecture of the MIPS architecture (Fox and Myreen, 2010), which is depicted in the following diagram. These have been all in English, and the processor is unable to comprehend them as the processor is composed entirely of digital elements that may only understand ones and zeros, as opposed to words and numbers. Consequently, such language of assembly would require to be precisely transformed into machine language, namely object code that is made up of 0 s and 1 s. Consequently, the assembler and the compiler would be required to do the transformation from your higher-level language to your language of assembly and the binary code generation (Wirthlin and Hutchings, 1995). We would examine the characteristics of the instruction set and determine what will be stored in the 0 s and 1 s, as well as how the 0 s and 1 s would be interpreted as instructions, data, or addresses. ISAs are meant to persist through several implementations; they must be portable and compatible; they must be utilized in several various ways; they must be generic, and they must also give easy functionality to other layers of the architecture. The taxonomy of ISA is shown in the next section (Adve et al., 2003).
2.4.1. Taxonomy The quantity of interior storage accessible in a CPU varies depending on the instruction set architecture used. Consequently, depending upon where the operands are kept and if they have been identified as implicit or explicit, the ISA can be classified into the following categories (Landwehr et al., 1994). Rearrangement using a single accumulator, in which one of the registers of general purpose is designated as the accumulator and which uses it to store the operands as a conditional requirement, is known as a single accumulator organization. According to this, one of the operands is implicitly contained inside the accumulation and it is not necessary to provide the other operand in addition to the instruction (Skillicorn, 1988).
Classification of Computer Architecture
59
All the operands are explicitly stated in the general register organization if the operands have been in registers or memory, it may be classed as register-register if the operands have been kept in registers. As the store and load commands may access memory, they are referred to as “load to store” architectures (Sulistio et al., 2004). Register to memory, one operand is stored in memory, whereas the other is kept in a register. Memory to memory, every operand has been expressed as memory operands when using memory as the operand type. Stack organization, in which the operands have been kept in the stack and the operations have been performed upon the top of the pile. Operators are completely defined in this example. Consider the situation in which you must operate A = B + C with all three operands being memory operands. You should 1st load one operand into the accumulator of an accumulator-based instruction set architecture, in which case several of the general-purpose registers is designated as an accumulator and one of the operands was always available in the accumulator. Then you can use the ADD commands to only describe the address of the operand, which is the address of the operand in the accumulator (Mentzas, 1994). ISAs that are GPR-based are divided into three primary categories. A registerbased operand must be moved in with any register, however, a memorybased operand in the register memory instruction set architecture may be used instead. Due to the register-register instruction set architecture, both operands should be moved to two registers, and the ADD command will only be able to operate on register values. Both memory operands are permitted in the memory-memory instruction set architecture (Chacón, 1992). As a consequence, you may just add to your total. In a stack-based instruction set architecture, you’ll require to insert both operands into the stack 1st, then utilize an ADD command to merge the top two elements of the stack and save the result back into the stack, as shown in the following example. As you can see from these examples, there are numerous techniques for doing the same operation based upon the instruction set architecture being used. The register-register instruction set architecture is the most frequently used of these instruction set architectures, and it is used in all RISC architectures (Gredler, 1986). We’ll see the several elements that must be taken into account while developing the instruction set architecture. They are as follows (Cook and Donde, 1982):
60
Dissecting Computer Architecture
• Operand types and sizes; • Instructions of many kinds; • Memory addressing; • Modes of addressing; • Issues with compilers; • Instruction formats and encoding. Before anything else, you should determine the kind of instructions that you want to support in the instruction set architecture; in other words, what are the various commands that you want to support. Computer programs have traditionally been composed of a series of small steps, like multiplying two numbers or transferring information from one register to another memory area. Other tasks have included testing for a particular condition, like zero, trying to read a character from the input device, or transmitting a character to be displayed on the output device, among other things. A computer ought to be able to execute the following types of commands (Saucède et al., 2021): • Instructions for information manipulation; • Commands for information transfer; • Commands for outputs and inputs; • Sequencing and regulating commands for the program. Data transfer commands are used to move data between the various storage areas of a computer system, like registers, memory, and input/output devices, among other things. A processor should read from memory the commands that have been placed there since both have been stored there. All of the outcomes of processing should be kept somewhere in memory (Giloi, 1983; Ferrara et al., 1997). This has resulted in the need for two fundamental memory operations, namely load (or read) and store (or write). Data is copied from memory to the CPU during the Load procedure, whereas data is copied from memory to the CPU during the Store operation. The CPU requires the execution of extra data transfer commands before data may be transferred through one register to another or among input/output devices and the CPU (Dandamudi, 2003). When processing data, the data manipulation instructions have been utilized to communicate the processor’s computational capacity and to perform operations on the information (Scott, 2003). Operations such as logical operations, arithmetic operations, and the operations of the shift are instances of such kinds of computation. An integer’s complement can be determined by performing the arithmetic operations of addition and
Classification of Computer Architecture
61
subtraction as well as multiplication, division, reduction, increments, and calculating its complement. There are other logical and bit manipulation operations such as XOR, OR, AND, Clear carry, set carry, and other similar commands. You may also do additional types of shifts and rotation operations if you so want (Pope, 1996). Ordinarily, we assume that commands are executed in a logical sequence. In other respects, commands that are placed in sequential locations are executed one by one after another. You do, however, have program sequence and control commands that enable you to change the flow of the program although it is running. The most effective approach to communicate this is with an example. Take, for example, the challenge of adding ‘n’ integers to a list of numbers. The following is an example of a possible order (Kiltz et al., 2007). • Move DATA1, R0 • Add DATA2, R0 • Add DATA3, R0 • Add DATAn, R0 • Move R0, SUM The addresses of the locations of memory storing the n integers are representatively denoted as DATA1, DATA2, …, DATAn, and every Databer is added to the contents of register R0 using a single Add instruction. The result is stored in memory location SUM once all the numbers have been added. Rather than a long series of Add instructions, a single Add instruction can be placed in a program loop, as seen below (Meadows, 1993): • • •
Move N, R1 Clear R0 LOOP verifies the address of the “Next” number and adds the “Next” number to R0 • Decrement R1 • Branch > 0, LOOP • Move R0, SUM The loop is a series of instructions that are repeated several times as necessary. It begins with the command LOOP and finishes at a Branch greater than zero. The address of the next list entry is computed during every step through this loop, and that element is retrieved and added to R0 (Ali et al., 2010; Kim, 2014). As discussed in the following section, an operand’s
62
Dissecting Computer Architecture
address may be supplied in several different ways. For the time being, you must understand how to design and operate a program loop. Suppose that memory location N contains the number of entries in the list, n. Register R1 is utilized as a counter to keep track of how many times the loop is run. As a result, at the start of the program, the data of location N are put into register R1. Furthermore, within the loop’s body, the Decrement R1 instruction decreases the values of R1 by 1 every time the loop is run. As long as the result of the decrement operation is larger than zero, the loop is executed again (Reilly et al., 2009; Burrell, 2004). You must be capable of following branch directions now. This instruction updates the program counter with a new value. As a consequence, rather than fetching and executing the command at the location which it obeys the branch command in the serial address sequence, the processor retrieves and performs the commands at this novel address, referred to as the branch target. The unconditional or conditional branch instruction may be used. An unconditional branch instruction branches to the supplied location regardless of whether or not any conditions are met (Wupper and Meijer, 1998). Unless a stated condition is met does a conditional branch command produce a branch? Whereas if the condition has not been met then PC is incremented as usual, and the next command in the progressive sequence is retrieved and performed. The Branch>0 LOOP (branch if larger than zero) instruction in the previous instance is a conditional branch command that induces a branch to location LOOP whereas if the result of the instantaneously prior command, and is the decreased value in the register R1, is more than 0 (Anderson and Jensen, 1975). The loop is run as long as there are elements in the list that have not yet been added to R0. The Decrement instruction gives a value of 0 after the nth iteration via the loop, thus branching is not possible. The Move instruction is rather retrieved and performed. It stores the end outcome in memory location SUM, which is moved from R0. Jumps are a term used in certain ISAs to describe these instructions. Used by following conditional branch instructions, the processor maintains track of information regarding the impacts of different operations (Yehezkel, 2002). This is performed by storing the necessary data in individual parts, which are referred to as condition code flags. Such flags are normally stored in a condition code register or status register, which is a particular processor register. Based upon the results of the operation, unique condition code flags have been set to 1 or cleared to zero. Sign, Zero, Overflow, and Carry are among the most widely utilized flags (Starr et al., 2008; Maltese et al., 1993). Subroutines have been utilized in combination with the call
Classification of Computer Architecture
63
and return instructions. A subroutine is a self-contained set of instructions that does a specific purpose. A subroutine can be called numerous times throughout the execution of a program to execute its function at different locations in the main program. A branch is performed at the start of a subroutine every time it is called to begin processing its set of instructions. The return instruction has been used to make a branch back to the main program after the subroutine has been performed. Interrupts may potentially disrupt a program’s flow (Wu et al., 2011). A program interrupt occurs when control of a presently running program is transferred to some other service program as a result of an internally or externally request. After the servicing program has been completed, control returns to the original software. Except for three differences, the interrupt operation is quite identical to a subroutine call in principle: (1) Aside from the implementation of an instruction, an interrupt has been typically initiated by an internally or externally signal; (2) the address of the interrupt service program has been ascertained by the hardware or from certain information from the impede signal or the interrupt-causing instruction; and (3) an impede process typically stores all of the information necessary to describe the state of the CPU instead of just the program counter (Van Heerden et al., 2012). Whenever the processor is disrupted, it records the current state of the processor, such as the register contents, the processor status word (PSW), and return address, and then goes to the interrupt handler or interrupt service function. It resumes to the main program after this is completed. In the following unit on I/O, we’ll go through interrupts in greater depth (Wang et al., 2021). Information is transferred among memory, registers, and I/O devices via output and input instructions. Special instructions dedicated to input/ output transfers can be used, or memory-related instructions can be used to accomplish input/output transfers. If you are creating an embedded processor for a certain application, you would need to include instructions that have been particular to that application. For a general-purpose processor, you only include generalpurpose instructions (Maier and Größler, 1998). Then there are saturating arithmetic operations, multiply, and accumulator instructions, which aim to harness data-level parallelism, in which the same operation of subtraction or addition is performed on distinct data. The data kinds and sizes show the processor’s supported kinds of data and their lengths. Single precision Floating Point (1 word) and Double
Dissecting Computer Architecture
64
Precision Floating Point (2 words) are common operand kinds. Floating point numbers and unpacked and packed decimal numbers are following the IEEE standard 754 (Meisel et al., 2010).
2.4.2. Addressing Modes A command’s operation field shows the operation to be performed. This operation should be done on information that is instantaneously available or that has been saved in memory words or registers of the computer. The addressing mode of the command concludes how the operands have been chosen during the execution of the program. Before the operand is referred to, the addressing mode provides a rule for altering or interpreting the address field of the command. The most essential addressing modes present in current CPUs are covered in this part (Chow et al., 1987). Addressing mode approaches are used by computers to support one or more of the following: •
To give the user the flexibility of programming by including characteristics like loop control counters, indexing of data, memory pointers, and the relocation of the program. • To reduce the bits, and amount in the field the of command’s addressing. Constants, pointers, global, and regional variables, and arrangements are all utilized in higher-level programming languages. The compiler should be capable of implementing these structures utilizing the capabilities given in the instruction set of the machine on which the program would be run when converting a higher-level language program into assembly language. Addressing modes relate to the several methods whereby an operand’s location is shown in an instruction. The most basic data kinds are constants and variables, which can be located in practically every program computer. A variable is expressed in assembly language via assigning a memory region or a register to maintain its value (Hartley, 1992). • •
•
Register Mode: The operand is the data of a processor register; the register’s name (address) is specified in the instruction. Absolute Mode: The operand resides in the location of memory, and the location’s address is particular straight in the command. This is often referred to as direct. Immediate Mode: This may be utilized to articulate data constants and address them in assembly language (Fiskiran et al., 2001).
Classification of Computer Architecture
65
The operand is particular directly in the command in immediate mode. The instruction Move 200 immediate, R0, for instance, inserts the value 200 in register R0. Only the value of a resource operand can be specified in the immediate mode. The usage of the sharp sign (#) in front of a value to signal that it is to be utilized as an immediate operand is a frequent convention. As a result, we code the above command as Move #200, R0. In higherlevel language applications; constant values have been widely utilized. The constant 6 is included in the expression A = B + 6. This statement can be built as follows, supposing A and B are defined as variables and can be accessed utilizing the absolute mode (Bolanakis et al., 2008): • Move B, R1 • Add #6, R1 • Move R1, A In assembly language, constants have been also utilized to increase a counter, investigation the pattern of bit, etc.: •
Indirect Mode: In the subsequent addressing modes, the command doesn’t openly state the operand or its address. Alternatively, it presents information that may be used to identify the operand’s memory address. This address is referred to this as the operand’s effective address (EA). The contents of a register or memory location whose address occurs in the instruction are the EA of the operand in this mode. The identity of the register or the memory location specified in the instruction is included in parenthesis to indicate indirection (Freudenthal and Carter, 2009). For instance, take the command Add (R1), R0. The processor utilizes the value in register R1 as the EA of the operand when executing the Add instruction. To read the data of this place, it asks for a read operation from memory. The requested operand is read and added to the data of register R0 by the CPU. As demonstrated by the instruction Add (A), R0, indirect addressing through a memory location is also feasible. In this scenario, the processor reads the contents of memory location A first, then requests a 2nd read operation to acquire the operand utilizing this value as an address. A pointer is a register or memory region that carries the address of an operand. In programming, indirection, and the usage of pointers are essential and powerful ideas. In this instance, changing the contents of location A fetches various operands to add to register R0 (El-Aawar, 2008).
Dissecting Computer Architecture
66
•
Index Mode: The subsequent addressing mode you’ll be taught, gives you more versatility when it comes to accessing operands. It comes in handy when working with lists and arrays. The EA of the operand is made up in this way via subtracting a constant value (displacement) from the contents of a register. The register used might be a custom register developed for this reason or one of the processor’s general-purpose registers (Maltese and Ferrara, 1996). It is considered an index register in either instance. The Index model is symbolically represented as X(Ri), where Ri is the name of the register affected and X signifies the constant value included in the instruction. EA = X + [Ri] is the EA of the operand. In the procedure of producing the EA, the contents of the index register are not modified. The constant X can be specified as an explicit integer or as a symbolic term denoting a numerical value in an assembly language program. The constant X is included in the instruction when it is transformed into machine code, and it is normally expressed by fewer bits than the computer’s word length. Because X is a signed integer, it should be signedextended to the length of the register before being added to the contents (Patterson and Sequin, 1998). • Relative Mode: It was defined by utilizing general-purpose processor registers in the preceding discussion. Whenever the counter of a program, a Personal computer, is used rather than a general-purpose register, a workable implementation of this notion can be obtained, as shown in Figure 4.2 (Patterson and Ditzel, 1980). Then, utilizing X (PC), you can address a location of memory that is X bytes away from the present position of the program counter. The name Relative mode is connected with this sort of address because the addressed place is recognized “relative” to the counter of the program, which forever specifies the recent implementation point of a program. The Index mode determines the EA in this scenario by utilizing the program counter instead of the general-purpose register Ri. Control flow instructions are commonly addressed using this addressing technique (Hamza, 2017). This mode may, however, be utilized to acquire data operands. However, the most prevalent application is in branch instructions, where it is used to define the target address. If the branch condition is fulfilled, an instruction like Branch > 0 LOOP enables program execution to move to the branch
Classification of Computer Architecture
67
target address designated by the term LOOP. This location may be calculated by describing it as an offset from the program counter’s current value. The offset is specified as a signed number because the branch target might be before or after the branch instruction. Remember that the CPU increments the PC to point to the next instruction during the execution of an instruction. In Relative mode, most computers utilize this updated value to calculate the optimum address (Ichikawa et al., 1992). The following two modes help acquire data items in sequential memory locations. •
•
Autoincrement Mode: The contents of a register provided in the instruction serve as the operand’s EA in autoincrement mode. The contents of such register have been automatically increased to pass on to the subsequent item in a list upon acquiring the operand (Cohen, 1992). The Autoincrement mode is shown via putting the specified register in parenthesis, showing that the register contents are being utilized as the EA, and then a plus sign, showing that the register contents are to be increased when the operand is accessed. As a result, (Ri)+ is used to represent the Autoincrement mode. Auto Decrement Mode: Another handy model that complements the Autoincrement mode is the Auto decrement mode, which retrieves the entries of a list in the reverse sequence (Hutty, 1984). The register contents given in the command have been diminished and then utilized as the EA of the operand in the auto decrement mode. The Auto decrement mode is shown via putting the particular register in parenthesis, followed via the sign of minus, indicating that the contents of the register will be decremented before are used as the EA. Consequently, we write (Ri). Operands are retrieved in lessening address sequence in this way. You might be curious why the address has been decreasing before being utilized in Auto decrement mode and incremented after being utilized in Autoincrement mode. This is because such two modes may be combined to create a stack (Morris, 1985).
2.4.3. Instruction Formats The preceding sections demonstrated that the processor may perform a variety of instructions and as such the operands may be specified in a variety of ways. After all of this has been decided, the information must be
Dissecting Computer Architecture
68
provided to the processor in the format of instructions. The instruction’s amount of bits are separated into groups called fields. The following are the most typical fields seen in instruction formats (Liang, 2019): •
The operation to be conducted is specified in the operation code field. The number of bits represents the number of operations that may be carried out; • A processor registers or a memory address is designated by an address field. The number of bits is determined by the memory capacity or the number of registers; • A mode field that provides the method for determining the operand or EA. The amount of addressing modes enabled by the CPU determines this. Based upon the kind of ISA used, there can be 1, 2, or 3 address fields. Also, keep in mind that the length of the instructions would differ depending on the number of operands supported and the size of the individual fields. Certain processors can fit all of the instructions into a single-sized form, while others employ different sized forms. As a result, you have the option of using a variable or fixed format (Anger et al., 1994). Memory address interpretation – there are two forms of memory address interpretation which are little-endian and big-endian arrangement. Bytes are used to organize memories, and a specific address of a memory region can store up to eight bits of data. When looking at the word length of the CPU, however, the word length of the processor can be greater than one byte. Consider a 32-bit CPU, which is comprised of 4 bytes. Such 4 bytes are distributed over four memory regions (Ziefle, 2008). When you give the address of a word, do you provide the address among the most important byte as the address of the word (big end) or the address of the least relevant byte as the address of the word (small end)? What is the difference between a little-endian arrangement and a big-endian arrangement? The big-endian layout is used by Motorola, IBM, and HP, whereas the little-endian pattern is used by Intel. We also describe that there may be an arrangement when data spans many memory locations and you try to access a word that is oriented with the word boundary. If you attempt to access words that do not begin with a word boundary, you will be able to do so, but they will not be aligned. It’s a design problem whether or not there’s support for mismatched data. Even if you’re authorized to access misalignment data, it typically takes additional memory cycles to do so (Tokoro et al., 1977).
Classification of Computer Architecture
69
Ultimately, when it comes to the function of compilers in creating the instruction set architecture, the compiler has several responsibilities. The days have been gone when architectures and compilers were assumed to be independent of one another. Whenever the compiler comprehends the interior design of the processor, it would be capable of writing more efficient code. Consequently, the architecture should show itself to the compiler, and the compiler should utilize all reachable hardware. The ISA must be easy to compile. The orthogonality, regularity, and the flexibility to balance alternatives are the three key approaches wherein the instruction set architecture can assist the compiler (Asaad, 2021). Eventually, all of the characteristics of an ISA have been explained concerning the 80×86 and MIPS processors: •
•
•
ISA Class: Nearly all ISAs are classified as general-purpose register architectures, using memory or register locations as operands. The MIPS has 32 general-purpose registers and 32 floating-point registers, whereas the 80×86 has 16 floating-point registers and 16 general-purpose registers. The register-memory ISAs, like the 80×86, that may access memory as a component of several instructions, and the load-store ISAs, like MIPS, that may only access memory via load or store instructions, are the two most prevalent forms of this class. All of the most current ISAs are load-store ISAs (Aditya et al., 2000). Memory Addressing: Byte addressing is used to access memory operands in essentially all server and desktop computers, such as the 80×86 and MIPS. Object alignment is required by certain architectures, such as MIPS. If A mod s = 0, access to an object of size as bytes at byte location A is arranged. Although the 80×86 does not need operand alignment, accesses are often quicker when operands are aligned (Liang, 2020). Addressing Modes: These indicate the address of a memory item in addition to constant and registers operands. Immediate (for constants), Register, and Displacement are MIPS addressing modes, in which a constant offset is appended to a register to produce the memory address. The 80×86 provides those 3+3 displacement variations: two registers (indexed having displacement), no register (absolute), and two registers wherein a register is incremented through the operand size in bytes (depending upon displacement and scaled index). It’s similar to
Dissecting Computer Architecture
70
•
•
•
•
the last 3, but without the displacement field: indirect register, scaled index, and indexed (Cluley, 1987). The Operand Sizes and Types: These are supported by 80×86 and MIPS: 64-bit (long integer or doubleword), 32-bit (integer or word), 16-bit (Unicode character or half word), 8-bit (ASCII character), and IEEE 754 floating point in 64-bit (double precision) and 32-bit (single precision). The 8086 is also capable of supporting an 80-bit floating-point operation (extensive double precision) (Lee and Park, 2010). Operations: Data transmission, control, floating-point, and arithmetic logical operations are the four major kinds of operations. MIPS represents the RISC architectures which are used in 2006. It is a basic and easy-to-pipeline instruction set architecture. The 80×86 has a significantly more extensive and comprehensive range of operations (Lancioni et al., 2000). Instructions for the Control Flow: Unconditional jumps, conditional branches, returns, and procedure calls are supported by virtually all ISAs, including the 80×86 and MIPS. Both employ PC-relative addressing, which specifies the branch address via an address field appended to the PC. There have been a few minor distinctions. MIPS conditional branches (BNE, BE, etc.), verify the contents of registers, whereas 80×86 conditional branches (JNE, JE, etc.), verify condition code bits set as a result of logic/ arithmetic operations. The MIPS procedure call (JAL) keeps the return address in a register, whereas the 80×86 call (CALLF) keeps it on a memory stick (Germain et al., 2000). ISA Encoding: There have been two fundamental encoding options: variable and fixed length. The length of all MIPS commands is 32 bits which facilitates command decoding. The length of the 80×86 encoding is varied and ranges from 1 to 18 bytes. Because variable-length instructions take up less space as compared to fixed-length instructions, a program built for the 80×86 is often smaller than a program compiled for MIPS. Take note that the decisions taken above would affect how well the instructions have been encoded in binary form. For instance, addressing modes and registers quantity may have a considerable effect on the size of commands, as the address and register mode fields might come out several times inside a single command (Sites, 1993).
Classification of Computer Architecture
71
2.5. MICROARCHITECTURE Microarchitecture, often known as computer organization and shortened as search or µarch in computer engineering, is the approach a certain instruction set architecture is executed in a specific processor. Differing microarchitectures can be used to implement a particular instruction set architecture; implementations can change owing to various design goals or technological advancements (Figure 2.5) (Ronen et al., 2001).
Figure 2.5. Microarchitecture illustration. Source: https://en.wikipedia.org/wiki/Microarchitecture.
Dissecting Computer Architecture
72
2.5.1. The Cycles of Instruction To run the programs, all the multi-chip and single-chip CPUs must do the following (Sinharoy et al., 2005): • •
Decode the instruction by reading it; Locate any additional information required to process the instruction; • Follow the instructions; • Make a list of the outcomes. Till the power is switched off, the cycle of instruction is continued indefinitely.
2.5.2. Multicycle Microarchitecture Multicycle designs were the first computers in history. This approach is still used in many of the tiniest and cheapest computers. Multicycle designs frequently employ the fewest overall logic components and consume the least amount of power. They may be built with predictable timing and great dependability in mind. They don’t have a pipeline to pause while taking conditional interrupts or branches, for example. Alternate microarchitectures, on either hand, frequently accomplish additional commands for a specific time while employing a similar logic family. When we talk about “increased performance,” we usually refer to a multi-cycle architecture (Harris and Harris, 2016). In a multi-cycle computer, the four processes are performed sequentially across numerous clock cycles. Certain architectures may complete the sequence in two clock cycles via executing consecutive steps on alternative clock edges, with lengthier operations taking place outside of the main cycle. Stage-1 on the first cycle’s rising edge, stage-2 on the first cycle’s falling edge, and so on (Cong et al., 2004). The cycle state (higher or lower), cycle counter, and bits of the command decode register specify what every element of the computer must be performing in the logic control. A bits table defining the control signals to every portion of the computer in every cycle of every command may be used to construct the control logic. The logic table may then be put to the test in a software simulation using test code. A microprogram is a logic table that is stored in memory and utilized to run a real computer. The logic table is enhanced into the format of combinational logic constructed from logic gates
Classification of Computer Architecture
73
in certain computer architectures, commonly utilizing logic optimization computer software. Till Maurice Wilkes devised this tabular technique and dubbed it micro-programming, premature computers employed ad-hoc logic architecture for control (Lim et al., 2009).
2.5.3. Increasing Execution Speed Its reality is that the hierarchy of memory that the central memory, which comprises cache, and non-volatile storage such as hard drives (in which the program commands and information have been stored), is slower as compared to the CPU complicates this seemingly straightforward set of steps. Step (2) frequently causes a long (in CPU terms) latency as the data is transferred across the computer bus (Koufaty and Marr, 2003). Designs that eliminate such delays as often as feasible have been subjected to extensive investigation. Over time, one of the main goals has been to perform more commands in parallel, boosting a program’s efficient execution speed. Complex logic and circuit architectures were introduced as a result of these efforts. Due to the quantity of circuitry required, such approaches might only be executed on costly mainframes or supercomputers at first. More of such strategies might be incorporated into a single semiconductor chip as semiconductor manufacturing evolved. Moore’s law is one example of this (Sinharoy et al., 2005).
2.5.4. Instruction Set Choice Over time, instruction sets have evolved from being initially fairly basic to being occasionally extremely complicated (from a different perspective). The use of load-store designs, as well as EPIC and VLIW kinds, has been popular in current years. Vector architectures and SIMD are two types of data parallelism-aware computing architectures (Atasu et al., 2003). A few of the labels utilized to signify classes of CPUs architectures are not especially explanatory, particularly the “CISC” label; numerous initial designs that were retrospectively designated “CISC” are considerably easier than modern RISC processors, even though they are based on the CISC architecture. The selection of instruction set architecture, on either hand, can have a significant impact on the difficulty of developing higher-performance devices (Koufaty and Marr, 2003). The dominant method employed in the development of the 1st RISC processors has been to reduce instruction semantic complication to a minimum level while maintaining higher
74
Dissecting Computer Architecture
encoding simplicity and regularity at the expense of encoding complexity. This type of homogeneous instruction set was conveniently decoded, fetched, and performed in a pipelined fashion, and the easiest plan was utilized to decrease the number of logic levels to achieve higher frequencies of operating. Command cache memories were recompensed for the high frequency of operating and generally lower density of code, and larger register sets had been utilized to factor out as many (slower) memory accesses as conceivable (Chou et al., 2004).
2.5.5. Instruction Pipelining Instruction pipelining is amongst the 1st and most powerful strategies for improving performance. Before going to the next instruction, earlier processor architectures will perform all of the processes listed previously for that instruction. At every given stage, larger sections of the circuitry had been left free; for example, the instruction decoding circuitry was inactive during processing, etc. (Koufaty and Marr, 2003). Pipelining boosts productivity by allowing many instructions to run in parallel across the CPU. In a similar simple instance, the processor will begin decoding (step-1) a new command whereas the previous one awaited the outcome. Up to four instructions might be “in-flight” at the same time, enabling the processor to appear four times faster even though each instruction takes the same amount of time to complete (there have been still four stages), the CPU as a whole “retires” commands significantly more quickly (Sinharoy et al., 2005). By clearly isolating every step of the command procedure and making them occupy a similar amount of time in one cycle RISC makes pipelines shorter and considerably easy to implement. The CPU works like a line of assembly, having commands flowing on the first side and outputs emerging out on the next side. The pipelined core plus a command cache might be fitted on a similar size die that will ordinarily contain the core alone on CISC architecture due to the decreased complication of the conventional RISC pipeline. The true reason RISC had been faster. At a similar clock price and speed, early systems such as the MIPS and SPARC were sometimes over 10 times faster than Intel and Motorola CISC solutions (Marr et al., 2002). Pipelines aren’t confined to RISC architectures. The top-of-the-line VAX execution (VAX 8800) had been significantly pipelined by 1986, somewhat before the SPARC architectures and 1st commercial MIPS. Most current CPUs (including tiny CPUs) have been pipelined, and microcode CPUs
Classification of Computer Architecture
75
without pipelining are only found in the smallest embedded processors. Pipelines and microcode and pipelines have been used to create larger CISC machines, ranging from the VAX 8800 to the contemporary Athlon and Pentium-4. The two primary microarchitectural developments which have permitted processor performance to maintain pace with the circuit technology upon which they have been built are advancements in caching and pipelining (Ronen et al., 2001).
2.5.6. Cache It wasn’t long before advances in chip production enabled extra circuitry to be packed onto the die, and designers began exploring new methods to put it to use. The addition of an ever-increasing quantity of cache memory ondie was one of the most prevalent. The cache is a memory that is both quick and costly. It may be accessible within certain cycles rather than the many that are required to “speak” with the main memory. A cache regulator is built within the CPU that regulates the writing and reading of data from the cache. If the information has been in the cache, it is accessible from there, improving efficiency, but if it isn’t, the processor “stalls” whereas the cache regulator reads it in (Mak et al., 2009). In the half of the 1980s, RISC architectures began to include cache, which was typically only 4 KB. This quantity has increased over time, and most CPUs today contain a minimum of two megabytes of memory, with much more powerful CPUs having 4 or 6 or 12 or even 32 megabytes or over, with its most recent EPYC Milan-X series having 768 of memory structured in various tiers of a memory hierarchy. In general, greater cache indicates better performance because stalling is decreased (Rotenberg et al., 1999). Caches and pipelines had been made for each other. Earlier, building a pipeline that might operate quicker than the access delay of off-chip memory didn’t make any sense. However, utilizing on-chip cache memory allowed a pipeline to operate at the pace of the cache accessing delay, which was a substantially shorter duration. Processor operating frequencies might grow at a far higher rate than off-chip memory because of this (Talpes and Marculescu, 2005).
2.5.7. Branch Prediction The main obstacle to getting improved performance through commandlevel parallelism is the occurrence of pipeline stalls and flushes as a result of
76
Dissecting Computer Architecture
branching. Because conditional branches are dependent on the results of a register, it is not always clear whether or not a conditional branch would be taken until late in the pipeline (Jiménez, 2005). During the time it takes the command of the processor decoder to recognize that it has met a command of a conditional branch and the deciding register value may be read out, the pipeline must be stalled for several cycles, or if it is not stalled and the branch is taken, the pipeline must be flushed, depending on the situation. In tandem with increasing clock rates, the depth of the pipeline rises as well, with some modern processors having 20 stages or more in their pipeline. On average, every 5th instruction performed is a branch, resulting in a significant amount of stalling in the absence of any interventions (Wenisch et al., 2005). Techniques like branch anticipation and speculative implementation are employed to reduce the severity of such branch penalty consequences. Branches prediction is a process in which the hardware makes informed assumptions as to whether a given branch would be followed or not. When it comes down to it, one side of the branch would be called considerably more often than the other in most cases (Lipasti and Shen, 1997). Most modern designs contain quite complicated statistical prediction algorithms, which keep track of the outcomes of previous branches to forecast the future with higher precision. The supposition enables the hardware to pre-fetch commands without having to wait for the register to be read in the first place. With speculative execution, you can go one step further and not only have the code along the expected route pre-fetched, but you can also have it run before you know whether or not to take the branch. This may result in improved performance when the estimate is correct, but it also carries the menace of a significant fine when the estimate is incorrect since commands must be undone (Haskins and Skadron, 2003).
2.5.8. Superscalar While the gates and complexity required to support the notions described above increased, advancements in semiconductor fabrication enabled the usage of even more logic gates to be implemented shortly (Smith and Sohi, 1995). In the diagram above, the processor is shown to be processing portions of a single instruction at a given moment. Computer programs may be completed more quickly if several instructions are processed at the same time. That’s what superscalar processors do by recreating functional units like ALUs, which is what they are designed to do. The duplication of
Classification of Computer Architecture
77
functional units had only been made practical whenever the die area of a single-issue processor was no longer stretching the boundaries of what might be successfully produced in a controlled environment. By the late 1980s, superscalar designs were beginning to appear on the market (Palacharla et al., 1997). A typical modern architecture has two load units, one store unit (since several commands produce no outcomes to store), two or more-digit math units, two or more floating-point math units, and frequently some form of SIMD unit. Instruction issue logic becomes more sophisticated as a result of reading a large list of commands from memory and passing them on to the various execution units which have been spare at the time of reading the instructions. The findings are then gathered and re-ordered after the process (Choudhary et al., 2012).
2.5.9. Out of Use Execution The inclusion of caches lessens the regularity and period of delays caused by data being retrieved from the memory hierarchy, but that doesn’t eliminate them. A cache miss will cause the cache regulator to halt the processor and wait in earlier architectures. There might be another command in the program its data is now stored in the cache. Out-of-use processing permits that ready command to be executed, whereas an older command waits in the cache, and subsequently rearranges the outcomes to make it look as if all occurred in the sequence that was planned. Additional operand dependence delays can be avoided using this strategy, including an instruction waiting for a response from a lengthy delay floating-point operation or other multicycle processes (Capalija and Abdelrahman, 2012).
2.5.10. Register Renaming Register renaming is a strategy for avoiding wasteful serialized execution of commands when such commands reutilize similar registers. Assume we have two sets of instructions that would share a register. The 1st set of instructions is performed before the second set, however, if the second set is allocated to a comparable register, both sets of commands may be run in series or parallel (Sinharoy et al., 2005).
78
Dissecting Computer Architecture
2.6. SYSTEM DESIGN The name says it all: the design would meet user needs for a system’s architecture, interfaces, modules, and data, and it would be linked to product design. It’s the procedure of gathering marketing data and turning it into manufacturing-ready product designs. Software and hardware are standardized to create modular systems (Figure 2.6) (Wolf, 2012).
Figure 2.6. Design and architecture of the software. Source: https://www.tutorialspoint.com/software_architecture_design/introduction.htm.
If the wider issue of product design includes the blends and the viewpoint of marketing, designing, and production into a single method of product creation (Wolf, 2012, therefore the design is the process of collecting the marketing data and developing the product specifications to be built. The design of a system is essentially the procedure of creating and developing systems to fulfill the defined criteria of the consumer (Luff et al., 1990; Padalia et al., 2003). The primary subject of the design of the system is the knowledge of pieces and their next interaction with others (Luff et al., 1990). Till the 1990s, the design of the system played a vital and acknowledged position in the information processing sector. In the 1990s, the standardizing of software and hardware led to the capability to design modular systems. The rising relevance of software operating on general platforms has enriched the profession of software engineering (Leveson, 1995; Rose and Hill, 1997).
Classification of Computer Architecture
79
2.6.1. Architectural Design The system’s architectural design focuses on the architecture of the system, which defines the behavior, arrangement, and other viewpoints and analyses of that system (Grobman et al., 2009).
2.6.2. Logical Design System logic design is an abstract description of the data flows, inputs, and outputs of a system to better understand how the system is being used. A common method for accomplishing this is modeling, which entails developing an excessively abstract depiction of the actual system. Designs have been considered to be part of a system in this sense. Entity-relationship diagrams (ERDs) are included in the logical design (Grobman et al., 2010).
2.6.3. Physical Design The physical design of the system is concerned with the actual output and input operations that take place within it. A system’s data entry method, how it is checked and authorized, how it is handled, and how it is presented are all discussed in further detail in this section. The essential requirements for the system are determined during the physical design phase (Leeuwen and Wagter, 1997): • Requirements for input; • Requirements for output; • Requirements for storage; • Requirements for processing; • System management and backups or restoration. To put it differently, the physical part of system design may be divided into three sub-tasks that are universally recognized (Sait and Youssef, 1999): • Designing a user interface (UI); • Data structures and organization; and • Process design is the third step. UI Design is focused on how consumers enter information into a system and also on how the system returns that information to consumers. The design of data is focused on how information is presented and maintained inside a computer system. In the end, the Design of Process is focused on how information flows thru a system, and specifically where it is vetted and protected as it travels inwards, though, and outwards of a system, as well as
80
Dissecting Computer Architecture
where and how it and where it is converted. After the system design phase, certification defining the three sub-tasks is created and completed accessible for usage in the following phase (Keitel-Schulz and Wehn, 2001). It is important to note that physical design doesn’t relate to the tangible physical design of a data system in this respect. To provide an analogy, the physical design of a personal computer includes input through a keyboard, processing inside the CPU, and output through a display, printer, and other output devices. Notably, it will not affect the physical arrangement of any real hardware, which in the case of a PC will include a monitor, motherboard, CPU, modems, hard drive, USB ports, video/graphic cards, and so on. It entails the precise design of a UI, as well as the design of a product database design processor and control processor. It is necessary to build a personal specification for the suggested system to use (Hornby and Pollack, 2001).
Classification of Computer Architecture
81
REFERENCES Aditya, S., Rau, B. R., & Johnson, R., (2000). Automatic Design of VLIW and EPIC Instruction Formats (Vol. 2, No. 3, p. 94). HP Laboratories Technical Report HPL. 2. Adve, V., Lattner, C., Brukman, M., Shukla, A., & Gaeke, B., (2003). LLVA: A low-level virtual instruction set architecture. In: Proceedings 36th Annual IEEE/ACM International Symposium on Microarchitecture; MICRO-36 (Vol. 1, pp. 205–216). IEEE. 3. Ali, N. M., Hosking, J., & Grundy, J., (2010). A taxonomy of computersupported critics. In: 2010 International Symposium on Information Technology (Vol. 3, pp. 1152–1157). IEEE. 4. Anderson, G. A., & Jensen, E. D., (1975). Computer interconnection structures: Taxonomy, characteristics, and examples. ACM Computing Surveys (CSUR), 7(4), 197–213. 5. Anger, W. K., Rohlman, D. S., & Sizemore, O. J., (1994). A comparison of instruction formats for administering a computerized behavioral test. Behavior Research Methods, Instruments, & Computers, 26(2), 209–212. 6. Asaad, R. R., (2021). A study on instruction formats on computer organization and architecture. Icontech International Journal, 5(2), 18–24. 7. Atasu, K., Pozzi, L., & Ienne, P., (2003). Automatic applicationspecific instruction-set extensions under microarchitectural constraints. International Journal of Parallel Programming, 31(6), 411–428. 8. Barnes, G. H., Brown, R. M., Kato, M., Kuck, D. J., Slotnick, D. L., & Stokes, R. A., (1968). The illiac iv computer. IEEE Transactions on Computers, 100(8), 746–757. 9. Ben, M. A., & Atri, M., (2019). An efficient end-to-end deep learning architecture for activity classification. Analog Integrated Circuits and Signal Processing, 99(1), 23–32. 10. Bodin, L. D., & Berman, L., (1979). Routing and scheduling of school buses by computer. Transportation Science, 13(2), 113–129. 11. Bolanakis, D. E., Evangelakis, G. A., Glavas, E., & Kotsis, K. T., (2008). Teaching the addressing modes of the M68HC08 CPU using a practicable lesson. In: Proceedings of the 11th IASTED International Conference on Computers and Advanced Technology in Education (Vol. 2, No. 5, p. 446–450). Crete, Greece. 1.
82
Dissecting Computer Architecture
12. Burrell, M., (2004). Addressing modes. In: Fundamentals of Computer Architecture (Vol. 1, pp. 189–199). Palgrave, London. 13. Capalija, D., & Abdelrahman, T. S., (2012). Microarchitecture of a coarse-grain out-of-order superscalar processor. IEEE Transactions on Parallel and Distributed Systems, 24(2), 392–405. 14. Chacón, F., (1992). A taxonomy of computer media in distance education. Open Learning: The Journal of Open, Distance and E-Learning, 7(1), 12–27. 15. Chen, Y., Lan, H., Du, Z., Liu, S., Tao, J., Han, D., & Chen, T., (2019). An instruction set architecture for machine learning. ACM Transactions on Computer Systems (TOCS), 36(3), 1–35. 16. Chou, Y., Fahs, B., & Abraham, S., (2004). Microarchitecture optimizations for exploiting memory-level parallelism. In: Proceedings. 31st Annual International Symposium on Computer Architecture (3rd edn., pp. 76–87). IEEE. 17. Choudhary, N., Wadhavkar, S., Shah, T., Mayukh, H., Gandhi, J., Dwiel, B., & Rotenberg, E., (2012). Fabscalar: Automating superscalar core design. IEEE Micro, 32(3), 48–59. 18. Chow, F., Correll, S., Himelstein, M., Killian, E., & Weber, L., (1987). How many addressing modes are enough?. ACM SIGOPS Operating Systems Review, 21(4), 117–121. 19. Chung, K. L., (1995). Prefix computations on a generalized meshconnected computer with multiple buses. IEEE Transactions on Parallel and Distributed Systems, 6(2), 196–199. 20. Cluley, J. C., (1987). Instruction formats and address modes. In: An Introduction to Low-Level Programming for Microprocessors (4th edn., pp. 19–42). Palgrave, London. 21. Cohen, R., (1992). Addressing modes and management protocols in a gigabit LAN with switching tables. In: Proceedings 17th Conference on Local Computer Networks (4th edn., pp. 669–670). IEEE Computer Society. 22. Cong, J., Fan, Y., Han, G., Yang, X., & Zhang, Z., (2004). Architecture and synthesis for on-chip multicycle communication. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23(4), 550–564.
Classification of Computer Architecture
83
23. Cook, R. P., & Donde, N., (1982). An experiment to improve operand addressing. ACM SIGARCH Computer Architecture News, 10(2), 87– 91. 24. Dandamudi, S. P., (2003). Addressing modes. Fundamentals of Computer Organization and Design, 2(5), 435–469. 25. Dutta, S., Jeong, H., Yang, Y., Cadambe, V., Low, T. M., & Grover, P., (2020). Addressing unreliability in emerging devices and non-von Neumann architectures using coded computing. Proceedings of the IEEE, 108(8), 1219–1234. 26. El-Aawar, H., (2008). An application of complexity measures in addressing modes for CISC and RISC architectures. In: 2008 IEEE International Conference on Industrial Technology (Vol. 5, No. 6, pp. 1–7). IEEE. 27. Faust, S., Mukherjee, P., Nielsen, J. B., & Venturi, D., (2015). A tamper and leakage resilient von Neumann architecture. In: IACR International Workshop on Public Key Cryptography (Vol. 1, pp. 579–603). Springer, Berlin, Heidelberg. 28. Ferrara, V., Beccherelli, R., Campoli, F., D’Alessandro, A., Galloppa, A., Galbato, A., & Maltese, P., (1997). Matrix addressing waveforms for grey shades SSFLC displays. Molecular Crystals and Liquid Crystals Science and Technology. Section A; Molecular Crystals and Liquid Crystals, 304(1), 363–370. 29. Fiskiran, A. M., & Lee, R. B., (2001). Performance impact of addressing modes on encryption algorithms. In: Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors: ICCD 2001 (2nd edn., pp. 542–545). IEEE. 30. Flynn, M. J., (1972). Some computer organizations and their effectiveness. IEEE Transactions on Computers, 100(9), 948–960. 31. Fox, A., & Myreen, M. O., (2010). A trustworthy monadic formalization of the ARMv7 instruction set architecture. In: International Conference on Interactive Theorem Proving (Vol. 1, pp. 243–258). Springer, Berlin, Heidelberg. 32. Francillon, A., & Castelluccia, C., (2008). Code injection attacks on Harvard-architecture devices. In: Proceedings of the 15th ACM Conference on Computer and Communications Security (Vol. 1, pp. 15–26).
84
Dissecting Computer Architecture
33. Freudenthal, E., & Carter, B., (2009). A gentle introduction to addressing modes in a first course in computer organization. In: 2009 Annual Conference & Exposition, 3(5), (pp. 14–31). 34. Germain, C. A., Jacobson, T. E., & Kaczor, S. A., (2000). A Comparison of the Effectiveness of Presentation Formats for Instruction: Teaching First-Year Students (Vol. 61, No. 1, pp. 65–72). College & Research Libraries. 35. Giannetsos, T., Dimitriou, T., Krontiris, I., & Prasad, N. R., (2010). Arbitrary code injection through self-propagating worms in von Neumann architecture devices. The Computer Journal, 53(10), 1576– 1593. 36. Giloi, W. K., (1983). Towards a taxonomy of computer architecture based on the machine data type view. ACM SIGARCH Computer Architecture News, 11(3), 6–15. 37. Goodacre, J., & Sloss, A. N., (2005). Parallelism and the ARM instruction set architecture. Computer, 38(7), 42–50. 38. Gredler, M. B., (1986). A taxonomy of computer simulations. Educational Technology, 26(4), 7–12. 39. Grobman, Y. J., Yezioro, A., & Capeluto, I. G., (2009). Computer-based form generation in architectural design: A critical review. International Journal of Architectural Computing, 7(4), 535–553. 40. Grobman, Y. J., Yezioro, A., & Capeluto, I. G., (2010). Non-linear architectural design process. International Journal of Architectural Computing, 8(1), 41–53. 41. Hamza, H. S., (2017). Automatic system design for finding the addressing modes based on context-free grammar. In: 2017 Second Al-Sadiq International Conference on Multidisciplinary in IT and Communication Science and Applications (AIC-MITCSA) (Vol. 5, No. 6, pp. 85–89). IEEE. 42. Harris, D. M., & Harris, S. L., (2016). Microarchitecture. In: Digital Design and Computer Architecture (4th edn., pp. 364–410). Graphic World Publishing Services. 43. Hartley, D. H., (1992). Optimizing stack frame accesses for processors with restricted addressing modes. Software: Practice and Experience, 22(2), 101–110.
Classification of Computer Architecture
85
44. Haskins, J. W., & Skadron, K., (2003). Memory reference reuse latency: Accelerated warmup for sampled microarchitecture simulation. In: 2003 IEEE International Symposium on Performance Analysis of Systems and Software: ISPASS 2003 (Vol. 1, pp. 195–203). IEEE. 45. Hornby, G. S., & Pollack, J. B., (2001). The advantages of generative grammatical encodings for physical design. In: Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No. 01th8546) (Vol. 1, pp. 600–607). IEEE. 46. Hutty, R., (1984). Nested loops and addressing modes. In: Programming in Z80 Assembly Language, (Vol. 5, No. 6, pp. 33–39). Palgrave, London. 47. Huynh-The, T., Hua, C. H., Pham, Q. V., & Kim, D. S., (2020). MCNet: An efficient CNN architecture for robust automatic modulation classification. IEEE Communications Letters, 24(4), 811–815. 48. Hwang, K., & Jotwani, N., (1993). Advanced Computer Architecture: Parallelism, Scalability, Programmability (Vol. 199, pp. 1–6). New York: McGraw-Hill. 49. Ichikawa, S., Sato, M., & Goto, E., (1992). Evaluation of rangechecking addressing modes and the architecture of FLATS2. Systems and Computers in Japan, 23(10), 1–13. 50. Jiménez, D. A., (2005). Piecewise linear branch prediction. In: 32nd International Symposium on Computer Architecture (ISCA’05) (Vol. 1, pp. 382–393). IEEE. 51. Kassani, S. H., Kassani, P. H., Wesolowski, M. J., Schneider, K. A., & Deters, R., (2019). A hybrid deep learning architecture for leukemic B-lymphoblast classification. In: 2019 International Conference on Information and Communication Technology Convergence (ICTC) (Vol. 1, pp. 271–276). IEEE. 52. Keitel-Schulz, D., & Wehn, N., (2001). Embedded DRAM development: Technology, physical design, and application issues. IEEE Design & Test of Computers, 18(3), 7–15. 53. Kiltz, S., Lang, A., & Dittmann, J., (2007). Taxonomy for computer security incidents. In: Cyber Warfare and Cyber Terrorism (3rd edn., pp. 421–418). IGI Global. 54. Kim, D. H., (2014). Addressing mode extension to the ARM/thumb architecture. Advances in Electrical and Computer Engineering, 14(2), 85–89.
86
Dissecting Computer Architecture
55. Kong, J. H., Ang, L. M., & Seng, K. P., (2010). Minimal instruction set AES processor using Harvard architecture. In: 2010 3rd International Conference on Computer Science and Information Technology (Vol. 9, pp. 65–69). IEEE. 56. Konyavsky, V. A., & Ross, G. V., (2019). Secure computers of the new Harvard architecture. Asia Life Sciences, 28(1), 33–53. 57. Koufaty, D., & Marr, D. T., (2003). Hyperthreading technology in the net burst microarchitecture. IEEE Micro, 23(2), 56–65. 58. Laird, A., (2009). The von Neumann architecture topic paper# 3. Computer Science (3rd edn., Vol. 319, pp. 360–8771). 59. Lancioni, G. E., O’Reilly, M. F., Seedhouse, P., Furniss, F., & Cunha, B., (2000). Promoting independent task performance by persons with severe developmental disabilities through a new computer-aided system. Behavior Modification, 24(5), 700–718. 60. Landwehr, C. E., Bull, A. R., McDermott, J. P., & Choi, W. S., (1994). A taxonomy of computer program security flaws. ACM Computing Surveys (CSUR), 26(3), 211–254. 61. Langdon, W. B., & Poli, R., (2006). The halting probability in von Neumann architectures. In: European Conference on Genetic Programming (Vol. 1, pp. 225–237). Springer, Berlin, Heidelberg. 62. Larraza-Mendiluze, E., & Garay-Vitoria, N., (2014). Approaches and tools used to teach the computer input/output subsystem: A survey. IEEE Transactions on Education, 58(1), 1–6. 63. Lee, Y. J., & Park, J. W., (2010). Instruction formats of processing visual media dedicated to the pipelined SIMD architectural multi-access memory system. In: International Conference on Future Information & Communication Engineering (Vol. 3, No. 1, pp. 11–14). 64. Leeuwen, J. P. V., & Wagter, H., (1997). Architectural design-byfeatures. In: CAAD Futures 1997 (Vol. 1, pp. 97–115). Springer, Dordrecht. 65. Leveson, N. G., (1995). Safeware: System safety and computers. ACM, 4(5), 2–8. 66. Li, B., & Yan, X. Q., (2010). Berth allocation problem with Harvard architecture and agent-based computing. In: 2010 International Conference on Computer Application and System Modeling (ICCASM 2010) (Vol. 1, pp. V1–197). IEEE.
Classification of Computer Architecture
87
67. Liang, X., (2019). Computer architecture simulators for different instruction formats. In: 2019 International Conference on Computational Science and Computational Intelligence (CSCI) (Vol. 1, No. 2, pp. 806–811). IEEE. 68. Liang, X., (2020). More on computer architecture simulators for different instruction formats. In: 2020 International Conference on Computational Science and Computational Intelligence (CSCI) (Vol. 2, No. 3, pp. 910–916). IEEE. 69. Lim, K. H., Kim, Y., & Kim, T., (2009). Interconnect and communication synthesis for distributed register-file microarchitecture. IET Computers & Digital Techniques, 3(2), 162–174. 70. Lipasti, M. H., & Shen, J. P., (1997). Super speculative microarchitecture for beyond AD 2000. Computer, 30(9), 59–66. 71. Liu, S., Du, Z., Tao, J., Han, D., Luo, T., Xie, Y., & Chen, T., (2016). Cambricon: An instruction set architecture for neural networks. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) (Vol. 1, pp. 393–405). IEEE. 72. Lo, C. H., Tsung, T. T., & Chen, L. C., (2005). Shape-controlled synthesis of Cu-based nanofluid using submerged arc nanoparticle synthesis system (SANSS). Journal of Crystal Growth, 277(1–4), 636–642. 73. Loo, A. W. S., (2007). Computer architecture. In: Peer-to-Peer Computing (pp. 209–226). Springer, London. 74. Luff, P., Gilbert, N. G., & Frohlich, D., (1990). Computers and Conversation (Vol. 1, No. 4, pp. 4–9). Academic Press. 75. Maier, F. H., & Größler, A., (1998). A taxonomy for computer simulations to support learning about socio-economic systems. In: Proceedings of the 16th International Conference of the System Dynamics Society (Vol. 99, pp. 1–9). Quebec City, Canada: System Dynamics Society. 76. Mak, P. K., Walters, C. R., & Strait, G. E., (2009). IBM System z10 processor cache subsystem microarchitecture. IBM Journal of Research and Development, 53(1), 2–1. 77. Maltese, P., & Ferrara, V., (1996). Rome addressing modes for surfacestabilized ferroelectric liquid-crystal (SSFLC) matrix displays. Journal of the Society for Information Display, 4(2), 75–82.
88
Dissecting Computer Architecture
78. Maltese, P., Piccolo, R., & Ferrara, V., (1993). An addressing effective computer model for surface stabilized ferroelectric liquid crystal cells. Liquid Crystals, 15(6), 819–834. 79. Mariantoni, M., Wang, H., Yamamoto, T., Neeley, M., Bialczak, R. C., Chen, Y., & Martinis, J. M., (2011). Implementing the quantum von Neumann architecture with superconducting circuits. Science, 334(6052), 61–65. 80. Marr, D. T., Binns, F., Hill, D. L., Hinton, G., Koufaty, D. A., Miller, J. A., & Upton, M., (2002). Hyper-threading technology architecture and microarchitecture. Intel Technology Journal, 6(1), 2–7. 81. Matuszczyk, T., & Maltese, P., (1995). Addressing modes of ferroelectric liquid crystal displays. In: Liquid Crystals: Materials Science and Applications (Vol. 2372, pp. 298–311). International Society for Optics and Photonics. 82. McQuillan, J. M., (1978). Enhanced message addressing capabilities for computer networks. Proceedings of the IEEE, 66(11), 1517–1527. 83. Meadows, C., (1993). An outline of a taxonomy of computer security research and development. In: Proceedings on the 1992–1993 Workshop on New Security Paradigms (Vol. 2, No. 5, pp. 33–35). 84. Meisel, M., Pappas, V., & Zhang, L., (2010). A taxonomy of biologically inspired research in computer networking. Computer Networks, 54(6), 901–916. 85. Mentzas, G., (1994). A functional taxonomy of computer-based information systems. International Journal of Information Management, 14(6), 397–410. 86. Morris, N. M., (1985). An instruction set and addressing modes. In: Microelectronic and Microprocessor-Based Systems (Vol. 1, No. 5, pp. 94–117). Palgrave, London. 87. Padalia, K., Fung, R., Bourgeault, M., Egier, A., & Rose, J., (2003). Automatic transistor and physical design of FPGA tiles from an architectural specification. In: Proceedings of the 2003 ACM/SIGDA Eleventh International Symposium on Field Programmable Gate Arrays (Vol. 1, pp. 164–172). 88. Padgett, H. C., Schmidt, D. G., Luxen, A., Bida, G. T., Satyamurthy, N., & Barrio, J. R., (1989). Computer-controlled radiochemical synthesis: A chemistry process control unit for the automated production of radiochemicals. International Journal of Radiation Applications and
Classification of Computer Architecture
89.
90.
91.
92.
93. 94.
95.
96.
97.
89
Instrumentation; Part A; Applied Radiation and Isotopes, 40(5), 433– 445. Palacharla, S., Jouppi, N. P., & Smith, J. E., (1997). Complexityeffective superscalar processors. In: Proceedings of the 24th Annual International Symposium on Computer Architecture (Vol. 1, No. 2, pp. 206–218). Parrilla-Gutierrez, J. M., Sharma, A., Tsuda, S., Cooper, G. J., AragonCamarasa, G., Donkers, K., & Cronin, L., (2020). A programmable chemical computer with memory and pattern recognition. Nature Communications, 11(1), 1–8. Patterson, D. A., & Ditzel, D. R., (1980). The case for the reduced instruction set computer. ACM SIGARCH Computer Architecture News, 8(6), 25–33. Patterson, D. A., & Sequin, C. H., (1998). RISC I: A reduced instruction set VLSI computer. In: 25 Years of the International Symposia on Computer Architecture (Selected Papers) (Vol. 1, pp. 216–230). Pope, S. T., (1996). A taxonomy of computer music. Contemporary Music Review, 13(2), 137–145. Quirita, V. A. A., Da Costa, G. A. O. P., Happ, P. N., Feitosa, R. Q., Da Silva, F. R., Oliveira, D. A. B., & Plaza, A., (2016). A new cloud computing architecture for the classification of remote sensing data. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10(2), 409–416. Reilly, S., Barron, P., Cahill, V., Moran, K., & Haahr, M., (2009). A general-purpose taxonomy of computer-augmented sports systems. In: Digital Sport for Performance Enhancement and Competitive Evolution: Intelligent Gaming Technologies (5th edn., pp. 19–35). IGI Global. Ronen, R., Mendelson, A., Lai, K., Lu, S. L., Pollack, F., & Shen, J. P., (2001). Coming challenges in microarchitecture and architecture. Proceedings of the IEEE, 89(3), 325–340. Rose, J., & Hill, D., (1997). Architectural and physical design challenges for one-million gate FPGAs and beyond. In: Proceedings of the 1997 ACM Fifth International Symposium on Field-Programmable Gate Arrays (Vol. 1, pp. 129–132).
90
Dissecting Computer Architecture
98. Ross, J. W., & Westerman, G., (2004). Preparing for utility computing: The role of IT architecture and relationship management. IBM Systems Journal, 43(1), 5–19. 99. Rotenberg, E., Bennett, S., & Smith, J. E., (1999). A trace cache microarchitecture and evaluation. IEEE Transactions on Computers, 48(2), 111–120. 100. Sait, S. M., & Youssef, H., (1999). VLSI Physical Design Automation: Theory and Practice (Vol. 6, pp. 1–5). World Scientific. 101. Saucède, T., Eléaume, M., Jossart, Q., Moreau, C., Downey, R., Bax, N., & Vignes-Lebbe, R., (2021). Taxonomy 2.0: Computer-aided identification tools to assist Antarctic biologists in the field and the laboratory. Antarctic Science, 33(1), 39–51. 102. Scott, T., (2003). Bloom’s taxonomy is applied to testing in computer science classes. Journal of Computing Sciences in Colleges, 19(1), 267–274. 103. Shaout, A., & Eldos, T., (2003). On the classification of computer architecture. International Journal of Science and Technology (2nd edn., Vol. 14, pp. 2–9). 104. Shin, D., & Yoo, H. J., (2019). The heterogeneous deep neural network processor with a non-von Neumann architecture. Proceedings of the IEEE, 108(8), 1245–1260. 105. Sinharoy, B., Kalla, R. N., Tendler, J. M., Eickemeyer, R. J., & Joyner, J. B., (2005). POWER5 system microarchitecture. IBM Journal of Research and Development, 49(4.5), 505–521. 106. Sites, R. L., (1993). Alpha AXP architecture. Communications of the ACM, 36(2), 33–44. 107. Skillicorn, D. B., (1988). A taxonomy for computer architectures. Computer, 21(11), 46–57. 108. Smith, J. E., & Sohi, G. S., (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE, 83(12), 1609–1624. 109. Starr, C. W., Manaris, B., & Stalvey, R. H., (2008). Bloom’s taxonomy revisited: Specifying assessable learning objectives in computer science. ACM SIGCSE Bulletin, 40(1), 261–265. 110. Sulistio, A., Yeo, C. S., & Buyya, R., (2004). A taxonomy of computerbased simulations and its mapping to parallel and distributed systems simulation tools. Software: Practice and Experience, 34(7), 653–673.
Classification of Computer Architecture
91
111. Syamala, Y., & Tilak, A. V. N., (2011). Reversible arithmetic logic unit. In: 2011 3rd International Conference on Electronics Computer Technology (Vol. 5, pp. 207–211). IEEE. 112. Talpes, E., & Marculescu, D., (2005). Execution cache-based microarchitecture for power-efficient superscalar processors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 13(1), 14–26. 113. Tokoro, M., Tamura, E., Takase, K., & Tamaru, K., (1977). An approach to microprogram optimization considering resource occupancy and instruction formats. ACM SIGMICRO Newsletter, 8(3), 92–108. 114. Tsai, W. T., Fan, C., Chen, Y., Paul, R., & Chung, J. Y., (2006). Architecture classification for SOA-based applications. In: Ninth IEEE International Symposium on Object and Component-Oriented RealTime Distributed Computing (ISORC’06) (Vol. 1, p. 8). IEEE. 115. Van, H. R. P., Irwin, B., Burke, I. D., & Leenen, L., (2012). A computer network attack taxonomy and ontology. International Journal of Cyber Warfare and Terrorism (IJCWT), 2(3), 12–25. 116. Wang, Z., She, Q., & Ward, T. E., (2021). Generative adversarial networks in computer vision: A survey and taxonomy. ACM Computing Surveys (CSUR), 54(2), 1–38. 117. Wenisch, T. F., Wunderlich, R. E., Falsafi, B., & Hoe, J. C., (2005). TurboSMARTS: Accurate microarchitecture simulation sampling in minutes. ACM SIGMETRICS Performance Evaluation Review, 33(1), 408, 409. 118. Wirthlin, M. J., & Hutchings, B. L., (1995). A dynamic instruction set computer. In: Proceedings IEEE Symposium on FPGAs for Custom Computing Machines (Vol. 1, pp. 99–107). IEEE. 119. Wolf, M., (2012). Computers as Components: Principles of Embedded Computing System Design (Vol. 2, No. 3, pp. 4–7). Elsevier. 120. Wu, Z., Ou, Y., & Liu, Y., (2011). A taxonomy of network and computer attacks based on responses. In: 2011 International Conference of Information Technology, Computer Engineering and Management Sciences (Vol. 1, pp. 26–29). IEEE. 121. Wupper, H., & Meijer, H., (1998). Towards a taxonomy for computer science. In: Informatics in Higher Education (Vol. 1, pp. 217–228). Springer, Boston, MA.
92
Dissecting Computer Architecture
122. Yehezkel, C., (2002). A taxonomy of computer architecture visualizations. ACM SIGCSE Bulletin, 34(3), 101–105. 123. Zhou, P., Liu, W., Fei, L., Lu, S., Qin, F., Zhou, Y., & Torrellas, J., (2004). AccMon: Automatically detecting memory-related bugs via program counter-based invariants. In: 37th International Symposium on Microarchitecture (MICRO-37’04) (Vol. 2, pp. 269–280). IEEE. 124. Ziefle, M., (2008). Instruction formats and navigation aids in mobile devices. In: Symposium of the Austrian HCI and Usability Engineering Group (2nd edn., pp. 339–358). Springer, Berlin, Heidelberg.
CHAPTER
3
COMPUTER MEMORY SYSTEMS
CONTENTS 3.1. Introduction....................................................................................... 94 3.2. Memory Hierarchy............................................................................. 97 3.3. Managing the Memory Hierarchy...................................................... 99 3.4. Caches............................................................................................. 104 3.5. Main Memory.................................................................................. 112 3.6. Present and Future Research Problems............................................. 118 References.............................................................................................. 123
94
Dissecting Computer Architecture
3.1. INTRODUCTION Figure 3.1 shows that computing systems are composed of three basic units: (i) computation units that execute operations on information (ii) storage units (or memory) that store information that is either being operated on or being archived, and (iii) communication units that share information between storage units and computation units. Memory units and storage are divided into two types: (i) the system of memory, which serves as an area of working storage, collecting, and analyzing data that is now being functioned on by the running programs, and (ii) the system of backup storage, such as the hard disc, which serves as backup storage of data for the lengthy period in an irregular way. The system of memory, often known as the “area of working storage” of the CPU, would be the subject of this section (Eckert, 1953; Chen, 1992).
Figure 3.1. Computing system. Source: https://kilthub.cmu.edu/articles/journal_contribution/Memory_Systems/6469010/1.
In computing, the memory system is a data storage system through which information may be accessed and modified by the CPU (or processors). When a computing system is operating, the processor interprets information from the memory system, executes calculation on the information, and starts writing the customized information back into the memory system – such process is continuously repeated until all required calculation is executed on all the required information (Figure 3.2) (Olivera, 2000; Bivens et al., 2010).
Computer Memory Systems
95
Figure 3.2. Hierarchical representation of memory systems of computer. Source: https://www.tutorialandexample.com/computer-memory.
3.1.1. Basic Concepts and Metrics The entire quantity of data that a memory system may keep is its capacity. A distinct address is assigned to each item of data kept in the memory system. The initial item of data, for instance, has an address of zero, while the last piece has an address of capacity-1. The address space of the memory system refers to the whole range of potential addresses, which ranges from zero to capacity-1. As a result, to retrieve specific information from the memory system, the processor should provide the memory system with its address (Hollingshead, 1998; Xia et al., 2015). Many significant metrics define the execution of a memory system: (i) bandwidth; (ii) parallelism; and (iii) latency are all factors to consider. Lower latency, higher bandwidth, and higher parallelism are all characteristics of a good performance memory system. The length of time required for the CPU to acquire information from the memory system is referred to as latency. The pace at which the CPU may receive information from the memory system is measured in bandwidth, also referred to as throughput.
96
Dissecting Computer Architecture
At first glance, bandwidth, and latency seem to be opposed. Whether it takes T seconds to acquire one item of data, it’s easy to conclude that other information may be obtained in less than one sec. This isn’t always the case, though. To properly comprehend the link between bandwidth and latency, we should also consider the memory system’s 3rd performance indicator (Li and Hudak, 1989). The quantity of concurrent memory system access is referred to as parallelism. When a memory system has a parallelism of one, every access is handled one at a time, and bandwidth is the inverse of latency in this situation. However, if a memory system supports more than one parallelism, simultaneous access to distinct locations might be handled at the same time, overlapping their latencies. Whenever the parallelism is equal to 2, for instance, another access is provided in parallel during the time it takes to service one access (T). As a result, although individual access latency stays constant at T, the memory system’s bandwidth doubles to 2/T (Burger, 1996; Henke, 2010). The link between memory system bandwidth, parallelism, and latency may be described in more generic terms as follows: Whereas bandwidth may be measured in the unit of accesses per time (above equation), it may also be expressed in terms of bytes per time, or the quantity of data accessed per time:mel Another feature of a memory system is its price. The cost of a memory system is the amount of money needed to put it in place. The cost of a memory system is strongly tied to its performance and capacity: as the capacity increases then the memory system performance typically increases its cost (Sorin et al., 2003; Somers, 2003).
3.1.2. Two Elements of the Memory System While building a memory system, computer designs aim to maximize three of the aforementioned characteristics: excellent performance, huge capability, and cheap cost. A memory system with a huge capacity and good performance, on the other hand, is quite costly. When increasing memory capacity, for example, the latency of memory nearly always rises as well. As a result, there is a basic trade-off connection between efficiency and capability when developing a memory system inside a realistic cost budget: it is feasible to attain either huge capability or good performance, although not both at a similar time in a way of low cost (Burnett and Coffman, 1970; Bailey, 1995). To balance the trade-off between performance and reliability, a contemporary memory system often comprises of two elements, one of
Computer Memory Systems
97
which is little although relatively faster to access, and the other of which is huge but relatively slower to access, both of which are spoken as “cash.” Among the two, the central memory is the all-encompassing repository, with a capacity that is normally selected so that access to the storage system is kept to a minimum (Bradley, 1962; Hollingshead and Brandon, 2003). But if the memory system had to comprise solely of main memory, all reach to the memory system will be handled via main memory which has been a sluggish memory system. Therefore, a memory system includes a cache: however, a cache is relatively lesser as compared to the main memory, it becomes much faster compared to the main memory. Cache utilization is accomplished via transferring a portion of information from the main memory and storing it in the cache, allowing part of the memory system’s accesses to be delivered more rapidly as a result of the cache. The data that is saved in the cache is a copy of the data that has already been kept in the main memory of the computer (Figure 3.3) (Saleh et al., 1990; Hangal et al., 2004).
Figure 3.3. System of main memory and cache. Source: https://www.apogeeweb.net/article/160.html.
3.2. MEMORY HIERARCHY The three features of an optimal memory system are high performance (higher bandwidth and lower latency), cheap cost, and huge capacity. Consequently, technical, and physical restrictions make it difficult to create a memory system that satisfies all three qualities at once. Instead, one feature may only be increased at the cost of another, implying that a basic trade-off association exists between capability, memory system efficiency, and cost. (To keep things simple, we’ll suppose we consistently want to save money, limiting to a “zero-sum” trade-off between capacity and performance.) For instance, main memory is an element of the memory system that offers higher performance although lower capacity, whereas a cache is another element of the memory system that gives higher performance although
98
Dissecting Computer Architecture
lower capacity. As a result, main memory and caches are on opposing sides of the performance-to-capacity trade-off continuum (Harper, 1991; Mutlu and Subramanian, 2014). As previously stated, current memory systems include both main memory and caches. The logic behind this is to get the perfect combination (capability and efficiency), that is, a memory system with higher efficiency of cache and huge capability of main memory. If the memory system has been exclusively made up of main memory, each access to it will be hampered by the higher latency (and hence lower performance) of main memory. Whenever caches have been utilized to main memory, part of the memory system’s accesses will be provided by the caches at lower latency, whereas the rest will be guaranteed to locate their information in main memory owing to its huge capacity, although at a higher latency. Consequently, with the combination of main memory and caches, the memory system’s efficient latency (that is, average latency) falls below that of main memory while maintaining the main memory’s enormous capacity (Laha et al., 1988; Cooper-Balis et al., 2012). Main memory and caches have been stated to be components of a memory hierarchy inside a memory system. A memory hierarchy, in more technical terms, describes how distinct elements (such as main memory and cache) with varied capacity/performance qualities are joined to build the memory system. The quickest (although smallest) element is found at the “top-level” of the hierarchy of memory, while the slowest (but biggest) element is found at the “lower level” of the hierarchy of memory, as illustrated in Figure 3.4. The memory hierarchy normally comprises a set of layers of caches (such as L1, L2, L3 caches in Figure 3.4, which stand for level-1, level-2, and level-3) every with a bigger capacity as compared to the one top of it and a single level of main memory at the bottom with the maximum capability. The last-level cache is the bottom-most cache (for instance, the L3 cache in Figure 3.4, which is directly well above the main memory) (Protic et al., 1995; Heersmink, 2017).
Computer Memory Systems
99
Figure 3.4. A memory hierarchy having three tiers of caches is an instance. Source: https://kilthub.cmu.edu/articles/journal_contribution/Memory_Systems/6469010/1.
Whenever the CPU accesses the memory system, it usually does so in a series way, beginning with a search of the topmost element to check if it has the information it requires. If the information is located, the access is considered to have a “hit” in the top element, and it is served immediately without having to explore the memory hierarchy’s lower elements. Instead, the access was said to have been “missed” in the topmost element, and it is handed down to the memory hierarchy directly below the element, whereas it can be a miss or hit again. The likelihood of a hit increases as access moves deeper in the hierarchy because of the increased capacity of the lowest elements, till it accesses the lowest level of the hierarchy when it has been assured to always hit (McAulay, 1991; Avizienis et al., 2004).
3.3. MANAGING THE MEMORY HIERARCHY As previously stated, each item of information in a memory system is assigned a distinct address. The address space is indeed the collection of all potential addresses for a memory system, and it runs from Zero to capacity one, wherein capacity has been the memory system’s size. Once software programers write computer programs, they naturally depend upon the memory system, assuming that particular pieces of the program’s information may be kept at particular addresses without limitation as long as the addresses have been suitable. That is, the addresses don’t exceed the memory system’s address space. Consequently, in reality, the memory
100
Dissecting Computer Architecture
system doesn’t disclose its address space to computer programs directly. This is because of two factors (Tam et al., 1990; Avizienis et al., 2001). To begin with, while writing a computer program, the current software programmer has no means of knowing the memory system capability upon which the application would execute. The application might, for instance, operate on a computer with a smaller big memory system (for example, 1 TB capacity) or a tiny memory system (for example, one-megabyte capability). Whereas a 1 GB address is legitimate for the 1st memory system, it is incorrect for the 2nd memory system because it surpasses the address space’s maximum constraint (1 MB). Consequently, if the memory system’s address space has been directly uncovered to the program, the software programmer would never know which addresses have been legitimate and may be utilized to keep records for the program she is writing (Figure 3.5) (Worlton, 1991; Sutton et al., 2009).
Figure 3.5. The memory hierarchy of a computer is depicted in this diagram. Source: https://en.wikipedia.org/wiki/Memory_hierarchy.
Computer Memory Systems
101
Secondly, while a computer program has been developed, the software programmer has no means of knowing which other programs would be running at the same time as the program being developed. During normal operation, whenever the consumer launches the application, it can be running on a similar computer as a slew of other programs, some of which might happen to be utilizing the same address (for instance, address zero) to store a specific piece of their information. In this situation, when one program alters the information at that location, it rewrites the information of another program that was previously saved at a similar address, even though this must not be permitted to happen. Consequently, if the program has direct access to the address space of the memory system, many programs can overwrite and damage each other’s information, resulting in an erroneous performance of the programs (McCalpin, 1995; Jacob, 2009).
3.3.1. Physical versus Virtual Address Spaces To overcome such two issues, a memory system doesn’t explicitly reveal its address space, but rather creates the appearance of a massive address space that is distinct for every program (Figure 3.6).
Figure 3.6. Virtual memory may work in tandem with the actual memory of a computer to offer quicker, more fluid processes. Source: https://www.enterprisestorageforum.com/hardware/virtual-memory/.
Dissecting Computer Architecture
102
In contrast to the impression of a vast address space, which overcomes the 1st issue of varying capacities among various memory systems, the impression of distinct large address space for every program overcomes the 2nd issue of numerous programs altering data at a similar location in memory. They have been known as virtual address spaces because such huge and distinct address spaces have been an impression created for the benefit of the programmers to make their life simpler while developing composting programs. The physical address space of the memory system, on the other hand, refers to the actual core address space of the memory system. According to current-generation systems (approximately 2013), the physical address space of a standard eight GB memory system varies from zero to eight GB, but the virtual address space of a program generally ranges from 0 to 256 TB (Freitas and Lavington, 2000; Lazzaroni et al., 2010).
3.3.2. Virtual Memory System Virtual memory refers to the complete technique discussed so far through which the OS maps and translations among physical and virtual addresses (Obaidat and Boudriga, 2010). While building a virtual memory system that consists of an OS, there have been three things to keep in mind (Figure 3.7): • • •
Whenever map a virtual address to a specific address; The granularity of the map; What to do while physical addresses have been limited has been all discussed.
Figure 3.7. System of virtual memory. Source: http://www.cburch.com/books/vm/.
Computer Memory Systems
103
The 1st thing to note is that most virtual memory systems map a virtual address while it has been accessed for the very 1st time, which is known as on-demand mapping. With another way of saying it, is if a virtual address has never been used, then it is never assigned to a corresponding physical address (Liedtke, 1995; Ryoo et al., 2022). Even though the virtual address space is extraordinarily wide (that is, 256 TB), only a very tiny portion of this space is used by the vast majority of applications. Thus, it is useless to map the complete virtual address space into the physical address space as an overwhelming percentage of those virtual address spaces would not be accessible in the first place. Not to talk about the fact that the virtual address space is far bigger as compared to the physical address space, making it impossible to map all virtual addresses even at the beginning of the mapping process (Figure 3.8) (Alpern et al., 1994; Coleman and Davidson, 2001).
Figure 3.8. The virtual memory is organized in a general manner. Source: https://edux.pjwstk.edu.pl/mat/264/lec/main85.html.
Secondly, a virtual memory system should translate addresses from the virtual address space to the physical address space at such a granularity. When the granularity is set to one byte, for instance, the virtual memory system equally splits the physical/virtual address into one-byte physical/ virtual ”chunks.” After that, as the physical portion is accessible, the virtual memory system may map a one-byte virtual chunk to any one-byte physical chunk. Furthermore, such a tight split of address spaces into a huge number of little bits seems to have a significant drawback: it raises the virtual memory system’s complication. As previously stated, when one mapping among two physical/virtual chunks is made, the virtual memory system should memorize it. As a result, having a big number of physical/virtual chunks implies having a large number of potential mappings among them, which adds to the bookkeeping cost of memorizing the mappings. Most virtual memory systems roughly partition address spaces into smaller amounts of
104
Dissecting Computer Architecture
pieces, called pages, with a common size of four Kilobytes, to decrease such an expense. A virtual page is a 4 KB piece of the virtual address space, while a physical page is a four Kilobytes part of the physical address space. Each time a virtual page has been translated to a physical page, the OS records the mapping in an information structure known as the page table (Welch, 1978; Mellor-Crummey et al., 2001). Thirdly, when a program accesses a novel virtual page for the 1st time, the virtual memory system translates the virtual page to a free physical page available in the system’s physical memory. Furthermore, if this occurs repeatedly, the physical address space can be filled, neither of the physical pages is available since they are all translated to virtual pages and the system would crash. At this stage, the virtual memory system should “generate” a free physical page by recovering one of the mapped physical pages from the virtual memory system’s possession. The virtual memory system does this via disposing of a physical page’s information from the main memory and “un-mapping” the physical page out of its virtual page inside the virtual memory system. Once a free physical page has been produced in this way, the virtual memory system may use to map it to a novel virtual page in the virtual memory system (Figure 3.9) (Hallnor and Reinhardt, 2005; Endoh et al., 2005).
Figure 3.9. Virtual memory enhancement. Source: https://www.bsc.es/research-development/research-areas/computerarchitecture-and-codesign/improving-virtual-memory.
3.4. CACHES The term “cache” refers to any architecture that saves data that is expected to be retrieved again (for example, recently accessed data or often accessed data) to avoid performing the lengthy latency operation necessary to get the information from a slow structure in the first place. Examples include web
Computer Memory Systems
105
servers on the internet that utilize caches to save very familiar photos or news items so that they may be accessed fast and delivered to the end consumer. The term cache pertains to a tiny yet quick element of the memory hierarchy that saves the information that has been currently (or often) accessed between all information in the memory system (Drost et al., 2005; Camp et al., 2011). Caches are commonly used in computer memory systems. Given that a cache is intended to be significantly quicker as compared to main memory, stored data in a cache may be retrieved rapidly via the processor (Figure 3.10) (Moreno et al., 2011; Sardashti et al., 2015).
Figure 3.10. Memory caches (caches of central processing unit) employ highspeed static RAM (SRAM) chips, while disc caches are frequently part of main memory and are made up of ordinary dynamic RAM (DRAM) chips. Source: https://www.pcmag.com/encyclopedia/term/cache.
The efficiency of a cache is determined by whether a significant portion of memory accesses “hits” inside the cache and so avoids being serviced by the slow main memory. Because several computer programs display locality in their memory access behavior, a cache may obtain a higher hit rate notwithstanding its tiny capacities. Information that has been accessed in the past is expected to be accessed again in the prospective. As a result, a tiny cache with a capacity significantly lower than the main memory may handle the majority of memory requests as long as it contains the most recent (or often) accessed information (Rege, 1976; Lioupis et al., 1993).
106
Dissecting Computer Architecture
3.4.1. Basic Design Considerations When building a cache, there have been three primary design parameters: • Physical substrate; • Capacity; and • The granularity of data management. We’ll go through all of them one by one in the sections below. Firstly, the physical substrate utilized to execute the cache should be capable of giving significantly lower latencies as compared to the main memory (Cabeza et al., 1999). That’s why SRAM (static random-access memory, pronounced “es-ram”) is and continues to be the most popular physical cache substrate. The main benefit of SRAM is that it may run at rates that have been comparable to those of the CPU. Furthermore, SRAM has been made up of a similar sort of semiconductor-based transistors as the CPU. As a result, an SRAM-based cache may be built upon a similar semiconductor chip as the CPU at a reasonable cost, enabling it to attain even reduced latencies. An SRAM cell, which generally has six transistors, is the smaller unit of SRAM. The six transistors jointly store a single bit of data (zero or one) in the format of electrical voltage (“lower” or “higher”). The transistors supply the CPU the data value that correlates to their voltage levels whenever the processor interprets from an SRAM cell. When the CPU writes to an SRAM cell, on either side, the voltage of the transistors is modified to show the updated information value which is being recorded (Lin et al., 2001; Viana et al., 2004). Secondly, while establishing a cache’s capacity, it’s important to strike a balance between the requirements for a cheap cost, higher hit rate, and lower latency. A huge cache, whereas more probable to yield a higher hit rate, has two drawbacks. Firstly, because of the higher number of transistors necessary to create a huge cache, this has a higher cost. Secondly, a huge cache will almost certainly have longer latency. It’s because controlling the operation of a large number of transistors is a difficult undertaking with additional latencies (such as it can take a longer time to find out the location of an address). In reality, a cache is designed large enough to achieve a higher hit rate while avoiding a major increase in cost and delay (Neefs et al., 1998; Basak et al., 2019). Finally, a cache is broken down into several little bits known as cache blocks. A cache block is indeed a level of granularity during which information is handled by the cache. In current memory systems, a cache
Computer Memory Systems
107
block is typically 64 bytes in size. A 64-kilobyte cache, for instance, is made up of 1,024 distinct cache blocks. All of these cache blocks may retain information from every 64-byte “chunk” of the address space, that is, every 64-byte cache block may keep information from address zero, address 64, address 128, address 192, etc. As a result, a cache block needs a “label” that specifies the location of the data kept in the cache block (Figure 3.11) (Fatahalian et al., 2006; Yasui et al., 2011).
Figure 3.11. Cache data. Source: https://kilthub.cmu.edu/articles/journal_contribution/Memory_Systems/6469010/1.
3.4.2. Logical Organization The architect of a computer should also determine how well a cache is conceptually structured, that is, how it translates 64-byte portions of the address space on its 64-byte cache blocks, in addition to the fundamental issues stated above. There have been several chunks as compared to cache blocks because the memory system’s address space is significantly greater than the cache’s capacity. Consequently, the “many-to-few” mappings among the blocks of cache and chunk are unavoidable. The architect of the computer, on the other hand, may construct a cache for every chunk of the address space which may map the chunk to (i) all of its blocks of cache or (ii) to a specified block of cache. While mapping a chunk to a block of cache, such two logical groupings reflect two opposing ends of the amount of freedom. As detailed following, such independence creates a key tradeoff between cache complexity and use (Bahn et al., 2002; Matthews et al., 2008).
108
Dissecting Computer Architecture
On the first side, a fully associated structure refers to a cache that allows the most flexibility in mapping a chunk to a block of cache. A fully associated cache doesn’t place limitations upon where a novel chunk may be kept when it is carried in from main memory, such as the chunk may be kept in any blocks of cache. As a result, the cache contains a minimum of one vacant block of cache, the chunk will be saved in the cache with no need to evict an already filled (such as non-empty) block of cache. A fully associated cache is the most effective at utilizing all of the blocks of cache in this respect. The disadvantage of a completely associated cache is that anytime the processor uses the cache, it should explore all blocks of the cache extensively because either one of the cache blocks might comprise the information that the processor requires. However, looking thru all cache blocks takes a lot of time and consumes a lot of energy (Figure 3.12) (Kessler and Hill, 1992; Mogul, 2000).
Figure 3.12. The logical organization. Source: https://slideplayer.com/slide/16157911/.
An organization that is directly mapped, on either side, refers to a cache which allows for the minimal amount of flexibility when mapping a chunk to a block of cache. When a fresh chunk is carried in from the main memory, a directly mapped cache permits the chunk to be stored in only one particular block of cache, allowing for faster access to the data. Consider the following scenario: a 64-kilobyte cache with 1,024 blocks of cache is implemented
Computer Memory Systems
109
(64 bytes). Each 1024th chunk (64 bytes) of address space will be mapped to a similar block of cache, for example, chunks at addresses 0, 64 K, and 128 K will all map to the 0th block of cache in the cache in a straightforward implementation of directly mapped caching (Agarwal and Pudar, 1993). Alternatively, if the block of cache is currently filled via various chunks, the old chunk should be evacuated from the block of cache before a novel chunk may be loaded in it. The term “conflict” refers to a situation in which two separate chunks (represented by two various addresses) compete with each other for a similar block of cache. Disputes can exist in a directly mapped cache while all other blocks of cache are vacant to occur at one block of cache in a directly mapped cache. In this aspect, a directly mapped cache performs the poorest when it comes to effectively utilizing all of the blocks of cache available in the system. The advantage of a direct-mapped cache, on the other hand, is that the processor may rapidly decide whether or not the cache comprises the data it is looking for by searching only one block of cache. A directly mapped cache, as a result, has a reduced latency for accessing data (Hill, 1988; Seznec, 1993). There has been a 3rd option termed the set-associative organization, that enables a chunk to map to one of numerous (although not all) blocks of cache inside a cache like a middle ground among the two organizations (fully associated versus directly mapped). A fully associated organization will map a chunk to all the N blocks of the cache, but a directly mapped organization will map a chunk only to one specific block of cache if a cache includes N blocks of cache. The idea of sets, which are tiny non-overlapping groupings of cache blocks, is used in a set-associative organization. In a setassociative cache, a chunk is linked only to one set, comparable to a directly mapped cache. A set-associative cache, on either side, is comparable to a fully associated cache in that the chunk may map to every cache block in the set. As an instance, consider a two-way set-associative cache that is a setassociative cache with every set consisting of two blocks of cache. A chunk is first mapped to one particular set from all N/2 sets using such a cache. The chunk may then map to one of the two blocks of cache that make up the set inside the set (Lawless, 1958; McFarling, 1989).
3.4.3. Managing Multiple Caches We’ve just discussed design concerns and management strategies for a cache so far. A memory hierarchy, on the other hand, often comprises more than a single cache (Figure 3.13). The regulations that regulate how several caches
Dissecting Computer Architecture
110
inside the memory hierarchy communicate with one another are discussed below: • • •
The policies are the inclusion; Write handling; Partitioning.
Figure 3.13. To enable rapid access to data, a distributed cache consolidates the RAM of numerous computers into a single in-memory data store, which is then utilized as a data cache. Source: https://hazelcast.com/glossary/distributed-cache/.
Firstly, in addition to main memory, the hierarchy of memory might include numerous tiers of caches. The superset among all stored data in either cache is always saved in the main memory, which is the lower level (Rosenberg, 1957; Johnson, 1990). To put it another way, caches are part of the main memory. Based upon the policy of inclusion in the memory system, moreover, a similar correlation cannot exist between one cache to another. There have been three kinds of policies of inclusion: (i) inclusive, (ii) exclusive, and (iii) non-inclusive. Initially, under the policy of inclusive, a piece of information found inside one cache has been assumed to be located in all subsequent levels of the cache. Secondly, under the policy of exclusive, a piece of information in one cache has been assumed to be absent from all subsequent levels of cache. Thirdly, a piece of information
Computer Memory Systems
111
in one cache can or cannot be observed at higher cache levels if the policy of non-inclusive is followed. The exclusive and inclusive policies are opposed, and other policies in between have been classified as non-inclusive (Bloch et al., 1948; Boutwell and Hoskinson, 1963). On either side, because the policy of exclusive doesn’t keep multiple copies of similar information in all caches, it doesn’t waste the capacity of the cache. The policy of inclusive, on either side, has the benefit of making information search easier when the computing system has several processors (Figure 3.14) (Eachus, 1961; Bobrovnik, 2014).
Figure 3.14. Single versus multiple-level caches. Source: memory/.
https://pediaa.com/difference-between-cache-memory-and-virtual-
Secondly, when the processor puts novel data into the block of cache, the stored data in the block of cache is changed and no longer corresponds to the data that had been initially placed into the cache. Whereas the cache has the most recent copy of the data, all other layers of the memory hierarchy (such as main memory and caches) have an older copy (Beckman et al., 1961; Li et al., 2010). In other terms, when a block of cache is updated, a disagreement emerges between the updated v of cache and the memory hierarchy’s lower levels. A write handling policy is used by the memory system to rectify this difference. Write-back and write-through policies are the two forms of the policies of write handling (Astrahan and Rochester, 1952; Martin and O’Brien, 2014). Thirdly, a cache at a given level can be divided into two small caches, every of them is allocated to one of two kinds of information: (i) instruction and (ii) data. Commands are a sort of information that enables a computer
112
Dissecting Computer Architecture
how to alter (for example, addition, subtraction, or movement of data) other data. Two benefits are having two distinct caches (a data cache and an instruction cache). For starters, it keeps one sort of data from dominating the cache. Whereas the processor requires both kinds of data to execute a program, if the cache is only partially occupied, the processor can be forced to retrieve the other data from low levels of the hierarchy of memory, resulting in a significant delay. It enables every cache to be located closer to the CPU, reducing the time it takes to deliver data and instructions to the processor (Alonso et al., 1963; Honig, 1993).
3.4.4. Specialized Caches for Virtual Memory In the systems of virtual memory, specialized caches are employed to speed up the translation of address by reducing the amount of time it takes. The term “TLB” (translation lookaside buffer) refers to the cache that is most typically utilized. The purpose of a TLB is to store the most recent utilized virtual to physical address translations performed by the processor, allowing the translation to be performed quickly for virtual addresses that reach the TLB. A TLB, in its most basic definition, is a cache that stores portions of the page table that have been recently accessed by the processor (Langmuir, 1960; Frankel, 1964).
3.5. MAIN MEMORY Whereas caches have been primarily designed for fast performance, the primary goal of main memory is to offer as much capacity as feasible at the lowest possible cost and with as little delay as possible. As a result, DRAM (dynamic random-access memory; pronounced as ”dee-ram”) has been the physical substrate of choice for constructing main memory (Bersagol et al., 1996; Benenson et al., 2004). This is a DRAM cell that has one capacitor and one transistor (1T) that constitutes the smaller unit of DRAM (1C). A single bit of data (0 or 1) is stored in a DRAM cell in the form of an electrical charge inside its capacitor (which can be either “charged” or “discharged”). Because DRAM has a smaller cell size than SRAM, it has the benefit of being more energy-efficient. A DRAM cell requires fewer electrical elements 1T and 1C) than an SRAM cell (6T). Consequently, many more DRAM cells may be packed on a semiconductor chip in a similar amount of
Computer Memory Systems
113
space, allowing DRAM-based main memory to attain a significantly bigger capacity for nearly a similar amount of money as traditional flash memory (Figure 3.15) (Yu et al., 2010; Kojiri and Watanabe, 2016).
Figure 3.15. DRAM types. Source: https://fr.transcend-info.com/Support/FAQ-1263.
SRAM-based caches are often built onto a similar chip of semiconductor as the CPU. DRAM-based main memory, on either side, is executed utilizing one or more specialized DRAM chips independent from the CPU chip. This is because of two factors. To begin with, a big capacity of main memory necessitates so many DRAM cells that they may not all reside on a similar chip as the CPU. Secondly, the process technology used to make DRAM cells (and associated capacitors) is incompatible with the process technology used to make processor chips at a cheap cost (Muller, 1964; Statnikov and Akhutina, 2013). As a result, combining the logic and DRAM in today’s systems will dramatically raise expenses. Rather, the primary memory is executed as one or more DRAM chips with a significant number of DRAM cells allocated to them. Because DRAM-based main memory is not on a similar chip as the processor, it is frequently referred to as “offchip” main memory from the CPU’s viewpoint (Figure 3.16) (Lukaszewicz, 1963; Langton, 1984).
114
Dissecting Computer Architecture
Figure 3.16. Kinds of DRAMs. Source: https://www.javatpoint.com/dram-in-computer-organization.
It is through a network of electrical lines known as the memory bus that the processor communicates with the DRAM chips. An integrated memory controller (IMC) is found inside a CPU, and it connects with DRAM chips through the memory bus (Symons and Schweitzer, 1984; Treu, 1992). The memory controller delivers and gets the proper electrical signals to and from the DRAM chips thru the memory bus to access a piece of information stored in the DRAM chips. There have been three different sorts of signals: • Address signals; • Command signals; and • Data signals. While the address indicates where data has been accessed, the command indicates how that data has been accessed (such as write or read), and the data indicates what has been accessed, the data value is the real value of the data is being accessed. The address and instruction buses are separated into three small groups of wires; each being devoted to a particular kind of signal, while the data bus is split into three small sets of wires, and every one of them is dedicated to a particular kind of signal. There have been three buses in total, with the data bus being the only one that is bi-directional because the memory controller may either receive or send information from the DRAM chips, while the address and instruction buses are both unidirectional because only the memory controller sends the instruction and address to the DRAM chips (Koganov, 2017; Bouressace and Csirik, 2018).
Computer Memory Systems
115
3.5.1. DRAM Organization The logical organization of a DRAM-based main memory system is: • Channels; • Ranks; • Banks. Banks are the smaller memory structures that may be accessed simultaneously. Bank-level parallelism is the term for this. A rank, on the other hand, is a group of DRAM chips (and associated banks) that function in lockstep (Claerbout, 1985; Gibson et al., 1998). Usually, a DRAM rank comprises 8 DRAM chips, each with 8 banks. The rank contains just 8 separate banks, all of which are the set of the ith bank throughout all chips, due to the chips’ lock-step operation. Banks in various ranks have been completely isolated from their chip-level electrical operations, resulting in superior bank-level parallelism as compared to the banks in a similar rank. Finally, a channel is made up of all the banks which communicate the similar memory bus (data bus, address bus and instruction bus). Whereas banks on a similar channel face memory bus congestion, banks on distinct channels may be accessed independently of one another. There are two memory queries that access the similar bank should be fulfilled one after another, even though the DRAM system allows different degrees of parallelism at various levels of its structure. Let’s look at the logical organization of a DRAM bank as perceived by the memory controller to discover why (Mowshowitz, 1997; Dzhaliashvili and Suhorukova, 2001). The logical architecture of a bank of DRAM is shown in Figure 3.17. A capacitor-based bank of DRAM is a 2D arrangement of dynamic random access memory (RAM) cells. It is regarded as a set of rows with many columns in each row. As a result, each cell has a pair of addresses: a column address and a row address. A row-buffer, which is an arrangement of senseamplifiers, is present in every bank. The objective of a sense amplifier has been to accurately identify the extremely little amount of electrical charge contained in a cell to read from it. While writing to a cell, on either side, the sense-amplifier works like an electrical driver, filling or emptying the cell’s stored charge to program it (Flynn, 1972; Poremba and Xie, 2012).
Dissecting Computer Architecture
116
Figure 3.17. The organization of DRAM. Source: https://kilthub.cmu.edu/articles/journal_contribution/Memory_Systems/6469010/1.
There are cables known as bit lines that run across a bank in a columnwise direction, and each of these wires may be used to link a sense amplifier to all the cells in a similar column. The word line (one per row) is a wire that controls whether or not the associated row of cells is linked to the bit lines. There are four-word lines in total, one per row (Figure 3.18).
Figure 3.18. Bank organization.
3.5.2. Bank Operation The memory controller provides three instructions to a bank in the sequence shown below to service the request of memory which accesses the information at a certain address of column and row. Every instruction initiates a series of processes inside the bank to get access to the cells connected with the address: •
•
When you use the ACTIVATE command (which is given with a row address), you will be able to load the complete row into the row buffer. WRITE/READ (administered with a column identifier): Access the data contained in a column by referencing it from the row-
Computer Memory Systems
117
buffer. WRITE command: Modify the stored data in the rowbuffer by the information obtained from the processor during the write. Transfer the information from the row-buffer to the CPU to execute the READ command. • PRECHARGE: Delete all rows from the row buffer. While being processed by the chip of DRAM, every DRAM instruction incurs latency. If an instruction is given before the prior instruction has been completely processed, unexpected behavior can occur. To avoid this, when giving orders to a DRAM chip, the memory controller should adhere to a set of temporal limitations (Manegold et al., 2002; Vömel and Freiling, 2011). Such limitations specify when an instruction is ready to be scheduled about all previous instructions delivered to a similar channel, rank, or bank before it. The timing limitations’ precise values have been given in the datasheet of each chip of DRAM and vary from one to the next. The two commands of DRAM (WRITE/READ, PRECHARGE, and ACTIVATE), on the other hand, each require roughly 15 nanoseconds. We recommend Ferreira et al. (2010) for further details on the organization and functioning of a DRAM bank.
3.5.3. Memory Request Scheduling When a large number of memory requests have been awaiting access to DRAM, the memory controller should determine which one to schedule subsequently. The memory controller accomplishes this via implementing a memory scheduling algorithm its purpose, in several modern higher efficiency systems, is to pick the most advantageous memory request to reduce total memory request latency. Consider the case where a memory request is pending to access a row that is currently mounted in the row buffer. Because the row is now in the row buffer in this situation, the memory controller may bypass the PRECHARGE and ACTIVATE directives and go straight to the WRITE/READ instruction. Consequently, the memory controller can respond swiftly to that specific memory request. A row buffer hit refers to a memory request which accesses a row in the row buffer. Several memory scheduling techniques make use of row-buffer hits’ short latency and prioritize them over the other requests. We suggest the reader to current publications that have looked into memory request scheduling for further detail (Seznec, 2010).
118
Dissecting Computer Architecture
3.5.4. Refresh A DRAM cell is a capacitor that stores data in the form of a charge. Over time, this charge gradually seeps out, resulting in the loss of the data. It is for this reason that DRAM is referred to as DRAM because its charge varies with time. A periodic restoration or refreshing of the charge in every DRAM cell is required to maintain data integrity. Reading a row of DRAM cells out and writing it back in is similar to giving a PRECHARGE and an ACTIVATE commands to the row in sequence. DRAM cells have been renewed at the granularity of a row via reading this out and writing it back in (Kong and Zhou, 2010).
3.6. PRESENT AND FUTURE RESEARCH PROBLEMS Even in today’s high-performance computer environments (particularly in terms of power/energy usage), the memory system remains a significant bottleneck. Nowadays and in the future, memory has become an even greater bottleneck in computing systems due to the following factors: an increasing number of processing agents and cores are sharing components of the memory system; applications running on the cores have become progressively information and memory-intensive; memory consumes important amounts of power and energy in newer systems; and there has been increased complexity scaling well-entrenched memory innovations, like DRAM, to small nodes of technology. Memory management has become increasingly critical throughout all levels of the transformational hierarchy, particularly at both the hardware and software levels, as a result of these developments. To resolve the tough efficiency, the efficiency of energy, rightness, safety, and dependability issues we face in creating and implementing memory systems nowadays, methods or combinations of methods that incorporate the innovative idea collaborative manner at various levels seem to be a promising option (Kjelsø et al., 1999).
3.6.1. Caches 3.6.1.1. Efficient Utilization Several replacement policies are developed to enhance the basic LRU policy to best use a cache’s finite capacity. The LRU policy isn’t necessarily the best option with all memory access trends with varying degrees of proximity. In the LRU policy, for instance, even though the data is lower locality, the
Computer Memory Systems
119
most recently accessed information has always been stored in the cache. To complicate things worst, the lower locality information is kept in the cache for an inordinate amount of time. This is because the LRU policy always writes information into the cache as one of the most currently accessed, and only tried to evict information after it has become the least accessed. As a result, academics employ a mix of three methodologies to construct complex substitution plans. Firstly, a block of the cache must not always be put as the most-recently accessed while it has been allocated in the cache. Cache blocks having poor locality must be put as the least recently accessed so that they have been swiftly ejected from the cache to create a way for cache blocks with the higher locality. Secondly, whenever a cache block is ejected from the cache, this must not necessarily be the block of cache that has been read the least recently. Preferably, “dead” the blocks of cache (those that would never be visited again) must be ejected, irrespective of whether they’ve been accessed lately or not. Thirdly, while deciding which block of cache to evict, the cost of re-fetching the cache block from main memory must be considered: if two blocks of cache have the same cost of re-fetching from main memory, the block of cache with the lowest re-fetch latency must be ejected (Figures 3.19 and 3.20) (Stone, 1970).
Figure 3.19. The management of cache. Source: https://www.webroam.com/cache-management.php.
120
Dissecting Computer Architecture
Figure 3.20. A cache block is being transferred. Source: https://slideplayer.com/slide/14569058/.
3.6.1.2. Quality-of-Service Inside a multi-core system, the CPU is divided into several cores; each one may run a different application. Components of the hierarchy of memory can be accessed via certain cores in such a system. The last-level cache of a CPU, for instance, has been usually shared by every core. Although the last cache has a big capacity, having several last-level caches independently for every core is prohibitively costly. Although, in this instance, it is critical to verify that the shared last-level cache has been used lawfully by every core. Alternatively, a program operating on one of the cores can fill the last-level cache using exclusively its information, evicting information required via other programs. Researchers have developed techniques to give quality of service whenever a cache is accessed by many cores to avoid such events. One method is to divide the shared cache among the cores so that every core has its devoted partition where just the information of that core is kept. A core’s partition must be only big enough to retain the information that the core is currently accessing (the working set of cores) and no more. Moreover, if the size of the working set of core varies with time, the partition size of the core must dynamically grow or decrease to match (Qureshi et al., 2009).
3.6.1.3. Low Power One of the major issues that computer architects confront is improving the fuel efficiency of computing systems. Because caches need a high number of transistors, which all consume power, they have been a prominent target for
Computer Memory Systems
121
the optimization of energy. Studies have suggested, for instance, decreasing the operating voltage of the cache’s transistors to minimize power usage. Furthermore, this offers significant trade-offs that should be considered when constructing a cache: whereas a low voltage cache uses lesser power, it can also be sluggish and more error-prone. Low-voltage caches which give adequate reliability at an affordable cost are investigated by researchers.
3.6.2. Main Memory 3.6.2.1. Challenges in DRAM Scaling DRAM is the most favored physical substrate for constructing main memory, owing to its lower cost for one bit. Furthermore, as DRAM procedure technology expanded to combine more DRAM cells into a similar space on a semiconductor chip, the cost-per-bit of DRAM has steadily reduced. Furthermore, because of rising manufacturing cost/complexity and decreased cell dependability, enhancing DRAM cell density via lowering cell size, as has been done in the past, is becoming more challenging. Consequently, researchers have been looking into new methods to improve DRAM’s energy efficiency and performance while keeping lower costs. Recent recommendations, for instance, include lowering DRAM access latency, increasing DRAM parallelism, and reducing the number of DRAM refreshes (Figure 3.21) (Garcia-Molina and Salem, 1992).
Figure 3.21. Challenges of DRAM scaling. Source: https://semiwiki.com/semiconductor-services/297904-spie-2021-applied-materials-dram-scaling/.
122
Dissecting Computer Architecture
3.6.2.2. 3D-Stacked Memory A memory bus connects a DRAM chip to the processor chip, which in turn connects the processor chip to the DRAM chip. In today’s systems, the memory bus is executed on the motherboard through the use of electrical lines. However, this has three drawbacks: it consumes a lot of power, has a limited bandwidth, and is expensive. First of all, and foremost, because a motherboard cable is both longer and thicker, it requires a significant amount of power to transmit an electrical signal thru it. Secondly, because it requires much more power to carry a large number of electrical signals in a short period, the bandwidth which a motherboard wire may supply is limited. Thirdly, both the DRAM chip and the CPU chip are equipped with a stub (referred to as a pin) toward which either side of a motherboard wire may be wired. Pins, on the other hand, are prohibitively costly. Consequently, a memory bus that is comprised of a large number of motherboard wires (and thus needs a large number of pins on the chips) raises the cost of the memory system. Researchers have recommended that, as a solution, one extra DRAM chip be stacked directly upon the top of a CPU chip, rather than having it placed side by side on the motherboard. A 3D-stacked design allows for direct communication between the CPU and the DRAM chip(s), reducing the need for additional motherboard cables and connectors (Figure 3.22) (Garcia-Molina and Salem, 1992).
Figure 3.22. The design of a three-dimensional stacked DRAM memory cell. Source: https://www.researchgate.net/figure/Architecture-of-a-3D-stackedDRAM-based-on-39-92_fig2_333077335.
Computer Memory Systems
123
REFERENCES 1.
Agarwal, A., & Pudar, S. D., (1993). Column-associative caches: A technique for reducing the miss rate of direct-mapped caches. In: Proceedings of the 20th Annual International Symposium on Computer Architecture (pp. 179–190). 2. Alonso, R. L., Blair-Smith, H., & Hopkins, A. L., (1963). Some aspects of the logical design of a control computer: A case study. IEEE Transactions on Electronic Computers, (6), 687–697. 3. Alpern, B., Carter, L., Feig, E., & Selker, T., (1994). The uniform memory hierarchy model of computation. Algorithmica, 12(2), 72– 109. 4. Astrahan, M. M., & Rochester, N., (1952). The logical organization of the new IBM scientific calculator. In: Proceedings of the 1952 ACM National Meeting (Pittsburgh) (pp. 79–83). 5. Avizienis, A., Laprie, J. C., & Randell, B., (2001). Fundamental concepts of computer system dependability. In: Workshop on Robot Dependability: Technological Challenge of Dependable Robots in Human Environments (pp. 1–16). Citeseer. 6. Avizienis, A., Laprie, J. C., Randell, B., & Landwehr, C., (2004). Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11–33. 7. Bahn, H., Koh, K., Noh, S. H., & Lyul, S. M., (2002). Efficient replacement of nonuniform objects in web caches. Computer, 35(6), 65–73. 8. Bailey, D. H., (1995). Unfavorable strides in cache memory systems (RNR technical report rnr-92-015). Scientific Programming, 4(2), 53– 58. 9. Basak, A., Li, S., Hu, X., Oh, S. M., Xie, X., Zhao, L., & Xie, Y., (2019). Analysis and optimization of the memory hierarchy for graph processing workloads. In: 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 373–386). IEEE. 10. Beckman, F. S., Brooks, F. P., & Lawless, W. J., (1961). Developments in the logical organization of computer arithmetic and control units. Proceedings of the IRE, 49(1), 53–66.
124
Dissecting Computer Architecture
11. Benenson, Y., Gil, B., Ben-Dor, U., Adar, R., & Shapiro, E., (2004). An autonomous molecular computer for logical control of gene expression. Nature, 429(6990), 423–429. 12. Bersagol, V., Dessalles, J. L., Kaplan, F., Marze, J. C., & Picault, S., (1996). XMOISE: A logical spreadsheet to elicit didactic knowledge. In: International Conference on Computer-Aided Learning and Instruction in Science and Engineering (pp. 430–432). Springer, Berlin, Heidelberg. 13. Bivens, A., Dube, P., Franceschini, M., Karidis, J., Lastras, L., & Tsao, M., (2010). Architectural design for next generation heterogeneous memory systems. In: 2010 IEEE International Memory Workshop (pp. 1–4). IEEE. 14. Bloch, R. M., Campbell, R. V. D., & Ellis, M., (1948). The logical design of the Raytheon computer. Mathematical Tables and Other Aids to Computation, 3(24), 286–295. 15. Bobrovnik, V. I., (2014). Structure and logical organization of current studies in track and field sports. Pedagogics, Psychology, MedicalBiological Problems of Physical Training and Sports, 18(3), 3–18. 16. Bouressace, H., & Csirik, J., (2018). Recognition of the logical structure of Arabic newspaper pages. In: International Conference on Text, Speech, and Dialogue (pp. 251–258). Springer, Cham. 17. Boutwell, Jr. E., & Hoskinson, E. A., (1963). The logical organization of the PB 440 micro programmable computer. In: Proceedings of the Fall Joint Computer Conference (pp. 201–213). 18. Bradley, E. M., (1962). Properties of magnetic films for memory systems. Journal of Applied Physics, 33(3), 1051–1057. 19. Burger, D., (1996). Memory systems. ACM Computing Surveys (CSUR), 28(1), 63–65. 20. Burnett, G. J., & Coffman, Jr. E. G., (1970). A study of interleaved memory systems. In: Proceedings of the Spring Joint Computer Conference (pp. 467–474). 21. Cabeza, M. L. C., Clemente, M. I. G., & Rubio, M. L., (1999). CacheSim: A cache simulator for teaching memory hierarchy behavior. ACM SIGCSE Bulletin, 31(3), 181. 22. Camp, D., Childs, H., Chourasia, A., Garth, C., & Joy, K. I., (2011). Evaluating the benefits of an extended memory hierarchy for parallel
Computer Memory Systems
23. 24. 25. 26.
27.
28.
29.
30. 31.
32.
33.
125
streamline algorithms. In: 2011 IEEE Symposium on Large Data Analysis and Visualization (pp. 57–64). IEEE. Chen, C. L., (1992). Symbol error-correcting codes for computer memory systems. IEEE Transactions on Computers, 41(02), 252–256. Claerbout, J. F., (1985). Imaging the Earth’s Interior (Vol. 1, pp. 5–10). Oxford: Blackwell scientific publications. Coleman, C. L., & Davidson, J. W., (2001). Automatic memory hierarchy characterization. In: ISPASS (pp. 103–110). Cooper-Balis, E., Rosenfeld, P., & Jacob, B., (2012). Buffer-on-board memory systems. ACM SIGARCH Computer Architecture News, 40(3), 392–403. Drost, R., Forrest, C., Guenin, B., Ho, R., Krishnamoorthy, A. V., Cohen, D., & Sutherland, I., (2005). Challenges in building a flat-bandwidth memory hierarchy for a large-scale computer with proximity communication. In: 13th Symposium on High Performance Interconnects (HOTI’05) (pp. 13–22). IEEE. Dzhaliashvili, Z. O., & Suhorukova, M. V., (2001). Information, optoelectronics, and information technologies (historic, philosophical, and logical aspects). In: Selected Papers from the International Conference on Optoelectronic Information Technologies (Vol. 4425, pp. 102–105). SPIE. Eachus, J. J., (1961). Logical organization of the Honeywell H-290. Transactions of the American Institute of Electrical Engineers, Part I: Communication and Electronics, 79(6), 715–719. Eckert, J. P., (1953). A survey of digital computer memory systems. Proceedings of the IRE, 41(10), 1393–1406. Endoh, T., Ohsawa, T., Koike, H., Hanyu, T., & Ohno, H., (2012). Restructuring of the memory hierarchy in computing system with spintronics-based technologies. In: 2012 Symposium on VLSI Technology (VLSIT) (pp. 89, 90). IEEE. Fatahalian, K., Horn, D. R., Knight, T. J., Leem, L., Houston, M., Park, J. Y., & Hanrahan, P., (2006). Sequoia: Programming the memory hierarchy. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (p. 83-es). Ferreira, A. P., Zhou, M., Bock, S., Childers, B., Melhem, R., & Mossé, D., (2010). Increasing PCM main memory lifetime. In: 2010 Design,
126
34. 35.
36.
37.
38.
39.
40.
41.
42. 43.
44.
Dissecting Computer Architecture
Automation & Test in Europe Conference & Exhibition (DATE 2010) (pp. 914–919). IEEE. Flynn, M. J., (1972). Some computer organizations and their effectiveness. IEEE Transactions on Computers, 100(9), 948–960. Frankel, S., (1964). R64-36 the logical organization of the PB 440 micro programmable computer. IEEE Transactions on Electronic Computers, (3), 311–321. Freitas, A. A., & Lavington, S. H., (2000). Basic concepts on parallel processing. Mining Very Large Databases with Parallel Processing, 61–69. Garcia-Molina, H., & Salem, K., (1992). Main memory database systems: An overview. IEEE Transactions on Knowledge and Data Engineering, 4(6), 509–516. Gibson, D., Kleinberg, J., & Raghavan, P., (1998). Inferring web communities from link topology. In: Proceedings of the Ninth ACM Conference on Hypertext and Hypermedia: Links, Objects, Time and Space---Structure in Hypermedia Systems: Links, Objects, Time and Space---Structure in Hypermedia Systems (pp. 225–234). Hallnor, E. G., & Reinhardt, S. K., (2005). A unified compressed memory hierarchy. In: 11th International Symposium on HighPerformance Computer Architecture (pp. 201–212). IEEE. Hangal, S., Vahia, D., Manovit, C., Lu, J. Y., & Narayanan, S., (2004). TSOtool: A program for verifying memory systems using the memory consistency model. In: Proceedings. 31st Annual International Symposium on Computer Architecture (pp. 114–123). IEEE. Harper, III. D. T., (1991). Block, multistrike vector, and FFT accesses in parallel memory systems. IEEE Transactions on Parallel & Distributed Systems, 2(01), 43–51. Heersmink, R., (2017). Distributed selves: Personal identity and extended memory systems. Synthese, 194(8), 3135–3151. Henke, K., (2010). A model for memory systems based on processing modes rather than consciousness. Nature Reviews Neuroscience, 11(7), 523–532. Hill, M. D., (1988). A case for direct-mapped caches. Computer, 21(12), 25–40.
Computer Memory Systems
127
45. Hollingshead, A. B., & Brandon, D. P., (2003). Potential benefits of communication in transactive memory systems. Human Communication Research, 29(4), 607–615. 46. Hollingshead, A. B., (1998). Retrieval processes in transactive memory systems. Journal of Personality and Social Psychology, 74(3), 659. 47. Honig, W. M., (1993). Logical organization of knowledge with inconsistent and undecidable algorithms using imaginary and transfinite exponential number forms in a non-Boolean field-I. Basic principles. IEEE Transactions on Knowledge and Data Engineering, 5(2), 190–203. 48. Jacob, B., (2009). The memory system: You can’t avoid it, you can’t ignore it, you can’t fake it. Synthesis Lectures on Computer Architecture, 4(1), 1–77. 49. Johnson, S. D., (1990). Manipulating logical organization with system factorizations. In: Hardware Specification, Verification and Synthesis: Mathematical Aspects (pp. 260–281). Springer, New York, NY. 50. Kessler, R. E., & Hill, M. D., (1992). Page placement algorithms for large real-indexed caches. ACM Transactions on Computer Systems (TOCS), 10(4), 338–359. 51. Kjelsø, M., Gooch, M., & Jones, S., (1999). Performance evaluation of computer architectures with main memory data compression. Journal of Systems Architecture, 45(8), 571–590. 52. Koganov, A. V., (2017). The tests for checking of the parallel organization in the logical calculation are based on the algebra and the automats. Computer Research and Modeling, 9(4), 621–638. 53. Kojiri, T., & Watanabe, Y., (2016). Contents organization support for logical presentation flow. In: Pacific Rim International Conference on Artificial Intelligence (pp. 77–88). Springer, Cham. 54. Kong, J., & Zhou, H., (2010). Improving privacy and lifetime of PCMbased main memory. In: 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN) (pp. 333–342). IEEE. 55. Laha, S., Patel, J. H., & Iyer, R. K., (1988). Accurate low-cost methods for performance evaluation of cache memory systems. IEEE Transactions on Computers, 37(11), 1325–1336. 56. Langmuir, C. R., (1960). A logical machine for measuring problem solving ability. In: Papers presented at the Eastern Joint IRE-AIEEACM Computer Conference (pp. 1–9).
128
Dissecting Computer Architecture
57. Langton, C. G., (1984). Self-reproduction in cellular automata. Physica D: Nonlinear Phenomena, 10(1, 2), 135–144. 58. Lawless, W. J., (1958). Developments in computer logical organization. In: Advances in Electronics and Electron Physics (Vol. 10, pp. 153– 184). Academic Press. 59. Lazzaroni, M., Piuri, V., & Maziero, C., (2010). Computer security aspects in industrial instrumentation and measurements. In: 2010 IEEE Instrumentation & Measurement Technology Conference Proceedings (pp. 1216–1221). IEEE. 60. Li, G., Lu, H., & Wang, T., (2010). Modeling knowledge logical organization with an intelligent topic map. In: The 3rd International Conference on Information Sciences and Interaction Sciences (pp. 1–5). IEEE. 61. Li, K., & Hudak, P., (1989). Memory coherence in shared virtual memory systems. ACM Transactions on Computer Systems (TOCS), 7(4), 321–359. 62. Liedtke, J., (1995). On micro-kernel construction. ACM SIGOPS Operating Systems Review, 29(5), 237–250. 63. Lin, W. F., Reinhardt, S. K., & Burger, D., (2001). Reducing DRAM latencies with an integrated memory hierarchy design. In: Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture (pp. 301–312). IEEE. 64. Lioupis, D., Kanellopoulos, N., & Stefanidakis, M., (1993). The memory hierarchy of the CHESS computer. Microprocessing and Microprogramming, 38(1–5), 99–107. 65. Lukaszewicz, L., (1963). Outline of the logical design of the ZAM-41 computer. IEEE Transactions on Electronic Computers, (6), 609–612. 66. Manegold, S., Boncz, P., & Kersten, M., (2002). Optimizing mainmemory join on modern hardware. IEEE Transactions on Knowledge and Data Engineering, 14(4), 709–730. 67. Martin, R. P., & O’Brien, J. E., (2014). Physical systems for regulatory investigation: Part I logical organization of observation and reason. In: Proceedings of Advances in Thermal-Hydraulics, American Nuclear Society Annual Meeting (Vol. 1, pp. 1–15). Reno, NV.
Computer Memory Systems
129
68. Matthews, J., Trika, S., Hensgen, D., Coulson, R., & Grimsrud, K., (2008). Intel® turbo memory: Nonvolatile disk caches in the storage hierarchy of mainstream computer systems. ACM Transactions on Storage (TOS), 4(2), 1–24. 69. McAulay, A. D., (1991). Optical Computer Architectures: The Application of Optical Concepts to Next Generation Computers (Vol. 1, pp. 15–20). John Wiley & Sons, Inc. 70. McCalpin, J. D., (1995). Memory bandwidth and machine balance in current high-performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, 2, 19–25. 71. McFarling, S., (1989). Program optimization for instruction caches. In: Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems (pp. 183–191). 72. Mellor-Crummey, J., Whalley, D., & Kennedy, K., (2001). Improving memory hierarchy performance for irregular applications using data and computation reordering. International Journal of Parallel Programming, 29(3), 217–247. 73. Mogul, J. C., (2000). Squeezing more bits out of HTTP caches. IEEE Network, 14(3), 6–14. 74. Moreno, L., González, E. J., Popescu, B., Toledo, J., Torres, J., & Gonzalez, C., (2011). MNEME: A memory hierarchy simulator for an engineering computer architecture course. Computer Applications in Engineering Education, 19(2), 358–364. 75. Mowshowitz, A., (1997). Virtual organization. Communications of the ACM, 40(9), 30–37. 76. Muller, D. E., (1964). The place of logical design and switching theory in the computer curriculum. Communications of the ACM, 7(4), 222– 225. 77. Mutlu, O., & Subramanian, L., (2014). Research problems and opportunities in-memory systems. Supercomputing Frontiers and Innovations, 1(3), 19–55. 78. Neefs, H., Van, H. P., & Van, C. J. M., (1998). Latency requirements of optical interconnect at different memory hierarchy levels of a computer system. In: Optics in Computing’98 (Vol. 3490, pp. 552– 555). International Society for Optics and Photonics.
130
Dissecting Computer Architecture
79. Obaidat, M. S., & Boudriga, N. A., (2010). Fundamentals of Performance Evaluation of Computer and Telecommunication Systems (Vol. 1 pp. 15–25). John Wiley & Sons 80. Olivera, F., (2000). Memory systems in organizations: An empirical investigation of mechanisms for knowledge collection, storage, and access. Journal of Management Studies, 37(6), 811–832. 81. Poremba, M., & Xie, Y., (2012). Nvmain: An architectural-level main memory simulator for emerging non-volatile memories. In: 2012 IEEE Computer Society Annual Symposium on VLSI (pp. 392–397). IEEE. 82. Protic, J., Tomasevic, M., & Milutinovic, V., (1995). A survey of distributed shared memory systems. In: Proceedings of the TwentyEighth Annual Hawaii International Conference on System Sciences (Vol. 1, pp. 74–84). IEEE. 83. Qureshi, M. K., Srinivasan, V., & Rivers, J. A., (2009). Scalable high performance main memory system using phase-change memory technology. In: Proceedings of the 36th Annual International Symposium on Computer Architecture (pp. 24–33). 84. Rege, S. L., (1976). Cost, performance, and size tradeoffs for different levels in a memory hierarchy. Computer, 9(4), 43–51. 85. Rosenberg, J., (1957). Logical organization of the DIGIMATIC computer. In: Papers and Discussions Presented at the Eastern Joint Computer Conference: Computers with Deadlines to Meet (pp. 25–29). 86. Ryoo, J., Kandemir, M. T., & Karakoy, M., (2022). Memory space recycling. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 6(1), 1–24. 87. Saleh, A. M., Serrano, J. J., & Patel, J. H., (1990). Reliability of scrubbing recovery techniques for memory systems. IEEE Transactions on Reliability, 39(1), 114–122. 88. Sardashti, S., Arelakis, A., Stenström, P., & Wood, D. A., (2015). A primer on compression in the memory hierarchy. Synthesis Lectures on Computer Architecture, 10(5), 1–86. 89. Seznec, A., (1993). A case for two-way skewed-associative caches. ACM SIGARCH Computer Architecture News, 21(2), 169–178. 90. Seznec, A., (2010). A phase-change memory is a secure main memory. IEEE Computer Architecture Letters, 9(1), 5–8.
Computer Memory Systems
131
91. Somers, H., (2003). Translation memory systems. Benjamins Translation Library, 35, 31–48. 92. Sorin, D. T., Pai, V. S., Adve, S., Vemon, M. K., & Wood, D. A., (1998). Analytic evaluation of shared-memory systems with ILP processors. In: Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No. 98CB36235) (pp. 380–391). IEEE. 93. Statnikov, A. I., & Akhutina, T. V., (2013). Logical-grammatical constructions comprehension and serial organization of speech: Finding the link using computer-based tests. Procedia-Social and Behavioral Sciences, 86, 518–523. 94. Stone, H. S., (1970). A logic-in-memory computer. IEEE Transactions on Computers, 100(1), 73–78. 95. Sutton, M. A., Orteu, J. J., & Schreier, H., (2009). Image Correlation for Shape, Motion and Deformation Measurements: Basic Concepts, Theory and Applications (Vol. 1, pp. 10–16). Springer Science & Business Media. 96. Symons, C. R., & Schweitzer, J. A., (1984). A proposal for an automated logical access control standard (ALACS) is a standard for computer logical access security. ACM SIGCHI Bulletin, 16(1), 17–23. 97. Tam, M. C., Smith, J. M., & Farber, D. J., (1990). A taxonomy-based comparison of several distributed shared memory systems. ACM SIGOPS Operating Systems Review, 24(3), 40–67. 98. Treu, S., (1992). Interface structures: Conceptual, logical, and physical patterns applicable to human-computer interaction. International Journal of Man-Machine Studies, 37(5), 565–593. 99. Viana, P., Barros, E., Rigo, S., Azevedo, R., & Araújo, G., (2003). Exploring memory hierarchy with ArchC. In: Proceedings. 15th Symposium on Computer Architecture and High Performance Computing (pp. 2–9). IEEE. 100. Vömel, S., & Freiling, F. C., (2011). A survey of main memory acquisition and analysis techniques for the windows operating system. Digital Investigation, 8(1), 3–22. 101. Welch, T. A., (1978). Memory hierarchy configuration analysis. IEEE Transactions on Computers, 27(05), 408–413. 102. Worlton, J., (1991). Toward a taxonomy of performance metrics. Parallel Computing, 17(10, 11), 1073–1092.
132
Dissecting Computer Architecture
103. Xia, F., Jiang, D. J., Xiong, J., & Sun, N. H., (2015). A survey of phase change memory systems. Journal of Computer Science and Technology, 30(1), 121–144. 104. Yasui, Y., Fujisawa, K., Goto, K., Kamiyama, N., & Takamatsu, M., (2011). Netal: High-performance implementation of network analysis library considering computer memory hierarchy (< Special Issue> SCOPE (seminar on computation and optimization for new extensions)). Journal of the Operations Research Society of Japan, 54(4), 259–280. 105. Yu, Y., Bai, Y. S., Li, N., & Xu, W. H., (2010). The logical structure design of the web-based courses interface elements based on human factors. In: 2010 2nd IEEE International Conference on Information Management and Engineering (pp. 458–462). IEEE.
CHAPTER
4
COMPUTER PROCESSING AND PROCESSORS
CONTENTS 4.1. Introduction..................................................................................... 134 4.2. Computer Processors....................................................................... 135 4.3. Computer Processes (Computing).................................................... 138 4.4. Multitasking and Process Management............................................ 141 4.5. Process States.................................................................................. 142 4.6. Inter-Process Communication (IPC).................................................. 143 4.7. Historical Background of Computer Processing............................... 144 4.8. Types of Central Processing Units (CPUS)......................................... 144 References.............................................................................................. 149
134
Dissecting Computer Architecture
4.1. INTRODUCTION The central processing unit (CPU) is referred to by the abbreviation CPU. The CPU is one of the most critical components of any programmable device or digital computer. The efficiency and operation of the gadget are determined by the type of CPU. Before diving into the many types of processors, it’s important to first grasp the purpose of the CPU in layman’s words (Figure 4.1) (Krikelis and Weems, 1994; Kapasi et al., 2003).
Figure 4.1. Representation of a typical processor. Source: https://digitalworld839.com/types-of-central-processing-unit/.
The primary memory of a computer or any device is the CPU, which is responsible for ensuring that all tasks such as gaming, editing, Internet surfing, chatting, and so on are completed effectively. The CPU is positioned in the center of the motherboard, next to the VRM area of the motherboard (Kuroda and Nishitani, 1998). Technically, the CPU is a critical part of a computer device since it serves as the device’s brain, processing all of the device’s instructions and arithmetic or logical operations. Once this is done, signals are transmitted to the other parts of the computer and peripherals, and several procedures linked to the management of computer memory are carried out across the computer. As a result of this operation, heat is generated, necessitating the installation of tiny CPU fans (Afif et al., 2020; Rose et al., 2021).
Computer Processing and Processors
135
The CPU, in a nutshell, is responsible for entering and storing the programs and data that the machine will need at the time of executing a task as an output for the user, as well as for providing the result to the user (Figure 4.2) (Walker and Cragon, 1995).
Figure 4.2. CPU working diagram. Source: https://digitalworld839.com/types-of-central-processing-unit/.
Among the internal units are ROM and RAM, power supply units, and hard drives, all of which are found in the central processor unit (Cole et al., 1986; Chen et al., 2007). Via a PC Cabinet, these provide ports via which you may attach a variety of input and output devices as well as storage devices such as displays, pen drives (for storing data), microphones, headphones (for listening), keyboards, mouse, cameras, printers, and digital graphics tablets.
4.2. COMPUTER PROCESSORS Known as a “Microprocessor,” a processor is a tiny sort of chip that is used in computers and other electrical devices to do calculations. All basic instructions, like input/output (I/O), logical, arithmetical, and other basic instructions, which are generated by the hardware or the operating system (OS), are managed by the processor. Its primary job function is to collect input from input devices and then deliver correct results on output devices using that input. These days, more advanced processors are accessible on
Dissecting Computer Architecture
136
the market, each of which is capable of controlling trillions of instructions in a single second (Darringer, 1969; Lazzaro et al., 1992). Processors are utilized in personal computers, but they may also be found in other electronic devices like smartphones, tablets, personal digital assistants, and other similar devices. Intel and AMD are the two major businesses that manufacture the highest-quality CPUs available on the market (Dulberger, 1993; Rao, 2019).
4.2.1. Basic Components of Processor • • • •
ALU: It is an arithmetic logic unit that aids in the execution of all logic and arithmetic operations. FPU: It is also known as the “Math coprocessor,” and it aids in the manipulation of mathematical computations. Registers: It saves the outcome of all operations and stores all instructions and data, as well as firing operands to the ALU. Cache Memory: Aids to reducing the amount of time spent traveling data from the main memory (Figure 4.3).
Figure 4.3. Components of CPU. Source: https://en.wikipedia.org/wiki/Central_processing_unit.
Computer Processing and Processors
137
4.2.2. Primary Processor Operations • •
•
•
Fetch: In which to acquire all commands from the main memory unit (RAM). Decode: This action is handled by the decoder, which converts all instructions into readable formats so that the other components of the CPU may continue with their activities. The decoder is responsible for the entirety of the conversion process. Execute: The purpose of this section is to complete all operations and to activate every component of the CPU that is required to perform instructions. Write-Back: After implementing all operations, the outcome sizes might is moved to write back (Figure 4.4).
Figure 4.4. A simplified view of the instruction cycle. Source: https://padakuu.com/operational-overview-of-cpu-199-article.
4.2.3. Speed of Processor The speed of your processor is entirely dependent on the capabilities of your CPU. You have the ability to improve the speed of the processor along with the functionality of the processor. If your CPU has been unlocked, you will be able to overclock it, which means you will be able to boost the frequency of your CPU rather than its regular settings (Owston and Wideman, 1997; Geer, 2005).
138
Dissecting Computer Architecture
If you acquire a CPU that has been locked, you will have no opportunity to raise the processor’s speed in the future.
4.3. COMPUTER PROCESSES (COMPUTING) A process is an instance of a computer program that is being run by one or more threads in computing. It includes the program’s code as well as its activities. A process can be made up of numerous threads of execution that execute instructions at the same time, depending on the OS (Figure 4.5) (Decyk et al., 1996; Wiriyasart et al., 2019).
Figure 4.5. Program vs. process vs. thread (scheduling, preemption, context switching). Source: https://en.wikipedia.org/wiki/Process_(computing)#:~:text=In%20 computing%2C%20a%20process%20is,execution%20that%20execute%20instructions%20concurrently.
Whereas a computer program is a passive collection of commands that are normally saved in a file on a disc, a process is the execution of those instructions after they have been loaded from the disc into memory (also known as execution). Several processes may be connected with a single program; for example, opening numerous instances of the same application will frequently result in the execution of more than one process (Rose et al., 2021).
Computer Processing and Processors
139
In computing, multitasking is a technique that allows many programs to share processors (CPUs) and further system resources. All CPU is responsible for only one job at a time. Multitasking, on the other hand, lets every processor move between tasks that are now being done, deprived of having to wait for each job to be completed. Switches may be achieved when tasks inductee and wait for the completion of input/output operations, when a task willingly yields the CPU, when a task receives a hardware interrupt, or whenever the OS’s scheduler determines that a process has exhausted its fair share of CPU time (Guirchoun et al., 2005). CPU time-sharing, which is a way for inserting the execution of user threads and processes, as well as the execution of separate kernel tasks – but the latter feature is just possible in preventive kernels like Linux – is a frequent kind of multitasking. Users are instantly assigned computing resources when they simply press a key or move their mouse, which is an important side effect of preemption for interactive processes. Because interactive processes are given higher priority than CPU-bound processes, users are immediately assigned computing resources when they simply press a key or move their mouse (Alverson et al., 1990). Aside from that, applications like music and video reproduction are assigned some type of real-time priority, which means they take precedence over any other lower-priority processes. During context changes in time-sharing systems, background switches are conducted quickly, giving the impression that several processes are being executed on a similar processor at the same time. Concurrency is the term used to describe the simultaneous execution of many processes (Fox et al., 1989; Ginosar, 2012). Because of the need for security and stability, most current OSs prohibit direct communication between independent processes, instead offering inter-process communication (IPC) capabilities that are carefully managed and monitored (Figures 4.6 and 4.7) (Venu, 2011).
140
Dissecting Computer Architecture
Figure 4.6. A list of methods as shown by htop. Source: https://en.wikipedia.org/wiki/Process_(computing)#:~:text=In%20 computing%2C%20a%20process%20is,execution%20that%20execute%20 instructions%20concurrently.
Figure 4.7. A process table as shown by KDE system guard. Source: https://en.wikipedia.org/wiki/Process_(computing)#:~:text=In%20 computing%2C%20a%20process%20is,execution%20that%20execute%20instructions%20concurrently.
Computer Processing and Processors
141
Generally, a computer system procedure contains the resources listed below: • •
A representation of a program’s executable machine code; In memory, you’ll find output and input data, process-specific data, executable code, a call stack, and a heap; • File descriptors or handles, as well as data sources and sinks, are descriptors of resources assigned to the process by the OS; • Security attributes, such as the process owner and the process’ set of permissions; • The state (context) of the processor is comprised of information about the values of registers and the location of physical memory on the processor. It is common to practice storing state information in computer lists when a procedure is active, but when a process is not operating, the state information is frequently stored in memory. The majority of this info about currently running procedures is stored in data structures known as process switch blocks by the OS. In OSs that enable the child or threads processes, any subgroup of the resources, often as a minimum, the processor state, may be coupled with each of the process’ threads (Zhang et al., 2005; Hall et al., 2015). The OS separates its processes and provides them with the resources they require, reducing the likelihood that they may conflict with one another and origin system faults. It is also possible that the OS may include methods for IPC, which will allow processes to interact in a safe and expectable manner (Park et al., 2017; Estrada-Torres, 2021).
4.4. MULTITASKING AND PROCESS MANAGEMENT Even though only one process may be running on a single CPU at any given moment, a multitasking OS may simply switch among processes to provide the impression of several procedures running concurrently. Single processes are often associated with the main program, while child processes are associated through any spin-off, parallel procedures that function as asynchronous subroutines, as is common in the programming world. A process is said to own resources, one of which is an image of its program (which is stored in memory). Multiprocessing systems, on the other hand, allow several processes to operate off of or share the same reentrant program that is stored at a similar location in memory, with every process
Dissecting Computer Architecture
142
being said to possess its image of the program (Egger et al., 2008; Ghaffari and Emsley, 2016). It should be noted that the preceding account affects both processes controlled by an operating process and system specified by process calculi. If a process asks for something that it must wait for, the process will be halted. When a procedure is in the blocked state, it is suitable for swapping to disc; however, in a virtual memory system, swapping to disc is transparent since sections of a process’s memory may be physically located on the disc and not in the main memory at any one moment. Be aware that even sections of running processes are suitable for shifting to disc if the portions have not been utilized in the last several seconds. When a program is running, not all of its components, including its data, must be in physical memory for the related process to be active (Figure 4.8) (Chu et al., 2011; Schultz et al., 2013).
Figure 4.8. Examples of computer multitasking programs. Source: https://medium.com/@rmsrn.85/multitasking-operating-system-typesand-its-benefits-deb1211c1643.
4.5. PROCESS STATES Processes must have particular states for an OS kernel to support multitasking. The names of these states aren’t uniform, but they all serve the same purpose: •
•
The process is first “made” by loading it into the main memory from a secondary storage device (hard disc drive, CD-ROM, etc.). The process scheduler then places it in the “waiting” state. The procedure “waits” for the scheduler to perform a so-entitled context switch while it is “waiting.” Whereas the earlier “running”
Computer Processing and Processors
•
143
process is held in a “waiting” state, the background switch puts the procedure into the processor and switches the status to “running.” The “blocked” state is assigned to a process in the “running” state that has to wait for a resource. When the process no longer takes a longer time, the process state is altered to “waiting” (Figure 4.9).
Figure 4.9. In the state diagram, the different process stages are shown, with arrows suggesting probable transitions between them. Source: https://en.wikipedia.org/wiki/Process_state.
4.6. INTER-PROCESS COMMUNICATION (IPC) When processes want to interact with one another, they must either exchange parts of their address spaces or utilize alternative IPC methods. In a shell pipeline, for example, the output of the first process must be passed to the second, and so on; another instance is a task that can be decomposed into collaborating but partly independent processes that can run at the same time (Bidra et al., 2013).
144
Dissecting Computer Architecture
4.7. HISTORICAL BACKGROUND OF COMPUTER PROCESSING In the early 1960s, computer control software had progressed from monitor control software, such as IBSYS, to executive control software, which was a significant step forward. After some time had passed, computers became quicker, but computer time remained neither cheap nor completely used; in such an environment, multiprogramming became both probable and essential. Multiprogramming is the process of running many programs at the same time. Because of the basic uniprocessor computer architecture, more than one application could run on a single processor at the same time, and they had to share precious and restricted hardware resources; as a result, the concurrent was of a serial nature at the outset. On later systems with many processors, numerous applications may be executed simultaneously and in parallel on the same system (Medvegy et al., 2002; Gugerty, 2006). Programs are made up of sequences of instructions that are executed by processors. A single processor can only execute one instruction at a time; it is not feasible to run many programs at the same time on a single processor. A program may require a resource, like an input device, that has significant latency, or a program may initiate a slow process, like sending output to a printer, to complete its task. As a result, the CPU would be considered “idle” (unused). So that the processor remains active at all times, the execution of like a program is stopped, and the OS switches to running another application on the same computer. It will seem to the user that the apps are running at the same time as one another (Medvegy et al., 2002; Tkalcic and Tasic, 2003). Soon later, the concept of a “program” was broadened to include the concept of an “executing program and its surroundings.” The notion of a process was introduced, and it became important with the advent of reentrant code, which was also introduced. Threads were introduced a little later (Gatos et al., 2004; Piotrowski, 2012).
4.8. TYPES OF CENTRAL PROCESSING UNITS (CPUS) There are six different types of CPUs available. Single-core, dual-core, quad-core, Hexa-core, octa-core, and deca-core processors are all available. These are the six different types of CPUs that may be found in computers, laptops, and mobile phones. The multithreading, efficiency, speed, clock frequency, cache, and effective functioning of mobile and computer devices are all determined by these sorts of CPUs (Lipkin, 1984; Matsuzawa, 2003).
Computer Processing and Processors
145
The speed at which software applications run is determined by the CPU’s power. In PCs, the primary manufacturers are Intel and AMD, while in mobile devices, the main manufacturers are MediaTek, Qualcomm, Apple Bionic, and Samsung (Exynos), each of which has its form of CPU. Let’s look at the many sorts of CPU processors and their characteristics (Figure 4.10).
Figure 4.10. CPU types. Source: https://digitalworld839.com/types-of-central-processing-unit/.
The CPU is an important part of the computer since it regulates all of the computations and controls that are passed to other components of the computer and its computer peripheral. The CPU runs at a high rate in response to the commands of the input application. When the components are connected to the CPU, they become more reliant and powerful. Therefore, it is important to select the most suitable one and program it appropriately. (Gray and Siewiorek, 1991; Chuvieco et al., 2019). The Intel 486 is faster than the Intel 386, however, with the introduction of the Pentium CPU, all processors have been designated with names such as Athlon, Celeron, Pentium, and Duron. The many types of processors are developed in different architectures, such as 32-bit and 64-bit, to provide the highest possible speed and capacity flexibility. The most common varieties of CPUs are categorized as single-core, dual-core, quad-core, Hexa-core,
146
Dissecting Computer Architecture
octa-core, and deca-core processors, which are detailed in further detail further down this page (Reddy and Petrov, 2010; Kim et al., 2016).
4.8.1. Single-Core CPU The CPU is a critical component of the computer since it controls all of the computations and commands that are passed to other components of the computer and its peripherals. The CPU runs at a high rate in response to the commands of the input application. When the components are connected to the CPU, they become more reliant and powerful. As a result, it is vital to select the most appropriate one and program it appropriately. AMD and Intel are the two most important processor manufacturers. For a long time, processors have been used to discover the most appropriate and reliable processor (Lukas et al., 2010). The Intel 486 is faster than the Intel 386, however, with the introduction of the Pentium CPU, all processors have been designated with names such as Duron, Celeron, Pentium, and Athlon. The many types of processors are developed in different architectures, such as 64-bit and 32-bit, to provide the highest possible speed and capacity flexibility. The most common varieties of CPUs are categorized as single-core, dual-core, quad-core, Hexa-core, octa-core, and deca-core processors, which are detailed in further detail further down this page.
4.8.2. Dual-Core CPU It is a single CPU with two powerful cores that perform as if it were two CPUs in one. The processor, unlike a single-core CPU, must switch back and forth within a changeable array of data streams, and if or more threads are performed, the dual-core CPU successfully controls multitasking. To make the most of the dual-core CPU, the running programs and OS need to have a special code known as instantaneous multithreading technology. Dual-core CPUs are faster than single-core CPUs, but they are not as durable as quadcore CPUs (Septien et al., 2010).
Computer Processing and Processors
147
4.8.3. Quad-Core CPU With four cores on a single CPU, the quad-core CPU is an improved form of multiple core CPU characteristics and architecture. Quad-core permits for excellent multitasking, similar to a dual-core CPU that splits the burden across the cores. It does not refer to a particular process that is four times as quick as others. Except SMT code is employed to execute applications and programs, the speed will increase and become undetectable. These CPUs are used by users who need to run numerous applications at the same time, like gamers, and a series of supreme commander games that are optimized for multi-core CPUs (Jozwik et al., 2011).
4.8.4. Hexa Core Processors It is another multi-core CPU with six cores that can operate tasks faster than quad-core and dual-core processors. The processors of Hexacore are simple for consumers of personal computers, and Intel now introduced the Inter core i7 in 2010 with the Hexa core processor. Smartphone users, on the other hand, exclusively utilize quad-core and dual-core CPUs. Hexacore CPUs are now accessible on cellphones.
4.8.5. Octa-Core Processors The dual-core is created with two cores, and four cores are built-in quadcore; Hexa comes with six cores while the octal processors are produced with eight separate cores to execute an effective work that is efficient and even more swiftly than quad-core processors. Swerving octa-core processors includes a dual set of quad-core processors that splits distinct activities amongst the various sorts. Several times, the minimal powered core sets are applied to generate sophisticated jobs. If there is any emergency or demand, the quick four sets of cores will be stuck in. In particular, the octa-core is properly specified with a dual-code core and changed suitably to deliver effective performance (Ghaffari and Emsley, 2016).
4.8.6. Deca-Core Processor With their superior capabilities, deca-core CPUs are becoming increasingly popular. Most smartphones today come with Deca core CPUs that are low-cost and never go out of style. Most devices on the market have been upgraded with new processors to provide customers with additional helpful functions.
148
Dissecting Computer Architecture
4.8.7. Mainstream Processors of CPU The mainstream processors, which are considerably bigger and manage high-performance activities like 3D gaming, video editing, and other multimedia-related programs, are mid-range CPUs. It’s related to low-cost processors that are used to complete core tasks cost-effectively. Office apps, photo editing, online surfing, and other basic chores may all be done with ease on such a CPU (Spink et al., 2008).
Computer Processing and Processors
149
REFERENCES 1.
Afif, M., Said, Y., & Atri, M., (2020). Computer vision algorithms acceleration using graphic processors NVIDIA CUDA. Cluster Computing, 23(4), 3335–3347. 2. Alverson, R., Callahan, D., Cummings, D., Koblenz, B., Porterfield, A., & Smith, B., (1990). The term computer system. In: Proceedings of the 4th International Conference on Supercomputing (pp. 1–6). 3. Bidra, A. S., Taylor, T. D., & Agar, J. R., (2013). Computer-aided technology for fabricating complete dentures: Systematic review of historical background, current status, and future perspectives. The Journal of Prosthetic Dentistry, 109(6), 361–366. 4. Chen, T. P., Budnikov, D., Hughes, C. J., & Chen, Y. K., (2007). Computer vision on multi-core processors: Articulated body tracking. In: 2007 IEEE International Conference on Multimedia and Expo (pp. 1862–1865). IEEE. 5. Chu, R., Gu, L., Liu, Y., Li, M., & Lu, X., (2011). SenSmart: Adaptive stack management for multitasking sensor networks. IEEE Transactions on Computers, 62(1), 137–150. 6. Chuvieco, E., Mouillot, F., Van, D. W. G. R., San, M. J., Tanase, M., Koutsias, N., & Giglio, L., (2019). Historical background and current developments for mapping burned area from satellite earth observation. Remote Sensing of Environment, 225, 45–64. 7. Cole, R., Chen, Y. C., Barquin-Stolleman, J. A., Dulberger, E., Helvacian, N., & Hodge, J. H., (1986). Quality-adjusted price indexes for computer processors and selected peripheral equipment. Survey of Current Business, 66(1), 41–50. 8. Darringer, J. A., (1969). The Description, Simulation, and Automatic Implementation of Digital Computer Processors (Vol. 1, pp. 5–12). Carnegie Mellon University. 9. Decyk, V. K., Karmesin, S. R., De Boer, A., & Liewer, P. C., (1996). Optimization of particle‐in‐cell codes on reduced instruction set computer processors. Computers in Physics, 10(3), 290–298. 10. Dulberger, E. R., (1993). Sources of price decline in computer processors: Selected electronic components. Price Measurements and Their Uses, 57, 103–124.
150
Dissecting Computer Architecture
11. Egger, B., Lee, J., & Shin, H., (2008). Scratchpad memory management in a multitasking environment. In Proceedings of the 8th ACM International Conference on Embedded Software (pp. 265–274). 12. Estrada-Torres, B., Camargo, M., Dumas, M., García-Bañuelos, L., Mahdy, I., & Yerokhin, M., (2021). Discovering business process simulation models in the presence of multitasking and availability constraints. Data & Knowledge Engineering, 134, 101897. 13. Fox, G. C., Johnson, M., Lyzenga, G., Otto, S., Salmon, J., Walker, D., & White, R. L., (1989). Solving problems on concurrent processors vol. 1: General techniques and regular problems. Computers in Physics, 3(1), 83, 84. 14. Gatos, B., Pratikakis, I., & Perantonis, S. J., (2004). An adaptive binarization technique for low quality historical documents. In: International Workshop on Document Analysis Systems (pp. 102–113). Springer, Berlin, Heidelberg. 15. Geer, D., (2005). Chipmakers turn to multicore processors. Computer, 38(5), 11–13. 16. Ghaffari, M., & Emsley, M. W., (2016). The boundary between good and bad multitasking in CCPM. The Journal of Modern Project Management, 4(1). 17. Ghaffari, M., & Emsley, M. W., (2016). The impact of good and bad multitasking on buffer requirements of CCPM portfolios. The Journal of Modern Project Management, 4(2). 18. Ginosar, R., (2012). Survey of processors for space. Data Systems in Aerospace (DASIA). Eurospace, 1–5. 19. Gray, J., & Siewiorek, D. P., (1991). High-availability computer systems. Computer, 24(9), 39–48. 20. Gugerty, L., (2006). Newell and Simon’s logic theorist: Historical background and impact on cognitive modeling. In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting (Vol. 50, No. 9, pp. 880–884). Sage CA: Los Angeles, CA: SAGE Publications. 21. Guirchoun, S., Martineau, P., & Billaut, J. C., (2005). Total completion time minimization in a computer system with a server and two parallel processors. Computers & Operations Research, 32(3), 599–611. 22. Hall, N. G., Leung, J. Y. T., & Li, C. L., (2015). The effects of multitasking on operations scheduling. Production and Operations Management, 24(8), 1248–1265.
Computer Processing and Processors
151
23. Jozwik, K., Tomiyama, H., Edahiro, M., Honda, S., & Takada, H., (2011). Rainbow: An OS extension for hardware multitasking on dynamically partially reconfigurable FPGAs. In: 2011 International Conference on Reconfigurable Computing and FPGAs (pp. 416–421). IEEE. 24. Kapasi, U. J., Rixner, S., Dally, W. J., Khailany, B., Ahn, J. H., Mattson, P., & Owens, J. D., (2003). Programmable stream processors. Computer, 36(8), 54–62. 25. Kim, J. H., Yang, X., & Putri, M., (2016). Multitasking performance and workload during a continuous monitoring task. In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting (Vol. 60, No. 1, pp. 665–669). Sage CA: Los Angeles, CA: SAGE Publications. 26. Krikelis, A., & Weems, C. C., (1994). Associative processing and processors. Computer, 27(11), 12–17. 27. Kuroda, I., & Nishitani, T., (1998). Multimedia processors. Proceedings of the IEEE, 86(6), 1203–1221. 28. Lazzaro, J., Wawrzynek, J., Mahowald, M., Sivilotti, M., & Gillespie, D., (1992). Silicon auditory processors as computer peripherals. Advances in Neural Information Processing Systems, 5. 29. Lipkin, M., (1984). Historical background on the origin of computer medicine. In: Proceedings of the Annual Symposium on Computer Application in Medical Care (p. 987). American Medical Informatics Association. 30. Lukas, S., Schuh, G., Bender, D., Piller, F. T., Wagner, P., & Koch, I., (2010). Multitasking in the product development process: How opposing cognitive requirements affect the designing process. In: PICMET 2010 Technology Management for Global Economic Growth (pp. 1–10). IEEE. 31. Matsuzawa, T., (2003). The Ai project: Historical and ecological contexts. Animal Cognition, 6(4), 199–211. 32. Medvegy, M., Duray, G., Pintér, A., & Préda, I., (2002). Body surface potential mapping: Historical background, present possibilities, diagnostic challenges. Annals of Noninvasive Electrocardiology, 7(2), 139. 33. Owston, R. D., & Wideman, H. H., (1997). Word processors and children’s writing in a high-computer-access setting. Journal of Research on Computing in Education, 30(2), 202–220.
152
Dissecting Computer Architecture
34. Park, J. J. K., Park, Y., & Mahlke, S., (2017). Dynamic resource management for efficient utilization of multitasking GPUs. In: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (pp. 527–540). 35. Piotrowski, M., (2012). Natural language processing for historical texts. Synthesis Lectures on Human Language Technologies, 5(2), 1–157. 36. Rao, R. P., (2019). Towards neural co-processors for the brain: Combining decoding and encoding in brain-computer interfaces. Current Opinion in Neurobiology, 55, 142–151. 37. Reddy, R., & Petrov, P., (2010). Cache partitioning for energy-efficient and interference-free embedded multitasking. ACM Transactions on Embedded Computing Systems (TECS), 9(3), 1–35. 38. Rose, G. S., Shawkat, M. S. A., Foshie, A. Z., Murray, J. J., & Adnan, M. M., (2021). A system design perspective on neuromorphic computer processors. Neuromorphic Computing and Engineering, 1(2), 022001. 39. Schultz, C., Schreyoegg, J., & Von, R. C., (2013). The moderating role of internal and external resources on the performance effect of multitasking: Evidence from the R&D performance of surgeons. Research Policy, 42(8), 1356–1365. 40. Septien, J., Mecha, H., Mozos, D., & Tabero, J., (2010). Fragmentation management for HW multitasking in 2D reconfigurable devices: Metrics and defragmentation heuristics. In: Parallel and Distributed Computing (p. 11). IntechOpen. 41. Spink, A., Cole, C., & Waller, M., (2008). Multitasking behavior. Annual Review of Information Science and Technology, 42(1), 93–118. 42. Tkalcic, M., & Tasic, J. F., (2003). Color Spaces: Perceptual, Historical and Applicational Background (Vol. 1, pp. 304–308). IEEE. 43. Venu, B., (2011). Multi-Core Processors: An Overview. arXiv preprint arXiv:1110.3535. 44. Walker, W., & Cragon, H. G., (1995). Interrupt processing in concurrent processors. Computer, 28(6), 36–46. 45. Wiriyasart, S., Hommalee, C., & Naphon, P., (2019). Thermal cooling enhancement of dual processors computer with thermoelectric air cooler module. Case Studies in Thermal Engineering, 14, 100445.
Computer Processing and Processors
153
46. Zhang, Y., Goonetilleke, R. S., Plocher, T., & Liang, S. F. M., (2005). Time-related behavior in multitasking situations. International Journal of Human-Computer Studies, 62(4), 425–455.
CHAPTER
5
INTERCONNECTION NETWORKS
CONTENTS 5.1. Introduction..................................................................................... 156 5.2. Questions About Interconnection Networks..................................... 157 5.3. Uses of Interconnection Networks................................................... 159 5.4. Network Basics................................................................................ 170 References.............................................................................................. 178
156
Dissecting Computer Architecture
5.1. INTRODUCTION In today’s world, digital systems are everywhere. Simulating physical systems, handling massive databases, and creating papers are all activities that digital computers are utilized for. Telephone conversations, television signals, and Internet data are all relayed over digital communication infrastructure. Digital delivery and processing of audio and video entertainment are becoming more common. Finally, practically every product, from autos to household appliances, is controlled digitally. Logic, memory, and communication are the three essential building components of a digital system. By executing mathematical operations or making judgments, logic alters and mixes data. Memory saves information for eventual access and moves it forward in time. Data is moved from one area to another through communication. The communication component of digital systems is the subject of this book. It focuses on interconnection networks, which are used to convey data between digital system subsystems (Cherkasova et al., 1996; Fontaine et al., 2008). Several digital systems’ performance is now restricted by their communication or connectivity rather than their memory or logic. The majority of the power in a high-end system is utilized to drive wires, and the majority of the clock cycle is used on wire delay rather than gate delay. Memory and processors are becoming smaller, faster, and less expensive as technology advances. The speed of light, on the other hand, does not alter. The pin density and wire density that regulate system component interconnections are scaling at a slower rate than the parts themselves. Furthermore, the frequency of communication between components is significantly slower than contemporary processor clock rates. These variables combine to make connectivity a critical component of future digital systems’ success (Hluchyj and Karol, 1988; Pan et al., 2004). Interconnection networks are developing as a practically comprehensive solution to the system-level communication difficulties for advanced digital systems as designers attempt to make more effective use of constrained interconnection bandwidth. Interconnection networks are gradually replacing buses as the conventional system-level interconnection. They were created to meet the demanding communication needs of multicomputers. As designers learn that routing packets is both quicker and less expensive than routing wires, they are also replacing specialized wiring in particularpurpose systems (Klinkowski et al., 2005; Baliga et al., 2005).
Interconnection Networks
157
5.2. QUESTIONS ABOUT INTERCONNECTION NETWORKS Before proceeding any more, we shall address certain fundamental problems concerning interconnection networks, such as the following: What is an interconnection network, and how does it work? What is the location of these people? What is the significance of these individuals?
5.2.1. What Is an Interconnection Network? Interconnection networks, as seen in Figure 5.1, are systems that may be programmed to transfer data across terminals in a networked environment. The picture depicts six terminals, designated as T1 through T6, that are linked to a network. T3 transmits a message comprising the data to T5 through the network, and T5 receives the message from T3 thanks to the network. The network is programmable in the sense that it can create various connections at different times in time depending on the circumstances (Yang et al., 2005). When a message is sent from T3 to T5 in one cycle, the network in the figure may use similar resources to send a message from T3 to T1 in the following cycle. The network is considered a system because it is made up of several parts, including switches, channels, buffers, and controllers, all of which work together to convey information. Networks that fulfill these broad criteria can be found at a variety of scales. A single processor may have many memory arrays, registers, and arithmetic units, all of which can communicate with one another using onchip networks. Interconnecting processors with memory or input ports with output ports is accomplished using board- and system-level networks. Lastly, wide-area and local-area networks are used to connect different systems inside an organization as well as systems across the globe. For the purposes of this book, we will confine our attention to smaller scales, ranging from the chip to the system level. There is currently a plethora of great texts available that cover the larger-scale networks. In contrast to large-scale problems, at the system level and well below problems, where channels are limited and data rates extremely high, are fundamentally different and necessitate the development of alternative solutions (Newman, 1989; Hunter and Andonovic, 2000).
158
Dissecting Computer Architecture
Figure 5.1. The functional view of an interconnection network. Note: Channels are used to link the terminals (designated T1 through T6) to the rest of the network. The presence of arrowheads at either end of the channel indicates that it is bidirectional, allowing data to be sent both into and out of the interconnection network. Source: https://er.yuvayana.org/metrics-for-interconnection-networks-andperformance-factors/.
5.2.2. Where Do You Find Interconnection Networks? They are utilized in practically all digital systems that are large as to include two elements that need to be linked together to function properly. Network interconnections are most commonly seen in computer systems and communication switches, which is not surprising given their popularity. They are responsible for connecting processors to memory and input/output (I/O) devices to I/O controllers in computer systems. They link the input ports of network routers and communication switches to the output ports of such switches and routers. They also serve as a link between actuators and sensors and the processors of control systems. There is a good chance that an interconnection network will be found everywhere that bits are transmitted among two components of a system (Awdeh, 1993; Bintjas et al., 2003).
5.2.3. Why Are Interconnection Networks Important? Since they represent a performance bottleneck in several systems, they should be avoided. It is the connection network between the CPU and
Interconnection Networks
159
the memory that defines the latency and bandwidth of the memory in a computer system, both of which are important performance variables. The performance of the interconnection network in a communication switch is a significant factor in determining the switch’s capacity (data rate and number of ports). Interconnection has become a crucial bottleneck in most systems as the need for connectivity has grown faster than the abilities of the primary wires can keep up with (Dorren et al., 2003). Interconnection networks are a viable alternative to dedicated wire because they allow limited wiring resources to be shared by a large number of low-duty-factor signals, reducing the need for dedicated wiring. Assume that each terminal in Figure 5.1 has to transmit one word to each other terminal once per 100 cycles, as shown in Figure 5.1. To connect each pair of terminals to the network, we could use a dedicated worldwide channel, which would need a total of 30 unidirectional channels (Chiaroni, 2003). Each channel, on the other hand, would remain inactive for 99% of the time. If, on the other hand, the six terminals are connected in a ring, just six channels are required. The number of channels is decreased by a factor of five when using the ring network, and the channel duty factor is raised from 1% to 12.5% (Feng, 1981; Duato et al., 2003).
5.3. USES OF INTERCONNECTION NETWORKS It’s helpful to look at how interconnection networks are employed in digital systems to comprehend the requirements imposed on their design. In this part, we’ll look at three main types of interconnection networks and how they influence network needs. We’ll look at how each application calculates the following network settings in detail for each application: • • • • • • • •
The number of terminals; Each terminal’s maximum bandwidth; Each terminal’s average bandwidth; The necessary latency; The size of the message, or a range of message sizes; The anticipated traffic pattern(s); The required level of service quality; The interconnection network’s needed dependability and availability (Wu and Feng, 1980).
160
Dissecting Computer Architecture
The number of ports, or terminals, in a network is directly proportional to the number of parts that must be linked to the network, as we have already seen. Additionally, along with knowing the total number of terminals, the designer must also understand how the terminals will connect with the rest of the system (Owens et al., 2007; Xu, 2013). To function properly, each terminal will need a specific network bandwidth from the network, which is often stated in bits per second (bit/s). In the absence of explicit instructions, we presume that the terminal bandwidths are symmetric, which means that the output and input bandwidths of the terminal are the same. Although they are related, the peak bandwidth is the maximum data rate that a terminal will demand from the network for a short period of time, and the average bandwidth is the average data rate that a terminal will request from the network. Having an understanding of both peak and average bandwidths becomes critical when attempting to reduce the overall cost of the interconnection network, as outlined in the following section on processor-memory interface design (see Figure 5.2) (Biberman, and Bergman, 2012; Bermond, 2016). Apart from specifying the rate at which messages must be received and provided by the network, the time necessary to deliver a single message, referred to as “message latency,” is also set for the network. However, while the best network may handle both high bandwidth and low latency, in practice, there is typically a compromise between these two characteristics. Examples include a network with high bandwidth capacity that keeps network resources busy, resulting in resource conflict and contention for bandwidth and other resources. When two or more messages compete for the usage of a similar shared resource in the network, this is known as contention. All of these communications, except one, will have to wait for that resource to become available, increasing the latency of the messages overall. If, on the other hand, resource utilization was reduced by reducing bandwidth demands, the delay would be reduced as a result as well. Another key design aspect is message size, which refers to the length of a message in bits. When messages are short, overheads in the network can have a greater influence on performance than when overheads can be amortized throughout a bigger message, as is the case when messages are large. In many systems, various different messages are used (Kachris et al., 2013; Yunus et al., 2016).
Interconnection Networks
161
The traffic pattern of a network is defined by how the messages from every terminal are spread among all of the potential destination terminals on the network. For instance, each terminal may have an equal chance of sending messages to all of the other terminals in the network. This is a representation of the random traffic pattern. Terminals that prefer to deliver messages exclusively to other terminals in the immediate vicinity, on the other hand, might benefit from the underlying network’s spatial proximity to lower costs. In other networks, on the other hand, the requirements must remain valid in the face of arbitrary traffic patterns. Some networks will also need a certain level of service quality (QoS). Quality of service (QoS) is defined as the equitable deployment of resources in accordance with a service policy. For instance, when many messages are competing for a similar resource in a network, the conflict can be addressed in a variety of ways depending on the situation. A first-come, first-served approach might be used to serve messages based on how long they have been waiting for the resource in the issue. Another option gives preference to the message that has been in the network for the longest period of time (the longest duration approach). The decision between these and other allocation strategies is dependent on the services that are essential to the network and the resources available (Krishnamoorthy and Krishnamurthy, 1987; Bhuyan et al., 1989). Finally, the level of dependability and availability required from an interconnection network has an impact on the architecture of the network itself. The reliability of a network is a measure of how frequently the network completes the task of delivering messages successfully. In the majority of cases, communications must be delivered in whole and without error 100% of the time. Creating a 100% reliable network can be accomplished by including specialized hardware to identify and repair mistakes, implementing a higher-level software protocol, or a combination of these techniques. Furthermore, as we shall see in the next section on packet-switching fabrics, it is conceivable for a tiny percentage of messages to be lost by the network. In the context of networks, availability refers to the percentage of time that a network is available and running correctly. Typical Internet router specifications demand 99.999% availability, which translates to fewer than five minutes of total downtime per year on average. The difficulty in delivering this degree of availability is that the components required to create the network will frequently fail many times per minute, which makes it difficult to maintain. Thus, the network must be built to
162
Dissecting Computer Architecture
identify and swiftly recover from these errors even though still allowing the network to function normally (Bermond et al., 1986; Heydemann, 1997).
5.3.1. Processor-Memory Interconnect Figure 5.2 depicts two techniques for connecting processors to memory through an interconnection network. Figure 5.2(a) depicts a dance-hall design in which an interconnection network connects P processors to M memory banks. The integrated-node arrangement depicted in Figure 5.2(b) is used by the majority of current computers (Kruskal and Snir, 1983; Reed and Grunwald, 1987).
Figure 5.2. To connect the CPU and memory, an interconnection network is used.
Note: (a) Separate processor (P) and memory (M) ports in a dance-hall design. (b) An integrated-node design with a single memory bank and shared CPU and memory ports. Source: https://dl.acm.org/doi/pdf/10.5555/2821589.
In an integrated node, CPUs, and memory are merged. Each CPU may gain access to its local memory through a communication switch C instead of using the network in this configuration. The network requirements imposed by each setup are described in Table 5.1. The number of processor ports can range from thousands to hundreds of thousands, as in a Cray T3E with 2,176 processor ports, to as low as one for a single processor. In today’s high-end servers, 64 to 128 processor configurations are typical, and this number is growing. All of
Interconnection Networks
163
these processing ports is also memory port in the combined node design. On the other hand, with a dance-hall arrangement, the number of memory ports is often substantially more than the number of CPU ports. One high-end vector processor, for example, has 32 processor interfaces and may request 4,096 memory banks. This big ratio optimizes memory bandwidth while reducing the likelihood of bank disputes, which occur when two processors need access to almost a similar memory bank at the same time (Dally, 1990; Day et al., 1997). Table 5.1. Processor-Memory Interconnection Network Parameters Parameter
Value
Processor ports
1–2,048
Peak bandwidth
8 Gbytes/s
Memory ports
0–4,096
Message latency
100 ns
Average bandwidth
400 Mbytes/s
Traffic patterns
arbitrary
Message size
64 or 576 bits
Reliability
no message loss
Quality of service
none
Availability
0.999 to 0.99999
A contemporary microprocessor can perform roughly 109 instructions per second, each of which may take two 64-bit words from memory. If one of these references is not found in the caches, a block of eight words is typically retrieved from memory. If we truly needed to fetch two words from memory per cycle, we’d require 16 Gbytes/s of bandwidth. Fortunately, only around a third of all instructions related to data stored in memory and caches help to limit the number of times a memory bank must be referenced. The average bandwidth is more than two orders of magnitude lower with normal cache-miss ratios—around 400 Mbytes/s. A rapid rush of memory requests might soon block the processor’s network interface if we reduce this peak bandwidth too much. Serialization increases message delay by squeezing an elevated burst of requests down a lower-bandwidth network port, akin to a clogged sink slowly emptying. We require an 8 Gbytes/s peak bandwidth to avoid serialization during spikes of queries (Wang et al., 2002; Lemieux and Lewis, 2004).
164
Dissecting Computer Architecture
Memory latency, and therefore the delay of the network interface over which memory requirements and answers are delivered, has a significant impact on processor performance. Since this is the fundamental latency of a normal memory system alone without a network, we mention a latency requirement of 100 ns in Table 5.1. We have increased the effective memory latency if our network adds another 100 ns of delay. When load and store instructions miss the processor’s cache, they’re transformed into read-request and write-request packets and sent across the network to the appropriate memory bank. Each read-request packet includes the memory address to be read, as well as a cache line or word to be written. After receiving a request packet, the relevant memory bank conducts the required operation and provides a write-reply or read-reply packet (Adams et al., 1987; Dandamudi and Eager, 1990). In our network, you’ll see that we’ve started to discriminate between packets and messages. A message is a unit of transmission from the network’s clients to the network. A single message can generate one or more packets at the network interface. Large messages can be divided into multiple smaller packets, or uneven length messages can be divided into fixed-length packets, allowing the underlying network to be simplified. We presume a one-to-one relationship between packets and messages since the messages generated in this processor-memory interface are quite modest (Pfister et al., 1985; Soteriou et al., 2006). The write-reply and read-request packets comprise no data, but they do save an address. The network’s address, together with certain packet and header type information, fits easily inside 64 bits. Write-request and readreply packets share 64 bits of header and address information, as well as the contents of a 512-bit cache line, for a total of 576 bits. Figure 5.3 depicts the two different packet forms (Shang et al., 2003).
Figure 5.3. The processor-memory connection requires two packet types. Source: https://www.amazon.com/Principles-Practices-Interconnection-Networks-Architecture/dp/0122007514.
Interconnection Networks
165
We do not need any QoS, as is usual with the processor-memory connection. Since this network is fundamentally self-throttling, this is the case. Memory requests will take longer to be serviced if the network gets overloaded. Because the processors can only handle a certain amount of queries at a time, they will begin to idle while waiting for responses. The network congestion is decreased since the processors are not producing new requests when they are idle. Self-throttling is the term for this self-soothing action. Most QoS assurances only affect the network when it is crowded, while self-throttling tries to minimize congestion, making QoS in processormemory joins less beneficial (Bhuyan and Agrawal, 1983; Navaridas et al., 2011).
5.3.2. I/O Interconnect Interconnection networks are also employed in computer systems to link I/O devices to processors and/or memory, such as disc drives, displays, and network interfaces. Diagram 5.4 shows a conventional I/O network used to connect a set of host adapters to an array of disc drives (at the bottom of the figure). The network works in the same way as the processormemory connection but with a different level of granularity and time. These distinctions, notably a higher tolerance for delay, push network design in quite different ways (Figure 5.4) (Agrawal, 1983; Chen and Peh, 2003).
Figure 5.4. A graphical representation of a typical I/O network. Source: https://www.elsevier.com/books/principles-and-practices-of-interconnection-networks/dally/978-0-12-200751-4.
166
Dissecting Computer Architecture
Transferring sectors of 4 Kbytes or more is how disc operations are carried out. The delay of sector access might be several milliseconds due to the disk’s rotational latency and the time required to reposition the head. A disc read is initiated by a host adapter sending a control packet specifying the disc address to be read as well as the memory block to be read—the disc plans ahead movement to read the required sector when it gets the request. When the disc retrieves the requested sector, it sends a response packet including the sector and identifying the target memory block to the appropriate host adapter (Blake and Trivedi, 1989; Hromkovič et al., 1996). Table 5.2 lists the specifications of a high-performance I/O connectivity network. This network can support up to 64 host adapters, with many physical devices, such as hard discs, connected to each host adapter. There are up to 64 I/O devices per host adapter in this example, for a total of 4,096 devices. In normal setups, some host adapters link to 100 or more devices. The peak-to-average bandwidth ratio of the disc ports is quite high. A disc can read data at up to 200 Mbytes/s while it is transmitting consecutive sectors. The peak bandwidth in the table is determined by this value. In most cases, the disc must shift its head between sectors in an average of 5 milliseconds, resulting in an average data rate of one 4-Kbyte sector per 5 milliseconds, or < 1 megabyte per second. The host ports have a poorer peak-to-average bandwidth ratio since they manage the aggregate traffic from 64-disc ports (Gratz et al., 2007; Bistouni and Jahanshahi, 2020). Because of the large variation between average and peak bandwidth at device ports, a network structure with concentration is required. While designing a network to handle the peak bandwidth of all devices at the same time is undoubtedly sufficient, the resultant network will be highly costly. Otherwise, we may build the network to accommodate merely the average bandwidth, but this increases serialization delay, as explained in the process or memory interconnect example. This serialization delay would be relatively considerable because of the high peak-to-average bandwidth ratio. Concentrating on the demands of numerous devices is a more efficient strategy (Chen et al., 1981; Jiang et al., 2009).
Interconnection Networks
167
Table 5.2. I/O Connectivity Network Parameters Parameter
Value
Average bandwidth Host ports
1 Mbytes/s (devices) 64 Mbytes/s (hosts) 1–64
Device ports Peak bandwidth Message size
1–4,096 200 Mbytes/s 32 bytes or 4 Kbytes
Message latency Reliability Traffic patterns Availability
10 μs No message lossa Arbitrary 0.999 to 0.99999
Disc operations are carried out by transferring sectors that are 4 Kbytes or larger. Access to a sector may be delayed by many milliseconds because of the rotational latency of the disc and the time it takes to reposition the head after it has been made. In order for a disc read to begin, a host adapter must transmit a control packet describing the disc address to be read and the memory block to be read. As soon as the disc receives the request, it begins planning movement to read the relevant sector. When the disc successfully obtains the requested sector, it sends a response packet to the suitable host adapter, which contains the sector as well as information about the target memory block (Jwo et al., 1993; Nabavinejad et al., 2020). The parameters of a high-performance I/O connection network are listed in Table 5.2. There can be up to 64 host adapters on this network, and each host adapter can have a large number of physical devices attached to it, such as hard discs. In this example, there are a total of 4,096 I/O devices, with each host adapter supporting up to 64 I/O devices. In more common configurations, a small number of host adapters connect to hundreds or thousands of devices (Bermond et al., 1989; Sanchez et al., 2010). The disc ports have a very high peak-to-average bandwidth ratio compared to their average bandwidth. When a disc is transferring consecutive sectors, it may read data at speeds of up to 200 Mbytes per second. This number is used to calculate the maximum bandwidth shown in the table. A typical disc must change its head between sections in an average of 5 milliseconds, resulting in a total data rate of one 4-Kbyte sector per 5 milliseconds, or < 1 megabyte per second on average. Because they handle the aggregate
168
Dissecting Computer Architecture
traffic from 64-disc ports, the host ports have a lower peak-to-average bandwidth ratio than the disc ports. Optimized network topology with high concentration is necessary due to the considerable variance between average and peak bandwidth at device ports. A network that can manage the peak bandwidth of all devices simultaneously is surely sufficient, but the resulting network will be prohibitively expensive to build and maintain. Instead, we may design the network to support just the average bandwidth, but this would result in increased serialization time, as mentioned in the process or memory interconnection examples. It is anticipated that the serialization delay would be significant due to the high peak-to-average bandwidth ratio (Adve and Vernon, 1994; Siddiqui et al., 2017).
5.3.3. Packet Switching Fabric Interconnection networks have taken over as the switching fabric for communication network switches and routers, replacing buses and crossbars. An interconnection network is used as part of a larger-scale network’s router in this application (local-area or wide-area). This use is demonstrated in Figure 5.5. Large-scale network channels are terminated by an array of line cards. Each packet or cell is processed by the line cards, which determine its destination, check for compliance with the service agreement, rewrite specific fields of the packet, and update statistics counters (Patel, 1981; Batten et al., 2013).
Figure 5.5. Interconnection networks are used as a switching fabric by certain network routers, moving packets between line cards that transmit and receive packets across network channels. Source: http://cva.stanford.edu/books/ppin/.
Interconnection Networks
169
The parameters of a common interconnection network using it as a switching fabric are shown in Table 5.3. The high average bandwidth and the necessity for quality of service are the most significant distinctions between switch fabric needs and I/O network requirements, and processor-memory. The large packet size of a switch fabric, together with its latency insensitivity, simplifies network design by removing the need to optimize latency and message overhead. The actual packet sizes are determined by the router’s protocol (Patel, 1979; Bhuyan, 1985). Table 5.3. A Packet Switching Fabric’s Parameters Parameter
Value
Average bandwidth
7 Gbits/s
Peak bandwidth Ports Packet payload size Message latency Reliability Traffic patterns Quality of service Availability
10 Gbits/s 4–512 40–64 Kbytes 10 μs