126 42 8MB
English Pages 214 Year 2019
The Programmer’s Guide To Theory Great ideas explained First Edition
Mike James
I/O Press I Programmer Library
Copyright © 2019 IO Press All rights reserved. This book or any portion thereof may not be reproduced or used in any manner whatsoever without the express written permission of the publisher except for the use of brief quotations in a book review. Mike James The Programmers Guide To Theory 1st Edition ISBN Paperback: 9781871962437 First Printing, 2019 Revision 0 Published by IO Press
www.iopress.info
In association with I Programmer
www.i-programmer.info
The publisher recognizes and respects all marks used by companies and manufacturers as a means to distinguish their products. All brand names and product names mentioned in this book are trade marks or service marks of their respective companies and our omission of trade marks is not an attempt to infringe on the property of others.
2
Preface Computer science, specifically the theory of computation, deserves to be better known even among non-computer scientists. The reason is simply that it is full of profound thoughts and ideas. It contains some paradoxes that reveal the limits of human knowledge. It provides ways to reason about information and randomness that are understandable without the need to resort to abstract math. It is the very physical reality of computation that makes it such a “solid” thing to reason about and yet doing so reveals paradoxical results that are as interesting as anything philosophy has to offer. In fact, computational theory has a claim to being the new philosophy of the digital age. And yet one of the barriers to learning the great ideas it contains is that computer scientists have little choice but to present what they know in mathematical form. Many books on the subject adopt the dry theorem-proof format and this does nothing to inspire understanding and enthusiasm. I’m not saying that theorems and proof are unnecessary. We need mathematical rigor to be sure what we know is what we know, but there is no need for it in an exposition that explains the basic ideas. My intention is that if you read this book you will understand the ideas well enough that, if you have the mathematical prowess, you should be able to construct proofs of your own, or failing this, understand proofs presented in other books. This is not an academic textbook but a precursor to reading an academic textbook. Finally I hope that my reader will enjoy the ideas and be able to understand them well enough not to make the common mistakes when applying them to other areas such as philosophy, questions of consciousness and even moral logic – which so often happens. My grateful thanks are due to Sue Gee and Kay Ewbank for their input in proof-reading the manuscript. As with any highly technical book there are still likely to be mistakes, hopefully few in number and small in importance, and if you spot any please let me know. Mike James November 2019
3
This book is based on a series of in informal articles in a number of magazines and on the I Programmer website: www.i-programmer.info To keep informed about forthcoming titles visit the I/O Press website: www.iopress.info. This is also where you will also find errata, update information and can provide feedback to help improve future editions..
4
Table of Contents Chapter 1 11 What Is Computer Science? Computable?......................................................................................12 Hard or Easy?.....................................................................................13 Understanding Computational Theory............................................15 The Roadmap....................................................................................15 Part I What Is Computable? Chapter 2 19 What Is Computation? Turing Machines...............................................................................19 Tapes and Turing Machines.............................................................20 Infinite or Simply Unlimited............................................................23 The Church-Turing Thesis...............................................................24 Turing-Complete...............................................................................25 Summary...........................................................................................27 Chapter 3 29 The Halting Problem The Universal Turing Machine........................................................29 What Is Computable? The Halting Problem.....................................30 Reduction..........................................................................................32 The Problem With Unbounded........................................................33 Non-Computable Numbers...............................................................36 Summary...........................................................................................38 Chapter 4 39 Finite State Machines A Finite History.................................................................................39 Representing Finite State Machines.................................................40 Finite Grammars................................................................................41 Grammar and Machines....................................................................43 Regular Expressions..........................................................................44 Other Grammars................................................................................45 Turing Machines...............................................................................47 Turing Machines and Finite State Machines...................................49 Turing Thinking................................................................................50 Summary...........................................................................................54
5
Chapter 5 55 Practical Grammar Backus-Naur Form - BNF..................................................................55 Extended BNF....................................................................................57 BNF in Pictures - Syntax Diagrams..................................................58 Why Bother?......................................................................................59 Generating Examples........................................................................59 Syntax Is Semantics..........................................................................60 Traveling the Tree.............................................................................62 A Real Arithmetic Grammar.............................................................62 Summary...........................................................................................63 Chapter 6 65 Numbers, Infinity and Computation Integers and Rationals.......................................................................65 The Irrationals...................................................................................67 The Number Hierarchy.....................................................................68 Aleph-Zero and All That...................................................................70 Unbounded Versus Infinite..............................................................71 Comparing Size.................................................................................72 In Search of Aleph-One....................................................................73 What Is Bigger than Aleph-Zero?.....................................................74 Finite But Unbounded “Irrationals”.................................................76 Enumerations....................................................................................76 Enumerating the Irrationals..............................................................78 Aleph-One and Beyond.....................................................................79 Not Enough Programs!......................................................................80 Not Enough Formulas!......................................................................82 Transcendental and Algebraic Irrationals........................................83 π - the Special Transcendental.........................................................84 Summary...........................................................................................85 Chapter 7 87 Kolmogorov Complexity and Randomness Algorithmic Complexity...................................................................87 Kolmogorov Complexity Is Not Computable...................................89 Compressability.................................................................................90 Random and Pseudo Random...........................................................91 Randomness and Ignorance..............................................................92 Pseudo Random.................................................................................93 True Random.....................................................................................94 Summary...........................................................................................96
6
Chapter 8 97 Algorithm of Choice Zermelo and Set Theory...................................................................97 The Axiom of Choice........................................................................98 To Infinity and..................................................................................98 Choice and Computability..............................................................100 Non-Constructive............................................................................101 Summary.........................................................................................103 Chapter 9 105 Gödel’s Incompleteness Theorem The Mechanical Math Machine......................................................106 The Failure Axioms.........................................................................108 Gödel’s First Incompleteness Theorem..........................................108 End of the Dream.............................................................................111 Summary.........................................................................................112 Chapter 10 113 Lambda Calculus What is Computable?......................................................................113 The Basic Lambda...........................................................................114 Reduction........................................................................................115 Reduction As Function Evaluation................................................116 More Than One Parameter..............................................................117 More Than One Function...............................................................118 Bound, Free and Names..................................................................118 Using Lambdas................................................................................119 The Role of Lambda In Programming.............................................121 Summary.........................................................................................122
7
Part II Bits, Codes and Logic Chapter11 Information Theory Surprise, Sunrise!............................................................................126 Logs..................................................................................................127 Bits...................................................................................................129 A Two-Bit Tree................................................................................130 The Alphabet Game........................................................................131 Compression....................................................................................131 Channels, Rates and Noise..............................................................132 More Information - Theory.............................................................133 Summary.........................................................................................134 Chapter12 135 Coding Theory – Splitting the Bit Average Information.......................................................................135 Make it Equal...................................................................................137 Huffman Coding..............................................................................138 Efficient Data Compression............................................................140 Summary.........................................................................................142 Chapter13 143 Error Correction Parity Error......................................................................................143 Hamming Distance..........................................................................144 Hypercubes......................................................................................146 Error Correction...............................................................................147 Real Codes.......................................................................................147 Summary.........................................................................................149 Chapter 14 151 Boolean Logic Who was George Boole?..................................................................151 Boolean Logic .................................................................................152 Truth Tables....................................................................................152 Practical Truth Tables.....................................................................153 From Truth Tables to Electronic Circuits......................................154 Logic in Hardware...........................................................................155 Binary Arithmetic............................................................................156 Sequential Logic..............................................................................158 De Morgan's Laws............................................................................159 The Universal Gate..........................................................................160 Logic, Finite State Machines and Computers................................162 Summary.........................................................................................163 8
Part III Computational Complexity Chapter 15 167 How Hard Can It Be? Orders..............................................................................................167 Problems and Instances..................................................................169 Polynomial Versus Exponential Time............................................170 A Long Wait.....................................................................................171 Where do the Big Os Come From?..................................................172 Finding the Best Algorithm............................................................174 How Fast can you Multiply?...........................................................174 Prime Testing..................................................................................175 Summary.........................................................................................178 Chapter 16 179 Recursion Ways of Repeating Things..............................................................180 Self-Reference..................................................................................181 Functions as Objects.......................................................................182 Conditional recursion.....................................................................183 Forward and Backward Recursion.................................................185 What Use is Recursion?..................................................................186 A Case for Recursion -The Binary Tree..........................................187 Nested Loops...................................................................................188 The Paradox of Self-Reference........................................................190 Summary.........................................................................................191 Chapter 17 193 NP Versus P Algorithms Functions and Decision Problems..................................................193 Non-Deterministic Polynomial Problems.......................................194 Co-NP...............................................................................................195 Function Problems..........................................................................196 The Hamiltonian Problem..............................................................197 Boolean Satisfiability......................................................................198 NP-Complete and Reduction..........................................................199 Proof That SAT Is NP-Complete.....................................................199 NP-Hard...........................................................................................201 What if P = NP?..............................................................................202 Summary.........................................................................................204
9
Chapter 1 What Is Computer Science?
Computer science is no more about computers than astronomy is about telescopes. Edsger Dijkstra What can this possibly mean? Computer science has to be about computers, doesn’t it? While computers are a central part of computer science, they are not its subject of study. It is what they compute that is of more interest than the actual method of computation. Computer science as practiced and taught in higher educational establishments is a very broad topic and it can include many practical things, but there is a pure form of the subject which is only interested in questions such as what can be computed, what makes something difficult or easy to compute, and so on. This is the interpretation of computer science that we are going to explore in the rest of this book. So what is this computing and what sort of questions are there about it? Today we nearly all have a rough idea of what computing is. You start from some data, do something to it and end up with an answer, or at least something different. This process can be deeply hidden in what we are doing. For example, a word processor is an example of computation, but exactly what it is doing is very complex and to the average user very obscure. To keep things simple we can say that computation is just the manipulation of data in precise and reproducible ways. At a more practical level, computations are rarely just “games with symbols”. They usually involve finding the answer to some real world problem or implementing a task. For example, can you find me the solution to a set of equations? What about finding the best route between a number of towns, or perhaps how best to pack a suitcase? These are all potential candidates for computation. We start off with some data on the initial configuration of the problem and then we perform manipulations that eventually get us the answer we are looking for. Sometimes we demand an exact answer and sometimes we can make do with an answer that is good enough.
11
You can see that this has the potential to become increasingly complex even though we have attempted to keep things simple. Most of the time we are interested in exact answers, even though these may be more than we need in practice. The reason is that we are interested in what can be achieved in theory as this guides us as to what is possible in practice. So, what sort of questions can we have about computation?
Computable? The first and most obvious is – what can we compute? You might think that also has an obvious answer – surely everything is computable. We might not know how to compute something, but this is just a matter of discovering how to do it. In principle, everything is computable in the sense that every question has an answer and every computation has a result. This was a thought that occurred to mathematicians in the nineteenth century. The German mathematician, David Hilbert (1862-1943), for example, believed in the automation of the whole of mathematics. A machine would perform all of the proofs and mathematicians would become secondary. It was thought impossible that there could be anything that wasn’t computable. This deeply held belief turns out not to be true. There are questions and computations that are in principle impossible - not just things that cannot be computed because they take too long, but things that are in principle non-computable and questions that are undecidable. This was, and is, one of the big shocks of computer science and one of the reasons for learning more about it. The fact that there exist things that cannot be computed is deeply surprising and it tells us something about the world we live in. However, it is important to really understand the nature of these non-computable results. The conditions under which they occur are subtle and may sometimes be considered unrealistic. What this means is that the existence of the non-computable is a pure result worthy of consideration, but perhaps not too much practical concern. In this respect it is much like infinity – it’s an interesting idea but you can’t actually order an infinity of anything or perform an infinite number of operations. You can head in that direction, but you can’t actually get there. As you will see, the idea of infinity in all its subtle forms plays a big role in the theory of computation. There are lots of instances, that are clearly wrong, of non-computable things being used to argue that real world computations are impossible. For example, it was argued that a robot could not kill a human because of a famous non-computable question called the halting problem. I don’t advise anyone to take this result seriously if pursued by a homicidal robot. If we are to avoid silly erroneous applications of the idea, it is important to understand what non-computability actually means and how it arises, rather than to assume that it means what its superficial description seems to imply.
12
There are other aspects of non-computability that might have more to tell us about the practical nature of the universe, or more precisely our description of the universe, but these are much more difficult to explore. We generally believe that when you do a computation that results in a number then every possible number is a candidate for the result. Put another way, it is easy to believe that every number has its day as the result of some program that produces it as a result of a computation. There are very simple arguments that indicate that this is not so. In fact the vast majority of numbers cannot possibly be the result of a computation for the simple reason that there are more numbers than programs that can produce them. This is perhaps an even more shocking conclusion than the discovery of non-computable problems. We take it for granted that there are programs that can produce numbers like π or e to any precision, but this type of number is relatively rare. An even deeper reason is the possibility that these numbers lack the regularity to allow a definition that is shorter than they are when written out. That is, π is an infinite sequence of digits, but we can write down a program involving a few tens of characters that will produce these digits one after another. Most numbers do not seem to permit their digit representation to be condensed in this way. This is also a hint that perhaps ideas of randomness and information play a role in what is computable. The majority of numbers have digit sequences that are so close to random that they do not allow a program to capture the inherent pattern of digits because there is no such pattern. This is an area we still do not understand well and it relates to subtle mathematical ideas such as the axiom of choice, transfinite numbers and so on.
Hard or Easy? This sort of absolute non-computability is interesting, but often from a more philosophical than practical point of view. Another slightly different form of the question of computability is how hard the computation is. In this case we are not looking for things that cannot be computed at any cost, only things which cost so much they are out of our reach. This is a much more practical definition of what is computable and what is non-computable and, in particular, modern cryptography is based on problems which are difficult enough to be regarded as practically non-computable. This is the area of computational complexity theory and it relates to how time, or resources in general, scale with a problem. To make this work we need some measure of the size of a problem. Usually there is a natural measure, but it isn’t difficult to prove that the actual way that we measure the size of a problem isn’t that important and doesn’t alter our conclusions. For example, if I give you a list of numbers and ask you to add them up then it is clear that the time it take depends on the number of values, i.e. the length of
13
the list. The longer the list, the longer it takes. Now suppose that I set the task of multiplying each possible pair of numbers, including any repeats. The time this takes goes up as the length of the list squared. You can see that the multiplication task is much harder than the sum task and that the length of list you can sum is much longer than the list you can multiply. This is how we measure the difficulty of tasks. There are lots of tasks that are easy where the time or other resources which they need to be completed increase relatively slowly so we can process large examples. Other tasks are much more demanding and quickly reach a size where no amount of computer availability could complete them in a reasonable time. What qualifies as a reasonable time depends on the application and can vary from a few years to the expected lifetime of the universe. You would probably agree that a problem that has a computation time equal to the lifetime of the universe is effectively non-computable. It might be possible in theory, but it isn’t possible in practice. Such problems are the foundation for modern cryptography. You might say that how accessible these difficult computations are depends on the computer hardware you have. Double the speed of your computer and you halve the time it takes. This is true, but half the lifetime of the universe is still too long and you only have to increase the size of the problem by a little and again your improved computer cannot finish the computation in a reasonable time. The discussion may be about efficiency, but it isn’t much affected by the power of the hardware at your disposal. It isn’t so much about the time anything takes on real hardware, more about how that time scales with the size of the problem. In some cases the time scales in proportion to the size of the problem and in other cases it increases much faster than this. Notice that difficult problems might be effectively non-computable, but unlike real non-computable problems they do have a solution. In many cases, if someone gives you a solution to the problem then you can check that it is correct very quickly. This is a strange class of computational tasks that take a long time to find a solution, but are very quick to check if someone gives you a solution. These are the so-called “NP problems”. “Easy problems” are usually called “P problems” and today the greatest single unsolved question in computer science is if NP problems really are P problems in disguise. That is, does NP = P? If so we just have noticed that there are fast ways of solving all difficult problems that are easy to check. This isn’t a small theoretical difficulty, as modern cryptography depends on the fact that NP≠P, and there is a million dollar prize for proving the conjecture, one way or the other.
14
One of the big difficulties in all of this is in proving that you have a method that is as efficient as it is possible to be. If you find that it takes too long to solve a problem then perhaps the problem really is difficult or perhaps we have just not thought hard enough about how to do the computation. Computer science is intriguing because of the possibility of finding a fast method of working something out that previously the best minds were unable to find. It is worth saying that this does happen and computations that were previously thought to take an impractical amount of time have been simplified to make them accessible.
Understanding Computational Theory I hope that the meaning of Dijkstra’s quote at the start of this chapter is becoming clearer. We can examine computation without getting bogged down in the design or use of a computer to actually perform the computation. This is the theory of computation and it does inform the practice of computation, but it isn’t dependent on any practical implementation. In many ways this area of computer science is more like mathematics than anything else and it is possible to pursue mathematical theorem proof-style explanations of the subject. Many books do exactly this, but while this approach is rigorous and essential for progress, it doesn’t help develop intuition about what is actually being described, let alone being proved. In this book you mostly won’t find theorems and the only “proofs” are explanations that hopefully make a proof seem possible and reasonable. If you have a mathematical background then converting your intuitions into a full proof will be an interesting exercise and so much more rewarding than spending much more time converting a theorem proof presentation into intuitions. In most cases I have tried to present how to think about the main ideas in computability rather than present watertight proofs. As a result this is not a computer science textbook, but it will help you read and master any of the excellent textbooks on the subject.
The Roadmap This book has three parts, this first of which is about theoretical computation. It looks, in Chapter 2, at what constitutes a model that captures the essence of computation and the best known exponent, the Turing machine. We move on, in Chapter 3, to consider what is logically computable. This involves proof by contradiction, which is always troubling because it is difficult to see exactly what is causing the problem. In this case, we examine the best known non-computable problem – the halting problem – and find out what makes it impossible and what changes are needed to make it possible. In Chapter 4 we take a step back to a slightly simpler machine, the finite state machine, which
15
isn’t as powerful as a Turing machine, but is much closer to what is practically possible. The interesting thing is that there is a hierarchy of machines that result in a hierarchy of language grammars. Chapter 5 looks in more detail at the idea of grammar and how it is useful in computing to convey, not just syntax, but also meaning. One of the problems in studying computer science is that many of its ideas make use of mathematical concepts. In Chapter 6 we look in detail at numbers, the basis of much computation, and at the different types of infinity. Rather than a pure math approach to these ideas, we take an algorithmic approach to make the ideas real. Chapter 7 takes us into an area that is often overlooked – algorithmic complexity theory. This is easy enough to understand, but has some strange consequences. Even more overlooked is the subject of Chapter 8, the axiom of choice which deserves to be better known as it has connections to algorithmic information and noncomputability. Chapter 9 takes us into the ideas of Gödel’s incompleteness theorem, which is very much connected to non-computability, but in a different arena. Finally in Chapter 10, the section ends with a look at Lambda calculus, which is an alternative expression of what it is to compute. Part II is about more practical, rather than logical, matters. What is a bit forms the subject matter of Chapter 11 on classical information theory. Chapter 12 expands this to look at coding theory, which is at the heart of implementing compression algorithms. Chapter 13 is about error correcting codes. Even though we know about bits and information, we don’t yet know about bits and logic, which is what Chapter 14 is all about. Part III looks at the ideas of computational complexity theory. This isn’t about studying things that are logically non-computable but things that are practically non-computable. This is about algorithms that simply take too long to work that the answers that they provide are out of our reach as surely as the Halting problem. Chapter 15 examines how long computations take in terms of how the time scales with the problem size. Chapter 16 looks at recursion, which is fundamental to the longest running algorithms. Finally, Chapter 17 is about NP and P, two fundamental and related classes of algorithms. The class P is the set of algorithms that are computable in reasonable time for reasonable size problems. The set NP is the set of algorithms that are checkable in reasonable time for reasonable size problems. These two are very similar and yet so different that there is a $1 million prize on offer for proving whether they are the same or different.
16
Part I What Is Computable?
Chapter 2
What Is Computation?
19
Chapter 3
The Halting Problem
29
Chapter 4
Finite State Machines
39
Chapter 5
Practical Grammar
55
Chapter 6
Numbers, Infinity and Computation 65
Chapter 7
Kolmogorov Complexity
87
Chapter 8
The Algorithm of Choice
97
Chapter 9
Gödel’s Incompleteness Theorem
105
Chapter 10
The Lambda calculus
113
17
Chapter 2 What Is Computation?
Today we might simply say that computation is what a computer does. Until recently, however, there were no computing machines. If you wanted a computation performed then you had little choice but to ask a human to do it. So in a sense, computing is what humans do with whatever help they can get – slide rule, calculator or even digital computer. But we still don’t really know what a computer does or what a computation actually is.
Turing Machines You have probably heard of Alan Turing. He is often cited as being one of the Bletchley Park, UK, code breakers who shortened World War II, but he is equally, if not more, important for his fundamental work in computing.
Alan Mathison Turing 1912-1954 Turing constructed an abstract “thought” model of computing that contained the essence of what a human computer does. In it, he supposed that the human computer had an unlimited supply of paper to write on and a book of rules telling them how to proceed. He reasoned that the human would read something from the paper, then look it up in the book which would tell them what to do next. As a result the “computer” would write something on the paper and move to the next rule to use. It is worth noting that at the time Turing was working on this, the term “computer” did usually refer to a human who was employed to perform calculations using some sort of mechanical desk calculator.
19
A human seems like a simple model of computation, but it has many vague and undefined parts that make it difficult to reason with. For example, what operations are allowed, where can the computer write on the paper, where can they read from and so on. We need to make the model more precise. The first simplification is that the paper is changed into a more restrictive paper tape that can only be written to or read from in one location, usually called a “cell”. The tape can be moved one place left or right. This seems limiting, but you can see that with enough moves anything can be written on the tape. The second simplification is that the rule book has a very simple form. It tells the computer what to write on the tape, which direction to move the tape in, and which rule to move to next.
One of the interesting things about the Turing machine is that you can introduce variation on how it works, but you usually end up with something with the same computing power. That is, tinkering with the definition doesn’t provide the computer with the ability to compute something it previously couldn’t. It sometimes alters the speed of the computation but it introduces nothing new that could not have been computed before. This means that we can settle on a Turing machine being something like the description of the human with a paper tape and a book of rules and what can be computed using this setup isn’t changed by the exact way we define it. This is usually expressed as the fact that what is computable is robust to the exact formulation of the computing device. However, if we are all going to understand what is going on, then we might as well settle on one precise mathematical definition of what a Turing machine is, but we need to remember that our conclusions don't really depend on the exact details.
Tapes and Turing Machines In most definitions a Turing machine is a finite state machine with the ability to read and write data to a tape. Obviously to understand this definition we need to look at what a finite state machine is. The idea is so important we will return to it in Chapter 4, but for the moment we can make do with an intuitive understanding.
20
Essentially a finite state machine consists of a number of states. When a symbol, a character from some alphabet say, is input to the machine it changes state in such a way that the next state depends only on the current state and the input symbol. That is, the current state becomes the next state depending on the symbol read from the tape. You can represent a finite state machine in a form that makes it easier to understand and think about. All you have to do is draw a circle for every state and arrows that show which state follows for each input symbol. For example, the finite state machine in the diagram has three states. If the machine is in state 1 then if the next symbol is an A it moves it to state 3 and a B moves it to state 2.
A Turing machine is a finite state machine that has an unlimited supply of paper tape that it can write on and read back from. There are many formulations of a Turing machine that vary the way the tape is used, but essentially the machine reads a symbol from the tape, to be used as an input to the finite state machine. This takes the input symbol and, according to it and the current state, does three things: 1. Prints something on the tape 2. Moves the tape right or left by one cell 3. Changes to a new state
21
To give you an example of a simple Turing machine, consider the rules: 1. If tape reads blank write a 0, move right and move to state 2 2. If tape reads blank write a 1, move right and move to state 1 The entire tape is assumed to be blank when it is started and so the Turing machine writes: 010101010101... where the ... means "goes on forever". Notice that the basic operations of a Turing machine are very simple compared to what you might expect a human computer to perform. For example, you don’t have arithmetic operations. You can’t say, “add the number on the tape to 42”, or anything similarly sophisticated. All you can do is read symbols, write symbols and change state. At this point you might think that this definition of a Turing machine is nothing like a human reading instructions from a book, but it is. You can build complex operations like addition simply by manipulating symbols at the very basic level. You can, for example, create a Turing machine that will add two numbers written on a tape and write the resulting sum onto the tape. It would be a very tedious task to actually create and specify such a Turing machine, but it can be done. Once you have a Turing machine that can add two numbers, you can use this to generate other operations. Again tedious, but very possible. The whole point is that a Turing machine is that ultimate reduction of computation. It does hardly anything – read a symbol, write a symbol and change state – but this is enough to build up all of the other operations you need to do something that looks like computation. If you attempt to make it any simpler then it likely won’t do the job. “Everything should be made as simple as possible, but no simpler.” Albert Einstein A Turing machine can also perform a special action – it can stop or halt – and surprisingly it is this behavior that attracts a great deal of attention. For example, a Turing machine is said to recognize a sequence of symbols written on the tape if it is started on the tape and halts in a special state called a final state. What is interesting about this idea, as we will discover in more detail later, is that there are sequences that a Turing machine can recognize that a finite state machine, i.e. a Turing machine without a tape, can’t. For example, a finite state machine can recognize a sequence that has three As followed by three Bs, AAABBB, and so can a Turing machine. But only a Turing machine can recognize a sequence that has an arbitrary number of As followed by the same number of Bs. That is, a Turing machine is more powerful than a finite state machine because it can count without limit.
22
At first this seems very odd, but a little thought reveals it to be quite reasonable, obvious even! To discover if the number of Bs is the same as the number of As you have to do something that is equivalent to counting them, or at least matching them up, and given there can be arbitrarily many input A symbols even before you ever see a B you need unlimited storage to count or match them up. The finite state machine doesn't have unlimited storage but the Turing machine does in the form of its tape! This is a very important observation, and one we will return to, but it deserves some explanation before we move on.
Infinite or Simply Unlimited Infinity is a difficult concept and one it is very easy to make mistakes when reasoning about. You might think that in the tangible, physical world of computers there would be no need for infinity, but in the theoretical world of computer science infinity figures large, very large. Many of the paradoxes and their conclusions depend on infinity in some form or another and while proofs and discussions often cover up this fact, it is important that you know where the infinities are buried. What we have just discovered is that a Turing machine is more powerful than a finite state machine because it has unlimited storage in the form of a paper tape. At this point we could fall to arguing the impossibility of the Turing machine because its tape can’t be infinitely long. Some definitions of a Turing machine do insist that the tape is infinite, but this is more than is required. The machine doesn't, in fact, need an infinite tape, just one that isn't limited in length, and this is a subtle difference. Imagine if you will that Turing machines are produced with a very long, but finite, tape and if the machine ever runs out of space you can just order some more tape. This is the difference between the tape actually being infinitely long or just unlimited. Mathematicians refer to this idea as finite but unbounded. That is, at any given time the tape has a very definite finite length, but you never run out of space to write new symbols. This is a slightly softer and more amenable form of infinity because you never actually get to infinity, you just head towards it as far as you need to go. Some definitions of a Turing machine state that the tape is infinite. In most cases this means the same thing as the Turing machine only uses as much tape as it needs – it is effectively used as if it was finite but unbounded. This idea is subtle and it is discussed in more detail in Chapter 6.
23
The Church-Turing Thesis You cannot ignore the fact that a Turing machine is very simple. It certainly isn’t a practical form of computation, but that isn’t the intent. A Turing machine is the most basic model of computation possible and the intent is to use it to see what is computable, not how easy it is to compute. This is summarized in the Church-Turing thesis, named in honor of Alonzo Church whose work was along the same lines as that of Turing.
Alonzo Church 1903–1995 Church invented the Lambda calculus, see Chapter 10, as a mathematical embodiment of what computation is. If you like, Turing stripped the physical computer down to its barest simplicity and Church did the same thing for mathematics. The Church-Turing thesis is a thesis rather than a conjecture or a theorem because it most likely cannot ever be proved. It simply states: ● Anything that can be computed, can be computed by a Turing machine. Notice that it doesn’t say anything about how fast. It doesn’t matter if it takes a Turing machine several lifetimes of the universe to get the answer, it is still computable. The Turing machine is thus the gold standard for saying which questions have answers and which don’t – which are decidable and which are not. Notice also that a consequence of the Church-Turing thesis is that there cannot be a computing device that is superior to a Turing machine. That is, the Church-Turing thesis rules out the existence of a super-Turing machine. Indeed about the only way you could falsify the Church-Turing thesis would be to find a super-Turing machine – needless to say, to date, no-one has. There are many suggestions for alternative computational systems – Post’s production system; Lambda calculus; recursive functions; and many variations on the basic Turing machine – but they all are equivalent to the Turing machine described above. In the jargon they are Turing equivalent.
24
You might ask, why bother with these alternative systems if they are equivalent to a Turing machine, and the answer is that in some cases they are simpler to use for the problem in hand. In most cases, however, the Turing machine is obviously the most direct, almost physical, interpretation of what we mean by “compute”. The latest star of computing, the quantum computer, is often claimed to have amazing powers to compute things that other computers cannot, but it is easy to prove that even a quantum computer is Turing-equivalent. A standard nonquantum computer can simulate anything a quantum computer can do – it is just much slower. Thus a quantum computer is Turing-equivalent, but it is much faster – so much faster that it does enable things to be computed that would otherwise simply take too long. In the real world time does matter. A computational system is Turing equivalent if it can simulate a Turing machine – i.e. it can do what a Turing machine can – and if a Turing machine can simulate it – i.e. a Turing machine can do what it can.
Turing-Complete This idea that every computational system is going to be at most as good as a Turing machine leads on to the idea of a system being Turing-complete. There are systems of computation, regular expressions for example as we’ll see later, that look powerful and yet they cannot compute things a Turing machine can. These, not quite full programming systems, are interesting in their own right for the light they shine on the entire process of computing. Any computer system that can compute everything a Turing machine can is called Turing-complete. This is the badge that makes it a fully grown up computer, capable of anything. Of course, it might not be efficient, and hence too slow for real use, but in principle it is a full computing system. One of the fun areas of computer science is identifying systems that are Turing-complete by accident – unintended Turing-completeness. So much the better if they are strange and outlandish and best of all if they seem unreasonable. For example, the single x86 machine instruction MOV is Turing-complete on its own without any other instructions. Some of the more outlandish Turing-complete systems include knitting, sliding block puzzles, shunting trains and many computer and card games. Sometimes identifying Turing-completeness can be a sign that things have gone too far. For example, markup languages such as HTML or XML, which are designed only to format a document, would be considered dangerous if they were proved Turing-complete. The reason is simply that having a Turing-complete language where you don’t need one is presenting an unnecessary opportunity for a hacker to break in. You might think that simple languages accidentally becoming Turing-complete is unlikely, but CSS – a style language – has been proved to be Turing-complete.
25
Turing-completeness may be more than just fun, however. When you look at the physical world you quickly discover examples of Turing-completeness. Some claim that this is a consequence of the Universe being a computational system and its adherents want to found a new subject called digital physics. Mostly this is just playing with words, as clearly the Universe is indeed a computational system and this is generally what our computers are trying to second guess. This is not to say that there are no good questions about natural Turing-completeness that await an answer, but its existence in such abundance is not at all surprising.
26
Summary ● To investigate what computation is we need a model of computation that is simple enough to reason about. ● A Turing machine is a simplified abstraction of a “human” computer working with pen and paper and a book of instructions. ● A Turing machine is composed of a finite state machine, which corresponds to the book of instructions, and an unlimited supply of paper tape, which corresponds to the pen and paper. ● This model of computation is the subject of the Church-Turing thesis which roughly states that anything that can be computed can be computed by a Turing machine. ● The fact that the tape is unlimited is perhaps the most important feature of a Turing machine. If you make the tape a fixed length then much of the computing power is lost. ● There are many alternative formulations of the Turing machine and they all amount to the same thing. The definition of a Turing machine is robust against small changes. ● There are also many alternative formulations to the Turing machine, but they are all equivalent to it. This gives rise to the idea of Turing equivalent – a machine that can compute everything a Turing machine can and vice-versa. ● It also gives rise to the idea of a Turing-complete system, something that is Turing equivalent. ● Unintentional or accidental Turing-completeness is often observed and is interesting and amusing. ● Turing Completeness is also very common in natural systems and this isn’t so mysterious when you consider that the real world has to perform, in some form, the computations that our computers perform when they simulate real world problems.
27
Chapter 3 The Halting Problem
Now we have the Church-Turing thesis and the prime model of computation the Turing machine, it is time to ask what computations are out of reach. If you have not encountered these ideas before then prepared to be shocked and puzzled. However paradoxical it may all sound, it is, perhaps at the expense of some mysticism, very easy to understand. First we have to look at the second good idea Turing had – the universal Turing machine.
The Universal Turing Machine Suppose you have a Turing machine that computes something. It consists of a finite state machine, an initial state and an initialized tape. These three things completely define the Turing machine. The key idea in creating a universal Turing machine (UTM) is to notice that the information that defines a specific Turing machine can be coded onto a tape. The description of the finite state machine can be written first as a table of symbols representing states and transitions to states, i.e. it is a lookup table for the next state given the current state and the input symbol. The initialization information can then be written on the tape and the remainder of the tape can be blank and used as the work area for the universal Turing machine. Now all we have to do is to design a Turing machine that reads the beginning of the tape and uses this as a “lookup table” for the behavior of the finite state machine. It can then use another special part of the tape to record the machine’s current state. It can read the symbols from the original machine’s initial tape, read the machine definition part of the tape to look up what it should do next, and write the new state out to the working area. What we have just invented is a Turing machine that can simulate any other Turing machine – i.e. a universal Turing machine. Just in case you hadn’t noticed, a universal Turing machine is just a programmable computer and the description on the tape of another Turing machine is a program. In that the universal Turing machine simulates any specific Turing machine in software, it is perhaps the first example of a virtual machine. You might think that this is obvious, but such ideas were not commonplace when Turing thought it up and implemented it. Even today you should be slightly
29
surprised that such as simple device – a Turing machine – is capable of simulating any other Turing machine you can think of. Of course, if it wasn’t then the Church-Turing Thesis would have been disproved. If you read about the universal Turing machine in another book then you might encounter a more complete description including the structure of the finite state machine and the coding used to describe the Turing machine to be simulated. All of this fine detail had to be done once to make sure it could be done, but you can see that it is possible in general terms without have to deal with the details. Be grateful someone has checked them out, but don’t worry about it too much.
What Is Computable? The Halting Problem At this point it all seems very reasonable – anything that can be computed can be computed by a specific Turing machine. Even more generally, anything that can be computed can be computed by a universal Turing machine, a very particular Turing machine, by virtue of it being able to simulate any other Turing machine. All you have to do is give the universal Turing machine a description of the specific Turing machine you are interested in on a tape and let it run. The universal Turing machine will then do whatever the specific Turing machine did, or would do if you actually built it. This all seems innocent, but there is a hidden paradox waiting to come out of the machinery and attack – and again this was the reason Turing invented all of this! It is known as the “halting problem”. This is just the problem of deciding if a given Turing machine will eventually halt on a specific tape. In principle it looks reasonably easy. You examine the structure of the finite state part of the machine and the symbols on its tape and work out if the result halts or loops infinitely. Given that we claim that anything that can be computed can be computed by a Turing machine, let’s suppose that there is just such a machine called “D”. D will read the description of any Turing machine, “T”, and its initial data from its tape and will decide if T ever stops. It is important to note that D is a single fixed machine that is supposed to work for all T. This rules out custom designing a machine that works for a particular type of T or a machine that simply guesses – it has to be correct all of the time.
30
There are various proofs that the halting problem isn’t computable, but the one that reveals what is going on most clearly is one credited to Christopher Strachey, a British computer scientist. To make the proof easier to follow, it is better to describe what the Turing machines do rather than give their detailed construction. You should be able to see that such Turing machines can be constructed.
Christopher Strachey 1916-1975 Suppose we have a Turing machine, called halt, that solves the halting problem for a machine, described by tape T, working on tape t: halt(T,t) if T halts on t then state is true else state is false
gives true if the machine described by tape T working on tape t halts and false otherwise, where true and false can be coded as any two states when the Turing machine stops. Now consider the following use of the halt Turing machine to create another Turing machine: paradox(P) if halt(P,P)==true else stop
loop forever
Nothing is wrong with this Turing machine, despite its name. It takes a description of a machine in the form of a tape P and runs halt(P,P) to discover if machine P halts on tape P i.e. halts on its own description as data. If the machine halts then paradox doesn’t and if it doesn’t the machine does. This is a little silly but there is nothing wrong with the construction. Now consider: halt(paradox,paradox)
Now our hypothetical halting machine has to work out if paradox halts on it own tape.
31
Suppose it does halt, then halt(paradox,paradox) is true, but in paradox the call to halt(P,P) is the same because P=paradox, so it has to be true as well and hence it loops for ever and doesn’t halt. Now suppose it doesn’t halt. By the same reasoning halt(paradox,paradox) is false and the if in paradox isn’t taken and paradox stops. Thus, if paradox halts, it doesn’t and if paradox doesn’t halt, it does – so it is indeed well named and it is a paradox and cannot exist. The only conclusion that can be drawn is that halt cannot exist and the halting problem is undecidable. At this point you probably want to start to pick your way through this explanation to find the flaw. I have to admit that I have cut a few corners in the interests of simplicity, but you can trust me that any defects can be patched up without changing the overall paradox. Notice that a key step in constructing this paradox is the use of halt within the program you are feeding to halt. This is self reference and like most self references, it results in a paradox. For example, the well known barber conundrum is a paradox: The barber is the "one who shaves all those, and those only, who do not shave themselves". The question is, does the barber shave himself? It is impossible to build a Turing machine that can solve the halting problem – and given we believe that anything that can be computed can be computed by a Turing machine we now have the unpleasant conclusion that we have found a problem that has no solution – it is undecidable. We have found something that is beyond the limits of computation.
Reduction As soon as you have one of the flood gates open and there are lots of noncomputable problems: Does machine T halt on input tape TT? Does machine T ever print a particular symbol? Does T ever stop when started on a blank tape? and so on. All of these allow the construction of a paradox and hence they are non-computable. Once you have a specimen non-computable problem you can prove other problems non-computable by showing that the new problem is the same as the old one. More precisely, if a solution to problem A can be used to solve problem B, we say that A can be reduced to B. So if we can show that a solution to A can be used to solve the halting problem, i.e. A can be reduced to the halting problem, then A is also undecidable.
32
This is a standard technique in complexity theory, reduction, because while it is often difficult to prove something, it is easy to show that something is equivalent to what you have already proved. For example, the Busy Beaver problem is simple enough to state. Find the maximum number of steps a Turing machine with n states can make before halting i.e. find the function BB(n) which gives the maximum number of steps the machine takes. If this function exists you can use it to put a bound on the number of steps any machine could take and so create the function halt(T,t). If you have BB(n) you know that any machine that halt(T,t) has to process has to either halt in fewer than n steps or loop forever. We have reduced BB(n) to halt(T,t) therefore BB(n) is undecidable. Arguing the other way, there are some well known math problems that have solutions if the halting problem does. For example, Goldbach’s conjecture is that every number greater than 2 can be written as the sum of two primes. If the halting problem was decidable you could use halt(T,t) to solve the Goldbach conjecture. Simply create a Turing machine G that checks each integer to see if it is representable as the sum of two primes and halts if it finds a counter example. Now we compute halt(G,t) and if it returns true then G halts and the Goldbach conjecture is false. If it returns false then no counter example exists and the Goldbach conjecture is true. The fact that halt is non-computable rules out this approach to proving many mathematical conjectures that involve searching an infinite number of examples to find a counter example.
The Problem With Unbounded It is very important to note that hidden within all these descriptions is the use of the fact that a Turing machine has a finite but unbounded tape. We never ask if it is impossible to construct the paradoxical machine because of a resource constraint. In the real world the memory is finite and all Turing machines are bounded. That is, there is a maximum length of tape. The key point is that any Turing machine with a tape of length n, a symbol set of size s and x states (in the finite state machine) can be simulated by a finite state machine with a = snx states. We will return to this in the next chapter, but you should be able to see that a resource-limited Turing machine has a finite set of states – a combination of the states of the finite state machine and the states the tape can be in. Hence it is itself just a finite state machine. Why does this matter? Because for any finite state machine the halting problem is computable. The reason is that if you wait enough time for the finite state machine with a states to complete a+1 steps it has either already halted or entered one of the a
33
states more than once and hence it will loop forever and never halt. This is often called the pigeon hole principle. If you have a things and you pick a+1 of them you must have picked one of them twice. There is even a simple algorithm for detecting the repetition of a state that doesn't involve recording all of the states visited. It is the unbounded tape that makes the Turing machine more powerful than a finite state machine because a Turing machine can have as many states as it likes. The fact that the Turing machine has an unbounded tape is vital to the construction of the halting paradox. You might now ask where the unbounded nature of the Turing machine comes into the construction of the paradoxical machine? After all we never actually say “assuming the tape is unlimited” in the construction. The answer is that we keep on creating ever bigger machine descriptions from each initial machine description without really noticing. An informal description is not difficult and it is possible to convert it into a formal proof. Suppose we have a bound b on the size of the tape we can construct – that is the tape cannot be bigger than b locations. Immediately we have the result that the halting problem can be solved for these machines as they have a finite number of states and hence are just finite state machines. Let’s ignore this easy conclusion for the moment and instead construct a machine that implements halt(T,t) then the function halt(paradox,paradox), which has to evaluate paradox(paradox), which in turn has to evaluate halt(paradox,paradox), which in turn has to … You can see that what we have is an infinite regression each time trying to evaluate halt(paradox,paradox). This is fine if the tape, or more generally the resources available, are unbounded and the evaluations can go on forever. Notice that the use of halt in paradox is an example of recursion – a subject we will return to in Chapter 16. What happens if the Turing machine is bounded to b tape locations? Then at some point it will run out of tape and the evaluation of halt(paradox,paradox) will fail and the whole set of evaluations will complete without ever having evaluated halt(paradox,paradox). That is, an error stops the computation without a result. In this case no paradox exists and we haven’t proved that halt(p,t) cannot exist, only that if it does exist there is a limit to the size of machine it can process. The halting problem can only be solved by a machine that uses a tape bigger than the class of machines it solves the problem for and this implies it cannot solve the problem for itself or any machine constructed by adding to its tape. You need a bit of infinity to make the halting problem emerge.
34
Even if you expand the conditions so that the computing machine can have memory bounded by some function of the size of the input, the halting problem is still decidable. As soon as you have an unbounded set then there are questions you can ask that have no answer. If you still find this strange consider the following question. We have an arbitrary, but bounded by b, set of things – is it possible to compute a value n that is larger than the number of things in the set? Easy, set n≥b and you have the required number. Now change the conditions so that the set is finite but unbounded. Can you find a value of n? The answer is no. As the set is unbounded, the question is undecidable. Any value of n that you try can be exceeded by increasing the size of the set. In this sense the size of an unbounded set is non-computable by definition as it has no upper bound. The situation with the halting paradox is very similar. Another way of thinking of this is that the machine that solves the halting problem has to be capable of having a tape that is larger than any machine it solves the problem for and yet the machine corresponding to the paradox machine has an even larger tape. Just like n in the unbounded set, a Turing machine with the largest tape doesn’t exist as you can always make a bigger one. For the record:
All real physical computing devices have bounded memory and therefore are finite state machines, not Turing machines, and hence they are not subject to the undecidability of the halting problem.
A Turing machine has an unbounded tape and this is crucial in the proof of the undecidability of the halting problem, even if this isn’t made clear in the proof.
Having said this, it is important to realize that for a real machine the number of states could be so large that the computation is impractical. But this is not the same as the absolute ban imposed by the undecidability of the halting problem for Turing machines. There are many other undecidable propositions and they all share the same sort of structure and depend on the fact that an unbounded tape means there is no largest size Turing machine and no restriction on what a Turing machine can process as long as it can just ask for more tape. Notice that humans are also finite state machines, we do not have unlimited memory, and hence there is no real difference, from the point of view of computer science, between a robot and a human.
35
Non-Computable Numbers You can comfort yourself with the thought that these non-computable problems are a bit esoteric and not really practical – but there is more and this time it is, or possibly is, practical. There is another type of non-computability which relates to numbers and which is covered in detail in Chapters 6 and 7. However, while we are on the topic of what cannot be computed, it is worth giving a brief overview of the non-computable numbers and how we know they exist. We are all very smug about the ability to compute numbers such as π to millions of decimal places, it seems to be what computers were invented for. The bad news is that there are non-computable numbers. To see that this is the case, think of each Turing machine tape as being a number – simply read the binary number you get from a coding of its symbols on the tape that describes it to a universal Turing machine. So to each tape there is an integer and to each integer there is a tape. We have constructed a one-to-one numbering of all the possible Turing machines. Now consider the number R which starts with a decimal point and has a 1 in decimal position n if the nth Turing tape represents a machine that halts when run on U, the universal Turing machine, and a 0 if it doesn’t halt. Clearly there is no Turing machine that computes R because this would need a solution to the halting problem and there isn’t one. R is a non-computable number, but again you can comfort yourself with the idea that it is based on an unbounded computing machine and so isn’t really practical. Perhaps noncomputable numbers are all of this sort and not really part of the real world. The really bad news is that it is fairly easy to prove that there are far more non-computable numbers than there are computable numbers. Just think about how many real numbers there are and how many tapes – there is only a countable infinity of computable numbers as there is only a countable infinity of tapes, but there is an uncountable infinity of real numbers in total, see Chapter 6 if you are unclear what countable and uncountable is all about. Put even more simply - there are more real numbers than there are programs to compute them. Most of the world is outside of our computing reach… What does this mean? Some people are of the opinion that the fault lies with the real numbers. They perhaps don’t correspond to any physical reality. Some think that it has something deep to do with information and randomness. Some of these ideas are discussed in more detail later.
36
There seem to be two distinct types of non-computable numbers. The first are those that we infer the existence of because there just aren’t enough programs to do the computing and those that are based on the undecidability of some problems. The first type are inherently inaccessible as the proof that they exist is non-constructive. It doesn’t give you an example of the number, or the proposed algorithm that cannot compute it. It just states that there are many non-computable numbers. Any attempt that you make to exhibit such a number is doomed to failure because to succeed would turn it into a computable number. The second type of non-computable numbers are based on constructing a number from a non-decidable problem. The simplest example is the number given earlier where the nth bit is 0 if the nth Turing machine doesn’t halt and 1 if it does. These non-computable numbers are very different and there is a sense in which they are computable if you make the Turing machines tape bounded.
37
Summary ● The universal Turing machine can simulate any other Turing machine by reading a description of it on its tape plus its starting tape. ● The halting problem is the best known example of a non-computable or undecidable problem. All you need to do to solve it is create a Turing machine that implements halt(p,t) which is true if and only if the machine described by tape p halts on tape t. ● Notice that the halting program has to be solved for all Turing machines not just a subset. ● If halt(p,t) exists and is computable for all p and t then it is possible to use it to construct another machine paradox(p) which halts if p loops on p and loops if p halts on p. This results in a paradox when you give halt the problem of deciding if halt(paradox,paradox) halts or not. ● The halting problem is an archetypal undecidable problem and it can be used to prove that many other problems are undecidable by reducing them to the halting problem. ● If the halting problem were decidable we could answer many mathematical question such as the Goldbach conjecture by creating a Turing machine that searches for counter examples and then using halt to discover if a counter example existed. ● The undecidability of the halting problem for Turing machines has it origin in the unboundedness of the tape. ● A Turing machine with a bounded tape is a finite state machine f for which halt(f) is computable. ● The reason that the proof that halt is undecidable fails if the tape is bounded is that it involves a non-terminating recursion, which requires an infinite tape. ● All real computers, humans and robots are subject to finite memory and hence are finite state machines and are not subject to the undecidability of the halting problem. ● It is possible to create numbers that are not computable using undecidable problems such as halt(p,t) – such as the number with its nth bit is 0 if the nth Turing machine halts and 1 if if doesn’t. Such numbers are based on the unboundedness of the Turing machine. ● There is a class of non-computable numbers that in a sense are even more non-computable. There are only a countable number of Turing machines, but an uncountable number of real numbers. Therefore most real numbers do not have programs that compute them.
38
Chapter 4 Finite State Machines
We have already met the finite state machine as part of the Turing machine, but now we need to consider it in its own right. There is a sense in which every real machine is a finite state machine of one sort or another while Turing machines, although theoretically interesting, are just theoretical. We know that if the Church-Turing thesis is correct, Turing machines can perform any computation that can be performed. This isn’t the end of the story, however, because we can learn a lot about the nature of computation by seeing what even simpler machines can do. We can gain an understanding of how hard a computation is by asking what is the least that is needed to perform the computation. As already stated the simplest type of computing machine is called a ‘finite state machine’ and as such it occupies an important position in the hierarchy of computation.
A Finite History A finite state machine consists of a fixed number of states. When a symbol, a character from some alphabet say, is input to the machine it changes state in such a way that the next state depends only on the current state and the input symbol. Notice that this is more sophisticated than you might think because inputting the same symbol doesn’t always produce the same behavior or result because of the change of state. The the new state depends only on the current state and the input You can also have the machine output a character as a result of changing state or you can consider each state to have some sort of action associated with it. What this means is that the entire history of the machine is summarized in its current state. All of the inputs since the machine was started determine its current state and thus the current state depends on the entire history of the machine. However, all that matters for its future behavior is the state that it is in and not how it reached this state. This means that a finite state machine can "forget" aspects of its history that it deems irrelevant to its future.
39
Before you write off the finite state machine as so feeble as to be not worth considering as a model of computation, it is worth pointing out that in addition to being able to "forget" irrelevant aspects of its history it can record as many as it needs. As you can have as many states as you care to invent, the machine can record arbitrarily long histories. Suppose some complex machine has a long and perhaps complex set of histories which determine what it will do next. It is still a finite state machine because all you need is a state for each of the possible past histories and the finite state machine can respond just like the seemingly complex machine. In this case the state that you find the machine in is an indication of its complete past history and hence can determine what happens next. Because a finite state machine can represent any history and, by regarding the change of state as a response to the history, a reaction, it has been argued that it is a sufficient model of human behavior. Unless you know of some way that a human can have an unbounded history or equivalently an unbounded memory then this seems to be an inescapable conclusion - humans are finite state machines.
Representing Finite State Machines You can represent a finite state machine in a form that makes it easier to understand and think about. All you have to do is draw a circle for every state and arrows that show which state follows for each input symbol. For example, the finite state machine in the diagram below has three states. If the machine is in state 1 then an A moves it to state 2 and a B moves it to state 3.
This really does make the finite state machine look very simple and you can imagine how as symbols are applied to it how it jumps around between states.
40
This really is a simple machine but its simplicity can be practically useful. There are some applications which are best modeled as a finite state machine. For example, many communications protocols, such as USB, can be defined by a finite state machine diagram showing what happens as different pieces of information are input. You can even write, or obtain, a compiler that will take a finite state machine’s specification and produce code that behaves correctly. Many programming problems are most easily solved by actually implementing a finite state machine. You set up an array, or other data structure, which stores the possible states and you implement a pointer to the location that is the current state. Each state contains a lookup table that shows what the next state is, given an input symbol. When a symbol is read in, your program simply has to look it up in the table and move the pointer to the new state. This a very common approach to organizing games.
Finite Grammars The practical uses of finite state machines is reason enough to be interested in them. Every programmer should know about finite state machines and shouldn't be afraid of implementing them as solutions to problems. However, the second good reason is perhaps more important - but it does depend on your outlook. Finite state machines are important because they allow us to explore the theory of computation. They help us discover what resources are needed to compute particular types of problem. In particular, finite state machines are deeply connected with the idea of grammars and languages that follow rules. If you define two of the machine’s states as special – a starting and a finishing state – then you can ask what sequence of symbols will move it from the starting to the finishing state. Any sequence that does this is said to be ‘accepted’ by the machine. Equally you can think of the finite state machine as generating the sequence by outputting the symbols as it moves from state to state. That is, a list of state changes obeyed in order, from the start to the finish state, generates a particular string of symbols. Any string that can be generated in this way will also be accepted by the machine. The point is that the simplicity, or complexity, of a sequence of symbols is somehow connected to the simplicity or complexity of the finite state machine that accepts it.
41
So we now have a way to study sequences of symbols and ask meaningful questions. As a simple example, consider the finite state machine given below with state 1 as start and state 3 as finish – what sequences does it accept? Assuming that A and B are the only two symbols available, it is clear from the diagram that any sequence like BABAA is accepted by it.
A finite machine accepts a set of sequences In general, the machine will accept all sequences that can be described by the computational grammar, see Chapter 5. 1) 2)
→ →
B|A# A|B
A computational grammar is a set of rules that specify how you can change the symbols that you are working with. The | symbol means select one of the versions of the rule. You can use to specify the starting state and # to specify the final state; in this case we have used as the starting state. You can have many hours of happy fun trying to prove that this grammar parses the same sequences as the finite state machine accepts. To see that it is it does, just try generating a sequence:
Start with and apply rule 1 to get B. You could have selected A# instead and that would be the end of the sequence.
Use rule 2 to get BA. You have replaced by A
Use rule 1 to get BAB
Use rule 2 to get BABB
Use rule 2 to get BABBA You can carry on using rule 1 and 2 alternately until you get bored and decide to use the A# alternative of rule 1 giving something like BABBAA#. 42
Grammar and Machines Now you can start to see why we have moved on from machines to consider grammar. The structure of the machine corresponds to sequences that obey a given grammar. The question is which grammars correspond to finite state machines? More to the point, can you build a finite state machine that will accept any family of sequences? In other words, is a finite state machine sufficiently powerful to generate or accept sequences generated by any grammar? The answer is fairly easy to discover by experimenting with a few sequences. It doesn’t take long to realize that a finite state machine cannot recognize a palindromic sequence. That is, if you want to recognize sequences like AAABAAA, where the number of As on either side of the B has to be the same, then you can’t use a finite state machine to do the job. If you want to think about this example for a while you can learn a lot. For example, it indicates that a finite state machine cannot count without limit. You may also be puzzled by the earlier statement that you can build a finite state machine that can "remember" any past history. Doesn't this statement contradict the fact it cannot recognize a palindromic sequence? Not really. You can build a finite state machine that accepts AAABAAA and palindromes up to this size, but it won't recognize AAAABAAAA as a similar sequence because it is bigger than the size you designed the machine to work with. Any finite state machine that you build will have a limit on the number of repeats it can recognize and so you can always find a palindromic sequence that it can't recognize. The point here isn't that you can't build a finite state machine that can recognize palindromes, but that you can't do it so that it recognizes all palindromes of any size. If you think back to Chapter 3, you might notice that this is another argument about bounded and unbounded sequences. If you want to recognize palindromes of any size you need a Turing machine, or equivalent, and an unbounded storage facility. A finite state machine can count, but only up to fixed maximum. For some this doesn't quite capture the human idea of a palindrome. In our heads we have an algorithm that doesn't put any limits on the size - it is a Turing algorithm rather than a finite state algorithm. The definition of a palindrome doesn't include a size limit but for a practical machine that accepts palindromes there has to be a limit but we tend to ignore this in our thinking. More on this idea later.
43
There are lots of other examples of sequences that cannot be recognized by a finite state machine but can be parsed or specified by a suitable grammar. What this means is that finite state machines correspond to a restricted type of grammar. The type of sequence that can be recognized by a finite state machine is called a ‘regular sequence’ - and, yes, this is connected to the regular expressions available in so many programming languages. A grammar that can parse or generate a regular sequence is called a ‘regular’ or ‘type 3 grammar’ and its rules have the form: →
symbol
or → symbol
Regular Expressions Most computer languages have a regular expression facility that allows you to specify a string of characters you are looking for. There is usually an OR symbol, often |, so A|B matches A or B. Then there is a quantifier usually * which means zero or more. So A* means A, AA, AAA and so on or the null string. There are also symbols which match whole sets of characters. For example, \d specifies any digit, \w specifies any character other than white space or punctuation and \s specifies white space or punctuation. There are usually more symbols so you can match complicated patterns but these three enable you to match a lot of patterns easily. For example, a file name ending in .txt is specified as: \w*.txt
A file name ending in .txt or .bak as: \w*.txt|\w*.bak
A name starting with A and ending with Z as: A\w*Z
and so on. As already mentioned, regular expressions usually allow many more types of specifiers, far too many to list here. The key point is that a regular expression is another way to specifying a string that a finite state machine can recognize, i.e. it is a regular grammar and this means there are limits on what you can do with it. In particular, you generally can’t use it to parse a real programming language and there is no point in trying. It is also worth knowing that the "*" operator is often called the Kleene star operator after the logician Stephen Kleene. It is used generally in programming and usually means "zero or more". For example z* means zero or more z characters.
44
Other Grammars Now that we know that regular grammars, finite state machines and the sequences, or languages, that they work with do not include everything, the next question is, what else is there? The answer is that there is a hierarchy of machines and grammars, each one slightly more powerful than the last.
Avram Noam Chomsky 1928This hierarchy was first identified by linguist Noam Chomsky who was interested in the grammar of natural language. He wanted to find a way to describe, analyze and generate language and this work is still under development today. So what is the machine one step more powerful than a finite state machine? The answer is a ‘pushdown machine’. This is a finite state machine with the addition of a pushdown stack or Last In First Out (LIFO) stack.
On each state transition the machine can pop a symbol off the top of the stack or push the input symbol onto it. The transition that the machine makes is also determined by the current input symbol and the symbol on the top of the stack. If you are not familiar with stacks then it is worth saying that a pushing a symbol onto the top of the stack pushes everything already on the stack down one place and popping a symbol of the top of the stack makes everything move up by one.
45
At first sight the pushdown machine doesn’t seem to be any more powerful than a finite state machine – but it is. The reason is it more powerful is that, while its stack only has a finite memory, it has a finite but unbounded memory which can grow to deal with anything you can throw at it. That is we don't add in the realistic condition that the stack can only store a maximum number of symbols. If you recall, it is the unbounded nature of the tape that gives a Turing machine its extra powers. For example, a pushdown stack machine can accept palindromes of the type AAABAAA where the number of As on each side of the B have to be the same. It simply counts the As on both sides by pushing them on the stack and then popping them off again after the B. Have a look at the pushdown machine below – it recognizes palindromes of the type given above. If you input a string like AAABAAA to it then it will end up in the finish state and as long as you have used up the sequence, i.e. there are no more input symbols, then it is a palindrome. If you have symbols left over or if you end up in state 4 it isn’t a palindrome.
A palindrome detector (TOS =Top Of Stack) So a pushdown machine is more powerful than a finite state machine. It also corresponds to a more powerful grammar – a ‘context-free grammar’. A grammar is called context-free or ‘type 2’ if all its rules are of the form: → almost anything
The key feature of a context-free grammar is that the left-hand side of the rule is just a non-terminal class.
46
For example, the rule: A → AA
isn’t context-free because there is a terminal symbol on the left. However, the rule: → AA
is context-free and you can see that this is exactly what you need to parse or generate a palindrome like ABA with the addition of: → B
which is also context-free.
Turing Machines You can probably guess that, if the language corresponding to a pushdown machine is called context-free, the next step up is going to be ‘contextsensitive’. This is true, but the machine that accepts a context-sensitive sequence is a little more difficult to describe. Instead of a stack, the machine needs a ‘tape’, which stores the input symbols. It can read or write the tape and then move it to the left or the right. Yes, it’s a Turing machine! A Turing machine is more powerful than a pushdown machine because it can read and write symbols in any order, i.e. it isn’t restricted to LIFO order. However, it is also more powerful for another, more subtle, reason. A pushdown machine can only ever use an amount of storage that is proportional to the length of its input sequence, but a machine with a tape that can move in any direction can, in principle, use any amount of storage. For example, the machine could simply write a 1 on the tape and move one place left, irrespective of what it finds on the tape. Such a machine would create an unbounded sequence of 1s on the tape and so use an unbounded amount of storage. It turns out that a Turing machine is actually too powerful for the next step up the ladder. What we need is a Turing machine that is restricted to using just the portion of tape that its input sequence is written on, a ‘linear bounded machine’. You can think of a linear bounded machine as a Turing machine with a short tape or a full Turing machine as a linear bounded machine that has as long a tape as it needs. The whole point of this is that a linear bounded machine accepts context-sensitive languages defined by context-sensitive grammars that have rules of the form: anything1 → anything2
but with the restriction that the sequence on the output side of the rule is as least as long as the input side.
47
Consider: A → AA
as an example of a context-sensitive rule. Notice that you can think of it as the rule: → A
but only applied when an A comes before and hence the name “contextsensitive”. A full Turing machine can recognize any sequence produced by any grammar – generally called a ‘phrase structured’ grammar. In fact, as we already know, a Turing machine can be shown to be capable of computing anything that can reasonably be called computable and hence it is the top of the hierarchy as it can implement any grammar. If there was a grammar that a Turing machine couldn't cope with then the Church Turing Thesis wouldn't hold. Notice that a phrase structured grammar is just a context-sensitive grammar that can also make sequences shrink as well as grow.
The languages, the grammar and the machines That’s our final classification of languages and the power of computing machines. Be warned, there are lots of other equivalent, or nearly equivalent, ways of doing the job. The reason is, of course, that there are many different formulations of computing devices that have the same sort of power. Each one of these seems to produce a new theory expressed in its own unique language. There are also finer divisions of the hierarchy of computation, not based on grammars and languages. However, this is where it all started and it is enough for the moment.
48
Turing Machines and Finite State Machines We already know that a Turing machine is more powerful than a finite state machine, however, this advantage is often over-played. A Turing machine is only more powerful because of its unbounded memory. Any algorithm that a Turing machine actually implements has a bounded tape if the machine stops. Obviously, if the machine stops it can only have executed a finite number of steps and so the tape has a bounded length. Thus any computation that a bounded Turing machine can complete can be performed by a finite state machine. The subtle point is that the finite state machine is different in each case. There is no single finite state machine that can do the work of any bounded Turing machine that you care to select but once you have selected such a Turing machine it can be converted into a finite state machine. As explained in the previous chapter if a Turing machine uses a tape of length n, a symbol set of size s and x states (in the finite state machine) can be simulated by a finite state machine with a= snx states. Proving this is very simple. The tape can be considered as a string of length n and each cell can have one of the s symbols. Thus, the total number of possible strings is sn. For example, if we have a tape of length 2 and a symbol set consisting of 0,1 then we have the following possible tapes [0|0] [0|1] [1|0] [1|1] and you can see that this is 22, i.e. 4. For another example, a tape of length 3 on two symbols has the following states: [0|0|0] [0|0|1] [0|1|0] [0|1|1] [1|0|0] [1|0|1] [1|1|0] [1|1|1]
i.e. 23=8 states. For each state of the tape, the Turing machine’s finite state machine controller can be in any of x states, making a total of snx states. This means you can construct a finite state machine using snx states arranged so that the transition from one state to another copies the changes in the Turing machine’s internal state and the state of its tape. Thus the finite state machine is exactly equivalent to the Turing machine. What you cannot do is increase the number of states in the finite state machine to mimic any addition to the Turing machine’s tape - but as the tape is bounded we don't have to. It is sometimes argued that modern computers have such large storage capacities that they are effectively unbounded. That is a big computer is like a Turing machine. This isn’t correct as the transition from finite state machine to Turing machine isn’t gradual – it is an all or nothing. The halting problem for a Turing machine is logically not decidable. The halting problem for a finite state machine or a bounded Turing machine is always decidable, in theory. It may take a while in practice, but there is a
49
distinct difference between the inaccessibility of the halting problem for a Turing machine and the practical difficulties encountered in the finite state machine case. What it is reasonable to argue is that for any bounded machine a finite state machine can do the same job. So the language and grammar hierarchy discussed in the last section vanishes when everything has to be finite. A finite state machine can recognize a type 0, 1 or 2 language as long as the length of the string is limited.
Turing Thinking One of the interesting things about Turing machines versus finite state machines is the way we write practical programs or, more abstractly, the way we formulate algorithms. We nearly always formulate algorithms as if we were working with a Turing machine. For example, consider the simple problem of matching parentheses. This is not a task a finite state machine can perform because the grammar that generates the “language” is context-free: 1. → (S) 2. → 3. → #
where S is the current state of the string. You can prove that this is the grammar of balanced parentheses or you can just try a few examples and see that this is so. For example, starting with S="" i.e. the null string, rule 1 gives: ()
Applying rule 2 gives: ()()
Applying rule 1 gives: (()())
and finally rule 3 terminates the string. Let’s create an algorithm that checks to make sure that a string of parentheses is balanced. The most obvious one is to use a stack in which for each symbol: if symbol is ) if the stack is empty then stop else pop the stack else push ( on stack
When the last symbol has been processed, the stack is empty if the parentheses are balanced. If the stack isn't empty then the parentheses are unbalanced. If there is an attempt to pop an empty stack while still processing symbols the parentheses are unbalanced.
50
If you try it you should be convinced that this works. Take (()())(), then the stack operations are: string (()())() ()())() )())() ())() ))() )() () )
operation push ( push ( pop push ( pop pop push ( pop
stack ( (( ( (( ( empty ( empty balanced
as the final stack is empty, the string is balanced. Now try (()()))(: string (()()))( ()()))( )()))( ()))( )))( ))( )(
operation push ( push ( pop push ( pop pop pop
stack ( (( ( (( ( empty empty unbalanced
as an attempt to pop an empty stack has occurred the parentheses are unbalanced. This algorithm is an example of Turing thinking. The algorithm ignores any issue of what resources are needed. It is assumed that pushes and pops will never fail for lack of resources. The algorithm is a Turing machine style unbounded algorithm that will deal with a string of any size. Now consider a finite machine approach to the same problem. We need to set up some states to react to each ( and each ) such that if the string is balanced we end in a final state. In the machine below we start from state 1 and move to new states according to whether we have a left or right parenthesis. State 5 is an error halt state and state 0 is an unbalanced halt state.
51
Consider (())() (())() ())() ))() )() () )
state state state state state state
1 2 3 2 1 2
→ → → → → →
state state state state state state
2 3 2 1 2 1
and as we finish in state 1, the string is balanced. (()))( ()))( )))( ))( )(
state state state state state
1 2 3 2 1
→ → → → →
state state state state state
2 3 2 1 0
as we finish in state 0, the string is unbalanced. Finally consider: (((()))) ((()))) (()))) ())))
state state state state
1 2 3 4
→ → → →
state state state state
2 3 4 5
as we finish in state 5, an out of memory error has occurred. The difference between the two approaches is that one doesn’t even think about resources and uses the pattern in the data to construct an algorithm that can process any amount of data given the resources. The second deals with a problem with an upper bound – no more than four left parentheses can be processed. Of course, this distinction is slightly artificial in that we could impose a bound on the depth of the stack and include a test for overflow and stop the algorithm. Would this be finite state thinking or just Turing thinking with a fix for when things go wrong? The point is that we construct pure algorithms that tend not to take resource limitations into account – we think Turing, even if we implement finite state. There is also a sense in which Turing thinking captures the abstract essence of the algorithm that goes beyond a mere mechanism to implement it. The human mind can see algorithms that always work in principle even if in practice they have limits. It is the computer science equivalent of the mathematician conceiving of a line as having no thickness, of a perfect circle or indeed something that has no end.
52
Summary ● The finite state machine is simple and less powerful than other machines with unbounded storage such as the Turing machine. ● Finite state machines accept sequences generated by a regular grammar. ● Regular expressions found in most programming languages are a practical implementation of a regular grammar. ● The pushdown machine is slightly more powerful than a finite state machine and it accepts sequences generated by a context-free grammar. ● A full Turing machine can accept sequences generated by a phrase structured grammar. ● The different types of machine and grammars form a hierarchy of increasingly complex languages. ● Although a Turing machine is more powerful than a finite state machine, in practice all machines are finite state machines. ● A Turing machine with a tape limited to n cells is equivalent to a finite state machine with snx where s is the number of symbols it uses and x is the number of states in its controller. ● Even though Turing machines are an abstraction and do not exist, we tend to create algorithms as if they did, making no assumptions about resource limits. If we do think of resource limits they are afterthoughts bolted on to rescue our inherently unbound algorithms. ● Algorithms that are unbounded are the pure thought of computer science.
53
Chapter 5 Practical Grammar
This chapter is a slight detour from the subject of what is computable. We have discovered that different types of computing model are equivalent to different types of grammar and the languages they produce. What we haven’t looked at so far is the impact this practical problem has had on theoretical computer science. Many hours are spent studying grammar, how to specify it and how to use it. What is often ignored is what practical relevance grammar has to the meaning of a language? Surely the answer is none whatsoever! After all, grammar is about syntax and semantics is about meaning. This isn’t quite true. A grammar that doesn’t reflect meaning is a useless theoretical construct. There is an argument that the arithmetic expression is the whole reason that grammar and parsing methods were invented, but once you have the theory of how to do it you might as well use it for the entire structure of a language.
Backus-Naur Form - BNF In the early days of high-level languages the only really difficult part was the translation of arithmetic expressions to machine code and much of the theory of grammar was invented to deal with just this problem.
John Warner Backus 1924 - 2007
55
At the core of this theory, and much used in the definition of programming languages, is BNF, or Backus Normal Form. Because Donald Knuth pointed out that it really isn't a normal form, in the sense of providing a unique normalized grammar, it is often known as Backus-Naur Form after John Backus the inventor of Fortran, and Peter Naur one of the people involved in creating Algol 60. Fortran was the first "high level" computer language and its name derives from FORmula TRANslation. The intention was to create a computer language that would allow programmers to write something that looked like an arithmetic expression or formula. For example, A=B*2+C is a Fortran expression. Working out what exactly had to be done from a formula is not as easy as you might think and other languages ducked the issue by insisting that programmers wrote things out explicitly - "take B multiply it by two and add C". This was the approach that the language Cobol took and it was some time before it added the ability to use formulas. Not only isn't Backus Normal Form a normal form, there isn't even a single standard notation for BNF, and everyone feels free to make up their own variation on the basic theme. However, it is always easy enough to understand and it is very similar to the way we have been describing grammar to this point. For example, using "arrow notation" you might write: → +
You can read this as saying that an additive expression is formed by taking a variable, a plus sign and another variable. This rule only fits expressions like A+B, and not A+B+C, but you get the general idea. Quantities in angle brackets like:
are called “non-terminal” symbols because they are defined by a BNF rule and don’t appear in the final language. The plus sign on the other hand is a “terminal” symbol because it isn't further defined by a BNF rule. You might notice that there is a slight cheat going on in that the non-terminal was replaced by a terminal in the examples, but without a rule allowing this to happen. Well, in proper BNF, you have to have a rule that defines every non-terminal in terms of terminal symbols, but in the real world this becomes so tedious that we often don’t bother and rely instead on common sense. In this case the missing rule is something like: → A|B|C|D etc .. |Z
where the vertical bar is read as “OR”. Notice that this defines as just a single character - you need a slightly more complicated rule for a full multicharacter variable name.
56
If you want to define a variable that was allowed to have more than a oneletter name you might use: 1. → | 2. → A|B|C|D etc .. |Z
This is the BNF equivalent of a repeating operation. It says that a variable is either a variable followed by a letter or a letter. For example, to use this definition you start with:
and use rule 1 to replace it by:
then use rule 2 to replace by A to give: A
Next we use rule 1 to replace to get: A
and so on building up the variable name one character at a time. Boring isn’t it? But it is an ideal way to tell a computer what the rules of a language are. Just to check that you are following – why does rule 1 include the alternative | ?
The reason is that this is the version of the rule we use to stop a variable name growing forever, because once it is selected the next step is to pick a letter and then we have no non-terminals left in the result. Also notice that the BNF definition of a variable name cannot restrict its length. Rules like “fewer than 8 characters” have to be imposed as notes that are supplied outside of the BNF grammar.
Extended BNF To make repeats easier you will often find that an extended BNF notation is used where curly brackets are used to indicate that the item can occur as many times as you like – including zero. In this notation: → {}
means “a variable is a letter followed by any number of letters”. Another extension is the use of square brackets to indicate an optional item. For example: → [+|-] {}
means that you can have a plus or minus sign, or no sign at all, in front of a sequence of at least one digit. The notation isn't universal by any means, but something like it is often used to simplify BNF rules.
57
If you know regular expressions you might recognize some of the conventions used to specify repeated or alternative patterns. This is no accident. Both regular expressions and BNF are capable of parsing regular languages. The difference is that BNF can do even more. See the previous chapter for a specification of what sorts of grammars are needed to model different complexities of language and machines.
BNF In Pictures - Syntax Diagrams You will also encounter BNF in picture form. Syntax diagrams or "railroad diagrams" are often used as an easy way to understand and even use BNF rules. The idea is simple - each diagram shows how a non-terminal can be constructed from other elements. Alternatives are shown as different branches of the diagram and repeats are loops. You construct a legal string by following a path through the diagram and you can determine if a string is legal by finding a path though the diagram that it matches. For example, the following syntax diagram is how the classic language Pascal defines a variable name:
It corresponds to the rules: → → |
If you want to see more syntax diagrams simply see if your favorite language has a list of them - although it has to be admitted that their use has declined. Here is a classic example in the 1973 document defining Pascal:
58
Why Bother? What have we got so far? ● A grammar is a set of rules which defines the types of things that you can legitimately write in a given language. This is very useful because it allows designers of computer languages to define the language unambiguously. If you look at the definition of almost any modern language – C, C++, C#, Java, Python, Ruby and so on - you will find that it is defined using some form of BNF. For example, if you lookup the grammar of C++ you will discover that an expression is defined as: expression: assignment-expression expression , assignment-expression
Which in our BNF notation would be: → | ,
As already mentioned, sometimes the BNF has to be supplemented by a note restricting what is legal. This use of BNF in defining a language is so obviously useful that most students of computer science and practicing programmers immediately take it to heart as a good method. Things really only seem to begin to go wrong when parsing methods are introduced. Most people just get lost in the number and range of complexity of the methods introduced and can’t see why they should bother with any of it.
Generating Examples You can think of the BNF rules that define a language as a way of generating examples of the language or of testing to see if a statement is a valid example of the language. Generating examples is very easy. It’s rather like playing dominoes with production rules. You pick a rule and select one of the parts of the rule separated by “OR” and then try and find a rule that starts with any non-terminal on the right. You replace the non-terminal items until you have a symbol string with no non-terminal items – this is a legal example of the language. It has to be because you generated it using nothing but the rules that define the language. Working the other way, i.e. from an example of the language to a set of rules that produce it, turns out to be more difficult. Finding the production rules is called “parsing” the expression and which parsing algorithms work well can depend on the nature of the grammar. That is, particularly simple grammars can be parsed using particularly simple and efficient methods. At this point
59
the typical course in computer science spends a great deal of time explaining the different parsing algorithms available and a great many students simply cannot see the wood for the trees – syntax trees that is! While the study of parsing methods is important, you only understand why you are doing it when you discover what parsing can do for you. Obviously if you can construct a legal parse of an expression then the expression is grammatical and this is worth knowing. For example, any reasonable grammar of arithmetic expressions would soon throw out /3+*5 as nonsense because there is no set of rules that create it. This is valuable but there is more. If you have parsed an expression the way that it is built up shows you what it means. This is a function of grammar that is usually ignored because grammar is supposed to be only to do with syntax, which is supposed to have nothing to do with meaning or semantics. In English, for example, you can have syntax without meaning; “The green dream sleeps furiously.”
This is a perfectly good English sentence from the point of view of grammar, but, ignoring poetry for the moment, it doesn’t have any meaning. It is pure syntax without any semantics. This disassociation of syntax and semantics is often confused with the idea that syntax conveys no meaning at all and this is wrong. For example, even in our nonsense sentence you can identify that something “the dream” has the property “green” and it is doing something, i.e. “sleeping”, and that it is doing it in a certain way, i.e. “furiously”. This is a lot of information that is brought to you courtesy of the syntax and while the whole thing may not make sense the syntax is still doing its best to help you get at the meaning.
Syntax Is Semantics So in natural languages syntax carries with it some general indications of meaning. The same is true of the grammar of a programming language. Consider a simple arithmetic expression: 3+2*5
As long as you know the rules of arithmetic, you will realize that you have to do the multiplication first. Arithmetic notation is a remarkably sophisticated mini-language, which is why it takes time to learn in school and why beginners make mistakes. Implementing this arithmetic expression as a program is difficult because you can't simply read it from left to right and implement each operation as you meet it. That is, 3+2*5 isn't (3+2)*5 i.e. doing the operations in the order presented 3+2 and then *5. As the multiplication has a higher priority it is 3+(2*5) which written in order of operations is 2*5+3.
60
A simple grammar for this type of expression, leaving out the obvious detail, might be: → + |
*
This parses the expression perfectly, but it doesn’t help with the meaning of the expression because there are two possible ways that the grammar fits: →
+ 3 + 2*5
or → * 3+2 * 5
These are both perfectly valid parses of the expression as far as the grammar is concerned, but only the first parsing groups the expressions together in a way that is meaningful. We know that the 2 and 5 should group as a unit of meaning and not the 3 and the 2, but this grammar gives rise to two possible syntax trees
We need to use a grammar that reflects the meaning of the expression. For example, the grammar: → + → * |
also allows 3+2*5 to be legal but it only fits in one way: → + 3 + 2*5 → * 2 * 5
This means that this particular grammar only gives the syntax tree that corresponds to the correct grouping of the arithmetic operators and their operands. It contains the information that multiplication groups together more strongly than addition. In this case we have a grammar that reflects the semantics or meaning of the language and this is vital if the grammar is going to help with the translation.
61
There may be many grammars that generate a language and any one of these is fine for generating an expression or proving an expression legal, but when it comes to parsing we need a grammar that means something. Perhaps it isn't true to say syntax is semantics, but you really have to pick the syntax that reflects the semantics.
Traveling the Tree Now that we have said the deeply unfashionable thing that syntax is not isolated from semantics, we can now see why we bother to use a grammar analyzer within a compiler. Put simply, a syntax tree or its equivalent can be used to generate the machine code or intermediate code that the expression corresponds to. The syntax tree can be considered as a program that tells you how to evaluate the expression. For example, a common method of generating code from a tree is to walk all its nodes using a “depth first” algorithm. That is, visit the deepest nodes first and generate an instruction corresponding to the value or operator stored at each node. The details of how to do this vary but you can see the general idea in this diagram:
So now you know. We use grammar to parse expressions, to make syntax trees, to generate the code.
A Real Arithmetic Grammar You might like to see what a real grammar for simple arithmetic expressions looks like: → + | - | → * | / | → x | y | ... |
Of course, this simplified rule leaves out proper variable names, operators other than +, -, * and /, but it is a reasonable start.
62
Summary ● Grammar is not just related to the different machines that characterize the languages they describe. Grammars are generally useful in computer science. ● The most common method of describing a grammar is some variation on BNF or Backus-Naur Form. It is important to realize that there is no standard for this, but the variations are usually easy to understand. ● Extended BNF allows it to describe language forms that are specified as regular expressions. ● Often syntax diagrams are easier to understand than pure BNF. ● A grammar is often said to be to do with the syntax of a language and nothing at all to do with semantics, the meaning. This isn’t completely true. ● To be useful, a grammar has to reveal the meaning of a language and it has to be compatible with its meaning. ● With a suitable grammar you can work backwards from the sequence of symbols to the rules that created them. This is called parsing. ● There are many parsing algorithms and in general simple grammars have simple parsing methods. ● If a grammar is compatible with the meaning of a language then parsing can be used to construct a syntax tree which shows how the components group together. ● A syntax tree can have many uses, but the main one in computer science is to generate a sequence of machine instructions that correspond to the original sequence of symbols. This is how a compiler works.
63
Chapter 6 Numbers, Infnity and Computation
Infinity is a concept that mathematicians are supposed to understand and it is a key, if often hidden, component of computer science arguments and formal proofs. Not understanding the nature of infinity will lead you to conclude all sorts of very wrong, and even silly, results. So let’s discover what infinity is all about and how to reason with it. It isn’t so difficult and it is a lot of fun. As infinity is often said to be “just” a very big number, first we need some background on what exactly numbers are. In particular, we need to know something about what sort of numbers there are. There is a complexity hierarchy of numbers starting from the most simple, and probably the most obvious, working their way up through rationals and irrationals to finally the complex numbers. I'm not going to detour into the history of numbers, fascinating though it is, and I’m not going to go into the details of the mathematical theory. What is important is that you have an intuitive feel for each of the types of number, starting with the simple numbers that we all know.
Integers and Rationals All you really need to know is that in the beginning were the whole or natural numbers. The natural, or counting, numbers are just the ones we need to count things and in most cases these don't include zero which was quite a late invention. After all who, apart from programmers, counts zero things? The integers are what you get when you add zero and negative numbers. You need the negatives to solve problems like "What do I add to 5 to get 3?" and zero appears as a result of "What is 5-5?". You can motivate all of the different types of number by the need to solve equations of particular types. So the integers are the solution to equations like: x+a=b
where a and b are natural numbers and x is sometimes a natural number and sometimes the new sort of number - a negative integer. Notice that the definition of the new type of number, an integer, only makes use of numbers we already know about, the natural numbers.
65
Notice that we can immediately see that the natural numbers and the integers are infinite in the usual sense of the term – there is always another integer. You can also see that if you were asked to count the number of integers in a set you could do this. The integers are naturally countable. If you are surprised that the fact that something is countable is worthy of comment, you need to know that there are things that you cannot count – see later. Next in the number hierarchy come the fractions, the numbers which fill in all the spaces on the number line. That is, if you draw a line between two points then each point will correspond to an integer or a fraction on some measuring system. The fractional numbers are usually known as the rationals because they are a ratio of two integers. That is, the rationals are solutions of equations like: x*a=b
where again a and b are what we already know about, i.e. integers, and x is sometimes an integer and sometimes the new type of number – a rational. The definition of the new type of number once again only involves the old type of number.
A small part of the number line to every rational number there is a point but is the reverse true? Is there a number for every point? Notice that if you pick any two points on the line then you can always find another fractional point between them. In fact, if you pick any two points you can find an infinity of fractions that lie between them. This is quite unlike the integers where there most certainly isn't an integer between every pair of integers - for example, what is the integer between 1 and 2? Also notice that this seems to be a very different sort of infinity to the countable infinity of the integers. It can be considered to be a "small" infinity in that there are an infinite number of rational points in any small section of the line. Compare this to the "big" infinity of the integers which extends forever. You can hold an infinite number of rationals in your hand, whereas it would be impossible to hold an infinity of integers in any container, however big The infinity of the rationals seems to be very different from the infinity of the integers, but is it really so different?
66
The Irrationals Given the integers and the fractions is there a number to every point in the number line and a point to every number? No but this fact is far from obvious. If you disagree then you have been exposed to too much math - not in itself a bad thing but it can sometimes make it harder to see the naive viewpoint. It was followers of Pythagoras who discovered a shocking truth - that there were points on the number line that didn't correspond either to integers or fractions. The square root of two, √2, is one such point. You can find the point that corresponds to √2 simply by drawing a unit square and drawing the diagonal - now you have a line segment that is exactly √2 long and thus lines of that length exist and are easy enough to construct. This means you can find a point on the line that is √2 from the zero point quite easily. The point that is √2 exists, but does a rational number that labels it?
So far so much basic math but, and this is the big but, it can be proved (fairly easily) that there can be no fraction, i.e. a number of the form a/b that is √2. You don't need to know the proof to trust its results but the proof is very simple. However, skip this section if you want to. Suppose that you can express √2 as a rational number in its lowest terms, i.e. all factors canceled out, as a/b. Then (a/b)2 is 2 by definition and hence a2=2b2. This implies that a2 is even and hence a has to be even (as odd numbers squared are still odd). Suppose a=2c for some integer c. Then 4c2=2b2 or b2=2c2 and hence b2 is also even and hence b is even. Thus b=2d for some integer d. Now we can write a/b=2c/2d, which means that a/b isn't in its lowest form as we can now remove a factor of 2 to get c/d. As you can reduce any fraction to its lowest form by canceling common factors there can be no a/b that squares to give 2.
67
What this means is that if we only have the integers and the fractions or the rationals (because they are a ratio of two integers) then there are points on the line that do not correspond to a number - and this is unacceptable. The solution was that we simply expanded what we regard as a number and the irrational numbers were added to the mix. But what exactly are irrationals and how can you write one down? This is a question that occupied mathematicians for a long time and it took a while to get it right. Today we think of an irrational number as a value with an infinite sequence of digits after the decimal point. This sequence of digits goes on forever and can never fall into a repeating pattern because if it does it is fairly easy to show that the value is a rational. That is in decimal: 0.33333333333333
repeating for ever isn't an irrational number because it repeats and it is exactly 1/3 i.e. it is rational. Notice that there is already something strange going on because while we have an intellectual solution to what an irrational is, we still can't write one down. How could you as the digits never end and never fall into a repeating pattern. This means that in general writing an irrational down isn't going to be a practical proposition. And yet I can write down √2 and this seems to specify the number precisely. Also what sort of equations do irrationals solve? A partial answer is that they are solutions to equations like: xa=b
Again a and b are rational with b positive and x is sometimes the new class of numbers i.e. an irrational. The problem is that only a relatively small proportion of the irrationals come from the roots of polynomials as we shall see. Now we have integers like 1 or -2, fractions like 5/6 and irrationals like √2. From the point of view of computing this is where it all gets very messy, but first we are now ready to look at the subtlety of infinity.
The Number Hierarchy There is a standard mathematical procedure for defining and expanding the number system. Starting out with a set of numbers that are well defined, you look at simple equations involving those numbers. In each case you discover that some of the equations don't have a solution that is a number in the set of numbers you started with. To make these solutions available you add a new type of number to create a larger set of numbers. You repeat this until you end up with a set of numbers for which all the equations you can think up
68
have a solution in the same set of numbers. That is, as far as equations go you now have a complete set of numbers. This sounds almost magical and the step that seems to hold the magic is when you add something to the existing set of numbers to expand its scope. It seems almost crazy that you can simply invent a new type of number. In fact it isn't magical because it is the equation that defines all of the properties of the new numbers. For example, you can't solve x2=2 using nothing but rational numbers. So you invent a symbol to represent the new number √2 and you can use it in calculations as if it was an ordinary number √2*2+3/4 and you simply work with the symbol as if it was a perfectly valid number. You can even occasionally remove the new symbol using the fact that √2*√2=2 which is, of course given by the equation that √2 solves. If you can get rid of all of the uses of the new symbol so much the better but if you can't then at the end of the calculation you replace the new symbol with a numeric approximation that gets you as close to the true answer as you need. To summarize: In each case the equation listed is the one that leads to the next class of numbers, i.e. the one for which no solution exists in the current class of numbers. Notice that the complex numbers are complete in the sense that we don’t need to invent any more types of number to get the solutions of all equations that involve the complex numbers. The complex numbers are very important, but from our point of view we only need to understand the hierarchy up to the irrationals.
69
Aleph-Zero and All That There are different orders of infinity. The usual infinity, which can be considered the smallest infinity, is usually called aleph-zero, aleph-naught or aleph-null and is written:
As this is aleph-zero and not just aleph, you have probably guessed that there are others in the series and aleph-one, aleph-two and so on are all different orders of infinity, and yes there are, probably an aleph-zero of them. What is this all about? Surely there is just infinity and that's that? Certainly not, and the existence of different orders of infinity or the "transfinite" numbers is an idea that plays an important role in computer science and the way we think about algorithms in general. Even if you are not going to specialize in computer science and complexity theory, it is part of your intellectual growth to understand one of the deepest and most profound theories of the late 19th and early 20th centuries. It was the German mathematician Georg Cantor who constructed the theory of the transfinite numbers a task that was said to have driven him mad. In fact contemplation of the transfinite is a matter of extreme logic and mathematical precision rather than lunacy.
Georg Cantor 1845-1918
70
Unbounded Versus Infinite Programmers meet infinity sooner and more often than the average person. OK, mathematicians meet it all the time, but perhaps not in the reality of an infinite loop. Question: How long does an infinite loop last? Answer: As long as you like. This highlights a really important idea. We really don't have an infinite loop, what we have is a finite but unbounded loop. If you have read the earlier chapters this will be nothing new, but there is a sense in which it is very much the programmer’s form of infinity. Something that is finite, but unbounded, is on its way to being infinite, but it isn't actually there yet so we don’t have to be too philosophical. Consider again the difference between the following two sets: U = the set of counting numbers less than n N = the set of natural numbers, i.e. 0,1,2,3,
Set U is finite for any n, but it is unbounded in that it can be as big as you like depending on n. On the other hand, N is the real thing - it is an infinite set. It is comparable to the difference between someone offering you a very big stick to hold - maybe one that reaches the orbit of the moon, the sun or perhaps the nearest star- and a stick that is really infinitely long. A stick that is finite but "as big as you like" is similar to the sort of sticks that you know. For one thing it has an end. An infinite stick doesn't have an end – it is something new and perhaps confined to the realm of theory. The same arguments hold for U and N. The set U always has a largest member but N has no largest member. Infinity is different from finite but unbounded. When you write an "infinite loop" you are really just writing a finite but unbounded loop - some day it will stop. Often finite but unbounded is all we need to prove theorems or test ideas and it avoids difficulties. For example, a finite but unbounded tape in a Turing machine does have a last cell, but an infinite tape doesn't. A Turing machine working with tape that doesn't have an end is more than just a mechanical problem – how do you move it? Indeed can it move? You can avoid this particular problem by moving the reading/writing head, which is finite, but this doesn’t solve all the conceptual difficulties of a tape that is actually infinite. In the early days of computer science there was a tendency to prefer “finite but unbounded”, but as the subject became increasingly math-based, true infinity has replaced it in most proofs. This is a shame as the programmer’s infinity is usually all that is needed.
71
Comparing Size The basic idea of the size of something is that if two sets are the same size then we can put them into one-to-one correspondence. Clearly if this is the case then the two sets are of the same size as there can't be any "things" hiding in one set that make it bigger than the other set. This definition seems innocent enough, and it is, but you have to trust it and the conclusions you reach when you use it to compare infinite sets. For example, suppose you take the set Z of all integers then this is clearly an infinite set. Now take the set O of all the odd integers - this set clearly has half the number of things in it. But wait - if you set up the 1:1 correspondence between O and Z such that each n in O is associated with 2n in Z, we have to conclude that O and Z have the same number of elements - they are both sets with an infinity of elements. This is the source of the reasonably well-known story featuring Hotel Hilbert named in honor of German mathematician, David Hilbert. The hotel has an infinite number of rooms and on this night they are all full when a tour bus arrives with n new guests. The passengers are accommodated by the simple rule that the existing guest in room m moves to room m+n - thus clearing the first n rooms. If n were 10 then the person in room 1 would move to room 11, room 2 to room 12 and so on. Notice that in this case the pigeon hole principle doesn’t apply. For a finite number of rooms n say putting m>n people into the rooms would imply that at least one room held two or more people. This is not the case if n is infinite as we can make more space. From a programmers point of view the question really is “how long do the new guests wait before moving into their rooms?” If the move is made simultaneously then all guests come out into the corridor in one time step and move to their new room at time step two. The new guests then move into their new accommodation in one more time step. So the move is completed in finite time. However, if the algorithm is to move guests one per time step you can see that the move will never be completed in finite time because the guest in room 1 would have to wait for the guest in room 1+n to move into room 2+n and so on. It takes infinite time to move infinite guests one guest at a time. Suppose a coach carrying an infinite number of new guests arrives? Can they be accommodated? Yes, but in this case the move is from room n to room 2n. This frees all odd number rooms and there is an infinity of these. Once again if this is performed simultaneously it takes finite time but if the move is one guest at a time it takes infinity to complete. What about Hotel Hilbert with a finite but unbounded number of rooms? In this case when a finite number of new guests appear the same move occurs but now we have to add n rooms to the hotel. Ok this might take a few months but the new guests are accommodated in finite time even if the
72
current guests move one at a time. Finite but unbounded is much more reasonable. This argument also works if a finite but unbounded number of new guests appear. Things only go wrong when the coach turn up with an infinite number of guests. In this case it really does take infinity to build the new rooms – a task that is practically speaking never completed. In short, you can add or subtract any number from an infinite set and you still have an infinite number, even if the number added or subtracted is infinite. Put more formally, the union of a finite or infinite number of infinite sets is still infinite.
My favorite expression of this fact is: Aleph-null bottles of beer on the wall, Aleph-null bottles of beer, You take one down, and pass it around, Aleph-null bottles of beer on the wall.
In Search of Aleph-One Is there an infinity bigger than the infinity of the integers? At first you might think that the rationals, i.e. numbers like a/b where a and b are integers, form a bigger set - but they don't. It is fairly easy, once you have seen how, to arrange a one-to-one assignment between integers and rational fractions. If you have two sets A and B then the Cartesian product A X B of the two sets is the set of all pairs (a,b) with a taken from A and b taken from B. If A and B are the set of all positive integers you can consider all the pairs (a,b) as the co-ordinates of a point in a grid with integer co-ordinates.
73
Now simply start at the origin and traverse the grid in a diagonal pattern assigning a count as you go: 0->(0,0), 1->(1,0), 2->(0,1), 3->(2,0), 4->(1,1)
and so on. Clearly we have a one-to-one mapping of the integers to the co-ordinate pairs and so the co-ordinate pairs have the same order of infinity as the natural numbers. This also proves that the Cartesian product, i.e. the set of all pairs, of two infinite sets is the same size. You can modify the proof slightly to show that the rationals a/b with b not equal to 0 is also just the standard infinity. You can see that in fact using this enumeration we count some rationals more than once. Basically, if you can discover an algorithm that counts the number of things in a set, then that set is countable and an infinite countable set has order of infinity aleph-zero. Two questions for you before moving on. 1. Can you write a for loop that enumerates the set (a,b) or equivalently give a formula i→(a,b)? 2. Why do we have to do the diagonal enumeration? Why can’t you just write two nested loops which count a from 0 to infinity and b from 0 to infinity? See later in the chapter for answers.
What Is Bigger Than Aleph-Zero? Consider the real numbers, that is the rational numbers plus the irrationals. The reals are the set of infinite decimal fractions - you can use other definitions and get to the same conclusions. That is, a real number is something of the form: integer. infinite sequence of digits
e.g. 12.34567891234567...
The reals include all of the types of numbers except of course the complex numbers. If the fractional part is 0 we have the integers. If the fractional part repeats we have a rational. If the factional part doesn’t repeat we have an irrational. You also only have to consider the set of reals in the interval [0,1] to find a set that has more elements than the integers. How to prove that this is true?
74
Cantor invented the diagonal argument for this and it has been used ever since to prove things like Gödel's incompleteness theorem, the halting problem and more. It has become a basic tool of logic and computer science. To make things easy let's work in binary fractions. Suppose there is an enumeration that gives you all the binary fractions between 0 and 1. We don't actually care exactly what this enumeration is as long as we can use it to build up a table of numbers: 1 0.010101110111101... 2 0.101110111010111... 3 0.101101001011001... and so on. All we are doing is writing down the enumeration i -> a binary real number for i an integer. If this enumeration exists then we have proved that there are as many reals as integers and vice versa and so the reals have an order of infinity that is alephzero i.e. they are countable. If this is all true the enumerations contains all of the reals as long as you keep going. If we can find a real number that isn't in the list then we have proved that it isn't a complete enumeration and there are perhaps more reals than integers. Now consider the following argument. Take the first bit after the binary point of the first number and start a new number with its logical Boolean NOT - that is, if the bit is 0 use 1 and if it is 1 use 0. Next take the second bit from the second number and use its NOT for the second bit of the new number. In fact, you can see that the new number is simply the logical NOT of the diagonal of the table. In this case: s=0.110... This new number s is not equal to the first number in the table because it has been constructed to differ in the first bit. It is not equal to the second number because it has been constructed to be different in the second bit, and so on. In fact, you can see that s differs in the nth bit after the binary point from the nth number in the table. We have no choice to conclude that s isn't in the table and so there is a real number that isn’t in the supposedly complete enumeration. In other words, there can be no such complete enumeration because if you present me with one I can use it to create a number that it doesn't include. You cannot count the real numbers and there are more real numbers than there are integers.
Can you see why the argument fails if the fractions are not infinite sequences of bits?
75
Finite But Unbounded “Irrationals” Suppose we repeat the arguments for a sequence of n digits that are finite but unbounded in n. Is the size of this set of unbounded “irrationals” aleph-one? The answer is no, the size of these irrationals is still aleph-zero. You can discover that it isn’t aleph-one by simply trying to prove that is is using the diagonal argument. For any n the diagonal number has more digits than n and hence isn’t included in the table of irrationals of size n. Thus the diagonal proof fails. Consider the sets Sn of all “irrationals” on n digits, i.e. Sn is the set of sequences of length n. It is obvious that each set has a finite and countable number of elements. Now consider the set S, which is the union of all of the sets Sn. This is clearly countable as the union of countable sets is always countable and it is infinite – it is aleph-zero in size. You can convert this loose argument into something more precise. You can also prove it by giving an enumeration based on listing the contents of each set in order n. You have to have the set S∞, i.e. the set of infinite sequences of digits to get to alephone.
Enumerations To see this more clearly you only have to think a little about iteration or enumeration. This section considers this mathematics from a programmer’s viewpoint. Put simply, anything you can count out is enumerable. Integers are enumerable - just write a for loop. for i=0 to 10 display i next i
which enumerates the integers 0 to 10. Even an infinite number of integers is enumerable, it’s just the for loop would take forever to complete: for i=0 to infinity display i next i
OK, the loop would never end. but if you wait long enough you will see any integer you care to pick displayed. In this sense, the integers, all of them, are enumerable. The point is that for any particular integer you select the wait is finite. When it comes to enumerating infinity, the state of getting to whatever item we are interested in if we wait long enough is the best we can hope for. An infinite enumeration cannot be expected to end.
76
The next question is are the rationals enumerable? At first look the answer seems to be no because of the following little problem. Suppose you start your for loop at 0. The next integer is 1 but what is the next rational? The answer is that there isn't one. You can get as close as you like to zero by making the fraction smaller and smaller – 1/10, 1/100, 1/1000 and so on. There is no next fraction (technically the fractions are "dense") and this seems to make writing a for loop to list them impossible. However, this isn't quite the point. We don't need to enumerate the fractions in any particular order we just need to make sure that as long as the loop keeps going every rational number will eventually be constructed. So what about: for n=0 to infinity for i=n to 0 display (n-i)/i next i next n
This may seem like a strange transformation but it corresponds to the “zig zag” enumeration of the rationals given earlier:
You can see that n is the column number. When n is 4, say, we need to enumerate (4,0), (3,1), (2,2), (1,3) and (0,4) and you should be able to confirm that this is what the nested for loops do. Also ignore the fact that we are listing improper rationals of the form (a/0) - you cannot divide by zero. You should be able to see that if you wait long enough every fraction will eventually be displayed and in addition all of the possible combinations of (a,b) will eventually be produced. We can agree that even an infinite set is enumerable as long as you are guaranteed to see every element if you wait long enough and so the rationals are enumerable. Notice that we have transformed two infinite loops into a single infinite loop by adopting a zig-zag order. This idea generalizes to n infinite loops, but it is more complicated.
77
Enumerating the Irrationals Fractions, and hence the rationals, are enumerable. The next question is - are the irrationals enumerable? At first sight it seems not, by the same argument that there is no "next" irrational number, but as we have seen this argument fails with the rationals because they can be transformed to a zig-zag order and hence they are enumerable. However, first we have another problem to solve - how do we write down the irrationals? It is clear that you cannot write them down as anything like a/b, so how do you do it? After lots of attempts at trying to find a way of representing irrational numbers the simplest way of doing the job is to say that an irrational number is represented exactly by an infinite sequence of digits, for example, 3.14159… , where the ... means "goes on for ever". You can see that a single irrational number is composed of an enumerable sequence of digits that can be produced by: for i=0 to infinity display the ith digit of √2 next i
You should probably begin to expect something different given that now each number is itself an enumerable sequence of digits where integers and rational are simply enumerable as a set. What about the entire set of irrational numbers – is it enumerable? This is quite a different problem because now you have a nested set of for loops that both go on to infinity: for i=0 to infinity for j=0 to infinity display jth digit of the ith irrational next j next i
The problem here is that you never get to the end of the inner loop so the outer loop never steps on. In other words, no matter what the logic of the program looks like, it is, in fact, functionally equivalent to: i=0 for j=0 to infinity display jth digit of the ith irrational next j
The original nested loops only appear to be an enumeration because they look like the finite case. In fact, the program only ever enumerates the first irrational number you gave it. It’s quite a different sort of abstraction to the enumeration of the rationals and, while it isn't a proof that you can't do it, it does suggest what the problem might be. As the inner loop never finishes, the second loop never moves on to another value of i and there are huge areas of the program that are logically reachable but are never reached because of the infinite inner loop.
78
It is as if you can have one infinite for loop and see results, all the results if you wait forever - but a pair of nested infinite loops is non-computable because there are results you will never see, no matter how long you wait. The point is that you can't enumerate a sequence of infinite sequences. You never finish enumerating the first one, so you never get to the second. As already stated, this isn't a proof but you can construct a proof out of it to show that there is no effective enumeration of the irrationals. However, this isn't about proof, it's about seeing why an infinite sequence of infinite sequences is different from just an infinite sequence and you can derive Cantor’s diagonal argument from it. Now let’s return to the alephs.
Aleph-One and Beyond The size of the real numbers is called aleph-one and it is the size of the continuum – that model of space where between each point there is not only another point but an infinity of points. Aleph-zero is the sort of infinity we usually denote by ∞ and so aleph-one is bigger than our standard sort of countable infinity. This all seems a little shocking - we now have two infinities. You can show that aleph-one behaves in the same way as aleph-zero in the sense that you can take any lots of elements away, even aleph-one elements, and there are still aleph-one elements left. After the previous discussion, you can also see why the reals are not effectively enumerable and lead to a new order of infinity – they need two infinite loops. Of course, the problem with these loops is that the inner loop never ends so the outer one never gets to step on to the next real number in the enumeration. This raises the question of what we did to get from aleph-zero to aleph-one and can we repeat it to get to aleph-two? The answer is yes. If two sets A and B are aleph-zero in size then we already know that all of the usual set operations A U B, i.e. set A union B, is everything in both sets – often said as an infinity plus an infinity is the same infinity - and A X B all of the pairs obtained by taking one from A and one from B (the Cartesian product) for example also have aleph-zero elements. However, there is another operation that we haven't considered - the power set. If you have a set of elements A, then the power set of A, usually written 2A, is the set of all subsets of A including the empty set and A. So if A={a,b} the power set is {0,{a},{b}, A}. You can see why it is called the power set as a set with two things in it gives rise to a power set with four things, i.e. 2 2=4. This is generally true and if A has n elements its power set has 2n elements. Notice that this is a much bigger increase than other set operations, i.e A U A has 2n elements and A X A has n2 elements.
79
The power set really does seem to change gear on the increase in the number of elements. So much so that if A has aleph-zero elements then it can be proved that 2A, i.e. the power set, has aleph-one elements. And, yes, if A has aleph-one elements its power set has aleph-two elements, and so on. ● In general if A has aleph-n elements, the power set 2A has aleph-(n+1) elements. There is an infinity of orders of infinity. Notice also that the reals, R, are related to the power set of the integers, just as the rationals, Q, are related to the Cartesian product of the integers, i.e. R=2Z and Q=Z X Z with suitable definitions and technical adjustments. The set of alephs is called the transfinite numbers and it is said that this is what sent Cantor crazy but you should be able to see that it is perfectly logical and there is no madness within. So finally, are there aleph-zero, i.e. a countable infinity, of transfinite numbers or are there more? The answer is we don't know and it all hinges on the answer to the question "is there a set with an order of infinity between aleph-zero and aleph-one". That there isn't is the so-called continuum hypothesis, and it forms Hilbert's 23rd problem and here we get into some very deep mathematics and an area of active research.
Not Enough Programs!
Now we come to the final knockout blow for computation and it is based on the infinities. How many programs are there? Any program is simply a sequence of bits and any sequence of bits can be read as a binary number, a binary integer to be exact. Thus, given any program, you have a binary number corresponding to it. Equally, given any binary number you can regard it as a program - it might not do anything useful, but that doesn't spoil the principle: ● Every program is an integer and every integer is a program. This correspondence is an example of a Gödel numbering and it is often encountered in computer science. This mean that the set of programs is the same as the set of integers and we realize immediately that the set of programs is enumerable. All I have to do is write a for loop that counts from 1 to 2 n and I have all the programs that can be stored in n bits. Even if you let this process go to infinity the number of programs is enumerable. It is infinite, but it is enumerable and so there are aleph-zero programs. This is a very strange idea. The Gödel numbering implies that if you start generating such a numbering eventually you will reach a number that is Windows or Linux or Word or whatever program you care to name. This has, occasionally been suggested as a programming methodology, but only as a joke.
80
You can repeat the argument with Turing machines rather than programs. Each Turing machine can be represented as a tape to be fed into a universal Turing machine. The tape is a sequence of symbols that can be binary coded and read as a single number. We have another Gödel numbering and hence the set of Turing machines is enumerable and if we allow the tape to be finite but unbounded there are aleph-zero Turing machines. The important point is that the number of irrational numbers isn't enumerable, it is aleph-one, which means there are more irrational numbers than there are programs. So what? Think about the task of computing each possible number. That is, imagine each number has a program that computes it and displays it. You can do it for the integers and you can do it for the rationals, but you can't do it for the irrationals and there are more irrational numbers than there are programs, so many, in fact most, irrational numbers don't have programs that compute them. We touched on this idea in Chapter 3, but now we have the benefit of knowing about the orders of infinity. Numbers like √2, π and e clearly do have programs that compute them, so not all irrational numbers are non-computable - but the vast majority are noncomputable. To express it another way: ● Most of the irrational numbers do not have programs that compute them. What does this mean? Well, if you think about it, a program is a relatively short, regular construct and if it generates an irrational number then somehow the information in that number must be about the same as the information in the program that generates it. That is, computable numbers are regular in some complicated sense, but a non-computable number is so irregular that you can't compress its structure into a program. This leads on to the study of algorithmic information theory which is another interesting area of computer science full of strange ideas and even stranger conclusions, see Chapter 7. Notice that you can’t get around this conclusion by somehow extending the way you count programs. You can’t, for example, say there are an aleph-zero of programs that run on computer A and another aleph-zero that run on computer B so we have more programs. This is wrong because aleph-zero plus aleph-zero programs is still just aleph-zero programs. To avoid the problem you have to invent a computer that works with real numbers as a program so that there aleph-one of them – and you can probably see the circular argument here.
81
Not Enough Formulas! If you are now thinking that programs to compute numbers aren't that important, let's take one final rephrasing of the idea. Consider a mathematical formula. It is composed of an enumerable set of n symbols and as such the set of all formulas is itself enumerable. Which means that there are more irrational numbers than there are mathematical formulas. In other words, if you want to be dramatic, most of the numbers in mathematics are out of the reach of mathematics. There are some subtle points that I have to mention to avoid the mistake of inventing "solutions" to the problem. Surely you can write an equation like x=a where a is an irrational number and so every irrational number has its equation in this trivial sense. The problem here is that you cannot write this equation down without writing an infinite sequence of digits, i.e. the numeric specification for a. Mathematical formulas can't include explicit irrational numbers unless they are assigned a symbol like π or e and these irrationals, the ones we give a symbol to, are exactly the ones that have formulas that define them. That is, the computable irrational numbers are the ones we often give names to. They are the ones that have specific properties that enable us to reason about them. For example, I can use the symbol √2 to stand as the name of an irrational number, but this is more than a name. For example, I can do exact arithmetic with the symbol: √2(1+√2)2=2(1+2√2+√2√2)= 2(1+2√2+2)=2(3+2√2)=6+4√2. We repeatedly use the fact that √2√2=2 and it is this property, and the ability to compute an approximation to √2, that makes the symbol useful. This is not the case for irrational numbers in general and we have no hope of making it so. That is, a typical irrational number requires an infinite sequence of symbols to define it. For a few we can replace this by a finite symbol, like √2, that labels its properties and this allows it to be used in computation and provides a way of approximating the infinite sequence to any finite number of places. Also notice that if you allow irrational numbers within a program or a formula then the set of programs or formulas is no longer enumerable. What this means is that if you extend what you mean by computing to include irrational numbers within programs, then the irrationals become computable. So far this idea hasn't prove particularly useful because practical computation is based on enumerable sets.
82
Transcendental and Algebraic Irrationals This lack of programs to compute all the numbers also spills over into classical math. Long before we started to worry about computability mathematicians had run into the problem. The first irrational numbers that were found were roots of simple polynomials - polynomials with rational coefficients. If you think about it for a moment you will see that a polynomial is a sort of program for working something out and finding the roots of a polynomial is also an algorithmic procedure. A polynomial is something like: axn+bxn-1+cxn-2+ .. For example: 2x4+1/2x3+8x2+9x+13 is a polynomial of degree 4. A few minutes thinking should convince you that there are only an enumerable set of polynomials. You can encode a polynomial as a sequence: (a,b,c.. ) where the a, b, c are the coefficients of the powers of x. The length of the sequence is finite but unbounded. As in the earlier argument each of these sets for fixed, n is countable and hence the union of all of these sets is countable and hence there are aleph-zero polynomials. If you don’t like this argument write a program that enumerates them but you have to use a zig-zag enumeration like the one used for the rationals. As the set of polynomials is countable, and each polynomial has at most n distinct roots, the set of all roots is also countable and hence the set of irrationals that are roots of polynomials is enumerable. That is, there are aleph-zero irrationals that are the roots of polynomials. Such irrational numbers are called algebraic numbers, they are enumerable and as such they are a "small" subset of all of the irrationals, which is aleph-one and hence much bigger. That is, most irrationals are not algebraic numbers. We call this bulk of the irrational numbers transcendental numbers and they are not the roots of any finite polynomial with rational coefficients. So what are transcendental numbers? Most of them don't have names, but the few that do are well known to us - π and e are perhaps the best known. The transcendental numbers that we can work with have programs that allow us to compute the digit sequence to any length we care to. In this sense the noncomputable numbers are a subset of the transcendental numbers. That is, in this sense the computable numbers are the integers, the algebraic irrationals and the transcendentals that are "regular enough" to have ways of computing approximations.
83
The situation is very strange when you try to think about it in terms of computational complexity. The algebraic numbers are the roots of polynomials and in this sense they are easy computable numbers. The transcendentals are not the roots of polynomials and hence a harder class of things to compute. Indeed most of them are non-computable, but some of them are, and some, like π, are fairly easy to compute.
π - the Special Transcendental π is the most fascinating of transcendental numbers because it has so many ways of computing its digits. What formulas can you find that give a good approximation to π? Notice that an infinite series for π gives an increasingly accurate result as you compute more terms of the series. So if you want the 56th digit of π you have to compute all the digits up to the 56th as well as the digit of interest - or do you? There is a remarkable formula - the Bailey– Borwein–Plouffe formula - which can provide the nth binary digit of π without having to compute any others. Interestingly, this is the original definition of a computable number given by Turing – there exists a program that prints the nth digit of the number after a finite time. The digits of π cannot be random - we compute them rather than throw a dice for them - but they are pseudo random. It is generally said that if you enumerate π for long enough then you will eventually find every number you care to specify and if you use a numeric code you will eventually find any text you specify. So π contains the complete works of Shakespeare, or any other text you care to mention; the complete theory of QFT; and the theory of life the universe and everything. However, this hasn't been proved. Any number that contains any sequence if you look for long enough is called normal, and we have never proved that π is normal. All of this is pure math, but it is also computer science because it is about information, computational complexity and more – see the next chapter.
84
Summary ● There is a hierarchy of number types starting from the counting numbers, expanding to the integers, the rationals, the irrationals and finally the complex numbers. Each expansion is due to the need to solve valid equations in the number type that doesn’t have a solution in the number type. ● The two sets are the same size if it is possible to match elements up one to one. ● The size of the set of integers – the usual meaning of infinity – is aleph-zero. The rationals can be put into one to one correspondence with the integers and so there are aleph-zero rationals. ● To find a set with more elements than aleph-zero you need to move to the irrationals. The irrationals cannot be enumerated and there are aleph-one of them. ● The union of any sets of size aleph-zero is a set of size aleph-zero. The Cartesian product of sets of size aleph-zero is a set of size aleph-zero. It is only when you take the power set 2A do we get a set of size alephone. ● The number of programs or equivalent Turing machines is aleph-zero. Therefore there are not enough programs to compute all of the irrationals – the majority are therefore non-computable numbers. ● Similarly there are not enough formulas for all the irrationals and in this sense they are beyond the reach of mathematics. ● The irrationals that are roots of polynomials are called algebraic. However, as there are only aleph-zero polynomials, most of the irrationals are not the roots of polynomials and these are called transcendental. ● The majority of transcendentals are non-computable but some, including numbers like π, are not only computable, but seem to be very easy to compute.
85
Chapter 7 Kolmogorov Complexity and Randomness
At the end of the previous chapter we met the idea that if you want to generate an infinite anything then you have to use a finite program and this seems to imply that the whatever it is that is being computed has to have some regularity. This idea is difficult to make precise, but that doesn’t mean it isn’t an important one. It might be the most important idea in the whole of computer science because it explains the relationship between the finite and the infinite and it makes clear what randomness is all about.
Algorithmic Complexity Suppose I give you a string like 111111... which goes on for one hundred ones in the same way. The length of the string is 100 characters, but you can write a short program that generates it very easily: repeat 100 print "1" end repeat
Now consider the string "231048322087232.." and so on for one hundred digits. This is supposed to be a random string, it isn't because I typed it in by hitting number keys as best I could, but even so you would be hard pressed to create a program that could print it that was shorter than it is. In other words, there is no way to specify this random-looking string other than to quote it. This observation of the difference between these two strings is what leads to the idea of Kolmogorov, or algorithmic, complexity. The first string can be generated by a program with roughly 30 characters, and so you could say it has 30 bytes of information, but the second string needs a program of at least the hundred characters to quote the number as a literal and so it has 100 bytes of information. You can already see that this is a nice idea, but it has problems. Clearly the number of bytes needed for a program that prints one hundred ones isn't a well-defined number - it depends on the programming language you use. However, in any programming language we can define the Kolmogorov complexity as the length of the smallest program that generates the string in question.
87
Andrey Kolmogorov was a Russian mathematician credited with developing this theory of information but it was based on a theorem of Ray Solomonoff which was later rediscovered by Gregory Chaitin - hence the theory is often called Solomonoff-Kolmogorov-Chatin complexity theory.
Andrey Nikolaevich Kolmogorov 1903 -1987 Obviously one way around this problem that the measure of this complexity is to use the size of a Turing machine that generates the sequence, but even this can result in slightly different answers depending on the exact definition of the Turing machine. However, in practice the Turing machine description is the one preferred. So complexity is defined with respect to a given description language – often a Turing machine. The fact that you cannot get an exact absolute measure of Kolmogorov complexity is irritating but not a real problem as any two measures can be shown to differ by a constant. The Kolmogorov complexity of a string is just the smallest program that generates it. For infinite strings things are a little more interesting because, if you don't have a program that will generate the string, you essentially don't have the string in any practical sense. That is, without a program that generates the digits of an infinite sequence you can't actually define the string. This is also the connection between irrational numbers and non-computable numbers. As explained in the previous chapter, an irrational number is an infinite sequence of digits. For example: 2.31048322087232 ... where the ... means carry on forever. Some irrationals have programs that generate them and as such their Kolmogorov complexity is a lot less than infinity.
88
However, as there are only a countable number of programs and there are an uncountable number of irrationals – see the previous chapter - there has to be a lot of irrational numbers that don't have programs that generate them and hence that have an infinite Kolmogorov complexity. Put simply, there aren't enough programs to compute all of the irrationals and hence most irrationals have an infinite Kolmogorov complexity. To be precise, there is an alephzero, or a countable infinity, of irrational numbers that have Kolmogorov complexity less than infinity and an aleph-one, or an uncountable infinity of them, that have a Kolmogorov complexity of infinity. A key theorem in algorithmic complexity is: ● There are strings that have arbitrarily large Kolmogorov complexity. If this were not so we could generate the aleph-one set of infinite strings using an aleph-zero set of programs. The irrationals that have a smaller than infinite Kolmogorov complexity are very special, but there are an infinity of these too. In a sense these are the "nice" irrationals - numbers like π and e - that have interesting programs that compute them to any desired precision. How would you count the numbers that had a less than infinite Kolmogorov complexity? Simple just enumerate all of the programs you can create by listing their machine code as a binary number. Not all of these programs would generate a number, indeed most of them wouldn't do anything useful, but among this aleph-zero of programs you would find the aleph-zero of "nice" irrational numbers. Notice that included among the nice irrationals are some transcendentals. A transcendental is a number that isn't the root of any finite polynomial equation. Any number that is the root of a finite polynomial is called algebraic. Clearly, for a number that is the root of a finite polynomial, i.e. not transcendental but algebraic, you can specify it by writing a program that solves the polynomial. For example, √2 is an irrational, but it is algebraic and so it has a program that generates it and hence it’s a nice irrational.
Kolmogorov Complexity Is Not Computable The second difficulty inherent in the measure of Kolmogorov complexity is that, given a random-looking string, you can't really be sure that there isn't a simple program that generates it. This situation is slightly worse than it seems because you can prove that the Kolmogorov complexity of a string is itself a non-computable function. That is, there is no program (or function) that can take a string and output its Kolmogorov complexity. The proof of this, in the long tradition of such non-computable proofs, is proof by contradiction.
89
If you assume that you have a program that will work out the Kolmogorov complexity of any old string, then you can use this to generate the string using a smaller program and hence you get a contradiction. To see that this is so, suppose we have a function Kcomplex(S) which will return the Kolmogorov complexity of any string S. Now suppose we use this in an algorithm: for I = 1 to infinity for all strings S of length I if Kcomplex(S)> K then return S
You can see the intent, for each length of string test each string until its complexity is greater than K, and there seems like there is nothing wrong with this. Now suppose that the size of this function is N, then any string it generates has a Kolmogorov complexity of N or less. If you set K to N or any value larger than N you immediately see the problem. Suppose the algorithm returns S with a complexity greater than K, but S has just been produced by an algorithm that is N in size and hence string S has a complexity of N or less - a contradiction. This is similar to the reasonably well known Berry paradox: “The smallest positive integer that cannot be defined in fewer than twenty English words” Consider the statement to be a program that specifies the number n which is indeed the smallest positive integer that cannot be defined in fewer than twenty English words. Then it cannot be the number as it has just been defined in 14 words. Notice that once again we have a paradox courtesy of self-reference. However, this isn't to say that all is lost.
Compressability You can estimate the Kolmogorov complexity for any string fairly easily. If a string is of length L and you run it through a compression algorithm you get a representation which is L-C in length where C is the number of characters removed in the compression. You can see that this compression factor is related to the length of a program that generates the string, i.e. you can generate the string from a description that is only L-C characters plus the decompression program. Any string which cannot be compressed by even one character is called incompressible. There have to be incompressible strings because, by another counting argument, there are 2n strings of length n but only 2n-1 strings of length less than n. That is, there aren't enough shorter strings to represent all of the strings of length n. Again we can go further and prove that most strings aren't significantly compressible, i.e. the majority of strings have a high Kolmogorov complexity.
90
The theory says that if you pick a string of length n at random then the probably that it is compressible by c is given by 1-21-c+2-n. Plotting the probability of compressing a string by c characters for a string length of 50 you can see at once that most strings are fairly incompressible once you reach a c of about 5.
Armed with all of these ideas you can see that a string can be defined as algorithmically random if there is no program shorter than it that can generate it - and most strings are algorithmically random.
Random and Pseudo Random The model of computation that we have been examining up to this point is that of the mechanistic universe. Our computations are just what the universe does. The planets orbit the sun according to the law of gravity and we can take this law and create a computer program that simulates this phenomena. In principle, we can perform the simulation to any accuracy that we care to select. The planets orbiting the sun and the computational model will be arbitrarily close to one another – closer as the accuracy increases. It all seems to be a matter of decimal places and how many we want our computations to work with. This gives the universe a mechanistic feel. The planets are where they are today because of where they were yesterday and where they were 1000 years earlier and so on back to the start of the universe. If you could give the position of every atom at time zero then you could run the model or the real universe and everything would be where it is today. This is the model of the universe that dominated science until quite recently and was accepted despite it containing some disturbing ideas. If the universe really is a deterministic machine there can be no “free will”. You cannot choose to do something because it’s all a matter of the computation determining what happens next. The idea of the absolute determinism reduces the universe to nothing but a playing out of a set of events that were set to happen by the initial conditions and the laws that govern their time development. In this universe there is no room for random.
91
Randomness and Ignorance The standard idea of probability and randomness is linked with the idea that there is no way that we can predict what is going to happen even in principle. We have already mentioned the archetypal example of supposed randomness – the toss of a coin. We summarize the situation by saying that there is a 50% chance of it landing on either of its sides as if its dynamics were controlled by nothing by randomness. Of course this is not the case – the coin obeys precise laws that in principle allow you to predict exactly which side it will land on. You can take the initial conditions of the coin and how it is about to be thrown into the air and use the equations of Newtonian mechanics to predict exactly how it will land on the table. In principle, this can be done with certainty and there is no need to invoke randomness or probability to explain the behavior of the coin. Of course, in practice we cannot know all of the initial conditions and so a probability model is used to describe certain features of the exact dynamics of the system. That is, the coin tends to come down equally on both sides unless there is something in its makeup that forces it to come down more on one side than the other for a range of initial conditions. You need to think about this for a little while, but it soon becomes clear that a probability model is an excuse for what we don’t know – but still very useful nonetheless. What we do is make up for our lack of deterministic certainty by summarizing the dynamics by the observed frequency that we get any particular result. For example, in the case of the coin we could throw it 100 times and count the number of heads and the number of tails. For a fair coin we would expect this to be around 50% of each outcome. This is usually summarized by saying that the probability of getting heads is 0.5. Of course, after 100 throws it is unlikely that we would observe exactly 50 heads and this is allowed for in the theory. What we expect, however, is that as the number of throws increases the proportion gets closer and closer to 50% and the deviations away from 50% become smaller and smaller. It takes quite a lot of math, look up The Law of Large Numbers, to make these intuitions exact but this is the general idea. You can, if you want to, stop the theory at this point and just live with the idea that some systems are too complex to predict and the relative frequency of each outcome is all you can really work with. However, you can also move beyond observed frequencies by making arguments about symmetry. You can argue that as one side of a coin is like the other, and the throwing of the coin shows no preference for one side or another, then by symmetry you would expect 50% of each outcome. Another way of saying this is that the physics is indifferent to the coin’s sides and the side of a coin cannot affect the dynamics. Compare this to a coin with a small something stuck to one side. Now there isn’t the symmetry we used before and the dynamics does take account of the side of the coin. Hence we cannot in this case conclude that the relative frequency of each side is 50%.
92
Pseudo Random When you work with probabilities it is all too easy to lose sight of the fact that the phenomena you are describing aren’t really random – they are deterministic but beyond your ability to model accurately. Computers are often used to generate random numbers, and at first this seems to be a contradiction. How can a program generate random numbers when its dynamics are completely determined – more determined than the toss of a coin, say. We usually reserve the term pseudo random for numbers generated in this way but they are no more pseudo random than the majority of other numbers that we regard as truly random. Notice that as pseudo random numbers are generated by a program much smaller than the generated sequence they are most certainly not algorithmically random in the sense of Kolmogorov. It is also true that an algorithmically random sequence might not be so useful as a sequence of random numbers because they might have irregularities that allow the prediction of the next value. For example an algorithmically random sequence might have more sixes say than any other digit. It would still be algorithmically random because this doesn’t imply that you can write a shorter program to generate it. What is important about pseudo random numbers is that they are not predictable in a statistical sense, rather than there being no smaller program that generates the sequence. Statistical prediction is based on detecting patterns that occur more often than they should for a random sequence. In the example given above, the occurrence of the digit six more often than any other digit would mean that guessing that the next digit was six would be correct more often than chance. When we use the term pseudo random we mean that in principle all digits occur with equal frequencies, all pairs of digits occur with equal frequencies, all triples of digits occur with equal frequencies and so on. A pseudo random sequence that satisfies this set of conditions allows no possibility of predicting what might come next as every combination is equally likely. In practice, pseudo random numbers don’t meet this strict condition and we tend to adopt a less stringent requirement depending on what the numbers are going to be used for. The key idea is that the sequence should not give a supposed adversary a good chance of guessing what comes next. For example, pseudo random numbers used in a game can be of relatively low quality as the player has little chance to analyzing the sequence and so finding a weakness. Pseudo random numbers needed for cryptography have to be of a much higher quality as you can imagine that a nation state might put all of the computer power it has into finding a pattern. Pseudo random numbers are number sequences that satisfy a condition of non-predictability given specified resources. In this sense what is pseudo random is a relative term.
93
True Random If you have followed this reasoning you should be happy to accept the fact that there is no random, only pseudo random. What would a truly random event look like? It would have to differ from deterministic randomness by not being deterministic at all. In other words, there would have to be no theory that could predict the outcome, even in principle. A fairly recent discovery, chaos, is often proposed as an example of something that can stand in for randomness. The idea is that systems exist that are so sensitive to their initial conditions that predicting their behavior rapidly becomes impractical. If the throw of a coin were a chaotic system then the behavior of the coin would be very different for starting states that differed by very little. If you knew the initial state of the coin and this allowed you to predict that it would come down heads then, for a chaotic system, a tiny change in that initial state would make it come down tails. Given that you cannot in practice know the initial state perfectly then the chaotic dynamics gives you no choice but to resort to probabilities. A chaotic system however is still not in principle unpredictable. In this case adding more-and-more information still makes the predictions more accurate. This is not true randomness even though it is very interesting at a practical and theoretical level. The question of whether or not there are truly random processes, i.e. ones that are even in principle not predictable, is considered open for some people, and answered by quantum mechanics for others. What is so special about quantum mechanics? The answer is that it has probability built into it. The equations of quantum mechanics don’t predict the exact state of a system, only the probability that it will be in any one of a number of states. If a coin was governed by quantum mechanics then all that the equations could say is that there is a 50% chance of it being heads and 50% chance of it being tails. Even given all of the information about the quantum coin and its initial state, the dynamics of the system is inherently probabilistic. If quantum mechanics is correct, to paraphrase Einstein, “God does play dice with the universe”. If you don’t find this disturbing, and Einstein certainly did, consider that when the quantum coin lands heads or tails you cannot ask about the deterministic sequence of events that led to the outcome. If there was such a sequence, it could have been used to predict the outcome and probability wouldn’t enter into it. To say that a coin lands heads without anything on the way to the outcome actually determining this is very strange. For a real coin we think that the initial throw, the air currents and how far it travels before landing determine its final configuration, but this cannot be so for a quantum coin. A quantum event does not have an explanation only a probability.
94
Some find this so disturbing that they simply postulate that, like our supposedly random physical coin, we are simply missing some information in the quantum case as well. However, to date no complete theory has been shown to give the same predictions as standard quantum mechanics and so far these predictions have been correct. So, if you want true randomness you need to use a quantum system? No, in most cases this isn’t necessary pseudo random is more than good enough. It is interesting to note that even a supposedly quantum-based random number generator is difficult to create as it is all too easy for the system to be biased by non-quantum effects. For example, a random number generator based on radiation can be biased by non-quantum noise in the detector. This idea that there is no causal sequence leading up to the outcome is very similar to the idea of algorithmic randomness. There is no program smaller than the sequence that generates it and, in a sense, this means that there is no deterministic explanation for it. The program is the deterministic sequence, an explanation if you like for the infinite sequence it generates. To see this, consider how could we create an algorithmic random number generator? We would need to write a finite program that generated the sequence but, as the sequence is algorithmically random, there is no such program. You could write an infinite program, but this would be equivalent to simply quoting the digits in the sequence. But how can you quote an infinite sequence of digits? How can you determine the next digit? There seems to be no possibility of a causal way of selecting the next digit because if there was you could use it to construct a shorter program. In this sense algorithmically random sequences are not deterministic. Perhaps this is the key to the difference between pseudo random and physically random numbers. This leads us on to the subject of the next chapter, “the axiom of choice”. The annoying thing about Kolmogorov complexity is that it is easy to understand and yet in any given case you really can't measure it in any absolute sense. Even so, it seems to say a lot about the nature of the physical world and our attempts to describe it. You will notice that at the core of this understanding is the idea of a program that generates the sequence of behavior. It is tempting to speculate that many of the problems we have with, say, modern physics is due to there simply not being enough programs to go around or, what amounts to the same thing, too many random sequences.
95
Summary ● The algorithmic, or Kolmogorov, complexity of a string of symbols is the size of the smallest program that will generate it. ● Kolmogorov complexity clearly depends on the machine used to host the program, but it is defined up to a constant which allows for the variation in implementation. ● Most irrationals don’t have programs that generate them and hence their Kolmogorov complexity is infinite. ● Kolmogorov complexity isn’t computable in the sense that there isn’t a single function or Turing machine that will return the complexity of an arbitrary string. ● A C compressible string can be reduced by C symbols by a compression program. ● A string that cannot be reduced by even one symbol is said to be incompressible. Such strings have to exist by a simple counting principle. This means that the majority of strings have a high Kolmogorov complexity. ● A string with a high Kolmogorov complexity is algorithmically random. ● Most random numbers are pseudo random in that they are theoretically predictable if not practically predictable. This includes examples of systems that are usually considered to be truly random, such as the toss of a coin. Clearly, with enough data, you can predict which face the coin will come down. ● The only example of true randomness is provided by quantum mechanics where the randomness is built into the theory – there is no way of predicting the outcome with more information.
96
Chapter 8 Algorithm of Choice
After looking at some advanced math concerning infinity, you can start to appreciate that algorithms, or programs, have a lot to do with math. The fact that the number of programs is smaller than the number of numbers gives rise to many mathematical phenomena. One particularly exotic mathematical idea, known as the axiom of choice, isn’t often discussed in computer science circles and yet it probably deserves to be better known. However, this said, if you want to stay close to a standard account of computer science, you are free to skip this chapter. Before we get started, I need to say that this isn't a rigorous mathematical exposition of the axiom of choice. What it is attempting to do is to give the idea of the "problem" to a non-mathematician, i.e. an average programmer. I also need to add that while there is nothing much about the axiom of choice that a practical programmer needs to know to actually get on with programming it, it does have connections with computer science and computability. The axiom of choice may not be a particularly practical sort of mathematical concern, but it is fascinating and it is controversial.
Zermelo and Set Theory The axiom of choice was introduced by Ernst Zermelo in 1904 to prove what seems a very reasonable theorem.
Ernst Zermelo 1871-1953
97
In set theory you often find that theorems that seem to state the obvious actually turn out to be very difficult to prove. In this case the idea that needed a proof was the "obvious" fact that every set can be well ordered. That is, there is an order relation that can be applied to the elements so that every subset has a least element under the order. This is Zermelo's well-ordering theorem. To prove that this is the case Zermelo had to invent the axiom of choice. It now forms one of the axioms of Zermelo-Fraenkel set theory which is, roughly speaking, the theory that most would recognize as standard set theory. There are a lot of mathematical results that depend on the axiom of choice, but notice it is an axiom and not a deduction. That is, despite attempts to prove it from simpler axioms, no proof has ever been produced. Mathematicians generally distinguish between Zermelo-Fraenkel set theory without the axiom of choice and a bigger theory where more things can be proved with the axiom of choice.
The Axiom of Choice So exactly what is the axiom of choice - it turns out to be surprisingly simple: "The axiom of choice allow you to select one element from each set in a collection of sets" - yes, it really is that simple. A countable collection of sets is just an enumeration of sets Si for a range of values of i. The axiom of choice says that for each and every i you can pick an element from the set Si. It is so obvious that it hardly seems worth stating as an axiom, but it has a hidden depth. Another way of formulating the axiom of choice is to say that for any collection of sets there exists a choice function f which selects one element from each set, i.e. f(Si) is an element of Si. Notice that if you have a collection of sets that comes with a natural choice function then you don't need the axiom of choice. The fact you have an explicit choice function means that you have a mechanism for picking one element for each set. The axiom of choice is a sort of reassurance that even if you don't have an explicit choice function one exists - you just don't know what it is. Another way to look at this is that the axiom of choice says that a collection of sets for which there is no choice function doesn't exist.
To Infinity and... So where is this hidden depth that makes this obvious axiom so controversial? The answer is, as it nearly always is in these cases, infinity. If you have a finite collection of sets then you can prove that there is a choice function by induction and you don't need the axiom of choice as it is a theorem of standard set theory.
98
Things start to get a little strange when we work with infinite collections of sets. In this case, even if the collection is a countable infinity of sets, you cannot prove that there is a choice function for any arbitrary collection and hence you do need the axiom of choice. That is, to make set theory carry on working you have to assume that there is a choice function in all cases. In the case where you have a non-countable infinity of sets then things are more obviously difficult. Some non-countable collections do have obvious choice functions and others don't. This only way to see what the difficulties are is to look at some simple examples. First consider the collection of all closed finite intervals of the reals, i.e. intervals like the set of points x satisfying a ≤ x ≤ b or [a,b] in the usual notation. Notice that the interval a to b includes its end-points, i.e. a and b are in the set due to the use of less-than-or-equals. This is an infinite and uncountable collection of sets, but there is an obvious choice function. All you have to is define F([a,b]) as the mid point of the interval. Given you have found a choice function there is no need to invoke the axiom of choice. Now consider a collection of sets that sound innocent enough - the collection of all subsets of the reals. It is clear that you can't use the mid-point of each subset as a choice function because not every subset would have a mid-point. You need to think of this collection of subsets as including sets that are arbitrary collections of points. Any rule that you invent for picking a point is bound to encounter sets that don't have a point with that property (note; this isn't a rigorous argument). You can even go a little further. If you invent a choice function, simply consider the sets that don’t have that point – they must exist as this is the set of all subsets! Now you can start to see the difficulty in supplying a choice function. No matter what algorithmic formulation you invent, some sub-sets just won’t have a point that satisfies the specification. You cannot, for example, use the smallest value in the sub-set to pick a point because open sets, sets like a