Numerical Analysis: An Introduction 9783110573329, 9783110573305

Numerical analysis deals with the development and analysis of algorithms for scientific computing, and is in itself a ve

435 70 4MB

English Pages 193 [194] Year 2019

Table of contents :
Preface
Software
Acknowledgment
Support
Contents
1. Computer representation of numbers and roundoff error
2. Solving linear systems of equations
3. Least squares problems
4. Finite difference methods
5. Solving nonlinear equations
6. Eigenvalues and eigenvectors
7. Interpolation
8. Numerical integration
9. Initial value problems
10. Partial differential equations
Index

Recommend Papers

Numerical Analysis: An Introduction 9783110573329, 9783110573305

Numerical analysis deals with the development and analysis of algorithms for scientific computing, and is in itself a ve

192 22 12MB Read more

An Introduction to Numerical Analysis 0521007941, 9780521007948

Numerical analysis provides the theoretical foundation for the numerical algorithms we rely on to solve a multitude of c

225 31 4MB Read more

Introduction to Applied Numerical Analysis

189 82 9MB Read more

Introduction to Applied Numerical Analysis

This Dover edition, first published in 2012, is an unabridged republication of the edition published in 1989 by Hemisphe

211 49 10MB Read more

A Friendly Introduction to Numerical Analysis

715 83 31MB Read more

Fourier Analysis - An Introduction

Table of contents : Cover......Page 1 Princeton Lectures in Analysis......Page 3 FOURIER ANALYSIS - AN INTRODUCTION.....

100 23 Read more

Real analysis : an introduction

105 25 2MB Read more

Numerical Analysis

101 52 472KB Read more

An Introduction to Numerical Methods and Analysis [3 ed.] 1119604699, 9781119604693

The new edition of the popular introductory textbook on numerical approximation methods and mathematical analysis, with

174 114 34MB Read more

Convection-Diffusion Problems - An Introduction to Their Analysis and Numerical Solution [1 ed.] 9781470448684

Many physical problems involve diffusive and convective (transport) processes. When diffusion dominates convection, stan

369 52 2MB Read more

Numerical Analysis: An Introduction
9783110573329, 9783110573305

Author / Uploaded
Timo Heister
Leo G. Rebholz
Fei Xue

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Timo Heister, Leo G. Rebholz, and Fei Xue Numerical Analysis

Also of Interest Scientific Computing. For Scientists and Engineers Timo Heister, Leo G. Rebholz, 2015 ISBN 978-3-11-035940-4, e-ISBN (PDF) 978-3-11-035942-8, e-ISBN (EPUB) 978-3-11-038680-6

Inside Finite Elements Martin Weiser, 2016 ISBN 978-3-11-037317-2, e-ISBN (PDF) 978-3-11-037320-2, e-ISBN (EPUB) 978-3-11-038618-9

Single Variable Calculus. A First Step Yunzhi Zou, 2018 ISBN 978-3-11-052462-8, e-ISBN (PDF) 978-3-11-052778-0, e-ISBN (EPUB) 978-3-11-052785-8

Numerische Mathematik 1. Eine algorithmisch orientierte Einführung Peter Deuflhard, Andreas Hohmann, 2018 ISBN 978-3-11-061421-3, e-ISBN (PDF) 978-3-11-061432-9, e-ISBN (EPUB) 978-3-11-061435-0

Non-Newtonian Fluids. A Dynamical Systems Approach Boling Guo, Chunxiao Guo, Yaqing Liu, Qiaoxin Li, 2018 ISBN 978-3-11-054923-2, e-ISBN (PDF) 978-3-11-054961-4, e-ISBN (EPUB) 978-3-11-054940-9

Timo Heister, Leo G. Rebholz, and Fei Xue

Numerical Analysis |

An Introduction

Mathematics Subject Classification 2010 65-01, 65F05, 65F35, 65H10, 65G50, 65L10, 65L12, 65L05 Authors Prof. Dr. Timo Heister University of Utah Department of Mathematics 155 S 1400 E Room 233 Salt Lake City, UT 84112-0090 USA [email protected] Prof. Dr. Leo G. Rebholz Clemson University School of Mathematical and Statistical Sciences Clemson, SC 29634 USA [email protected] Prof. Dr. Fei Xue Clemson University School of Mathematical and Statistical Sciences Clemson, SC 29634 USA [email protected]

ISBN 978-3-11-057330-5 e-ISBN (PDF) 978-3-11-057332-9 e-ISBN (EPUB) 978-3-11-057333-6 Library of Congress Control Number: 2018963420 Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.dnb.de. © 2019 Walter de Gruyter GmbH, Berlin/Boston Cover image: Taken from [RVX19], and is a benchmark problem from [A] which depicts a reconstruction of abasilar artery with an aneurysm at the terminal bifurcation. [RVX19] L. Rebholz, A. Viguerie and M. Xiao, Efficient nonlinear iteration schemes based on algebraic splittng for the incompressible Navier–Stokes equations, Mathematics of Computation, https://doi.org/10.1090/mcom/3411, 2019. [A] Aneurisk–Team. AneuriskWeb project website: http://ecm2.mathcs.emory.edu/aneuriskweb Typesetting: VTeX UAB, Lithuania Printing and binding: CPI books GmbH, Leck www.degruyter.com

|

Author LR dedicates this book to Professor Abhay Gaur for his fantastic support and guidance many years ago during undergraduate and early graduate study. Without him, LR surely would not have chosen this wonderful career path. Author TH dedicates this book to Wolfgang for being his mentor, cycling buddy, and good friend. Author FX dedicates this book to Dandan for her extraordinary patience, understanding, and full-fledged support in daily life.

Preface This book is intended for math majors interested in learning tools for solving differential equations, eigenvalue problems, linear systems of equations, nonlinear systems of equations, and to perform curve fitting. The major topics for a first course in scientific computing are covered, and we emphasize fundamental ideas, simple codes, and mathematical proofs. We do not get bogged down with low level details of the fifty different methods there are to solve a particular problem. Instead,we try to give an undergraduate math major an idea of what scientific computing and numerical analysis is all about, along with exposure to the fundamental ideas of the proofs. Some key questions we aim to address are: – How do we best approximate mathematical processes/operations that cannot be exactly represented on a computer? – How accurate are our approximations? – How efficient are our approximations? It should be no surprise that we want to quantify accuracy as much as possible. Moreover, when a method fails, we want to know why it fails. In this book, we will see how to mathematically analyze the accuracy of many numerical methods. Concerning efficiency, we can never have the answer fast enough, but often there is a trade-off between speed and accuracy. Hence, we also analyze efficiency so that we can “choose wisely” when selecting an algorithm. Thus, to put it succinctly, the purpose of this course is to: – Introduce students to some basic numerical methods and how to use them on a computer. – Analyze these methods for accuracy and efficiency. – Implement these methods and use them to solve problems. We assume a knowledge of calculus, differential equations, and linear algebra, and also that students have some programming experience (e. g., an introductory computer programming course) and have used MATLAB in at least some capacity. This book is related to a sophomore engineering course in scientific computing written by the authors and published in 2015 by DeGruyter.1 Some parts of this book are very similar, including roundoff error. But most of the chapters have been expanded to include more in-depth discussion and mathematical theory wherever possible, and there is a significant amount of new material. Lastly, despite countless hours of effort, there are bound to be remaining typos and mistakes. We would greatly appreciate any users of this book to point them out to us. We would also appreciate any other constructive comments and criticisms regarding the presentation of the material. 1 T. Heister and L. Rebholz, Scientific Computing for Scientists and Engineers, De Gruyter (Berlin), 2015. https://doi.org/10.1515/9783110573329-201

Software Algorithms given in the text are written in the language of MATLAB and Octave. Currently at Clemson, all students have free access to MATLAB. Octave is a free version of MATLAB, which has almost all of the same functionality. Newer versions of MATLAB have more bells and whistles, but for the purposes of this book, either MATLAB or Octave can be used. We have created a website for the codes used in this book, where all MATLAB/Octave codes from the text can be downloaded: http://www.math.clemson.edu/~rebholz/nabook/

https://doi.org/10.1515/9783110573329-202

Acknowledgment We wish to thank Vince Ervin for valuable comments and suggestions. Also, we thank the students who used draft versions of this text and helped to find many small errors and typos. In particular, Sarah Kelly, Allison Miller, Rianna Recchia, and especially Claire Evans helped to clean up the text.

https://doi.org/10.1515/9783110573329-203

Support The authors thank the National Science Foundation for partial support of their research, and indirectly in the writing of this book, through the grants DMS-1522191, DMS-1719461, DMS-1819097, the Computational Infrastructure in Geodynamics initiative (CIG), through EAR-0949446 and the University of California – Davis, and Technical Data Analysis, Inc. through US Navy SBIR N16A-T003.

https://doi.org/10.1515/9783110573329-204

Contents Preface | VII Software | IX Acknowledgment | XI Support | XIII 1 1.1 1.2 1.3 1.3.1 1.3.2 1.4

Computer representation of numbers and roundoff error | 1 Examples of the effects of roundoff error | 1 Binary numbers | 4 64 bit floating-point numbers | 6 Adding large and small numbers is bad | 8 Subtracting two nearly equal numbers is bad | 8 Exercises | 10

2 2.1 2.2 2.3 2.3.1 2.4 2.5 2.6 2.7 2.7.1 2.7.2 2.7.3 2.7.4 2.8

Solving linear systems of equations | 13 Solving triangular linear systems | 13 From Gaussian elimination to LU factorization | 15 Pivoting | 18 Work in LU/GE | 23 Direct methods for large sparse linear systems | 23 Conjugate gradient | 26 Preconditioning and ILU | 28 Accuracy in solving linear systems | 30 Matrix and vector norms and condition number | 31 Condition number of a matrix | 33 Sensitivity in linear system solving | 34 Error and residual in linear system solving | 36 Exercises | 37

3 3.1 3.2 3.3 3.4

Least squares problems | 41 Solving LSQ problems with the normal equations | 42 QR factorization and the Gram–Schmidt process | 44 Curve of best fit | 48 Exercises | 51

4 4.1

Finite difference methods | 53 Convergence terminology | 53

XVI | Contents 4.2 4.2.1 4.2.2 4.2.3 4.2.4 4.3 4.4 4.5 4.6

Approximating the first derivative | 55 Forward and backward differences | 55 Centered difference | 58 Three point difference formulas | 60 Further notes | 62 Approximating the second derivative | 62 Application: Initial value ODEs using the forward Euler method | 63 Application: Boundary value ODEs | 65 Exercises | 69

5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9

Solving nonlinear equations | 71 Convergence criteria of iterative methods for nonlinear systems | 71 The bisection method | 72 Fixed-point theory and algorithms | 77 Newton’s method | 82 Secant method | 86 Comparing bisection, Newton, Secant method | 86 Combining secant and bisection and the fzero command | 87 Equation solving in higher dimensions | 87 Exercises | 90

6 6.1 6.2 6.3 6.4 6.5

Eigenvalues and eigenvectors | 93 Theoretical background | 93 Single-vector iterations | 94 Multiple-vector iterations | 101 Finding all eigenvalues and eigenvectors of a matrix | 103 Exercises | 103

7 7.1 7.1.1 7.2 7.3 7.4 7.5

Interpolation | 105 Interpolation by a single polynomial | 105 Lagrange interpolation | 107 Chebyshev interpolation | 110 Piecewise linear interpolation | 113 Piecewise cubic interpolation (cubic spline) | 116 Exercises | 119

8 8.1 8.2 8.3 8.4 8.5

Numerical integration | 121 Preliminaries | 121 Newton–Cotes rules | 124 Composite rules | 127 Clenshaw–Curtis quadrature | 130 Gauss quadrature | 133

Contents | XVII

8.6 8.7 9 9.1 9.2 9.3 9.4 9.4.1 9.5 9.6 9.7

MATLAB’s integral function | 138 Exercise | 139 Initial value problems | 141 Reduction of higher order IVPs to first order | 142 The forward Euler method | 143 Heun’s method and RK4 | 146 Stiff problems and numerical stability | 147 Implicit methods and unconditional stability | 149 Summary, general strategy, and MATLAB ODE solvers | 150 Fitting ODE parameters to data | 152 Exercises | 155

10 Partial differential equations | 159 10.1 The 1D heat equation | 159 10.1.1 A maximum principle and numerical stability | 166 10.2 The 2D Poisson equation | 168 10.3 Exercises | 172 Index | 173

1 Computer representation of numbers and roundoff error In this chapter, we will introduce the notion and consequences of a finite number system. Each number in a computer must be physically stored. Therefore, a computer can only hold a finite number of digits for any number. Decades of research and experimentation has led us to a (usually) reasonable approximation of numbers by representing them with (about) sixteen digits in base 10. While this approximation may seem at first to be reasonable, we will explore its consequences in this chapter. In particular, we will discuss how to not make mistakes that can arise from using a finite number system. The representation of numbers is not specific to a program or programming language like MATLAB but they are part of the hardware of the processors of virtually every computer.

1.1 Examples of the effects of roundoff error To motivate the need to study computer representation of numbers, let us consider first some examples taken from MATLAB—but we note that the same thing happens in C, Java, etc.: 1. The order in which you add numbers on a computer makes a difference! >> 1 + 1e -16 + 1e -16 + 1e -16 + 1e -16 + 1e -16 + 1e -16 + 1e -16 ans = 1 >> 1e -16 + 1e -16 + 1e -16 + 1e -16 + 1e -16 + 1e -16 + 1e -16 + 1 ans = 1.000000000000001 Note: AAAeBBB is a common notation for a floating-point number with the value AAA × 10BBB . So 1e-16 = 10−16 . As we will see later in this chapter, the computer stores about 16 base 10 digits for each number; this means we get 15 digits after the first nonzero digit of a number. Hence, if you try to add 1e-16 to 1, there is nowhere for the computer to store the 1e-16 since it is the 17th digit of a number starting with 1. It does not matter how many times you add 1e-16; it just gets lost in each intermediate step, since operations are always done from left to right. So even if we add 1e-16 to 1, 10 times in a row, we get back https://doi.org/10.1515/9783110573329-001

2 | 1 Computer representation of numbers and roundoff error exactly 1. However, if we first add the 1e-16’s together, then add the 1, these small numbers get a chance to combine to become big enough not to be lost when added to 1. 2. Consider f (x) =

ex − e−x . x

Suppose we wish to calculate lim f (x).

x→0

By L’Hopital’s theorem, we can easily determine the answer to be 2. However, how might one do this on a computer? A limit is an infinite process, and moreover, it requires some analysis to get an answer. Hence on a computer one is seemingly left with the option of choosing small x’s and plugging them into f . Table 1.1 shows what we get back from MATLAB by doing so. Table 1.1: Unstable limit computation. x 10−6 10−7 10−8 10−9 10−10 10−11 10−12 10−13 10−14 10−15 10−16 10−17

ex −e−x x

1.999999999946489 1.999999998947288 1.999999987845058 2.000000054458440 2.000000165480742 2.000000165480742 2.000066778862220 1.999511667349907 1.998401444325282 2.109423746787797 1.110223024625157 0

Moreover, if we choose x any smaller than 1e-17, we still get 0. The main numerical issue here is, as we will learn, subtracting two nearly equal numbers on a computer is bad and can lead to large errors. Interestingly, if we create the limit table using a mathematically equivalent expression for f (x), we can get a much better answer. Recall (i. e., this is a fact which we hope you have seen before): the exponential function is defined by xn x2 x3 x4 x5 =1+x+ + + + + ⋅⋅⋅. n! 2! 3! 4! 5! n=0 ∞

ex = ∑

1.1 Examples of the effects of roundoff error

| 3

Using this definition, we calculate 3

5

7

x x x ex − e−x 2x + 2 3! + 2 5! + 2 7! + ⋅ ⋅ ⋅ x2 x4 x6 = = 2 + 2 + 2 + 2 + ⋅⋅⋅. x x 3! 5! 7! 4

Notice that if |x| < 10−4 , then 2 x5! < 10−16 and, therefore, this term (and all the ones after it in the sum) will not affect the calculation of the sum in any of the first 16 digits. Hence, we have that, in MATLAB, ex − e−x x2 =2+ x 3 provided |x| < 10−4 . We can expect this approximation of f (x) to be accurate to 16 digits in base 10. Recalculating a limit table based on this mathematically equivalent expression provides much better results, as can be seen in Table 1.2, which shows the numerical limit is clearly 2. Table 1.2: Stable limit computation. x 10−6 10−7 10−8 10−9 10−10 10−11 10−12 10−13 10−14 10−15 10−16 10−17

2+

x2 3

2.000000000000334 2.000000000000004 2.000000000000000 2.000000000000000 2.000000000000000 2.000000000000000 2.000000000000000 2.000000000000000 2.000000000000000 2.000000000000000 2.000000000000000 2.000000000000000

3. We learn in Calculus II that the integral ∞

∫ 1

1 = ∞. x

Note that since this is a decreasing, concave up function, approximating it with the left rectangle rule gives an over-approximation of the integral. That is, ∞

1 1 ∑ >∫ , n x n=1 ∞

1

4 | 1 Computer representation of numbers and roundoff error and so we have that 1 = ∞. n n=1 ∞

∑

But if we calculate this sum in MATLAB, we converge to a finite number instead of infinity. As mentioned above, the computer can only store (about) 16 digits. Since the running sum will be greater than 1, any number smaller than 1e-16 will be lost due to roundoff error. Thus, if we add in order from left to right, we get that 16

10 1 1 = (in MATLAB) ∑ < ∞. n n n=1 n=1 ∞

∑

Hence numerical error has caused a sum which should be infinite to be finite.

1.2 Binary numbers Computers and software allow us to work in base 10, but behind the scenes everything is done in base 2. This is because numbers are stored in computer memory (essentially) as “voltage on” (1) or “voltage off” (0). Hence, it is natural to represent numbers in their base 2, or binary, representation. To explain this, let us start with base 10, or decimal, number system. In base 10, the number 12.625 can be expanded into powers of 10, each multiplied by a coefficient: 12.625 = 1 × 101 + 2 × 100 + 6 × 10−1 + 2 × 10−2 + 5 × 10−3 . It should be intuitive that the coefficients of the powers of 10 must be digits between 0 and 9. Also, the decimal point goes between the coefficients of 100 and 10−1 . Base 2 numbers work in an analogous fashion. First, note that it only makes sense to have digits of 0 and 1, for the same reason that digits in base 10 must be 0 through 9. Also, the decimal point goes between the coefficients of 20 and 2−1 . Hence in base 2 we have, for example, that (11.001)base2 = 1 × 21 + 1 × 20 + 0 × 2−1 + 0 × 2−2 + 1 × 2−3 = 2 + 1 +

1 = 3.125. 8

Converting a base 2 number to a base 10 number is nothing more than expanding it into powers of 2. To get an intuition for this, consider Table 1.3 that converts the base 10 numbers 1 through 10. The following algorithm will convert a base 10 number to a base 2 number. Note this is not the most efficient computational algorithm, but perhaps it is the easiest to understand for beginners.

1.2 Binary numbers | 5 Table 1.3: Binary representation of the numbers from 1 to 10. Base 10 representation

Base 2 representation

1 2 3 4 5 6 7 8 9 10

1 10 11 100 101 110 111 1000 1001 1010

Given a base 10 decimal d, 1. Find the biggest power p of 2 such that 2p ≤ d but 2p+1 > d, and save the index number p. 2. Set d = d − 2p . 3. If d is 0 or very small (compared to the d you started with), then stop; otherwise, go to step 1. Finally, write down your number as 1’s and 0’s, putting 1’s in the entries where you saved the indices (the p above) and 0 everywhere else. Example 1. Convert the base 10 number d = 11.5625 to base 2. 1. Find p = 3 because 23 = 8 ≤ 11.5625 < 16 = 24 . Set d = 11.5625 − 8 = 3.5625 2. Find p = 1 because 21 = 2 ≤ 3.5625 < 4 = 22 . Set d = 3.5625 − 2 = 1.5625 3. Find p = 0 because 20 = 1 ≤ 1.5625 < 2 = 21 . Set d = 1.5625 − 1 = 0.5625 4. Find p = −1 because 2−1 = 21 ≤ 0.5625 < 1 = 20 . Set d = 0.5625 − 0.5 = 0.0625 5. Find p = −4 because 2−4 = 161 = 0.0625 ≤ 0.0625 < 81 = 2−3 . Set d = 0.0625 − 0.0625 = 0 Thus the process has terminated, since d = 0. Our base 2 number thus has 1’s in the 3, 1, 0, −1, and −4 places, so we get (11.5625)base10 = (1011.1001)base2 . Of course, not every base 10 number will terminate to a finite number of base 2 digits. For most computer number systems, we have a total of 53 base 2 digits to represent a base 10 number. We now introduce the notion of standard binary form, which can be considered analogous to exponential formatting in base 10. The idea here is that every binary

6 | 1 Computer representation of numbers and roundoff error Table 1.4: Examples of standard binary form. Note that the exponent is a number in decimal format. Base 2

Standard binary form

1.101 −1011.1101 0.000011101

1.101 × 20 −1.0111101 × 23 1.1101 × 2−5

number, except 0, can be represented as x = ±1.b1 b2 b3 ⋅ ⋅ ⋅ × 2exponent , where each bi is a 0 or a 1, and the exponent (a positive or negative whole number) is adjusted so that the only digit to the left of the decimal point is a single 1. Some examples are given in Table 1.4.

1.3 64 bit floating-point numbers By far the most common computer number representation system is the 64-bit “double” floating-point number system. This is the default used by all major mathematical and computational software. In some cases, it makes sense to use 32 or 128 bit number systems, but that is a discussion for later (later, as in “not in this book”), as first we must learn the basics. Each “bit” on a computer is a 0 or a 1, and each number on a computer is represented by 64 0’s and 1’s. If we assume each number is in standard binary form, then the important information for each number is (i) sign of the number, (ii) exponent, and (iii) the digits after the decimal point. Note that the number 0 is an exception and is treated as a special case for the number system. The IEEE standard divides up the 64 bits as follows: – 1 bit sign: 0 for positive, 1 for negative; – 11 bit exponent: the base 2 representation of (standard binary form exponent + 1023); – 52 bit mantissa: the first 52 digits after decimal point from standard binary form. The reason for the “shift” (sometimes also called bias) of 1023 in the exponent is so that the computer does not have to store a sign for the exponent (more numbers can be stored this way). The computer knows internally that the number is shifted, and knows how to handle it. With the bits from above denoted as sign s, exponent E, and mantissa b1 , . . . , b52 the corresponding number is standard binary form is (−1)s ⋅ 1.b1 . . . b52 × 2E−1023 . Example 2. Convert the base 10 number d = 11.5625 to 64 bit double floating-point representation.

1.3 64 bit floating-point numbers | 7

From a previous example, we know that 11.5625 = (1011.1001)base2 , and so has standard binary representation of 1.0111001 × 23 . Hence, we immediately know that sign bit = 0

mantissa = 0111001000000000000000000000000000000000000000000000 For the exponent, we need the binary representation of (3+1023) = 1026 = 1024+2, and thus, exponent = 10000000010 1.

There are some obvious consequences to this number system: There is a biggest and smallest positive representable number. – The exponent can hold 11 bits total, which means the base 10 numbers 0 to 2047. Due to the shift, the biggest positive unshifted number it can hold is 1024, and the biggest negative unshifted number it can hold is −1023. However, the unshifted numbers 1024 and −1023 have special meanings (e. g., representing 0 and infinity). So the smallest and largest workable exponents are −1022 and 1023. This means the largest number that a computer can hold is nmax = 1.1111111 . . . 1 × 21023 ≈ 10308 and similarly the smallest positive representable number is nmin = 1.00000 . . . 0 × 2−1022 ≈ 10−308

2.

Compare these with what MATLAB returns for realmin and realmax. That said, MATLAB can represent slightly smaller numbers up to approximately 10−324 . These numbers are called denormalized numbers. – Having these numbers as upper and lower bounds on positive representable numbers is generally not a problem. Usually, if one needs to deal with numbers larger or smaller than this, the entire problem can be rescaled into units that are representable. The relative spacing between 2 floating-point numbers is 2−52 ≈ 2.22 × 10−16 . – This relative spacing between numbers is generally referred to as machine epsilon or eps in MATLAB, and we will denote it by ϵmach = 2−52 . – Given a number d = 1.b1 b2 . . . b52 ×2exponent , the smallest increment we can add to it is in the 52nd digit of the mantissa, which is 2−52 × 2exponent . Any number smaller would be after the 52nd digit in the mantissa. – As there is spacing between two floating-point numbers, any real number between two floating-point numbers must be rounded (to the nearest one). This means the maximum relative error in representing a number on the computer is about 1.11 × 10−16 . Thus, we can expect 16 digits of accuracy if we enter a number into the computer.

8 | 1 Computer representation of numbers and roundoff error –

Although it is usually enough to have 16 digits of accuracy, in some situations this is not sufficient. Since we often rely on computer calculations to let us know a plane is going to fly, a boat is going to float, or a reactor will not melt down, it is critical to know when computer arithmetic error can cause a problem and how to avoid it.

There are two main types of catastrophic computer arithmetic errors: adding large and small numbers together, and subtracting two nearly equal numbers. We will describe each of these issues now.

1.3.1 Adding large and small numbers is bad As we saw in Example 1 in this chapter, if we add 1 to 10−16 , it does not change the 1 at all. Additionally, the next computer representable number after 1 is 1 + 2−52 = 1 + 2.22 × 10−16 . Since 1 + 10−16 is closer to 1 than it is to 1 + 2.22 × 10−16 , it gets rounded to 1, leaving the 10−16 to be lost forever. We have seen this effect in the example at the beginning of this chapter when repeatedly adding 10−16 to 1. Theoretically speaking, addition in floating-point computation is not associative, meaning (A + B) + C = A + (B + C) may not hold, due to rounding. One way to minimize this type of error when adding several numbers is to add from smallest to largest (if they all have the same sign) and to use factorizations that lessen the problem. There are other more complicated ways to deal with this kind of error that is out of the scope of this book, for example, the “Kahan Summation Formula.”

1.3.2 Subtracting two nearly equal numbers is bad The issue here is that insignificant digits can become significant digits, and the problem is illustrated in Example 2, earlier in this chapter. Consider the following MATLAB command and output: >> 1 + 1e -15 - 1 ans = 1.110223024625157 e -15 Clearly, the answer should be 10−15 , but we do not get that, as we observe error in the second significant digit. It is true that the digits of accuracy in the subtraction operation is 16, but there is a potential problem with the “garbage” digits 110223024625157 (these digits arise from rounding error). If we are calculating a limit, for example, they could play a role.

1.3 64 bit floating-point numbers | 9

Consider using the computer to find the derivative of f (x) = x at x = 1. Everyone in the world, including babies and the elderly, knows the answer is f 󸀠 (1) = 1. But suppose for a moment we do not know the answer and wish to calculate an approximation. The definition of the derivative tells us f (1 + h) − f (1) . h→0 h

f 󸀠 (1) = lim

It might seem reasonable to just pick a very small h to get a good answer, but this is a bad idea. Consider the following MATLAB commands, which plug in values of h from 10−1 to 10−20 . When h = 10−15 , we see the “garbage” digits have become significant and alter the second significant digit. For even smaller values of h, 1 + h gives back 1, and so the derivative approximation is 0. >> h = 10.ˆ -[1:20] '; >> fp = ((1+ h ) - 1) ./ >> disp ([ h , fp ]); 0.100000000000000 0.010000000000000 0.001000000000000 0.000100000000000 0.000010000000000 0.000001000000000 0.000000100000000 0.000000010000000 0.000000001000000 0.000000000100000 0.000000000010000 0.000000000001000 0.000000000000100 0.000000000000010 0.000000000000001 0.000000000000000 0.000000000000000 0.000000000000000 0.000000000000000 0.000000000000000

h; 1.000000000000001 1.000000000000001 0.999999999999890 0.999999999999890 1.000000000006551 0.999999999917733 1.000000000583867 0.999999993922529 1.000000082740371 1.000000082740371 1.000000082740371 1.000088900582341 0.999200722162641 0.999200722162641 1.110223024625157 0 0 0 0 0

We will discuss in detail, in a later chapter, more accurate ways to approximate derivatives. Although the roundoff error will always be a problem, more accurate methods can give good approximations for h only moderately small, and thus, minimize potential issues from roundoff. There is another fact about floating-point numbers to be aware of. What do you expect the following program to do?

10 | 1 Computer representation of numbers and roundoff error

i =1.0 while i∼=0.0 i =i -0.1 end The operator “∼=” means “not equal” in MATLAB. We would expect the loop to be counting down from 1.0 in steps of 0.1 until we reach 0, right? No, it turns out that we are running an endless loop counting downward because i is never exactly equal to 0. Rule: Never compare floating-point numbers for equality (“==” or “∼=”). Instead, for example, use a comparison such as abs ( i ) < 1e -12 to replace the comparison i∼=0.0 from the above code.

1.4 Exercises 1. 2.

Convert the binary number 1101101.1011 to decimal format. Convert the decimal number 63.125 to binary format. What is its 64-bit floatingpoint representation? 3. Show that the decimal number 0.1 cannot be represented exactly as a finite binary number (i. e., the base 2 expansion is not terminal). Use this fact to explain that “0.1*3 == 0.3” returns 0 (meaning false) in MATLAB. 4. Write your own MATLAB function tobinary(n) that, given a whole number, outputs its base 2 representation. The output should be a vector with 0s and 1s. Example: >> tobinary (34) ans = 1 0 0

5. 6. 7.

0

1

0

Hint: The MATLAB command pmax=floor(log2(n)) finds the biggest power of 2 contained in n. What is the minimum and maximum distance between two adjacent floating point numbers? What is the best way to numerically evaluate ‖x‖2 = √∑Nn=1 xn2 ? (i. e., is there a best order for the addition?) Consider the function g(x) =

ex − 1 . x

(a) Find limx→0 g(x) analytically. (b) Write a MATLAB program to calculate g(x) for x = 10−1 , 10−2 , . . . , 10−15 . Explain why the analytical limit and the numerical ‘limit’ do not agree. (c) Extend your calculation to x = 10−16 , 10−17 , . . . , 10−20 , and explain this behavior.

1.4 Exercises | 11

8. The polynomial (x − 2)6 is 0 when x = 2 but is positive everywhere else. Plot both this function, and its decomposition x6 − 12x 5 + 60x 4 − 160x 3 + 240x 2 − 192x + 64 near x = 2, on [1.99, 2.01] using 10,000 points. Explain the differences in the plots. 9. What is the next biggest computer representable number after 1 (assuming 64 bit double floating point numbers)? What about after 4096? What about after 81 ? 10. Read about the Patriot missile failure during the Gulf War that, due to roundoff error, accidentally hit an American Army barracks, killing 28 soldiers and injuring over 100 others: http://www-users.math.umn.edu/~arnold/disasters/patriot. html. More examples of serious problems are discussed at http://mathworld.wolfram. com/RoundoffError.html. Write a paragraph about the importance of roundoff error in software calculations. 11. The mathematical definition of the exponential function is exp(x) = ex = 1 + x +

∞ n x xn x2 + ⋅⋅⋅ + + ⋅⋅⋅ = ∑ . 2! n! n! n=0

Since factorial dominates power functions, this infinite sum will converge, although one may need n to be very large. For x = 2.2, calculate ex using MATLAB’s “exp” function, and also using the definition. How large must n be in order for these two methods to give the same answer to 16 digits? 12. The mathematical definition of the sine function is sin(x) = x −

∞ (−1)n 2n+1 x3 x5 x7 + − + ⋅⋅⋅ = ∑ x . 3! 5! 7! (2n + 1)! n=0

Since factorial dominates power functions, this infinite sum will converge, although one may need n to be very large. For x = π4 , calculate sin(x) using MATLAB’s ‘sin’ function and also using the definition. How large must n be in order for these two methods to give the same answer to 16 digits?

2 Solving linear systems of equations The need to solve systems of linear equations Ax = b arises across nearly all of engineering and science, business, statistics, economics, and many other fields. In a standard undergraduate linear algebra course, we have learned how to solve this problem using Gaussian Elimination (GE). We will show here how such a procedure is equivalent to an LU factorization of the coefficient matrix A, followed by a forward and a back substitution. To achieve stability of the factorization in computer arithmetic, a strategy called pivoting is necessary, which leads to the LU factorization with partial pivoting. This is the standard direct method for solving linear systems where A is a dense matrix. Linear systems with large and sparse (most entries are zero) coefficient matrices arise often in numerical solution methods of differential equations, for example, by the finite element and finite difference discretizations. State-of-the-art direct methods can nowadays efficiently solve such linear systems up to an order of a few million, using advanced strategies to keep the LU factors as sparse as possible and the factorization stable. However, problems of ever-increasing dimension need be tackled, and sparse linear systems of order tens of millions to billions have become more routine. To efficiently solve these large systems approximately, iterative methods such as the Conjugate Gradient (CG) method are typically used, and on sufficiently large problems, can be advantageous over direct methods. This chapter will mainly focus on direct methods but will also discuss the CG method. “Linear solvers” has become a vast field and is a very active research area. We aim here to provide a fundamental understanding of the basic types of solvers, but note that we are just scratching the surface, in particular for iterative methods.

2.1 Solving triangular linear systems Consider a system of linear equations Ax = b, where the coefficient matrix A is square and nonsingular. Recall that the GE procedure gradually eliminates all entries in the coefficient matrix below the main diagonal by elementary row operations, until the modified coefficient matrix becomes an upper triangular matrix U. The solution remains unchanged during the entire procedure. In this section, we consider how to solve a linear system where the coefficient matrix is upper or lower triangular. The procedure of elimination will be reviewed and explored in the new perspective of matrix factorization in the next section. Example 3 (Back substitution for an upper triangular system). Consider the linear system x1 + 2x2 + 3x3 = 2, 4x2 + 5x3 = 3 and 6x3 = −6. It can be written in matrix form as https://doi.org/10.1515/9783110573329-002

14 | 2 Solving linear systems of equations 1 (

2 4

3 x1 2 5) (x2 ) = ( 3 ) , 6 x3 −6

where the coefficient matrix is upper triangular. To solve this linear system, we start from the last equation 6x3 = −6, which immediately gives x3 = −6 = −1. 6

Then, from the second equation 4x2 + 5x3 = 3, we get x2 = first equation x1 + 2x2 + 3x3 = 2 leads to x1 = 2 − 2x2 − 3x3 = 1.

3−5x3 4

= 2. Finally, the

This procedure illustrates the general procedure of back substitution. Given an upper triangular linear system with nonzero diagonal entries u11 Ux = (

u12 .. .

... .. .

u(n−1)(n−1)

u1n .. . )(

u(n−1)n unn

b1 x1 . .. . ) = ( .. ) , bn−1 xn−1 bn xn

bn directly, then substitute it into unn bn−1 −u(n−1)n xn the previous equation and compute xn−1 = u . Assume in general that we have (n−1)(n−1) n b −∑ uij xj can be evaluated. We continue until already solved for xi+1 , . . . , xn , then xi = i j=i+1 uii

we start with the last equation and evaluate xn =

the value of x1 is found. A MATLAB code for back substitution is given below. function [x] = BackSubstitution (A ,b) % solve the upper triangular system Ax = b % using back - substitution n = length (b ); x = zeros (n ,1);

for j = n : -1:1 % Check to see if the diagonal entry is zero if abs (A(j ,j )) < 1e -15 error ( 'A is singular ( diagonal entries of zero ) ') end % Compute solution component x ( j ) = b(j) / A(j ,j ); % Update the RHS vector for i =1: j -1 b(i) = b(i) - A(i ,j )* x(j ); end end

2.2 From Gaussian elimination to LU factorization

| 15

Similarly, consider a lower triangular linear system l11 l21 (. .. ln1

l22 .. . ln2

..

. ⋅⋅⋅

b1 x1 b2 x2 )( . ) = ( . ). .. .. bn xn lnn

Here, we start from the first equation and get x1 =

b1 l11

first, then move to the second

bi −∑i−1 j=1 lij xj . lii

equation. At Step i, we can evaluate xi = This procedure, called forward substitution, moves on until xn is determined. One can follow these algorithms and see that both substitutions need ∑ni=1 (i − 1) = 2

≈ 21 n2 addition/subtractions and ∑ni=1 [(i − 1) + 1] = n(n+1) ≈ n2 multiplication/2 2 divisions, or n combined. Both will be used after we complete an LU factorization of the coefficient matrix A to solve Ax = b. n(n−1) 2

2.2 From Gaussian elimination to LU factorization In this section, we show that the standard GE without row swapping can be done through elementary matrix operations, which naturally leads to an LU factorization without pivoting. Let us briefly review GE. Given a system of linear equations Ax = b, where A ∈ ℝn×n is nonsingular, and b ∈ ℝn , GE updates the coefficient matrix and the right-hand side by elementary row operations, one column at a time, eliminating all entries below the main diagonal. This produces a new linear system Ux = c, where U is an upper triangular matrix. Then back substitution is used to solve Ux = c for x. Note that this procedure does not change the solution to Ax = b, that is, it is the same x in Ax = b and Ux = c. After reviewing this process below, we will show how it can be “saved” in an LU factorization A = LU, with L lower triangular and U upper triangular. The U is the upper triangular matrix that results from the GE procedure. The L is a matrix with ones on the diagonal, and whose entries are easily determined from the GE procedure (although the mathematical reasons why it works are a little more complicated). In your linear algebra course, the GE procedure could be performed in many different ways, so long as the Gauss rules are not violated. However, for a computer algorithm, it is necessary to avoid ambiguity and to have a more precise algorithm: we will always move left to right, and use the diagonal entry to eliminate entries below it. Example 4 (Gaussian elimination). Apply GE to solve the system of linear equations 2x1 + x2 + 3x3 = 5,

4x1 − x2 + 2x3 = −1,

16 | 2 Solving linear systems of equations −x1 + 4x2 + x3 = 7. This linear system can be written as Ax = b, for which the augmented matrix is 2 (A | b) = ( 4 −1

1 −1 4

3 2 1

5 −1) . 7

Step 1: Eliminate the entries in column 1 below the main diagonal, namely, the (2, 1) entry 4 and the (3, 1) entry −1. This is done by subtracting twice of row 1 from row 2 (R2 ← R2 − 2 × R1 ), and adding half of row 1 to row 3 (R3 ← R3 + 21 R1 ). This is described as 2 (4 −1

1 −1 1

3 2 1

5 −1) 4

R2 ← R2 − 2 × R1 R3 ← R3 + 21 R1

󳨀→

2 (0 0

1 −3

3 −4

9 2

5 −11) .

5 2

19 2

Let mji be the multiplier used to zero-out the (j, i) entry: m21 = −2, m31 = 21 . Step 2: Eliminate the entries in column 2 below the diagonal, which is simply the (3, 2) entry 92 . This is done by adding 32 times row 2 to row 3 (R3 ← R3 + 32 R2 ). That is, 2 (0 0

1 −3 9 2

3 −4 5 2

5 2 R3 ←R3 + 32 R2 󳨀→ (0 −11) 19 0 2

1 −3 0

3 −4 − 72

5 −11) . −7

Here, m32 = 32 . Step 3: Solve the upper triangular system, bottom to top. From the last equation − 72 x3 = −7, we have x3 = 2; then from the 2nd equation −3x2 − 4x3 = −11, we get x2 = 1; finally, from the 1st equation 2x1 + x2 + 3x3 = 5, we find x1 = −1. If we define the matrix L by ljj = 1,

lji = 0

if j < i,

lji = −mji ,

if j > i,

then we obtain the LU factorization A = LU. One can check that 2 A=(4 −1

1 −1 4

3 1 2) = ( 2 1 − 21

0 1 − 32

2 0 0) (0 1 0

1 −3 0

3 −4) = LU. − 72

We now discuss why L can be created this way. At Step i, we want to eliminate all entries in the ith column below the diagonal. To eliminate the (j, i) entry (j > i), adding

2.2 From Gaussian elimination to LU factorization

| 17

mji times row i to row j is equivalent to left-multiplying by the elementary matrix 1

..

Eji = (

.

..

mji

),

.

1

which differs from the identity matrix In at the (j, i) entry alone. This can be verified by looking at the jth row of Eji applied to the current (modified) coefficient matrix: R1 1 ⏟⏟ , . . . , 0) ( ... ) = Rj + mji Ri . (0, . . . , 0, ⏟⏟m ⏟⏟⏟ji⏟⏟ , 0, . . . , ⏟⏟⏟⏟⏟ the jth entry the ith entry Rn Observe that all other rows remain intact after this matrix multiplication. We can verify that the product of all such elementary matrices accumulated at Step i is 1

Li = Eni E(n−1)i ⋅ ⋅ ⋅ E(i+1)i

( ( ( ( =( ( ( (

..

.

(i,i)

..

m(i+1)i .. . mni

(

) ) ) ) ), ) ) )

⏟⏟⏟⏟⏟ 1 ⏟⏟ .

..

.

1)

which differs from the identity at the entries below the ith diagonal entry. Interestingly, a simple calculation shows that the inverse of Li is obtained by simply negating the off-diagonal entries (check this by verifying Li L−1 i = In ): 1

L−1 i

( ( ( ( =( ( ( ( (

..

.

) ) ) ) ). ) ) )

⏟⏟⏟⏟⏟ 1 ⏟⏟ (i,i)

−m(i+1)i .. . −mni

..

.

..

.

1)

Another interesting fact is that the product of L−1 i matrices creates a matrix that keeps all the nonzero entries of each matrix unchanged:

18 | 2 Solving linear systems of equations 1

−1 L−1 i Lj

( ( ( ( ( ( =( ( ( ( ( ( ( (

..

.

1 −m(i+1)i .. . .. . .. . −mni

..

) ) ) ) ) ) ). ) ) ) ) ) )

. 1 −m(j+1)j .. . −mnj

..

.

..

.

1)

We can now describe the GE procedure in terms of multiplication of A by the Li matrices: U = Ln−1 Ln−2 ⋅ ⋅ ⋅ L1 A. Thus, we can write −1 −1 A = L−1 1 L2 ⋅ ⋅ ⋅ Ln−1 U. −1 −1 From the discussion above, we can define L := L−1 1 L2 ⋅⋅⋅Ln−1 , which is easily calculated to be

1 −m21 ( .. L := ( ( . .. .

(−mn1

1 −m32 .. . −mn2

1

..

. ...

..

) ). ) .

−mn(n−1)

1)

Hence, even though it took a bit of work to develop the representation of L, in practice it comes down to simply placing the negative of these GE multipliers mji in their corresponding locations in L. Thus, no extra work is needed to create L and U, and this is why the terms GE and LU are used synonymously.

2.3 Pivoting GE breaks down at Step i if the ith diagonal entry of the current (modified) coefficient matrix, referred to as the pivot, is zero (or close to 0), since there is no way to eliminate a nonzero entry using a zero pivot. A zero pivot may arise at any step of GE, making the algorithm fail, even if A is nonsingular and a unique solution to Ax = b exists. Consider, for example, the linear equation 0 ( 1

1 x 0 ) ( 1) = ( ) . 0 x2 2

There is a unique solution (x1 = 2 and x2 = 0), but GE fails at the first step.

2.3 Pivoting | 19

To fix this issue, if a zero pivot (current (i, i) entry) appears at Step i, we can switch the ith row containing the zero pivot with a row below it, say the jth (j > i) row where the (j, i) entry is nonzero, so that the new pivot (new (i, i) entry) is no longer zero. This strategy is called partial pivoting. In exact arithmetic (no round-off errors), this would be sufficient to make sure GE proceeds to the end. However, partial pivoting must also be used if the original pivot is excessively small. Consider the example 1 ϵ A = (2 1

1 −1 ) = ( −1 2ϵ 1

1

1

) (2

ϵ

−1 ) = LU, 1 + 2ϵ−1

where ϵ is the machine precision. Assume that 2ϵ−1 can be evaluated exactly in computer arithmetic. Then the floating point representation of the (2, 2) entry of U, namely 1 + 2ϵ−1 , is 2ϵ−1 due to rounding in computer arithmetic. Let the computed factors be ̂ = L, L

1

̂ = (2ϵ and U

−1 ), 2ϵ−1

such that 1

̂=L ̂U ̂ = (2ϵ A 1

−1 ). 0

̂ and U ̂ form the exact LU factorization of A ̂ that In other words, the computed factors L is very different from A. This is, of course, not acceptable. The standard partial pivoting used for GE that will address these issues is as follows. At Step i, we have eliminated all entries of (A | b) in the first i − 1 columns below the diagonal, and we need to do so for the ith column. The current modified coefficient matrix we have is a11 0 ( ( .. ( . ( .. . (0

⋅⋅⋅ .. .

a1i .. .

⋅⋅⋅

0 .. . 0

aii .. . ani

⋅⋅⋅ ⋅⋅⋅

a1n .. .

) ) . ain ) ) .. .

ann )

We compare the entries aii , a(i+1)i , . . . , ani in absolute value and find the largest |aji | = max |aki |, i≤k≤n

where we denote by j the row with the largest entry. Next, swap row i and row j (i ≤ j), and perform the current step of elimination as usual.

20 | 2 Solving linear systems of equations The row swapping can be done equivalently by left-multiplying by the permutation matrix Pi , which itself is obtained by switching rows i and j of the identity matrix I; the elimination is performed by left-multiplying by the unit lower triangular matrix Li introduced before. Then we proceed to the next step, continuing until we have eliminated all entries below the diagonal and obtain the upper triangular U. This procedure is described in matrix operations as Ln−2 Pn−2 ⋅ ⋅ ⋅ L L ⏟⏟⏟⏟⏟⏟⏟ ⏟⏟⏟⏟⏟⏟⏟ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ 2 P2 L 1 P1 A = U. n−1 Pn−1 ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ Step (n−1) Step (n−2)

Step 2 Step 1

To develop a deeper insight into the above relation, note that all Pi are symmetric permutation matrices, such that Pi PTi = P2i = I. With some effort, we can show that this equation can be written as (Ln−1 ) (Pn−1 Ln−2 Pn−1 ) ⋅ ⋅ ⋅ (Pn−1 ⋅ ⋅ ⋅ P2 L1 P2 ⋅ ⋅ ⋅ Pn−1 ) Pn−1 ⋅ ⋅ ⋅ P2 P1 A = U, where each matrix in a parentheses is lower triangular with 1’s on the diagonal. This might not seem obvious at first glance, but can be shown by noting that Li is a unit lower triangular with nonzeros only in the ith column, and that it has identical nonzero pattern to Pi+1 Li Pi+1 (switching rows i + 1 and j of Li , then its columns i + 1 and j, j ≥ i + 1) and Pi+2 Pi+1 Li Pi+1 Pi+2 , etc. In summary, we have ̃ n−2 ⋅ ⋅ ⋅ L ̃ 1 PA = U, ̃ n−1 L L ̃ i = Pn−1 ⋅ ⋅ ⋅ Pi+1 Li Pi+1 ⋅ ⋅ ⋅ Pn−1 is a lower triangular matrix with 1’s on the where each L ̃ n−2 ⋅ ⋅ ⋅ L ̃ 1 , and P = Pn−1 ⋅ ⋅ ⋅ P1 is a permutation matrix. ̃ n−1 L diagonal, and thus so is L Taking the inverse of the lower triangular matrices on both sides, we have ̃ −1 ̃ −1 L ̃ −1 ⋅ ⋅ ⋅ L PA = L n−2 n−1 U = LU, 1 ̃ −1 ⋅ ⋅ ⋅ L ̃ −1 L ̃ −1 where L = L 1 n−2 n−1 is also a unit lower triangular. The above derivation gives clear guidance on how to generate the U and P matrices. However, the L factor has a complicated expression, and it would be awkward and inefficient to construct it in the original formula using matrix multiplications and inverses. In light of the above analysis (given for the non-pivoting case), all the three factors can be constructed as follows. Initialize P = L = I. Then at Step i (1 ≤ i ≤ n − 1), 1. Find the largest (in absolute value) (j, i) entry in the ith column (1 ≤ i ≤ j ≤ n) of the modified coefficient matrix, at or below the (i, i) entry; 2. Swap rows i and j in the modified coefficient matrix, the permutation matrix P, and these two rows in columns 1 through i − 1 in L; 3. Eliminate all subdiagonal entries in the ith column of the modified coefficient matrix, updating rows/columns i + 1 through n; 4. Place in the (k, i) entry (i < k ≤ n) of L negative the multiplying factor of row i added to row k used in the elimination.

2.3 Pivoting | 21

Upon completion, U is the final modified upper triangular coefficient matrix, L is a lower triangular matrix, P is a permutation matrix, and PA = LU. Example 5 (LU with partial pivoting, or PLU). We follow the procedure above to factorize 2 A=(4 −1

1 −1 4

3 2) 1

into PA = LU, with initial P = I3 and L = I3 . Step 1: Compare the entries on and below the diagonal in column 1 and find the largest in modulus (the (2, 1) entry 4). Then we swap rows 1 and 2 of A and P and get the updated matrix 4 (2 −1

−1 1 4

2 3) 1

0 and P = ( 1 0

1 0 0

0 0) . 1

There is no swapping to do for L. Next, we eliminate the new (2, 1) entry by performing R2 ← R2 − 21 R1 and the new (3, 1) entry via R3 ← R3 + 41 R1 . We have the modified coefficient matrix 4 (0 0

2 2)

−1 3 2 15 4

3 2

and the updated L=(

1

0 1 0

1 2 − 41

0 0) . 1

Step 2: Compare the entries on and below the diagonal in column 2 and find the largest ). Then we swap rows 2 and 3 of this modified matrix and in modulus (the (3, 2) entry 15 4 of P, and these two rows in column 1 of L to get 4 (0 0

−1 15 4 3 2

2

3 ), 2

2

0 P = (0 1

1 0 0

0 1) , 0

1 and L = (− 41 1 2

0 1 0

0 0) . 1

Next, we eliminate the (3, 2) entry by performing R3 ← R3 − 25 R2 . This gives 4 (0 0

−1 15 4

0

2

3 ) 2 7 5

22 | 2 Solving linear systems of equations along with updated 1 L = (− 41 1 2

0 1 2 5

0 0) . 1

Since the matrix has become upper triangular, we have completed the algorithm. We recover 4 U = (0 0

−1 15 4

0

2

3 ), 2 7 5

and have the final L and P as their last values in the algorithm. Note that the algorithm above is built into MATLAB, and can be called via lu: >> A = [2 1 3; 4 -1 2; -1 4 1] A = 2 4 -1

1 -1 4

3 2 1

>> [L ,U ,P] = lu (A) L = 1.0000 -0.2500 0.5000

0 1.0000 0.4000

0 0 1.0000

4.0000 0 0

-1.0000 3.7500 0

2.0000 1.5000 1.4000

U =

P = 0 0 1

1 0 0

0 1 0

2.4 Direct methods for large sparse linear systems | 23

Remark 6. If A is real, symmetric, and positive definite (xT Ax > 0, ∀x ∈ ℝn \ {0}), the above process can be modified to produce the Cholesky factorization A = LLT . This factorization is more efficient, no pivoting is required, and it improves storage since only one triangular matrix needs stored. MATLAB has this factorization built in as well and it can be called via the chol command.

2.3.1 Work in LU/GE How many floating point operations (flops) are needed to perform PLU? At Step i, we need to compare n−i+1 entries aii , . . . , ani and choose the largest one in modulus. After pivoting, we need n − i divisions to get the multiple factors used for row operations. To eliminate each subdiagonal (j, i) entry (i < j ≤ n) in the ith column, it takes n − i scalar multiplications and n − i additions to subtract a multiple of the new ith row 1 n2 from the jth row. In total, there will be ∑n−1 i=1 (n − i + 1) = 2 (n + 2)(n − 1) ≈ 2 scalar 3

n 1 3 comparisons, ∑n−1 i=1 [(n − i) + (n − i)(n − i)] = 3 (n − n) ≈ 3 multiplication/divisions, and 1 3 1 ∑n−1 i=1 (n − i)(n − i) = 6 n(n − 1)(2n − 1) ≈ 3 n addition/subtractions. If multiplications and additions take about the same time (which is the case nowadays on many platforms), we can combine them and simply say that the arithmetic cost for LU factorization is 2 3 n flops plus 21 n2 comparisons. 3 The above estimates assume that every entry of A is nonzero. If A has certain special nonzero structures, the estimate may become smaller. For instance, if all entries of A below the kth subdiagonal are zero, where k is independent of n, then the total arithmetic cost is at most 𝒪(kn2 ). This is because at each step of the factorization, at most k − 1 entries need to be eliminated, and hence at most k − 1 rows will be updated. In addition, for such a matrix A, if all entries above the kth superdiagonal are zero, then the total work is at most 𝒪(k 2 n). Finally, assume that we have completed the LU factorization and have P, L, and U such that PA = LU. To solve the linear system Ax = b, note that it is equivalent to LUx = PAx = Pb, which leads to x = U−1 L−1 Pb. This can be evaluated by (a) solving the lower triangular system Ly = Pb by forward substitution, then (b) solving the upper triangular system Ux = y by back substitution. Note that if one needs to solve many linear systems Axk = bk (k = 1, 2, . . .) with the same matrix and different right-hand sides b1 , b2 , . . ., then at the first step one does O(n3 ) work to create the LU factorization, but then reuses L and U so that each additional solve needs only O(n2 ) work.

2.4 Direct methods for large sparse linear systems Direct methods such as LU and Cholesky factorizations can also be used to solve large sparse linear systems. Such systems typically have mostly zeroes, and a smart algorithm should not repeatedly zero-out zeros. Typically, for such linear systems arising

24 | 2 Solving linear systems of equations from discretization of partial differential equations in 2-D domains, direct methods can be quite competitive. Overall, the success of direct methods primarily depends on our ability to construct a factorization that is sufficiently sparse. In particular, what is needed is a reordering of the rows and columns so that after the factorization is created, the factor matrices (e. g., the L and U) have many less nonzeros than they would have otherwise. It is easy to construct such a factorization for a symmetric positive definite matrix. We first apply a fill-reducing ordering of A, represented as a permutation matrix ̂ = PT AP = L ̂L ̂ T leads P, such that the Cholesky factorization of the reordered matrix A T ̂ sparser than the counterpart L for A = LL . The solution to Ax = b is to a factor L ̂ −T L ̂ −1 PT b, but the latter corresponds to a computationx = A−1 b = L−T L−1 b or x = PL ̂ The ally more efficient solve thanks to the lower density (fewer nonzero entries) of L. following example shows this: >> A = delsq ( numgrid ( 'B ' ,512)); b = randn ( length (A ) ,1); >> tic ; L = chol (A , ' lower '); toc ; % time to factorize the original A Elapsed time is 2.960520 seconds . >> tic ; x = L '\( L\b ); toc ; % time to solve with the original L factor Elapsed time is 0.259174 seconds . >> tic ; p = symamd (A ); toc ; % time for fill - reducing permutation Elapsed time is 0.251770 seconds . >> P = speye ( size (A )); P = P (: , p ); >> tic ; Lhat = chol (A(p ,p), ' lower '); toc ; % time to factorize the permutated A Elapsed time is 0.380372 seconds . >> tic ; xhat = P *( Lhat '\( Lhat \(P '* b ))); toc ; % time to solve with the sparser L factor Elapsed time is 0.050527 seconds . >> disp ([ nnz (L) nnz ( Lhat )]); 86216840

5848939

>> disp ([ norm (A*x -b )/ norm (b) norm (A* xhat -b )/ norm (b )]); 2.6099 e -14

7.0041 e -15

̂ Here, A is a matrix arising from the discrete negative Laplacian on a 2-D region, and A is a reordered version of A, using the symmetric approximate minimum degree permû is about 15 times tation (symamd). The above result shows that the Cholesky factor of A 6 sparser than that of the original A (about 5.8 × 10 entries vs. 8.6 × 107 entries, shown

2.4 Direct methods for large sparse linear systems | 25

by nnz). Direct solves using the two Cholesky factors produce numerically equivalent solutions to the linear system Ax = b, but the one with more sparse factors needs fewer arithmetic operations and less storage. Consequently, the time needed to faĉ is much less than that for factorizing the original matorize the permutated matrix A trix A (0.38 vs. 2.96 seconds), and solving the linear system Ax = b using the sparser Cholesky factor is faster than the solve using the original Cholesky factor (0.05 vs. 0.26 seconds). Similar strategies can be done for nonsymmetric and symmetric indefinite matrices, such as using colamd in MATLAB. This is not always as effective, but in general can still give considerable improvement over non-reordered matrices. Quality software packages based on the above ideas have been developed. In MATLAB, the function lu with five output parameters generates a diagonal scaling matrix S, permutation matrices P and Q, a unit lower triangular L and an upper triangular U, such that PS−1 AQ = LU. The solution to Ax = b is therefore x = A−1 b = QU−1 L−1 PS−1 b. For symmetric indefinite matrices, the function ldl with four output parameters generates a diagonal scaling matrix S, a permutation P, a unit lower triangular L, and a block diagonal D with 1×1 or 2×2 blocks along the diagonal, such that PT SASP = LDLT . The solution to Ax = b is x = A−1 b = SPL−T D−1 L−1 PT Sb. Here is an example: >> >> >> >> >> >> >> >>

n = 512ˆ2; A = gallery ( ' neumann ', n) - speye (n ); b = randn (n ,1); [L ,U ,P ,Q , R ] = lu (A ); xa = Q *( U \( L \( P *( R \b )))); B = ( A +A ')/2; % symmetric part of A , indefinite [L ,D ,P , S ] = ldl (B ); xb = S *( P *( L '\( D \( L \(P '*( S*b )))))); disp ([ norm (A*xa -b )/ norm (b) norm (B*xb -b )/ norm (b )]); 9.7249 e -11

4.5134 e -11

Finally, we note that MATLAB has the “backslash” operator, which is a very efficient solver for sparse linear systems. For dense systems, it is essentially just the PLU algorithm. For sparse matrices (which need to be of sparse type in MATLAB), it uses the efficient reorderings and factorizations described above, as well as automatically multi-threading the process. If the factorization of a coefficient matrix does not need saved (e. g., if this coefficient matrix appears in only one linear system), then in MATLAB, backslash is the direct solver of choice. If it does need saved (say, if we solve multiple linear systems with the same coefficient matrix), then we follow the above examples to compute the LU, LDL, or Cholesky factors of A first, depending on the symmetry and definiteness of A, and solve Ax = b using the triangular factors.

26 | 2 Solving linear systems of equations

2.5 Conjugate gradient We discuss now an important class of iterative methods for solving large sparse linear systems approximately. Given an initial approximation x0 , iterative methods generate a sequence of approximate solutions x1 , x2 , . . ., that gradually converge to the solution x = A−1 b. Specifically, we consider Krylov subspace methods, which construct subspaces 𝒦k (A, r0 ) = span{r0 , Ar0 , A2 r0 , . . . , Ak−1 r0 } of low dimension (k ≪ n, e. g., k = 100 for n = 107 ), where r0 = b − Ax0 , such that good approximate solutions xk can be extracted from the space Wk = x0 + 𝒦k (A, r0 ). The conjugate gradient method (CG) for symmetric positive definite (SPD) linear systems is the earliest development. Subsequent milestones include the minimal residual method (MINRES) and simplified quasi-minimal residual (SQMR) for symmetric indefinite systems, and the generalized minimal residual method (GMRES), bi-conjugate gradient stabilized (BI-CGSTAB), BI-CGSTAB(ℓ)1 and induced dimension reduction2 (IDR(s)) for nonsymmetric linear systems. We will concentrate exclusively on CG for SPD systems, and note that theory for methods for non-SPD matrices (such as GMRES, BI-CGSTAB and IDR) is available but more complex.3 The way one uses these other methods in MATLAB is essentially identical to how one uses CG for solving SPD systems. Another excellent book on Krylov methods and applications has recently been published by Olshanskii and Tyrtyshnikov.4

The Conjugate Gradient (CG) method for SPD Ax = b Choose tol > 0, x0 , compute r0 = b − Ax0 , set p0 = r0 . For k = 0, 1, . . . , until convergence αk =

rTk rk

pTk Apk

xk+1 = xk + αk pk , rk+1 = rk − αk Apk ‖rk+1 ‖2 If ‖b‖ ≤ tol, exit βk =

End For

2

rTk+1 rk+1 rTk rk

, pk+1 = rk+1 + βk pk

1 BI-CGSTAB(ℓ) is described in BiCGstab(ℓ) for linear equations involving unsymmetric matrices with complex spectrum, G. L. G. Sleijpen and D. R. Fokkema, Elec. Trans. Numer. Anal., Vol. 1 (1993), pp. 11– 32. 2 IDR(s) is described in IDR(s): a family of simple and fast algorithms for solving large nonsymmetric systems of linear equations, P. Sonneveld and M. B. van Gijzen, SIAM J. Sci. Comput. Vol. 31 (2008), pp. 1035–1062. 3 For more details about Krylov subspace methods, see Iterative methods for sparse linear systems (2nd edition) by Y. Saad, SIAM, 2003. 4 M. A. Olshanskii and E. E. Tyrtyshnikov, Iterative Methods for Linear Systems, Theory and Applications, SIAM (Philadelphia), 2014.

2.5 Conjugate gradient | 27

The CG algorithm itself is fairly straightforward. In each iteration, the computational work is one matrix-vector multiplication Apk , two inner products pTk Apk and rTk+1 rk+1 (rTk rk has been evaluated in the previous iteration), and three vector updates for xk+1 , rk+1 and pk+1 . The dominant cost of CG is that at every iteration, one matrix-vector product is required. If A were dense, then this is n2 flops. However, if A were sparse with at most m nonzeros (m ≪ n) in each row, then the cost is just mn flops per iteration. Note that for matrices arising from finite element or difference methods, m is small, maybe 5 to 20, and does not grow when n grows (we will learn more about this in later chapters). In fact, CG is so computationally efficient that one would not need to consider an alternative Krylov subspace method for solving large SPD linear systems. The mathematics behind CG is not quite straightforward and probably more appropriate for a graduate level course in numerical analysis. Nevertheless, we summarize the major theorems for this algorithm. Theorem 7. The kth CG iterate xk is the unique vector in Wk = x0 + 𝒦k (A, r0 ) for which

the A-norm error ‖e‖A = √(x − x∗ )T A(x − x∗ ) or φ(x) = 21 xT Ax − bT x is minimized over all x ∈ Wk . Here x∗ = A−1 b is the exact solution.

This theorem means that at step k, xk is the best solution in the affine space Wk . Noting that Wk typically has dimension k, a more delicate argument (not explored here) guarantees that CG will converge to the exact solution in at most n steps, if there were no round-off errors. Unfortunately, round off error is an issue with CG, as there is an implicit orthogonalization process, and such processes are round-off sensitive. Hence, CG may not get the exact solution in n steps due to the problems with roundoff. However, CG need not get the solution exactly for the algorithm to converge, and it may get a good approximate solution in less than n steps. Moreover, with a good preconditioner, as we mention below, the number of steps needed to converge can be reduced even further. λ

Theorem 8. Let κ = λmax(A) , where λmax (A) and λmin (A) denote the largest and the smallmin est eigenvalues of A (both positive). After k steps of CG, the error ek = xk − x∗ satisfies (A)

k

√κ − 1 ‖ek ‖A ≤ 2( ) . √κ + 1 ‖e0 ‖A In addition, if A has only m distinct eigenvalues, then CG converges to the exact solution x∗ = A−1 b in at most m steps. Here, κ =

λmax (A) λmin (A)

for a SPD matrix equals its 2-condition number, defined as κ2 (A) =

‖A‖2 ‖A ‖2 (the equivalence is not valid for non-SPD matrices). Theorem 8 suggests that a smaller condition number κ of A (or having eigenvalues closer together), usually leads to faster convergence of CG, but a large κ does not necessarily mean the convergence is slow, especially if A has only a few distinct eigenvalues. −1

28 | 2 Solving linear systems of equations We give an example to illustrate the use of CG. In the last section of this chapter, we will discuss the crucial role of a technique called “preconditioning” and provide a MATLAB code for preconditioned CG with additional examples. % a discrete negative 2- D Laplacian >> A = delsq ( numgrid ( 'L ' ,512)); >> rng default ; >> b = randn ( length (A ) ,1); % right - hand side vector % use CG to find an approximate solution to relative tol 1e -8 >> x = pcg (A ,b ,1e -8 ,1500); pcg converged at iteration 1430 to a solution with relative residual 1e -08. Although we will not discuss other types of iterative solvers, we note that if A were nonsymmetric and one wanted to use the solver BI-CGSTAB, then one can simply replace pcg in the code above with bicgstab. Other solvers are available and have been built into MATLAB, such as GMRES and BI-CGSTAB(ℓ).

2.6 Preconditioning and ILU In the previous sections, we used CG to solve linear systems Ax = b approximately. This approach is called unpreconditioned solve, and usually suffers from slow convergence. To speed up the convergence, a crucial technique called preconditioning needs to be applied. Nowadays, it is widely acknowledged that unpreconditioned methods are not efficient, and it is the preconditioning that makes these methods competitive for solving large sparse linear systems. Generally speaking, preconditioning is a linear process represented by a matrix M ≈ A, which transforms the original linear system into a new one with the same solution. Consider the “left-sided” preconditioned linear systems M−1 Ax = M−1 b. An approximate solution to the preconditioned system is also an approximate solution to the original system. With a good preconditioner M ≈ A, the preconditioned coefficient matrix M−1 A should have most eigenvalues clustered around 1, with a small number of outliers. Consequently, if such an M can be constructed, iterative methods generally converge to the solution in a significantly fewer number of iterations than they do for the original system. It is important to understand that the preconditioned coefficient matrices M−1 A should in practice never be formed explicitly. Instead, when a Krylov subspace method is applied, to apply the action of M−1 A to a vector pk , first a multiplication by A is done, followed by a linear solve using the M matrix or some linear procedure to compute

2.6 Preconditioning and ILU

| 29

the action of M−1 on Apk . Hence, it is important that the preconditioner M be cheap to perform the action M−1 on any vector, in addition to clustering the eigenvalues of M−1 A. What could be a reasonably good approximation of A that is also relatively cheap to apply? One of the earliest developments of general-purpose preconditioners is the incomplete LU factorization (ILU). For example, let PA = LU be a complete ̃ U ̃ be approximations to L, U, respecLU factorization with partial pivoting, and L, ̃U ̃ and, therefore, tively, with fewer nonzero entries. Consequently, we have PA ≈ L −1 −1 −1 ̃ L ̃ Pu can define the action of preconditioning. Note that such incomM u = U plete LU factors are obtained by dropping certain entries from the triangular factors during the factorization; they should not be obtained by computing the complete LU factors first (if we could perform the LU factorization, then we simply solve the system directly) before dropping entries. We remark that if the matrix is symmetric, the incomplete Cholesky is a more efficient variant of ILU that produces only one factor L, such that A ≈ LLT . Consider now an example of using a variant of ILU to precondition a linear system. Below we solve a 195,075 × 195,075 SPD linear system, first with unpreconditioned CG, then using incomplete Cholesky (ICHOL) as a preconditioner. Note that MATLAB’s pcg allows for the function u → QL−T L−1 QT u to be input as the preconditioner, where Q is the approximate minimum degree permutation matrix for reducing fill-ins, and L is the lower triangular ICHOL factor. We also output FLAG (0 means convergence), RELRES (relative residual of the final iterate), ITER (total number of iterations performed), and RESVEC (a vector of absolute residual norms at each step). A = delsq ( numgrid ( 'L ' ,512)); b = randn ( length (A ) ,1); % right - hand side vector tic ; [x , FLAG , RELRES , ITER , RESVEC ] = pcg (A ,b ,1 e -8 ,2000); toc ; [ FLAG , ITER ] % precondition with incomplete Cholesky ( ichol since A is SPD ) q = symamd (A ); Q = speye ( size (A )); Q = Q (: , q ); L = ichol (A(q ,q), struct ( ' type ', ' ict ', ' droptol ' ,1e -3)); % the function implementing the action of preconditioning mfun = @(u)Q *(L '\( L \( Q '* u ))); tic ; [x , FLAGp , RELRESp , ITERp , RESVECp ] = pcg (A ,b ,1 e -8 ,500 , mfun ); toc ; [ FLAGp , ITERp ]

The output we get is Elapsed time is 9.350714 seconds . ans = 0 1402 Elapsed time is 1.792773 seconds . ans = 0 53

30 | 2 Solving linear systems of equations The methods both converged (flags are 0), and the preconditioner allows for convergence in 53 iterations instead of 1402. In addition, unpreconditioned CG takes 9.35 seconds to converge, whereas ICHOL-preconditioned CG only needs 1.79 seconds. RESVEC and RESVECp let us see how the methods converged. See Figure 2.1.

Figure 2.1: Convergence of unpreconditioned and ICHOL-preconditioned PCG.

Finally, as mentioned above, a good preconditioner M should cluster the eigenvalues of M−1 A. Note if M = A, then all eigenvalues will be 1, and CG would converge in one step.5 We wish we could plot all eigenvalues of A and M−1 A to see the effect of eigenvalue clustering, but determining all the eigenvalues of a large matrix of order over tens of thousand is too expensive. Fortunately, one may see such an effect by looking at analogous smaller matrices. Below are the eigenvalues of an analogous but much smaller A plotted on the complex plane, along with the eigenvalues of M−1 A where M = LLT comes from the incomplete Cholesky factorization. In fact, since CG is applicable to symmetric matrices only, the preconditioned system should be formulated as L−1 AL−T x̃ = L−1 b (called “split preconditioning”), where an approximate solution x̃ k can be used to retrieve an approximate solution xk = L−T x̃ k to the original linear system Ax = b. Note that the eigenvalue distribution of the preconditioned coefficient matrix L−1 AL−T (identical to those of M−1 A) are much closer together, exactly what preconditioning should achieve. See Figure 2.2.

2.7 Accuracy in solving linear systems An important question which we now consider is whether numerical solutions to linear systems are accurate. Some systems are very sensitive to small changes in data or roundoff error, and thus their answers are potentially inaccurate. Other systems are 5 But in this case, the action of preconditioning M−1 applied to b is equivalent to the linear solve x = A−1 b itself, which is assumed to be too expensive here. A tradeoff is needed between the quality/cost of preconditioning and the number of iterations.

2.7 Accuracy in solving linear systems | 31

Figure 2.2: Eigenvalues of the matrix A (left) and the clustered eigenvalues of the preconditioned matrix L−1 AL−T (right).

not sensitive, and their solutions are likely good. We will quantify the sensitivity and accuracy of systems with the notion of matrix conditioning. There are two major sources of error that arise when solving linear systems of equations. The first comes from poor representation of the equations in the computer. This arises in the 16th digit from rounding error, but also if the equations are created from experiments, then likely there is measurement error in the fourth (or so) digit in each entry of A and b. Hence although one wants to solve Ax = b, one is really solving Â x̂ = b.̂ The question then arises, how close is x̂ to x? Note that this type of error is not from numerical calculations, but from error in the representation of the linear system. The second source of error comes from the calculations that produce a numerical solution to Ax = b. When GE (or some variant of it) is used as the linear solver, the numerical error produced may be in the last few digits of the solution components (i. e., relative error is small). With other types of solvers such as CG, we may accept an approximate solution when the relative residual drops to 10−6 . We will aim to quantify this phenomenon in this chapter, too.

2.7.1 Matrix and vector norms and condition number In order to study how “close” two vectors are, we introduce norms. Some commonly used vector norms are: 1. Infinity norm: ‖v‖∞ = max1≤i≤n |vi |

2.

Euclidean norm (2-norm): ‖v‖2 = (∑ni=1 vi2 )

1/2

Although the intuitive measure of distance is the 2-norm, mathematically speaking, any function ‖ ⋅ ‖ that satisfies the following four properties can be considered a mathematical measure of distance, which we refer to as a norm: 1. ‖x‖ ≥ 0 2. ‖x‖ = 0 if and only if x = 0

32 | 2 Solving linear systems of equations 3. ‖αx‖ = |α| ‖x‖ for all α ∈ ℝ 4. ‖x + y‖ ≤ ‖x‖ + ‖y‖ The reason why we define a different way to measure the “size” of a vector, as opposed to just using Euclidean norm for everything, is that all measures in finite dimension are equivalent (not equal, however), and so we want to pick one that is easy/cheap to perform calculations with. To show that ‖v‖∞ is a norm for vectors in ℝn , we must show each of the 4 properties of a norm hold for all vectors v ∈ ℝn . Lemma 9. The function ‖v‖∞ = max|vi | defines a norm for vectors in ℝn . 1≤i≤n

Proof. We prove this by showing the function satisfies all 4 properties of a norm, with v being an arbitrary vector in ℝn . For property 1, we can immediately see that ‖v‖∞ ≥ 0 since it is a max of nonnegative numbers |vi |. Property 2 is an if-and-only-if statement, and so we must show both implications. If v = 0, then each of its components is zero, and so ‖v‖∞ = maxi |vi | = 0. Conversely, if ‖v‖∞ = 0, then maxi |vi | = 0, but if the max of nonnegative numbers is 0, then they must all be 0. Hence v = 0. To prove property 3, we calculate ‖αv‖∞ = max |αvi | = max |α||vi | = |α| max |vi | = |α|‖v‖∞ , 1≤i≤n

1≤i≤n

1≤i≤n

where we use only properties of real numbers. Similarly, we only need properties of real numbers to prove property 4: ‖v + w‖∞ = max |vi + wi | ≤ max (|vi | + |wi |) , 1≤i≤n

1≤i≤n

which is upper bounded if we apply the max to both terms, max (|vi | + |wi |) ≤ max |vi | + max |wi | = ‖v‖∞ + ‖w‖∞ . 1≤i≤n

1≤i≤n

1≤i≤n

We will also need to discuss the “size” of a matrix. We will consider (only) matrix norms induced by vector norms, that is, ‖A‖ = max ‖Ax‖ , ‖x‖=1

where ‖ ⋅ ‖ is a vector norm (it could be the 2 norm, ∞ norm, or other). We leave the proof that the ‖ ⋅ ‖ is a norm for matrices in ℝn×n as an exercise. Proving it requires proving for arbitrary A, B ∈ ℝn×n : 1. ‖A‖ ≥ 0 2. ‖A‖ = 0 if and only if A = 0

2.7 Accuracy in solving linear systems | 33

3. ‖αA‖ = |α|‖A‖ for all α ∈ ℝ 4. ‖A + B‖ ≤ ‖A‖ + ‖B‖. Additionally, since we define matrix norms as being induced by vector norms, we also have the properties, for any A, B ∈ ℝn×n and x ∈ ℝn : 5. ‖Ax‖ ≤ ‖A‖ ‖x‖ 6. ‖AB‖ ≤ ‖A‖‖B‖ The ∞-matrix norm is simple and easy to calculate (although we omit the proof of how to derive this formula). It is the max absolute row sum of a matrix: n

‖A‖∞ = max ∑ |aij | 1≤i≤n

j=1

Unfortunately, no such formula exists for the 2-matrix norm, and calculating it is not cheap. For this reason, and because all norms are equivalent in finite dimensions, we will use the ∞-matrix and vector norm for the rest of this chapter. What equivalent means essentially is that although we will get different numbers with different norms, the numbers are usually relatively close and in practice, typically of the same of similar order of magnitude. Example 10. Find ‖A‖∞ and ‖v‖∞ for 1 A = ( −3 −10

−3 1 8

7 10 ) , −4

1 v = (−6) . 5

For the vector norm, we calculate ‖v‖∞ = max{1, 6, 5} = 6, For the matrix norm, again we calculate ‖A‖∞ = max{11, 14, 22} = 22. 2.7.2 Condition number of a matrix Recall the concept of matrix inverses. If a square matrix A is nonsingular, then there exists a matrix A−1 such that AA−1 = I and A−1 A = I. Recall from linear algebra that A−1 can be found using Gauss-Jordan elimination (like GE, but zero out above the diagonal as well) applied to all columns of the matrix. In general, we do not need to actually calculate A−1 ; we just need to know it exists.

34 | 2 Solving linear systems of equations We now define the condition number of a matrix: cond(A) = ‖A‖‖A−1 ‖. The condition number satisfies 1 ≤ cond(A) ≤ ∞, and its value increases as the rows (or columns) of A get closer to being linear dependent (i. e., the matrix is getting closer to being singular). That is, if we think of the rows (or columns) of a matrix as n dimensional vectors, if they are all perpendicular to each other, the 2-condition number is 1. However, as the smallest angle made by the vectors shrinks, the 2-condition number grows in an inversely proportional manner. As we will see in the next section, the condition number is the fundamental measure of sensitivity in solving linear systems. If the condition number of A is small, then we say Ax = b is a well-conditioned system and we expect an accurate numerical solution by stable algorithms (such as GE with partial pivoting). However, if the condition number of A is large, then the system is called ill-conditioned and the numerical solution is most likely inaccurate. 2.7.3 Sensitivity in linear system solving We begin this section with an example. Example 11. Consider the system of equations 6 Ax = ( 11.5

−2 x 10 ) ( 1 ) = ( ) = b. −3.85 x2 17

The solution to this system is x = [45 130]T . However, for the slightly perturbed system ̃x̃ = ( 6 A 11.5

−2 x 10 ) ( 1 ) = ( ) = b, −3.84 x2 17

the solution to the system is x̃ = [110 325]T . Why is there such a difference in solutions? After all, we made just a “small” change to one entry of A. The reason comes down to the sensitivity of the matrix, which can be measured by its condition number. If a matrix is ill-conditioned (large condition number), then small changes to data/entries can cause large changes in solutions. Obviously, we want to be aware of such problems. This fact can be mathematically described as follows. Theorem 12. Assume A is nonsingular and let Ax = b and Â x̂ = b. Then ̂ ‖A − A‖ ‖x − x‖̂ ≤ cond(A) . ‖A‖ ‖x‖̂

2.7 Accuracy in solving linear systems | 35

Proof. Since b = Ax and b = Â x,̂ Ax = Â x,̂ and subtracting Ax̂ from both sides gives A(x − x)̂ = (Â − A)x.̂ Multiplying both sides by A−1 gives (x − x)̂ = A−1 (Â − A)x,̂ and then taking norms of the vectors on both sides gives ̂ ‖x − x‖̂ = ‖A−1 (Â − A)x‖̂ ≤ ‖A−1 ‖‖(Â − A)x‖̂ ≤ ‖A−1 ‖‖Â − A‖‖x‖, with the two inequalities coming from the matrix norm properties. Next, divide both sides by ‖x‖̂ and multiply the right-hand side by 1 = ‖A‖ to reveal ‖A‖ ‖A‖ ‖x − x‖̂ ≤ ‖A−1 ‖‖Â − A‖ . ‖x‖̂ ‖A‖ Using the definition of cond(A) and rearranging completes the proof. The consequence of this theorem is that the relative change made to A (or, equivalently, the error in the representation of A) is magnified in the solution by up to cond(A). In fact, we can think of the theorem as saying: the percent difference between the solutions is bounded by the percent difference between the matrices times the condition number. Example 13. Suppose we wish to solve Ax = b, but we are sure of only the first 6 digits in each entry of A. If the condition number of A is 102 in ∞-norm, how many digits of the solution do we expect will agree with the true solution? Here, we are solving Â x̂ = b, since we are solving a linear system but not using ̂ ‖A−A‖

A exactly. Using the infinity norm, ‖A‖ ∞ ≤ 10−6 since we are sure of only the first 6 ∞ digits of A (with the infinity norm, “digits of agreement” and “relative difference” are directly related in this way). The theorem shows that ‖x − x‖̂ ∞ ‖Â − A‖∞ ≤ cond∞ (A) ≤ 102 ⋅ 10−6 = 10−4 , ‖x‖̂ ∞ ‖A‖∞ which says precisely that each entry of your solution x̂ will agree with that of the true solution x up to at least 4 digits. There is a similar result for changes in the vector b, as follows. Theorem 14. Let Ax = b and Ax̂ = b.̂ Then 󵄩 󵄩󵄩 󵄩󵄩b − b̂ 󵄩󵄩󵄩 ‖x − x‖̂ 󵄩. 󵄩 ≤ cond(A) ‖x‖ ‖b‖ In general, for a given linear system, the best accuracy we can hope for is Relative error in numerical solution ≈ cond(A) × machine epsilon

36 | 2 Solving linear systems of equations This ignores possible error in numerical methods for x,̂ so error could be worse if b comes from data measurements. Thus, knowing cond(A) is very important, as it will determine how much we should believe our computed solution. Returning now to our example, we can answer the question of why the two solutions in the above example differed by so much. We calculate the condition number first, via 38.5 A−1 = ( 115

−20 ), −60

󵄩 󵄩 cond(A) = ‖A‖∞ 󵄩󵄩󵄩󵄩A−1 󵄩󵄩󵄩󵄩∞ = max{8, 15.35} ⋅ max{58.5, 175} = 2686.25. 󵄩󵄩 ̃ 󵄩󵄩 󵄩󵄩A−A󵄩󵄩

0.01 = 6.515 × 10−4 . From the theorem, we can The relative change in A is 󵄩 ‖A‖ 󵄩 ≈ 15.35 expect the relative difference between x and x̂ to be bounded by

‖x − x‖̂ ≤ 2686.25 ⋅ 6.515 × 10−4 = 1.75, ‖x‖̂ and thus, we cannot even expect accuracy in the first digit of the solution entries, which is precisely what we observe. This terrible outcome is caused directly by the condition number being large relative to the size of the perturbation. 2.7.4 Error and residual in linear system solving Assume now that we can represent A and b exactly, and let us consider a different type of error. All numerical methods for solving Ax = b introduce error; that is, they almost surely find x̂ ≠ x. Unfortunately, we usually never know x, but we still want to have an idea of the size of the error e = x̂ − x. What we do know, if given an approximation x,̂ is the residual r = b − Ax.̂ Residual and error are different, but related: Ae = A(x̂ − x) = Ax̂ − Ax = Ax̂ − b = r. Multiplying both sides of Ae = r by A−1 gives e = A−1 r, and then taking norms of both sides yield ‖e‖ = ‖A−1 ‖‖r‖ ≤ ‖A−1 ‖‖r‖, where the last inequality came from a property of matrix norms. Dividing both sides ̂ and multiplying the right-hand side by ‖A‖ yield by ‖x‖, ‖A‖ ‖e‖ ‖r‖ ≤ cond(A) . ‖x‖̂ ‖A‖‖x‖̂ The left-hand side is the relative error of the solution, and the right-hand side is the ‖r‖ . With x̂ computed by numericondition number of A times the relative residual ‖A‖‖ x‖̂

cally stable algorithms such as GE with partial pivoting, the relative residual

‖r‖ ‖A‖‖x‖̂

is

2.8 Exercises | 37

on the order of machine epsilon.6 Overall, direct solvers for sparse matrices typically produce approximations that have very small relative residuals, for example, smaller than 10−12 . Iterative solvers, such as CG or GMRES, often use relative residual size as a stopping criteria, and usually on the order of 10−6 or 10−8 . Hence, if there is a large ̂ condition number compared to the relative residual, then the error ‖x−x‖ may be large. ‖x‖̂

2.8 Exercises 1.

By hand, solve the triangular linear systems 2 (0 0

3 −1 0

4 x1 11 2 ) (x2 ) = ( 5 ) 3 x3 9

4 (−1 1

0 5 3

0 x1 −8 0) (x2 ) = ( 7 ) . 2 x3 7

and

2. 3.

Write MATLAB code for the forward substitution (lower triangular), and test it on the second system in Exercise 1. By hand, use (a) Gaussian elimination with no pivoting, (b) LU factorization with no pivoting, and (c) LU factorization with partial pivoting, to solve the system of linear equations 1 (−1 2

−1 −3 6

−1 x1 −6 1 ) (x2 ) = ( 2 ) . −1 x3 −1

4. By hand, compute the LU factorization of ϵ

A = (2 1

5.

−1 ) 1

with partial pivoting, taking round-off errors into consideration. Check if the prod̂ and U ̂ is close to PA (see the example in Section 2.3). uct of the computed L Consider a 5 × 5 tridiagonal matrix A (nonzeros appear only on the main diagonal, and the first superdiagonal and subdiagonal), and assume that LU factorization with no pivoting is stable for this A. Where would the nonzero entries appear in

6 Sometimes the relative residual is also defined as cond(A) times machine epsilon.

‖r‖ , ‖b‖

but this quantity itself can be as large as

38 | 2 Solving linear systems of equations

6.

7.

the L and U factors, respectively? Generalize your observation to n × n tridiagonal matrices. How many flops are needed to obtain such an LU factorization and forward/back substitutions? In MATLAB, run A = delsq(numgrid(`B',256)); and B = A + sprandn(A); to construct an SPD and a nonsymmetric matrix, respectively. Use the techniques in Section 2.4 to solve the linear systems Ax = b and Bx = b, where b = [1, . . . , 1]T . ‖r‖ ‖r‖ ‖r‖ Check if all the relative residual norms ‖A‖‖ are sufficiently small. , ‖B‖‖ and ‖b‖ x‖̂ x‖̂ Let us derive some preliminary properties of the conjugate gradient method. For a symmetric positive definite matrix A ∈ ℝn×n , a vector b ∈ ℝn , and any nonzero vector x ∈ ℝn , define φ(x) = 21 xT Ax − bT x. (a) Given xk , pk ∈ ℝn \ {0}, find the scalar αk such that φ(xk + αk pk ) is minimized. Simplify the result by introducing the residual rk = b − Axk . (b) Let xk+1 = xk + αk pk , with αk defined in (a), such that rk+1 = b − Axk+1 = rT Ap

k rk − αk Apk . Show that rk+1 ⊥ pk . Let βk = − pk+1 and pk+1 = rk+1 + βk pk . Show T Ap k

k

that pk+1 ⊥ Apk . (c) From pk+1 = rk+1 + βk pk , write rk+1 in terms of pk and pk+1 . Then replace k with k + 1 in rk+1 = rk − αk Apk so that an expression of rk+2 is obtained. Show that rk+1 ⊥ rk+2 , using the relations shown in (b). 8. (a) In MATLAB, run the following commands: n = 32; A = delsq ( numgrid ( 'S ',n +2)); b = ones ( length (A ) ,1); x = pcg (A ,b ,1e -8 , length (A ) -1); disp ( condest (A ));

Increase n to 64, 128, 256, and 512, and make a table summarizing n, the number of CG iterations needed to reach the relative tolerance 10−8 , and the 2-condition number (estimate) of A. Are the results consistent with Theorem 8 in Section 2.5? ‖r ‖ (Disregard the fact that we ask CG with x0 = 0 to terminate when ‖rk ‖2 ≤ 10−8 instead of ‖ek ‖A ≤ 10−8 ). 0 A (b) Run the following commands: ‖e ‖

0 2

un = ones (256 ,1); D = diag ([ un ; 2 e1 * un ; 4 e2 * un ; 1.6 e5 * un ]); rng default ; V = randn ( size (D )); [V ,∼] = qr (V ); A = V*D*V '; A = (A+A ')/2; b = ones ( length (A ) ,1); x = pcg (A ,b ,1e -8 , length (A ) -1); disp ( cond (A )); In how many iterations does CG converge? Compare this result to that in part (a), and provide an explanation of the iteration count here.

2.8 Exercises | 39

9.

Consider a matrix that arises from solving the 2D Poisson problem. This matrix A has 4 for each diagonal entry, and −1’s on each of the first upper and lower diagonals, and again on the √n ± 1 diagonals. The rest of the matrix is all zeros. You can create such a 2500×2500 matrix in MATLAB A = gallery ( ' poisson ' ,50);

Note the 50 gets squared, as it has to do with number of unknowns in each dimension. This matrix is automatically created as “sparse” type. Create such matrices of size 2500, 3600, 4900, . . ., and vectors b of all ones of the commensurate lengths, respectively. Solve the linear systems Ax = b two ways, and keep track of timings: First, solve using the sparse direct solver in MATLAB by solving the linear system with backslash. Next, repeat the process but using nonsparse matrices (e. g., A2 = full(A);). Compare timings of the sparse versus nonsparse solve. 10. Given A=(

2 −3

4 ) −6.001

2 and b = ( ) 3

(a) Solve Ax = b using backslash. (b) Change the second entry of b to 3.01, and solve for the new solution. (c) What is the relative difference between the solutions? (d) Does this agree with the Theorem 14? 11. Repeat Exercise 10, but now using the matrix 2 A=( 3

4 ). 6.001

12. Prove that ‖v‖∞ = max1≤i≤n |vi | defines a norm for vectors in ℝn . 13. Prove properties 5 and 6 of vector-norm-induced matrix norms (assume that the associated vector norms satisfy the norm properties). 14. Prove Theorem 14.

3 Least squares problems In this chapter, we consider problems where there are more linear equations than unknowns, or equivalently, more constraints than variables. Such problems arise, for example, when we try to find parameters that “best fit” a model to a data set. To help with the discussion, consider a representative example of finding a line of best fit for given data points (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ). To determine such a line, we want to determine a0 and a1 such that l(x) = a0 + a1 x “best fits” the data. For this problem, we have 2 unknowns, a0 and a1 , but we have n equations we wish to satisfy: yi = l(xi ),

i = 1, 2, 3, . . . , n.

Combining these equations into a matrix equation, we obtain 1 1 (. .. 1

x1 y1 x2 y2 a0 .. ) ( ) = ( .. ) ⇐⇒ Ax = b. a1 . . xn yn

It is highly unlikely that there can be a solution to these n equations, unless the n points happened to all lie on a single line. So instead of looking for an exact solution, we will look for a “least squares” solution in the sense that we want to find a0 and a1 so that the least squares error n

e(a0 , a1 ) = ∑(yi − (a0 + a1 xi ))

2

i=1

is minimized. Note there are different ways to minimize, but typically for data fitting problems, least squares error minimization is both intuitive and has good mathematical properties. Since equality is not expected here, when writing least squares problems, we typically use the notation ≅ instead of =, and so write the system as Ax ≅ b, with the understanding that this means minimizing the sum of squares error. A nice mathematical property of using least squares error is that it is the same as minimizing the Euclidean norm of the residual Ax − b. In fact, simply expanding out ‖Ax − b‖22 reveals n

2

‖Ax − b‖22 = ∑(yi − (a0 + a1 xi )) . i=1

https://doi.org/10.1515/9783110573329-003

42 | 3 Least squares problems Thus, by defining the least squares (LSQ) problem as Ax ≅ b ⇐⇒ x minimizes ‖Ax − b‖2 , we have that it is the exact same thing as minimizing the sums of squares error: Ax ≅ b ⇐⇒ x minimizes ‖Ax − b‖2

⇐⇒ x is the solution of min ‖Ax − b‖22 x

n

2

⇐⇒ (a0 , a1 ) is the solution of min ∑(yi − (a0 + a1 xi )) . a0 ,a1

i=1

Hence, we observe that solving the LSQ problem is exactly the same as minimizing the squared distances of the data points from the predicted line.

3.1 Solving LSQ problems with the normal equations We consider now a linear system of n equations and m unknowns that can be written as An×m xm×1 ≅ bn×1 . We assume that n > m, and A has full column rank (i. e., rank(A) = m). This assumption corresponds to the parameters being independent of each other, which is typically a safe assumption. For example, in a line of best fit, one wants to find a0 and a1 that best fit a line l(x) = a0 + a1 x to a data set. Here, m = 2 and we would have full column rank (see the matrix A above). But if we changed the problem to instead look for coefficients of 1, x, and also (x + 1), then we would have m = 3 but a column rank of only 2. This is because if we tried to use a0 , a1 , and a2 to best fit a line l(x) = a0 + a1 x + a2 (x + 1) to a set of data, column 3 of the matrix would be a linear combination of the first two columns. Hence, for LSQ problems, the assumption of full column rank is reasonable since if the columns are not linearly independent, then typically this can be fixed by eliminating redundant columns and unknowns. We now derive a solution method for the LSQ problem Ax ≅ b. Since this problem is defined to be finding x that minimizes ‖Ax − b‖2 and with that also ‖Ax − b‖22 , we define the function h(x) = ‖Ax − b‖22 .

3.1 Solving LSQ problems with the normal equations | 43

Expanding this norm using its definition, we obtain h(x) = (Ax − b)T (Ax − b) = xT AT Ax − xT AT b − bT Ax + bT b. To minimize h(x), a key point is to notice that h(x) is a quadratic function in x1 , x2 , . . . , 2 xm , and moreover, the coefficients of x12 , x22 , . . . , xm in h(x) must be positive (see Exercise 1). A quadratic function with positive coefficients of these terms is an upward facing paraboloid (in ℝm ), and thus must have a unique minimum at its vertex. Hence we can solve the LSQ problem by finding the vertex of the paraboloid defined by h(x), and this can be done by setting ∇h(x) = 0 and solving for x (i. e., set each first partials 𝜕h to zero and solve). 𝜕xi A calculation (see Exercise 2) shows that ∇h(x) = 2AT Ax − 2AT b, and so the LSQ solution x must satisfy (AT A)x = AT b. Since A has full column rank, AT A is square SPD and, therefore, nonsingular (see Exercise 3). Thus, we can solve this m × m linear system to get the LSQ solution. That is, we have transformed the minimization problem Ax ≅ b into the square linear system (AT A)x = AT b. We now perform an example to demonstrate the theory above. Example 15. Find the line of best fit for the data points xy = [ 1.0000 1.9000 2.5000 3.0000 4.0000 7.0000 ]

1.0000 1.0000 1.1000 1.6000 2.0000 3.4500

We proceed to find the best fit coefficients a0 and a1 . Following the derivation above, this defines a LSQ problem Ax ≅ b. We can input A and b easily into MATLAB using the data points: >> A = [ xy (: ,1).ˆ0 , xy (: ,1).ˆ1] A = 1.0000 1.0000 1.0000 1.9000 1.0000 2.5000

44 | 3 Least squares problems

1.0000 1.0000 1.0000

3.0000 4.0000 7.0000

>> b = xy (: ,2) b = 1.0000 1.0000 1.1000 1.6000 2.0000 3.4500 Next, solve the system using the normal equations (AT A)x = AT b: >> x = (A ' * A ) \ (A ' * b) x = 0.2627 0.4419 We now have the coefficients, a0 = 0.2627 and a1 = 0.4419. As a sanity check, we plot the line a0 + a1 x along with the data points and see what we expected: a line of best fit.

3.2 QR factorization and the Gram–Schmidt process Solving a LSQ problem, as done above, is called solving with the normal equations, since the matrix AT A is a “normal” matrix. For smaller problems, this is generally effective. For problems where m is large (greater than a few thousand or so), the linear solve can be difficult because AT A is often fairly dense, even if A is sparse. Further, this linear solve is highly prone to conditioning problems. To see this, consider for simplicity a square, symmetric, nonsingular matrix M. Here, one can calculate under

3.2 QR factorization and the Gram–Schmidt process | 45

properties of normal matrices that 2

cond2 (MT M) = ‖MT M‖2 ‖(MT M)−1 ‖2 = ‖M‖22 ‖M−1 ‖22 = (cond2 (M)) . Although this is the simplest case, it is a similar situation for rectangular matrices (we would have to build up notation and theory for conditioning of rectangular matrices), but the takeaway is this: if a matrix M is even mildly ill-conditioned, then MT M is very ill-conditioned. We next show a way to solve the LSQ problems without performing a linear solve with a potentially ill-conditioned matrix. The Gram–Schmidt process is a method to transform a set of m linearly independent vectors into an orthonormal set of m vectors that have the same span. Recalling from Calculus III the notion of a vector projection of v onto u (the part of v that is in the direction of u), proju (v) =

u⋅v u, ‖u‖2

a linearly independent set of vectors {a1 , a2 , a3 , . . .} can be transformed into an orthogonal set by the process a1 = a1 ,

a2 = a2 − proja1 (a2 ),

a3 = a3 − proja1 (a3 ) − proja2 (a3 ), and so on. After each step, each qi is normalized via qi =

ai . ‖ai ‖

For numerical stability, we alter the process to the following mathematically equivalent process: q1 = a1

q1 = q1 /‖q1 ‖

q2 = a2

q2 = q2 − (q1 ⋅ q2 )q1

q2 = q2 /‖q2 ‖

q3 = a3

q3 = q3 − (q1 ⋅ q3 )q1

q3 = q3 − (q2 ⋅ q3 )q2 q3 = q3 /‖q3 ‖

and so on.

46 | 3 Least squares problems One can easily check that the set {qi }i=1,...,n is orthonormal. Considering the set of vectors as columns of a matrix A, the Gram–Schmidt process transforms it into a matrix Q, where the columns of Q are orthonormal and have the same span as the columns of A. Note that if we consider A to be the matrix with columns ai , then each step of the Gram–Schmidt algorithm is an operation using just the columns of (modified) A. This is a fairly straightforward MATLAB algorithm, which we give below. Note that here we normalize the qi ’s after the orthogonalization process, before moving on to the next column. We also keep track of the multipliers (scalar projections) rij = qi ⋅ aj , as keeping track of these will allow us to reverse the process, which results in a matrix factorization. function [Q ,R ] = gramschmidt_QR (A) Q = A; n = size (Q ,2); R = zeros (n ,n ); for j = 1 : n q = A (: , j ); for k = 1 : j -1 qk = Q (: , k ); R(k , j) = (qk '* q ); q = q - qk *R(k ,j ); end R(j , j) = norm (q ,2); Q (: , j) = q / R(j ,j ); end We test the algorithm below on three random vectors of length 5, which are stored in a 5 × 3 matrix A. One can check that the vectors are orthonormal by checking that QT Q = I, since the i, j component of QT Q is qi ⋅ qj . >> A = rand (5 ,3) A = 0.8407 0.2543 0.8143 0.2435 0.9293

0.3500 0.1966 0.2511 0.6160 0.4733

>> Q = gramschmidt_QR (A) Q =

0.3517 0.8308 0.5853 0.5497 0.9172

3.2 QR factorization and the Gram–Schmidt process | 47

0.5476 0.1656 0.5304 0.1586 0.6052

-0.1063 0.1400 -0.2697 0.9456 0.0465

-0.4617 0.8493 0.0545 -0.1711 0.1824

0.0000 1.0000 0.0000

-0.0000 0.0000 1.0000

>> Q ' * Q ans = 1.0000 0.0000 -0.0000

Interestingly, the transformation of the matrix A to Q involves only rescaling of columns, and replacing columns with their sum with another column. This means that each operation is a linear transformation, and moreover, these linear transformations can be combined together into a matrix. In particular, if we know Q (which we do from the algorithm), then we can easily undo the Gram–Schmidt operations using the upper triangular matrix R, which holds the scalar projections that were created from the orthogonalization process. This creates a factorization of the form An×m = Qn×m Rm×m . This A = QR factorization turns out to be quite useful for solving the LSQ problem Ax ≅ b. Since the unique LSQ solution satisfies AT Ax = AT b, inserting the factorization yields RT QT QRx = RT QT b. Since QT Q = Im×m and R is nonsingular, since A has full column rank, the equation reduces to Rx = QT b. This solve with R is fast and easy since R is upper triangular, and moreover, turns out to be much better for conditioning, compared to using the normal equations. To see this, again take for simplicity the case of a square symmetric matrix A, we have that cond2 (R) = cond2 (A), since Q−1 = QT and cond2 (Q) = 1, and thus, solving the LSQ problem this way is not ill-conditioned unless A is ill-conditioned (as opposed to solving using the normal equations).

48 | 3 Least squares problems There are several algorithms to compute QR factorizations, such as the Householder reflection or Givens rotations, with each having advantages over the others in different situations. As a general rule, it is safe to use MATLAB’s QR factorization process based on Householder reflections, which can be called via [Q,R] = qr(A,0). The second argument 0 is to have it produce Qn×m instead of a full square Qn×n , where the columns m + 1, m + 2, . . . , n are created so that Qn×n is a square orthogonal matrix—this is unnecessary and wasteful, if one is only interested in using the QR factorization to solve the LSQ problem.

3.3 Curve of best fit Since many data relations are not linear, it is important to also consider “curves of best fit.” The ideas in the previous sections can be extended to this setting without much difficulty. There are three types of two-parameter curves to consider: 1. exponential: y = bemx 1 2. inverse: y = mx+b 3. power: y = bxm Each of these functions depend on two parameters, which we call m and b. It is our goal to find the m and b that provide a curve of best fit for given data points (xi , yi ), i = 1, 2, . . . , n. The ideas below can easily be extended to other types of two-parameter curves. The procedure is to find a transformation that creates a linear relationship in transformed data, fit a line to the transformed data, and finally untransform the line to get a curve of best fit. Step 1 of this procedure is the hardest, as it requires us to look at data and determine what kind of function that data might represent. If the data is exponential, then we would expect the points (xi , log(yi )) to look linear. Similarly, if the data comes from an inverse function, then we would expect (x, y1 ) to appear linear. Hence, Step 1 requires some educated guess, and checking that the transformed data is indeed linear. Step 2 of the procedure is clear. Once we have linearized data, we already know how to find the line of best fit for it. Step 3 is to take the line of best fit for the transformed data and turn it into a curve for the original data. How to do this depends on the transformation, but for example, for the exponential case, we fit a line log(y) = a0 + a1 x, then to untransform, we raise e to both sides and get y = ea0 +a1 x = ea0 ea1 x . If we set b = ea0 and m = a1 , we now have our parameters for a curve of best fit. The transformation process for each of these data types is as follows: 1. If the data comes from y = bemx , then taking the log of both sides gives log(y) = log(b) + mx. Thus, a plot of xi versus log(yi ) will look linear. 1 , then taking reciprocals of both sides gives y−1 = 2. If the data comes from y = mx+b −1 mx + b. So a plot of xi versus yi will look linear.

3.3 Curve of best fit | 49

3.

If the data comes from y = bxm , then taking the log of both sides gives log(y) = log(b) + m log(x). Thus, a plot of log(xi ) vs. log(yi ) will look linear.

Example 16. Fit a curve to the data points. >> xy = [ 0.2500 0.5000 0.7500 1.0000 1.2500 1.5000 1.7500 2.0000 2.2500 2.5000 2.7500 3.0000 3.2500 3.5000 3.7500 4.0000 4.2500 4.5000 4.7500 ]

0.3662 0.4506 0.5054 0.5694 0.6055 0.6435 0.7426 0.9068 0.9393 1.1297 1.2404 1.4441 1.5313 1.7706 1.9669 2.3129 2.5123 2.9238 3.3070

Our first step is to plot the data points (xi , yi ), as follows:

By examining the data, we suspect this could be either an exponential or a power function. Thus we plot the transformed data (xi , log(yi )):

50 | 3 Least squares problems

and (log(xi ), log(yi )):

From these plots, we conclude the data most likely comes from an exponential function. The next step is to fit a line of best fit to the transformed (linear) data (xi , log(yi )), so we use the line of best fit procedure with log(yi ) in place of yi . >> A = [ xy (: ,1).ˆ0 , xy (: ,1).ˆ1]; >> [Q , R ] = qr (A ,0); >> x = R \ (Q '* b) x = -1.0849 0.4752

b = log ( xy (: ,2));

Thus, we have fit a line of best fit to the data (xi , log(yi )), and it is given by log(y) = −1.0849 + 0.4752x Now we untransform the line (into a curve) by raising e to both sides to get y = e−1.0849 e0.4752x = 0.3379e0.4752x Since the line was a line of best fit for the transformed data, this will be a curve of best fit for the original data, which we can see in the plot below.

3.4 Exercises | 51

Let us now summarize the different curves to fit: name:

equation:

fit line (y = a0 + a1 x) to:

answer:

line exponential power inverse

y y y y

(xi , yi ) (xi , log yi ) (log xi , log yi ) (xi , 1/yi )

b = a0 , m = a1 b = ea0 , m = a1 b = ea0 , m = a1 b = a0 , m = a1

= b + mx = b ⋅ emx = b ⋅ xm = (b + mx)−1

3.4 Exercises Let x = [x1 x2 ]T . Prove that the coefficients of x12 and x22 that arise in the polynomial xT (AT A)x are positive (you may assume that A is 3 × 2). 2. For A3×2 , x2×1 , let h(x) = ‖b − Ax‖22 . Show that ∇h(x) = 2AT (Ax − b). 3. If A has full column rank, prove that (AT A) is symmetric positive definite. 4. Prove that, for a matrix norm induced by a vector norm, ‖AB‖ ≤ ‖A‖‖B‖ for any matrices A and B where AB is well defined. 5. Calculate cond2 (A) and cond2 (AT A) for 1.

3.0170 A=( −5.8075

6.

Is the condition number of the normal matrix approximately the square of the matrix condition number? Consider fitting a nonlinear curve to the data points x = [ -1.0000 0 0.5000 1.0000 y = [0.3715 1.0849 1.7421 2.7862

7.

1.8236 ). −3.5070

2.2500 9.5635

3.0000 20.1599

3.4000] 30.0033]

(a) For the different transformation in this chapter, plot the corresponding transformed data, and decide which looks closest to linear. (b) Find a curve of best fit for the data. Solve LSQ problems to find the line of best fit, quadratic of best fit, cubic of best fit, and quartic of best fit for the points

52 | 3 Least squares problems

0.1622 0.7943 0.3112 0.5285 0.1656 0.6020 0.2630 0.6541 0.6892 0.7482

0.0043 0.5011 0.0301 0.1476 0.0045 0.2181 0.0182 0.2798 0.3274 0.4188

Plot them all together on [0.1622, 0.7943]. Which is best? 8. Consider the matrix A defined by A = [ 0.7577 0.7431 0.3922 0.6555 0.1712 ]

0.7060 0.0318 0.2769 0.0462 0.0971

0.8235 0.6948 0.3171 0.9502 0.0344

0.4387 0.3816 0.7655 0.7952 0.1869

2.7260 1.8514 1.7518 2.4471 0.4896

and vector b defined to be a 5 × 1 vector of all 1’s. Solve Ax = b two ways: – Using the normal equations, that is, solve AT Ax = AT b – Using a QR factorization Which is the most accurate (compare ‖b − Ax‖2 for each solution)? Explain the differences using the condition numbers of the associated matrices.

4 Finite difference methods Taylor series and Taylor’s theorem play a fundamental role in finite difference method approximations of derivatives. We begin this chapter with a review of these important results. Theorem 17 (Taylor series). If a function f is infinitely differentiable on ℝ, then for any chosen expansion point x0 , it holds that f (x) = f (x0 ) + f 󸀠 (x0 )(x − x0 ) +

f 󸀠󸀠 (x0 ) f 󸀠󸀠󸀠 (x0 ) (x − x0 )2 + (x − x0 )3 + ⋅ ⋅ ⋅ 2! 3!

for every real number x. Theorem 18 (Taylor’s theorem). If a function f is n + 1 times differentiable in an interval around a chosen expansion point x0 , then for every real number x in the interval, there is a number c ∈ [x, x0 ] such that f 󸀠󸀠 (x0 ) (x − x0 )2 2! f (n) (x0 ) f (n+1) (c) (x − x0 )n + (x − x0 )n+1 . + ⋅⋅⋅ + n! (n + 1)!

f (x) = f (x0 ) + f 󸀠 (x0 )(x − x0 ) +

The importance of Taylor’s theorem is that we can pick a point x0 , and then represent the function f (x), no matter how “messy” it is, by the polynomial f (x0 )+f 󸀠 (x0 )(x −

(n+1) f (n) (x ) f 󸀠󸀠 (x0 ) (c) (x − x0 )2 + ⋅ ⋅ ⋅ + n! 0 (x − x0 )n . The error is given by the single term f (n+1)! (x − 2! n+1 x0 ) . Although we do not know c, we do know it lives in [x, x0 ]. In many cases, this

x0 ) +

term can be estimated to give an upper bound on the error in approximating a function by a Taylor polynomial. One way to think of these Taylor polynomials is as the tangent polynomials for a function f at a point x0 . If n = 1, we recover the tangent line to f at x0 , and if n = 2 we would obtain the tangent parabola. As we learned in Calculus I, the tangent line is quite accurate for x close to x0 and can be quite inaccurate if x is far away from x0 . This

is exactly what we see in the error term f (2)!(c) (x − x0 )2 ; the difference (x − x0 ) is squared, which directly shows large error for x far away from x0 . In general, Taylor polynomials are useful approximations when points are close together and bad approximations when points are far apart. In what we do in this chapter, which includes applications to differential equations, the points will be close together, and thus, Taylor polynomials will be a useful tool. (2)

4.1 Convergence terminology For a given mathematical problem, assume there is a solution u. If we use a numerical algorithm to approximate u, then we will get a numerical solution ũ (it is extremely rare https://doi.org/10.1515/9783110573329-004

54 | 4 Finite difference methods ̃ The fundamental question is: How close is ũ to u? It is our job to quantify for u = u). this difference. In many cases, the error in an approximation depends on a parameter. For example, in Newton’s method, the error typically depends on how many iterations are (x) performed. If one is approximating a derivative of f 󸀠 (x) by calculating f (x+h)−f for a h user chosen h, the error will naturally depend on h. Hence in this case, we will want to quantify the error in terms of h. That is, we want to be able to write 󵄨󵄨 f (x + h) − f (x) 󵄨󵄨 󵄨󵄨 󵄨 − f 󸀠 (x)󵄨󵄨󵄨 ≤ Chk , 󵄨󵄨 󵄨󵄨 󵄨󵄨 h where C is a problem dependent constant (independent of h and k), and we want to determine k. If k > 0, then as h decreases, we are assured the error will go to 0. The larger k is, the faster it will go to zero. Often, the value of k will allow us to compare different algorithms in order to determine which is more accurate. Definition 19 (Big O notation). Suppose u is the true solution to a mathematical prob̃ lem, and u(h) is an approximation to the solution that depends on a parameter h. If it holds that ̃ |u − u(h)| ≤ Chk , with C being a constant independent of h and k, then we write ̃ |u − u(h)| = O(hk ). This is interpreted as “The error is on the order of hk .” For first-order convergence (k = 1), the error is reduced proportional to the reduction of h. In other words, if h gets cut in half, you can expect the error to be approximately cut in half also. For second order convergence, that is, O(h2 ); however, if h gets 2 cut in half, the error gets cut by 21 = 41 , which is obviously much better. Example 20. Suppose we have an algorithm where the error depends on h, and for a sequence of h’s {1, 1/2, 1/4, 1/8, 1/16, 1/32, . . .}, the sequence of errors is given by 10, 5, 2.5, 1.25, 0.625, 0.3125 and converges with first-order accuracy, that is, O(h). This is because when h gets cut in half, so do the errors. For the same h’s, the sequence of errors 100, 25, 6.25, 1.5625, 0.390625, 0.09765625 converges with second-order accuracy, that is, O(h2 ), since the errors get cut by a factor of 4 = 22 when h is cut in half.

| 55

4.2 Approximating the first derivative

In the example above, it was clear that the exponents of h were 1 and 2, but in general, the k in O(hk ) need not be an integer. To approximate k from two approximation errors e1 , e2 corresponding to parameters h1 , h2 , we treat the error bound as a close approximation and start with e ≈ Chk with C independent of h and k. Then we have that e1 ≈ Chk1 ,

e2 ≈ Chk2 .

Solving for C in both equations and setting them equal gives log ( ee2 ) e1 h2 k e2 e2 h2 e2 1 ≈ k 󳨐⇒ ( ) ≈ . 󳨐⇒ k log ( ) ≈ log ( ) 󳨐⇒ k ≈ h1 e1 h1 e1 hk1 h2 log ( hh2 ) 1

Given a sequence of h’s and corresponding errors, we calculate k’s for each successive error, and typically, k will converge to a number.

4.2 Approximating the first derivative 4.2.1 Forward and backward differences Consider a discretization of an interval [a, b] with N + 1 equally spaced points, call them x0 , x1 , . . . , xN . Call the point spacing h, so that h = xi+1 − xi . Suppose we are given function values f (x0 ), f (x1 ), . . . , f (xN ), so we know just the points, but not the entire curve, as in the plot in Figure 4.1. Suppose we want to know f 󸀠 (xi ) from just this limited information. Recalling the definition of the derivative is f 󸀠 (x) = lim

h→0

f (x + h) − f (x) , h

Figure 4.1: Set of example points.

56 | 4 Finite difference methods a first guess at approximating f 󸀠 (xi ) is to use one of 2 point forward difference: 2 point backward difference:

f (xi+1 ) − f (xi ) , h f (xi ) − f (xi−1 ) f 󸀠 (xi ) ≈ . h

f 󸀠 (xi ) ≈

Illustrations of these ideas are shown in Figure 4.2. These two finite difference approximations to the derivative are simply the slopes between the adjacent points.

Figure 4.2: The forward and backward finite difference approximations to f 󸀠 (xi ).

Important questions to ask about these approximations are “How accurate are they?” and “Is one of these approximations any better than the other?” Answers to these questions can be determined from Taylor series. Consider the spacing of the x points, h, to be a parameter. We expect that as h gets small, the forward and backward difference approximations should approach the true value of f 󸀠 (xi ) (assuming no significant roundoff error, which should not be a problem if h is not too small). Using Taylor’s theorem with expansion point xi and choosing x = xi+1 gives us f (xi+1 ) = f (xi ) + f 󸀠 (xi )(xi+1 − xi ) +

f 󸀠󸀠 (c) (xi+1 − xi )2 , 2

for some xi ≤ c ≤ xi+1 . Then, since h = xi+1 − xi , this reduces to f (xi+1 ) = f (xi ) + hf 󸀠 (xi ) +

h2 f 󸀠󸀠 (c) . 2

Now some simple algebra yields f (xi+1 ) − f (xi ) hf 󸀠󸀠 (c) − f 󸀠 (xi ) = . h 2 Note that the left-hand side is precisely the difference between the forward difference approximation and the actual derivative f 󸀠 (xi ). Thus we have proved the following result.

4.2 Approximating the first derivative

| 57

Theorem 21. Let f ∈ C 2 ([xi , xi+1 ]).1 Then the error in approximating f 󸀠 (xi ) with the forward difference approximation is bounded by 󵄨󵄨 f (x ) − f (x ) 󵄨󵄨 h 󵄨󵄨 i+1 󵄨 i − f 󸀠 (xi )󵄨󵄨󵄨 ≤ max |f 󸀠󸀠 (x)| = Ch, 󵄨󵄨 󵄨󵄨 󵄨󵄨 2 xi ≤x≤xi+1 h where C is a constant independent of h. The theorem above tells us the relationship between the error and the point spacing h is linear. In other words, if we cut h in half, we can expect the error to get cut (approximately) in half as well. For the backward difference approximation, we can perform a similar analysis as for the forward difference approximation and will get the following result. Theorem 22. Let f ∈ C 2 ([xi−1 , xi ]). Then the error in approximating f 󸀠 (xi ) with the backward difference approximation is bounded by 󵄨󵄨 f (x ) − f (x ) 󵄨󵄨 󵄨󵄨 i 󵄨 i−1 − f 󸀠 (xi )󵄨󵄨󵄨 ≤ Ch, 󵄨󵄨 󵄨󵄨 󵄨󵄨 h where C is a constant independent of h and depends on the value of the second derivative of f near xi . Example 23. Use the forward difference method, with varying h, to approximate the derivative of f (x) = esin(x) at x = 0. Verify numerically that the error is O(h). We type the following commands into MATLAB and get the following output for the forward difference calculations and the associated error (note f 󸀠 (0) = 1) for a series of decreasing h’s. >> >> >> >>

h = (1/2).ˆ[1:10]; fprime_fd = ( exp ( sin (0+ h )) - exp ( sin (0)))./ h; error = fprime_fd - 1; [h ' , fprime_fd ', error ' ,[0 , error (1: end -1)./ error (2: end )] ']

ans = 5.0000 e -01 2.5000 e -01 1.2500 e -01 6.2500 e -02 3.1250 e -02 1.5625 e -02 7.8125 e -03

1.2303 e +00 1.1228 e +00 1.0622 e +00 1.0312 e +00 1.0156 e +00 1.0078 e +00 1.0039 e +00

2.3029 e -01 1.2279 e -01 6.2240 e -02 3.1218 e -02 1.5621 e -02 7.8120 e -03 3.9062 e -03

0 1.8756 e +00 1.9728 e +00 1.9937 e +00 1.9985 e +00 1.9996 e +00 1.9999 e +00

1 This means f is a twice differentiable function on the interval [xi , xi+1 ].

58 | 4 Finite difference methods

3.9062 e -03 1.9531 e -03 9.7656 e -04

1.0020 e +00 1.0010 e +00 1.0005 e +00

1.9531 e -03 9.7656 e -04 4.8828 e -04

2.0000 e +00 2.0000 e +00 2.0000 e +00

The first column is h, the second is the forward difference approximation, the third column is the error, and the fourth column is the ratios of the successive errors. To verify the method is O(h), we expect that the errors get cut in half when h gets cut in half. The fourth column verifies this is true, since the ratios of successive errors converges to 2.

4.2.2 Centered difference There are (many) more ways to approximate f 󸀠 (xi ) using finite differences. One thing to notice about the forward and backward differences is that if f has curvature, then, for smaller h, it must be true that one of the methods is an overestimate and the other is an underestimate. Then it makes sense that an average of the two methods might produce a better approximation. After averaging, we get a formula that could also arise from using a finite difference of the values at xi−1 and xi+1 : 2 point centered difference: f 󸀠 (xi ) ≈

f (xi+1 ) − f (xi−1 ) . 2h

A graphical illustration is given in Figure 4.3.

Figure 4.3: The centered difference approximation to f 󸀠 (xi ).

Example 24. Using h = 0.1, approximate f 󸀠 (1) with f (x) = x3 using forward, backward, and centered difference methods. Note that the true solution is f 󸀠 (1) = 3. For h = 0.1, we get the approximations f (1.1) − f (1) = 3.31 0.1 f (1) − f (0.9) = 2.71 BD = 0.1 FD =

(|error| = 0.31),

(4.1)

(|error| = 0.29),

(4.2)

4.2 Approximating the first derivative

CD =

f (1.1) − f (0.9) = 3.01 0.2

| 59

(|error| = 0.01).

(4.3)

We see in the example that the centered difference method is more accurate. As it is dangerous to draw conclusions based on one example, let us look at what some mathematical analysis tells us: Consider the Taylor series of a function f expanded about xi : f (x) = f (xi ) + f 󸀠 (xi )(x − xi ) +

f 󸀠󸀠󸀠 (xi ) f 󸀠󸀠 (xi ) (x − xi )2 + (x − xi )3 + ⋅ ⋅ ⋅ . 2! 3!

Choose x = xi+1 and then x = xi−1 to get the two equations f 󸀠󸀠 (xi ) f 󸀠󸀠󸀠 (xi ) + h3 + ⋅⋅⋅, 2! 3! f 󸀠󸀠󸀠 (xi ) f 󸀠󸀠 (xi ) − h3 + ⋅⋅⋅. f (xi−1 ) = f (xi ) − hf 󸀠 (xi ) + h2 2! 3!

f (xi+1 ) = f (xi ) + hf 󸀠 (xi ) + h2

Subtracting the equations cancels out the odd terms on the right-hand sides and leaves f (xi+1 ) − f (xi−1 ) = 2hf 󸀠 (xi ) + 2h3

f 󸀠󸀠󸀠 (xi ) + ⋅⋅⋅, 3!

and thus, using some algebra, we get f (xi+1 ) − f (xi−1 ) f 󸀠󸀠󸀠 (xi ) − f 󸀠 (xi ) = h2 + ⋅⋅⋅. 2h 3! We now see that the leading error term in the centered difference approximation is f 󸀠󸀠󸀠 (x ) h2 3! i . Thus, we repeat this procedure, but use Taylor’s theorem to truncate at the third derivative terms. This gives us f (xi+1 ) − f (xi−1 ) f 󸀠󸀠󸀠 (c1 ) + f 󸀠󸀠󸀠 (c2 ) − f 󸀠 (xi ) = h2 . 2h 12 We have proved the following theorem. Theorem 25. Let f ∈ C 3 ([xi−1 , xi+1 ]). Then, the error in approximating f 󸀠 (xi ) with the centered difference approximation is bounded by 󵄨󵄨 󵄨󵄨 f (x ) − f (x ) 󵄨 󵄨󵄨 i+1 i−1 − f 󸀠 (xi )󵄨󵄨󵄨 ≤ Ch2 = O(h2 ), 󵄨󵄨 󵄨󵄨 󵄨󵄨 2h where C is a constant with respect to h and depends on the value of the third derivative of f near xi . This is a big deal. We have proven that the error in centered difference approximations is O(h2 ), whereas the error in forward and backward differences is O(h). So, if h gets cut by a factor of 10, we expect FD and BD errors to each get cut by a factor of 10, but the CD error will get cut by a factor of 100. Hence, we can expect much better

60 | 4 Finite difference methods answers using CD, as we suspected. Note that Theorem 25 requires more regularity of f (one more derivative needs to exist). Example 26. Use the centered difference method, with varying h, to approximate the derivative of f (x) = sin(x) at x = 1. Verify numerically that the error is O(h2 ). We type the following commands into MATLAB and get the following output for the centered difference calculations and the associated error (note f 󸀠 (1) = cos(1)): >> >> >> >>

h = (1/2).ˆ[1:10]; fprime_cd = ( sin (1+ h )- sin (1 - h ))./(2* h ); error = abs ( fprime_cd - cos (1) ); [h ' , fprime_cd ', error ' ,[0 , error (1: end -1)./ error (2: end )] ']

ans = 5.0000 e -01 2.5000 e -01 1.2500 e -01 6.2500 e -02 3.1250 e -02 1.5625 e -02 7.8125 e -03 3.9062 e -03 1.9531 e -03 9.7656 e -04

1.2303 e +00 1.1228 e +00 1.0622 e +00 1.0312 e +00 1.0156 e +00 1.0078 e +00 1.0039 e +00 1.0020 e +00 1.0010 e +00 1.0005 e +00

2.2233 e -02 5.6106 e -03 1.4059 e -03 3.5169 e -04 8.7936 e -05 2.1985 e -05 5.4962 e -06 1.3741 e -06 3.4351 e -07 8.5879 e -08

0 3.9627 e +00 3.9906 e +00 3.9977 e +00 3.9994 e +00 3.9999 e +00 4.0000 e +00 4.0000 e +00 4.0000 e +00 4.0000 e +00

As in the previous example, the first column is h, the second is the forward difference approximation, the third column is the error, and the fourth column is the ratios of the successive errors. To verify the method is O(h2 ), we expect that the errors get cut by four when h gets cut in half, and the fourth column verifies this is true.

4.2.3 Three point difference formulas The centered difference formula offers a clear advantage in accuracy over the backward and forward difference formulas. However, the centered difference formula cannot be used at the endpoints. Hence if one desires to approximate f 󸀠 (x0 ) or f 󸀠 (xN ) with accuracy greater than O(h), we have to derive new formulas. The idea in the derivations is to use Taylor series approximations with more points—if we use only two points, we cannot do better than forward or backward difference formulas. Hence consider deriving a formula for f 󸀠 (x0 ) based on the points (x0 , f (x0 )), (x1 , f (x1 )), and (x2 , f (x2 )). Since we are going to use Taylor series approximations, the obvious choice of the expansion point is x0 . Note that this is the only way to get

4.2 Approximating the first derivative

| 61

the equations to contain f 󸀠 (x0 ). As for which x-points to plug in, we have already decided to use x0 , x1 , x2 , and since we choose x0 as the expansion point, consider Taylor series for x = x1 and x = x2 : f 󸀠󸀠 (x0 ) f 󸀠󸀠󸀠 (x0 ) + h3 + ⋅⋅⋅, 2! 3! 󸀠󸀠 󸀠󸀠󸀠 f (x0 ) f (x0 ) f (x2 ) = f (x0 ) + 2hf 󸀠 (x0 ) + (2h)2 + (2h)3 + ⋅⋅⋅. 2! 3! f (x1 ) = f (x0 ) + hf 󸀠 (x0 ) + h2

The goal is to add scalar multiples of these equations together so that we get a formula in terms of f (x0 ), f (x1 ), and f (x2 ) that is equal to f 󸀠 (x0 ) + O(hk ) where k is as large as possible. The idea is thus to “kill off” as many terms after the f 󸀠 (x0 ) term as possible. For these two equations, this is achieved by adding −4 × equation 1 + equation 2. This will kill the f 󸀠󸀠 terms, but will leave the f 󸀠󸀠󸀠 term. Hence we truncate the two Taylor series at the f 󸀠󸀠󸀠 term and then combine to get the equation 3f (x0 ) − 4f (x1 ) + f (x2 ) = −2hf 󸀠 (x0 ) +

h3 (−4f 󸀠󸀠󸀠 (c1 ) + 8f 󸀠󸀠󸀠 (c2 )), 6

where x0 ≤ c1 ≤ x1 and x0 ≤ c2 ≤ x2 . Assuming that f is three times differentiable near x0 , dividing through by −2h gives 3f (x0 ) − 4f (x1 ) + f (x2 ) = f 󸀠 (x0 ) + C(h2 ), −2h with C being a constant depending on f 󸀠󸀠󸀠 but independent of h. A formula for f 󸀠 (xN ) can be derived analogously, and note that we could have used any three consecutive x-points and gotten an analogous result (but it typically only makes sense to do it at the endpoints, as otherwise one would just use centered difference). Thus, we have proven the following. Theorem 27. Let f be a three times differentiable function near xi . Then the error in approximating f 󸀠 (xi ) with the three point, one-sided difference approximation satisfies 󵄨󵄨 −3f (x ) + 4f (x ) − f (x ) 󵄨󵄨 󵄨󵄨 󵄨 i i+1 i+2 − f 󸀠 (xi )󵄨󵄨󵄨 = O(h2 ), 󵄨󵄨 󵄨󵄨 󵄨󵄨 2h 󵄨󵄨 󵄨󵄨 3f (x ) − 4f (x ) + f (x ) 󵄨 󵄨󵄨 i i−1 i−2 − f 󸀠 (xi )󵄨󵄨󵄨 = O(h2 ). 󵄨󵄨 󵄨󵄨 2h 󵄨󵄨 Thus, if we wanted second-order accurate formulas for f 󸀠 (xi ), we would use the centered difference formula at all interior points and the three point formulas at the endpoints.

62 | 4 Finite difference methods 4.2.4 Further notes There are some important notes to these approximations that should be considered. – If you use more points in the formulas, and assume that higher order derivatives of f exist, you can derive higher order accurate formulas. However, even more complicated formulas will be needed at the boundary to obtain the higher order accuracy. The O(h2 ) formulas are by far the most common in practice. – If you do not have equal point spacing, O(h2 ) formulas can still be derived at each point xi , using Taylor series as in this section. – If you are trying to approximate f 󸀠 for given data, if the data is noisy, using these methods is probably not a good idea. A better alternative is to find a “curve of best fit” function for the noisy data, then take the derivative of the function.

4.3 Approximating the second derivative Similar ideas as those used for the first derivative approximations can be used for the second derivative. The procedure for deriving the formulas still use Taylor polynomials, but now we must be careful that our formulas for f 󸀠󸀠 are only in terms of function values and not in terms of f 󸀠 . The following is a 3-point centered difference approximation to f 󸀠󸀠 (xi ): f 󸀠󸀠 (xi ) ≈

f (xi−1 ) − 2f (xi ) + f (xi+1 ) . h2

We can prove this formula is O(h2 ). Theorem 28. Assume f is four times differentiable in an interval containing xi . Then 󵄨󵄨 󵄨󵄨 f (x ) − 2f (x ) + f (x ) 󵄨 󵄨󵄨 i−1 i i+1 − f 󸀠󸀠 (xi )󵄨󵄨󵄨 ≤ Ch2 , 󵄨󵄨 2 󵄨󵄨 h 󵄨󵄨

where C is independent of h.

Proof. Let xi be the expansion point in a Taylor series, and get 2 equations from this series by choosing x values of xi+1 and xi−1 . Then truncate the series at the fourth derivative terms. This gives us f 󸀠󸀠󸀠 (xi ) f 󸀠󸀠 (xi ) f 󸀠󸀠󸀠󸀠 (c1 ) + h3 + h4 , 2! 3! 4! f 󸀠󸀠󸀠 (xi ) f 󸀠󸀠 (xi ) f 󸀠󸀠󸀠󸀠 (c2 ) − h3 + h4 , f (xi−1 ) = f (xi ) − hf 󸀠 (xi ) + h2 2! 3! 4! f (xi+1 ) = f (xi ) + hf 󸀠 (xi ) + h2

with xi−1 ≤ c2 ≤ xi ≤ c1 ≤ xi+1 . Now adding the formulas together and doing a little algebra provide 󸀠󸀠󸀠󸀠 f (xi−1 ) − 2f (xi ) + f (xi+1 ) (c1 ) + f 󸀠󸀠󸀠󸀠 (c2 ) 󸀠󸀠 2f = f (x ) + h . i 4! h2

4.4 Application: Initial value ODEs using the forward Euler method | 63

The assumption that f is four times differentiable in an interval containing xi means

that for h small enough, the term of h. This completes the proof.

f 󸀠󸀠󸀠󸀠 (c1 )+f 󸀠󸀠󸀠󸀠 (c2 ) 4!

will be bounded above, independent

More accurate (and more complicated) formulas can be derived for approximating the second derivative by using more x points in the formula. However, we will stop our derivation of formulas here.

4.4 Application: Initial value ODEs using the forward Euler method Consider the initial value ODE: For a given f and initial data (t0 , y0 ), find a function y(t) satisfying y󸀠 (t) = f (t, y) t0 < t ≤ T,

y(t0 ) = y0 .

In a sophomore level differential equations course, students learn how to analytically solve a dozen or so special cases. But how do you solve the other infinitely many cases? Finite difference methods can be an enabling technology for such problems. We discuss now the forward Euler method for approximating solutions to the ODE above. This is the first and simplest solver for initial value ODEs, and we will discuss and analyze this and more complex (and more accurate) methods in detail in a later chapter. We begin by discretizing time with N + 1 points, and let t0 < t1 < t2 < t3 < ⋅ ⋅ ⋅ < tN = T be these points. For simplicity, assume the points are equally spaced and call the point spacing Δt. Since we cannot find a function y(t), we wish to approximate the solution y(t) at each ti ; call these approximations yi . These are the unknowns, and we want them to satisfy yi ≈ y(ti ) in some sense. Consider our ODE at a particular t, t = tn > t0 . The equation must hold at tn for n > 0 since it holds for all t0 < t ≤ T. Thus we have that y󸀠 (tn ) = f (tn , y(tn )). Applying the forward difference formula gives us y(tn+1 ) − y(tn ) ≈ f (tn , y(tn )). Δt The forward Euler timestepping algorithm is created by replacing the approximation signs by equals signs and the true solutions y(tn ) by their approximations yn :

64 | 4 Finite difference methods Step 1: Given y0 Steps n = 1, 2, 3, . . . , N − 1: yn+1 = yn + Δtf (tn , yn ) This is a simple and easy-to-implement algorithm that will yield a set of points (t0 , y0 ), (t1 , y1 ), . . . , (tN , yN ) to approximate the true solution y(t) on [t0 , T]. The question of accuracy will be addressed in more detail in a later chapter, but for now we just state the accuracy is O(Δt). We know that the forward difference approximation itself is only first order, so we could not expect forward Euler to be any better than that, but it does reach this accuracy. Hence we can be assured that as we use more and more points (by cutting the timestep Δt), the error will shrink, and the forward Euler solution will converge to the true solution as Δt → 0. The forward Euler code is shown below, which inputs a right-hand side function named func, a vector of t’s (which discretize the interval [t0 , T]), and the initial condition. function y = forwardEuler ( func ,t , y1 ) N = length (t ); y = zeros (N ,1); % Set initial condition : y (1)= y1 ; % use forward Euler to find y ( i +1) using y (i ) for i =1: N -1 y ( i +1) = y(i ) + ( t(i +1) - t(i ) ) * func (t(i),y(i )); end Example 29. Given the initial value problem: y󸀠 =

1 y − − y2 , t2 t

y(1) = −1,

use the forward Euler method to approximate the solution on [1, 2]. Run it with 11 points, 51 points, and 101 points (so Δt = 0.1, 0.02, and 0.01, resp.), and compare your solutions to the true solution y(t) = −1 . t We first need to implement the right hand side f (t, y) = t12 − yt − y2 (here we wrote it as an inline function): f = @ (t ,y) 1/( t ˆ2) - y/t - y ˆ2 One can call the solver via

4.5 Application: Boundary value ODEs | 65

t101 = linspace (1 ,2 ,101); y101 = forwardEuler (f , t101 , -1); We plot the computed solutions below, along with the true solution, and observe converge of the forward Euler solutions to the true solution.

4.5 Application: Boundary value ODEs Consider next the 1D diffusion equation with given boundary data: Find the function u(x) satisfying u󸀠󸀠 (x) = f (x) u(a) = ua ,

on a < x < b,

u(b) = ub ,

(4.4) (4.5) (4.6)

for a given function f and boundary values ua and ub . Finite difference methods can be used to approximate solutions to equations of this form. Similar to initial value problems, the first step is to discretize the domain, so we pick N equally spaced points on [a, b], with point spacing h: a = x1 < x2 < ⋅ ⋅ ⋅ < xN = b. Since u󸀠󸀠 (x) = f (x) on all of the interval (a, b), then it must be true that u󸀠󸀠 (x2 ) = f (x2 ),

u󸀠󸀠 (x3 ) = f (x3 ),

u󸀠󸀠 (x4 ) = f (x4 ), .. .. .. . . . u󸀠󸀠 (xN−1 ) = f (xN−1 ). Now for each equation, we approximate the left-hand side using the second-order finite difference method. This changes the system of equations to a system of approximations u(x1 ) − 2u(x2 ) + u(x3 ) ≈ h2 f (x2 ),

66 | 4 Finite difference methods u(x2 ) − 2u(x3 ) + u(x4 ) ≈ h2 f (x3 ),

u(x3 ) − 2u(x4 ) + u(x5 ) ≈ h2 f (x4 ), .. .. .. . . .

u(xN−2 ) − 2u(xN−1 ) + u(xN ) ≈ h2 f (xN−1 ). Denote by un the approximation to u(xn ) (for 1 ≤ n ≤ N), and let these approximations exactly satisfy the system above. This gives us a system of linear equations for approximations u1 , u2 , . . . , uN : u1 − 2u2 + u3 = h2 f (x2 ),

u2 − 2u3 + u4 = h2 f (x3 ),

u3 − 2u4 + u5 .. .

= h2 f (x4 ), .. .. . .

uN−2 − 2uN−1 + uN = h2 f (xN−1 ). Recall that we know the values of u1 and uN since they are given, so we can plug in their values and move them to the right-hand sides of their equations. Now the system becomes −2u2 + u3 = h2 f (x2 ) − ua ,

u2 − 2u3 + u4 = h2 f (x3 ),

u3 − 2u4 + u5 .. .

= h2 f (x4 ), .. .. . .

uN−2 − 2uN−1 = h2 f (xN−1 ) − ub .

This is a system of N − 2 linear equations with N − 2 unknowns, which means it can be written in matrix-vector form as Ax = b, and we can use Gaussian elimination to solve it. Example 30. Use the finite difference method above to approximate a solution on [−1, 1] to u󸀠󸀠 (x) = 2ex + xex ,

with boundary values u(−1) = −e−1 and u(1) = e. First, we proceed to setup the linear system above. Define the function f , meshwidth h, the x-points, and the boundary values. We choose N = 100 points. >> >> >> >> >>

f = @(x) (2+ x ).* exp (x ); ua = - exp ( -1); ub = exp (1); N =100; x = linspace ( -1 ,1 , N );

4.5 Application: Boundary value ODEs | 67

The linear system has a (N − 2) × (N − 2) matrix with −2’s on the diagonal, 1’s on the first upper and lower diagonals, and the rest of the entries are 0. The right-hand side is created with h2 f (xi ), where i goes from 2 to N − 1. >> >> >> >> >> >> >>

d1 = -2* ones (N -2 ,1); d2 = ones (N -3 ,1); A = diag ( d1 ) + diag (d2 ,1) + diag (d2 , -1); h = x (2) - x (1); b = h ˆ2 * f (x (2: N -1)); b (1) = b (1) - ua ; b (N -2) = b (N -2) - ub ;

We can now solve the linear system, and then add on the endpoints to get a finite difference approximate solution at all N points: >> u = A \ b '; >> u = [ ua ; u ; ub ]; Note that our solve is inefficient, since the matrix is very sparse, but we are treating it as a full matrix. Fixing this is an exercise. For this test example, we know the true solution is utrue (x) = xex , and so we plot our solution together with the true solution to show it is accurate. >> >> >> >> >>

plot (x ,u , 'k - ',x ,x .* exp (x), 'r -- ',' LineWidth ' ,2.5); xlabel ( 'x ',' FontSize ' ,20) ylabel ( 'u ',' FontSize ' ,20) legend ( ' Finite diff approx ',' True soln ') set ( gca , ' FontSize ' ,20)

We observe from the plot that the two curves lay on top of each other, which means success. Note that if we did not know the true solution, we would run the code with several different values of N (say 1000, 2000, and 5000), and then make sure all three plots are the same. The method will converge with O(h2 ), but you must make sure your h is small enough that your numerical solution is sufficiently converged. Lastly, we will illustrate that the method converges with O(h2 ). This accuracy is expected, since this is the accuracy of the finite difference approximations that

68 | 4 Finite difference methods were made to create the method. We calculate the solution using point spacing of h = 0.01, 0.005, 0.0025, and 0.00125, and then, compare to the true solution. Shown below are the errors and error ratios for these h’s (which correspond to N = 201, 401, 801, and 1601): N 201 401 801 1601

errors 1.9608 e -05 4.9020 e -06 1.2255 e -06 3.0638 e -07

ratios 0 4.0000 e +00 4.0000 e +00 4.0000 e +00

The errors were calculated for each solution as the maximum absolute error of the approximate solution at each node. The ratios of 4 indicate O(h2 ) convergence. The same procedure can easily be extended to advection-diffusion-reaction system, that is, equations of the form: αu󸀠󸀠 (x) + βu󸀠 (x) + γu(x) = f (x) on a < x < b, u(a) = ua ,

u(b) = ub ,

(4.7) (4.8) (4.9)

where α, β, γ are constants. Here, we approximate the second derivative in the same way, and for the first derivative, we use the centered difference method. Before applying the boundary conditions, the resulting linear system is u − u1 u1 − 2u2 + u3 + γu2 = f (x2 ), +β 3 2 2h h u − 2u3 + u4 u − u2 + γu3 = f (x3 ), α 2 +β 4 2 2h h u − 2u4 + u5 u − u3 α 3 + γu4 = f (x4 ), +β 5 2 2h h .. .. .. . . . uN − uN−2 uN−2 − 2uN−1 + uN + γuN−1 = f (xN−1 ). +β α 2h h2 α

Next, using that u1 = ua and uN = ub and doing some algebra, we get the linear system β −2α α + γ) u2 + ( 2 + ) u3 2h h2 h β β α −2α α ( 2 − ) u2 + ( 2 + γ) u3 + ( 2 + ) u4 2h 2h h h h β β α −2α α ( 2 − ) u3 + ( 2 + γ) u4 + ( 2 + ) u5 2h 2h h h h .. . (

= f (x2 ) − = f (x3 ), = f (x4 ), .. .

.. .

αua βua + , 2h h2

4.6 Exercises | 69

(

αu βu β α −2α − ) uN−2 + ( 2 + γ) uN−1 = f (xN−1 ) − 2b − b . 2 2h 2h h h h

Notice that the coefficients in each equation form a pattern, and this can be exploited to easily generate the resulting matrix. It is left as an exercise to write a MATLAB program that solves such a boundary value problem.

4.6 Exercises 1.

Find Taylor series approximations using quadratic polynomials (n = 2) for the function f (x) = ex at x = 0.9, using the expansion point x0 = 1. Find an upper bound on the error using Taylor’s theorem, and compare it to the actual error. 2. Prove Theorem 22. State any assumptions that you make. 3. Prove the second result in Theorem 27. State any assumptions that you make. 4. Approximate the derivative of f (x) = sin2 (x) at x = 1, using backward, forward and centered difference approximations. Use the same h as in the examples in this section. Verify numerically that the convergence orders are O(h), O(h) and O(h2 ), respectively. 5. Suppose a function f is twice differentiable but not three times differentiable. What would you expect for its accuracy? Would it still be O(h2 )? Why or why not? 6. Show that 󵄨󵄨󵄨 󵄨󵄨󵄨 f (xn+3 ) − 9f (xn+1 ) + 8f (xn ) 󵄨󵄨 − f 󸀠 (xn )󵄨󵄨󵄨 = O(h2 ). 󵄨󵄨 󵄨󵄨󵄨 −6h 7.

What assumptions are needed to get this result? Find a formula for f 󸀠 (xn ) that is O(h4 ) accurate that uses f (xn−2 ), f (xn−1 ), f (xn+1 ), f (xn+2 ).

Test it on f (x) = sin(x) at x = 1 using several h values to show it is indeed fourthorder accurate. 8. Write a program that uses the finite difference method to solve u󸀠󸀠 (x) = f (x) on [a, b], given boundary values and the number of points N. Your code should input f , a, b, ua , ub , and N and should output the vectors x and u. The above example for u󸀠󸀠 = f is a good template to start from. Your code should solve the linear system more efficiently by having a sparse matrix A created using spdiags (instead of diag). Test your code on u󸀠󸀠 = 8 sin(2x) with boundary values u(0) = 0 and u(π) = 0. The true solution is −2 sin(2x), and your code should converge to it with rate O(h2 ). 9. Use the finite difference method to approximate a solution to the boundary value problem y󸀠󸀠 − 5y󸀠 − 2y = x2 ex ,

−1 < x < 2,

y(−1) = 0,

y(2) = 0.

70 | 4 Finite difference methods Here, you must write your own code. Run the code using N = 11, 21, 51, 101, 201, and 401 points. Plot all the solutions on the same graph. Has it converged by N = 401? 10. The differential equation W 󸀠󸀠 (x) −

S −ql q 2 W(x) = x+ x , D 2D 2D

0≤x≤l

describes plate deflection under an axial tension force, where l is the plate length, D is rigidity, q is intensity of the load, and S is axial force. Approximate the deflection of a plate if q = 200 lb/in2 , S = 200 lb/in, D = 1,000,000 lb/in and l = 12 in, assuming W(0) = W(l) = 0. Plot your solution, and discuss why you think it is correct. 11. Solve the initial value problem y󸀠 (t) = −et sin(et − 1),

y(0) = 0,

on [0, 2], using the forward Euler method. You should run your code with several choices of Δt to make sure it converges.

5 Solving nonlinear equations There should be no doubt that there is a great need to solve nonlinear equations with numerical methods. For most nonlinear equations, finding an analytical solution is impossible or very difficult. Just in one variable, we have equations such as ex = x2 which cannot be analytically solved. In multiple variables, the situation only gets worse. The purpose of this chapter is to study some common numerical methods for solving nonlinear equations. We will quantify how they work, when they work, how well they work, and when they fail.

5.1 Convergence criteria of iterative methods for nonlinear systems This section will develop and discuss algorithms that will (hopefully) converge to solutions of a nonlinear equation. Each method we discuss will be in the form of: Step 0: Guess at the solution with x0 (or two initial guesses x0 and x1 ) Step k: Use x0 , x1 , x2 , . . . , xk−1 to generate a better approximation xk These algorithms all build sequences {xk }∞ k=1 = {x0 , x1 , x2 , . . .} which hopefully will converge to a solution x∗ . As one might expect, some algorithms converge quickly, and some converge slowly or not at all. It is our goal to find rigorous criteria for when the discussed algorithms converge, and to quantify how quickly they converge when they are successful. Although there are several ways to determine how an algorithm is converging, it is typically best mathematically to measure error in the kth iteration (ek = |xk − x∗ |), that is, the error is the distance between xk and the solution. Once this is sufficiently small, the algorithm can be terminated, and the last iterate becomes the solution. We now define the notions of linear, superlinear, and quadratic convergence. Definition 31. Suppose an algorithm generates a sequence of iterates {x0 , x1 , x2 , . . .} which converges to x∗ . We say the algorithm converges linearly if ek+1 |x − x∗ | = lim k+1 ∗ ≤ α < 1, k→∞ ek k→∞ |xk − x | lim

and that it converges superlinearly if |xk+1 − x∗ | ek+1 ≤ C, = lim p k→∞ |xk − x ∗ |p k→∞ e k lim

where p > 1 (note that C need not be smaller than 1). If p = 2, the convergence is called quadratic. https://doi.org/10.1515/9783110573329-005

72 | 5 Solving nonlinear equations To see the benefit of superlinear convergence, consider the error once the algorithm gets “close” to the root, that is, once |xk − x∗ | gets small. Since for p > 1, we can write |xk+1 − x∗ | ≤ C|xk − x∗ |p−1 , k→∞ |xk − x ∗ | lim

which allows us to observe that the term on the right, C|xk − x∗ |p−1 , becomes smaller with each iteration. This means that convergence to the limit becomes accelerated as the iteration progresses and will eventually converge faster than any linearly convergent method.

5.2 The bisection method The bisection method is a very robust method for solving nonlinear equations in one variable. It is based on a special case of the intermediate value theorem, which we recall now from first semester Calculus. Theorem 32 (Intermediate Value Theorem). Suppose a function f is continuous on [a, b], and f (a) ⋅ f (b) < 0. Then there exists a number c in the interval (a, b) such that f (c) = 0. The theorem says that if f (a) and f (b) are of opposite signs, and f is continuous, then the graph of f must cross the x-axis (y = 0) between a and b. Consider the following illustration of the theorem in Figure 5.1. In the top picture, two points are shown whose y-values have opposite signs. Do you think you can draw a continuous curve that connects these points without crossing the dashed line? No, you cannot! This is precisely what the intermediate value theorem says. Now that we have established that any continuous curve that connects the points must cross the dashed line, let us name the x-point where it crosses to be c, as in Figure 5.1. Note that a curve may cross multiple times, but we are guaranteed that it crosses at least once. The bisection method for rootfinding is based on a simple idea and uses the intermediate value theorem. Consider two points, (a, f (a)), and (b, f (b)) from a function/graph with f (b) and f (a) having opposite signs. Thus, we know that there is a root of f in the interval (a, b). Let m be the midpoint of the interval [a, b], so that m = (a + b)/2, and evaluate f (m). One of the following is true: Either f (m) = 0, f (a) and f (m) have opposite signs, or f (m) and f (b) have opposite signs—it is simple to check which one. This means that we now know an interval that a root lives in, which is half the width of the previous interval! So if we repeat the process over and over, we can repeatedly reduce the size of the interval which we know contains a root. Do this enough times to “close in” on the root, and then the midpoint of that final interval can be considered a good estimate of a root. In a sense, this is the same basic concept

5.2 The bisection method | 73

Figure 5.1: Illustration of the intermediate value theorem and the bisection method.

as using your calculator to zoom in on a root. It is just that we find where the curve crosses the axis with equations instead of a picture. Let us now illustrate this process. Example 33. Use the bisection method to find the root of f (x) = cos(x) − ex on [−2, −0.5]. Clearly, the function is continuous on the interval, and now we check that the function values at the endpoints have opposite signs: f (−2) = −0.551482119783755 < 0, f (−0.5) = 0.271051902177739 > 0. Hence by the intermediate value theorem, we know that a root exists in [−2, −0.5], and we can proceed with the bisection algorithm. Step 1: Call m the midpoint of the interval, so m = (−2+(−0.5))/2 = −1.25, and we evaluate f (−1.25) = 0.028817565535079 > 0. Thus, there is a sign change on [−2, −1.25], so we now know an interval containing a root is [−2, −1.25]. Hence, we have reduced in half the width of the interval. Step 2: For our next step, call m the midpoint of [−2, −1.25], so m = −1.625 and then we calculate f (−1.625) = −0.251088810231130 < 0. Thus, there is a sign change in [−1.625, −1.25], so we keep this interval and move on.

74 | 5 Solving nonlinear equations Step 3: For our next step, call m the midpoint of [−1.625, −1.25], so m = −1.4375 and then we calculate f (−1.4375) = −0.104618874642633 < 0. Thus, there is a sign change in [−1.4375, −1.25], so we keep this interval and move on. Repeating this process enough will zoom in on the root −1.292695719373398. Note that, if the process is terminated after some number of iterations, we will know the maximum error between the last midpoint and true solution is half the length of the last interval. Clearly, this process repeats itself on each step. We give below a code for the bisection process. function [y , data ] = bisect (a ,b , func , tol ) % bisect (a ,b , func , tol ), uses the bisection method to find % a root of func in (a , b ) within tolerance tol % evaluate the function at the end points fa = func ( a ); fb = func ( b ); % Check that fa and fb have opposite signs if fa * fb >= 0 error ( ' The function fails to satisfy f (a )* f ( b ) tol ) % perform the Newton iteration fxnew = fun ( xnew ); fprimexnew = funderiv ( xnew ); numits = numits +1; xold = xnew ; xnew = xnew - fxnew / fprimexnew ; end root = xnew ; end Example 44. Use Newton’s method to solve ex = x2 using initial guess x0 = 1 and tolerance 10−14 . We need to find the zero of f (x) = ex − x2 , so we define f (x) (like in the last section) and the derivative f 󸀠 (x) so we can pass them to the newt function. >> myfun1 = @(x) exp (x) - x ˆ2; >> myfun1prime = @(x) exp (x) - 2* x; >> [ root , numiterations ]= newt ( myfun1 , myfun1prime , -1 ,1e -14) root = -0.703467422498392 numiterations = 5 We observe that the correct answer is found to a much smaller tolerance in 5 iterations than the bisection method found in 20! It is typical for Newton’s method to converge much faster than bisection. While the bisection method always converges if the function is continuous and the starting interval contains a root, Newton’s method can run into one of several difficulties:

5.4 Newton’s method | 85

1. 2.

3.

If the initial guess is not sufficiently close, the initial tangent line approximation could be terrible and prevent convergence. Vanishing derivatives. If f 󸀠 (xk ) ≈ 0, the tangent line has no root, or one that is very far away. Here, Newton’s method can become numerically unstable and may not converge. Cycles. You can construct examples where Newton’s method jumps between two points (so xk+2 = xk ) and never converges. One example is f (x) = x3 − 2x + 2 with x1 = 0.

In summary, Newton’s method is not guaranteed to work, but when it does, it is very fast. We prove next that it converges quadratically, under some assumptions on f and a good initial condition. Theorem 45. Suppose f ∈ C 3 (Bϵ (x∗ )), where x∗ is a root of f , and f 󸀠 (x∗ ) ≠ 0. Then Newton’s method converges quadratically, provided the initial guess is good enough. Proof. The Newton iteration xk+1 = xk −

f (x) . f 󸀠 (x)

f (xk ) f 󸀠 (xk )

can be considered as a fixed-point itera-

tion xk+1 = g(xk ), where g(x) = x − Note that the root x∗ of f is also a fixed point of g. For this proof, we will directly apply Theorem 41. This involves showing that g ∈ C 2 (Bϵ (x∗ )) and g 󸀠 (x∗ ) = 0. We have assumed that x∗ is a root of f and is, therefore, a fixed point of g, so g(x∗ ) = x∗

and f (x∗ ) = 0.

Since f ∈ C 3 (Bϵ (x∗ )), it immediately holds that g ∈ C 2 (Bϵ (x∗ )). We now calculate g (x) to be 󸀠

g 󸀠 (x) = 1 −

2

(f 󸀠 (x)) − f (x)f 󸀠󸀠 (x) 2

(f 󸀠 (x))

.

Plugging in x∗ yields, since f (x∗ ) = 0, f 󸀠󸀠 (x∗ ) exists, and f 󸀠 (x∗ ) ≠ 0, we calculate g 󸀠 (x∗ ) = 1 − =1− =1− = 0. This completes the proof.

2

(f 󸀠 (x∗ )) − f (x∗ )f 󸀠󸀠 (x∗ ) 2

(f 󸀠 (x∗ )) 2

(f 󸀠 (x∗ )) − 0 ⋅ f 󸀠󸀠 (x∗ ) 2

(f 󸀠 (x∗ )) (f 󸀠 (x∗ ))

2

2

(f 󸀠 (x∗ ))

86 | 5 Solving nonlinear equations

5.5 Secant method Although Newton’s method is much faster than bisection, it has a disadvantage in that it explicitly uses the derivative of the function. In many processes where rootfinding is needed (e. g., optimization processes), we may not know an analytical expression for the function, and thus may not be able to find an expression for the derivative. That is, a function evaluation may be the result of a process or experiment. While we may theoretically expect the function to be differentiable, we cannot find the derivative explicitly. For situations like this, the secant method has been developed, and one can think of the secant method as being the same as Newton’s method, except that instead of using the tangent line at xk , you use the secant line at xk and xk−1 . This defines the following algorithm: Algorithm 46 (Secant method). Given: f , tol, x0 , x1 while( |xk+1 − xk | > tol): xk+1 = xk −

f (xk ) f (xk )−f (xk−1 ) xk −xk−1

.

Hence we may think of the secant method as Newton’s method, but with the f (x )−f (x ) derivative term f 󸀠 (xk ) replaced by the backward difference xk −x k−1 . k k−1 It is tedious (but not hard) to prove that the secant method converges superlin√ early, with rate p = 1+2 5 ≈ 1.618.

5.6 Comparing bisection, Newton, Secant method We now compare the bisection, Newton, and secant methods in the following table. Bisection

Newton

Secant

needs always converges? higher dimensions?

continuous, sign change in [a, b] interval [a, b] yes no

continuous, differentiable points x0 and x1 no, only if “close” yes(1)

work per step

1 function

convergence rate(3)

1

continuous, differentiable point x0 no, only if “close” yes 1 function, 1 derivative(2) 2

requirements on f ?

1 function ∼ 1.6

The secant method can be extended to higher dimensions, but the approximation of the derivative requires more work. There are several similar methods known as “Quasi-Newton” or “Jacobian-free Newton” methods. (2) The user needs to supply the derivative to the algorithm, which can be problematic, for example, if it is difficult to compute by hand or not accessible. (3) To achieve the given convergence rate there are more technical requirements on f (smoothness) and it only works if starting “close enough” (see above). Nevertheless, Newton’s is typically faster than secant, which is in turn faster than bisection. (1)

5.7 Combining secant and bisection and the fzero command | 87

5.7 Combining secant and bisection and the fzero command From the comparisons, we see that bisection is the most robust (least number of assumptions to get guaranteed convergence), but is slow. On the other hand, secant is much faster but needs good initial guesses. Hence, it seems reasonable that an algorithm that used bisection method to get close to a root, then the secant method to “zoom in” quickly, would get the best of both worlds. Indeed, this can be done with a simple algorithm: Start with an interval that has a sign change, and then instead of using the midpoint to shrink the interval, use the secant method. But if the secant method gives a guess that is outside of the interval, then use the midpoint instead. Eventually, the secant method will be used at each step. Matlab/octave have a function built-in that does this (and more) for functions in one variable, and it is called “fzero.” To use it, you simply need to give it a user-defined function and an initial guess or interval. The function intelligently alternates between bisection, secant, and a third method called inverse quadratic interpolation (use 3 points to fit a quadratic and use a root—if one exists—as the next guess). Consider now using “fzero” to solve x3 = sin(x) + 1. First, make a function where the solution of the equation is a root of the function, as in the following: function y = myNLfun1 ( x) y = x ˆ3 - sin (x) - 1; Then simply pass this function into fzero, and give it an initial guess: >> fzero ( @myNLfun1 ,5) ans = 1.249052148501195 We can test that it worked by plugging the answer into the function. >> myNLfun1 (1.249052148501195) ans = 1.332267629550188 e -15

5.8 Equation solving in higher dimensions Without getting too deep into details, we simply state here that the Newton algorithm works in higher dimensions, with the same “good” and “bad” points as in the 1D case.

88 | 5 Solving nonlinear equations The only difference in the algorithm is that instead of solving f (x)0 = 0 with xk+1 = xk −

f (xk ) , f 󸀠 (xk )

we use the analogue for vector valued functions to solve f(x) = 0 using xk+1 = xk − (∇f(xk ))−1 f(xk ). In the 2 × 2 case, f (x , x ) f(x) = ( 1 1 2 ) , f2 (x1 , x2 ) and 𝜕f1 1 ∇f = ( 𝜕x 𝜕f2

𝜕x1

𝜕f1 𝜕x2 𝜕f2 ) . 𝜕x2

As an example, consider solving the system of nonlinear equations, x13 = x2 ,

x1 + sin(x2 ) = −3. First, we create MATLAB functions for the function and its derivative (put these into separate .m files): function y = myNDfun ( x ) y (1 ,1) = x (1)ˆ3 - x (2); y (2 ,1) = x (1) + sin (x (2)) + 3; end function y (1 ,1) = y (1 ,2) = y (2 ,1) = y (2 ,2) = end

y = myNDfunprime ( x ) 3* x (1)ˆ2; -1; 1; cos (x (2));

Next, create the n dimensional Newton method function: function [ root , numits ] = newton ( fun , gradfun , x0 , tol ) % Solve fun ( x )=0 using Newton ' s method % We are given the function and its gradient gradfun % Starting from the initial guess x0 .

5.8 Equation solving in higher dimensions | 89

x0 = x0 (:); % this will force x0 to be a column vector xold = x0 +1; % this needs to be ∼= x0 so that we enter the while loop xnew = x0 ; numits = 0; n = length ( x0 ); while norm ( xnew - xold )> tol gradfxk = gradfun ( xnew ); fxk = fun ( xnew ); fxk = fxk (:); % this will force fxk to be a column vector [a , b ]= size ( fxk ); if a∼= n || b∼=1 error ( ' function has wrong dimension ') end [a , b ]= size ( gradfxk ); if a∼= n || b∼=n error ( ' gradient has wrong dimension ') end xold = xnew ; % x_k +1 = x_k - ( grad f( xk ))ˆ{ -1} * f ( xk ), % but implement as a linear solve xnew = xnew - gradfxk \ fxk ; numits = numits +1; if ( numits >=100) error ( ' no convergence after 100 iterations ', numits ); end end root = xnew ; end Running it gives the following answer: >> [ root , numits ] = newton ( @myNDfun , @myNDfunprime ,[ -2; -15] ,1 e -8) root = -2.474670119857577 -15.154860516817051

90 | 5 Solving nonlinear equations

numits = 5

We can check that we are very close to a root: >> value = myNDfun ( root ) value = 1.0 e -14 * -0.532907051820075 -0.088817841970013 If we start too far away from the root, the algorithm will not converge: >> [ root , numits ] = newton ( @myNDfun , @myNDfunprime ,[100; -10] ,1 e -8) current step : -1.990536027147847 -7.886133867955427 Error using newton ( line 36) no convergence after 100 iterations Error in ch4_newtondimex3 ( line 3) [ root , numits ] = newton ( @myNDfun , @myNDfunprime ,[100; -10] ,1 e -8)

5.9 Exercises 1.

2.

Suppose you had two algorithms for rootfinding: one is linearly convergent with a given α < 1, and the other is superlinearly convergent with a given C and p. How close must xk be to the root x∗ for the superlinear method to outperform (reduce the error more) the linear method for every step after step k? Suppose you ran a nonlinear solver algorithm and calculated errors at each step to be 9.0579 e -03 3.4483 e -03 8.0996 e -04 9.2205 e -05 3.5416 e -06 2.6659 e -08

Classify the convergence as linear or superlinear, and determine the associated constants (α, or C and p) 3. Determine the root of f (x) = x2 − e−x by hand using a) bisection method with a = 0 and b = 1 and b) Newton’s method with x1 = 0. Carry out three iterations each. You can round the numbers to 5 digits and use a calculator or Matlab to do each individual calculation. 4. Since the bisection method is known to cut the interval containing the root in half at each iteration, once we are given the stopping criteria “tol” (i. e., stop when

5.9 Exercises | 91

|b − a| < tol), we immediately know how many steps of bisection need to be taken. Change the code “bisect.m” to use a “for loop” instead of a “while loop,” so that the approximation to the root is within tol of a root, but the minimum number of bisection iterations are used (you will need to calculate the number of “for loop” iterations). Verify your code by solving the same equation as the previous problem, and making sure you get the same answer. 5. Consider the trisection method, which is analogous to bisection except that at each iteration, it creates 3 intervals instead of 2 and keeps one interval where there is a sign change. (a) Is the trisection method guaranteed to converge if the initial interval has a sign change? Why or why not? (b) Adapt the bisection code to create a code that performs the trisection method. Compare the results of bisection to trisection for “myfun1.m” for the same stopping tolerance, but several different starting intervals. How do the methods compare? (c) The main computational cost of bisection and trisection is function evaluations. The other operations it does are a few additions and divisions, but the evaluation of most functions, including sin, cos, exp, etc. are much more costly. Based on this information, explain why you would never want to use trisection over bisection. 6. Use the bisection, Newton, and secant methods to find the smallest positive solution to sin(x) = cos(x), using a tolerance of 10−12 . Note you will create your own secant code—it is a small change from the Newton code in this chapter. Report the solutions, the number of iterations each method needed, the number of function evaluations performed, and the commands you used to find the solutions. Conclude which method is fastest. 7. Prove that if x∗ is a fixed point of g, g 󸀠 (x∗ ) = 0, g 󸀠󸀠 (x∗ ) = 0, and g ∈ C 3 (Bϵ (x∗ )), then for x0 chosen close enough to x∗ , the fixed point algorithm xk+1 = g(xk ) converges cubically to x∗ . 8. Show that Theorem 36 need not hold if the interval I is not closed. 9. Which of Newton, secant, and bisection is most appropriate to find the smallest positive root of the function f (x) = |cos(50x)| − 1/2, and why? 10. Use the “fzero” command to find a solution to sin2 (x) = cos3 (x). 11. Use Newton’s method with initial guess ⟨0, 0, 0⟩ to find the solution of x12 + x2 − 33 = 0, x1 − x22 − 4 = 0,

x1 + x2 + x3 − 2 = 0 to 10 digits of accuracy.

92 | 5 Solving nonlinear equations 12. Solve the following nonlinear system using Newton’s method in Matlab: x2 = y + sin(z),

x + 20 = y − sin(10y),

(1 − x)z = 2.

Hint: You need to find a suitable starting value for x, y, and z so that the method converges. 13. Show that for a linear function f (x) = ax + b, for any initial guesses x0 and x1 , the secant method will find the root of f in one step.

6 Eigenvalues and eigenvectors Eigenvalues and eigenvectors play important roles in a variety of applications. For example, a system of linear ordinary differential equations y󸀠 (t) = Ay(t) can be transformed to the decoupled system z󸀠 (t) = Λz(t), where A = VΛV−1 is an eigenvalue decomposition, and z(t) = V−1 y(t). An ideal spring-mass system with no damping can be modeled by an eigenvalue problem Kv = λMv, where √λ is a resonant frequency. Eigenvalue problems also arise from linear stability analysis of steady-state solutions of dynamical systems, where the solution is linearly stable if all eigenvalues have negative real parts. A steady state of Markov chains is the eigenvector associated with its dominant eigenvalue: 1. Time-independent Schrödinger equation is an eigenvalue problem where the eigenvalue describes the energy of stationary states. The Google PageRank values are the entries of the dominant right eigenvector of a modified adjacency matrix of the Webgraph. Eigenvalues are also used to find the roots of a polynomial. If p(x) = c0 + c1 x + c2 x2 + ⋅ ⋅ ⋅ + cn−1 xn−1 + xn , then the roots of p are the same as the eigenvalues of the companion matrix 0 1 (0 : 0

0 0 1 : ...

... 0 0 : ...

... ... ... : 0

0 0 0 : 1

−c0 −c1 −c2 ) . : −cn−1

Since root formulas for quadratic, cubic, and quartic equations are susceptible to catastrophic round-off error, finding roots as eigenvalues of a companion matrix is usually a more reliable approach (there are even more robust methods based on alternative polynomial basis and the colleague matrix). As there is no formula for higher order polynomial roots, finding the roots as eigenvalues circumvents this problem.

6.1 Theoretical background We now give just a brief review of the mathematical problem, since it is assumed that students in this course have already completed a linear algebra course. An eigenvalue-eigenvector pair (or eigenpair) of a matrix A ∈ ℂn×n is a scalar λ and vector v ∈ ℂn \ {0} satisfying Av = λv, or equivalently, (λI − A)v = 0. In other words, λ is a scalar for which λI − A is a singular matrix, and v ∈ null(λI − A). Eigenvectors can be scaled by any nonzero factor, but are usually normalized to have unit 2-norm. https://doi.org/10.1515/9783110573329-006

94 | 6 Eigenvalues and eigenvectors Since λI−A is singular with an eigenvalue λ, det(λI−A) = 0. That is, eigenvalues are the roots of the characteristic polynomial pn (λ) ≡ det(λI − A), whose coefficient for λn is 1. If A is real, det(λI − A) has all real coefficients, and so the eigenvalues are either real or are complex conjugate pairs. The set of all eigenvalues of A is called the spectrum, denoted as Λ(A). It can be shown that det(A) = ∏ni=1 λi and trace(A) = ∑ni=1 λi . Eigenvalue algorithms for general matrices of order ≥ 5 are necessarily iterative, because there is no root formula for general polynomials of order ≥ 5. Diagonalizability. The algebraic multiplicity algA (λ) of an eigenvalue λ is the smallest integer m such that p(m) n (λ) ≠ 0, and the geometric multiplicity geoA (λ) = dim null(λI − A). An eigenvalue λ is simple if algA (λ) = geoA (λ) = 1, semi-simple if algA (λ) = geoA (λ) > 1, and defective if algA (λ) > geoA (λ) (note that algA (λ) ≥ geoA (λ) always). A matrix A is diagonalizable if it has no defective eigenvalues, that is, Avi = λi vi (1 ≤ i ≤ n) or A = VΛV−1 where V = [v1 , . . . , vn ] and Λ = diag(λ1 , . . . , λn ); otherwise it is defective (nondiagonalizable) and has a Jordan canonical form. We only consider diagonalizable matrices in this chapter. We call A a normal matrix if AH A = AAH . Examples include real symmetric (AT = A), complex Hermitian (AH = A), real skew-symmetric (AT = −A) and complex skewHermitian (AH = −A) matrices. All eigenvalues of the first two classes of matrices are real, and those of the last two classes are pure imaginary or zero. Normal matrices have a complete set of orthonormal eigenvectors, such that A = VΛV−1 = VΛVH or A = VΛVT . Two matrices A and B are called similar if there exists a nonsingular matrix V such that A = VBV−1 . Similar matrices have the same eigenvalues.

6.2 Single-vector iterations For certain applications, we may require only one eigenpair of a matrix. Single-vector iterations are particularly suitable for this purpose. If used appropriately, these methods can be a simple and elegant solution; however, one needs to understand the conditions under which they converge robustly and rapidly. Power method. The power method is probably the most well-known eigenvalue algorithm, aiming to compute the dominant eigenpair (largest eigenvalue in modulus) of a matrix. If the dominant eigenvalue is unique, this method is essentially guaranteed to converge. Starting with an initial eigenvector approximation x0 (can be set as a random vector) normalized in 2-norm, this algorithm in each iteration multiplies xk by A and then normalizes the new approximation xk+1 . Let μk = xH k Axk be the Rayleigh quo‖Axk −μk xk ‖2 tient associated with xk . Once the condition ‖Ax ≤ tol is satisfied for some small k ‖2 tolerance tol > 0, the method terminates; the final xk is the approximate dominant eigenvector, and the Rayleigh quotient μk is the approximate dominant eigenvalue.

6.2 Single-vector iterations | 95

Power Method for computing the dominant eigenpair of A Choose tol > 0, x0 ∈ ℂn with ‖x0 ‖2 = 1. For k = 0, 1, . . . , until convergence μk = xHk Axk ; rk = Axk − μk xk ; ‖r ‖ If ‖Axk ‖2 ≤ tol, exit; k 2

xk+1 = Axk ; xk+1 = xk+1 /‖xk+1 ‖2 ; End For

The MATLAB code for this algorithm is given below. function [ eval , evec , itcount ] = PowerMethod (A , x0 , tol ) % normalize initial condition x = x0 / norm ( x0 ); % keep track of Rayleigh quotient as approx of lambda rq = x ' * ( A * x ); itcount =0; while norm ( A *x - rq *x )/ norm (A*x) > tol itcount = itcount +1; y = A*x; x = y / norm (y ); rq = x ' * (A*x ); end eval = rq ; evec = x ; Example 47. Find the dominant eigenvalue of the matrix 1 A = (1 2

2 3 6

3 1) . 2

We run the power method code above with initial guess [1; 2; 3], via the following commands: >> A = [1 , 2 , 3; 1, 3, 1; 2, 6, 2]; >> [l ,x , itcount ] = PowerMethod (A ,[1;2;3] ,1 e -6) l = 6.4641 e +00 x = 5.4779 e -01

96 | 6 Eigenvalues and eigenvectors

3.7415 e -01 7.4829 e -01 itcount = 6 >> norm (A*x - l*x) ans = 2.6768 e -07 >> eig ( A) ans = 6.4641 e +00 -4.6410 e -01 -7.4015 e -16 Observe that the algorithm converges in 6 iterations to the eigenvalue λ = 6.4641 and associated eigenvector. As a sanity check, we make sure Ax ≈ λx. We use the eig function (discussed more later in this chapter) as an additional sanity check: eig returns all the eigenvalues of a matrix, and we see the three eigenvalues are 6.4641, −0.4610, and 0. The power method indeed converged to the dominant eigenvalue. We now discuss why the power method converges to the dominant eigenvalue of a matrix. We need three assumptions to do this analysis: first, we assume a matrix A is diagonalizable (which is almost always true in practice). Second, we assume the matrix A has a unique dominant eigenvalue, that is, we can order the eigenvalues of A as |λ1 | > |λ2 | ≥ ⋅ ⋅ ⋅ ≥ |λn |. Let vi be the eigenvector associated with λi such that Avi = λi vi . The final assumption is that, as the initial guess x0 is written as a linear combination of all eigenvectors of A, the linear combination has a nonzero component of v1 . This is also almost always true in practice. Since A is diagonalizable, its eigenvectors form a basis for ℂn . Then we can decompose x0 in this basis: n

x0 = α1 v1 + α2 v2 + ⋅ ⋅ ⋅ + αn vn = ∑ αi vi , i=1

where the αi are (possibly complex) scalars. The power method at each iteration multiplies the previous iterate by A, normalizes the resulting vector, and checks for convergence. Multiplying x0 by A yields n

n

i=1

i=1

Ax0 = A ∑ αi vi = ∑ αi Avi ,

6.2 Single-vector iterations | 97

where Avi = λi vi . Thus we have n

Ax0 = ∑ αi λi vi , i=1

and now applying the normalization, we obtain x1 : x1 =

n Ax0 1 = ∑ αi λi vi . ‖Ax0 ‖2 ‖Ax0 ‖2 i=1

Repeating this process yields for x2 : x2 =

n 1 1 ∑ αi λi2 vi , ‖Ax0 ‖2 ‖Ax1 ‖2 i=1

and for xm : n 1 ) ∑ αi λim vi . ‖Axk ‖2 i=1 k=0

m−1

xm = ( ∏

Here is the key point: Since |λ1 | > |λ2 | ≥ ⋅ ⋅ ⋅ ≥ |λn |, then for m large enough, n

n

i=1

i=2

∑ αi λim vi = λ1m (α1 v1 + ∑ αi (

λi m ) vi ) ≈ α1 λ1m v1 , λ1

that is, all the terms containing v2 , . . . , vn eventually become negligible, and only the one with v1 remains. This gives us that for m large enough, m−1

1 ) α1 λ1m v1 = Cv1 , ‖Ax ‖ k 2 k=0

xm ≈ ( ∏

1 m where C = (∏m−1 k=0 ‖Axk ‖2 ) α1 λ1 is just a nonzero scalar. Since eigenvectors are only unique in their direction, this means that xm converges to the eigenvector v1 (and is normalized to have length 1 in the Euclidean norm). With Axm ≈ λ1 xm established, and ‖xm ‖2 = 1, then the Rayleigh quotient satisfies

μm =

xH 2 m (Axm ) = xH m (Axm ) ≈ λ1 ‖xm ‖2 = λ1 . ‖xm ‖22

Hence, once the power method converges to a direction (eigenvector v1 ), the Rayleigh quotient recovers the corresponding dominant eigenvalue. The rate of convergence depends on how quickly ∑ni=1 αi λim vi ≈ α1 λ1m v1 . Since the eigenvalues are ordered by magnitude, and the αi ’s and vi ’s are fixed, convergence m 2| ) being very small. Hence, it happens when |λ1 |m ≫ |λ2 |m , which is equivalent to ( |λ |λ1 | is the ratio of the second largest to the largest eigenvalues that determines the conver2| gence rate: if |λ is small (close to zero) then convergence is rapid, but if it is close to |λ1 | one then convergence can be slow.

98 | 6 Eigenvalues and eigenvectors Example 48. Run the power method on the matrix 1.0181e+00 A = ( −1.6856e−03 −6.5794e−03

4.4535e−02 1.0017e+00 −3.6852e−02

3.1901e−02 −6.5115e−04) . 9.8021e−01

How many iterations does it need to converge to a tolerance of 10−6 ? We enter the matrix into MATLAB and run the power method, with initial guess [1; 1; 1]: >> A = [1.0181 e +00 4.4535 e -02 3.1901 e -02 -1.6856 e -03 1.0017 e +00 -6.5115 e -04 -6.5794 e -03 -3.6852 e -02 9.8021 e -01]; >> [l ,x , itcount ] = PowerMethod (A ,[1;1;1] ,1 e -6) l = 1.0100 e +00 x = 9.7905 e -01 -2.0103 e -01 3.2415 e -02

itcount = 996 This took many more iterations than our previous example to converge (996 versus 6). 1 2| This is because the ratio for this matrix is |λ = 1.01 = 0.991 (close to 1), but for the |λ | previous example it was

|λ2 | |λ1 |

=

0.4641 6.4641

1

= 0.0072 (close to 0).

Inverse power method. Recall from your linear algebra course that a nonsingular matrix A has the same eigenvectors as its inverse, and its eigenvalues are reciprocals of its inverse. To see this, take Ax = λx, left-multiply both sides by A−1 , and then by λ1 , which reveal λ−1 x = A−1 x. Note λ ≠ 0 since A is assumed nonsingular. Thus, the smallest eigenvalue of A is the reciprocal of the largest eigenvalue of A−1 . We can therefore run the power method on A−1 to determine the smallest eigenvalue of A. Of course, one never should explicitly take an inverse, so in the implementation the method’s multiplication via xk+1 = A−1 xk can be changed to a linear solve: find xk+1 satisfying Axk+1 = xk . Code for the inverse power method is as follows:

6.2 Single-vector iterations | 99

function [ eval , evec , itcount ] = InvPowerMethod (A , x0 , tol ) % normalize initial condition x = x0 / norm ( x0 ); % keep track of Rayleigh quotient as approx of lambda rq = x ' * ( A * x ); % will do many solves with A , so prefactor [L ,U ,P ] = lu ( A ); itcount = 0; while norm ( A * x - rq *x )/ norm (A*x) > tol itcount = itcount +1; % y % x

solve Ay = x -> PAy = Px -> LUy = Px = U \ ( L \ P*x ); normalize = y / norm (y );

rq = x ' * (A*x ); end eval = rq ; evec = x ; We run the inverse power method on the same matrix as the previous example (code shown below), and correctly converge to the smallest eigenvalue λ3 = 0.99 and its corresponding eigenvector: >> [l ,x , itcount ] = InvPowerMethod (A ,[1;1;1] ,1 e -6) l = 9.9000 e -01 x = -7.0438 e -01 -6.2037 e -02 7.0710 e -01 itcount = 764

100 | 6 Eigenvalues and eigenvectors Shift-invert power method. We have shown above how to find the largest and smallest eigenvalues (in modulus) of a matrix. We can also use a similar technique to find the eigenvalue of A that is closest to a particular scalar σ ∈ ℂ. The key idea is that the eigenvalues of (A − σI) are given by λi − σ, where λi is an eigenvalue of A. To see this, start with Ax = λx, and subtract σx from both sides to reveal Ax − σx = λx − σx = (λ − σ)x, and consequently, (A − σI)x = (λ − σ)x. Thus for any eigenpair (λ, x) of A, (λ − σ, x) is an eigenpair of (A − σI). Therefore, if (A − σI) is nonsingular, then we can run the inverse power method on it to find the smallest eigenvalue, which has the form λj − σ. Adding σ to it recovers λj , which must be the closest eigenvalue to σ. The code to do this is identical to the inverse power method, except that the matrix (A − σI) is used: function [ eval , evec , itcount ] = ShiftInvPowerMethod (A , sigma , x0 , tol ) % normalize initial condition x = x0 / norm ( x0 ); % keep track of Rayleigh quotient as approx of lambda rq = x ' * ( A * x ); % will do many solves with A , so prefactor n = size (A ,1); [L ,U ,P ] = lu ( A - sigma * eye ( n )); itcount =0; while norm ( A * x - rq * x )/ norm ( A * x ) > tol itcount = itcount +1; y = U \ ( L \ P * x ); % normalize x = y / norm ( y ); rq = x ' * ( A * x ); end eval = rq ; evec = x;

Using the same matrix as the previous examples, we can find the second largest eigenvalue by choosing σ closest to 1 (the 3 eigenvalues of the matrix are 1.01, 1, and 0.99). We choose shift σ = 0.98, and converge to the eigenvalue 0.99 in 12 iterations. This is much faster than using the inverse power method to find this eigenvalue. >> [l ,x , itcount ] = ShiftInvPowerMethod (A ,1.02 , [1;1;1] ,1 e -6)

6.3 Multiple-vector iterations | 101

l = 9.9000 e -01 x = -7.0437 e -01 -6.2080 e -02 7.0712 e -01 itcount = 12 Rayleigh quotient iteration. The shift-invert power method with a fixed shift σ converges linearly to the unique eigenvalue closest to σ. Following the analysis of the 󵄨 󵄨󵄨 power method, we can show that the factor of convergence is 󵄨󵄨󵄨 λλ1 −σ 󵄨. That is, if |σ −λ1 | ≪ 2 −σ 󵄨 |σ−λ2 |, we can expect rapid convergence. This suggests that we may use a variable shift σk for the shift-invert power method, letting it converge to the eigenvalue of interest. In particular, if we let σk = μk = xH k Axk with ‖xk ‖2 = 1, the resulting algorithm is called the Rayleigh quotient iteration. Assume that xk (with ‖xk ‖2 = 1) is very close to a particular eigenvector vℓ of A in direction, such that the Rayleigh quotient μk = xH k Axk is closer to the corresponding eigenvalue λℓ than it is to all other eigenvalues, then one can show that the Rayleigh quotient iteration converges asymptotically to (λℓ , vℓ ) cubically for real symmetric or complex Hermitian matrix A, and in general quadratically otherwise. Two issues need attention for the Rayleigh quotient iteration. First, note that the Rayleigh quotient iteration converges quadratically or cubically only if xk is already sufficiently close to the desired eigenvector vℓ in direction. This is similar to the Newton’s method for solving nonlinear equations. To make sure such a condition is satisfied, we may first start with the shift-invert power method with σ close to λℓ , proceed several iterations such that xk ≈ vℓ , then switch to the Rayleigh quotient iteration. Meanwhile, though we can achieve quadratic/cubic asymptotic convergence, we have to solve a linear system with a new coefficient matrix A − μk I in each iteration. The linear solves would be much more expensive than those in the shift-invert power method, which can use the LU factors of A − σI computed only once, since σ is fixed.

6.3 Multiple-vector iterations The single-vector iteration methods in the original form can find only one eigenpair. In many situations, however, we need more eigenpairs with similar spectral characteristics, that is, several dominant eigenpairs or a dozen eigenpairs around a specified shift σ ∈ ℂ. Iterations using multiple-vectors can be used in this setting, and are a natural extensions of the single-vector methods.

102 | 6 Eigenvalues and eigenvectors The subspace iteration (also called simultaneous iteration) is a widely used extension of this type. We provide the outline of a basic version of this method. One can see easily that it is a straightforward extension of the power method, and the only essential difference is that the single-vector iterate xk (with ‖xk ‖2 = 1) of the power method is replaced with a block of p vectors Xk with orthonormal columns. In addition, suppose that we want individual dominant eigenpairs (instead of an orthonormal basis of this invariant subspace), a post-processing step is needed to retrieve these eigenpairs of A from Xk and the block Rayleigh quotient Mk = XH k AXk . Subspace Iteration for computing p dominant eigenpairs of A Choose tol > 0, X0 ∈ ℂn×p with XH0 X0 = Ip . For k = 0, 1, . . . , until convergence Mk = XHk AXk ; Rk = AXk − Xk Mk ; ‖R ‖ If ‖AXk ‖F ≤ tol, exit; k F

Xk+1 = AXk ; Orthonormalize Xk+1 such that XHk+1 Xk+1 = Ip ; (typically done by a reduced QR factorization of Xk+1 ) End For (k) (k) (k) (k) (k) −1 Diagonalize Mk = [w(k) 1 , . . . , wp ]diag(μ1 , . . . , μp )[w1 , . . . , wp ] ; (k) (k) output the eigenpair approximations (μi , Xk wi ) (1 ≤ i ≤ p) of A.

The analysis of the subspace iteration is almost parallel to that of the power method. Assume that the eigenvalues of A are ordered such that |λ1 | ≥ ⋅ ⋅ ⋅ ≥ |λp | > |λp+1 | ≥ ⋅ ⋅ ⋅ ≥ |λn |. If the algorithm proceeds without orthonormalizing Xk+1 in each iteration, then each column of Xk+1 will converge to the dominant eigenvector v1 in direction if |λ1 | > |λ2 |. However, the orthonormalization enforces the columns of Xk+1 to have unit 2-norm and be orthogonal to each other. As a result, Xk+1 will converge to an orthonormal basis of the dominant invariant subspace span{v1 , . . . , vp } associated with λ1 , . . . , λp . The post-processing step at the end will extract each individual eigenpair approximation. The subspace iteration converges linearly to these eigenpairs at the 󵄨λ 󵄨 asymptotic rate 󵄨󵄨󵄨 λp+1 󵄨󵄨󵄨. p If we want to compute p eigenvalues near a specified shift σ, we can use the shiftinvert variant of the subspace iteration. We only need to replace Xk+1 = AXk with Xk+1 = (A − σI)−1 Xk , achieved by solving the linear system of equations (A − σI)Xk+1 = (k) Xk with p right-hand sides Xk = [x(k) 1 , . . . , xp ]. An LU factorization should be performed only once, before the the solution of such a linear system in each iteration of shift-invert subspace iteration. If there is a unique set of p eigenvalues of A closest to σ, the shift-invert subspace iteration will usually converge to the desired eigenpairs.

6.4 Finding all eigenvalues and eigenvectors of a matrix | 103

6.4 Finding all eigenvalues and eigenvectors of a matrix All the algorithms introduced in previous sections can be used to compute one or a few eigenvalues of a small dense or a large and typically sparse matrix. Notably absent from this chapter is a discussion of finding all eigenvalues and eigenvectors of a matrix. In MATLAB, this can be done with the eig command, and most students use this command in their introductory linear algebra class. We have omitted discussion of this topic for two reasons. First, there are additional technical details to efficiently extend the ideas above for subspace iteration to the whole space to make converge rapidly. Second, it is not a practical algorithm for very large matrices: it is slow (O(n2 ) flops for real symmetric or complex Hermitian matrices and O(n3 ) flops for general nonsymmetric matrices), and is storage-consuming since the eigenvectors will form a dense matrix the same size as the input matrix. The algorithm is practical for matrices of size up to approximately twenty two thirty thousand on a personal computer (though many hours needed). As this is an overview course, we choose to leave the material to a first graduate course in numerical linear algebra.

6.5 Exercises 1. 2.

Assume that A, B ∈ ℂn×n , and at least one of them is nonsingular. Show that Λ(AB) = Λ(BA), that is, the eigenvalues of AB and BA are identical. Let 1 106 1 106 A=( ) and B = ( ). −6 0 2 2 × 10 2 Evaluate ‖B − A‖∞ and see how large the difference between A and B is. How much do the eigenvalues change? If 1 A=( 0

0 ) 2

1 and B = ( 2 × 10−6

2 × 10−6 ), 2

explore the same questions. (This example shows how sensitive certain eigenvalues of a nonsymmetric matrix could be under perturbations, and how insensitive they are in the symmetric case, resp.) 3. Let A ∈ ℝn×n be an orthogonal matrix such that AT A = AAT = In . Show that for any vector x ∈ ℝn , ‖Ax‖2 = ‖x‖2 . From this fact, what can we say about the modulus of the eigenvalues of A? Would the power method or subspace iteration be able to find one or a few dominant eigenpairs of A? 4. Let 5 7 3 2 4 −1 A = (0 1 2 ) and B = (7 0 1 ) . 4 −1 6 3 6 5

104 | 6 Eigenvalues and eigenvectors

5.

Use MATLAB’s eig to find eigenvalues of both matrices. Can the power method find the dominant eigenvalue of A and B, respectively? What about using the inverse power method to find the smallest (in modulus) eigenvalue? (a) Let x0 = [1 1 1]T . Use the power method for A in Problem 4 to compute x1 without any normalization. Evaluate the Rayleigh quotient μk =

xTk Axk xTk xk

(k = 0, 1),

and compare with the dominant eigenvalue of A. (b) Let x0 = [−13 15 −13]T . Invoke the inverse power method for B in Problem 4 to

compute x1 , without any normalization. Evaluate the Rayleigh quotient μk =

6.

7.

xTk Axk xTk xk

(k = 0, 1); compare with the smallest eigenvalue of B. (a) Follow the analysis for the power method in Section 6.2 to derive an analysis of the shift-invert power method. Show that this method would typically converge to the eigenvalue closest to the specified shift σ under certain assumptions, and the 󵄨 󵄨󵄨 asymptotic rate of convergence is 󵄨󵄨󵄨 λλ1 −σ 󵄨, where λ1 and λ2 denote the eigenvalues 2 −σ 󵄨 of A closest and second closest to σ. (b) Use the conclusion in part (a), explain why the shift-invert method with shift σ = 0.98 converges to λ = 0.99 of the 3 × 3 matrix in Section 6.2 much more quickly than the inverse power method (hint: the invert power method is nothing but the shift-invert power method with shift σ = 0). (a) Implement the subspace iteration in MATLAB. To test your code, run MATLAB command load west0479; and compute the 8 eigenvalues of largest modulus to relative tolerance 10−8 . Compare your computed eigenvalues with those obtained from MATLAB command eigs(west0479,8,`lm'). You may use the command [X,˜] = qr(X,0) to orthonormalize Xk+1 . (b) Implement the shift-invert subspace iteration. Make sure to use lu to factorize A − σI before the for loop and use the LU factors to solve the linear systems (A − σI)Xk+1 = Xk . To test your code, compute the 6 eigenvalues around σ = 1 of west0479 to relative tolerance 10−8 . Compare your results with those obtained from command eigs(west0479,6,1).

7 Interpolation It is both helpful and convenient to express relations in data with functions. This allows for estimating the dependent variables at values of the independent variables not given in the data, taking derivatives, integrating, and even solving differential equations. In this chapter, we will look at one of the common classes of such methods: interpolants. An interpolant of a set of points is a function that passes through each of the data points. For example, given the points (0, 0), (π/2, 1), and (−π/2, −1), both f (x) = sin(x) and g(x) =

2x π

would be interpolating functions, as shown in the plot below.

There are many applications where we would prefer to describe data with an interpolant. In this chapter, we will consider interpolation by a single polynomial, as well as piecewise-polynomial interpolation.

7.1 Interpolation by a single polynomial Given n distinct points: (xi , yi ), i = 1, 2, . . . , n, a polynomial of degree n − 1 can be found that passes through each of the n points (assuming the xi ’s are distinct of course). Such a polynomial would then be an interpolant of the data points. Since the interpolating polynomial is degree n − 1, it must be of the form p(x) = c1 + c2 x + c3 x2 + ⋅ ⋅ ⋅ + cn xn−1 . Thus, there are n unknowns (c1 , . . . , cn ) that, if we could determine them, would give us the interpolating polynomial. Requiring p(x) to interpolate the data leads to n equations. That is, if p(x) is to pass through data point (xi , yi ), then it must hold that yi = p(xi ). Thus, for each i (1 ≤ i ≤ n), we get the equation yi = c1 + c2 xi + c3 xi2 + ⋅ ⋅ ⋅ + cn xin−1 . https://doi.org/10.1515/9783110573329-007

106 | 7 Interpolation Since all the xi ’s and yi ’s are known, each of these n equations is linear in the unknown cj ’s. Thus, we have a square linear system to determine the unknown coefficients: 1 1 (1 : 1

x1 x2 x3 : xn

x12 x22 x32 : xn2

... ... ... ... ...

x1n−1 y1 c1 x2n−1 y2 c2 x3n−1 ) (c3 ) = (y3 ) : : : xnn−1 yn cn

(7.1)

It can be shown that the coefficient matrix is nonsingular since all xi ’s are distinct. Solving this linear system will uniquely determine the polynomial. Consider the following example. Example 49. Find a polynomial that interpolates the five points (0, 1), (1, 2), (2, 2), (3, 6), (4, 9). % x and y values of the 5 points to interpolate : px = [0 1 2 3 4]; py = [1 2 2 6 9]; n = length ( px ); % build nxn matrix column by column A = zeros (n ,n ); for i =1: n A (: , i) = px .ˆ( i -1); end % now solve for the coefficients : c = A \ py '; % output matrix and coefficients : A c % plot the points and the polynomial at the points x x = linspace ( -1 ,5 ,100); y = c (1) + c (2)* x + c (3)* x .ˆ2 + c (4) *x .ˆ3 + c (5)* x .ˆ4; plot ( px , py , ' rs ',x ,y , 'k ') This code will produce the output: A = 1

0

0

0

0

7.1 Interpolation by a single polynomial | 107

1 1 1 1

1 2 3 4

1 4 9 16

1 8 27 64

1 16 81 256

c = 1.000000000000000 5.666666666666661 -7.583333333333332 3.333333333333333 -0.416666666666667 Thus we have found our interpolating polynomial for the 5 points: p(x) = 1.0000 + 5.6667x − 7.5833x2 + 3.3333x3 − 0.4167x 4 . Plotting it along with the original data points gives the plot below, from which we can see that the curve does indeed pass through each point.

7.1.1 Lagrange interpolation The matrix from equation (7.1) is ill-conditioned for n more than a few. Note this is a famous matrix called the Vandermonde matrix. For example, if the xi ’s are equally spaced on [0, 10], then we can calculate condition numbers to be n =7: cond ( A )=1.2699 e +07 n =8: cond ( A )=2.9823 e +08 n =9: cond ( A )=7.2134 e +09 n =10: cond ( A )=1.7808 e +11 n =11: cond ( A )=4.4628 e +12 n =12: cond ( A )=1.1313 e +14 n =13: cond ( A )=2.8914 e +15 n =14: cond ( A )=7.2577 e +16

108 | 7 Interpolation On the one hand, it is typically not a good idea to try to fit data to a single higher-order polynomial at equally spaced or randomly selected points, since such polynomials tend to be oscillatory near the two ends of the interval of interest. However, if one does wish to construct a single higher-order polynomial, then there are different ways to build interpolating polynomials that do not require an ill-conditioned linear solve. One such way is called “Lagrange interpolation,” and since the interpolating polynomial of degree n − 1 is unique for n data points, it will produce the same answer as we get from solving the linear system above (if no numerical error is present). To define Lagrange interpolation, we first must define the Lagrange basis. As a point of reference, the monomial basis for degree n − 1 polynomials is {1, x, . . . , xn−1 }, and our task with the linear system was to determine the coefficients ci , which in turn defined the interpolating polynomial p(x). In the Lagrange case, there are also n basis functions, and they are defined as follows, for each i = 1, 2, . . . , n: li (x) = =

(x − xn )(x − xn−1 ) ⋅ ⋅ ⋅ (x − xi+1 )(x − xi−1 ) ⋅ ⋅ ⋅ (x − x1 ) (xi − xn )(xi − xn−1 ) ⋅ ⋅ ⋅ (xi − xi+1 )(xi − xi−1 ) ⋅ ⋅ ⋅ (xi − x1 ) ∏nj=1,j=i̸ (x − xj )

∏nj=1,j=i̸ (xi − xj )

The important things to notice about these basic functions is: – They are degree n − 1 polynomials, so any linear combination of them is also a degree n − 1 polynomial, – li (xi ) = 1, – li (xj ) = 0 if j ≠ i. We wish to find an interpolating polynomial that is a linear combination of the li ’s, which means we want to find the ci ’s that define pL (x) = c1 l1 (x) + c2 l2 (x) + ⋅ ⋅ ⋅ + cn ln (x), where pL (xi ) = yi ,

i = 1, 2, . . . n.

It turns out the finding the ci ’s is very easy. Consider the first data point, (x1 , y1 ) for which we want y1 = pL (x1 ). By construction of the li ’s, we have that l1 (x1 ) = 1 and lj (x1 ) = 0 for j = 2, 3, . . . , n. Thus pL (x1 ) = c1 l1 (x1 ) + 0 + ⋅ ⋅ ⋅ + 0 = c1 , which means c1 = y1 . The same thing can be done for the other data points. This precisely and completely defines the Lagrange interpolating polynomial: pL (x) = y1 l1 (x) + y2 l2 (x) + ⋅ ⋅ ⋅ + yn ln (x).

(7.2)

Again, since the degree n − 1 interpolating polynomial of n points is unique, and since the Lagrange and monomial basis are both degree n − 1, they must be equal.

7.1 Interpolation by a single polynomial | 109

Example 50. Use the Lagrange interpolation method to find the interpolating polynomial of the points (0, 1), (1, 2), (2, 2), (3, 6), (4, 9). Note that in the previous example, using the monomial basis, we found the interpolating polynomial to be pM (x) = 1.0000 + 5.6667x − 7.5833x2 + 3.3333x3 − 0.4167x 4 . The Lagrange basis functions are l1 (x) = l2 (x) = l3 (x) = l4 (x) = l5 (x) =

(x − 1)(x − 2)(x − 3)(x − 4) (x − 1)(x − 2)(x − 3)(x − 4) = (−1)(−2)(−3)(−4) 24

(x − 0)(x − 2)(x − 3)(x − 4) (x − 0)(x − 2)(x − 3)(x − 4) = (1)(−1)(−2)(−3) −6 (x − 0)(x − 1)(x − 3)(x − 4) (x − 0)(x − 1)(x − 3)(x − 4) = (2)(1)(−1)(−2) 4

(x − 0)(x − 1)(x − 2)(x − 4) (x − 0)(x − 1)(x − 2)(x − 4) = (3)(2)(1)(−1) −6 (x − 0)(x − 1)(x − 2)(x − 3) (x − 0)(x − 1)(x − 2)(x − 3) = (4)(3)(2)(1) 24

and the Lagrange interpolating polynomial is defined by pL (x) = l1 (x) + 2l2 (x) + 2l3 (x) + 6l4 (x) + 9l5 (x). With some arithmetic, one can expand pL (x) to show that pL (x) = pM (x). The following MATLAB script creates the Lagrange basis functions, then creates the Lagrange interpolant: xpts = [0 ,1 ,2 ,3 ,4]; ypts = [1 ,2 ,2 ,6 ,9]; % define the Lagrange basis functions l1 = @ (x) (x - xpts (2)).*( x - xpts (3)).*( x - xpts (4)).*( x - xpts (5)) ... ./ ( ( xpts (1) - xpts (2)).*( xpts (1) - xpts (3)).*\ cdots ( xpts (1) - xpts (4)).*( xpts (1) - xpts (5))); l2 = @ (x) (x - xpts (1)).*( x - xpts (3)).*( x - xpts (4)).*( x - xpts (5)) ... ./ ( ( xpts (2) - xpts (1)).*( xpts (2) - xpts (3)).*\ cdots ( xpts (2) - xpts (4)).*( xpts (2) - xpts (5))); l3 = @ (x) (x - xpts (1)).*( x - xpts (2)).*( x - xpts (4)).*( x - xpts (5)) ... ./ ( ( xpts (3) - xpts (1)).*( xpts (3) - xpts (2)).*\ cdots ( xpts (3) - xpts (4)).*( xpts (3) - xpts (5)));

110 | 7 Interpolation

l4 = @ (x) (x - xpts (1)).*( x - xpts (2)).*( x - xpts (3)).*( x - xpts (5)) ... ./ ( ( xpts (4) - xpts (1)).*( xpts (4) - xpts (2)).*\ cdots ( xpts (4) - xpts (3)).*( xpts (4) - xpts (5))); l5 = @ (x) (x - xpts (1)).*( x - xpts (2)).*( x - xpts (3)).*( x - xpts (4)) ... ./ (( xpts (5) - xpts (1)).*( xpts (5) - xpts (2)).*\ cdots ( xpts (5) - xpts (3)).*( xpts (5) - xpts (4)));

% Define the interpolating polynomial p = @( x) ypts (1)* l1 ( x ) + ypts (2)* l2 (x ) + \ cdots ypts (3)* l3 ( x ) + ypts (4)* l4 ( x ) + ypts (5)* l5 ( x )

7.2 Chebyshev interpolation In this section, we discuss an optimal approach to efficiently construct and evaluate the polynomial interpolant p(x) that is guaranteed to converge to f (x). We recommend single polynomial interpolation to be performed this way.1 First, we emphasize that the x-points of data should be chosen appropriately whenever possible. If n is larger than a few, then typically it is a bad idea to interpolate data with a single polynomial at equally spaced or randomly selected points. In many cases, such a high order polynomial tends to be oscillatory near the two ends of the interval [a, b] of interpolation. This is called the “Runge phenomenon” and has been incorrectly associated with the high order of the interpolant in some textbooks. In fact, it is the location of the x-points that causes the poor behavior of the interpolant, not the degree itself. If we choose the x-points at the Chebyshev points xi = −

b−a (i − 1)π b + a cos + , 2 n−1 2

1≤i≤n

(7.3)

on the interval [a, b] of interest, the interpolant p(x) would be increasingly more accurate approximation to f (x) as the number n of data points increases, as long as f (x) is differentiable on [a, b]. This p(x) is called a Chebyshev interpolant. 1 Example 51. Approximate f (x) = 1+25x 2 over the interval [−1, 1], using 9 and 19 equally spaced x-points, and also Chebyshev x-points (the corresponding y-points obtained by plugging the x’s into f (x)) to create a polynomial interpolant.

1 For in-depth discussion about polynomial approximations, see Approximation Theory and Approximation Practice, by L. N. Trefethen, SIAM, 2013.

7.2 Chebyshev interpolation

| 111

As the number of data points n increases, the polynomial interpolant p(x) based on equispaced points goes away from f (x) near the ends of the interval, whereas the interpolant based on Chebyshev points converges to f (x). In addition, if the y-data f (xi ) change slightly, the Chebyshev interpolant p(x) will also change slightly. The unstable oscillatory behavior of p(x) based on interpolation points far from Chebyshev (e. g., equispaced or random) is avoided in a reliable manner. Barycentric formula. To construct the polynomial interpolation p(x) and efficiently evaluate the interpolant at any point of interest on [a, b], we need to discuss an equivalent expression of the Lagrange interpolating polynomial p(x) = y1 l1 (x) + y2 l2 (x) + ⋅ ⋅ ⋅ + yn ln (x). This standard original formula, though mathematically convenient for analysis, is expensive for evaluating p(x) at many different values of x if n is not very small. In fact, we can see from the expression of the Lagrange basis li (x) that it takes O(n2 ) flops to evaluate p(x) at a single value of x, and hence O(mn2 ) flops to do so for m different values of x. To derive a more efficient formula, define μi = ∏n 1(x −x ) , such that j=1,j=i̸

li (x) =

∏nj=1,j=i̸ (x − xj ) ∏nj=1,j=i̸ (xi − xj )

= μi

l(x) , x − xi

i

j

n

where l(x) = ∏(x − xj ). j=1

Therefore, p(x) = l(x) (

y μ yμ y1 μ1 + 2 2 + ⋅⋅⋅ + n n ), x − x1 x − x2 x − xn

x ≠ xi .

(7.4)

Recall that the polynomial interpolant of degree ≤ n − 1 that goes through (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) is unique. Let f (x) ≡ 1, and the interpolant p(x) = y1 l1 (x) +

112 | 7 Interpolation y2 l2 (x) + ⋅ ⋅ ⋅ + yn ln (x) = l1 (x) + l2 (x) + ⋅ ⋅ ⋅ + ln (x). Since f (x) itself is a polynomial of degree 0 (≤ n − 1) going through these points, f (x) ≡ p(x) by the uniqueness. That is, l1 (x) + l2 (x) + ⋅ ⋅ ⋅ + ln (x) ≡ 1 l(x) for any x. Dividing li (x) = μi x−x by this identity, with some work, we have li (x) = μi μ /( 1 x−xi x−x1

+

μ2 x−x2

μn ), x−xn

+ ⋅⋅⋅ +

p(x) =

i

and it follows that y1 μ 1 x−x1 μ1 x−x1

+ +

y2 μ 2 x−x2 μ2 x−x2

+ ⋅⋅⋅ + + ⋅⋅⋅ +

yn μ n x−xn μn x−xn

,

x ≠ xi .

(7.5)

The new formulas (7.4) and (7.5) are called the modified Lagrange interpolating polynomial formula and the barycentric formula, respectively. The virtue of (7.5) is that the quantities μi appear on the numerator and denominator and, therefore, we can multiply or divide them by any nonzero number (called “scaling”) without changing the formula, whereas this is not possible for (7.4). Using Chebyshev x-points xi = − b−a cos (i−1)π + b+a for polynomial interpolation, 2 n−1 2 one can show that after proper scaling, 1 μ1 = , 2

μi = (−1)i−1 (2 ≤ i ≤ n − 1),

1 μn = (−1)n−1 . 2

(7.6)

It is easy to memorize these quantities: they are just the alternating sequence 1, −1, 1, −1, . . ., but the first and the last one need to be halved. We recommend single polynomial interpolation to be performed by choosing the Chebyshev x-points (7.3), and evaluated by the barycentric formula (7.5) with μi defined in (7.6). Such a formula has two major strengths: (1) p(x) is guaranteed to converge to f (x) for any x on [a, b], as long as f (x) is differentiable on this interval, and the error maxa≤x≤b |f (x) − p(x)| goes to zero exponentially with n if f (x) is analytic; (2) evaluation of p(x) for a single value of x takes only O(n) flops (instead of O(n2 ) flops needed by the original Lagrange interpolation formula), which is essentially optimal. A MATLAB code for the Chebyshev interpolation using the barycentric formula is given as follows: function pxeval = chebinterp ( func ,a ,b ,n , xeval ) % for a function f ( x ) differentiable on [a , b ] , we construct the % polynomial interpolant p ( x ) based on n Chebyshev points , and % evaluate p ( x ) at each element of ' xeval ' % Chebyshev x - points ( x1 , x2 ,... , xn ) xs = -(b - a )/2* cos ((0:( n -1))/( n -1)* pi )+( a + b )/2; % corresponding y - points ( y1 , y2 ,... , yn ) = ( f ( x1 ) , f ( x2 ) ,... , f ( xn )) ys = func ( xs ); % mu = [1/2 -1 1 -1 ... 1/2*( -1)ˆ( n -1)] mus = ones (1 , n ); mus (2:2: end ) = -1;

7.3 Piecewise linear interpolation

| 113

mus ([1 end ]) = mus ([1 end ])/2; pxeval = zeros ( size ( xeval )); % use the barycentric formula to evaluate p ( x ) at the ' xeval ' % $x$ - points ; for simplicity , assume that all elements of ' xeval ' % are not equal to any elements of the Chebyshev $x$ - points for k = 1 : length ( xeval ) denom = mus ./( xeval ( k ) - xs ); numer = ys .* denom ; pxeval (k) = sum ( numer )/ sum ( denom ); end

We illustrate the accuracy and efficiency of the Chebyshev interpolation on an exam1 ple similar to the previous one. We approximate f (x) = 1+2500x 2 on [−1, 1] by the interpolant p(x) of degree 499, 999, and 1999, respectively, and evaluate p(x) at 100,000 uniformly distributed random x-values on [−1, 1]. func = @(x )1./(1+2500* x .ˆ2); a = -1; b = 1; % uniformly distributed evaluation points on [a , b ] xeval = rand (100000 ,1)*( b - a )+ a ; ts1 = tic ; px1 = chebinterp ( func ,a ,b ,500 , xeval ); te1 = toc ( ts1 ); ts2 = tic ; px2 = chebinterp ( func ,a ,b ,1000 , xeval ); te2 = toc ( ts2 ); ts3 = tic ; px3 = chebinterp ( func ,a ,b ,2000 , xeval ); te3 = toc ( ts3 ); err1 = norm ( px1 - func ( xeval ) , ' inf '); err2 = norm ( px2 - func ( xeval ) , ' inf '); err3 = norm ( px3 - func ( xeval ) , ' inf '); fprintf ( 'inf - norm of | f ( x ) - p ( x )|: %.3 e , %.3 e , %.3 e .\ n ', ... err1 , err2 , err3 ); fprintf ( ' Timing : %.3 f , %.3 f , %.3 f secs .\ n ',te1 , te2 , te3 ); inf - norm of |f(x ) - p ( x )|: 9.267 e -05 , 4.210 e -09 , 7.550 e -15. Timing : 0.275 , 0.457 , 0.853 secs . 1 We see that f (x) = 1+2500x 2 on [−1, 1] can be approximated by the Chebyshev interpolant of degree 1999 to near machine epsilon. Also, the time needed to evaluate p(x) grows roughly linearly with n, the number of data points.

7.3 Piecewise linear interpolation In some applications, it is inconvenient or impossible to choose the x-points of data at Chebyshev points to construct polynomial interpolation. For example, we measure the temperature at a place several times during a 24-hour period and want to interpolate the measured temperatures to get a smooth temperature function for any time during this period. It would be more convenient to sample such a time-dependent function at equally spaced time intervals, but as we have seen in the previous section, a single

114 | 7 Interpolation polynomial p(x) based on equispaced data points would usually a bad approximation to the true function. Instead, piecewise interpolation is a natural solution. The first step to construct such an interpolation is to divide the entire interval [a, b] into several intervals, [x1 , x2 ], [x2 , x3 ], . . . , [xn−1 , xn ] (where x1 = a and xn = b as usual). Piecewise linear interpolation is defined, on interval i, to be pi (x) = yi +

yi+1 − yi (x − xi ). xi+1 − xi

The interpolating polynomial p(x) is defined so that p(x) = pi (x) on interval i. This may seem simplistic, but often it is reasonably accurate. It avoids oscillations, and even satisfies the following error estimate. Theorem 52. Suppose we choose n data points, (xi , yi ), i = 1, . . . , n, with max spacing h in the x-points, from a function f ∈ C 2 ([x1 , xn ]) (i. e., yi = f (xi ) at the data points). Then the difference between the function and the piecewise linear interpolant satisfies max |f (x) − p(x)| ≤ h2 max |f 󸀠󸀠 (x)|.

x1 ≤x≤xn

x1 ≤x≤xn

Proof. Consider the maximum error on an arbitrary interval i, 󵄨󵄨 󵄨󵄨 y − yi 󵄨 󵄨 max |f (x) − pi (x)| = max 󵄨󵄨󵄨f (x) − (yi + i+1 (x − xi )) 󵄨󵄨󵄨 󵄨󵄨 xi ≤x≤xi+1 󵄨󵄨 xi ≤x≤xi+1 xi+1 − xi 󵄨󵄨 󵄨󵄨 f (xi+1 ) − f (xi ) 󵄨 󵄨 (x − xi )) 󵄨󵄨󵄨. = max 󵄨󵄨󵄨f (x) − (f (xi ) + 󵄨󵄨 xi ≤x≤xi+1 󵄨󵄨 h From Taylor’s theorem, we can write f (x) = f (xi ) + f 󸀠 (xi )(x − xi ) +

f 󸀠󸀠 (c(x)) (x − xi )2 , 2

where c depends on x and is between xi and x. Inserting this into the error definition, we have max |f (x) − pi (x)|

xi ≤x≤xi+1

󵄨󵄨 󵄨󵄨 f (xi+1 ) − f (xi ) f 󸀠󸀠 (c(x)) 󵄨 󵄨 = max 󵄨󵄨󵄨f 󸀠 (xi )(x − xi ) + (x − xi )2 − (x − xi )󵄨󵄨󵄨. 󵄨󵄨 xi ≤x≤xi+1 󵄨󵄨 2 h

Again using Taylor series, we have that f (xi+1 ) = f (xi ) + f 󸀠 (xi )(xi+1 − xi ) + = f (xi ) + f 󸀠 (xi )h +

f 󸀠󸀠 (c0 ) (xi+1 − xi )2 2

f 󸀠󸀠 (c0 ) 2 h, 2

(7.7)

7.3 Piecewise linear interpolation

| 115

where c0 is between xi and xi+1 . Some arithmetic on this equation reveals that f (xi+1 ) − f (xi ) f 󸀠󸀠 (c0 ) = f 󸀠 (xi ) + h, h 2 and now using this identity to simplify the error equation (7.7) further, we obtain max |f (x) − pi (x)|

xi ≤x≤xi+1

󵄨󵄨 󵄨󵄨 f (xi+1 ) − f (xi ) f 󸀠󸀠 (c(x)) 󵄨 󵄨 = max 󵄨󵄨󵄨f 󸀠 (xi )(x − xi ) + (x − xi )2 − (x − xi )󵄨󵄨󵄨 󵄨 󵄨󵄨 xi ≤x≤xi+1 󵄨 2 h 󸀠󸀠 󸀠󸀠 󵄨󵄨 󵄨󵄨 f (c0 ) f (c(x)) 󵄨 󵄨 (x − xi )2 − (f 󸀠 (xi ) + h) (x − xi )󵄨󵄨󵄨 = max 󵄨󵄨󵄨f 󸀠 (xi )(x − xi ) + 󵄨󵄨 xi ≤x≤xi+1 󵄨󵄨 2 2 󸀠󸀠 󵄨 󵄨󵄨 f 󸀠󸀠 (c(x)) 󵄨󵄨 f (c0 ) 󵄨 (x − xi )2 − ( h) (x − xi )󵄨󵄨󵄨 = max 󵄨󵄨󵄨 󵄨󵄨 xi ≤x≤xi+1 󵄨󵄨 2 2

Since x is in interval i, we have that both c and c0 are also in the interval, and thus that |x − xi | ≤ h. Thus we have the bound 󵄨󵄨 f 󸀠󸀠 (c(x)) f 󸀠󸀠 (c ) 󵄨󵄨 󵄨 0 󵄨󵄨 max |f (x) − pi (x)| ≤ h2 󵄨󵄨󵄨 − 󵄨 󵄨󵄨 xi ≤x≤xi+1 2 2 󵄨󵄨󵄨 ≤ h2 max |f 󸀠󸀠 (x)| 2

xi ≤x≤xi+1

≤ h max |f 󸀠󸀠 (x)|. x1 ≤x≤xn

Since this bound holds on an arbitrary interval i, it holds for every interval, and thus the theorem is proven. Example 53. We now test the error rate in linear interpolation. Using the function f (x) = ex on [−2, 2], we create linear interpolants of f (x) using different numbers of points, and then calculate the maximum error in the linear interpolant. The MATLAB code to do this is as follows: format shorte f = @ ( x ) exp ( x ); % true solution at 1000 points x = linspace ( -2 ,2 ,1000); y = f ( x ); for m = 1 : 8 n = 10*2ˆ m + 1; xpts = linspace ( -2 ,2 , n ); ypts = f ( xpts ); yLI = interp1 ( xpts , ypts ,x ,` linear ');

116 | 7 Interpolation

maxerr (m )= max ( abs ( yLI - y )); end maxerr ' ratios = log ( maxerr (1: end -1)./ maxerr (2: end ) ) / log (2); ratios ' which produces the output for error of 3.3456 e -02 8.7800 e -03 2.2495 e -03 5.6933 e -04 1.3782 e -04 3.5046 e -05 8.8090 e -06 2.2109 e -06 and ratios of errors of 1.9300 e +00 1.9646 e +00 1.9823 e +00 2.0465 e +00 1.9755 e +00 1.9922 e +00 1.9943 e +00 The ratios are converging to 2, which is consistent with the predicted O(h2 ) error.

7.4 Piecewise cubic interpolation (cubic spline) Often, a better choice of piecewise interpolant is a cubic spline interpolant. This type of interpolant will be a cubic function on each interval and will satisfy: 1. On interval i, pi (x) = c0(i) + c1(i) x + c2(i) x2 + c3(i) x3 2.

p ∈ C 2 ([x1 , xn ]) (i. e., it is smooth, not choppy like the piecewise linear interpolant)

Determining the 4(n − 1) unknowns is done by enforcing that the polynomial p interpolates the data (2n − 2 equations), enforcing p ∈ C 2 at nodes x2 through xn−1 (2n − 4 equations), and enforcing the “not-a-knot” condition (2 equations). This defines a linear system of order 4(n − 1) to determine the coefficient cj(i) ’s. If the data points come from a smooth function, then cubic splines can be much more accurate than piecewise linear interpolants.

7.4 Piecewise cubic interpolation (cubic spline) |

117

Theorem 54. Suppose we choose n data points, (xi , yi ), i = 1, . . . , n, with max point spacing h in the x-points, from a function f ∈ C 4 ([x1 , xn ]). Then the difference between the function and the cubic spline p(x) satisfies max |f (x) − p(x)| ≤ Ch4 max |f (4) (x)| x1 ≤x≤xn

x1 ≤x≤xn

MATLAB has a built-in function called spline to do this. We simply give it the data points, and MATLAB takes care of the rest as follows. Example 55. For the data points (i, sin(i)) for i = 0, 1, 2, . . . , 6, find a cubic spline. Then plot the spline along with the data points and the function sin(x). First, we create the data points and the spline. >> x = [0:6] x = 0 1 2 >> y = sin ( x ); >> cs = spline (x ,y );

3

4

5

6

Next, we can evaluate points in the spline using MATLAB’s ppval. So for example, we can plug in x = π via >> ppval ( cs , pi ) ans = -1.3146 e -04 Thus to plot the cubic spline, we just need to give it x-points to plot y values at. We do this below, and include in our plot the sin(x) curve as well. >> >> >> >>

xx = linspace (0 ,6 ,1000); yy = ppval (cs , xx ); plot (x ,y , ' rs ',xx , yy , 'r ',xx , sin ( xx ) , 'b ') legend ( ' data points ',' cubic spline ',' sin ( x ) ')

This produces the following plot:

From the plot, we see that with just 7 points, we are able to match the sin(x) curve on a large interval quite well.

118 | 7 Interpolation Example 56. For f (x) = sin(x) on [−1, 1], calculate errors and rates for cubic spline and linear interpolants for varying numbers of points in MATLAB. % Create the function f ( x ) = sin ( x ) format shorte ; f = @( x) sin ( x ); % true solution at 1000 points x = linspace ( -1 ,1 ,1000); y = f( x ); % cubic spline for m = 1 : 8 n = 10*2ˆ m + 1; xpts = linspace ( -2 ,2 , n ); ypts = f( xpts ); yLI = interp1 ( xpts , ypts ,x , ` linear '); CS = spline ( xpts , ypts ); yCS = ppval ( CS , x ); maxerrCS ( m )= max ( abs ( yCS - y )); maxerrLI ( m )= max ( abs ( yLI - y )); end ratiosCS = log ( maxerrCS (1: end -1)./ maxerrCS (2: end ) ) / log (2); ratiosLI = log ( maxerrLI (1: end -1)./ maxerrLI (2: end ) ) / log (2); [ maxerrCS ' , [0; ratiosCS '] , maxerrLI ' , [0; ratiosLI ']]

which produces the output of [CS-error, CS-rates, LI-error, LI-rates], which correspond to the errors and order of cubic splines, and the errors and order of piecewise linear interpolants, respectively: 3.3589 e -06 2.1236 e -07 1.3443 e -08 8.4636 e -10 5.3116 e -11 3.2817 e -12 2.0672 e -13 1.2768 e -14

0 3.9834 e +00 3.9816 e +00 3.9894 e +00 3.9941 e +00 4.0166 e +00 3.9887 e +00 4.0171 e +00

3.9139 e -03 1.0165 e -03 2.5831 e -04 6.5114 e -05 1.6345 e -05 4.0410 e -06 1.0168 e -06 2.5278 e -07

0 1.9449 e +00 1.9765 e +00 1.9880 e +00 1.9942 e +00 2.0160 e +00 1.9906 e +00 2.0081 e +00

We observe the expected convergence order of 4 for cubic spline and 2 for piecewise linear interpolation. We also observe that the cubic spline is much more accurate. Now let us consider a less smooth function f (x) = √| sin(10x)|. This function is not even differentiable at some points in the interval, and its second derivative (where it exists) has vertical asymptotes. Repeating the same code above with this function produces the following output

7.5 Exercises | 119

9.7436 e -01 5.3906 e -01 3.5088 e -01 2.1509 e -01 1.3780 e -01 8.8288 e -02 6.9666 e -02 3.0434 e -02

0 8.5402 e -01 6.1946 e -01 7.0603 e -01 6.4240 e -01 6.4226 e -01 3.4176 e -01 1.1948 e +00

8.1422 e -01 6.0886 e -01 4.0968 e -01 2.5880 e -01 1.6801 e -01 9.4428 e -02 5.3083 e -02 1.9971 e -02

0 4.1929 e -01 5.7163 e -01 6.6266 e -01 6.2328 e -01 8.3126 e -01 8.3098 e -01 1.4104 e +00

Here, the approximations do not achieve their optimal convergence rates of 4 and 2, but this is expected since the derivatives required in the theorems do not exist. Interestingly, the linear interpolant is more accurate here than the cubic spline. Chebyshev or piecewise interpolation? To interpolate an analytic function f (x) and approximate it accurately by a polynomial p(x) on a finite interval [a, b], piecewise interpolants, including cubic splines, should be used only if it is inconvenient or impossible to choose the interpolation points at the Chebyshev points. If we can freely choose the interpolation points x1 , x2 , . . . , xn and efficiently evaluate f (x) at these points, Chebyshev interpolant is preferred because it is asymptotically a more accurate approximation to f (x) based on the same number of interpolation points n. The error maxx∈[a,b] |f (x) − p(x)| is 𝒪(ρ−n ) (some fixed ρ > 1) for Chebyshev interpolant, and it is 𝒪(n−p ) (some fixed p ≥ 2) for a piecewise interpolant (see Problem 8). Nevertheless, in some applications (such as modeling fluid flows), the true function may not be so smooth, and piecewise interpolation may be used dominantly.

7.5 Exercises 1.

Consider the data points x = [1 y = [1.1

2. 3.

2 4.1

3 8.9

4 16.1

5 26

6 36

7 52

8 63

9 78

10] 105]

(a) Find a single polynomial that interpolates the data points (b) Check with a plot that your polynomial does indeed interpolate the data (c) Would this interpolating polynomial likely give a good prediction of what is happening at x = 1.5? Why or why not? (d) Create a cubic spline for the same data, and plot it on the same graph as the single polynomial interpolant. Does it do a better job fitting the data? Find both the monomial and Lagrange interpolating polynomials for the data points (0, 0), (1, 1), (2, 3), (4, −1). If an x value gets repeated in the data points, why is it a problem in both of the interpolating polynomial constructions we considered? Furthermore, should it create a problem (i. e., is it even possible to get an answer)?

120 | 7 Interpolation 4. Compare the classic Lagrange interpolation formula (7.2) and the barycentric formula (7.5) (with all μi ’s known). Verify that the former needs O(n2 ) flops to evaluate p(x) for a single x, whereas the latter needs O(n) flops. 5. Consider the Chebyshev points xi = − cos (i−1)π (1 ≤ i ≤ 7) and the quantities 6 , μ3 = 16 , and μ4 = − 16 . Use . Verify by hand that μ1 = 83 , μ2 = − 16 μi = 7 1 3 3 3 ∏j=1,j=i̸ (xi −xj )

6.

7.

symmetry to find μ5 , μ6 , and μ7 directly without evaluation. After scaling, are they consistent with the sequence 21 , −1, 1, −1, . . . , 21 (−1)n−1 defined in (7.6)? Consider f (x) = ln x on [1, 5]. Give the expression of the barycentric formula (7.5) for the polynomial interpolation p(x) based on n = 4 Chebyshev points. Then evaluate p(x) at x = 1.5 and x = 2.5 and compare with f (x). Use the MATLAB code given to construct Chebyshev interpolants p(x) for f (x) = e−5x sin x1 sin 1 1 on [0.1596, 0.3175] using n = 200, 500, 800, 1100 Chebyshev sin

x

points, and evaluate p(x) at 100,000 uniformly distributed random evaluation x-points on this interval. How does max0.1596≤x≤0.3175 |f (x) − p(x)| change with n? Also, does the timing seem roughly linear with n? 8. Consider the function f (x) = sin2 (x) on [0, π]. (a) Verify the convergence theorems for piecewise linear and cubic spline interpolation. Using 11, 21, 41, and 81 equally spaced points, calculate the errors in the approximations (error as defined in the theorems) and the convergence rates. (b) By trial and error, find out how many equally spaced points are needed, so that the cubic spline p(x) approximates f (x) to near machine epsilon? Use 100,000 uniformly distributed random evaluation x-points on [0, π] to approximate max0≤x≤π |f (x) − p(x)|. (c) Use Chebyshev interpolation to approximate f (x). Do the same test as in part (b) and see how many data points are needed to reach near machine epsilon accuracy. Do you prefer Chebyshev interpolation or cubic splines?

8 Numerical integration In this chapter, we study algorithms for approximating the definite integral b

∫ f (x)dx. a

We assume that [a, b] is finite and f (x) is continuous. We have the experience with Calculus that finding the elementary antiderivative of f (x) can be rather challenging, and in many cases impossible even for f (x) with quite simple expressions, such as 2 3 f (x) = √x2 + 1, ln1 x , sinx x , e−x , and so on. In addition, we may not have an analytic expression of f (x) but instead can only evaluate it wherever convenient. In these cases, a most commonly used solution is to find an approximate value of the integral by numerical integration (or quadrature). Our focus would be on a variety of quadrature rules that strike different levels of balance between the accuracy and evaluation cost.

8.1 Preliminaries A fundamental idea for quadrature is to use a polynomial pn (x) to approximate f (x) b

b

on [a, b], so that ∫a f (x)dx can be approximated by ∫a pn (x)dx, and the integration of polynomials is relatively easy. The first thought here is to let pn (x) = ∑nk=0 f (xk )Lk (x) be a Lagrange interpolation of f (x) at distinct nodes x0 , x1 , . . . , xn ∈ [a, b]. Let wk = b

∫a Lk (x)dx, and it follows that n

b n

b

k=0

a k=0

a

Q(f ) ≡ ∑ wk f (xk ) = ∫ ∑ f (xk )Lk (x)dx = ∫ pn (x)dx. Here, {xk }nk=0 and {wk }nk=0 are called the quadrature nodes and weights, respectively. The weights usually depend on the nodes, which should be independent of f (x). A quadrature rule is defined by the choice of nodes and weights. The generic quadrature rule above is a linear functional. For any continuous functions f , g and scalars α, β, Q(αf + βg) = ∑nk=0 wk (αf (xk ) + βg(xk )) = α (∑nk=0 wk f (xk )) + β (∑nk=0 wk g(xk )) = αQ(f ) + βQ(g). Note that ∑nk=0 wk = b − a. In fact, consider the function f (x) ≡ 1 on [a, b], which is a special polynomial of degree no higher than n. The polynomial interpolation pn (x) of f (x) at {(xk , f (xk ))}nk=0 of degree ≤ n is unique and, therefore, pn (x) must be f (x) itself. It follows that pn (x) = ∑nk=0 f (xk )Lk (x) = ∑nk=0 Lk (x) ≡ 1 and, therefore, ∑nk=0 wk = b

b

∑nk=0 ∫a Lk (x)dx = ∫a ∑nk=0 Lk (x)dx = b − a. The degree of accuracy m for aindexdegree of accuracy quadrature rule is defined the highest degree of all polynomial integrands for which the rule gives exact value of https://doi.org/10.1515/9783110573329-008

122 | 8 Numerical integration the definite integral. For example, consider the midpoint rule Q(f ) = w0 f (x0 ), where x0 =

a+b 2

b

b

2

and w0 = ∫a L0 (x)dx = b−a. If f (x) = αx +β, then ∫a f (x)dx = α (b−a) +β(b−a), 2 b

and the quadrature is Q(f ) = w0 f (x0 ) = (b − a)[α( a+b ) + β] = ∫a f (x)dx. We can verify 2 that the midpoint rule is generally not exact for a quadratic integrand. The degree of accuracy of this rule is therefore 1. Theorem 57. The degree of accuracy m for quadrature rules based on Lagrange interpolation at n + 1 distinct nodes satisfies m ≥ n, and m ≥ n + 1 if x n = a+b and x n −k and 2 2 2 n n n x +k (k = 1, . . . , 2 ) are symmetric with respect to x . 2

2

Proof. Given arbitrary n+1 distinct nodes x0 , . . . , xn , there is a unique polynomial pn (x) of degree ≤ n going through the data points {xk , f (xk )}nk=0 . If the integrand f (x) itself is a polynomial of degree ≤ n, the corresponding interpolation pn (x) must be identical to b

b

f (x). Therefore, the quadrature rule based on ∫a pn (x)dx gives exact value of ∫a f (x)dx. That is, any quadrature rule based on Lagrange interpolation of f (x) necessarily has a degree of accuracy m ≥ n, wherever the n + 1 distinct nodes are located on [a, b]. Next, we want show that m ≥ n + 1 if n is even, x n = a+b , and x n −k , and x n +k 2 2

2

2

(k = 1, . . . , n2 ) are symmetric with respect to x n . Let pn+1 (x) = cn+1 xn+1 + cn xn + ⋅ ⋅ ⋅ + c0 2

be an arbitrary given polynomial of degree ≤ n + 1. Define pcn+1 (x) = (x − a+b )n+1 such 2 that pn+1 (x) = cn+1 pcn+1 (x) + rn (x), where rn is a polynomial of degree ≤ n. Due to the linearity of quadrature rules, Q(pn+1 ) =

cn+1 Q(pcn+1 ) n

+ Q(rn ) =

= cn+1 ∑ wk (xk − k=0

cn+1 Q(pcn+1 )

b

+ ∫ rn (x)dx a

b

b

a

a

a + b n+1 ) + ∫ rn (x)dx = ∫ rn (x)dx, 2

n+1

) = 0 because n + 1 is odd, x n = a+b , x n −k , and x n +k (k = where ∑nk=0 wk (xk − a+b 2 2 2 2 2 n 1, . . . , 2 ) are symmetric with respect to x n , and w n −k = w n +k due to the symmetry of 2 2 2 the nodes (can be shown without difficulty). b b b Meanwhile, it is also easy to see that ∫a pn+1 (x)dx = cn+1 ∫a pcn+1 (x)dx+∫a rn (x)dx = b

. Therefore, ∫a rn (x)dx, due to the symmetry of pcn+1 (x) with respect to the midpoint a+b 2 such a quadrature rule with an even n and symmetry in nodes and weights has a degree of accuracy m ≥ n + 1. An immediate result following Theorem 57 is summarized as follows.

Theorem 58. Suppose that the degree of accuracy of a quadrature rule is m, and all wk > 0. Then for f ∈ C m+1 ([a, b]), the error of the quadrature satisfies

8.1 Preliminaries | 123

󵄨󵄨 b 󵄨󵄨 m+3 󵄨󵄨 󵄨 󵄨 󵄨 max 󵄨󵄨f (m+1) (c)󵄨󵄨󵄨(b − a)m+2 . 󵄨󵄨 ∫ f (x)dx − Q(f )󵄨󵄨󵄨 ≤ 󵄨󵄨 󵄨󵄨 (m + 2)! c∈[a,b] 󵄨 a Proof. Write f (x) on [a, b] in a Taylor expansion at expansion point a as f (x) = pm (x) +

f (m+1) (cx ) (x − a)m+1 , (m + 1)!

(8.1)

f (a) k where a ≤ cx ≤ x ≤ b, and pm (x) = ∑m k=0 k! (x − a) is polynomial of degree ≤ m. Applying the quadrature rule to both sides of (8.1), we have (k)

f (m+1) (cx ) (x − a)m+1 ) (m + 1)! 1 Q (f (m+1) (cx )(x − a)m+1 ) = Q(pm ) + (m + 1)!

Q(f ) = Q (pm +

b

= ∫ pm (x)dx + a

b

b

1 Q (f (m+1) (cx )(x − a)m+1 ) . (m + 1)!

Meanwhile, ∫a f (x)dx = ∫a pm (x)dx + b

b 1 ∫ f (m+1) (cx )(x (m+1)! a

− a)m+1 dx. Subtracting Q(f )

from ∫a f (x)dx and multiplying by (m + 1)!, we have 󵄨󵄨 b 󵄨󵄨 󵄨 󵄨 (m + 1)!󵄨󵄨󵄨 ∫ f (x)dx − Q(f )󵄨󵄨󵄨 󵄨󵄨 󵄨󵄨 a

(8.2)

󵄨󵄨 b 󵄨󵄨󵄨 󵄨 = 󵄨󵄨󵄨 ∫ f (m+1) (cx )(x − a)m+1 dx − Q (f (m+1) (cx )(x − a)m+1 ) 󵄨󵄨󵄨 󵄨󵄨 󵄨󵄨 a 󵄨󵄨 n 󵄨 󵄨b 󵄨󵄨 󵄨󵄨 󵄨 m+1 󵄨󵄨󵄨 󵄨󵄨󵄨 (m+1) ≤ 󵄨󵄨 ∑ wk f (cxk )(xk − a) 󵄨󵄨 + 󵄨󵄨 ∫ f (m+1) (cx )(x − a)m+1 dx 󵄨󵄨󵄨 󵄨󵄨 󵄨󵄨 󵄨󵄨 󵄨󵄨 k=0 a n

b

k=0

a

󵄨 󵄨 󵄨 󵄨 ≤ max 󵄨󵄨󵄨f (m+1) (c)󵄨󵄨󵄨(b − a)m+1 ∑ |wk | + max 󵄨󵄨󵄨f (m+1) (c)󵄨󵄨󵄨 ∫(x − a)m+1 dx c∈[a,b] c∈[a,b] 1 󵄨 󵄨 = max 󵄨󵄨󵄨f (m+1) (c)󵄨󵄨󵄨(b − a)m+2 (1 + ), m+2 c∈[a,b]

where we used ∑nk=0 |wk | = ∑nk=0 wk = b − a since all wk > 0. Equivalently, we have 󵄨󵄨 b 󵄨󵄨 󵄨󵄨 󵄨 󵄨󵄨 ∫ f (x)dx − Q(f )󵄨󵄨󵄨 ≤ m + 3 max 󵄨󵄨󵄨f (m+1) (c)󵄨󵄨󵄨(b − a)m+2 . 󵄨 󵄨󵄨 󵄨󵄨 (m + 2)! c∈[a,b] 󵄨 󵄨󵄨 a 󵄨󵄨 Note that the constant factor in the upper bound on the quadrature error could be reduced significantly with more refined analysis. Nevertheless, we are more interested in the order of the quadrature rules in this textbook.

124 | 8 Numerical integration

8.2 Newton–Cotes rules The Newton–Cotes quadrature rules are based on interpolating f (x) at equally spaced (or equispaced) nodes. These rules are called open if the quadrature nodes xk = a + (k + 1)h,

(0 ≤ k ≤ n) where h =

b−a , n+2

and they are called closed if xk = a + kh,

(0 ≤ k ≤ n) where h =

b−a . n

We will consider only one open rule, namely, the midpoint rule, and all others are closed. Composite rules based on closed Newton–Cotes rules on each subinterval are slightly more efficient than those using open rules if n ≥ 1, because the former can use the same node at the common endpoint of two adjacent subintervals. By Theorem 57, for both open and closed Newton–Cotes rules, their degree of accuracy is m ≥ n if n is odd, and m ≥ n + 1 if n is even due to the symmetry in nodes. One can verify that these are actually equalities, meaning that there is no additional degree of accuracy achieved by Newton–Cotes rules. To derive the Newton–Cotes quadrature rule formulas, let us begin with the midpoint rule. This rule is open, with only one node x0 = a+b , and the corresponding 2 b

b

weight is w0 = ∫a L0 (x)dx = ∫a 1dx = b − a. The simplest closed rule is the trapezoidal rule (n = 1) with nodes x0 = a and b

b x−x1 dx x0 −x1

x1 = b, and weights w0 = ∫a L0 (x)dx = ∫a b x−x0 dx x1 −x0

=

b−a 2

b

and w1 = ∫a L1 (x)dx =

. = b−a 2 Simpson’s rule (n = 2) is one of the most widely used closed rules, with nodes b b 1 )(x−x2 ) x0 = a, x1 = a+b and x2 = b, and weights w0 = ∫a L0 (x)dx = ∫a (x(x−x dx = b−a , 2 −x )(x −x ) 6

∫a

b

b (x−x )(x−x )

b

0

1

0

2

w1 = ∫a L1 (x)dx = ∫a (x −x0 )(x −x2 ) dx = 2(b−a) , and w2 = ∫a L2 (x)dx = w0 = b−a . 3 6 1 0 1 2 The above basic Newton–Cotes rules are illustrated in Figure 8.1. We can continue deriving closed rules with more nodes, but the derivation becomes increasingly complex and tedious. Instead, we simply provide a table of these Newton–Cotes rules, their number of nodes n + 1, the degree of accuracy m, together b with their errors ∫a f (x)dx − Q(f ). Note that alternative error bounds can be obtained directly from Theorems 57 and 58. Though those bounds are not sharp in the constant factor, as they are derived without consideration of the equispacing of nodes and specific values of weights, they do have correct exponents of (b−a) and the order of derivatives of f (x). For example, Boole’s rule has n + 1 = 5 nodes, x2 = a+b , and the nodes are 2 symmetric. Therefore, by Theorems 57 and 58, the degree of accuracy is m = 5, and 1 m+3 maxc∈[a,b] |f (m+1) (c)|(b − a)m+2 = 630 maxc∈[a,b] |f (6) (c)|(b − a)7 , an error bound is (m+2)! which has a constant factor much larger than that of the error given in Table 8.1. In

8.2 Newton–Cotes rules | 125

Figure 8.1: Several low order Newton–Cotes quadrature rules. Table 8.1: Low order Newton–Cotes quadrature rules. Name

n

m

formula

error If − Q(f )

Midpoint

0

1

) (b − a)f ( a+b 2

+ (b−a) f (2) (c) 24

Trapezoid

1

1

b−a 2

[f (a) + f (b)]

− (b−a) f (2) (c) 12

Simpson

2

3

b−a 6

[f (a) + 4f ( a+b ) + f (b)] 2

− (b−a) f (4) (c) 2880

3 -rule 8

3

3

b−a 8

[f (a) + 3f ( 2a+b ) + 3f ( a+2b ) + f (b)] 3 3

− (b−a) f (4) (c) 6480

Boole

4

5

b−a [7f (a) + 32f 90

−

5

5

b−a [19f (a) + 75f ( 4a+b ) 288 5 75f ( a+4b ) + 19f (b)] 5

Weddle

6

7

b−a [41f (a) + 216f ( 5a+b ) + 27f ( 2a+b ) 840 6 3 a+5b a+2b 27f ( 3 ) + 216f ( 6 ) + 41f (b)]

3

3

5

( 3a+b ) + 12f ( a+b ) + 32f ( a+3b ) + 7f (b)] 4 2 4 + 50f ( 3a+2b ) + 50f ( 2a+3b )+ 5 5 + 272f ( a+b )+ 2

5

7

(b−a) − 1935360 f (6) (c) 7

11(b−a) f (6) (c) − 37800000 9

(b−a) − 1567641600 f (8) (c)

126 | 8 Numerical integration other words, if we simply want to construct quadrature rules based on Lagrange interpolation with error bounds of the form C maxc∈[a,b] |f (m+1) (c)|(b−a)m+2 (m is the degree of accuracy), there is significant degree of freedom to do so; equispaced nodes are not necessary for this purpose. The closed rule with 8 nodes (n = 7) has an error comparable to that of Weddle’s rule (n = 6). The closed rule with n = 8 is the first one with negative weights, and the one with n = 9 is the last one with all positive weights. Starting from n = 10, all closed Newton–Cotes rules have both positive and negative weights, whose maximum magnitude overall increases to infinity as n increases. For the sake of numerical stability, quadrature rules with all weights positive are preferred to those with negative weights, especially those with large negative weights. The following theorem shows why. Theorem 59. For a quadrature rule Q(f ) = ∑nk=0 wk f (xk ) with all wk > 0, if f , g ∈ C([a, b]) satisfy maxx∈[a,b] |f (x) − g(x)| ≤ δ, then |Q(f ) − Q(g)| ≤ δ(b − a). Proof. Direct evaluation leads to 󵄨󵄨 n 󵄨󵄨 󵄨󵄨 󵄨󵄨 󵄨 |Q(f ) − Q(g)| = 󵄨󵄨 ∑ wk (f (xk ) − g(xk )) 󵄨󵄨󵄨 󵄨󵄨 󵄨󵄨 󵄨 k=0 󵄨 n

n

k=0

k=0

󵄨 󵄨 ≤ ∑ |wk |󵄨󵄨󵄨f (xk ) − g(xk )󵄨󵄨󵄨 ≤ ∑ wk δ = δ(b − a). The last inequality in the proof would break down if there are negative weights, as ∑nk=0 |wk | > ∑k=0 wk = b − a. Moreover, assume that for any given M > 0, there is a number n, such that an unstable quadrature rule has both positive weights {wk } where i ∈ S+ ⊂ {0, 1, . . . , n}, and negative weights {wk } where i ∈ S− = {0, 1, . . . , n} \ S+ , with maxi∈S+ |wk | ≥ M or maxi∈S− |wk | ≥ M. Let f , g ∈ C([a, b]) be such that f (xk ) − g(xk ) = δ > 0 at xk if i ∈ S+ , and f (xk ) − g(xk ) = −δ at xk if i ∈ S− . Then |Q(f ) − Q(g)| = 󵄨 󵄨 󵄨 󵄨󵄨 n n 󵄨󵄨 ∑k=0 wk (f (xk ) − g(xk )) 󵄨󵄨󵄨 = 󵄨󵄨󵄨 ∑i∈S+ wk δ + ∑i∈S− wk (−δ)󵄨󵄨󵄨 = δ ∑k=0 |wk | ≥ δM. In other words, such a quadrature of two functions very close to each other may have quite different values. Quadrature rules of this type, for example, higher order Newton–Cotes, have stability issues and cannot be used. 3

Example 60. Approximate ∫1 ln xdx, using the midpoint, trapezoidal, Simpson’s, Boole’s, and Weddle’s rules. Here, f (x) = ln x, a = 1 and b = 3, and the exact integral 󵄨3 is x ln x − x󵄨󵄨󵄨1 = 3 ln 3 − 2 = 1.29583687. = 2. Therefore, Q(f ) = w0 f (x0 ) = The midpoint rule has w0 = b−a = 2 and x0 = a+b 2 2 ln 2 = 1.38629436. The trapezoidal rule has w0 = w1 = b−a = 1, and x0 = 1, x1 = 3. It follows that 2 Q(f ) = w0 f (x0 ) + w1 f (x1 ) = ln 3 = 1.09861229. = 31 and w1 = 4(b−a) = 43 , and x0 = 1, x1 = 2, and Simpson’s rule has w0 = w2 = b−a 6 6 4 x2 = 3. Consequently, Q(f ) = w0 f (x0 ) + w1 f (x1 ) + w2 f (x2 ) = 3 ln 2 + 31 ln 3 = 1.29040034.

8.3 Composite rules | 0.5

Table 8.2: Newton–Cotes quadrature for approximating ∫−0.5 n

I1

I2

0 4 8 12 16 20

1.0000000000 0.3676923077 0.3691018013 0.3598308365 0.3337916510 0.2811316793

0.0270270270 0.0352724944 0.0351827526 0.0351822243 0.0351822222 0.0351822222

dx 1+36x 2

1.5

and ∫0.5

127

dx . 1+36x 2

, w1 = w3 = 32(b−a) , w2 = 12(b−a) , and x0 = 1, Boole’s rule has w0 = w4 = 7(b−a) 90 90 90 x1 = 1.5, x2 = 2, x3 = 2.5 and x4 = 3. It follows that Q(f ) = ∑4k=0 wk f (xk ) = 1.29564976. Similarly, Weddle’s rule gives ∑6k=0 wk f (xk ) = 1.29582599, using the nodes xk = a+k b−a 6 (0 ≤ k ≤ 6) and the weights in Table 8.1. For this problem, higher order rules give more accurate approximations. Example 61. Use Newton–Cotes rules to approximate 0.5

I1 = ∫ −0.5

1 dx = 0.4163485908 1 + 36x 2

1.5

and I2 = ∫ 0.5

1 dx = 0.0351822222. 1 + 36x 2

The quadrature values are summarized in Table 8.2, where n is a multiple of 4. In this example, the quadrature seems to converge to I2 but fail to converge to I1 as n increases. This is because the Lagrange interpolation polynomial pn (x) of f (x) based on n + 1 equispaced nodes approximates f (x) accurately on [0.5, 1.5] for n ≈ 20, but it does not approximate f (x) well on [−0.5, 0.5] for all n. The failure of convergence is precisely the well-known Runge phenomenon. To make sure that the quadrature converges to the integral, one approach is to use composite rules based on Newton–Cotes, discussed in the next section.

8.3 Composite rules Assume that the degree of accuracy of a Newton–Cotes rule is m, f (x) ∈ C m+2 on any closed interval in its domain, and f (m+2) is uniformly bounded on all such closed intervals. Under these assumptions, we see from the absolute errors of Newton–Cotes rules in Table 8.1 that these errors tend to be smaller on a shorter interval. In addition, if f (x) does not change sign and is bounded away from zero and bounded in modub lus on [a, b], then ∫a f (x)dx is typically and roughly proportional to the length of the

interval b − a. It follows that the relative error of Newton–Cotes rules is m+2

C1 (b−a) C2 (b−a)

C1 (b−a)m+2 If

= C3 (b − a)m+1 . The relative error will also decrease on a shorter interval.

≈

128 | 8 Numerical integration The above informal discussion justifies the use of composite quadrature rules: divide [a, b] into many subintervals, apply regular quadrature rules such as Newton– Cotes on each subinterval and sum up the quadratures. For example, we may divide [a, b] into ℓ subintervals [a, a + h], [a + h, a + 2h], . . . , [b − h, b] with h = b−a , and use ℓ the midpoint rule on each subinterval. The composite midpoint rule is therefore ℓ−1 1 3 1 1 QCM (f ) = hf (a + h) + hf (a + h) + ⋅ ⋅ ⋅ + hf (b − h) = h ∑ f (a + (k + ) h) . 2 2 2 2 k=0

This composite rule is illustrated in Figure 8.2.

Figure 8.2: Composite midpoint quadrature rule.

Similarly, we may also approximate the integral on each subinterval by the trapezoidal rule. The composite trapezoidal rule is QCT (f ) = =

h h h (f (a) + f (a + h)) + (f (a + h) + f (a + 2h)) + ⋅ ⋅ ⋅ + (f (b − h) + f (b)) 2 2 2

ℓ−1 h h (f (a) + 2f (a + h) + ⋅ ⋅ ⋅ + 2f (b − h) + f (b)) = (f (a) + 2 ∑ f (a + kh) + f (b)). 2 2 k=1

The composite Simpson’s rule is as follows. If we choose to evaluate f (x) at the end points (and not midpoints) of subintervals only, the subinterval count ℓ must be even as the Simpson’s rule is applied to two adjacent subintervals QCS (f ) =

2h 2h (f (a) + 4f (a + h) + f (a + 2h)) + (f (a + 2h) + 4f (a + 3h) + f (a + 4h)) 6 6 2h +⋅⋅⋅ + (f (b − 2h) + 4f (b − h) + f (b)) 6

8.3 Composite rules |

=

129

h (f (a) + 4 ∑ f (a + (2k − 1)h) + 2 ∑ f (a + 2kh) + f (b)). 3 k=1 k=1 ℓ/2

ℓ/2−1

Higher-order composite quadrature rules can be constructed, for example, based on Boole’s and Weddle’s rules. Similar to the practice of the composite Simpson’s rule, the number of subintervals ℓ should be a multiple of 4 and 6, respectively. How do we estimate the error of composite quadrature rules? Consider the composite trapezoidal rule, for example. From Table 8.1, note that the error on each 3 subinterval is bounded by h12 max |f (2) (c)| and, therefore, the total error is bounded 3

2

max |f (2) (c)|h2 (note that hℓ = b − a). by ℓ h12 max |f (2) (c)| = h12 (hℓ) max |f (2) (c)| = b−a 12 In general, if the degree of accuracy of a regular Newton–Cotes rule is m, the error of the corresponding composite rule is bounded by C(b − a) max |f (m+1) (c)|hm+1 . In particular, the error of composite Simpson’s, Boole’s, and Weddle’s rules is bounded by (b − a) times 𝒪(h4 ), 𝒪(h6 ) and 𝒪(h8 ), respectively. These orders of accuracy are usually sufficient for applications. See Table 8.3 for a summary of the above analysis. Table 8.3: Composite Newton–Cotes rule error bounds (h = composite rules

error bound

Midpoint

max |f (2) (c)| (b − a)h2 24 max |f (2) (c)| (b − a)h2 12 (4) max |f (c)| (b − a)h4 90 8 max |f (6) (c)| (b − a)h6 945 9 max |f (8) (c)| (b − a)h8 1400

Trapezoid Simpson Boole Weddle

b−a , ℓ

ℓ + 1 equispaced nodes).

2.2

Example 62. Approximate ∫1 ln xdx = 0.534606 by composite trapezoidal and Simpson’s rules based on 6 subintervals. Here, a = 1, b = 2.2, and the end points of the 6 subintervals are 1, 1.2, 1.4, 1.6, 1.8, 2.0, and 2.2. Composite trapezoidal rule gives h (f (1)+2f (1.2)+2f (1.4)+2f (1.6)+2f (1.8)+2f (2.0)+f (2.2)) = 0.532792, and our composite 2 Simpson’s rule leads to h3 (f (1) + 4f (1.2) + 2f (1.4) + 4f (1.6) + 2f (1.8) + 4f (2.0) + f (2.2)) = 0.534591, more accurate than the composite trapezoidal quadrature. 0.6

1 Example 63. Consider If = ∫−0.6 1+36x 2 dx = 0.4332831588188253, for which regular high-order Newton–Cotes rules have difficulties approximating. We divide [−0.6, 0.6] 1 into ℓ = 192 subintervals of the same length h = 0.6−(−0.6) = 160 , and use composite 192 trapezoidal, Simpson’s, Boole’s, and Weddle’s rules to approximate If . Since ℓ = 192 is a multiple of 2, 4, and 6, all these composite rules can be used. The results are summarized in Table 8.4. The correct digits in each result are underlined. Clearly, with a sufficiently small subinterval of length h, higher-order composite rules give more accurate approximations.

130 | 8 Numerical integration 0.6

Table 8.4: Composite Newton–Cotes quadrature for approximating ∫−0.6 composite rules trapezoid Simpson Boole Weddle

1 dx. 1+36x 2

Q(f ) 0.4332817156597703 0.4332831587192119 0.4332831588187392 0.4332831588188242

Finally, we point out the composite trapezoidal rule works particularly well for numerically integrating periodic functions. In particular, the error of the quadrature decreases exponentially with the number of subintervals. Theorem 64. Suppose that the real function f ∈ C ∞ ([a, b]) with period T = b − a can be extended to the complex plane, f (z) is analytic in the horizontal strip −μ < Im(z) < μ for some μ > 0, and |f (z)| ≤ M in this strip. Let QℓCT (f ) be the composite trapezoidal . Then quadrature of f on [a, b] based on ℓ subintervals of length h = b−a ℓ b

|QℓCT (f ) − ∫ f (x)dx| ≤ a

2TM

e2πμℓ/T

−1

≈ 2TMe−

2πμℓ T

.

Example 65 (Poisson’s elliptic integral, 1827). Use the composite trapezoidal rule to 1 2π √ 1 − 0.36 sin2 θdθ = 0.9027799277721939. The results are sumapproximate I = 2π ∫0 marized in Table 8.5. Whenever the number of subintervals ℓ increases by 4, we get roughly 2 more digits of accuracy. Table 8.5: Composite trapezoidal rule for approximating

1 2π

2π ∫0 √1 − 0.36 sin2 θdθ.

subinterval #ℓ

ℓ QCT (f )

subinterval #ℓ

ℓ QCT (f )

8 16 24

0.9027692569068708 0.9027799272275734 0.9027799277721495

12 20 28

0.9027798586495662 0.9027799277674322 0.9027799277721936

8.4 Clenshaw–Curtis quadrature Quadrature rules based on equispaced nodes seem natural, as one would wonder what we can achieve if nonequispaced nodes are used. In fact, using carefully chosen nonequispaced nodes may develop highly efficient and accurate quadrature rules, which converge to the integrals of f ∈ C ∞ ([a, b]) exponentially as the number of nodes

8.4 Clenshaw–Curtis quadrature

| 131

increases, asymptotically outperforming all previously discussed quadrature rules with errors on the order of 𝒪(hp ) for any integer p > 0. We shall discuss two rules: Clenshaw–Curtis and Gauss.1 b

Clenshaw–Curtis Quadrature. The Clenshaw–Curtis quadrature for ∫a f (x)dx is b

n

a

k=0

Qn2C (f ) = ∫ pn (x)dx = ∑ wk f (xk ), where pn (x) is the Chebyshev interpolant for f (x), and the quadrature nodes xk = b

− cos( kπ ) b−a + a+b are the Chebyshev points. For an odd n, the weights wk = ∫a Lk (x)dx n 2 2 satisfy b−a 2n2

wk = { b−a n

{1 −

n−1 2

∑j=1 4j22−1

cos ( 2kjπ )} n

k = 0, n, 1 ≤ k ≤ n − 1,

(8.3)

and for an even n, b−a 2(n2 −1)

wk = { b−a n

{1 −

(−1)k n2 −1

−

k = 0, n,

n −1 2

∑j=1 4j22−1

cos ( 2kjπ )} n

1 ≤ k ≤ n − 1.

(8.4)

The first few Clenshaw–Curtis quadrature rules are summarized in Table 8.6. The degree of accuracy and upper bound on the errors of these quadrature rules are similar to those of the Newton–Cotes rules (see Theorems 57 and 58). Though Clenshaw–Curtis could be used to construct composite quadrature rules as Newton–Cotes, this is not the most efficient way to make full use of its approximation power. Instead, we should use a high-order Clenshaw–Curtis rule, whose nodes and weights can be computed very efficiently. Table 8.6: Low order Clenshaw–Curtis quadrature. n

Clenshaw–Curtis rules

1

b−a 2

[f (a) + f (b)]

2

b−a 6

[f (a) + 4f ( a+b ) + f (b)] 2

3

b−a 18

[f (a) + 9f ( 3a+b ) + 9f ( a+3b ) + f (b)] 4 4

4

b−a 30

[f (a) + 8f ( 2+4 2 a +

6

√ √ b−a [9f (a) + 80f ( 2+4 3 a + 2−4 3 b) 630 √ √ 80f ( 2−4 3 a + 2+4 3 b) + 9f (b)]

√

2−√2 b) 4

+ 12f ( a+b ) + 8f ( 2−4 2 a + 2 √

+ 144f

( 3a+b ) 4

+ 164f

2+√2 b) 4

( a+b ) 2

+ f (b)]

+ 144f ( a+3b )+ 4

1 More details of these quadrature rules can be found in Approximation Theory and Approximation Practice by Lloyd N. Trefethen, SIAM, 2013.

132 | 8 Numerical integration Convergence. Simply speaking, the Clenshaw–Curtis quadrature converges to the integral exponentially with the number of nodes if the integrand f (x) is analytic. The following results are presented for reference purposes only. Suppose that f (k) (0 ≤ k ≤ m − 1) are absolutely continuous, and f (m+1) is such 1 󵄨 (m+1) 󵄨 that V = ∫−1 󵄨󵄨󵄨 √f 2 󵄨󵄨󵄨dx < ∞. Let Qn2C (f ) be the (n + 1)-point Clenshaw–Curtis quadrature 1

1−x

32V . In addition, if 15πm(2n+1−m)m ρeiθ +ρ−1 e−iθ ∞ C ([−1, 1]) and can be analytically continued to the ellipse Eρ = (ρ > 1, 2 64Mρ n [0, 2π)) in the complex plane, and |f (z)| ≤ M inside Eρ , then |If −Q2C (f )| ≤ 15(ρ2 −1)ρn .

for If = ∫−1 f (x)dx. For sufficiently large n, |If − Qn2C (f )| ≤

f ∈

θ∈

An efficient implementation of the Clenshaw–Curtis quadrature based on the Fast Fourier Transform (FFT) is as follows: function Q = clenshawcurtis ( fun ,a ,b ,n) % ( n +1) - pt Clenshaw - Curtis quadrature for integral (f ,a , b) % Compute Chebyshev points and evaluate f at these points : x = - cos ( pi *(0: n ) '/ n )*( b -a )/2+( a+b )/2; fx = feval ( fun ,x )/(2* n ); % apply Fast Fourier Transform : g = real ( fft ( fx ([1: n +1 n : -1:2]))); % compute Chebyshev coefficients : coef = [g (1); g (2: n )+ g (2* n : -1: n +2); g(n +1)]; % compute vector with weights : w = zeros (n +1 ,1); w (1:2: end ) = 2./(1 -(0:2: n ).ˆ2); % compute result : Q = w '* coef *(b -a )/2;

The cost for computing all weights {wk } for (n + 1)-point Clenshaw–Curtis is 𝒪(n log n), almost linear with n, thanks to the use of FFT. Also, whenever n doubles, a half of the new nodes coincide with the old ones, such that function evaluations at these nodes need not be done repeatedly. 3

Example 66. Approximate ∫1 ln xdx by the Clenshaw–Curtis quadrature using n+1 = 5 nodes xk = − cos( kπ ) b−a + n 2

a+b , 2

√2 , x2 2 8 and 15

where a = 1, b = 3, and n = 4. Direct evaluation

= 2, x3 = 2 + 22 , and x4 = 3. From (8.4), we have gives x0 = 1, x1 = 2 − 12 w0 = w4 = 151 , w1 = w3 = w2 = 15 . The Clenshaw–Curtis quadrature is therefore n 4 Q2C (f ) = ∑k=0 wk f (xk ) = 1.2958988 with 5 digits of accuracy. The quadrature with n + 1 = 20 nodes is accurate to machine precision. √

8.5 Gauss quadrature

| 133

8.5 Gauss quadrature Another type of quadrature that converges exponentially for analytic integrand is the Gauss quadrature. Though Gauss quadrature is much more well-known than Clenshaw–Curtis, they are comparable in many aspects. b Consider the quadrature If = ∫a f (x)ρ(x)dx, where ρ(x) > 0 is a weight function. A quadrature Q(f ) = ∑nk=0 wk f (xk ) (note that it does not evaluate ρ(x) anywhere) is an (n + 1)-node Gauss quadrature if its degree of accuracy is 2n + 1. The nodes and weights of Gauss quadrature can be constructed directly: we set up the nonlinear system of b equations ∫a xℓ ρ(x)dx = ∑nk=0 wk xkℓ for 0 ≤ ℓ ≤ 2n + 1, and solve for all nodes and weights, for example, by Newton’s method. This approach is fine for small n, but not a good choice for large n, because each iteration of Newton’s method needs to solve a linear system of equations involving 2n + 2 unknowns, taking 𝒪(n3 ) flops per iteration. Example 67. Determine the 2-point (n = 1) Gauss quadrature rule for ρ(x) = 1 on [−1, 1] by hand. Let the quadrature rule be Q = w0 f (x0 ) + w1 f (x1 ), which is exact for polynomial integrand of degree ≤ 2n + 1 = 3. Therefore, we have the following equations for the unknowns w0 , w1 , x1 , and x2 : 1

1

w0 + w1 = ∫ 1dx = 2 −1

1

w0 x02 + w1 x12 = ∫ x2 dx = −1

w0 x0 + w1 x1 = ∫ xdx = 0 −1

2 3

1

w0 x03 + w1 x13 = ∫ x3 dx = 0 −1

This system of nonlinear equations seem hard to solve by hand. However, we note that the quadrature rule should be symmetric with respect to the origin; that is, x0 = −x1 √ √ and w0 = w1 . This observation easily leads to w0 = w1 = 1, x0 = − 33 , and x1 = 33 . We can determine the 3-point Gauss quadrature similarly. Legendre polynomials. We focus on the most common case where ρ(x) ≡ 1. For f (x) ∈ C n+1 ([a, b]), let us consider the Lagrange interpolation pn (x) = ∑nk=0 f (xk )Lk (x), such 1 f (n+1) (cx )ωn+1 (x), where cx ∈ (a, b) and ωn+1 (x) = (x − x0 )(x − that f (x) = pn (x) + (n+1)! x1 ) ⋅ ⋅ ⋅ (x − xn ). It follows that b

n

b

k=0

a

∫ f (x)dx = ∑ f (xk ) ∫ Lk (x)dx + a

n

b

1 ∫ f (n+1) (cx )ωn+1 (x)dx (n + 1)!

(8.5)

a

b

1 = ∑ wk f (xk ) + ∫ f (n+1) (cx )ωn+1 (x)dx = Q(f ) + Rn , (n + 1)! k=0 a

b 1 ∫ f (n+1) (cx )ωn+1 (x)dx. (n+1)! a

For the quadrature to have a degree of accuwhere Rn = racy 2n + 1, Rn must be zero for all polynomial integrand f of degree ≤ 2n + 1, whose (n + 1)-st derivative is a polynomial of degree ≤ n.

134 | 8 Numerical integration One can show that if the quadrature nodes {xk }nk=0 are such that ωn+1 (x) with zeros n {xk }k=0 is orthogonal to all polynomials of degree ≤ n under the inner product (u, w) = b ∫a u(x)w(x)dx, then the degree of accuracy is 2n + 1. Such a polynomial ωn+1 (x) is the

Legendre polynomial of degree n + 1, up to a scaling factor. Let us again consider the standard interval [−1, 1]. Legendre polynomials are den fined as P0 (x) = 1, P1 (x) = x, and Pn+1 (x) = 2n+1 xPn (x) − n+1 Pn−1 (x) for n ≥ 1. They n+1 n satisfy Pn (1) = 1 and Pn (−1) = (−1) for all n, and they are orthogonal in the sense that 1 2 δmn . In addition, Pn (x) has precisely n simple roots in (−1, 1). ∫−1 Pm (x)Pn (x)dx = 2n+1 n Let {xk }k=0 be the roots of Pn+1 (x), which are the nodes for the (n + 1)-point Gauss quadrature on [−1, 1]. Let Lk (x) be the Lagrange basis associated with xk . It can be shown that the corresponding weight is 1

wk = ∫ Lk (x)dx = −1

2(1 − xk2 )

2

((n + 1)Pn (xk ))

> 0.

It is not difficult to follow the above discussion to derive the first few Legendre polynomials and then compute their roots (quadrature nodes) and weights. The results are summarized in Table 8.7. Table 8.7: Nodes and weights of low order Gauss quadrature on [−1, 1]. n

nodes xi

weights wi

0

0

2

1 2 3

√3 √3 , 3 3 3 √ − 5 , 0, √ 35

−

1, 1

−√ 37 + 27 √ 65 , −√ 37 − 72 √ 65 ,

18−√30 18+√30 , 36 , 36

√3 − 2√6, 7 7 5 4

5 8 5 , , 9 9 9

√3 + 2√6 7 7 5

, − 13 √ 5 − 2√ 10 ,0 − 31 √ 5 + 2√ 10 7 7 1√ 5 3

, − 2√ 10 7

1√ 5 3

+ 2√ 10 7

18+√30 18−√30 , 36 36 322−13√70 322+13√70 128 , , 255 , 900 900 322+13√70 322−13√70 , 900 900

In particular, the 1-, 2-, and 3-point Gauss quadrature are, respectively, a+b ), 2 3 + √3 3 − √3 3 − √3 3 + √3 b−a [f ( a+ b) + f ( a+ b)], = 2 6 6 6 6 b−a 5 + √15 5 − √15 a+b 5 − √15 5 + √15 = [5f ( a+ b) + 8f ( ) + 5f ( a+ b)] . 18 10 10 2 10 10

QG1 = (b − a)f ( QG2 QG3

8.5 Gauss quadrature

| 135

Computation of nodes and weights for large n. Unlike Clenshaw–Curtis, there are no elementary expressions for the nodes and weights for the Gauss quadrature. Though the values of these quantities can be found exactly for small n, we need to compute them for larger n to make full use of the power of the Gauss quadrature. The mathematics needed for this purpose is advanced and shall not be discussed, but we can still present the major results for the quadrature on [a, b]. Theorem 68. Let

Tn+1

0 β1 ( =( (

β1 0 β2

β2 .. .

..

.

(

..

.

) ) )

0 βn

βn 0)

with βk =

1 . √ 2 1 − (2k)−2

Let Tn+1 = VDV T be an eigenvalue decomposition, where V = [v1 , . . . , vn+1 ] and D = diag(λ1 , . . . , λn+1 ). Then the (n + 1)-point Gauss quadrature nodes are xk = λk+1 b−a + a+b , 2 2 2

and the weights are wk = (b − a)(e1T vk+1 ) (0 ≤ k ≤ n).

Since the computation of all eigenpairs of the real symmetric matrix Tn+1 by the QR iteration takes 𝒪(n2 ) flops, this is an upper bound on the cost for computing the nodes and weights for Gauss quadrature. In recent years, new algorithms have been proposed to compute these quantities in 𝒪(n) flops, making it essentially as efficient as Clenshaw–Curtis. Nevertheless, unlike Clenshaw–Curtis, whenever n changes, all nodes change for Gauss quadrature, so that the evaluation of the integrand need be done at all the new nodes. Convergence. Similar to Clenshaw–Curtis, Gauss quadrature also exhibit exponential convergence to the integral for an analytic integrand f (x). The following results are also presented for reference purposes. Assume that f (k) (0 ≤ k ≤ m − 1) are absolutely continuous, and f (m) is of bounded 1 variation with V = ∫−1 |f (m+1) |dx < ∞. Let QnG (f ) be the (n + 1)-point Gauss quadra1

ture for If = ∫−1 f (x)dx. For all n ≥

m , 2

|If − QnG (f )| ≤

32V . In addition, 15πm(2n+1−m)m ρeiθ +ρ−1 e−iθ (ρ > ellipse Eρ = 2

f ∈ C ([−1, 1]) and can be analytically continued to the θ ∈ [0, 2π)) in the complex plane, and |f (z)| ≤ M inside Eρ , then ∞

|If − QnG (f )| ≤

if

1,

64M . 15(ρ2 − 1)ρ2n

In other words, Gauss and Clenshaw–Curtis quadrature exhibit similar convergence behavior for integrands with limited regularity, and Clenshaw–Curtis may need up to twice as many nodes as Gauss to achieve the same accuracy for an analytic integrand. In practice, the difference in convergence rate between the two quadrature is even less, especially for a relatively small number of nodes.

136 | 8 Numerical integration We have a MATLAB code of the Gauss quadrature based on Theorem 68: function Q = gausslg ( fun ,a ,b ,n ) % ( n +1) - pt Gauss - Legendre quadrature for integral ( fun ,a ,b ) m = ( n +2)/2; xg = zeros (n +1 ,1); wg = zeros (n +1 ,1); for i = 1 : m z = cos ( pi *(i -0.25)/( n +1+0.5)); while true p1 = 1; p2 = 0; for j = 1 : n +1 p3 = p2 ; p2 = p1 ; p1 = ((2* j -1)* z*p2 -(j -1)* p3 )/ j; end pp = (n +1)*( z*p1 - p2 )/( z*z -1); z1 = z; z = z1 - p1 / pp ; if abs (z - z1 ) 0,

y(0) = a.

This problem is an exercise, so ignore the fact that you could solve this ODE in closed form. Your task is to best fit (by minimizing least square error) the α and β to the data. Note that numerically, it may be best to shift time by subtracting 1790 off of each t point. A very rough guess of α = 0.0001 and β = 200 will be sufficient for the optimization routine to converge. Use your newly found “optimal” ODE to predict the population until 1990. The actual data is as follows: 1950 1960 1970 1980 1990

150.697 179.323 203.185 226.546 248.710

Comment on the ability of this “optimal” solution to predict the future.

10 Partial differential equations Many equations of interest involve derivatives in more than one variable, for example, both space and time. Examples of such equations are those that govern the motion of waves (wave equation), the evolution of heat dispersion (heat equation), the evolution of fluid flow (Navier–Stokes equations), electro-magnetism (Maxwell’s equations), and so on. To give a very basic introduction to this huge field of mathematics, we will consider the 1D heat equation and 2D Poisson equation.

10.1 The 1D heat equation The following set of equations describes how heat is distributed on a thin, insulated rod, over space and time. For simplicity, assume the rod has length 1 and has ends at x = 0 and x = 1. Heat is a function of space and time, so heat = u(x, t). We assume the initial heat distribution at time t = 0 is known to be uinit (x). Further, we assume the temperatures at the ends of the rod are fixed to be uL and uR . The equations read: ut (x, t) − cuxx (x, t) = 0

on (0, T] × (0, 1),

u(x, 0) = uinit (x) on [0, 1], u(0, t) = uL

u(1, t) = uR

on 0 < t ≤ T,

on 0 < t ≤ T.

In this system, c > 0 is a material constant that depends on how fast heat disperses in that medium. In practice, for a given material, we can look up c in a table. Since one can rescale the problem by a change of variables x̃ = √cx, it is mathematically fine to consider only the case c = 1. Thus, we have the evolution equation reduced to ut (x, t) − uxx (x, t) = 0. Note that this PDE is equipped with both an initial condition and boundary values; such equations are often referred to as initial-boundary value problems. We remark that it would only take small changes in the discussion and code to include time dependent boundary conditions (see exercises). Our approach to solving the system above will be to use finite differences in both space and time, in order to turn this PDE into a sequence of algebraic equations. The domain for our problem is two-dimensional: [0, T]×[0, 1], one dimension for time and another for space. Just as in initial value problems, we need to discretize time into a finite number (M + 1) of points: {0 = t0 , t1 , t2 , . . . , tM = T}. https://doi.org/10.1515/9783110573329-010

160 | 10 Partial differential equations For simplicity, assume the t-points in time are equally spaced, and denote the point spacing by Δt. Similarly, just as in boundary value problems, we need to discretize space into a finite number (N + 1) of points: {0 = x0 , x1 , x2 , . . . , xN = 1}. We also assume that the x-points in space are equally spaced, and denote the point spacing by Δx. Our goal will now be to approximate the heat distribution u(x, t) at each tm and xn . That is, we want to know u(xn , tm ) for n = 0, 1, . . . , N, and for m = 0, 1, . . . , M. Denote our approximation by u(xn , tm ) ≈ um n. The subscript n tells us the x-coordinate is xn and the superscript m tells us the t-coordinate is tm .

Figure 10.1: Finite difference grid for space and time.

Note that our discretization of time and space has produced a grid, which can be seen in Figure 10.1 with a M = 6 and N = 11 discretization of [0, 0.4] × [0, 1]. Thus, we can also think of our goal as being to approximate the solution at each of the grid intersection points (i. e. ‘nodes’). Due the initial condition, we already know the values for {u00 , u01 , u02 , . . . , u0N } (the bottom row), which are u0j = uinit (xj ). Due to the boundary condition, we also know the values for all times at x0 and xN (i. e., the left and right sides); thus, j

u0 = uL

for j = 1, 2, 3, . . . , M,

j

uN = uR

for j = 1, 2, 3, . . . , M.

10.1 The 1D heat equation

| 161

Figure 10.2: The finite difference grid. Blue x’s denote nodes where the values are known due to initial and boundary conditions.

Figure 10.2 shows at which nodes (with blue x’s) we now know the values for um n (the bottom, and right and left sides). We will now derive a set of discrete equations to fill in the grid. Consider a backward difference in time for u at the node point (xn , tm ): ut (xn , tm ) ≈

u(xn , tm ) − u(xn , tm−1 ) . Δt

An approximation of the second spatial derivative is given by uxx (xn , tm ) ≈

u(xn+1 , tm ) − 2u(xn , tm ) + u(xn−1 , tm ) . Δx 2

Substituting these approximations into our equation gives u(xn , tm ) − u(xn , tm−1 ) u(xn+1 , tm ) − 2u(xn , tm ) + u(xn−1 , tm ) ≈ 0, − Δt Δx 2 and thus now we do the standard ‘trick’ of replacing the approximation sign with an equals sign, and replacing the true values u(xn , tm ) with approximations um n . This gives the equation m m−1 um − 2um um n + un−1 n − un = 0, − n+1 2 Δt Δx

(10.1)

for each m = 1, 2, . . ., and n = 1, 2, . . . , N − 1. Note that um−1 is known, since it is obtained n from the previous time step (or initial condition). So we write the equation as um n −

Δt m m−1 (um − 2um n + un−1 ) = un . Δx2 n+1

162 | 10 Partial differential equations Thus, at each time step, we have to solve an equation that takes the form of the boundary value problems from a previous chapter on finite differences. Our approach is to solve the equation(s) (i. e. “fill in” the grid) going row-by-row, from bottom to top. Thus, consider trying to first find solutions at t = t1 . Our first set of unknowns will be {u11 , u12 , u13 , . . . , u1N−1 }, which are approximations to u(x, t) at the red x’s shown in Figure 10.3.

Figure 10.3: The finite difference grid. Blue x’s denote nodes where the values are known due to initial and boundary conditions, and red x’s denote the unknowns for step 1 (i. e., the first time step).

Our discrete equations for m = 1 take the form: u1n −

Δt (u1 − 2u1n + u1n−1 ) = u0n . Δx 2 n+1

Grouping like terms, we have, for n = 1, 2, . . . , N − 1 −

Δt 1 Δt Δt u + (1 + 2 2 ) u1n − 2 u1n−1 = u0n , Δx 2 n+1 Δx Δx

along with known boundary values u10 = uL and u1N = uR . Thus, we have N − 1 linear equations in the N −1 unknowns {u11 , u12 , u13 , . . . , u1N−1 }, and we can perform a linear solve to get their values. After we find the values {u11 , u12 , u13 , . . . , u1N−1 }, we move to time step #2. Now the unknowns will be {u21 , u22 , u23 , . . . , u2N−1 },

10.1 The 1D heat equation

| 163

and the same procedure as in step 1 leads us to the N − 1 linear equations −

Δt 2 Δt Δt u + (1 + 2 2 ) u2n − 2 u2n−1 = u1n . Δx 2 n+1 Δx Δx

Once time step 2 is finished by doing a linear solve, we move to time step 3, and so on until we are finished. Note that the matrix is exactly the same for step 2 as for step 1 (and for all other steps), but the right-hand side changes. The MATLAB code for a general interval [a, b], end time T, initial condition uinit (x), and boundary values uL and uR is as follows: function [U ,x ,t] = heat1d (a ,b ,T ,uL , uR , uInit ,N ,M) % solve the heat equation on [a , b] x [0 , T ] , % with boundary conditions % u_L ( at a ) and u_R ( at b ), with initial condition uInit % Discretize with N +1 points in space , and M +1 in time x = linspace (a ,b ,N +1); dx = x (2) - x (1); t = linspace (0 ,T ,M +1); dt = t (2) - t (1); % initialize solution matrix U = zeros ( N +1 , M +1); % Insert initial condition in first row of matrix U (: ,1) = uInit (x '); % % e r A

Now we march through time . Note the matrix is the same at each time step , so we can build it first . = ones (N -1 ,1); = dt /( dx ˆ2); = spdiags ([ - r *e (1+2* r )* e -r*e], -1:1 , N -1 , N -1);

for m =2: M +1 rhs = U (2: end -1 ,m -1); % boundary conditions affect first and last equation rhs (1) = rhs (1) + r* uL ; rhs ( end ) = rhs ( end ) + r* uR ; % solve the linear system for the values at time step m

164 | 10 Partial differential equations

U (2: end -1 , m )= A \ rhs ; % add in the boundary values at time m U (1 , m )= uL ; U ( N +1 , m )= uR ; end Example 80. Solve the heat equation on [−1, 2] × [0, 0.2] with boundary values uL = uR = 0 at all times, and initial condition 1 uinit (x) = { 0

0.4 ≤ x ≤ 0.6, otherwise.

First, we create a function for the initial condition >> uinit = @(x) (x >=.4) .* (x > [u ,x ,t] = heat1d ( -1 ,2 ,.2 ,0 ,0 , uinit ,100 ,50); Plots of the solution at various times are shown below. It is evident from the scale that the heat distribution is shrinking in a smooth and symmetric fashion. The plots are made with commands figure plot (x , u (: ,11) , 'k - ',' LineWidth ' ,2); title ([ 't= ' num2str (t (11))] , ' FontSize ' ,20) xlabel ( 'x ',' FontSize ' ,20) ylabel ( 'U ',' FontSize ' ,20) set ( gca , ' FontSize ' ,14)

We can also make a plot of the entire solution with a “waterfall” plot. The syntax to do this is below, followed by the plot:

10.1 The 1D heat equation

| 165

% make waterfall plot to view solution all at once [X , T ] = meshgrid (x ,t ); waterfall (X ,T ,u ') colormap jet xlabel ( 'x ',' FontSize ' ,20) ylabel ( 't ',' FontSize ' ,20) zlabel ( 'U(x , t ) ',' FontSize ' ,20) view ( -136 ,16) set ( gca , ' FontSize ' ,14)

Example 81. Consider another physical problem involving heat. Suppose you have a thin insulated rod with length 1 (i. e., on the interval [0, 1]), that begins with a uniform

temperature of 0. For t > 0, the left end of the rod (at x = 0) touches an object having

temperature 1, and the right end of the rod has fixed temperature 0. What will be the heat distribution on the rod as time moves forward to T = 0.5?

This problem is very similar to the previous one. The interval in space is [0, 1] and

the interval in time is [0, 0.5]. The initial condition for the rod is uinit = 0, and the

boundary conditions are u(0) = 1 and u(1) = 0. Thus, we can use our heat1d code from above, calling it via

>> uinit = @ ( x) 0* x; >> [u ,x ,t ]= heat1d (0 ,1 ,.5 ,1 ,0 , uinit ,100 ,50); The plot below shows the solutions with time, which have converged to a “steady state” (no longer changes in time) solution by T = 0.5.

166 | 10 Partial differential equations

The numerical solution appears to converge in time to the line from (1, 0) to (0, 1), suggesting the steady state solution is u(x) = 1 − x. This line is a steady solution of the heat equation, since it satisfies ut − uxx = 0 and satisfies the boundary conditions. 10.1.1 A maximum principle and numerical stability From your own intuition and from the examples above with the heat equation, it should be easy to believe that the physics of temperature evolution dictate that the maximum temperature achieved during the thin rod experiments must occur at the boundary or in the initial condition. In other words, temperature cannot grow on its own (away from the boundaries) as time progresses. Heat wants to disperse, and find an equilibrium solution that satisfies the boundary conditions. Mathematically speaking, this is a maximum principle for the heat equation and can be expressed as follows. Let R = [0, 1] × [0, T],

Γ = {(x, t) ∈ R, x = 0 or x = 1 or t = 0}.

In other words R is the entire domain (rectangle) with the boundaries, and Γ is just the bottom, left, and right sides. Then sup u(x, t) = sup u(x, t), R

Γ

which says precisely that if a max occurs in the interior the rectangle, then that max must also occur on the boundary. This means that the solution cannot grow in the

10.1 The 1D heat equation

| 167

interior beyond the boundary and initial conditions. We leave proof of this result to a student’s first PDE course. That this maximum principle property holds for numerical solutions is very important. If it does not hold, that means the computed temperature can nonphysically grow in the interior of the domain, and so the solution is nonphysical and, therefore, cannot be believed (often, bad things that can happen numerically, do happen). Moreover, something bad is happening with the discretization that is allowing nonphysical growth to take place, even though such growth cannot happen in the PDE. As we saw with IVPs, unnatural growth causes numerical instability, and so, whether a numerical method for the heat equation is stable depends on whether it preserves the maximum principle. The numerical scheme (10.1), which assumes a uniform grid in space and in time, does preserve this maximum principle. To see this, suppose a maximum exists in the m m m interior at um n . Then un+1 − 2un + un−1 ≤ 0, and we consider these two cases separately. m m m If un+1 − 2un + un−1 < 0, then m m−1 um − 2um um n + un−1 n − un = n+1 < 0, Δt Δx 2

m−1 which implies that um n < un , a contradiction. Thus, this case is not possible. m m m m−1 m If um n+1 − 2un + un−1 = 0, this implies un = un . Also, since un is assumed to be a maximum, we have that the three terms at time level m are equal. Thus, we have that m−1 m m um n+1 = 2un = un−1 = un .

We have now shown that the maximum is not unique, and must also occur at the point m−1 below um n , that is, at un . In particular, note we can repeat this argument to show that the max must also occur at u0n and at um 0 . This means the max also occurs at the boundary and initial condition, which in turn means that the numerical solution cannot grow beyond initial and boundary data. We have thus established that for (10.1), 0 m m max um n = max (max un , max u0 , max uN ) . n,m

n

m

m

Interestingly, a maximum principle does not necessarily hold for the explicit counterpart of (10.1). That is, if instead of applying a backward Euler temporal discretization, we apply forward Euler, we obtain the scheme m um − 2um um+1 − um n + un−1 n n = 0, − n+1 Δt Δx 2

(10.2)

This scheme is much easier to compute than (10.1), as there is no equation to solve at each time step. Instead, one can just use known values from the previous time step to calculate um+1 n . There is a time step size restriction for this method to achieve a maximum principle, which takes the form Δt ≤ O(Δx2 ). We omit the details, but note that even with a zero initial condition, except at one node use ϵ (i. e., machine epsilon ≈ 2.2e−16), using an excessively large Δt would lead the numerical solution to grow arbitrary large, exponentially fast. Testing this method is left as an exercise.

168 | 10 Partial differential equations

10.2 The 2D Poisson equation We consider now the 2D Poisson equation, which is a boundary value problem in two dimensions: −Δu(x, y) = f (x, y) in Ω, u = g(x, y) on 𝜕Ω, where f and g are given functions, Ω ⊂ ℝ2 is an open, connected domain, and 𝜕Ω denotes the boundary of Ω. For simplicity, we will consider Ω = (0, 1)2 . Recall that Δ denotes the Laplacian, which is defined to be the sum of the second partials, that is, Δ = 𝜕xx + 𝜕yy 󳨐⇒ Δu = uxx + uyy . The Poisson problem has a physical meaning: The solution u can be thought of as the displacement of a membrane that has fixed position at the boundary (given by g) and has a force f applied to it. A simple example is blowing air against a thin soap film (“blowing bubbles”). The domain would be the circle at the end of the wand, the edges of the bubble are fixed to the wand at locations g(x, y), and the forcing f (x, y) of air against the membrane deforms/displaces the membrane. To solve the 2D Poisson equation, we can use a similar approach as for the 1D boundary value problem. However, now we must apply finite difference approximations in both the x and y directions. The first step in creating a discrete algebraic system is, again, to discretize the domain. For simplicity, consider an equal point spacing h in both the x and y directions, using N points in each direction, creating a uniform grid, as shown in Figure 10.4. The unknowns will be at the grid intersection points (called nodes). Thus, there will be N unknowns in each row and column, for a total of N 2 unknowns. We label the nodes and unknowns from left to right, top to bottom. The unknown corresponding to the node (xn , ym ) is U(m−1)N+n . We will now proceed to create N 2 linear equations in the unknowns Uj , j = 1, 2, . . . , N 2 . For the points on the boundary, the equations are simple: if node j is a boundary node at coordinates (xn , ym ) (note j = (m − 1)N + n), then the equation is Uj = g(xn , ym ). The nodes not on the boundary are interior ones. At these locations, applying the centered finite difference approximation to the second partial derivatives, we have uxx (xn , ym ) + uyy (xn , ym )

u(xn+1 , ym ) − 2u(xn , ym ) + u(xn−1 , ym ) u(xn , ym+1 ) − 2u(xn , ym ) + u(xn , ym−1 ) + h2 h2 u(xn , ym+1 ) + u(xn+1 , ym ) − 4u(xn , ym ) + u(xn−1 , ym ) + u(xn , ym−1 ) = . h2 ≈

10.2 The 2D Poisson equation

| 169

Figure 10.4: Node numbering of the 2D grid.

To put this approximation in terms of the grid points, denote by Uj the unknown corresponding to the node at (xn , ym ) (note that j = (m − 1)N + n). The approximation then becomes uxx (xn , ym ) + uyy (xn , ym ) ≈

Uj+N + Uj+1 − 4Uj + Uj−1 + Uj−N h2

.

Thus, at an interior node j, the equation involves 5 nodes: j, j+1, j−1, j+N, and j−N (the node j, the ones left and right of it, and the ones above and below). These are shown in red in the grid plot in Figure 10.4. The approximation equation for −Δu(xn , ym ) = f (xn , ym ) is thus −Uj+N − Uj+1 + 4Uj − Uj−1 − Uj−N = h2 f (xn , ym ). It is now clear that our discretization will yield a N 2 × N 2 linear system, which when solved will produce U1 , U2 , . . . , UN 2 . To explicitly build the matrix, we will need to loop over the nodes from 1 to N 2 , and there will be 2 cases: Case 1: If node j is in on the boundary, Uj = g(xn , ym ). Case 2: If node j is not on the boundary −Uj+N − Uj+1 + 4Uj − Uj−1 − Uj−N = h2 f (xn , ym ). The code is given below: function [x ,y ,U] = poisson2d (f ,g ,N) % solve poisson equation on unit square with % Dirichlet boundary u = g , and forcing f

170 | 10 Partial differential equations

% discretize domain x = linspace (0 ,1 , N ); y = linspace (0 ,1 , N ); h = x (2) - x (1); % initialize matrix and rhs A = zeros (N ˆ2 , N ˆ2); b = zeros (N ˆ2 ,1); % build matrix and rhs , node by node for m =1: N for n =1: N j = (m -1)* N+n; if (m ==1 || m == N || n ==1 || n == N) % on bdry A(j , j) = 1; b(j) = g(x(n) ,y(m )); else % interior A(j ,j -N) = -1; A(j ,j -1) = -1; A(j , j) = 4; A(j , j +1) = -1; A(j , j+N) = -1; b(j) = h ˆ2 * f(x(n), y(m )); end end end U = A \ b; % made vector solution into matrix to correspond to solution % in 2 d domain UU = reshape (U ,N ,N ) '; [X , Y ] = meshgrid (x ,y ); surf (X ,Y , UU ) Example 82. Solve the 2D Poisson problem, with the right-hand side function f (x, y) = sin(x) − 6y and Dirichlet boundary condition g = sin(x) + y3 , on the domain Ω = (0, 1)2 . Use h = 1/4, 1/8, 1/16, and 1/32, and calculate convergence rates to the true solution utrue (x, y) = sin(x) + y3 .

10.2 The 2D Poisson equation

| 171

We wrote the following script to accomplish this: % Define f and g and utrue f = @(x ,y) sin (x ) - 6* y ; g = @(x ,y) sin (x ) + y .ˆ3; utrue = @(x ,y) sin (x ) + y .ˆ3; % solve the problem on varying meshes for i = 2:5 h = 1/(2ˆ i ); N = round (1/ h )+1; [x ,y ,U ] = poisson2d (f ,g , N ); % calculate error of this approximate solution do this by % comparing answer at each grid point to true solution , then % taking the max - easiest to do the calculation as % a vector instead of matrix UU = reshape (U ,N , N ) '; [X ,Y] = meshgrid (x , y ); UTrue = g(X , Y ); Linferr (i -1) = max ( max ( abs ( UU - UTrue ))); end % print out errors / rates [ Linferr ', [0 , log ( Linferr (1: end -1)./ Linferr (2: end ))/ log (2) ] ' ]

This produced the output ans = 1.7278 e -04 4.7007 e -05 1.1856 e -05 2.9708 e -06

0 1.8780 e +00 1.9872 e +00 1.9968 e +00

Thus, we observe second order convergence, which is expected since the finite difference approximation used to create the method was second order. We can plot the solution using the command “surf,” by setting up the plotting format via meshgrid: [x ,y ,U ] = poisson2d (f ,g ,N ); UU = reshape (U ,N ,N ) '; [X , Y ] = meshgrid (x ,y ); surf (X ,Y , UU ) This will produce the output as seen in Figure 10.5.

172 | 10 Partial differential equations

Figure 10.5: Solution of the 2D Poisson problem.

10.3 Exercises 1.

2.

3.

Consider again the 1D heat equation from Example 81, but with a slight twist. Suppose now the rod is completely insulated except at x = 0, but again assume the initial temperature is 0 on the entire rod, and for t > 0 a heat source with temperature 1 is placed against the rod at x = 0. What is the steady state heat distribution? The difference for this problem versus the previous example is the boundary condition at x = 1. We can no longer set it to air temperature, since it is insulated. However, a physical boundary at x = 1 would now be ux (t, 1) = 0, which means that the boundary at x = 1 should have no effect on the heat transfer. In a finite difference scheme, this can be implemented as uN = uN−1 at every time level after the initial condition. Solve the 2D Poisson problem with f = 1 and g = 0. Take h small enough so that the solution converges. Plot the solution using ‘surf’. Does your solution make physical sense, as a membrane undergoing a forcing that is attached at its boundary? Write a MATLAB code to implement (10.2). You code should have the same input and output as the “heat1d.m” implementation of (10.1). Test your code by repeat1 1 1 1 ing Example 80. For h = 20 , 40 , 80 , and 160 , experiment to find what is the time step size Δt needed for stability. Do your results agree with Δt ≤ Ch2 being the restriction?

Index absolute stability 148 advection-diffusion-reaction equation 68 back substitution 13 backslash 25 backward difference 56 barycentric formula 112 base 10 4 base 2 4 bias 6 big O notation 54 binary number 4 – convert to 5 – format 4 – standard form 5 bisection method 72 boundary value problem 65 centered difference 58 CG 26 characteristic polynomial 94 Cholesky, incomplete 29 Cholesky factorization 23 Clenshaw–Curtis quadrature 130 complexity 54 composite quadrature 127 condition number 34 conjugate gradient 13 conjugate gradient method 26 convergence 54 – linear 71 – quadratic 71 – superlinear 71 cubic splines 116 – not-a-knot condition 116 curve of best fit 48 decimal number 4 diagonalizable 94 direct linear solver 13 dominant eigenvalue 94 eigenvalue 93 eigenvalue multiplicity – algebraic 94 – defective 94

– geometric 94 – semi-simple 94 – simple 94 eigenvector 93 Euclidiean norm 31 exponent 6 fit – exponential 48 – inverse 48 – line 43 – ODE parameters 152 – power 48 fixed-point iteration 77 floating point – bias 6 – epsilon 7 – exponent 6 – mantissa 6 – roundoff 4 – shift 6 – sign 6 floating point standard 6 floating-point – number 1 forward difference 56 forward Euler 63 forward Euler method 143 fzero 87 Gauss quadrature 133 Gaussian elimination 13, 15 Gram–Schmidt process 46 Heun’s method 146 ill-conditioned 34 ILU 29 incomplete Cholesky 29 incomplete LU 29 infinity norm 31 initial value problem 63 initial value problems – first order 141 – higher order 142 intermediate value theorem 72 iterative method 13

174 | Index

least squares error 41 linar solver – iterative 13 line of best fit 43 linear solver – backslash 25 – CG 26 – Cholesky factorization 23 – conjugate gradient 13 – direct 13 – Gaussian elimination 15 – LU factorization 15 LU, incomplete 29 LU factorization 15 machine epsilon 7 mantissa 6 matrix norm 32 maximum principle 166 Newton–Cotes quadrature 124 Newton–Cotes quadrature rules – Boole 124 – midpoint 124 – Simpson 124 – trapezoidal 124 – Weddle 124 Newton’s method 82 nonlinear systems 87 norm – 2-norm 31 – definition 31 – induced 32 – infinity 31 normal equation 43 normal matrix 94 number system 4 numerical methods for IVPs – backward Euler 149 – explicit methods 143 – forwar Euler 143 – Heun’s method 146 – implicit methods 149 – implicit trapezoidal 146 – Runge–Kutta 147 ODE 63, 65 partial differential equations (PDEs) 159

partial pivoting 19 PDEs – 1D heat equation 159 – 2D Poisson equation 168 – finite difference method 159, 168 piecewise interpolation 114 pivoting 18 pivoting, partial 19 polynomial interpolation 105 – Chebyshev interpolant 110 – Lagrange basis 108 – monomial basis 108 – piecewise cubic 116 – piecewise linear 114 power method 94 – convergence 96 – inverse mode 98 – Rayleigh quotient variant 101 – shift-invert mode 100 preconditioning 28 QR factorization 47 quadrature 121 quadrature rules – Clenshaw–Curtis 130 – composite Newton–Cotes 127 – Gauss 133 – Newton–Cotes 124 – numerical stability 126 Rayleigh quotient 94 – block version 102 residual 36 rootfinding 72 roundoff error 4 row swapping 19 Runge phenomenon 110, 127 Runge–Kutta methods 147 scientific notation 1 Secant method 86 second derivative 62 sign 6 similarity 94 sparse linear system 26 sparse matrix 13 spectrum 94 standard binary form 5

Index | 175

subspace iteration 101 – shift-invert mode 102 Taylor series 53 Taylor’s theorem 53

3-point centered difference 62 three point difference formulas 60 trapezoidal method 146 well-conditioned 34