Calculus BLUE Multivariable Calculus Vol II Derivatives [2, 3 ed.] 9781944655044


419 241 238MB

English Pages 475 Year 2019

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
BLUE 2 INTRO
COVER
Title Page
Table of Contents
Instructions
LET’S GO!
SONG 19
BLUE 2 PROLOGUE
TITLE
CHORUS
We got it…
CHORUS
Change is…
How it all works
CASE: Gradients
CASE: Linearization
CASE: Optimization
CASE: Lagrange
BUT SO WHAT?
CASE: Approximation
CASE: the IFT
CASE: Statistics
CHORUS
SO MUCH MORE!
ONWARD!
Chapter 1 - multivariate functions
TITLE
CALCULUS!
CHORUS
Graphing functions 1-D
Graphing functions 2-D
Functions!
CHORUS
CASE: curves
CASE: surfaces
CASE: level set curves
CASE: level set surfaces
CASE: color, time, & more
CHORUS
CASE: weather
CASE: weather
CASE: weather
CHORUS
CASE: coordinate change
CASE: coordinate change
CASE: coordinate change
CHORUS
CASE: market equilibrium
CASE: market equilibrium
CASE: epidemic models
CHORUS
CASE: robot kinematics
CASE: data
CHORUS
Spin the Dials!
CHORUS
The BIG PICTURE
PROBLEMS
PROBLEMS
ACKNOWLEDGEMENTS
Chapter 2 - partial derivatives
TITLE
Remember…
The derivative
DEFINITION: partial derivative
DEFINITION: partial derivative
CHORUS
EXAMPLE: partial & implicit differentiation
NOTATION: partial derivatives
CHORUS
Slopes & derivatives
LET'S SEE: Slopes & derivatives
CHORUS
EXAMPLE: inputs & outputs
CASE: managing a bakery
CHORUS
It’s the question that drives you
The BIG PICTURE
PROBLEMS
PROBLEMS
Chapter 3 - the derivative
TITLE
What is the derivative?
It’s a matrix!
BUT SO WHAT?
IMPORTANT!
CHORUS
Curves & rates of change
Surfaces & rates of change
NOTATION: partial derivatives
EXAMPLE: rates of change
EXAMPLE: polar coordinates
Seeing derivatives
CHORUS
EXAMPLE: sensitivity
SUMMARY
CAUTION
The BIG PICTURE
PROBLEMS
PROBLEMS
Chapter 4 - differentiation
TITLE
DEFINITION: derivative
The Derivative
EXAMPLE: linear functions
CHORUS
An unusual annoyance
OK, so what? Oh wait.
YOU CAN SEE: the wrinkle
PANIC!!!
CHORUS
EXAMPLE: squaring square matrices
EXAMPLE: AHHA...
CHORUS
THINK ABOUT IT
Like, seriously, think!
CHORUS
The BIG PICTURE
PROBLEMS
PROBLEMS
ACKNOWLEDGEMENTS
Chapter 5 - the chain rule
TITLE
Remember…
1-d chain rule
In other words…
CHORUS
It’s time…
THEOREM: Chain Rule
CHORUS
EXAMPLE: evaluation
EXAMPLE: higher dimensions
CHORUS
EXAMPLE: chain rule
FALSE CHORUS
Classical chain rules
Classical chain rules
CHORUS
The BIG PICTURE
PROBLEMS
PROBLEMS
Chapter 6 - differentiation rules
TITLE
CHORUS
Linearity
EXAMPLE: linear & affine functions
CHORUS
EXAMPLE: dot products
CHORUS
The product rule, REDUX
CHORUS
EXAMPLE: quadratic forms
…continued
CHORUS
The material derivative
Material derivative derivation
EXAMPLE: BOOP!
FORESHADOWING
The BIG PICTURE
PROBLEMS
PROBLEMS
Chapter 7 - the inverse function theorem
TITLE
CHORUS
Recall: inverse rule for derivatives
CHORUS
DEFINITION: inverse functions
EXAMPLE: local nonlinear inverses
The BIG IDEA
CHORUS
The inverse rule for derivatives
EXAMPLE: derivative of an inverse
CHORUS
THEOREM: Inverse Function Theorem
CHORUS
EXAMPLE: polar coordinates, redux
CHORUS
EXAMPLE: solving nonlinear equations
CHORUS
EXAMPLE: inverse kinematics
The BIG PICTURE
PROBLEMS
PROBLEMS
Chapter 8 - the implicit function theorem
TITLE
Remember…
The classical case
EXAMPLE: classical case
CHORUS
EXAMPLE: implicit function
CHORUS
THEOREM: implicit function theorem
EXAMPLE: implicit function theorem
CHORUS
EXAMPLE: price equilibria
EXAMPLE: price equilibria
CHORUS
EXAMPLE: GPS accuracy
EXAMPLE: GPS accuracy
EXAMPLE: GPS accuracy
CHORUS
The BIG PICTURE
PROBLEMS
PROBLEMS
ACKNOWLEDGEMENTS
Chapter 9 - gradients
TITLE
CHORUS
Level sets in 2-d…
…are curves
Level sets in 3-d?
Nonlinear classifiers
CHORUS
Vectors or Matrices?
DEFINITION: gradient
The Gradient
CHORUS
EXAMPLE: computing gradients
CHORUS
Gradients & level sets
You can see this in 2-d
You can see this in 3-d
EXAMPLE: spheres
CHORUS
FALSE CHORUS
DEFINITION: vectors
Dictionary
BONUS: stochastic gradient descent
FORESHADOWING: gradients fields
CHORUS
Remember this…
The BIG PICTURE
PROBLEMS
PROBLEMS
Chapter 10 - tangent spaces
TITLE
CHORUS
Tangent planes
EXAMPLE: tangent planes
CHORUS
Parametrized curves
CHORUS
Parametrized surfaces
EXAMPLE: parametrized tangent planes
EXAMPLE: continued…
But what about…?
CHORUS
BONUS! Kernels & images
Tangent spaces: implicit
Tangent spaces: parametric
The BIG PICTURE
PROBLEMS
PROBLEMS
Please sign…
Chapter 11 - linearization
TITLE
CHORUS
Linearization to approximate curves
Linearization to approximate surfaces
CHORUS
Differentials or derivatives?
CHORUS
EXAMPLE: numerical approximation
EXAMPLE: tolerances and error
CHORUS
Relative rates
EXAMPLE: beam deflection error
EXAMPLE: beam deflection error
Do you see a pattern?
CHORUS
Accuracy of linear approximation
Taylor time!
The BIG PICTURE
PROBLEMS
PROBLEMS
ACKNOWLEDGEMENTS
Chapter 12 - taylor series
TITLE
Remember…
Let’s recall: Taylor series
What a Taylor polynomial means
CHORUS
What a Taylor polynomial means 2
The FORMULA
RELAX!!!
CHORUS
NOTATION: variables
NOTATION: powers
EXAMPLE: multi-index monomials
NOTATION: factorials
CHORUS
NOTATION: derivatives
EXAMPLE: higher derivatives
IMPORTANT!
CHORUS
FORMULA: Taylor series about 0
FORMULA: Taylor series about a
CHORUS
Remarks
The BIG PICTURE
PROBLEMS
PROBLEMS
Chapter 13 - computing taylor series
TITLE
Let’s see some…
CHORUS
EXAMPLE: 2-d Taylor series
CHORUS
EXAMPLE: Taylor the hard way
EXAMPLE: Taylor the easy way
CHORUS
EXAMPLE: Using the chain rule
BUT SO WHAT?
EXAMPLE: local solutions to equations
CHORUS
WHY?
DEFINITION: Hessian
BONUS!
CAUTION!
The BIG PICTURE
PROBLEMS
Chapter 14 - critical points and optimization
TITLE
Remember…
CHORUS
Time for a definition
DEFINTION: critical points
LEMMA: critical points
CHORUS
EXAMPLE: finding critical points
Graphing critical points
A saddle!
CHORUS
The second derivative
Trace-determinant method
The second derivative test
EXAMPLE: classification
Degenerate critical points
CHORUS
CAUTION: boundaries
CHORUS
The BIG PICTURE
PROBLEMS
Please sign…
Chapter 15 - optimization - linear regression
TITLE
Linear regression
CHORUS
Least squares problem
Take a partial derivative
Take another partial derivative
Solve for the critical point
CHORUS
Solve for the critical point
The solution!
CHORUS
Check minimality
BONUS!
Linear regression
Nonlinear regression?
Topological data analysis?!
The BIG PICTURE
PROBLEMS
Chapter 16 - optimization - nash equilibria
TITLE
It's time for some
CHORUS
Payoff matrices
EXAMPLE: rock, scissors, paper
EXAMPLE: even-odd
EXAMPLE: Mendelsohn
CHORUS
Playing at random
Expected payoffs from random play
CHORUS
EXAMPLE: even-odd mixed strategy
…continued
A saddle point equilibrium
Nash equilibria
LET'S SEE: Nash equilibrium
BONUS!
EXAMPLE: Nash equilibria
CHORUS
THINK!
The BIG PICTURE
PROBLEMS
ACKNOWLEDGEMENTS
Chapter 17 - constrained optimization
TITLE
CHORUS
CASES: constraints
CHORUS
RECALL: 1-d bounded optimization
BUT SO WHAT?
CHORUS
EXAMPLE: a simple constraint
…continued
…continued
CHORUS
EXAMPLE: an optimal box
…continued
CHORUS
EXAMPLE: a parametrized boundary
…continued
…continued
Higher dimensions? Yikes!
The BIG PICTURE
PROBLEMS
Chapter 18 - the lagrange multipler
TITLE
CHORUS
EXAMPLE: a simple constraint
CHORUS
EXAMPLE: a simple constraint
Think gradient…
CHORUS
You can see it…3D
You can see it…2D
THIS WORKS!
CHORUS
THEOREM: Lagrange multiplier
EXAMPLE: Lagrange
The BIG IDEA
The Lagrange equations
WHAT IS IT?
Interpretations of Lagrange
CHORUS
BONUS! Lagrange & rates of change
Lagrange multiplier: A PICTURE!
CHORUS
The BIG PICTURE
PROBLEMS
Chapter 19 - using the lagrange equation
TITLE
The Lagrange equations
CHORUS
EXAMPLE: supply chains
CHORUS
EXAMPLE: inscribing a cone
CHORUS
EXAMPLE: hard & in 3-d
…continued
…continued
CHORUS
EXAMPLE: resource allocation
CHORUS
EXAMPLE: algebraic-geometric inequality
…continued
FORESHADOWING
The BIG PICTURE
PROBLEMS
PROBLEMS
BLUE 2 EPILOGUE
TITLE
SO MUCH MORE!
CHORUS
You should…
OPTIMIZATION
Multi-Lagrange
…continued
CHORUS
EIGENVALUES
CHORUS
DYNAMICAL SYSTEMS
Linear ODE systems
The matrix exponential
Classification of equilibria
CHORUS
Nonlinear dynamics
CHORUS
OW MY HEAD!
SO MUCH MORE!
BLUE 2 FORESHADOW
TITLE
CHORUS
FAIL!!!
The definite integral
CHORUS
One weird trick…
CHORUS
Take it to the limit
CHORUS
APPLICATION: mass
APPLICATION: probability
CHORUS
the BIG THEOREM
CHORUS
Coordinate systems
APPLICATION: surface area
Higher dimensions…
The BIG PICTURE
LET’S GO!
BLUE 2 CLOSE
SONG 20
COVER
About the Author
REFERENCES
Where credit is due
Publisher of Beautiful Mathematics
Recommend Papers

Calculus BLUE Multivariable Calculus Vol II Derivatives [2, 3 ed.]
 9781944655044

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

bY

CALCULUS BLUE MULTIVARIABLE VOLUME 2 : DERIVATIVES ROBERT GHRIST 3rd edition, kindle format Copyright © 2019 Robert Ghrist All rights reserved worldwide Agenbyte Press, Jenkintown PA, USA ISBN 978-1-944655-04-4 1st edition © 2016 Robert Ghrist 2nt edition © 2018 Robert Ghrist

prologue chapter 1: multivariate functions chapter 2: partial derivatives chapter 3: the derivative chapter 4: differentiation chapter 5: the chain rule chapter 6: differentiation rules chapter 7: inverse function theorem chapter 8: implicit function theorem chapter 9: gradients chapter 10: tangent spaces

chapter 11: linearization chapter 12: taylor series chapter 13: computing taylor series chapter 14: critical points chapter 15: optimization: regression chapter 16: optimization: game theory chapter 17: constrained optimization chapter 18: the lagrange multiplier chapter 19: using lagrange’s method epilogue foreshadowing: integrals

enjoy learning! use your full imagination & read joyfully… this material may seem easy, but it’s not! it takes hard work to learn mathematics well… work with a teacher, tutor, or friends and discuss what you are learning. this text is meant to teach you the big ideas and how they are useful in modern applications; it’s not rigorous, and it’s not comprehensive, but it should inspire you to do things with math… exercises at chapter ends are for you to practice. don’t be too discouraged if some are hard… keep working! keep learning!

Waves oer the heavens deep…

What have we learned?

Well…

we’VE LEARNED TO WORK WITH…

&

but

is understanding & using derivatives

how to maximize/minimize a function with multiple inputs?

What lies beyond maxima/minima?

the lagrange multiplier will be one of our most useful optimization methods

it works in the setting of constrained optimization

such constraints arise naturally in economics, engineering, physics, & more

why do multivariate derivatives?

derivatives are critical to so many applications! along the way, we’ll learn…

x0

n # (xi yi) – (# xi) (#i yi)

m = A-1 b = b

n # i xi 2 – ( # i xi )

2

–(# xi) # (xi yi) + (# xi2) (# yi)

n # xi 2 – ( # xi )

2

we’ll learn some applications ranging from statistics to game theory…

is concerned with the Local & global features of Functions…

Most of the calculus you have learned

f: _ 

2

f: _ 

these curves are “level sets” where the function is constant: see chapter 9

multivariable calculus deals with functions of the form n inputs

n

f: _ 

m

m outputs

you’ve seen lots of multivariate functions before...

f: _ 

n

2

f: _ 

n

f(x,y) = c

these curves are “level sets” where the function is constant: see chapter 9

2

f: _ 

f(x,y,z ) = c

the corresponding construction in 3-d yields surfaces...

3

f: _ 

Sometimes, you can use color or density or time to visualize functions with more than two inputs/outputs…

4

f: _ 

2

often have position and/or time as inputs

meteorological quantities are functions of position

2

f: _ 

meteorological quantities are functions of position

2

f: _ 

2

meteorological quantities are functions of position

3

f: _ 

3

can “convert” between different units

r θ

y

x

polar functions can switch between coordinate systems

functions can switch between coordinate systems x r rθ =f = rθ y θ r x2 + y2 -1 x =f = θ y  (y/x)

these are TYPICALlY inverse functions

r

euclidean θ

y

x

functions can switch between coordinate systems

can be explicit functions of inputs

Q = quantity

demand : D(P) = D0 - bP

D0

supply = demand at the market equilibrium

Qm S0

Pm

P = price

Qm = S0 + a Pm = D0 – b Pm market quantity

market price

P = price

supply : S(P) = S0 + aP

(a + b) Pm = D0 - S0 Pm =

Pm Qm

D0 – S0 a+b

a b = = f S 0 D0

Qm = S0 + a Pm = S0 + a D0 – S0 a+b a D0 + b S0 = a+b D0 – S0 a+b a D0 + b S0 a+b

supply : S(P) = S0 + aP demand : D(P) = D0 - bP

supply = demand at the market equilibrium Qm = S0 + a Pm = D0 – b Pm market quantity

market price

P = price

S0 + a Pm = D0 – b Pm

f

=f

S E I R

S μ - βSI - μS E βSI – (μ+α)E = I αE – (μ+γ)I R γI - μR

S(t) = susceptible E(t) = exposed I(t) = infected R(t) = recovered α, β, γ, μ = rates positive constants

how these cohort sizes evolve

rates of change

this is just one possible model

dS/dt dE/dt dI/dt dR/dt

are sometimes “implicitly defined”

is to think about rates of change

the mathematics of change

the big picture

Most “real-world” functions have multiple inputs & multiple outputs… multivariable calculus Is what makes sense of them!

1

Consider functions f : n _ m and g : k _ n and h : m _ k WHERE m ≠ n ≠ k . A) Which pairwise compositions f•g , f•h , g•h , etc. are permitted? B) For each such legal composition, state the number of its inputs & outputs.

2

recall that the domain of a function is the set of all “legal” inputs. What subsets { (x, y) } of 2 comprise the domains of these functions?

x A) f = y

3

xy 1-x2-y2

y x b) f = y x

x c) f = y

2x – 5y 7x + y -x + 4y

[challenge?] find the inverses of the following change-of-coordinates functions A)

u x =f = v y

x+y x-y

b)

u x =f = v y

y3 x-y3

Can you draw pictures of what is happening in these coordinate changes?

4

if you wanted to model the flow of a fluid (such as wind) in 3-d as a function of particle positions and time, how many inputs and outputs would you need?

5

continuity for multivariate functions is a little more complicated than saying “draw the graph without lifting the pencil”… consider the following function: xy a) note that this is not well-defined at the origin… f (x, y) = 2 2 x +y b) take the limit of f(x,0) as x_0. are you happy? c) now take the limit of f(t,t) as t_0. are you still happy? d) relax! you will probably never see a function like this in “the wild”. maybe.

6

a famous model in economics is the cobb-douglas model for production. it says: P = CLαM 1-α where P, L, & M are the amounts of goods produced, labor used. & materials used, respectively; and C and 0m)

the derivative gives a system of equations to solve for the tangent space

x = f (0)

where

x=

x1 …

n

given a level set in n -1

xn

the tangent space at x0 is given by

[Df] ( x-x0) = 0 evaluated at x0

let’s say you have a “nice” parametrized “manifold” defined by

f: _ 

(kRn such that (1)Each φα maps Uα onto its image continuously with a continuous inverse φα-1. (2)If Uα intersects Uβ, then the change of coordinates map φβ.φα-1:Rn->Rn is differentiable on its domain φα(Uα ∩ Uβ). This is not an easy definition. Just think “locally Euclidean”.

You should not worry about this too much. One never really constructs all the covers and coordinate maps. Instead, one relies on the Big Tools: the Inverse and Implicit Function Theorems, which hold for manifolds. These can be used to generate n-manifolds as solutions to implicit equations. Manifold theory is used all the time in physics, robotics, dynamical systems, control theory, and (!) Mathematics.

I have read the above completely and agree to abide by these terms

for more than just geometry…

Finding tangent spaces is really a linearization of a function

The linearization varies from point-to-point…

…and provides an approximation that is locally valid

Finding tangent spaces is really a linearization of a function

The linearization varies from point-to-point…

…and provides an approximation that is locally valid

it’s simplest to use differential notation

f(x) = f(x1, x2, x3, … ,xn) n

^f df = # ^x dxi i i=1 this is a linear combination of differentials if we “pretend” that each differential represents a small change, then…

^f = ^x

^f ^x1

^f ^x2

^f ^xn

this is a linear transformation acting on vectors of rates of change

^f dx = df ^x

where “dx” is a vector of differentials dxi

are really helpful when estimating errors

numerical approximations let’s estimate the terms…

x=5

z =  π = -1 dz ≈ 0.01 2  3 = (π-0.14…) ≈ -1 + 21 (0.14) ≈ -0.99

Consider the function

15 xyz

now we can compute… -0.1 0.14 0.01 15 ≈ 15 + -15 + 3 + -1 4.9 π(3) 5*3*(-1) 5*3*(-1) 5

(

you can check that…

dx + dy + dz df = -15 xyz x y z

(

y = 3 dy ≈ 0.14

the last will take a bit more work…

15 4.9π(3 ) f(x,y,z) =

dx = -0.1

)

= -1 + ( -.020 + .047 - .010 ) = -0.983

= -0.98427…

)

that’s not bad for by hand…

tolerances and error

a = 24 ± 1/8 inches beam length: rafter length: b = 30 ± 1/4 inches θ = 57•± 1• angle: 2

2

estimate the strut length

L

L ≈ 26.30 in

L = a2 + b – 2ab  θ • errors: | da | < 1/8 ; | db | < 1/4 ; | dθ | < 1 (= 1/2π rad) L dL = (a-b  θ) da + (b-a  θ) db - (ab  θ) dθ (a-b  θ) da + (b-a  θ) db - (ab  θ) dθ dL = L maximize & compute

L

b θ a | dL | < 3.85 in

Are especially important in applications

if you have/need information about percentage errors, you will want to use relative rates

relative rate of change of u

“if each input is known with a 1% accuracy, with what accuracy are the outputs known?”

du d( u) = u represents percent change

Beam deflection

The elastic deflection at the midpoint of a beam, loaded at its center, & supported by two simple supports is

3

FL u = 48EI

The cross-sectional moment of inertia of a rectangular beam is

cross-section

3

wh I = 12

h w

if each variable is subject to a 2% error, what is the net impact?

Beam deflection

3

3

3

2

3

3

L 3FL FL 3FL du = 3 dF + 3 dL 2 3 dw 4 dh 4Ewh 4Ewh 4Ew h 4Ewh The relative rate of du = dF + 3 dL - dW - 3 dh error in deflection u F L W h

assume each term is +/- 2% at worst

FL FL = u = 48EI 4Ewh3

u(x) = u(x1, x2, x3, … ,xn) it’s of the form n

ci

u = K Π xi

du = d( u) u n ci = d (  K + #  (xi ) ) =

i=1

=

n

i=1

c i d( xi) # i=1 n

dxi c # i xi i=1

is great… but it’s just the first step

in some regions, linear approximation is very accurate... ...but not everywhere!

of course, it’s… did I really need to ask?

you might want to review that before proceeding…

the big picture

the derivative is key to both linearization & approximation of a nonlinear function

1

practice your implicit differentiation on the following functions, writing your answer in terms of differentials A) w = (x 2y2z ) B) a 2b - ab 2 = ab c) u = e vw - e wv

2

use differentials to give a numerical approximation for 1 / ( 0.99+0.492 ). try to do it without using a calculator! Hint: use f(x,y) = 1 / ( x+y2 ) .

3

if you know that e3 = 20.0855… and that (e/π)-( 3/2) ≅ -0.00077.., then, using differentials, estimate e5/ π2 ≅ 15.0374… how close is your estimate? now let’s say you also know that π3 ≅ 31.0063… then, estimate π/e (without knowing the square root of 3) by using the expression π/e = π3(e/π)2/e3. yes, this is kind of ridiculous.

4

to which variable (height or diameter) is the volume of a cylinder more sensitive? does this impact aspect ratios of canned goods? think of volumes and dimensions of cans of soda/pop, energy drinks, vegetables, juice, etc.

6

consider the area A of a parallelogram in the plane determined by vectors a i + b j and c i + d j. if each of the constants (a, b, c, d) can vary by up to 10%, what percentage variation can arise for A? your answer will depend on (a, b, c, d). are there values of (a, b, c, d) which lead to very large percentage errors in A?

7

assume that you are told the surface area of a sphere with a possible 10% error in measurement. with what confidence do you know its volume? radius? diameter? How does your answer change if the object is a cube?

8

[challenge] consider a solid object, such as a sphere, cone, or cube, in 3-d. if you have a measurement of the length scale (radius, side length, height, etc.) with 1% error, which is more accurate: measurement of volume? or surface area? does it matter what the exact shape is?

9

what is the relative rate of error of an exponential function f(u) = Ceu ? how does this compare to that of a logarithmic function f(u) = C  u ?

you learned taYlor series?

for f : IR _ IR the taylor series is about x=0

f(x) = f(x) = f(a+h) =

about x=a

f(x) =



i



1 df i ! dxi



1 i!

i=0 ∞ i=0 ∞

∑ i=0 ∞

∑ i=0

xi

= f(0) + f'(0) x +

0

i

i x Df

1 df i ! dxi

i

a

i

1 df i ! dxi



h =∑ i=0

i

a

f''(0) x2 + …

if one thinks of differentiation as an operator D

0

i

1 2

1 i!

i

Df h



(x-a) = ∑ i=0

i

local variable: h = x - a

a

1 i!

i

i

D f (x-a) a

this converges within a radius of convergence

as you recall… truncations to taylor polynomials provide approximations of increasing fidelity with degree

is simplest in the single-output case

in like manner…

a taylor polynomial is the “best fit” polynomial

in a neighborhood of the expansion point

about x=0



f( ) = sum over “multi-indices”

1

Df 0

it looks intimidating, but that is to be expected… remember how the single variable case seemed hard at first? we will learn what all these terms mean…

is intimidating-looking, but really useful

attention must be paid to:

multivariate sums work best with a multi-index

the

given n variables…

x = ( x1 , x2 , … , xn )

a multi-index is n ordered indices…

I = ( i1 , i2 , … , in )

of a multi-index I is defined as | I | := i1 + i2 + … + in

x = ( x1 , x2 , … , xn ) I = ( i1 , i2 , … , in ) the degree of the monomial term equals the degree of the multi-index

xI =

i1 i2

in

x1 x 2 … x n

multi-index monomials

x = ( x, y, z ) x(1,2,3)

=

x1y2z3

= xy2z3

x = ( a, b, c , d ) x(1,2,3,1)

2 3

= ab c d

x(0,1,0) = x0y1z0 = y

x(0,1,1,1) = bcd

x(1,0,1) = x1y0z1 = xz

x(1,0,0,0) = a

x(0,0,0) = x0y0z0 = 1

x(0,0,0,0) = 1

“linear” terms have degree 1 ; “quadratic” terms degree 2 ; etc.

I = ( i1, i2 , … , in ) (1, 2, 3) ! = 1! 2! 3! = 12 (2, 2, 2) ! = 2! 2! 2! = 8 (0, 0, 0) ! = 0! 0! 0! = 1

I ! = i1! i2! … in!

INTERACT NICELY WITH THIS NOTATION

f=f(x) x = ( x1 , x2 , … , xn ) I = ( i1 , i2 , … , in ) take partials according to the multi-index

I

D

i1

i2

^ ^ f = i1 i2 ^x1 ^x2

in

^ in ^xn

f

higher derivatives

f ( x ) = x3 y z2 D

(1,2,3)

2

3

^ ^ ^ f = ^x ^y2 ^z3

D

(1,0,1)

D

(0,0,0)

f=

^ ^ ^x ^z

x = ( x, y, z )

f = 0

f = 6x2yz

f = f = x3 y z2

it’s not hard to compute multiple partial derivatives, but…

if f has continuous second partial derivatives, then…

^ ^ ^xi ^xj

f =

^ ^ ^xj ^xi

f

for all i and j

And now After all that hard work… The notation makes more sense… so

1



f( ) = sum over multi-indices

Df 0 all derivatives evaluated at zero

I = ( i1 , i2 , … , in ) | |

f ( ) =∑

1

Df

about

f( letting

) =∑

a

1

evaluate at

Df a

so much work to compute !

not all series converge! in general, you have to compute the radius of convergence and stay within that distance to the expansion point. that’s not always easY to do – we will not explore this here… we know what the first derivative of a function is, but what are the higher derivatives? are they like matrices? the answer is not simple, as you can see from the complexity of the taylor formula.

the big picture

multi-index notation gives a general formula for taYlor series that closely resembles the usual formula we know & love

1

evaluate the following multi-index factorials: A) ( 1, 2 ) ! b) ( 0, 1, 0 ) ! c) ( 3, 0, 1, 2 ) ! d) ( 4, 0 ) ! e) ( 1, 1, 1, 1, 1 ) ! f) ( 2, 2, 2, 2, 1, 0 ) !

2

evaluate the following multi-index monomials xI, where: A) x = ( x, y ) ; I = ( 1, 2 ) b) x = ( x, y, z ) ; I = ( 3, 0, 2 ) c) x = ( x1, x2 , x3, x4 ) ; I = ( 1, 4, 0, 2 ) d) x = ( a, b, c ) ; I = ( 0, 2, 1 ) state the degree of each monomial in each case

3

compute the following multi-index derivatives DIf , where: A) x = ( x, y ) ; I = ( 2, 1 ) ; f(x) = exy b) x = ( x1, x2 , x3, x4 ) ; I = ( 1, 2, 3, 4 ) ; f(x) = ( x1 + x2 - x3 + x4 )3 c) x = ( u, v, w ) ; I = ( 2, 1, 1 ) ; f(x) = u3v2w – (u + v + w)2 d) x = ( x, y, z ) ; I = ( 0, 2, 1 ) ; f(x) = ( xyz )

4

compute the taylor series of f(x, y) = 3 – x + 2y + 5xy – y2 about the point x=1 and y=2. do this step-by-step, computing all derivatives. note: your answer should be a polynomial in the variables (x-1) and (y-2). when finished, be sure to check your work by multiplying everything out and confirming that you get the original polynomial back.

5

fix n the number of variables. there are, clearly, n multi-indices of degree equal to 1. how many multi-indices are there of degree 2? of degree 3? can you guess at (or prove) how many multi-indices have degree K for K>0 ?

6

[challenge] prove that mixed partial derivatives commute for multivariate polynomial functions. if you’re stuck, start with monomials in two variables…

7

for f : n _  the second derivative [D2f] is defined to be the square matrix whose entries are the second partial derivatives: [D2f]i,j = ^2f/^xi^xj = ^/^xi (^f/^xj). 2 what does the fact that mixed partials commute tell you about [D f]?

in the case of a planar function

2

f: _ 

for a planar function about origin…

f(x,y)

= f(0,0) ^f + ^x

^f x + ^y 0 2

0

y

2

2

1 ^f ^f 1 ^f 2 2 + x + xy + y 2 2 2 ^x 0 ^x^y 0 2 ^y 0 3

3

3

3

1 ^f 1 ^f 1 ^f 1 ^f 3 2 2 3 + x + x y + xy + y 6 ^x3 0 2 ^x2^y 0 2 ^x^y2 0 6 ^y3 0

+…

that is still a lot of derivatives

doing things the hard way… 2+y x taYlor expand e about x=y=0 up to terms of order three

I I = ( 0, 0) I = ( 1, 0) I = ( 0, 1) I = ( 2, 0) I = ( 1, 1) I = ( 0, 2)

I!

I

D

1 1 1 2

2+y x e

1 2

2+y x 2xe

2+y x 2xe 2+y x e

2+y 2+y 2 x x 4x e +2e 2+y x e

at (0,0) 1 0 1 2 0 1

(x,y)

I

1 x y x2 xy y2

idea: use the taylor series for e©



= 1+©+

2+y x e

= 1+ = =

1 2

2

© +

1 2 (x +y) +

1 6

© 3 + O(©4)

2 2 (x +y)

+

1 6

3 2 (x +y) +

O( |x|4 )

2 1 4 2 1 2 1 6 1 4 1 2 1 3 2 1 + (x +y) + ( 2 x +x y+ 2 y ) + ( 6 x + 2 x y+ 2 x y+ 2 y ) + 1 2 3 2 1 3 2 1 + y + x + 2 y + 2 x y + 2 y + O( |x|4 )

is, as ever, so very helpful

using the chain rule taYlor expand

1+y+xz about x=y=z=0 up to terms of order three

1 1 2 1 3 + 2 © - 8 © + 16 © + O(©4) 1 1 1 2 3 1 + y + xz = 1 + 2 (y+xz) - 8 (y+xz) + 16 (y+xz) + O( |x|4 ) 1 1 2 1 3 = 1 + 2 (y+xz) - 8 (y +2xyz) + 16 (y ) + O( |x|4 ) 1 1 1 2 1 1 3 = 1 + 2 y+ 2 xz - 8 y - 4 xyz + 16 y + O( |x|4 )

(1+©)1/2 = 1

Why bother with taylor series?

remember, when given a “strange” function, and a particular local region of inputs, taylor expansion reveals its behavior…

local solutions to equations

y

approximate solutions to

 ( xy ) =

3 x e

-  ( x2y )

near ( 0, 0 )

 ( xy ) =

x

3 x e -  ( x2y )

xy - O((xy)3) = ( 1 + x3 + O(x6)) 2 2 - ( 1 - O((x y) )) xy – x3 + O( |x|6 ) = 0

ignore higher order terms

xy – x3 = x ( y-x2) =0

complicated? yes. Useful? oh yes.

let’s rewrite the second-order expansion to see a patTERN…

f(a+h) = f(a) +

1 1!

[ Df ]a h +

2 1 T h [ D f ] h a 2!

3

+ O( | h | )

we can form the 2nd partial derivatives into a matrix, making this term a quadratic form

2

the hessian (or 2nd derivative)

2

[D f]i,j =

2

^f ^xi^xj

2

^f ^x12

[D f] =

f(a+h) = f(a) +

2

^f ^x1^xn 1 1!

[ Df ]a h +

2

2

^f ^xj2

^f ^xn^x1 2

^f ^xn2 2 1 T h [ D f ] h a 2!

this is a sYmmetric matrix 2

2 T

[D f] = [D f] 3

+ O( | h | )

but it’s not a 2-d array… such multidimensional arrays are known as tensors, and they have a rather complex algebraic structure

the big picture

as with single-variable taylor series, multivariate taylor series are often easilY computed via the chain rule

1

make sure you remember your taylor series! write out the standard, singlevariable taylor series about zero for the following functions: a)  © b)  © c)  (1+©) d) 1 1-© α e) h © f) h © g) (1+©) be sure to include any bounds on domain of convergence…

2

compute the taylor series of the following functions about the origin, using standard taylor series and composition. a)  ( x2 – 2y ) all terms up to & including order 4 b) h ( xz + y3 )

all terms up to & including order 9

c)  ( 1 + x  ( y – xey ) )

all terms up to & including order 5

d) ( 1 + ( 1 + x ) / ( 1 – y ) ) 1/2

all terms up to & including order 3

you learned how to solve max/min problems ?

in the case of a multi-input function

n

f: _ 

or two, to find & classify extrema of functions…

n

for f :  _ , a critical point is an input whose derivative is zero… [ Df ] =

^f ^x1

^f ^x2

^f ^xn

0

0

0

of course, this is, equivalently, a point where the gradient is zero

n

a real-valued function f :  _  attains a local maximum or minimum at a only if the input a is a critical point

and just like single-variable calculus

finding the critical points find and classify the extrema of

f ( x , y ) = x3 + y3 - 3xy ^f = 3x2 - 3y = 0 ^x ^f = 3y2 - 3x = 0 ^y critical points: ( 0, 0 ) ( 1 , 1 )

f(0,0) = 0 f _ +∞ f _ -∞

f( 1, 1 ) = -1 as as

x, y _ +∞ x, y _ -∞

is what we need

f ( a + h ) = f ( a ) + [ Df ]a h +

2 1 T h [ D f ] h a 2!

+ O( | h | )

THE SECOND DERIVATIVE DOMINATES LOCAL BEHAVIOR T

2

h [ D f ]a h > 0 IMPLIES THAT THE CRITICAL POINT IS A

for all h ≠ 0

T

3

2

h [ D f ]a h < 0 IMPLIES THAT THE CRITICAL POINT IS A

2

[D f] =

a b c d

det = ad - bc tr = a + d

it’s easy (and sufficient) to consider the case of a diagonal matrix in this case, the signs of the diagonal terms classify the critical point

+ 0 0 -

det < 0

+ 0 0 +

det > 0 tr > 0

- 0 0 -

det > 0 tr < 0

2

ASSUME f :  _  ASSUME a critical

> if DET[D2f]a if DET[D2f]a>0 then 2 > TR[D f]a>0 => a = LOCAL MIN; 2 > TR[D f]a a = LOCAL MAX;

> else

a = DEGENERATE :

FAIL

classify f ( x , y ) = x3 + y3 - 3xy

2

[D f] =

6x -3 -3

6y

= =

6

-3

-3

6

0

-3

-3

0

2

det [ D f ] = 27 > 0 2 tr [ D f ] = 12 > 0 2

det [ D f ] = -9 < 0

this function vanishes to 2nd order at the origin

this is sometimes called a "monkey saddle"

a classification of n-dimensional critical points

see the epilogue for a few hints…

before dealing with these complications

the big picture

the second derivative classifies critical points into local maxima, local minima, & saddles, (or degenerates)

1

Compute and classify the critical points of the following functions: A) 2x2 + 2y -4y2 + 4x

b) x(x+1)(y-2) – y(x+3)(y-4)

c) x3 + 3xy +2y2 + 3x - 4y

d) x3 - 4xy +2y2

e)  ( x2 + 4y2 + 1 )

f) h ( (x-1)2 -(y-2)2 ) h) (x1-1)2 + (x2-2)2 + (x3-3)2 + … + (xn-n)2

g)

2 2 –x –(y+2) e

2

For which values of C will the function f(x,y) = Cx2+4xy+Cy2 have a (local) maximum at (0,0)? What about minimum? Saddle?

3

Compute the second derivative (or “hessian”) of the following functions at 0 : 3x2 + 4xy -4y2 + 4x -3y + 2

A) c)  (x-3y) + 2x  (y-2x) 2 | x | = x*x e)

2

2

e2x +3y

b) d) x2 + 2xz - y2 + 3xy -2yz + 4z2 f) xTAx for A a square matrix

The treatment of critical point classification given in this chapter only works for 2-d and uses trace and determinant. Both this and the general n-dimensional case are greatly simplified with the use of eigenvalues. From the Taylor expansion of f:Rn->R about a critical point a, one sees that the second derivative (or Hessian) is the dominant term: f(a+h) = f(a)+ 1/2 h[D2f]hT + … Denote by {λi}, i=1…n, the eigenvalues of [D2f]

LEMMA: All eigenvalues λi of [D2f] are real. (This follows from [D2f] being a symmetric matrix, due to mixed partial derivatives commuting. From there, well, you should take a linear algebra course. Really!)

CONCLUSION: The signs of the eigenvalues {λi} completely determine the critical point type: 1) ALL λi>0 ==> MINIMUM (ie, [D2f] positive definite) 2) ALL λi MAXIMUM (ie, [D2f] negative definite)

COROLLARY: The quadratic form in the Taylor series about a 3) λi>0>λj ==> SADDLE critical point a converts after (some go up, all others down) a linear change of coordinates 4) if any λi=0 ==> DEGENERATE to: (need more info to determine) f(a+h) = f(a)+ 1/2 Σiλihi2 + O(h2) This is the best way to classify extrema.

I have read the above completely and agree to abide by these terms

given a collection of data points in the plane, what is the “best fit line” through them?

y

x what if we don’t want to “take a good guess”?

that’s what you do, right?

y

data presented as n pairs (xi, yi) find a “best fit” line y = mx + b

x

minimize the square sum of vertical differences between points and the line

f ( m , b ) = # ( yi – (mxi + b)) i

slope

Y-intercept

Y-value

2

Y-coordinate of line

f(m,b)=# ( y – (mx + b) ) i i i

2

^f = # ^ ( y – (mx + b))2 i i ^m i ^m ^ = # 2 ( yi – (mxi + b)) ^m ( yi – (mxi + b)) i

= # 2 ( yi – (mxi + b)) (–xi ) i

= # -2xi ( yi – (mxi + b)) i

by linearity chain rule

take the partial there, that’s not so bad!

f(m,b)=# ( y – (mx + b) ) i i i ^f = # ^b i

2

^ 2 ( yi – (mxi + b)) ^b ( yi – (mxi + b))

= # -2 ( yi – (mxi + b)) i

^f ^m

[ Df ] = #i -2xi (yi - mxi - b)

^f ^b

#i -2 (yi - mxi - b)

# -2 x (y mx b) = 0 i i i i

expand

# -2 (y mx b) = 0 i i i

#i xi yi -#i mxi2 -#i bxi = 0

#i yi -#i mxi -#i b = 0

#i xi yi -m#i xi2 -b#i xi = 0

#i yi -m#i xi -b#i 1 = 0

m #i xi2 + b #i xi = #i xi yi

m #i xi + b #i 1 = #i yi

pull out the variables

rearrange

a linear system!

# x i2

# xi

m

# xi

#1

b

=n -1

A =

Ax 2

1

=

# xi yi # yi

= b 2

n # xi – (# xi)

n

is A invertible? well…

det A = 2

2

n # xi – (# xi )

which is >0 so long as the xi are not all same

-# xi

-# xi # xi2

thanks to the standard 2-by-2 formula

-1

A =

m b

2

1

2

n # xi – (# xi)

-1

=A b =

n

-# xi

-# xi # xi2

b =

n # (xi yi) – (# xi) (#i yi) 2 2 n #ixi – (#ixi) –(# xi) # (xi yi) + (# xi2) (# yi) 2 2 n # xi – (# xi)

# xi yi # yi slope

Y-intercept

how do we know this solution MINIMZES distance to the line?

we know…

[ Df ] = #i -2xi (yi - mxi - b) compute the second derivatives…

2

[D f] =

2#ixi2 2#ixi 2#ixi

= 2A twice the matrix we used earlier !

2n

#i -2 (yi - mxi - b) 2

tr [D f] > 0 2 det [D f] = 4 det A 2 2 = 4 ( n #ixi - (#ixi) ) 2

= 4 #i0

thus, this choice of m, b, minimizes the least-squares distance

there’s a lot more to linear regression than just planar line-fitting… with n dependent and m independent variables, one is looking for a best-fit m-dimensional n+m “subspace” in  the “goodness” of the fit is crucial & not all relationships are linear!

LINEAR regression is AS INTERESTING AS IT IS USEFUL…

& IS A GREAT REASON TO LEARN MORE LINEAR ALGEBRA

IS THERE NON-LINEAR REGRESSION? OF COURSE!

ONE CAN USE A LOW-ORDER POLYNOMIAL APPROXIMATION & SOLVE FOR THE COEFFICIENTS OF THE TAYLOR SERIES

& BEYOND THIS LIES…

but that’s another story…

the big picture

regression formulae in statistics look intimidating, but are easily derived as solutions to optimization problems

1 2

what happens to the best fit line when you rescale the yi values in the data set by a (nonzero) constant C? [that is, the new yi equals Cyi] consider the following set of points in the plane: (xi , yi) = ((i), (i)) : i = 1…n A) use software to compute the values of m and b for the best fit line, for various values of n. be sure to use radians when computing (xi , yi). any patterns? b) now, if you wish, plot the points (xi , yi) for the same values of n. now what do you notice? does this shed light on your answers for m and b above?

3

prove that If the data set (xi , yi) consists of points that are sampled from a straight line, then the linear regression values (m, b) recover that line.

4

a challenge: given data (xi , yi , zi) try to find the best-fit plane z = ax + by + c by minimizing the sum of square-distances along the z axis. can you do it? what size matrices do you get? can you get explicit formulae for a, b, & c?

is another field in which optimization plays a part

this is a large subject, and we will merely sample it… specifically, we will look at

two person zero sum finite games

encode such games

each player has a finite set of strategies each round, they (privately) pick their strategies & play

P =

1 -7 -1 6 1

-3 5 -1 -1 0 3 5 -1 2 0 -3 -7 2

3



-4 6 -1 3

1 2 …

two players, A & B

m

n

player a “wins” and B “loses” the (EQUAL) payout value a and b play repeatedly

Pij = payoff from strategy choices ( wins 3 from )

some classical payofF matrices… rock, scissors, paper

P=

0 -1 1

1 0 -1

-1 1 0

this game is “fair” in that all plays are symmetric. you can “see” the fairness in the “skew-symmetry” T

P =-P

some classical payofF matrices… The even-odd game

P=

-2 3 3 -4

Two players: “ODD” AND “EVEN” EACH PUTS OUT 1 OR 2 FINGERS. IF THE SUM TOTAL IS ODD, ODD WINS THAT AMOUNT; ELSE EVEN WINS THE SUM

some classical payofF matrices… A MENDELSOHN GAME

P=

0 -1 2

1 -2 0 1 -1 0

EACH PLAYER CHOOSES 1, 2, OR 3 IF THE SAME NUMBER, NO PAYOFF IF YOU ARE HIGHER BY 1, YOU LOSE 1 POINT IF YOU ARE HIGHER BY 2, YOU WIN 2 POINTS

You played strategies at random ?

this is a very cool idea !

These must add up to 100%

.25 .4 .25 .1

.1

.2

.3

.1

.3

1 -3 5 -1 -7 -1 0 3 -1 5 -1 2 6 0 -3 -7

-4 6 -1 3

if player a plays strategy 2 for 40% of the games

& player b plays strategy 4 for 10% of the games

then this combination happens 4% of the time

P

with net expected payoff for this combination…

0.12

this is a very cool idea !

player A chooses among m strategies at random using a

a =

a1 a2 : am

player B chooses among n strategies at random using b

b = These must add up to one

b1 b2 : bn

T a Pb

random play leads to predictable average outcome

MIXED STRATEGY FOR THE EVEN-ODD GAME PAYOFF MATRIX

P=

-2 3 3 -4

As you play over & over, you can average the payoff per turn

Let’s say players choose 1 or 2 at random with some probability…

probability distributions:

a a= 1-a

the average payout (from P2 to P1) is:

f (a,b) = aT P b = -2ab + 3(1-a)b + 3a(1-b) -4(1-a)(1-b) = -4 + 7a + 7b -12ab

b b= 1-b

0≤a≤1 0≤b≤1

Each player has a separate random strategy

These are vectors adding up to 1

MIXED STRATEGY FOR THE EVEN-ODD GAME PAYOFF MATRIX

P=

-2 3 3 -4

As you play over & over, you can average the payoff per turn

^f = 7 – 12b = 0 ^a ^f = 7 - 12a = 0 ^b

7 b= 12 7 a= 12

b= a=

the average payout (from P2 to P1) is:

f (a,b) = aT P b = -2ab + 3(1-a)b + 3a(1-b) -4(1-a)(1-b) = -4 + 7a + 7b -12ab

But what kind of optimum is this?

7/ 12 5/ 12 7/ 12 5/ 12

MIXED STRATEGY FOR THE EVEN-ODD GAME

2

[D f]=

0 -12 -12 0

optimal average payout (from P2 to P1):

f ( 7/12 , 7/12 ) = -4 + 49/12 + 49/12 - 49/12 = 1/12 PLAYER 1 (“ODD”) HAS A SLIGHT ADVANTAGE & CAN GUARANTEE AN AVERAGE NET WIN

the average payout (from P2 to P1) is:

f (a,b) = aT P b = -2ab + 3(1-a)b + 3a(1-b) -4(1-a)(1-b) = -4 + 7a + 7b -12ab

NO MATTER WHAT p2 DOES, p1 WILL WIN ON AVERAGE AT LEAST 1/12 NO MATTER WHAT p1 DOES, p2 CAN LOSE ON AVERAGE NO MORE THAn 1/12

AT A NASH EQUILIBRIUM, neither PLAYER CAN DO better, GIVEN THE OPPONENT’S STRATEGY

AT A NASH EQUILIBRIUM, neither PLAYER CAN DO better, GIVEN THE OPPONENT’S STRATEGY

here’s a general (and powerfuL) theorem about saddle-point equilibria… minimax theorem:

given anY payoff matrix P, there exists a nash equilibrium ( a, b ). that is, there is a mixed strategy pair such that

maximize gain

T

T

max x ( Pb ) = min ( a P ) y x y

minimize LOSs this requires some deeper tools than calculus…

SOME 3-BY-3 nash equilibria With work, you can compute these…

0 -1 1

a=

1/ 3 1/ 3 1/ 3

1 0 -1

-1 1 0

b=

0 -1 2 1/ 3 1/ 3 1/ 3

a=

1/ 4 1/ 2 1/ 4

1 -2 0 1 -1 0

b=

0 1 -2 -2 0 4 3 -2 1 1/ 4 1/ 2 1/ 4

a=

9/ 23 7/ 23 7/ 23

b=

17/ 46 10/ 23 9/ 46

IN GENERAL, YOU HAVE TO BE CAREFUL (OR LUCKY) SOME GAMES HAVE SADDLE POINTS WITH COORDINATES THAT VIOLATE THE CONSTRAINTS (E.G., NEGATIVE) FOR NON-SQUARE MATRICES, YOU NEED A DIFFERENT APPROACH FOR FINDING THE NASH EQUILIBRIUM

1 what happens if the payoffs change as you plaY? 2 what happens if you have more than two players? 3 what happens if the set of strategies is not discrete?

the big picture PAYOFf matrices, acting on vectors of probability distributions, lead to optimal strategies, which give saDdle points

1

what is the payoff matrix for the even-odd game in which players choose numbers in the set { 1 , 2 , 3 , 4 } ?

2

consider the following payoff matrices: 0 2 -1 1 3 -3 A) b) c) -2 0 2 -2 0 1 3 -2 1 -2 0 2 -2 1 compute the payoff functions and the resulting nash equilibrium and expected payoff for each. -1 2

3

recall that a matrix A is skew-symmetric if AT = -A. is the product of two skew-symmetric matrices is also skew-symmetric?

4

try to compute the nash equilibrium for the payoff matrix: what goes wrong? why? what is the optimal strategy?

2 -1 3 -2

Often come with constraints

The size of the tumor must be non-negative

rent cannot exceed 60% of after-tax income

a rectangular box can be shipped only if the length plus the girth (perimeter of the cross-section) is below 120cm.

This engine must operate from -20•C to 40•C

total parts per million cannot exceed 350

the total expenditure on materials, capital equipment, & labor must not exceed available investment funds

is something you should remember…

bounded optimization in the 1-d case x -2 ≤ x ≤ 2 f(x) = x3 – 2x2 + x - 4

f' = 3x2 – 4x + 1 = 0 0 = ( 3x – 1 ) ( x – 1 ) x = 1/3 or x = 1 f" = 6x – 4

at the endpoints…

f(-2) = -22

f(2) = -2 however…

f( 1/3 ) = 4/27 – 4 < -2 f( 1 ) = -4

at x = 1/3,

f" = -2

at x = 1,

f" = 2

Ok, ok, so we have to check endpoints…

What’s the big deal? Just check the boundary points and move on…

the boundaries are not a finite set!

A simple boundary constraint

x, y, z x + y + z = 30 x, y, z ≥ 0 f( x, y, z ) = xyz use the constraint

z = 30 - x - y

then maximize

f( x, y ) = xy (30 - x - y) this is a nice, 2-d problem…

[ Df ] = 30y – 2xy – y2 0 = y(30–2x–y)

30x – x2 – 2xy 0 = x(30–x–2y)

two solutions to each equation

( 0 , 0 ) ( 0 , 30 ) ( 30 , 0 ) ( 10 , 10 ) there are four critical points to be classified. hmmmmm… i wonder which is the answer?

A simple boundary constraint

x, y, z x + y + z = 30 x, y, z ≥ 0 f( x, y, z ) = xyz

( 0 , 0 ) ( 0 , 30 ) ( 30 , 0 ) ( 10 , 10 ) there are four critical points. hmmmmm…i wonder which is the answer?

f( 0, 0 ) = 0*0*30 = 0 f( 30, 0 ) = 30*0*0 = 0 f( 0, 30 ) = 0*30*0 = 0 f( 10, 10 ) = 10*10*10 = 1000

2

[D f] =

–2y

30–2x–2y

30–2x–2y

–2x

=

-20 –10 –10 -20

det [ D2f ] = 300 > 0 tr [ D2f ] = -40 < 0

A simple boundary constraint

x, y, z

y

x + y + z = 30 x, y, z ≥ 0 f( x, y, z ) = xyz

( 0 , 0 ) ( 0 , 30 ) ( 30 , 0 ) ( 10 , 10 ) there are four critical points. hmmmmm…i wonder which is the answer?

z=0 x=0

y=0

x

since the function vanishes along the entire boundary of the legal domain, we conclude that the interior maximum is, indeed, the global maximum

is not a fixed boundary at all…

an optimal box without bounds

x, y, z xyz = 48 x, y, z ≥ 0

front/back = $1/ft2 top/bottom = $2/ft2 left/right = $3/ft2

f( x, y, z ) = 4xy + 2xz + 6yz

[ Df ] = 4y – 288/x2

there is a single critical point

use the constraint

z = 48 / xy

then minimize

f( x, y ) = 4xy + 96/y + 288/x this is a nice, 2-d problem…

2

[D f] =

4x – 96/y2

4y = 288/x2 4x = 96/y2 x=6 & y=2

576/x3

4

4

192/y3

= (6,2)

8/ 3

4

4

24

det > 0 tr > 0

an optimal box without bounds

x, y, z xyz = 48 x, y, z ≥ 0

front/back = $1/ft2 top/bottom = $2/ft2 left/right = $6/ft2

y

as x _ 0 f_∞

f( x, y ) = 4xy + 96/y + 288/x critical point at

x=6 & y=2 this is a local minimum

as y _ 0, f _ ∞

x

since the function has a single local minimum and “blows up” to infinity along all boundaries and as you go off “to infinity”, it is a global minimum

that all problems are this simple

a parameterized boundary

x, y

2

-2 ≤ x, y ≤ 3

[D f] =

a stress function

f( x, y ) =

x3 -

6xy +

3y2

use the derivative

^f/^x = 3x2 - 6y = 0 ^f/^y = -6x + 6y = 0 ( 0, 0 ) & ( 2, 2 )

2

[D f] =

6x

-6

-6

6

6x

-6

-6

6

= (0, 0)

= (2, 2)

0 -6 -6

6

12 -6 -6

det < 0

6

det > 0 tr > 0

there are two critical points in the interior of the square, and one is a local minimum. one would guess that this is the desired minimum; however, the entire boundary must be examined…

a parameterized boundary

x, y -2 ≤ x, y ≤ 3 a stress function

f( x, y ) = x3 - 6xy + 3y2 use the derivative

^f/^x = 3x2 - 6y = 0 ^f/^y = -6x + 6y = 0

f( x, 3 ) = x3 - 18x + 27 -2 ≤ x ≤ 3 f' = 3x2 - 18 f' = 0 at x = 6 f" = 6x > 0 at x = 6 local min at ( 6 , 3 ) local max at ( -2 , 3 ) local max at ( 3 , 3 )

y

x

a parameterized boundary (-2, 3)

( 6 , 3)

x, y -2 ≤ x, y ≤ 3 a stress function

f( x, y ) = x3 - 6xy + 3y2

(-2,-2)

(3,-2)

can you imagine what it would take to do this problem for a 3-d cube? or a more complex shape?

the big picture Constraints are common (& commonly difficult!) Be sure to check the boundary, boundaries, & “infinity” as needed

1

compute the maximal volume of a cone whose radius, r, and height, h, must satisfy the constraint r+h = 10, with r and h positive.

2

find the point on the plane 2x + y - z = 4 which is closest to the origin. hint: compute the minimum of the square of the distance from ( x, y, z ) to ( 0, 0, 0 ). be sure to argue why it’s a minimum, based on what happens as you go to infinity.

3

find the maxima and minima of f(x, y) = exe-y on the rectangle given by -1 ≤ x ≤ 1 and -2 ≤ y ≤ 2. hint: don’t forget corners!

4

what is the maximum and minimum of the function f(x, y) = x2y - 2xy on the disc of radius two given by x2 + y2 ≤ 4? note: you need compute the extrema in the interior of the disc and then on the boundary. hint: parametrize the boundary using an angular variable θ via x = 2  θ, y = 2  θ.

5

challenge: maximize f(x, y) = x2 - 2y2 - 3x + y on the square given by -1 ≤ x, y ≤ 2

Is almost never easy, but…

constrained to a circle… f(x,y) = x - 2y

y

hey! no critical points on 2

x2 + y2 – 4 = 0

x x(t) = 2  t y(t) = 2  t t = 0…2π f(t) = x(t) – 2y(t) = 2 t - 4 t f'(t) = -2 t - 4 t = 0 t = (-2) the usual singlevariable method

max:

(

2 5

,

-4 5 )

-2 4 min: ( 5 , 5

)

if you can parametrize the constraint set

constrained to a circle… level sets of f

y

f(x,y) = x - 2y x2 + y2 – 4 = 0

x look at the level sets of f… it appears that the extrema of f, when restricted to the constraint set, occur precisely where the level set is tangent to the constraint set

max:

(

2 5

,

-4 5 )

-2 4 min: ( 5 , 5

)

f(x,y) = x - 2y G(x,y) = 0

y

◊G ◊f

the level sets of f are tangent to the constraint set G=0 precisely where ◊f is parallel to ◊G 1 ◊f = -2

1 = 2λx -2 = 2λy

2x ◊G = 2y

y = -2x

x

◊ f = λ◊G x2 + y2 = 4 5x2 = 4

max:

(

2 5

,

-4 5 )

-2 4 min: ( 5 , 5

)

ARE THE RIGHT PERSPECTIVE

a constrained max/min is not a max/min of the full function

Is it always the case that the constraint level set and the optimal function level sets are tangent? With parallel gradients?

AT OPTIMA, LEVEL SETS OF THE FUNCTION AND THE CONSTRAINT LEVEL SET ARE TANGENT!

◊ f ⊥ { G=0 }

in 2-d & beyond!

◊ f | | ◊G

a general method for constrained optimization

n

n

given (differentiable) functions f :  _  and G :  _  anY exTremum a of f(x) restricted to G(x) = 0 must satisfy The fine print: make sure that [DG] ≠ 0 at a

[Df] a = λ[DG]a for some λ

this is equivalent to finding unconstrained n+1 optima of L(x, λ) = f - λG :  _  the “lagrangian”

◊a f = λ◊G a equivalently

a simple example f( x , y ) = x2 – 2xy + 2y2 G( x , y ) = 2x – 3y –5 = 0 Lagrange:

[Df] = λ[DG]

^ ^x

2x – 2y = 2λ

^ ^y

-2x + 4y = -3λ

λ = x-y solve -2x + 4y = -3(x-y) substitute x+y=0 simplify y = -x 2x – 3(-x) – 5 = 0 substitute into G=0

solve

x=1

y = -1

can you “see” that this is a minimum?

this is certainly solvable by other means; however, the lagrange multiplier method is “automatic” and requires little other than algebra…

to solve constrained optimization you can convert to unconstrained optimization with an extra variable

L a G R a N G e e Q U a T I O N s

[Df] = λ[DG] G=0 to optimize f(x) constrained to the level set of G(x)

means different things to different people

force of constraint

shadow price

the multiplier is the rate of change of the optimal value with respect to the constraint value

I’ll tell you

consider the constraint value as a variable, c (think “cost”)

G(x) = c then use lagrange to optimize f(x) when constrained by G-c=0

[Df] = λ[DG] G-c= 0 this gives n+1 equations on n+2 variables (x, λ, c) d dc

f (x(c)) = λ

thanks to the implicit function theorem, we can solve for the optima of f as a function of c (locally)

x = x(c)

the rate of change of the optimal f-value is: df ^f dx = ^x dc dc ^G dx = λ ^x dc dG = λ dc



the lagrange multiplier at this local max is larger…

…than the lagrange multiplier at this local max

because of how the critical values change with a small change in constraint value

can we get back to solving some real problems?

the big picture lagrange’s method converts constrained optimization into

unconstrained optimization with an extra variable…

the lagrange multiplier

1

use the lagrange equations to find the point on the plane ax + by + cz = 1 closest to the origin. Hint: extremize the square-distance f = x2+y2+z2.

2

use the lagrange equations to compute a formula for the minimal distance D from a point (x0, y0) in the plane to a line of the form ax + by = 1. hint: compute the minimal square-distance D2, then take its square root.

3

challenge: try to generalize the previous problem to n dimensions, finding the minimal (square) distance from the point P = (x1, … , xn) to the hyperplane given by the equation a1x1 + a2x2 + … + anxn = 1. this may require courage.

4

consider the cost function f = xy for x, y > 0, constrained to a level set of the form ax + by = C, for a, b, C > 0. draw the level sets of the cost and constraint functions in the plane. use lagrange to solve for the optima in terms of C. what does the lagrange multiplier λ tell you? as cost C is increased, does the lagrange multiplier λ get larger or smaller? can you “see” the answer?

L a G R a N G e e Q U a T I O N s

[Df] = λ[DG] G=0 to optimize f(x) constrained to the level set of G(x)

Is straightforward, but…

A supply-chain problem

commodities are consumed at constant rate & restocked when depleted. stock reorders happen y times per year; each time ordering amount x a total amount of S is needed per year: G( x , y ) = xy – S = 0 f ( x , y ) = ax + by

[Df] = λ[DG] a = λy a/y = λ

a ~ average storage cost ; b ~ order delivery cost

lagrange ^ ^x ^ ^y

b = λx

b/x = λ G

x b y = a

a 2 b 2 S = xy = a y = b x

x=

bS a

reorder this much

y=

aS b

at this frequency

That “degenerate” optima can be rejected

find the maximal inscribed cone V( r, h ) = π r2 h/3 G( r, h ) = ( h – R )2 + r2 – R2 = 0 lagrange [DV] = λ[DG] ^ ^r

2πrh/3 = 2λr r=0

λ = πh/3

0 = ( h – R )2 + r2 – R2 0 = ( h – R )2 + 2( h – R ) h - R2 0 = 3h2 – 4hR

^ ^h

R

h

πr2/3 = 2λ(h-R)

πr2/3 = 2 (h-R) πh/3

R

h-R r

r2 = 2 (h-R) h

0 = h (3h – 4R) h=0

h = 4R/3

h = 4R/3 r2 = 2 (h-R) h = R2/8

These could have been solved with substitution & “single-variable” methods

now in 3-d! With extra linear algebra! f ( x, y, z ) = xy + yz - x2 + y2 - z2 G( x, y, z )

= x2 + y2 + z2

-8=0

The lagrange equations are: ^ ^x ^ ^y ^ ^z

y - 2x = 2λx x + z + 2y = 2λy y – 2z = 2λz

0 -2-2λ 1 1 2-2λ 1 0 1 -2-2λ

x y

=

z

0 0 0

NEED A NONZERO SOLUTION FOR THE constraint TO HOLD… THIS ONLY HAPPENS IF…

DET = 0 -2-2λ

1

λ = -1

0 1

1 2-2λ =0 0 1 -2-2λ compute the determinant… 2

-4 ( 1 + λ ) ( 2λ – 3 ) = 0

OR

λ=± 3 2

now in 3-d! With extra linear algebra! f( x, y, z ) = xy + yz - x2 + y2 - z2 G( x, y, z ) λ = -1 0 1 0

= x2 + y2 + z2

-8=0

solve 1 4 1

0 1 0

x y z

=

0 0 0

this leads to two equations…

0 -2-2λ 1 1 2-2λ 1 0 1 -2-2λ

y = 0 x = -z

x y

=

z

substitute into the constraint equation…

x2 + y2 + z2 - 8 = 0 x2 + 0 + (-x)2 = 8

0 0 0

NEED A NONZERO SOLUTION FOR THE constraint TO HOLD… THIS ONLY HAPPENS IF…

λ = -1 OR

λ=± 3 2

x = ±2 ; y = 0 ; z = -x

λ = -1 x = 2 , y = 0, z = -2 OR

x = -2 , y = 0, z = 2

λ=± 3 2

x=z=±

2

3± 3 2

is this awful? yes. it is awful. y=(2± 6)x

You can still get results from lagrange

Economics without numbers

invest in an amount E of equipment at unit rate cost e invest in an amount L on labor at unit rate cost l fixed production function P = P( E , L ) : constant output C( E , L ) = e E + l L Lagrange: [DC] = λ[DP] ^ ^E ^ ^L

^P ^E ^P l =λ ^L

e =λ

solve for 1/λ equate

1 ^P = 1 ^P e ^E l ^L at the optimal cost, equals

higher dimensional problems are really hard

a high-dimensional maximization n

variables xi ≥ 0 for i=1…n n

G=# i=1

x2i

Lagrange: ^ ^xj

either

xj = 0

not maximal

i=1

2

=a ≥0

n

f = x1 x2…xn-1xn = Π 2 2

G = #x2i = a2 ≥ 0

2

2

x2i i=1

[Df] = λ[DG] i≠j

by symmetry, we conclude for all j

a n

for all i

this must be the maximum since the function is positive & all other critical points have value zero

2xj Π x2i = 2xj λ

Π xi2 = λ _ xj = i≠j

xi =

a2 n =

constant for all j

the maximal value is n n 2 2 = a = Π x fmax i=1 i n

n

variables xi ≥ 0 for i=1…n n

G=# i=1

x2i

=a ≥0

n

f = x1 x2…xn-1xn = Π

n

f =Π

x2i i=1

a2

≤ n

2

x2i i=1

2

n

# xi = a2 i=1 n

i=1

2

2 2

for xi ≥ 0 and

G = #x2i = a2 ≥ 0

we have n

n

1 2 xi = n# i=1

= n

Π

xi2 i=1

1/n

fmax 1

n

≤ n #x2i i=1

xi =

a n

a2 n =

for all i

the maximal value is n n 2 a 2 Π xi = n fmax = i=1

n

Π y i=1 i

1/n

1

n

≤ n # yi i=1

relies on core inequalities many of which are proved via lagrange optimization

the big picture When using lagrange’s method…

Stay calm & Do the algebra

1

recall (from exercises earlier in this text) the cobb-douglas model for P (production) in terms of labor x and materials y is P = C xα yβ, where α, β, C > 0 are constants and α + β = 1. assuming that labor costs A dollars per unit and materials cost B dollars per unit, use the lagrange method to optimize resource allocation x, y so as to maximize P with a fixed cost K.

2

the lagrange multiplier λ has a definite meaning in load balancing for electric network problems. Consider three generators that can output xi megawatts, i=1…3. Each generator costs Ci = 3xi + (i/40)xi2. If the total power needed is 1000 MW, what load balance (x1, x2 , x3) minimizes cost? in this problem, what is λ and what are the units of λ? if you are operating at the optimal load balance, and a request comes in for additional power at $20/MW, is this a good price? how do you determine this?

3

assume a solid body with three principal normal stresses σ1 > σ2 > σ3 > 0. along a plane with unit normal vector n, the shear stress is given by the function: τ2 = σ12 n12 + σ22 n22 + σ32 n32 – ( σ1 n12 +σ2 n22 +σ3 n32 )2 write out the lagrange equations for extremizing the shear stress τ as a function of the normal vector n satisfying n12 + n22 + n32 = 1. try to solve! (hint: there are 3 solutions, each of which has one components ni=0. what is the maximal shear stress τ as a function of normal stresses?

4

challenge: it’s hard to do a problem in n-dimensions, but here’s an example. Minimize a quadratic form f(x) = xT Q x for a positive definite square matrix Q subject to a hyperplane constraint x * b = 1 for a constant nonzero vector b. as a warm-up, you might want to start with a simpler case… A) let Q be a 2-by-2 matrix with determinant and trace both positive b) let Q be an n-by-n diagonal matrix with positive diagonal terms

modern applications of derivatives

n

a function f:  _  to be optimized while satisfying constraints n m G(x) = 0 for G:  _ 

the solutions to the constrained optimization problem equal those of an unconstrained problem with more variables, one for each constraint… n+m

Define L: 

_

L( x , λ ) = f (x) – λ*G (x) where λ is a vector of m lagrange multipliers

n

a function f:  _  to be optimized while satisfying constraints n m G(x) = 0 for G:  _ 

n+m

Define L: 

_

L = f – λ*G

λ = vector of m lagrange multipliers

0 = [DL] =

[Df] – λ* [ DG]

G1 G2 … Gm

^ ^x

^ ^λ

This gives two sets of equations

[Df] = λ* [ DG] G=0

classifying maxima/minima/saddles

n

In the unconstrained case of f:  _  2 the eigenvalues of the 2nd derivative [D f] determine everything! 2

the derviative [D f] has n eigenvalues 2nd

they are all real, 2 since [D f] is symmetric

2

if the eigenvalues of [D f] are… all positive all negative mixed +/some zero

local min local max saddle ???

this is not so simple…careful!

2

in the constrained setting, the 2nd derivative [D L] is used…

yes. oh, yes. oh, yes yes yes yes yes.

three time-dependent variables with linearly related evolution

x' = 3x - 2y + z y' = x + 4y - 2z z' = 5x - y + 3z

can be expressed as a simple first-order linear system

x' y' z'

=

you can check that the solution to this is…

x(t) = eAt x(0)

3 -2 1 1 4 -2 5 -1 3

x y z

that is…

x' = Ax using matrixvector notation

this is, of course, the old single-variable story rewritten with matrices

for a square matrix, the exponential is, clearly, defined as…

eAt

= I + At +

for a diagonal matrix…

A=

λ1 0 0 0 λ2 0 0 0 λ3

1 2!

2

(At) +

3 1 3! (At) +

… +

the exponential is too λ 1t

eAt =

e 0

0 0 λ2t e 0

0

0

eλ3t

1 n!

n

(At) + … in the same way that eigenvalues classify critical points in optimization, eigenvalues classify dynamical equilibria…

the solutions can be classified into types based on eigenvalues reminiscent of minima/saddles/maxima

notice how these have an equilibrium (constant solution) at zero… for complex pairs of eigenvalues, one gets spiraling behavior: either stable (sink), unstable (source), or “balanced” (center)

when the system is not linear & not solvable…

x = F(x) FOR A NONLINEAR SYSTEM, YOU FIND THE EQUILIBRIA (CONSTANT SOLUTIONS) AND LINEARIZE THE DYNAMICS BY COMPUTING THE DERIVATIVE…

THE EIGENVALUES OF THE DERIVATIVE AT AN EQUILIBRIUM TELL YOU ABOUT THE NONLINEAR DYNAMICS… LOCALLY!

more interesting things can happen

don’t panic if that all doesn’t make much sense… the epilogues are meant to be mind-crunching

are, of course, anti-derivatives

d dx

f:_ f' :  _ 

n

∫-dx

f: _ D

m

?

[Df] the derivative is not the same type of function… so you can’t invert!

n

let f :  _  be an “integrable” n function on a region R ⊂ 

@ f dx

is a limit

R

(like everything else in mathematics)

over higher-dimensional domains

n

let f :  _  be a “sufficiently integrable” n function on a region R ⊂  . then,

@ f dx = @( @ … ( ( @f dx )dx ) … dx ) dxn R

1

2

n-1

these “partial integrals” are analogous to undoing partial derivatives…

is not computing the integral

limits on a triple integral 1

1

@ @ @

1-x

z

f dz dx dy

y

y=-1 x=y2 z=0

=

1

1-x

@ @ @

1-x

f dy dz dx

x

x=0 z=0 y=- 1-x

these don’t look the same, but they are!

(0,1)

z

(0,1)

z=1-x

z

y

z=1-y2

(0,0) (0,0)

(1,0)

x

(-1,0)

x

(1,0)

y

(1,1)

x=1-y2

(1,-1)

in all manner of applications

r

r l

r

2M r2 5

2 r 5

2M r2 3

2 r 3

M 2 2 ( 3r + l ) 12

3r2

2

+l 12

are mostly limited to one tool

^u ^x

@

h(u) du = F(D)

@

u = F( x )

h(F(x)) det [DF] dx

D

difficult & useful integrals

z

z

y

x

r = x2 + y2 θ =  ( y/x ) z=z

y

x

ρ = x2 + y2 + z2 θ =  ( y/x ) φ =  ( z/ρ )

n

For a surface S⊂  parametrized by n G : R _  the surface area element is

s G = t

*** x

n

transpose!

T

dσ = det [DG] [DG] ds dt square root!

x1 x2

compute the volume Bn and “surface area” Ωn of the unit-radius ball in n

ρn-1dΩn

dVn = ρn-1 dρ dΩn take advantage of spherical coordinates…

dVn = ρ = dΩn =

volume element radial coordinate solid angle element

we will need to use the gamma function…

Γ(x) =

@



-t x-1

e t

t=0

dt = (x-1) !

the big picture in higher dimensions, one needs to integrate definitely & iteratively the difficulty is not integrating… WHAT’S HARD IS setting it up & choosing proper coordinates!

bY

Robert ghrist Is the andrea Mitchell professor Of mathematics and Electrical & systems engineering at the university of pennsylvania

He’s an award-winning researcher, teacher, writer, & speaker his 1995 ph.d. is in applied mathematics from cornell

Good textbooks on calculus that use matrices & matrix algebra: Colley, S. J., Vector Calculus, 4th ed., Pearson, 2011. Hubbard, J. and Hubbard, B. B., Vector Calculus, Linear Algebra, and Differential Forms: A Unified Approach, 5th ed., Matrix Editions, 2015. Good introduction to game theory: Ferguson, T., Game Theory, 2nd ed., web site, 2014. https://www.math.ucla.edu/~tom/Game_Theory/Contents.html

Good introduction to least-squares regression & more Boyd, S., and Vandeberghe, L., Introduction to Applied Linear Algebra – Vectors, Matrices, & Least Squares, Cambridge, to appear, 2018. Good graduate level text on optimization theory & applications: Boyd, S., and Vandeberghe, L., Optimization Theory, Cambridge, 2004.

all writing, design, drawing, & layout by prof/g [Robert ghrist] prof/g acknowledges the support of andrea Mitchell & the fantastic engineering students at the university of pennsylvania during the writing of calculus blue, prof/g’s research was generously supported by the united states department of defense through the ASDR&E vannevar bush faculty fellowship