UNCONSTRAINED OPTIMIZATION PAUL SCHRIMPF OCTOBER 24, PDF Free Download

PAUL SCHRIMPF OCTOBER 24, 213 UNIVERSITY OF BRITISH COLUMBIA ECONOMICS 26 Today s lecture is about unconstrained optimization. If you re following along in the syllabus, you ll notice that we ve skipped the fourth topic, eigenvalues and definite matrices. We will cover these things as part of our study of optimization. 1. NOTATION AND DEFINITIONS An optimization problem refers to finding the maimum or minimum of a function, perhaps subject to some constraints. In economics, the most common optimization problems are utility maimization and profit maimization. Because of this, we will state most of our definitions and results for maimization problems. Of course, we could just as well state each definition and result for a minimization problem by reversing the sign of most inequalities. In this lecture we will be interested in unconstrained optimization problems such as ma U F where U R n and F : U R. If F = ma U F, we mean that F F for all U and F = F for some U. Definition 1.1. F = ma U F is the maimum of F on U if F F for all U and F = F for some U There may be more than one such. We denote the set of all such that F = F by arg ma U F and might write arg ma U F, or, if we know there is only one such,, we sometimes write = arg ma U F. Definition 1.2. U is a maimizer of F on U if F = ma U F. The set of all maimizers is denoted arg ma U F. Definition 1.3. U is a strict maimizer of F on U if F > F for all U with =. Recall the definition of a local maimum from lecture 8. Definition 1.4. F has a local maimum at if δ > such that Fy F for all y N δ U. Each such is called a local maimizer of F. If Fy < F for all y =, y N δ U, then we say F has a strict local maimum at. When we want to be eplicit about the distinction between local maimum and the maimum in definition 1.1, we refer to the later as the global maimum. 1

Eample 1.1. Here are some eamples of functions from R R and their maima and minima. 1 F = 2 is minimized at = with minimum. 2 F = c has minimum and maimum c. Any is a maimizer. 3 F = cos has maimum 1 and minimum 1. 2πn for n Z is a maimizer. 4 F = cos + /2 has no global maimizer or minimizer, but has many local ones. 2. FIRST ORDER CONDITIONS In lecture 8, we proved that if F has a local maimum at, then DF =. We restate that theorem and proof here. Theorem 2.1. Let U R n, F : U R, and suppose F has a local maimum or minimum at, F is differentiable at, and interioru. Then DF =. Proof. We will write the proof for when F has a local maimum at. The eact same reasoning works when F has a local minimum. Since is in the interior of U, we can choose δ > such that N δ U. Since is a local maimum we can also choose such that F Fy for all y N δ. Since F is differentiable, we can write F + h F h = DF h + r, h h =. Let h = tv for some v R n with v = 1 and t R. If r,h where lim h h DF v >, then for t > small enough, we would have F+tv F t = DF v + r,tv > t DF v/2 > and F + tv > F in contradiction to being a local maimum. Similarly, if DF v v < then for t < and small, we would have F+tv F t = DF v + r,tv t > D f v/2 > and F + tv > F. Thus, it must be that DF v = for all v, i.e. DF =. The first order condition is the fact that DF = is a necessary condition for to be a local maimizer or minimizer of F. Definition 2.1. Any point such that DF = is call a critical point of F. If F is differentiable, F cannot have local minima or maima =local etrema at noncritical points. F might have a local etrema its critical points, but it does not have to. Consider F = 3 F =, but is not a local maimizer or minimizer of F. Similarly, if F : R 2 R, F = 2 1 2 2, then DF =, but is not a local minimum or maimum of F. 3. SECOND ORDER CONDITIONS To determine whether a given critical point is a local minimum or maimum or neither we can look at the second derivative of the function. Let F : R n R and suppose is a critical point. Then DF =. To see if is a local maimum, we need to look at 2

F + h for small h. If F is twice continuously differentiable, we can take a second order Taylor epansion of F around. F + h = F + DF h + 1 2 ht D 2 F h + r, h where D 2 F is the matri of F s second derivatives. It is called the Hessian of F. 2 F 2 F 2 D 2 1 1 n F =...... 2 F 1 n 2 2 F n Since is a critical, point DF =, so We can see that is a local maimum if F + h F = 1 2 ht D 2 F h + r, h. 1 2 ht D 2 F h + r, h for all h =, h < δ for some δ >. We know that r, h is small so, we epect that the above inequality will be true if h T D 2 F h for all h =. The Hessian, D 2 F is just some symmetric n by n matri, and h T D 2 F h is a quadratic form in h. This motivates the following definition. Definition 3.1. Let A be a symmetric matri, then A is Negative definite if T A < for all = Negative semi-definite if T A for all = Positive definite if T A > for all = Positive semi-definite if T A for all = Indefinite if 1 s.t. T 1 A 1 > and some other 2 such that T 2 A 2 <. In the net section we will derive some conditions on A that ensure it is negative semi- definite. For now, just observe that if D 2 F is negative semi-definite, then must be a local maimum. If D 2 F is negative definite, then is a strict local maimum. The following theorem restates the results of this discussion. Theorem 3.1. Let F : U R be twice continuously differentiable on U and let be a critical point in the interior of U. If 1 The Hessian, D 2 F is negative definite, then is a strict local maimizer. 2 The Hessian, D 2 F is positive definite, then is a strict local minimizer. 3 The Hessian, D 2 F is indefinite, is neither a local min nor a local ma. 4 The Hessian is positive or negative semi-definite, then could be a local maimum, minimum, or neither. Proof. We will only prove the first case. The second and third cases can be proven similarly. We will go over some eamples of the fourth case. 3

The main idea of the proof is contained in the discussion at the start of this section. The only tricky part is carefully showing that r, h is small enough to ignore. As in the discussion preceding the theorem, is a local minimizer if We can rewrite this as Factoring, we have F + h F = 1 2 ht D 2 F h + r, h. h T D 2 F h + r, h =h T D 2 F h + h T h r, h h 2 h T D 2 F h + h T h r, h h 2 =h T D 2 F + r, h h 2 h. r From our theorem on Taylor series, we know lim,h h h 2 δ > such that if h < δ, then r,h < ϵ. So, h 2 Let t = h and q = h T D 2 F + r, h h 2 h h T D 2 F + ϵ h h T D 2 F h + ϵ h 2 h, so h = tq. Then we have h h T D 2 F + r, h h 2 h t 2 q T D 2 F q + ϵt 2 =. Then for any ϵ > We can pick ϵ < inf q =1 q T D 2 F q. The set {q : q = 1} is compact so there is some q that achieves this minimum, and q T D 2 F q <, so ϵ >. 1 Then for all t < δ and q = 1, t 2 q T D 2 F q + ϵt 2 < t 2 q T D 2 F q + q T D 2 F q t 2 < where the last inequality follows from q T D 2 F q <. Thus, F + h F < for all h < δ. When the Hessian is not positive definite, negative definite, or indefinite, the result of this theorem is ambiguous. Let s go over some eamples of this case. Eample 3.1. F : R R, F = 4. The first order condition is 4 3 =, so = is the only critical point. The Hessian is F = 12 2 = at. However, 4 has a strict local minimum at. 1 If we didn t have compactness, we might have ended up with ϵ =, which is not allowed. 4

Eample 3.2. F : R 2 R, F 1, 2 = 1 2. The first order condition is DF = 2 1, =, so the 1 =, 2 R are all critical points. The Hessian is D 2 2 F = This is negative semi-definite because h T D 2 F h = 2h 2 1. Also, graphing the function would make it clear that 1 =, 2 R are all non-strict local maima. Eample 3.3. F : R 2 R, F 1, 2 = 1 2 + 4 2. The first order condition is DF = 2 1, 42 3 =, so the =, is a critical point. The Hessian is D 2 2 F = 122 2 This is negative semi-definite at because h T D 2 F h = 2h 2 1. However, is not a local maimum because F, 2 > F, for any 2 =. is also not a local minimum because F 1, < F, for all 1 =. In each of these eamples, the second order condition is inconclusive because h T D 2 F h = for some h. In these cases we could determine whether is a local maimum, local minimum, or neither by either looking at higher derivatives of F at, or look at D 2 F for all in a neighborhood of. We will not often encounter cases where the second order condition is inconclusive, so we will not study these possibilities in detail. The converse of theorem 3.1 is nearly true. If is a local maimizer, then D 2 F must be negative semi-definite. Theorem 3.2. Let F : U R be twice continuously differentiable on U and let in the interior of U be a local maimizer or minimizer of F. Then DF = and D 2 F is negative or positive semi-definite. Proof. Using the same notation and setup as in the proof of theorem 3.1, > F + tq F =t 2 1 2 qt D 2 F q + r, tq >t 2 1 2 qt D 2 F q + r, tq t 2 r,tq Because lim t =, for any ϵ > δ > such that if t < δ, then t 2 1 >t 2 2 qt D 2 r, tq 1 F q + t 2 t 2 2 qt D 2 F q ϵ t 2 ϵ > 1 2 t2 q T D 2 F q This is only possible for all ϵ > if q T D 2 F q. I do not epect you to remember the proofs of the last two theorems. However, it is important to remember the second order condition, and know how to check it. To check the second order condition, we need some practical way to tell whether a matri is positive or negative semi-definite.

4. DEFINITE MATRICES The second order condition says that a critical point is a local maimum if the Hessian is negative definite. In this section we will develop some conditions for whether a matri is negative definite. Let s start with the simplest case where A is a one by one matri. Then T A = 2 a is negative definite if and only if a <. A is negative semi-definite if a. What if A is a two by two symmetric matri? Then, T T 1 a b 1 A = b c Completing the square we get T A =a 1 + 2b 2 =a 1 + b a 2 2 =a 2 1 + 2b 1 2 + c 2 2 a 1 2 + b2 2 + a 2 2 2 ac b2 2 2 a b2 a 2 2 + c2 2 Thus T A < for all = if a < and ac b2 a <, i.e. ac b 2 >. Notice that ac b 2 = deta. So A is negative definite if a < and deta >. If we wanted A to be positive definite, we would need a > and deta >. For semi-definite we get the same thing with weak instead of strict inequalities. It turns out that in general this sort of pattern continues. We can determine whether A is negative definite by looking at the determinants of certain submatrices of A. Definition 4.1. Let A by an n by n matri. The k by k submatri a 11 a 1k A k =.. a k1 a kk is the kth leading principal submatri of A. The determinant of A k is the kth order leading principal minor of A. Theorem 4.1. Let A be an n by n symmetric matri. Then 1 A is positive definite if and only if all n of its leading principal minors are strictly positive. 2 A is positive semi-definite if and only if all n of its leading principal minors are weakly positive. 3 A is negative definite if and only if all n of its leading principal minors alternate in sign as follows: deta 1 <, deta 2 >, deta 3 <, etc. 4 A is negative semi-definite if and only if all n of its leading principal minors weakly alternate in sign as follows: deta 1, deta 2, deta 3, etc A is indefinite if and only if none of the five above cases hold, and deta k = for at least one k. The proof of this theorem is a bit tedious, so we will skip it. There is a proof in chapter 16.4 of Simon and Blume. Figure 1 shows each of the five types of definiteness. 6

FIGURE 1. Definite quadratic forms Positive definite 2 + y 2 Negative definite 2 y 2 8 7-1 6-2 -3 4-4 3-2 -6 1-7 1-81 1 1 y - - y - - -1-1 -1-1 Positive semi-definite 2 Negative semi-definite y 2 4 3-3 -1 2-1 2-2 1-2 1-3 -3 1-41 1 1 y - - y - - -1-1 -1-1 Indefinite 2 y 2 4 3 2 1-1 -2-3 -41 1 y - - -1-1 4.1. Eigenvectors and eigenvalues. Another way to check whether a matri is negative definite is by looking at the matri s eigenvalues. Eigenvalues are of interest in their own right as well because eigenvalues have many other uses, such as determining the stability of systems of difference and differential equations. 7

Definition 4.2. If A is an n by n matri, λ is a scalar, v R n with v = 1, and Av = λv then λ is a eigenvalue of A and v is an eigenvector. If v is an eigenvector of A with eigenvalue λ, then Atv = λtv for any t =. Thus, the requirement that v = 1 is just a normalization to pin down v. There are a few equivalent ways of defining eigenvalues, some of which are given by the following lemma. Lemma 4.1. Let A be an n by n matri and λ a scalar. Each of the following are equivalent. 1 λ is a eigenvalue of A. 2 A λi is singular. 3 deta λi =. Proof. We know that 2 and 3 from our results on systems of linear equations and matrices. Also, if A λi is singular, then the null space of A λi contains non-zero vectors. Choose v N A λi such that v =. Then A λiv =, so Av/ v = λv/ v as in the definition of eigenvalues. The function χ A : R R defined by χ A = deta I is called the characteristic polynomial of A. It will be a polynomial of order n. You may know from some other math course that polynomials of degree n have n roots, some of which might be not be real and some of which might not be distinct. Suppose that A has m n distinct real eigenvalues, λ 1,..., λ m. Associated with each eigenvalue is a linear subspace of eigenvalues, {v R n : Av = λ i v} = N A λ i I. By definition of eigenvalues each of these subspaces has dimension at least 1. Some of them may be of larger dimension, but they can be of dimension at most n. For each N A λ i I there an orthonormal basis of eigenvectors, i.e. v 1 i,..., vk i i and v j i T vi l = for j = l. such that Av j i = λ iv j v i, j = 1 Lemma 4.2. Let A be an n by n matri with m distinct real eigenvalues, and let v 1 i,..., vk i be an orthonormal basis for N A λ i I. Then {v 1 1,..., vk 1 1,..., v1 m,..., v k m m } are linearly independent. Proof. Suppose the eigenvectors are not linearly independent. Then we could write 2 v 1 1 = c2 1 v2 1 +... + ck mv k m. i 2 Possibly we cannot do this for v1, but we can for some v i. Without loss of generality, assume that we can for v 1. 8

with at least one c j i =. By the definition of eigenvalues and eigenvectors, we have Av 1 =λ 1 v 1 Ac 2 1 v2 1 +... + ck mv k m =λ 1 c 2 1 v2 1 +... + ck mv k m c 2 1 λ 1v 2 1 +... + ck mλ m v k m =λ 1 c 2 1 v 2 +... + c k mv k m c 1 2 λ 2 λ 1 v 1 2 +... + ck mλ m λ 1 v m = If c 1 2 = = ck m, then the original equation becomes v 1 1 = c2 1 v2 1 + + ck 1 1 vk 1 1 with one of these c s non-zero. That would contradict the way v 1 1,...vk 1 1 were constructed to be linearly independent. Hence, c j i = for at least one i > 1. Since the λ i are distinct, λ k λ 1 = for any k, so this says that v 1 2,..., vk m m are linearly dependent as well. We can repeat this argument to show that v 1 3,..., vk m m are linearly dependent, and then repeat it again and again and eventually show that v m 1 and v m are linearly dependent. Then v m 1 = bv m for some b =, and and Av m 1 = λ m 1 v m 1 = λ m 1 bv m Av m 1 = Abv m = bλ m v m. This implies that λ m = λ m 1, contrary to the eigenvalues being distinct. Therefore, the collection of eigenvectors must be linearly independent. Since the eigenvectors are linearly independent, in R n there can be at most n of them. However, there are some matrices that have strictly fewer eigenvectors. To say much more about eigenvalues we must impose the etra condition that A is symmetric. Fortunately, Hessian matrices are symmetric, so for the current purposes, imposing symmetry is not a problem. Since A is symmetric, A = A T. Also, by the definition of transpose, A, y =, A T y, or specializing to R n, A T y = A T y = Ay, where the last equality comes from symmetry of A. By the fundamental theorem of algebra which says that polynomials have possibly comple roots, at least one λ 1 and v 1 such that Av 1 = λ 1 v 1. A short argument using comple numbers which we have not covered and the symmetry of A shows that λ 1 must be real. If we then consider A as a symmetric linear operator on the n 1 dimensional space span{v 1 }, we can repeat this argument to show that there is a second not necessarily distinct eigenvalue λ 2 and associated eigenvector v 2 with v 2 v 1. If we repeat this n times, we will construct n eigenvalues some of which may be the same and n orthogonal eigenvectors. Moreover, we can rescale the eigenvectors so that v i = 1 for each i. These n eigenvectors form an orthonormal basis for R n. You can imagine them as some rotation of the usual aes. 9

Net, we can write A in terms of its eigenvectors and eigenvalues. If we make Λ a diagonal matri consisting of the eigenvalues of A and V a matri whose columns are the eigenvectors of A, then AV = VΛ. Moreover, V is an orthogonal matri, so V 1 = V T. This relationship is called the eigendecomposition of A. Theorem 4.2 Eigendecomposition. Let A be an n by n symmetric matri, then A has n not necessarily distinct eigenvalues and A = VΛV T where Λ is the diagonal matri consisting of the eigenvalues of A and the columns of V are the associated eigenvectors of A, and V is an orthonormal matri. Comment 4.1. There are non-symmetric matrices that can not be decomposed into eigenvalues and eigenvectors, for eample. 1 1 1 There are other non-symmetric matrices that can be eigendecomposed, for eample 1 1 1 1 can be eigendecomposed. Any square matri with A T A = AA T can be eigendecomposed. 1 1 Using the eigendecomposition, we can relate eigenvalues to the definiteness of a matri. Theorem 4.3. If A is an n by n symmetric non-singular matri with eigenvalues λ 1,..., λ n, then 1 λ i > for all i, iff A is positive definite, 2 λ i for all i, iff A is positive semi-definite, 3 λ i < for all i, iff A is negative definite, 4 λ i for all i, iff A is negative semi-definite, if some λ i > and some λ j <, then A is indefinite. Proof. Let A = VΛV T be the eigendecomposition of A. Let = R n. Then T V = z T = because V is nonsingular. Also, Since Λ is diagonal we have T A = T VΛV T = z T Λz. z T Λz = z 2 1 λ 1 +... + z 2 nλ n Thus, T A = z T Λz > for all iff λ i > for each i. The other parts of the theorem follow similarly. 4.2. Other facts about eigendecomposition. The results in this subsection are not essential for the rest of this lecture, but they are occasionally useful. If a matri is singular, then = such that A =. By definition, then / is an eigenvector and is an eigenvalue of A. Conversely, if is an eigenvalue of A, then with = 1 such that A =. Then A is singular. 1

Lemma 4.3 Eigenvalues and invertibility. A is singular iff is an eigenvalue of A. Recall that we defined the norm of a linear transformation as A BL = sup A. : =1 When A is symmetric so that its eigenvalues eist, this norm is equal to the largest eigenvalue of A. Lemma 4.4 Eigenvalues and operator norm. Suppose the n n matri A has n eigenvalues, λ 1,..., λ n, then A BL = ma 1 i n λ i. We saw that definiteness is related to both determinant and eigenvalues. It should come as no suprise that determinants and eigenvalues are directly related. The determinant of a matri is equal to the product of its eigenvalues. Lemma 4. Eigenvalues and determinants. If A has eigenvalues λ 1,..., λ n, such that then deta = n i=1 λ i. deta ti = n i=1 t λ i. GLOBAL MAXIMUM AND MINIMUM The second order condition 3.1 along with the first order condition gives a nice way of finding local maima of a function, but what about the global maimum? In general, you could use the first and second order conditions to find each of the local maima of F in the interior of U. You then must compare each of these local maima with each other and the value of F on the boundary of U to find the global maimum of F on U. If there are many local maima or the boundary of U is complicated, this procedure can be quite complicated. It would be nice if there was some simpler necessary and sufficient for a global maimum. Unfortunately, there is no such general necessary and sufficient condition. There is, however, a sufficient condition that is sometimes useful. Definition.1. Let f : U R. f is conve if for all, y U with l, y U we have f t + 1 ty t f + 1 t f y for all t [, 1]. Equivalently, F is conve if the set {y, : U, y f } U R is conve. This set is called the epigraph of f. If you draw the graph of the function, the epigraph is the set of points above the function. Definition.2. Let f : U R. f is concave if for all, y U with l, y U we have f t + 1 ty t f + 1 t f y for all t [, 1]. Equivalently, F is concave if the set {y, : U, y f } is conve. Theorem.1. Let F : U R be twice continuously differentiable and U R n be conve. Then, 1 For maimization: a The following three conditions are equivalent: 11

i F is a concave function on U ii Fy F DF y for all, y U, iii D 2 F is negative semi-definite for all, y U b If F is a concave function on U and DF = for some U, then is the global maimizer of F on U. 2 For minimization: a The following three conditions are equivalent: i F is a conve function on U ii Fy F DF y for all, y U, iii D 2 F is positive semi-definite for all, y U b If F is a conve function on U and DF = for some U, then is the global minimizer of F on U. This theorem is often summarized by saying that conve minimization problems i.e. the set U and function F are both conve have unique solutions. We will not prove this theorem until after covering conveity in more detail, which we may or may not have time for. An important fact about conve optimization problems is that their solutions can be efficiently computed. A general optimization problem is what is known as NPhard, which in practice means that you can write down what looks like a reasonable not too large optimization problem whose solution takes prohibitively long to compute. In contrast conve optimization problems can be solved in polynomial time. In particular, if f is conve, is n dimensional, the steps needed to compute f is M, and you want to find ˆ such that f ˆ f < ϵ, then there are algorithms that can find ˆ in O nn 3 + M log1/ϵ steps. In fact, there are algorithms for special types of conve problems that can solve them nearly as fast as a least squares problem or a linear program. In practice, this means that most conve problems can be solved quickly with up to about a one thousand variables, and some conve problems can be solved very quickly even with tens or hundreds of thousands of variables. The discovery that conve optimization problems can be solved efficiently was fairly recent. The main theoretical breakthroughs began in the late eighties and early nineties, and it remains an active area of research. 6.1. Profit maimization. 6. APPLICATIONS 6.1.1. Competitive multi-product firm. Suppose a firm has produces k goods using n inputs with production function f : R n R k. The prices of the goods are p, and the prices of the inputs are w, so that the firm s profits are The firm chooses to maimize profits. Π = p T f w T. The first order condition is ma p T f w T p T D f w =. 12

or without using matrices, k f j p j = w i j=1 i for i = 1,..., n. The second order condition is that k j=1 p 2 f j j 2 k j=1 p 2 f j j D[p T 1 1 n D f ] =.. k j=1 p 2 f j j 1 n k j=1 p 2 f j j must be negative semidefinite. 6.1.2. Multi-product monopolist. Consider the same setup as before, but now with a monopolist who recognizes that prices depend on output. Let the inverse demand function be Pq where P : R k R k. Now the firm s problem is The first order condition is or without matrices, k j=1 ma P f T f w T 2 n D f T DPT f f + P f D f w = k P j f f l f q l=1 l l + P j f f j i i = w i We can get something a bit more interpretable by writing this in terms of elasticities. Recall that the elasticity of demand for good l with respect to the price of good j is ϵ l j 1 = P j q l q l P l. Then, k P j f j=1 k P j f ϵ l f j 1 l + f j =w l=1 i i i [ ] fj 1 + ϵ j j 1 + i ϵ l f j 1 l =w i l =j i k j=1 There is a way to compare the price and quantity produced by the monopolist to the competitive firm. There are also things that can be said about the comparison between a single product monopolist k = 1 vs a multi-product monopolist. It might be interesting to derive some of these results. To begin with let k = 1 and compare the monopolist to the competitive firm. Under some reasonable assumptions on f and P, you can show that m < c where m is for the monopolist and c is for the competitive firm. You can also show that p m > p c and that ϵ1 1 < 1 for the monopolist. Although I left it out of the notation above, ϵ depends on. 13

UNCONSTRAINED OPTIMIZATION PAUL SCHRIMPF OCTOBER 24, 2013