Differentiable Functions Let S R n be open and let f : R n R. We recall that, for x o = (x o 1, x o,, x o n S the partial derivative of f at the point x o with respect to the component x j is defined as f(x o x j f(x o 1, x o,,, x o j 1, x o j + h, x o j+1,, x o n := lim, h h provided this limit exists. If this limit exists for the points x S, then we can differentiate the resulting function x f(x with respect to any of the components of x to obtain x j f(x x k x j, the second partial derivative of f. In particular, if f i j we refer to this second partial derivative as the mixed partial derivative. An important property of this mixed partial derivative is that ( ( f(x = f(x, x k x j x j x k provided these second derivatives exist and are continuous. Higher order derivitives are defined in a similar manner. A real-valued function f : S R will be said to be of class C (k on the open set S provided it is continuous and posesses continuous partial derivatives of all orders up to and including k. It will be said to be of class C ( on S if it is of class C (k for all integers k. It will be said to be of class C (k on an arbitrary set S provided it is of class C (k on a neighborhood of that set. We will also use the notation C and D for the classes C (1 and C ( respectively. Alternate notations for the partial derivatives will also be used, for example, f xi = f x j, f xk,x j = ( f, etc. x k x j The notion of differentiability of functions of several variables is related to the existence of partial derivatives but is not coincident with the existence of the partials. Indeed, we have the following definition: A function f : S R m where S R n is an open set, is said to be differentiable at a point x o S provided there is a linear transformation L : R n R m such that, for all h R n f(x o + h f(x o L h lim h h =. In the case that such a linear transformation L exists, it is called the derivative (sometimes the Fréchet derivative or the differential of the function f at the point x o. There are 1
various notations for the differential. Since the linear transformation depends on the point x o we may denote it as L = L(f; x o when we need to be specific. Other notations will be f (x o, h = L(f; x o (h. Now, in the case that f : S R, the linear transformation is a linear map from R n to R and is therefore called a a linear functional. This linear functional can be realized by the application of a dot product. Given any fixed vector z R n it is clear that the map of R n R given by l z (h := z h = z h is a linear map, a fact which follows from the elementary properties of the dot product. On the other hand, it is well known that, given the standard basis of unit vectors, every linear transformation is realized as a 1 n matrix. So that, if y denotes this matrix, the linear functional is given by y h. In other words there is a one-to-one correspondence between vectors in R n and linear transformations from R n to R. It is shown in advanced calculus texts that if a real-valued function is differentiable on an open set, the partial derivatives exist and that the linear functional that defines the derivitive is given by the map h ( f x 1, f x,, f x n (h 1, h,, h n = f xi h i. 1 Here we have supressed the dependence on the point x o. The vector ( f x 1, f x,, f x n is called the gradient of f and is written variously as grad f or f. Hence we write the differential as f (x o ; h = h f(x o. As a simple example in R 3, suppose that f(x, y, z := x +3 y + z. Then grad f(x, y, z = (x, 6y, 4z so that, for example, at the point x o = (1,, 1, grad f(1, 1 = (,, 4. Note that the differential at this point is the map h f(1,, 1 h or (h 1, h, h 3 h 1 + 4h. i=1 If the function f is of class C ( then the second partial derivatives are defined and continuous. We can consider the map from S R n given by f and ask for its derivative. This derivative is again a linear functional defined on R n. It can be shown that this second derivative in then a bilinear form on R n R n which can be realized in terms of a matrix, represented relative to the standard basis as 1 Conversely, it can be shown that if the partial derivatives exist at a point x o and are all continuous in a neighborhood of that point, then the function f is differentiable at x o.
Q(x o := f(x o x 1 x 1 f(x o x 1 x... f(x o f(x o x x 1 x x...... f(x o f(x o x n x 1 x n x... f(x o x 1 x n f(x o x x n. f(x o x n x n, which, since the second partial derivatives are continuous, is a symmetric matrix. The second differential, or second Fréchet derivative of the function is then given by f (x o ; h, k := k Qh. The matrix Q is referred to as the Hessian matrix of f. Clearly the mapping of R n R n R given by (h, k f (x o ; h, k = k Qh is a bilinear form, a form that is linear in h for every fixed k and linear in k for each fixed h. We note that the values of this form are completely determined by the values of f (x o ; h, h for h R n. This can be seen by the following computation which is reminicent of the binomial theorem. and hence f (x o ; h + k, h + k = f (x o ; h, h + k + f (x o ; k, h + k = f (x o ; h, h + f (x o ; h, k + f (x o ; k, k, f (x o ; h, k = 1 (f (x o ; h + k, h + k f (x o ; h, h f (x o ; k, k. Let us pause for a concrete example. Consider the case n = and write the variables as (x, y rather than (x 1, x. We continue to write (h 1, h for h and k = (k 1, k. Then we have while f(x, y = x 3 y, so that f = (3 x y, x 3 y, and f (x; h = 3 x y h + x 3 y k f (x o ; h, k = ( k 1 k ( 6xy 6x y 6x y x 3 In this example the matrix ( h1 = 6xy h 1 k 1 + 6x y(h k 1 + h 1 k + x 3 h k. h 3
Q = ( 6xy 6x y 6x y x 3 is the Hessian matrix. Note that it is symmetric., A particularly instructive, and useful example for our future work is given in the case that f is the quadratic function f(x = 1 i=1 a ij x i x j + b i x i + c, where a ij = a ji, b i, and c are given constants. In matrix form, we write i=1 where the n n-matrix is symmetric. f(x = 1 x A x + b x + c, For a given index k, the variable x k is repeated in pairs in the first term defining f, namely, when the index j = k and when the index i = k. (This is the reason for the factor of 1/ in the definition. So, for example, the derivative of the first term with respect to x 1 is 1 ( a 1j x j + a 1j x j. Hence, differentiating the expression for f with respect to x k we obtain ( f = 1 a ik x i x k + x k i=1 = a kj x j + b k, a kj x k x j + b k since a jk = a kj for all i by hypothesis. Clearly, from this last form we have also that x i x j f = a ij. It follows that the gradient, f and the Hessian Q are given by ( f(x = a 1j x j + b 1,, a nj x j + b n = Q = (a ij. ( a 1j x j,, a nj x j + b and 4
or f(x = Ax + b Q = A. Note, in particular that if ϕ(x := x x then ϕ(x = x and if ν(x := x then ν(x = x/ x. Now, the Hessian that appears in the above formula is a symmetric matrix, and for such matrices we have the following definition. Definition 1.1 An n n symmetric matrix, Q, is said to be positive semi-definite provided, for all x R n, x, Qx. The matrix Q is said to be positive definite provided for all x R n, x, x, Qx >. We emphasize that the notions of positive definite and positive semi-definite are defined only for symmetric matrices. It is important in the theory of optimization to interpret the differentials of f as directional derivatives. We recall that, we begin with a fixed unit vector (or direction û R n and a real-valued function f defined and continuous in a neighborhood of the point x o. We assume that the neighborhood is convex i.e. that for any two points in the neighborhood, the line segment joining the two points is completely contained in the neighborhood. Then the directional derivative of f at x o in the direction û is defined to be (Dûf(x o f(x o + tû f(x o := lim. t t As a simple example, consider f(x, y := x + 3x y and let x o = (,. Let û = (1/, 1/. Then In R n it is easy to see that a δ-neighborhood is such a set. 5
x o + t û = f(x o + t û = ( + t, t ( + t ( + 3 + t ( t = 4 t t and so (Dûf(x o = lim t ( t t 4 t =. Now, if the function f : R n R is of class C (k on S such that for some δ >, y := x + tû S, δ < t < δ then the function ϕ(t := f(x + t û = f(x 1 + tu 1, x + tu,, x n + tu n, δ < t < δ, is of class D (k in t on S. If, in particular, f C (1 the by the chain rule for differentiation, we have ϕ (t = f x 1 (x o + tûu 1 + + f x n (x o + tûu n, and we have ϕ ( = f (x o, û. We often write d dt f(x + t û = f (x o ; û. t= This shows how to compute the directional derivative since f (x o ; û = f(x o û. In the simple example given above, since f(x, y = x +3xy, we have f = (x+3y, 3x so that (Dûf(x o = (1/, 1/ (4, 6 = 4/ 6/ = as before. Clearly, if û = e j, the standard j th unit vector, then we recover the usual j th partial derivative. We now turn to the multidimensional analog of Taylor s formula. Again, we assume that f is a real-valued function defined on an open set S. Then, if x S we can choose h so that, for every t, t 1, the line segment parameterized by y(t = x + th lies in S. Then Taylor s formula for f can be derived from the the Taylor formula for the function ϕ(t := f(x + th. Indeed ϕ : [, 1] R and, if f C (1, we have ϕ(1 = f(x + h, ϕ( = f(x, and ϕ (θ = f (x + θh; h. Using the Mean Value Theorem for functions of one variable, we have that there is a number θ 1, < θ 1 < 1 such that 6
1 ϕ(1 = ϕ( + ϕ (θ 1 = ϕ( + ϕ (θ dθ, the second relation holding by integration. In terms of the original function f we then have what we call the first-order Taylor expansion f(x + h = f(x + f (x + θ 1 h; h = f(x + f (x + θh; h dθ. 1 Now, if the function f C ( then so is the function ϕ and we have, using the second-order Taylor expansion for a function of one variable ϕ(1 = ϕ( + ϕ ( + ( 1 ϕ (θ, where θ (, 1. Or, in terms of the integral remainder term, ϕ(1 = ϕ( + ϕ ( + 1 (1 θ ϕ (θ dθ. 3 In terms of the original function f, we have f(x + h = f(x + f (x; h + = f(x + f (x; h + ( 1 f (x + θ h; h 1 f (x + θh; h dθ. Now, if we write r (x, h := 1 (1 θ [f (x + θh; h f (x; h] dθ, 3 To derive this form, start with the first order Taylor expansion for ϕ with integral remainder and integrate by parts. 7
we can write the second order Taylor expansion in the form f(x + h = f(x + f (x; h + ( 1 f (x h + r (x, h. 8