Vector Derivatives and the Gradient

Size: px

Start display at page:

Download "Vector Derivatives and the Gradient"

Jodie McKinney
5 years ago
Views:

1 ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 1/1 Lecture 10 ECE 275A Vector Derivatives and the Gradient

2 ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 2/1 Objective Functions l(x) Let l(x) be a real-valued function (aka functional) of an n-dimensional real vector x X = R n l( ) : X = R n V = R where X is an n-dimensional real Hilbert space with metric matrix Ω = Ω T > 0. Note that henceforth vectors in X are represented a column vectors in R n. Because here the value space Y = R is one-dimensional, wlog we can take V to be Cartesian (because all (necessarily positive scalar) metric weightings yield inner products and norms that are equivalent up to an overall positive scaling). We will call l(x) an objective function. If l(x) is a cost, loss, or penalty function, then we assume that it is bounded from below and we attempt to minimize it wrt x. If l(x) is a profit, gain, or reward function, then we assume that it is bounded from above and we attempt to maximize it wrt x. For example, suppose we wish to match a model pdf p x (y) to a true, but unknown, density p x 0(y) for an observed random vector, where we assume p x (y) p x 0(y), x. We can then use a penalty function of x to be given by a measure of (non-averaged or instantaneous) divergence or discrepancy D I (x 0 x) of the model pdf p x (y) from the true pdf p x 0(y) defined by ( ) D I (x 0 px 0(y) x) log p x (y) = log p x 0(y) log p x (y)

3 ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 3/1 Instantaneous Divergence & Negative Log-likelihood Note that minimizing the instantaneous divergence D I (x 0 x) is equivalent to maximizing the log-likelihood p x (y) or minimizing the negative log-likelihood l(x) = log p x (y) A special case arises from use of the nonlinear Gaussian additive noise model with known noise covariance C and known mean function h( ) y = h(x) + n, n N(0, C) y N(h(x), C) which yields the nonlinear weighted least-squares problem l(x) = log p x (y). = y h(x) 2 W, W = C 1 Further setting h(x) = Ax is the linear weighted least-squares problem we have already discussed l(x) = log p x (y) =. y Ax 2 W, W = C 1 The symbol. = denotes that fact that we are ignoring additive terms and multiplicative factors which are irrelevant for the purposes of obtaining a extremum (here, a minimum) of a loss function. Of course we cannot ignore these terms if we are interested in the optimal value of the loss function itself.

4 ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 4/1 Stationary Points and the Vector Partial Derivative Henceforth let the real scalar function l(x) be twice partial differentiable with respect to all components of x R n. A necessary condition for x be be a local extremum (maximum or minimum) of l(x) is that l(x) (... 1 n }{{} 1 n ) = 0 where the vector partial derivative operator ( 1... n ) is defined as a row operator. (See the extensive discussion in the Lecture Supplement on Real Vector Derivative.) A vector that satisfies l(x) = 0 is known as a stationary point of l(x). Stationary points are points at which l(x) has a local maximum, minimum, or inflection. Sufficient conditions for a stationary point to be a local extremum require that we develop a theory of vector differentiation that will allow us to clearly and succinctly discuss second-order derivative properties of objective functions.

5 ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 5/1 Derivative of a Vector-Valued Function The Jacobian Let f(x) R m have elements f i (x), i = 1,, m, which are all differentiable with respect to the components of x R n. We define the vector partial derivative of the vector function f(x) as J f (x) f(x) f 1(x). f m(x) = f 1 (x) f 1 (x) 1 n f m (x) f m (x) 1 n }{{} m n The matrix J f (x) = f(x) is known as the Jacobian matrix or operator of the mapping f(x). It is the linearization of the nonlinear mapping f(x) at the point x. Often we write y = f(x) and the corresponding Jacobian as J y (x). If m = n, then y = f(x) can be viewed as a change of variables, in which case det J y (x) is known as the Jacobian of the transformation. det J y (x) plays a fundamental role in the change of variable formula of pdf s.

6 ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 6/1 The Jacobian Cont. A function f : R n R m is locally one-to-one in an open neighborhood of x if and only if its Jacobian (linearization) J f (ξ) = f(ξ) is a one-to-one matrix for all points ξ R n in the open neighborhood of x. A function f : R n R m is locally onto an open neighborhood of y = f(x) if and only if its Jacobian (linearization) J f (ξ) = f(ξ) is an onto matrix for all points ξ R n in the corresponding neighborhood of x. A function f : R n R m is locally invertible in an open neighborhood of x if and only if it is locally one-to-one and onto in the open neighborhood of x which is true if and only if its Jacobian (linearization) J f (ξ) = f(ξ) is a one-to-one and onto (and hence invertible) matrix for all points ξ R n in the open neighborhood of x. This is known as the Inverse Function Theorem.

7 ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 7/1 Vector Derivative Identities c T x Ax g T (x)h(x) T Ax T Ωx h(g(x)) = c T for an arbitrary vector c = A for an arbitrary matrix A = g T (x) h(x) + ht (x) g(x), = x T A + x T A T for an arbitrary matrix A = 2x T Ω when Ω = Ω T = h g g (Chain Rule) g(x)t h(x) scalar Note that the last identity) is a statement about Jacobians and can be restated in an illuminating manner as J h g = J h J g (1) which says that the linearization of a composition is the composition of the linearizations.

8 ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 8/1 Application to Linear Gaussian Model Stationary points of l(x) = log p x (y). = e(x) 2 W where e(x) = y A(x) and W = C 1 satisfy or 0 = l = l e e = (2eT W)( A) = 2(y Ax) T WA A T W(y Ax) = 0 e(x) = y Ax N(A ) with A = A T W Therefore stationary points satisfy the Normal Equation A T WAx = A T y A A = A y

9 ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 9/1 The Hessian of an Objective Function The Hessian, or matrix of second partial derivatives of l(x), is defined by H(x) ( ) T l(x) = n n n n }{{} n n As a consequence of the fact that the Hessian is obviously symmetric i j = j i H(x) = H T (x)

10 ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 10/1 Vector Taylor Series Expansion Taylor series expansion of a scalar-valued function l(x) about a point x 0 to second-order in x = x x 0 : l(x 0 + x) = l(x 0 ) + l(x 0) x xt H(x 0 ) x + h.o.t. where H is the Hessian of l(x). Taylor series expansion of a vector-valued function h(x) about a point x 0 to first-order in x = x x 0 : h(x) = h(x 0 + x) = h(x 0 ) + h(x 0) x + h.o.t. To obtain notationally uncluttered expressions for higher order expansions, one switches to the use of tensor notation.

11 ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 11/1 Sufficient Condition for an Extremum Let x 0 be a stationary value of l(x), l(x 0) of l(x) about x 0 we have = 0. Then from the second-order expansion l(x) = l(x) l(x 0 ) 1 2 (x x 0) T H(x 0 )(x x 0 ) = 1 2 xt H(x 0 ) x assuming that x = x x 0 is small enough in norm so that higher order terms in the expansion can be neglected. (That is, we consider only local excursions away from x 0.) We see that If the Hessian is positive definite, then all local excursions of x away from x 0 increase the value of l(x) and thus Suff. Cond. for Stationary Point x 0 to be a Unique Local Min: H(x 0 ) > 0 Contrawise, if the Hessian is negative definite, then all local excursions of x away from x 0 decrease the value of l(x) and thus Suff. Cond. for Stationary Point x 0 to be a Unique Local Max: H(x 0 ) < 0 If the Hessian H(x 0 ) is full rank and indefinite at a stationary point x 0, then x 0 is a saddle point. If H(x 0 ) 0 then x 0 is a non-unique local minimum. If H(x 0 ) 0 then x 0 is a non-unique local maximum.

12 ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 12/1 Application to Linear Gaussian Model Cont. Note that the scalar loss function l(x). = y Ax 2 w = (y Ax) T W(y Ax) = y T Wy 2y T WAx + x T A T WAx has an exact quadratic expansion. Thus the arguments given in the previous slide hold for arbitrarily large (global) excursions away from a stationary point x 0. In particular, if H is positive definite, the stationary point x 0 must be a unique global minimum. Having shown that l = 2xT A T WA 2y T WA we determine the Hessian to be ( ) l T = 2 A T WA 0 H(x) = Therefore stationary points (which, as we have seen, necessarily satisfy the normal equation) are global minima of l(x). Furthermore, if A is one-to-one (has full column rank), then H(x) = 2 A T WA > 0 and there is only one unique stationary point (i.e., weighted least-squares solution) which minimizes l(x). Of course this could not be otherwise, as we know from our previous analysis of the weighted least-squares problem using Hilbert space theory.

13 ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 13/1 The Gradient of an Objective Function l(x) Note that dl(x) = dx or, setting l = dl/dt and v = dx/dt, l(v) = v = Ω 1 x Ω xv = [ Ω 1 x ( ) ] T T Ω x v = x l(x), v with the Gradient of l(x) defined by x l(x) Ω 1 x ( ) T = Ω 1 x 1. n with Ω x a local Riemannian metric. Note that l(v) = x l(x),v is a linear functional of the velocity vector v. In the vector space structure we have seen to date Ω x is independent of x, Ω x = Ω, and represents the metric of the (globally) defined metric space containing x. Furthermore, if Ω = I, then the space is a (globally) Cartesian vector space. When considering spaces of smoothly parameterized regular family probability density functions, a natural Riemannian (local, non-cartesian) metric is provided by the Fisher Information Matrix to be discussed later in this course.

14 ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 14/1 The Gradient as the Direction of Steepest Ascent What unit velocities, v = 1, in the domain space X result in the fastest rate of change of l(x) as measured by l(v)? Equivalently, What unit velocity directions v in the domain space X result in the fastest rate of change of l(x) as measured by l(v)? From the Cauchy-Schwarz inequality, we have l(v) = x l(x), v x l(x) v = x l(x) or x l(x) l(v) x l(x) Note that v = c x l(x) with c = x l(x) 1 l(v) = x l(x) x l(x) = direction of steepest ascent. v = c x l(x) with c = x l(x) 1 l(v) = x l(x) x l(x) = direction of steepest descent.

15 ECE 275AB Lecture 10 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 15/1 The Cartesian or Naive Gradient In a Cartesian vector space the gradient of a cost function l(x) corresponds to taking Ω = I, in which case we have the Cartesian gradient c xl(x) = ( ) T = 1. n Often one naively assumes that the gradient takes this form even if it is not evident that the space is, in fact, Cartesian. In this case one might more accurately refer to the gradient shown as the standard gradient or, even, the naive gradient. Because the Cartesian gradient is the standard form assumed in many applications, it is common to just refer to it as the gradient, even if it is not the the correct, true gradient. (Assuming that we agree that the true gradient must give the direction of steepest descent and therefore depends on the metric Ω x.) This is the terminology adhered to by Amari and his colleagues, who then refer to the true Riemannian metric-dependent gradient as the natural gradient. As mentioned earlier, when considering spaces of smoothly parameterized regular family probability density functions, a natural Riemannian metric is provided by the Fisher Information Matrix. Amari is one of the first researchers to consider parametric estimation from this Information Geometry perspective. He has argued that the use of the natural (true) gradient can significantly improve the performance of statistical learning algorithms.

Generalized Gradient Descent Algorithms

ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San Diego p. 1/1 Lecture 11 ECE 275A Generalized Gradient Descent Algorithms ECE 275AB Lecture 11 Fall 2008 V1.1 c K. Kreutz-Delgado, UC San