j=1 r 1 x 1 x n. r m r j (x) r j r j (x) r j (x). r j x k

Size: px

Start display at page:

Download "j=1 r 1 x 1 x n. r m r j (x) r j r j (x) r j (x). r j x k"

Philip Reeves
5 years ago
Views:

1 Maria Cameron Nonlinear Least Squares Problem The nonlinear least squares problem arises when one needs to find optimal set of parameters for a nonlinear model given a large set of data The variables x,, x n represent the parameters of the model The residuals r,, r m, m n, represent the discrepancies between the model φ(x, t j ) and the data y j : r j (x) = y j φ(x, t j ) The objective function is chosen typically to be () f(x) := rj 2 (x) 2 Its Jacobian is defined by (2) (3) Then we have J(x) := f(x) = J T (x)r(x), r x r m x f(x) = J T (x)j(x) + The expressions above are follow from 2 f x i x = f x i = r x n r m x n n r j (x) r j (x) r j (x) r j x i, r j x r j x i + r j 2 r j x i x In the methods for the nonlinear least squares problem, the second term of f is often assumed to be small and hence is neglected This is justified in two cases: (i) if the residuals r j, j =,, m, are small, and (ii) if the model φ(x, t) is nearly linear in x Motivation of the choice of the objective function The objective function f can be chosen in many ways For example, instead of Eq () one can set f to be f(x) = y j φ(x, t j ) There are situation where a choice of f other than Eq () is optimal However, Eq () is common as it has the following statistical motivation LEt us assume that the model φ(x, t) adequately describes the process in hand, and the data y j do not contain a systematic error,

2 2 ie, an error that persists from measurement to measurement Then the residuals r j are independent and identically distributed random variables with a certain variance σ and probability density function g σ ( ) Then the lielihood of a particular step of observations y j, j =, 2,, m, given that the actual parameter vector is x, is given by the function m m p(y; x, σ) = g σ (r j ) = g σ (y j φ(x, t j )) Since we now y j s, the most liely value of x is obtained by maximizing p(y; x, σ) with respect to x The resulting value of x is called the maximum lielihood estimate of the parameters Next, we assume that the discrepancies r j follow the normal distribution, ie, g σ (z) = z2 e 2σ 2 2πσ 2 Substitution to the expression for p(y; x, σ) gives p(y; x, σ) = (2πσ) m/2 exp 2 (y j φ(x, t j )) 2 For any fixed variance σ 2, p is maximized when the sum of the squares is minimized Therefore, when the discrepancies are assumed to be independent identically distributed with normal distribution, the maximum lielihood estimate is obtained by minimizing the sum of squares σ 2 2 The Gauss-Newton method The simplest algorithm for the nonlinear least squares problem is the Gauss-Newton method This is a line search quasi-newton method where the approximate Jacobian is chosen to be J T J Hence the search direction p GN is given by (4) J T J p GN = J T r This method has several important advantages () The estimation of the Hessian is essentially free as soon as the Jacobian J is calculated for the right-hand side of Eq (4): F = J T r (2) In many interesting situations, the approximate Hessian J T J is close to the exact one: in the small residual case and in the nearly linear model case Hence, the Gauss-Newton method can achieve a similar performance to the Newton method even if the second term in Eq (3) is omitted (3) Whenever J has full ran, the search direction is a descent direction Indeed, f T pgn = (J T r ) T p GN = (J T J p GN ) T p GN = (p GN ) T J T J p GN = J p GN 2 The equality taes place if and only if J p GN = which is equivalent to J T r = f =

3 (4) The Gauss-Newton direction p GN is the solution of the linear least squares problem min J p + r, p as Eq (4) is the normal equation for this problem Hence one can find it by using the techniques for the linear least squares problem such as the QR decomposition and the SVD A shortcoming of the Gauss-Newton method is that it requires the Jacobian to be full ran in order to guarantee the convergence 2 Convergence of Gauss-Newton method The following theorem states sufficient conditions for the Gauss-Newton method to converge a point where f = J T r = Theorem Suppose each residual function r j is Lipschitz continuously differentiable in a neighborhood of the level set L := {x : f(x) f(x )}, and that the Jacobian satisfy the uniform full-ran condition J(x)z γ z Then if the iterates x are generated by the Gauss-Newton method with step lengths satisfying the Wolfe conditions, we have lim J T r = Proof The Gauss-Newton is a line search method Lipschitz continuity of r j implies Lipschitz continuity of f over the neighborhood of L Therefore, we have all assumptions that are sufficient for the Zoutendij condition to hold Hence, it suffices to show that cos θ, =,, 2,, are uniformly bounded away from zero Lipschitz continuity of each r j implies continuity of J(x) Hence there is a constant β > such that J(x) β, x L We have rt Jp GN cos θ = ( f)t p GN f p GN = J T r p GN Jp GN 2 J T Jp GN p GN γ2 p GN 2 β 2 p GN 2 = γ2 β 2 > Now we will establish the rate of convergence Doing the same type of analysis as we conducted for the line search methods, we write x + x = x + p GN x = x x [J T J ] f = [J T J ] [[J T J ](x x ) + f f ] 3

4 4 Here we have used that f = Now, let us introduce the following notation for the omitted part of the Hessian of f: f = J T J + r j r j J T J + H Then we can write f f = J T J(x + t(x x ))(x x )dt + H(x + t(x x ))(x x )dt We assume that J T J(x ) is invertible and J is Lipschitz-continuous Then for any ɛ > there is a ball B ɛ around x such that [J T J(x)] ( + ɛ) J T J(x ) < for all x B ɛ Furthermore, we assume that H is continuous Then there is a ball B 2 around x such that H(x) ( + ɛ) H(x ) for all x B2 ɛ If x B ɛ Bɛ 2, we have x + x = [J T J(x )] {J T J(x )(x x )+ J T J(x + t(x x ))(x x )dt + O( x x 2 ) + H(x + t(x x ))(x x )dt} J T J(x )] H(x + t(x x )) (x x ) dt ( + ɛ) 2 [J T J(x )] H(x ) x x + O( x x 2 ) Therefore, the rate of convergence of the Gauss-Newton method is hinged on the quantity [J T J(x )] H(x ) In order to expect that x +p GN x < x x, ie, the Gauss- Newton step of unit length taes us closer to the solution, we need to have [J T J(x )] H(x ) < In the small residual case and nearly linear case we have [J T J(x )] H(x ), therefore, the convergence is rapid The convergence is quadratic if H(x ) = 3 The Levenberg-Marquardt method The Levenberg-Marquardt method, lie the Gauss-Newton method, used J T J as approximation for the Hessian However, it employs the trust region strategy rather than the line search strategy The main advantage of this method over the Gauss-Newton method is that it handles the case where J is not a full ran naturally and its convergence is not impacted by ran deficiency of J

5 Originally the Levenberg-Marquardt method was proposed without maing a connection with the trust region strategy This connection was made later by More For a spherical trust region, the problem to be solved at every iteration is (5) min p 2 J p + r 2 2, p Explicitly, the model m (p) is given by (6) m (p) = 2 r 2 + p T J T r + 2 pt J T J p The following convergence result is a direct consequence of the convergence theorem for the trust region methods Theorem 2 Let η (, 4 ) in the Trust Region Algorithm Suppose the functions r i( ) are twice continuously differentiable in a neighborhood of a level set L := {x : f(x) f(x )}, and that for each the approximate solution p of Eq (5) satisfies the inequality ( (7) m () m (p ) c J t r min, J ) r J T J for some constant c > Then lim f = lim J T r = There is no need to calculate the right-hand side in the inequality or to chec it explicitly Instead, we can simply require that the decrease given by p at least matches that by the Cauchy point, which can be calculated inexpensively The solution of Eq (5) can be characterized in the same way as the solution of the trust region problem The particular form of the approximate Hessian will lead to some simplification Theorem 3 The vector p LM is a solution of Eq (5) if and only if there is a scalar λ such that 5 (8) (9) (J T J + λi)p LM = J T r, λ( p LM ) = Note that the third condition that (J T J +λi) is positive semidefinite holds automatically since J T J is always positive semidefinite Eq (8) is the normal equation for the linear least squares problem () min p 2 [ J λi ] p + [ r where J is m n, I is n n, r is m, and is n ] 2,

6 6 Least-squares problems are often poorly scaled If this is the case, it mae sense to mae the trust region elliptic rather than circular For example, if the variable x, is of the order of 4 while the variable x 2 is of the order of 6, then the trust region can be defined by p 2 ( 4 ) 2 + p2 2 ( 6 ) 2 2 or equivalently, [ ] 4 Dp, where D = 6 In this case, the solution of the trust region problem satisfies an equation of the form min p 2 Jp + r 2, Dp (J T J + λd 2 )p = J T r, an equivalently, solves the linear least squares problem [ ] [ min J r p 2 p + λd ] 2 The elements of the scaling matrix D can change from iteration to iteration, as we gather the information about typical values of each component of x References [] J Nocedal, S Wright, Numerical Optimization, Springer, 999

Maria Cameron. f(x) = 1 n

Maria Cameron. f(x) = 1 n Maria Cameron 1. Local algorithms for solving nonlinear equations Here we discuss local methods for nonlinear equations r(x) =. These methods are Newton, inexact Newton and quasi-newton. We will show that