Least squares methods in maximum likelihood problems. M.R.Osborne School of Mathematical Sciences, ANU

Size: px

Start display at page:

Download "Least squares methods in maximum likelihood problems. M.R.Osborne School of Mathematical Sciences, ANU"

Stuart George
5 years ago
Views:

1 Least squares methods in maximum likelihood problems M.R.Osborne School of Mathematical Sciences, ANU September 22,

2 Abstract: It is well known that the Gauss- Newton algorithm for solving nonlinear least squares problems is a special case of the scoring algorithm for maximizing log likelihoods. What has received less attention is that the computation of the current correction in the scoring algorithm in both its line search and trust region forms can be cast as a linear least squares problem. This is an important observation both because it provides likelihood methods with a general framework which accords with computational orthodoxy, and because it can be seen as underpinning computational procedures which have been developed for particular classes of likelihood problems (for example, generalised linear models). Aspects of this orthodoxy as it affects considerations such as convergence and effectiveness will be reviewed. 2

3 1. Point of paper has changed somewhat with more emphasis on what do we know about Gauss-Newton. 2.Will not consider estimation of DE s per se. However, 1. Results apply to stable simple shooting estimation problems, and 2. to suitably imbedded multiple shooting formulations - and we know how to do this. 3. Strictly results do not apply to (our work on) the simultaneous approach to ODE estimation. Problem is technically a mixed problem as limiting multipliers satisfy a stochastic DE. 4. Will not say anything about optimum observation points. However, the information matrix will be much in evidence,and its determinant is a quantity to maximize as a function of these. 3

4 Start with independent event outcomes y t R q, t = 1, 2,, n, associated pdf g (y t ; θ t, t) indexed by points t T R l, and structural information provided by a known parametric model θ t = η (t, x) where θ R s, and x R p. A priori information is the experimental design T n, T n = n. For asymptotics require 1 n t T n f(t) S(T ) f(t)ρ(t)dt The problem is given the event outcomes y t it is required to estimate x. It is not assumed that the individual components of y t are independent. 4

5 Example Exponential family: ( [ ]) θ g y; = c (y, φ) exp [{ y φ T } ] θ b (θ) /a (φ) E {y} = µ ( x, t ) = b (θ) T, V {y} = a (φ) 2 b (θ). signal in noise model θ = µ. (i)normal density g = 1 exp 1 (y µ)2 2πσ 2σ2 c(y, φ) = 1 exp y2 2πσ 2σ 2, a(φ) = σ2 θ = µ, b(θ) = µ 2. 5

6 (ii) multinomial (discrete) distribution g (n; ω) = n! p j=1 n j! p j=1 ω n j j = n! p j=1 n j! e p j=1 n j log ω j, where p j=1 n j = n, and the frequencies must satisfy the condition p j=1 ω j = 1. Eliminating ω p gives p j=1 n j log ω j = It follows that θ j = log p 1 j=1 n j log ω j 1 p 1 i=1 ω +n log i ω j 1 p 1 i=1 ω, b (θ) = n log i 1 1 p 1 e θ j. j=1 p 1 j=1 ω 6

7 Likelihood: G (y; x, T ) = t T g (y t ; θ t, t) Estimation principle: x = arg max x G (y; x, T ). Log likelihood L (y; x, T ) = t T log g (y t ; θ t, t) = t T L t (y t ; θ t, t) Assume: true model η, parameter vector x ; x properly in interior of region in which L is well behaved; boundedness of integrals (computing expectations etc.). 7

8 Properties: E { x L (y; x, T )} = 0; E { 2 x L (y; x, T ) } = E { x L T x L }. Fisher information I n = 1 n E { x L (y; x, T ) T x L (y; x, T ) { } 1 = V n x L (y; x, T ). } Maximum likelihood estimates are consistent and asymptotically minimum variance. 8

9 Algorithms: Describe step h, x x + h Newton J n = 1 n 2 x L (y; x, T n) h = Jn 1 1 n xl (y; x, T ) T Scoring h = In 1 1 n xl (y; x, T ) T Sample S n = 1 x L t (y t ; x, t) T x L t (y t ; x, t) n t T n h = Sn 1 1 n xl (y; x, T ) T 9

10 Equivalence lim n I n(x ) = lim S n n (x ) = lim J n n (x ) = I where I = S(T ) E { 2 x L (y; θ t, t)}ρ(t)dt. Transformation invariance - scoring, sample, but not Newton: Let u = u(x), W = u x then h u = W h x. x L = u LW, I x = W T I u W h x = ( W T I u W ) 1 1 n W T u L T = W 1 (I u ) 11 n ul T 10

11 Implementation: This requires: (1) A method for generating h, a direction in which the objective is increasing. (2) A method for estimating progress. A full step need not be satisfactory. This is especially true of initial steps. To measure progress introduce a monitor Φ(x) - needs same local stationary points and to be increasing when objective F is increasing F h 0 Φh 0 Two approaches: (1) Line search: effective search direction computed, monitor used to gauge a profitable step in this direction. (2) Trust region: step required to lie in an adaptively defined control region - typically one in which linearization of F does not depart too far from true behaviour. 11

12 Direction of search is required to be one which increases the objective. This is the case here for both scoring and sample algorithms x Lh = 1 n xli 1 n xl T > 0, x x. Note this shows that L is a suitable monitor. x Lh is invariant: n x Lh x = x LIn 1 xl T = u LW (W T I n W ) 1 W T u L T = u L(I u n ) 1 x L T = n u Lh u Will see there are good ways to compute this. 12

13 Measuring effectiveness. step x x + λh if Goldstein: accept ρ Ψ(λ, x, h) 1 ρ, 0 < ρ <.5 Φ(x + λh) Φ(x) Ψ = λ x Φ(x)h Can always choose λ to satisfy this test. Armijo: Let 0 < ρ < 1, and λ = ρ k where k is smallest integer such that Φ(x + ρ k 1 h) Φ(x) < Φ(x + ρ k h). Goldstein fine for proving results. No direct link in sense of small step asymptotic equivalence relating λ and ρ k. It does not tell how to find λ while Armijo does. Choice λ = 1 favoured for scale invariant, Newton type methods. Also Ψ.5 for scoring and sample methods provided n large enough..5 ρ.1. 13

14 Convergence: nice inequality (Kantorovitch) x Lh x L h 1 cond S (I n ) 1/2. If region R L bounded and I n positive definite then ascent direction exists uniformly lower bound λ R exists for linesearch step multiplier. Effectiveness means L can be used as a monitor (Not true for pure Newton). Goldstein gives x Lh L(x + λh) L(x) ρλ R 0 L(x) increasing and bounded convergence. Also h = (I n ) 11 n xl T xl n I n. Putting it all together gives x L 2 K (L(x + λh) L(x)) 0 14

15 Must be se- What happens if inf{λ i } = 0? quence { λ i }, inf λ i = 0 ρ > L(x i + λ i h i ) L(x i ) λ i x L(x i )h i Here, Taylor plus mean value theorem gives 1 2(1 ρ) n 2 xl(x) > σ min (I n ). λ i Almost a global result! Set of allowed approximations need not be closed: η = x(1) + x(2) exp x(3)t t = lim n (n n exp 1 n t). 15

16 Least squares. Consider sample estimate: S n = 1 n S n = t T n x L T t x L t = 1 n ST n S n. x L t. Correction satisfies the least squares problem min h r T r; r = S n h e 16

17 Scoring works pretty much the same: I n = 1 n t T n x η T E{ η L T t ηl t } x η Set Vt 1 = E{ η L T t ηl t } then obtain least squares problem where I L n = min h r T r; r = I L n h b. V 1/2 t. x η, b =. V 1/2 t η L t.. 17

18 Example: normal distribution L t = 1 2σ 2(y t µ(x, t)) 2 x L t = 1 σ 2(y t µ(x, t)) x µ t I L n = 2 x L t = 1 σ 2 ( (y t µ(x, t)) 2 x µ t. x µ t. I n = 1 nσ 2 t T n x µ T t xµ t As σ cancels the result is the Gauss-Newton method. In contrast S n = 1 (y t µ(x, t))2 nσ 2 t T σ 2 x µ T t xµ t. n Here the scale does not exit so agreeably. 18

19 Computation: I L n = [ Q 1 h = U 1 Q T 1 b Q 2 ] [ U 0 ] Can get more value from factorization x Lh = ( b T In L ) h [ = b T U Q 0 ] = Q T 1 b 2 0 U 1 Q T 1 b This is needed for Goldstein test, Also 0 as x Lh 0 so provides a scale invariant quantity for convergence testing. Typically would scale columns of I L n - see Higham s book. 19

20 Rate of convergence: Consider the unit step scoring iteration in fixed point form: where x i+1 = F n (x i ), F n (x) = x + I n (x) 1 1 n xl (x) T. The condition for convergence is ϖ ( F n (x n )) < 1, where ϖ ( F n (x n ) ) is the spectral radius of the variation F n = x F n. Then ϖ ( F n (x n )) 0, a.s., n. ϖ ( F n (x n ) ) is a (Newton-like) invariant of the likelihood surface, is a measure of the quality of the modelling, and can be estimated by a modification of the power method. 20

21 To calculate ϖ ( F n (x n ) ) note that x L (x n ) = 0. thus F n (x n ) = I + I n (x n ) 1 1 n 2 x L (x n), = I n (x n ) 1 (I n (x n ) + 1 n 2 x L (x n) If the right hand side were evaluated at x then the result would follow from the strong law of large numbers which shows that the matrix gets small (hence ϖ gets small) almost surely as n. But, by consistency of the estimates, we have ϖ ( F n (x n )) = ϖ ( F n ( x )) + O ( x n x ), a.s., and the desired result follows. ). 21

22 Trust region: Idea is to limit scope of the linear subproblem to a closed control region containing the current estimate: min r T r; h, h 2 D γ r = I L n h b h 2 D = ht D 2 h, Necessary conditions give [ r T 0 ] = λ T [ I I L n D > 0 diagonal. ] π [ 0 h T D 2 ] so h satisfies the perturbed scoring equations (I n + πn D2 ) h = 1 n xl T π can be used to control the trust region by controlling the size of h. Differentiating wrt π gives ( I n + π ) dh n D2 dπ = 1 n D2 h dht D 2 h = n dh T (I n + π ) dh dπ dπ n D2 dπ < 0 d dπ h 2 D < 0. 22

23 The classical form of the algorithm goes back to Levenberg Here two parameters α, β are kept, and the basic sequence of operations is: count=1: do while F(x+h(\pi))<F(x) count=count+1 \pi=\alpha*\pi loop x\leftarrow x+h(\pi) if count=1 then \pi=\beta*\pi Successful steps will be taken eventually as h 1 π D 2 x L T, π Experience suggests choices of α, β are not critical. Typically αβ < 1 to approach Newton like methods. The scoring and sample algorithms are not exact Newton methods. Both can be regarded as regularised methods because of their generic positive definiteness. Not obvious that setting π = 0 will increase rate of convergence. 23

24 Basic theorems mirror line search results. Convergence: Let {x i } produced by the α, β procedure be contained in compact region R in which {π i } < then {L(y; x i, T n )} converges, and limit points of {x i } are stationary points of L. Boundedness: If sequence {π i } determined by α, β procedure is unbounded in R then the norm of 2 xl is also unbounded. Remember iterations have the potential to become unbounded for smooth sets of nonlinear approximating functions as closure of these sets of functions cannot be guaranteed without more work. 24

25 Computation: The trust region problem is: [ ] [ ] min r T X L r; r = n b h h πi 0 ] [ Let Xn L U = Q then h(π) can be found by 0 solving the typically much smaller problem min h s T s; s = [ U πi ] h [ c1 0 where c 1 = Q T 1 b. Make further factorization [ ] [ [ ] [ ] U = Q U ], Q πi 0 T c10 c 1 c then 2 ] h(π) = (U ) 1 c 1 x Lh(π) = c T 1 Uh(π) = [ c T ] [ ] 1 0 πi U ( U T U + πi ) 1 [ U T πi ] [ c 1 0 ] = c

26 The linear subproblem has a useful invariance property with respect to diagonal scaling. Introduce the new variables u = T x where T is diagonal. T 1 ( S T x S x + πd 2) T 1 T h x = T 1 x L T. This is equivalent to ( S T u S u + πt 1 D 2 T 1) h u = u L T. Thus if D i transforms with x then T 1 i i D i transforms in the same way with respect to u. This requirement is satisfied by i D i = (S n ) i. This transformation effects a rescaling of the least squares problem. We have h = ( S T n S n + πd 2) 1 S T n e, Dh = ( D 1 S T n S nd 1 + πi ) 1 D 1 S T n e. The effect of this choice is to rescale the columns of S n to have unit length. It is often sufficient to set π = 1, and D = diag { (S n ) i, i = 1, 2,, p} initially. 26

27 One catch is the initial choice π = π 0. One example is provided by models containing exponential terms such as e xkt which should have negative exponents (x k > 0) but which can become positive (x k + h k < 0) by too large a step. To fix introduce a damping factor τ such that the critical components of x+τh(π 0 ) 0. τh(π 0 ) provides a possible choice for a revised trust region bound γ. This involves solving for π by eg Newton s method the equation h(π) = γ. dh dπ can be found by solving ( U T U + πi ) dh dπ = h This can be written in least squares form: min h s T s; s = [ U π 1/2 I ] dh dπ [ U T h 0 ]. 27

28 Newton s method extrapolates the function as a linear. An alternative strategy could be preferable here. This is based on the observation that, if A = W ΣV T is the singular value decomposition of the matrix in the least squares formulation, then h = p i=1 σ i (w T i b) σ 2 i + π v i is a rational function of π. To mirror this set h a b + (π π 0 ). To identify the parameters a and b use the values h(π 0 ) and dπ d h(π 0) - this corresponds to the same information as used in Newton s method. 28

29 To solve for the parameters we have the equations: The result is h(π 0 ) = a b, d dπ h(π 0) = ht dh dπ h (π 0) = a b 2. b = h(π 0) d dπ h(π 0), giving the correction a = h(π 0) b, π = π 0 h(π 0) γ h(π 0 ) d dπ h(π. 0) γ The effect of the rational extrapolation is just the Newton step modulated by the term h(π 0) γ. As this term 1 this is also a second order convergent process. 29

30 Simple exponential model: Here the model used is µ(t, x) = x(1) + x(2) exp( x(3)t). (1) The values chosen for the parameters are x(1) = 1, x(2) = 5, and x(3) = 10. Initial values are generated using x(i) 0 = x(i) + (1 + x(i))(.5 Rnd) where Rnd indicates a call to a uniform random number generator giving values in [0, 1]. This model is not difficult in the sense that the graph has features that depend on each of the parameters. Thus the interest is in the effects of simulated data errors. 30

31 Two types of random numbers are used to simulate the experimental data. Normal data: The data is generated by evaluating µ(t, x) on a uniform grid with spacing = 1/(n+1) and then perturbing these values using normally distributed random numbers to give values z i = µ(i, x)+ε i, ε i N(0, 2), i = 1, 2,, n. The choice of standard deviation was made to make small sample problems (n = 32) relatively difficult. The log likelihood is taken as L(x) = 1 2 n i=1 (z i µ(i, x)) 2. While the scale is not evident here it resurfaces in its effects on the generated data. 31

32 Poisson data: A Poisson random number generator is used to generate random counts z i corresponding to µ(i, x) as the mean model. The log likelihood used is ( ) n µ(i, x) L(x) = z i log + (z i µ(i, x)). i=1 z i Note that if z i = 0 then the contribution from the logarithm term to the log likelihood is zero. The rows of the least squares problem design matrix are given by e T i A = 1, exp( x(3)t i), x(2)t i exp( x(3)t i ), s i s i s i i = 1, 2,, n where s i = µ(i, x). The corresponding components of the right hand side are b i = z i µ(i, x) s i. 32

33 Numerical experiments comparing the performance of the line search (LS) and trust region (TR) methods are summarised below. For each n the computations were initiated with 10 different seeds for the basic random number generator, and the average number of iterations is reported as a guide to algorithm performance. The parameter settings used are α = 2.5, β =.1 for the trust region method and ρ =.25 for the Armijo parameter used in the line search. Experimenting with these values (for example, the choice α = 1.5, β =.5) made very little difference in the trust region results. Convergence is assumed if x Lh < 1.0e 8. This corresponds to final values of h in the range 1.e 4 to 1.e 6. Normal Poisson n LS TR LS TR * 14*

34 The starred entries in the table correspond to two cases of nonconvergence. The figure shows (in red) the current estimate after 50 iterations together with the data and the starting estimate. This gives an illustration that the set of approximations is not closed. Result shows a straight line fit 34

35 Second example - Gaussians with an exponential background. µ = x(1) exp x(2)t+x(3) exp (t x(4)) 2 /x(5) Line search case: + x(6) exp (t x(7)) 2 /x(8) x(1) = 5, x(2) = 10, x(3) = 18, x(4) =.3333, x(5) =.05, x(6) = 15, x(7) =.6667, x(8) =

The results here are much more of a mixed bag. The problems are closely related to the choice of starting values. The figure is produced using peak widths of.

36 The results here are much more of a mixed bag. The problems are closely related to the choice of starting values. The figure is produced using peak widths of.01 which makes it easier to see what is going on. In this case the starting values do not see the second peak. Both the background and the first peak are picked up well. Initial conditions miss the second peak 36

37 Third example - trinomial example Data for a trinomial example (m = 3) came from a consulting exercise with CSIRO. It is derived from a study of the effects of a cattle virus on chicken embryos. log 10 (titre) dead deformed normal Cattle virus data The model suggested fits the frequencies explicitly 1 ω 1 = 1 + exp ( β 1 β 3 log (t)), 1 1 ω 2 = 1 + exp ( β 2 β 3 log (t)), ω 3 = 1 ω 1 ω 2. 37

38 Makes sense to develop the algorithm in terms of the frequencies. Numerical results show an impressive rate of convergence for a relatively small data set. Suggests the model analysis is good. its L Lh β 1 β 2 β Results of computations for the trinomial data 38

39 In conclusion Work on the Levenberg algorithm in the 1970 s was responsible for at least some of the encouragement for the shift from linesearch to trust region methods in optimization problems. It can now be argued that this component of the move had somewhat dubious validity. 1. Use of the expected Hessian is already a regularising step. Improved conditioning derived from the trust region parameter could be illusory if the aim is small values for rapid convergence. If significant values of π are required then in the data analytic context the modelling could well be suspect. 2. The old papers relied on a small residual argument to explain good convergence rates. Again in our context this is not satisfactory. It is not completely obvious what the effect of non zero trust region parameters is in any particular case. 3. The trust region algorithm does not scale as well as the linesearch scoring algorithm. 4. Global convergence results of similar power appear available for both approaches. 39

linear least squares methods nonlinear least squares asymptotically second order iterations M.R. Osborne Something old,something new, and...

Something old,something new, and... M.R. Osborne Mathematical Sciences Institute Australian National University Anderssen 70 Canberra, January 30, 2009 Outline linear least squares methods nonlinear least