Least squares problems

Size: px

Start display at page:

Download "Least squares problems"

Arnold Powell
5 years ago
Views:

1 Least squares problems How to state and solve them, then evaluate their solutions Stéphane Mottelet Université de Technologie de Compiègne 30 septembre 2016 Stéphane Mottelet (UTC) Least squares 1 / 55

2 Outline 1 Motivation and statistical framework 2 Maths reminder (survival kit) 3 Linear Least Squares (LLS) 4 Non Linear Least Squares (NLLS) 5 Statistical evaluation of solutions 6 Model selection Stéphane Mottelet (UTC) Least squares 2 / 55

3 Motivation and statistical framework 1 Motivation and statistical framework 2 Maths reminder (survival kit) 3 Linear Least Squares (LLS) 4 Non Linear Least Squares (NLLS) 5 Statistical evaluation of solutions 6 Model selection Stéphane Mottelet (UTC) Least squares 3 / 55

4 Motivation Regression problem Data : (x i, y i ) i=1..n, Model : y = f (x; θ) x R : independent variable y R : dependent variable (value found by observation) θ R p : parameters Regression problem Find θ such that the model best explains the data, i.e. y i close to f (x i, θ), i = 1... n. Stéphane Mottelet (UTC) Least squares 4 / 55

5 Motivation Regression problem, example Simple linear regression : (x i, y i ) R 2 y find θ 1, θ 2 such that the data fits the model y = θ 1 + θ 2 x How does one measure the fit/misfit? Stéphane Mottelet (UTC) Least squares 5 / 55

6 Motivation Least squares method The least squares method measures the fit with the function S(θ) = n (y i f (x i ; θ)) 2, i=1 and aims to find ˆθ such that θ R p, S(ˆθ) S(θ), or equivalently ˆθ = arg min S(θ). θr p Important issues statistical interpretation existence, uniqueness and practical determination of ˆθ (algorithms) Stéphane Mottelet (UTC) Least squares 6 / 55

7 Statistical framework Hypothesis 1 (x i ) i=1...n are given 2 (y i ) i=1...n are samples of statistically independent random variables y i = f (x i ; θ) + ε i, where ε i is a random variable with E[ε i ] = 0, E[ε 2 i ] = σ2, density ε g(ε) The probability density of y i is given by φ i : R R p (y, θ) φ i (y; θ) = g(y f (x i ; θ)) and E[y i θ] = f (x i ; θ) Stéphane Mottelet (UTC) Least squares 7 / 55

8 Statistical framework Example If ε is normally distributed, i.e. g(ε) = (σ 2π) 1 exp( 1 ε 2 ) 2σ 2 φ i (y; θ) = (σ ( 2π) 1 exp 1 ) 2σ 2 (y f (x i; θ)) 2 Stéphane Mottelet (UTC) Least squares 8 / 55

9 Statistical framework Joint probability density and Likelihood function Joint density When θ is given, as the (y i ) are independent, the density of the whole vector y = (y 1,..., y n ) is Interpretation : for D R n Likelihood function φ(y; θ) = Prob[y D θ] = n φ i (y i ; θ). i=1 D φ(y; θ) dy 1... dy n When a sample of y is given, then L(θ; y) = φ(y; θ) is called Likelihood of the parameters θ Stéphane Mottelet (UTC) Least squares 9 / 55

10 Statistical framework Maximum Likelihood Estimation The Maximum Likelihood Estimate of θ is the vector ˆθ defined by Under the Gaussian hypothesis, then L(θ; y) = i=1 ˆθ = arg max L(θ; y). θ Rp n (σ ( 2π) 1 exp 1 ) 2σ 2 (y f (x i; θ)) 2, = (σ 2π) n exp ( 1 2σ 2 ) n (y f (x i ; θ)) 2, i=1 hence, we recover the least squares solution, i.e. arg max L(θ; y) = arg min S(θ). θ Rp θ R p Stéphane Mottelet (UTC) Least squares 10 / 55

11 Statistical framework Alternatives : Least Absolute Deviation Regression Least Absolute Deviation Regression : the misfit is measured by S 1 (θ) = n y i f (x i ; θ). i=1 Is ˆθ = arg min θ R p S 1 (θ) is a maximum likelihood estimate? Yes, if ε i has a Laplace distribution g(ε) = (σ 2) 1 exp First issue : S 1 is not differentiable ( ) 2 σ ε Stéphane Mottelet (UTC) Least squares 11 / 55

12 Statistical framework Alternatives : Least Absolute Deviation Regression Densities of Gaussian vs. Laplacian random variables (with zero mean and unit variance) : Second issue : the two statistical hypothesis are very different! Stéphane Mottelet (UTC) Least squares 12 / 55

13 Statistical framework Take home message Take home message #1 : Doing Least Squares Regression means that you assume that the model error is Gaussian. However, if you have no idea about the model error : 1 the nice theoretical and computational framework you will get is worth doing this assumption... 2 a posteriori goodness of fit tests can be used to assess the normality of errors. Stéphane Mottelet (UTC) Least squares 13 / 55

14 Maths reminder 1 Motivation and statistical framework 2 Maths reminder 3 Linear Least Squares (LLS) 4 Non Linear Least Squares (NLLS) 5 Statistical evaluation of solutions 6 Model selection Stéphane Mottelet (UTC) Least squares 14 / 55

15 Maths reminder Matrix algebra Notation : A M n,m (R), x R n, a a 1m A =..... a n1... a nm, x = Product : for B M m,p (R), C = AB M n,p (R), c ij = m a ik b kj k=1 x 1. x n Identity matrix I = Stéphane Mottelet (UTC) Least squares 15 / 55

16 Maths reminder Matrix algebra Transposition, Inner product and norm : A M m,n (R), [ A ] ij = a ji For x R n, y R n, x, y = x y = n x i y i, i=1 x 2 = x x Stéphane Mottelet (UTC) Least squares 16 / 55

17 Maths reminder Matrix algebra Linear dependance / independence : a set {x 1,..., x m } of vectors in R n is dependent if a vector x j can be written as m x j = α k x k k=1,k i a set of vectors which is not dependent is called independent a set of m > n vectors is necessarily dependent a set of n independent vectors in R n is called a basis The rank of a A M nm is the number of its linearly independent columns rank(a) = m {Ax = 0 x = 0} Stéphane Mottelet (UTC) Least squares 17 / 55

18 Maths reminder Linear system of equations When A is square rank(a) = n there exists A 1 s.t. A 1 A = AA 1 = I When the above property holds : For all y R n, the system of equations has a unique solution x = A 1 y. Ax = y, Computation : Gauss elimination algorithm (no computation of A 1 ) in Scilab/Matlab : x = A\y Stéphane Mottelet (UTC) Least squares 18 / 55

19 Maths reminder Differentiability Definition : let f : R n R m, f 1 (x) f (x) = f is differentiable at a R n if. f m (x), f i : R n R, f (a + h) = f (a) + J f (a)h + h ε(h), lim ε(h) = 0 h 0 Jacobian matrix, partial derivatives : [J f (a)] ij = f i x j (a) Gradient : if f : R n R, is differentiable at a, f (a + h) = f (a) + f (a) h + h ε(h), lim ε(h) = 0 h 0 Stéphane Mottelet (UTC) Least squares 19 / 55

20 Maths reminder Nonlinear system of equations When f : R n R n, a solution ˆx to the system of equations f (ˆx) = 0 can be found (or not) by the Newton s method : given x 0, for each k 1 consider the affine approximation of f at x k 2 take x k+1 such that T (x k+1 ) = 0, T (x) = f (x k ) + J f (x k )(x x k ) x k+1 = x k J f (x k ) 1 f (x k ) Newton s method can be very fast... if x 0 is not too far from ˆx! Stéphane Mottelet (UTC) Least squares 20 / 55

21 Maths reminder Find a local minimum - gradient algorithm When f : R n R is differentiable, a vector ˆx satisfying f (ˆx) = 0 and x R n, f (ˆx) f (x) can be found by the descent algorithm : given x 0, for each k : 1 select a direction d k such that f (x k ) d k < 0 2 select a step ρ k, such that x k+1 = x k + ρ k d k, satisfies (among other conditions) f (x k+1 ) < f (x k ) The choice d k = f (x k ) leads to the gradient algorithm Stéphane Mottelet (UTC) Least squares 21 / 55

22 Maths reminder Find a local minimum - gradient algorithm * x 3 x 4 x 1 x 2 x 0 * x k+1 = x k ρ k f (x k ), Stéphane Mottelet (UTC) Least squares 22 / 55

23 Linear Least Squares (LLS) 1 Motivation and statistical framework 2 Maths reminder 3 Linear Least Squares (LLS) 4 Non Linear Least Squares (NLLS) 5 Statistical evaluation of solutions Stéphane Mottelet (UTC) Least squares 23 / 55

24 Linear Least Squares Linear models The model y = f (x; θ) is linear w.r.t. θ, i.e. p y = θ k φ k (x), k=1 φ k : R R Examples y = p k=1 θ kx k 1 y = p k=1 θ k cos (k 1)x, where T = x T n x 1... Stéphane Mottelet (UTC) Least squares 24 / 55

25 Linear Least Squares Simple linear regression Simple linear regression S(θ) = n (θ 1 + θ 2 x i y i ) 2 = r(θ) 2, i=1 Residual vector r(θ) For the whole residual vector r(θ) = Aθ y, y = [ ] θ1 r i (θ) = [1, x i ] θ 2 y 1. y n, A = 1 x x n Stéphane Mottelet (UTC) Least squares 25 / 55

26 Linear Least Squares Optimality conditions Linear Least Squares problem : find ˆθ Necessary optimality condition ˆθ = arg min θr p S(ˆθ) = Aθ y 2 S(ˆθ) = 0 Compute the gradient by expanding S(θ) Stéphane Mottelet (UTC) Least squares 26 / 55

27 Linear Least Squares Optimality conditions S(θ + h) = A(θ + h) y 2 = Aθ y + Ah 2 = (Aθ y + Ah) (Aθ y + Ah) = (Aθ y) (Aθ y) + (Aθ y) Ah + (Ah) (Aθ y) + (Ah) Ah = Aθ y 2 + 2(Aθ y) Ah + Ah 2 = S(θ) + S(θ) h + Ah 2 S(θ) = 2A (Aθ y) Stéphane Mottelet (UTC) Least squares 27 / 55

28 Linear Least Squares Optimality conditions Theorem : a solution of the LLS problem is given by ˆθ, solution of the normal equations A A ˆθ = A y, moreover, if rank A = p then ˆθ is unique Proof : S(θ) = S(ˆθ + θ ˆθ) = S(ˆθ) + S(ˆθ) (θ ˆθ) + A(θ ˆθ) 2, Uniqueness : = S(ˆθ) + A(θ ˆθ) 2, S(ˆθ) S(ˆθ) = S(θ) A(θ ˆθ) 2 = 0, A(θ ˆθ) = 0 θ = ˆθ, Stéphane Mottelet (UTC) Least squares 28 / 55

29 Linear Least Squares Simple linear regression rank A = 2 if there exists i j such that x i x j Computations : S x = n x i, S y = i=1 n y i, S xy = i=1 [ A n Sx A = S x S xx n x i y i, S xx = i=1 ], A y = [ Sy S xy ] n i=1 x 2 i θ 1 = S ys xx S x S xy ns xx Sx 2, θ 2 = ns xy S x S y ns xx Sx 2 Stéphane Mottelet (UTC) Least squares 29 / 55

30 Linear Least Squares Practical resolution, Scilab or Matlab A=[ones(x),x,x.ˆ2] theta=a\y plot(x,y,"o",x,a*theta,x,"r") Stéphane Mottelet (UTC) Least squares 30 / 55

31 Linear Least Squares Practical resolution, Scilab or Matlab A=[ones(x),x,x.ˆ2] theta=a\y plot(x,y,"o",x,a*theta,x,"r") Stéphane Mottelet (UTC) Least squares 30 / 55

32 Linear Least Squares An interesting example Find a circle wich best fits (x i, y i ) i=1...n in the plane Algebraic distance d(a, b, R) = n ( (xi a) 2 + (y i b) 2 R 2) 2 = r 2 i=1 The residual is non-linear w.r.t. (a, b, R) but r i = R 2 a 2 b 2 + 2ax i + 2by i (xi 2 + yi 2 ), a = [2x i, 2y i, 1] b (x 2 R 2 a 2 b 2 i + yi 2 ) is linear w.r.t. (a, b, R 2 a 2 b 2 ) = θ. Stéphane Mottelet (UTC) Least squares 31 / 55

33 Linear Least Squares An interesting example Standard form 2x 1 2y 1 1 A =... 2x n 2y n 1, z = x y 2 1. x 2 n + y 2 n, d(a, b, R) = Aθ z 2 In Scilab A=[2*x,2*y,ones(x)] z=x.ˆ2+y.ˆ2 theta=a\z a=theta(1) b=theta(2) R=sqrt(theta(3)+aˆ2+bˆ2) t=linspace(0,2*%pi,100) plot(x,y,"o",a+r*cos(t),b+r*sin(t)) Stéphane Mottelet (UTC) Least squares 32 / 55

34 Linear Least Squares An interesting example Standard form 2x 1 2y 1 1 A =... 2x n 2y n 1, z = x y 2 1. x 2 n + y 2 n, d(a, b, R) = Aθ z 2 In Scilab A=[2*x,2*y,ones(x)] z=x.ˆ2+y.ˆ2 theta=a\z a=theta(1) b=theta(2) R=sqrt(theta(3)+aˆ2+bˆ2) t=linspace(0,2*%pi,100) plot(x,y,"o",a+r*cos(t),b+r*sin(t)) Stéphane Mottelet (UTC) Least squares 32 / 55

35 Linear Least Squares Take home message Take home message #2 : Solving linear least squares problem is just a matter of linear algebra Stéphane Mottelet (UTC) Least squares 33 / 55

36 Non Linear Least Squares (NLLS) 1 Motivation and statistical framework 2 Maths reminder 3 Linear Least Squares (LLS) 4 Non Linear Least Squares (NLLS) 5 Statistical evaluation of solutions 6 Model selection Stéphane Mottelet (UTC) Least squares 34 / 55

37 Non Linear Least Squares (NLLS) Example Consider data (x i, y i ) to be fitted by the non linear model y = f (x; θ) = exp(θ 1 + θ 2 x), The log trick leads some people to minimize S log (θ) = n (log y i (θ 1 + θ 2 x)) 2, i=1 i.e. do simple linear regression of (log y i ) against (x i ), but this violates a fundamental hypothesis, because if y i f (x i ; θ) is Gaussian then log y i log f (x i ; θ) is not! Stéphane Mottelet (UTC) Least squares 35 / 55

38 Non Linear Least Squares (NLLS) Possibles angles of attack Remind that S(θ) = r(θ) 2, r i (θ) = f (x i ; θ) y i. A local minimum of S can be found by different methods : Compute a solution of the non linear systems of equations with the Newton s method : S(θ) = 2J r (θ) r(θ) = 0, needs to compute the Jacobian of the gradient itself (do you really want to compute second derivatives?) does not guarantee convergence towards a minimum Stéphane Mottelet (UTC) Least squares 36 / 55

39 Non Linear Least Squares (NLLS) Possibles angles of attack Use the spirit of Newton s method as follows : start with θ 0 and for each k consider the development of the residual vector at θ k with h = θ θ k r(θ) = r(θ k ) + J r (θ k )(θ θ k ) + θ θ k ε(θ θ k ) and take θ k+1 as the vector minimizing S k (θ) = r(θ k ) + J r (θ k )(θ θ k ) 2 finding θ k+1 is a LLS problem! Stéphane Mottelet (UTC) Least squares 37 / 55

40 Non Linear Least Squares (NLLS) Gauss Newton method Original formulation θ k+1 = θ k [J r (θ k ) J r (θ k )] 1 J r (θ k ) r(θ k ), = θ k 1 2 [J r (θ k ) J r (θ k )] 1 S(θ k ) What can you do when the rank of J r (θ k ) is deficient? Pick up a λ > 0 and take θ k+1 as the vector minimizing Levenberg-Marquardt method r(θ k ) + J r (θ k )(θ θ k ) 2 + λ (θ θ k ) 2 θ k+1 = θ k 1 2 [J r (θ k ) J r (θ k ) + λi] 1 S(θ k ) λ allows to balance between speed (λ = 0) and robustness (λ ) Stéphane Mottelet (UTC) Least squares 38 / 55

41 Non Linear Least Squares (NLLS) Example 1 Consider again data (x i, y i ) to be fitted by the non linear model f (x; θ) = exp(θ 1 + θ 2 x), The Jacobian of r(θ) is given by exp(θ 1 + θ 2 x 1 ) x 1 exp(θ 1 + θ 2 x 1 ) J r (θ) =.. exp(θ 1 + θ 2 x n ) x n exp(θ 1 + θ 2 x 1 ) Stéphane Mottelet (UTC) Least squares 39 / 55

42 Non Linear Least Squares (NLLS) Example 1 In Scilab, use the lsqrsolve or leastsq function function r=resid(theta,n) r=exp(theta(1)+theta(2)*x)-y; endfunction function j=jac(theta,n) e=exp(theta(1)+theta(2)*x); j=[e x.*e]; endfunction ˆθ = (1.001, ) load data_exp.dat theta0=[0;0]; theta=lsqrsolve(theta0,resid,30,jac); plot(x,y,"ob",... x,exp(theta(1)+theta(2)*x),"r") Stéphane Mottelet (UTC) Least squares 40 / 55

43 Non Linear Least Squares (NLLS) Example 2 Enzymatic kinetics s s(t) (t) = θ 2, t > 0, s(t) + θ 3 s(0) = θ 1, y i = measurement of s at time t i S(θ) = r(θ) 2, r i (θ) = y i s(t i ) σ i Individual weighting allows to take into account different std. deviations of measurements Stéphane Mottelet (UTC) Least squares 41 / 55

44 Non Linear Least Squares (NLLS) Example 2 In Scilab, use the lsqrsolve or leastsq function function dsdt=michaelis(t,s,theta) dsdt=theta(2)*s/(s+theta(3)) endfunction function r=resid(theta,n) s=ode(theta(1),0,t,michaelis) r=(s-y)./sigma endfunction ˆθ = (887.9, 37.6, 97.7) load michaelis_data.dat theta0=[y(1);20;80]; theta=lsqrsolve(theta0,resid,n) If not provided, the Jacobian J r (θ) is approximated by finite differences (but analytical derivatives always speed up convergence) Stéphane Mottelet (UTC) Least squares 42 / 55

45 Non Linear Least Squares (NLLS) Take home message Take home message #3 : Solving non linear least squares problems is not that difficult with adequate software and good starting values Stéphane Mottelet (UTC) Least squares 43 / 55

46 Statistical evaluation of solutions 1 Motivation and statistical framework 2 Maths reminder 3 Linear Least Squares (LLS) 4 Non Linear Least Squares (NLLS) 5 Statistical evaluation of solutions 6 Model selection Stéphane Mottelet (UTC) Least squares 44 / 55

47 Statistical evaluation of solutions Motivation Since the data (y i ) i=1...n is a sample of random variables, then ˆθ too! Confidence intervals for ˆθ can be easily obtained by at least two methods Monte-Carlo method : allows to estimate the distribution of ˆθ but needs thousands of resamplings Linearized statistics : very fast, but can be very approximate for high level of measurement error Stéphane Mottelet (UTC) Least squares 45 / 55

48 Statistical evaluation of solutions Monte Carlo method The Monte Carlo is a resampling method, i.e. works by generating new samples of synthetic measurement and redoing the estimation of ˆθ. Here model is y = θ 1 + θ 2 x + θ 3 x 2, and data is corrupted by noise with σ = 1 2 Stéphane Mottelet (UTC) Least squares 46 / 55

49 Statistical evaluation of solutions Monte Carlo method θ 1 θ 2 θ 3 At confidence level=95%, ˆθ 1 [0.99, 1.29], ˆθ 2 [ 1.20, 0.85], ˆθ 1 [ 2.57, 1.91]. Stéphane Mottelet (UTC) Least squares 47 / 55

50 Statistical evaluation of solutions Linearized Statistics Define the weighted residual r(θ) by r i (θ) = y i f (x i ; θ) σ i, where σ i is the standard deviation of y i. The covariance matrix of ˆθ can be approximated by V (ˆθ) = F (ˆθ) 1 where F(ˆθ) is the Fisher Information Matrix, given by F (θ) = J r (θ) J r (θ) For example, when σ i = σ for all i, in LLS problems V (ˆθ) = σ 2 A A Stéphane Mottelet (UTC) Least squares 48 / 55

51 Statistical evaluation of solutions Linearized Statistics d=derivative(resid,theta) V=inv(d *d) sigma_theta=sqrt(diag(v)) // fractile Student dist. t_alpha=cdft("t",m-3,0.975,0.025); ˆθ = (887.9, 37.6, 97.7) thetamin=theta-t_alpha*sigma_theta thetamax=theta+t_alpha*sigma_theta At 95% confidence level ˆθ 1 [856.68, ], ˆθ2 [34.13, 41.21], ˆθ3 [93.37, ]. Stéphane Mottelet (UTC) Least squares 49 / 55

52 Statistical evaluation of solutions 1 Motivation and statistical framework 2 Maths reminder 3 Linear Least Squares (LLS) 4 Non Linear Least Squares (NLLS) 5 Statistical evaluation of solutions 6 Model selection Stéphane Mottelet (UTC) Least squares 50 / 55

53 Model selection Motivation : which model is the best? Stéphane Mottelet (UTC) Least squares 51 / 55

54 Model selection Motivation : which model is the best? On the previous slide data has been fitted with the model p y = θ k x k, p = , k=0 Consider S(ˆθ) as a function of model order p does not help much p S(ˆθ) Stéphane Mottelet (UTC) Least squares 52 / 55

55 Model selection Validation Validation is the key of model selection : 1 Define two sets of data V {1,... n} for validation T = {1,... n} \ V for model training 2 For each value of model order p Compute the optimal parameters with the training data ˆθ p = arg min θ R p i T (y i f (x i ; θ)) 2 Compute the validation residual S V (ˆθ p) = i V (y i f (x i ; ˆθ p)) 2 Stéphane Mottelet (UTC) Least squares 53 / 55

56 Model selection Validation Validation helps a lot : here the best model order is clearly p = 3! p S V (ˆθ p ) Stéphane Mottelet (UTC) Least squares 54 / 55

57 Statistical evaluation and model selection Take home message Take home message #4 : Always evaluate your models by either computing confidence intervals for the parameters or by using validation. Stéphane Mottelet (UTC) Least squares 55 / 55

Chapter 3 Numerical Methods

Chapter 3 Numerical Methods Part 2 3.2 Systems of Equations 3.3 Nonlinear and Constrained Optimization 1 Outline 3.2 Systems of Equations 3.3 Nonlinear and Constrained Optimization Summary 2 Outline 3.2