1 Introduction. (1) i=1. χ 2 (p) = w i

Size: px

Start display at page:

Download "1 Introduction. (1) i=1. χ 2 (p) = w i"

Valentine Peter Mills
6 years ago
Views:

1 The Levenberg-Marquardt method for nonlinear least squares curve-fitting problems c Henri P. Gavin Department of Civil and Environmental Engineering Duke University March 9, 216 Abstract The Levenberg-Marquardt method is a standard technique used to solve nonlinear least squares problems. Least squares problems arise when fitting a parameterized function to a set of measured data points by minimizing the sum of the squares of the errors between the data points and the function. Nonlinear least squares problems arise when the function is not linear in the parameters. Nonlinear least squares methods involve an iterative improvement to parameter values in order to reduce the sum of the squares of the errors between the function and the measured data points. The Levenberg-Marquardt curve-fitting method is actually a combination of two minimization methods: the gradient descent method and the Gauss-Newton method. In the gradient descent method, the sum of the squared errors is reduced by updating the parameters in the steepest-descent direction. In the Gauss-Newton method, the sum of the squared errors is reduced by assuming the least squares function is locally quadratic, and finding the minimum of the quadratic. The Levenberg-Marquardt method acts more like a gradient-descent method when the parameters are far from their optimal value, and acts more like the Gauss-Newton method when the parameters are close to their optimal value. This document describes these methods and illustrates the use of software to solve nonlinear least squares curve-fitting problems. 1 Introduction In fitting a function ŷ(t; p) of an independent variable t and a vector of n parameters p to a set of m data points (t i, y i ), it is customary and convenient to minimize the sum of the weighted squares of the errors (or weighted residuals) between the measured data y(t i ) and the curve-fit function ŷ(t i ; p). This scalar-valued goodness-of-fit measure is called the chi-squared error criterion. χ 2 (p) = [ ] m 2 y(ti ) ŷ(t i ; p) (1) i=1 w i = (y ŷ(p)) T W(y ŷ(p)) (2) = y T Wy 2y T Wŷ + ŷ T Wŷ (3) The value w i is a measure of the error in measurement y(t i ). The weighting matrix W is diagonal with W ii = 1/w 2 i. If the function ŷ is nonlinear in the model parameters p, then the minimization of χ 2 with respect to the parameters must be carried out iteratively. The goal of each iteration is to find a perturbation h to the parameters p that reduces χ 2. 1

2 2 The Gradient Descent Method The steepest descent method is a general minimization method which updates parameter values in the downhill direction: the direction opposite to the gradient of the objective function. The gradient descent method converges well for problems with simple objective functions [3, 4]. For problems with thousands of parameters, gradient descent methods are sometimes the only viable method. The gradient of the chi-squared objective function with respect to the parameters is p χ2 = (y ŷ(p)) T W (y ŷ(p)) (4) p [ ] ŷ(p) = (y ŷ(p)) T W () p = (y ŷ) T WJ (6) where the m n Jacobian matrix [ ŷ/ p] represents the local sensitivity of the function ŷ to variation in the parameters p. For notational simplicity J will be used for [ ŷ/ p]. The parameter update h that moves the parameters in the direction of steepest descent is given by h gd = αj T W(y ŷ), (7) where the positive scalar α determines the length of the step in the steepest-descent direction. 3 The Gauss-Newton Method The Gauss-Newton method is a method for minimizing a sum-of-squares objective function. It presumes that the objective function is approximately quadratic in the parameters near the optimal solution [1]. For moderately-sized problems the Gauss-Newton method typically converges much faster than gradient-descent methods []. The function evaluated with perturbed model parameters may be locally approximated through a first-order Taylor series expansion. [ ] ŷ ŷ(p + h) ŷ(p) + h = ŷ + Jh, (8) p Substituting the approximation for the perturbed function, ŷ + Jh, for ŷ in equation (3), χ 2 (p + h) y T Wy + ŷ T Wŷ 2y T Wŷ 2(y ŷ) T WJh + h T J T WJh. (9) This shows that χ 2 is approximately quadratic in the perturbation h, and that the Hessian of the chi-squared fit criterion is approximately J T WJ. The parameter update h that minimizes χ 2 is found from χ 2 / h =. h χ2 (p + h) 2(y ŷ) T WJ + 2h T J T WJ, () update and the resulting normal equations for the Gauss-Newton update are [ J T WJ ] h gn = J T W(y ŷ). (11) 2

3 4 The Levenberg-Marquardt Method The Levenberg-Marquardt algorithm adaptively varies the parameter updates between the gradient descent update and the Gauss-Newton update, [ J T WJ + λi ] h lm = J T W(y ŷ), (12) where small values of the algorithmic parameter λ result in a Gauss-Newton update and large values of λ result in a gradient descent update. The parameter λ is initialized to be large so that first updates are small steps in the steepest-descent direction. If an iteration happens to result in a worse approximation, λ is increased. As the solution improves, λ is decreased, the Levenberg-Marquardt method approaches the Gauss-Newton method, and the solution typically accelerates to the local minimum [3, 4, ]. Marquardt s suggested update relationship [] [ J T WJ + λ diag(j T WJ) ] h lm = J T W(y ŷ). (13) makes the effect of particular values of λ less problem-specific, and is used in the Levenberg- Marquardt algorithm implemented in the Matlab function lm.m 4.1 Numerical Implementation Many variations of the Levenberg-Marquardt have been published in papers and in code. This document borrows from some of these, including the enhancement of a rank-1 Jacobian update. In iteration i, the step h is evaluated by comparing χ 2 (p) to χ 2 (p + h). The step is accepted if the metric ρ i [6] is greater than a user-specified threshold, ɛ 4 >. This metric is a measure of the actual improvement in χ 2 as compared to the improvement of an LM update assuming the approximation (8) were exact. ρ i (h lm ) = = = χ 2 (p) χ 2 (p + h lm ) (y ŷ) T (y ŷ) (y ŷ Jh lm ) T (y ŷ Jh lm ) (14) χ 2 (p) χ 2 (p + h lm ) hlm T (λ ih lm + J T W (y ŷ(p))) if using eq n (12) for h lm (1) χ 2 (p) χ 2 (p + h lm ) hlm T (λ idiag(j T WJ)h lm + J T W (y ŷ(p))) if using eq n (13) for h lm (16) If in an iteration ρ i (h) > ɛ 4 then p + h is sufficiently better than p, p is replaced by p + h, and λ is reduced by a factor. Otherwise λ is increased by a factor, and the algorithm proceeds to the next iteration Initialization and update of the L-M parameter, λ, and the parameters p In lm.m users may select one of three methods for inializing and updating λ and p. 1. λ = λ o ; λ o is user-specified []. use eq n (13) for h lm and eq n (16) for ρ if ρ i (h) > ɛ 4 : p p + h; λ i+1 = max[λ i /L, 7 ]; otherwise: λ i+1 = min [λ i L, 7 ]; 3

4 2. λ = λ o max [ diag[j T WJ] ] ; λ o is user-specified. use eq n ( (12) for h lm and eq n (1) for ρ (J α = T W(y ŷ(p)) ) ) T h / ((χ 2 (p + h) χ 2 (p)) /2 + 2 ( J T W(y ŷ(p)) ) ) T h ; if ρ i (αh) > ɛ 4 : p p + αh; λ i+1 = max [λ i /(1 + α), 7 ]; otherwise: λ i+1 = λ i + χ 2 (p + αh) χ 2 (p) /(2α); 3. λ = λ o max [ diag[j T WJ] ] ; λ o is user-specified [6]. use eq n (12) for h lm and eq n (1) for ρ if ρ i (h) > ɛ 4 : p p + h; λ i+1 = λ i max [1/3, 1 (2ρ i 1) 3 ] ; ν i = 2; otherwise: λ i+1 = λ i ν i ; ν i+1 = 2ν i ; For the examples in section 4.4, method 1 [] with L 11 and L 9 has good convergence properties Computation and rank-1 update of the Jacobian, [ y/ p] In the first iteration, in every 2n iterations, and in iterations where χ 2 (p + h) > χ 2 (p), the Jacobian (J R m n ) is numerically approximated using forward differences, or central differences (default) J ij = ŷ i = ŷ(t i; p + δp j ) ŷ(t i ; p) p j δp j J ij = ŷ i = ŷ(t i; p + δp j ) ŷ(t i ; p δp j ) p j 2 δp j, (17), (18) where the j-th element of δp j is the only non-zero element and is set to j (1 + p j ). In all other iterations, the Jacobian is updated using the Broyden rank-1 update formula, J = J + ( (ŷ(p + h) ŷ(p) Jh) h T) / ( h T h ). (19) For problems with many parameters, a finite differences Jacobian is computationally expensive. Convergence can be achieved with fewer function evaluations if the Jacobian is re-computed using finite differences only occaisonally. The rank-1 Jacobian update equation (19) requires no additional function evaluations Convergence criteria Convergence is achieved when one of the following three criteria is satisfied, Convergence in the gradient, max J T W(y ŷ) < ɛ1 ; Convergence in paramters, max h i /p i < ɛ 2 ; or Convergence in χ 2, χ 2 /(m n + 1) < ɛ 3. Otherwise, iterations terminate when the iteration count exceeds a pre-specified limit. 4

5 4.2 Error Analysis Once the optimal curve-fit parameters p fit are determined, parameter statistics are computed for the converged solution using the same weight value w for every data point. This weight value is equal to the mean square measurement error, σy, 2 w 2 = σ 2 y = The parameter covariance matrix is then computed from 1 m n + 1 (y ŷ(p fit)) T (y ŷ(p fit )) i. (2) V p = [J T WJ] 1, (21) with W = I/w 2. If a data covariance matrix V y is provided then W should be set to V 1 y. The asymptotic standard parameter errors are given by σ p = diag([j T WJ] 1 ), (22) The asymptotic standard parameter error is a measure of how unexplained variability in the data propagates to variability in the parameters, and is essentially an error measure for the parameters. The standard error of the fit is given by σŷ = diag(j[j T WJ] 1 J T ). (23) The standard error of the fit indicates how variability in the parameters affects the variability in the curve-fit. The asymptotic standad prediction error reflects the standard error of the fit as well as the mean square measurement error. σŷp = σ 2 y + diag(j[j T WJ] 1 J T ). (24) 4.3 Matlab code: lm.m The Matlab function lm.m implements the Levenberg-Marquardt method for curvefitting problems. The code with examples are available here: hpgavin/lm.m hpgavin/lm examp.m hpgavin/lm func.m hpgavin/lm plots.m

6 1 function [p,x2, sigma_p, sigma_y, corr_p,r_sq, cvg_hst ] = lm(func,p,t,y_dat, weight,dp, p_min, p_max,c, opts ) 2 % [ p, X2, sigma p, sigma y, corr p, R sq, c v g h s t ] = lm ( func, p, t, y dat, weight, dp, p min, p max, c, o p t s ) 3 % 4 % Levenberg Marquardt curve f i t t i n g : minimize sum o f w e i ghted squared r e s i d u a l s % INPUT VARIABLES 6 % func = f u n c t i o n o f n independent v a r i a b l e s, t, and m parameters, p, 7 % r e t u r n i n g t h e s i m u l a t e d model : y h a t = func ( t, p, c ) 8 % p = n v e c t o r o f i n i t i a l g uess o f parameter v a l u e s 9 % t = m v e c t o r s or matrix o f independent v a r i a b l e s ( used as arg to func ) % y d a t = m v e c t o r s or matrix o f data to be f i t by func ( t, p ) 11 % w e i g h t = w e i g h t i n g v e c t o r f o r l e a s t s q u a r e s f i t ( w e i g h t >= ) % i n v e r s e o f t h e standard measurement e r r o r s 13 % D e f a u l t : s q r t ( d. o. f. / ( y dat y d a t ) ) 14 % dp = f r a c t i o n a l increment o f p f o r numerical d e r i v a t i v e s 1 % dp ( j )> c e n t r a l d i f f e r e n c e s c a l c u l a t e d 16 % dp ( j )< one s i d e d backwards d i f f e r e n c e s c a l c u l a t e d 17 % dp ( j )= s e t s corresponding p a r t i a l s to zero ; i. e. h o l d s p ( j ) f i x e d 18 % D e f a u l t :. 1 ; 19 % p min = n v e c t o r o f lower bounds f o r parameter v a l u e s 2 % p max = n v e c t o r o f upper bounds f o r parameter v a l u e s 21 % c = an o p t i o n a l matrix o f v a l u e s passed to func ( t, p, c ) 22 % o p t s = v e c t o r o f a l g o r i t h m i c parameters 23 % parameter d e f a u l t s meaning 24 % o p t s (1) = prnt 3 >1 i n t e r m e d i a t e r e s u l t s ; >2 p l o t s 2 % o p t s (2) = MaxIter Npar maximum number o f i t e r a t i o n s 26 % o p t s (3) = e p s i l o n 1 1e 3 convergence t o l e r a n c e f o r g r a d i e n t 27 % o p t s (4) = e p s i l o n 2 1e 3 convergence t o l e r a n c e f o r parameters 28 % o p t s () = e p s i l o n 3 1e 3 convergence t o l e r a n c e f o r Chi square 29 % o p t s (6) = e p s i l o n 4 1e 1 determines acceptance o f a L M s t e p 3 % o p t s (7) = lambda 1e 2 i n i t i a l v a l u e o f L M paramter 31 % o p t s (8) = lambda UP fac 11 f a c t o r f o r i n c r e a s i n g lambda 32 % o p t s (9) = lambda DN fac 9 f a c t o r f o r d e c r e a s i n g lambda 33 % o p t s ( ) = Update Type 1 1 : Levenberg Marquardt lambda update 34 % 2 : Quadratic update 3 % 3 : Nielsen s lambda update e q u a t i o n s 36 % 37 % OUTPUT VARIABLES 38 % p = l e a s t s q u a r e s optimal e s t i m a t e o f t h e parameter v a l u e s 39 % X2 = Chi squared c r i t e r i a 4 % sigma p = asymptotic standard e r r o r o f t h e parameters 41 % sigma y = asymptotic standard e r r o r o f t h e curve f i t 42 % c o r r p = c o r r e l a t i o n matrix o f t h e parameters 43 % R sq = R squared c o f f i c i e n t o f m u l t i p l e determination 44 % c v g h s t = convergence h i s t o r y The m-file to solve a least-squares curve-fit problem with lm.m can be as simple as: 1 my_data = load( my_data_file ); % l o a d t h e data 2 t = my_data (:,1); % i f t h e independent v a r i a b l e i s in column 1 3 y_dat = my_data (:,2); % i f t h e dependent v a r i a b l e i s in column 2 4 p_min = [ ]; % minimum e x p e c t e d parameter v a l u e s 6 p_max = [. 1. ]; % maximum e x p e c t e d parameter v a l u e s 7 p_init = [ ]; % i n i t i a l g uess f o r parameter v a l u e s 8 9 [ p_fit, X2, sigma_p, sigma_y, corr, R_sq, cvg_hst ] =... lm ( lm_func, p_init, t, y_dat, 1, -.1, p_min, p_max ) where the user-supplied function lm_func.m could be, for example, 1 function y_hat = lm_func (t,p, c) 2 y_hat = p (1) * t.* exp(-t/p (2)).* cos (2* pi *( p (3)* t - p (4) )); 6

7 It is common and desirable to repeat the same experiment two or more times and to estimate a single set of curve-fit parameters from all the experiments. In such cases the data file may arranged as follows: 1 % t v a r i a b l e y (1 s t experiment ) y (2 nd experiemnt ) y (3 rd experiemnt ) : : : : 6 etc. etc. etc. etc. If your data is arranged as above you may prepare the data for lm.m using the following lines. 1 my_data = load( my_data_file ); % l o a d t h e data 2 t_column = 1; % column o f t h e independent v a r i a b l e 3 y_columns = [ ]; % columns o f t h e measured dependent v a r i a b l e s 4 y_dat = my_data (:, y_columns ); % t h e measured data 6 y_dat = y_dat (:); % a s i n g l e column v e c t o r 7 8 t = my_data (:, t_column ); % t h e independent v a r i a b l e 9 t = t*ones(1, length( y_columns )); % a column o f t f o r each column o f y t = t (:); % a s i n g l e column v e c t o r 7

8 Note that the arguments t and y dat to lm.m may be matrices as long as the dimensions of t match the dimensions of y dat. The columns of t need not be identical. Results may be plotted with lm plots.m: 1 function lm_plots ( t, y_dat, y_fit, sigma_y, cvg_hst, filename ) 2 % l m p l o t s ( t, y dat, y f i t, sigma y, c v g h s t, f i l e n a m e ) 3 % P l o t s t a t i s t i c s o f t h e r e s u l t s o f a Levenberg Marquardt l e a s t s q u a r e s 4 % a n a l y s i s with lm.m 6 y_dat = y_dat (:); 7 y_fit = y_fit (:); 8 9 [ max_it,n] = size ( cvg_hst ); n = n -3; 11 figure (1); % p l o t convergence h i s t o r y o f parameters, c h i ˆ2, lambda 12 c l f 13 subplot (211) 14 plot ( cvg_hst (:,1), cvg_hst (:,2: n+1), -o, linewidth,4); 1 for i =1: n 16 text (1.2* cvg_hst ( max_it,1), cvg_hst ( max_it,1+ i), sprintf ( %d,i) ); 17 end 18 ylabel ( parameter values ) 19 subplot (212) 2 semilogy( cvg_hst (:,1), [ cvg_hst (:,n +2) cvg_hst (:,n +3)], -o, linewidth,4) 21 text (.8* cvg_hst ( max_it,1),.2* cvg_hst ( max_it,n+2), \ chi ˆ2/(m-n +1)} ); 22 text (.8* cvg_hst ( max_it,1),.* cvg_hst ( max_it,n+3), \ lambda ); 23 % l e g e n d ( \ c h i ˆ2/(m n+1)}, \ lambda, 3 ) ; 24 ylabel ( \ chi ˆ2/(m-n +1)} and \ lambda ) 2 xlabel ( function calls ) figure (2); % p l o t data, f i t, and c o n f i d e n c e i n t e r v a l o f f i t 29 c l f 3 plot (t,y_dat, og, t,y_fit, -b, 31 t, y_fit +1.96* sigma_y,.k, t,y_fit -1.96* sigma_y,.k ); 32 legend( y_{ data }, y_{fit }, 9% c.i.,,); 33 ylabel ( y(t) ) 34 xlabel ( t ) 3 % s u b p l o t (212) 36 % semilogy ( t, sigma y, r, l i n e w i d t h, 4 ) ; 37 % y l a b e l ( \ sigma y ( t ) ) figure (3); % p l o t histogram o f r e s i d u a l s, are t h e y Gaussean? 4 c l f 41 hist ( real ( y_dat - y_fit )) 42 t i t l e ( histogram of residuals ) 43 axis ( tight ) 44 xlabel ( y_{ data } - y_{fit } ) 4 ylabel ( count ) 8

9 4.4 Numerical Examples The robustness of lm.m is tested in three numerical examples by curve-fitting simulated experimental measurements. Noisy experimental measurements y are simulated by adding random measurement noise to the curve-fit function evaluated with a set of true parameter values ŷ(t; p true ). The random measurement noise is normally distributed with a mean of zero and a standard deviation of.2. y i = ŷ(t i ; p true ) + N(,.2). (2) The convergence of the parameters from an erroneous initial guess p initial to values closer to p true is then examined. Each numerical example below has four parameters (n = 4) and one-hundred measurements (m = ). Each numerical example has a different curve-fit function ŷ(t; p), a different true parameter vector p true, and a different vector of initial parameters p initial. For several values of p 2 and p 4, the χ 2 error criterion is calculated and is plotted as a surface over the p 2 p 4 plane. The bowl-shaped nature of the objective function is clearly evident in each example. The objective function may not appear quadratic in the parameters and the objective function may have multiple minima. The presence of measurement noise does not affect the smoothness of the objective function. The gradient descent method endeavors to move parameter values in a down-hill direction to minimize χ 2 (p). This often requires small step sizes but is required when the objective function is not quadratic. The Gauss-Newton method approximates the bowl shape as a quadratic and endeavors to move parameter values to the minimum in a small number of steps. This method works well when the parameters are close to their optimal values. The Levenberg-Marquardt method retains the best features of both the gradient-descent method and the Gauss-Newton method. The evolution of the parameter values, the evolution of χ 2, and the evolution of λ from iteration to iteration is plotted for each example. The simulated experimental data, the curve fit, and the 9-percent confidence interval of the fit are plotted, the standard error of the fit, and a histogram of the fit errors are also plotted. The initial parameter values p initial, the true parameter values p true, the fit parameter values p fit, the standard error of the fit parameters σ p, and the correlation matrix of the fit parameters are tabulated. The true parameter values lie within the confidence interval p fit 1.96σ p < p true < p fit σ p with a confidence level of 9 percent. 9

10 4.4.1 Example 1 Consider fitting the following function to a set of measured data. The m-function to be used with lm.m is simply: 1 function y_hat = lm_func (t,p, c) 2 y_hat = p (1)*exp(-t/p (2)) + p (3)* t.*exp(-t/p (4)); ŷ(t; p) = p 1 exp( t/p 2 ) + p 3 t exp( t/p 4 ) (26) The true parameter values p true, the initial parameter values p initial, resulting curve-fit parameter values p fit and standard errors of the fit parameters σ p are shown in Table 1. The R 2 fit criterion is 98 percent. The standard parameter errors are less than one percent of the parameter values except for the standard error for p 2, which is 1. percent of p 2. The parameter correlation matrix is given in Table 2. Parameters p 3 and p 4 are the most correlated at -96 percent. Parameters p 1 and p 4 are the least correlated at -3 percent. The bowl-shaped nature of the χ 2 objective function is shown in Figure 1(a). shape is nearly quadratic and has a single minimum. This The convergence of the parameters and the evolution of χ 2 and λ are shown in Figure 1(b). The data points, the curve fit, and the curve fit confidence band are plotted in Figure 1(c). Note that the standard error of the fit is smaller near the center of the fit domain and is larger at the edges of the domain. A histogram of the difference between the data values and the curve-fit is shown in Figure 1(d). Ideally these curve-fit errors should be normally distributed.

11 Table 1. Parameter values and standard errors. p initial p true p fit σ p σ p /p fit (%) Table 2. Parameter correlation matrix. p 1 p 2 p 3 p 4 p p p p (a) log (χ 2 ) 3 y(t) p p 2 y data y fit y fit +1.96σ y y fit -1.96σ y 1 2 (b) parameter values χ 2 and λ count χ 2 λ function calls histogram of residuals 2 1 σ y (t) (c) t (d) y data - y fit Figure 1. (a) The sum of the squared errors as a function of p 2 and p 4. (b) Top: the convergence of the parameters with each iteration, (b) Bottom: values of χ 2 and λ each iteration. (c) Top: data y, curve-fit ŷ(t; p fit ), curve-fit+error, and curve-fit-error; (c) Bottom: standard error of the fit, σŷ(t). (d) Histogram of the errors between the data and the fit. 11

12 4.4.2 Example 2 Consider fitting the following function to a set of measured data. ŷ(t; p) = p 1 (t/ max(t)) + p 2 (t/ max(t)) 2 + p 3 (t/ max(t)) 3 + p 4 (t/ max(t)) 4 (27) This function is linear in the parameters and may be fit using methods of linear least squares. The m-function to be used with lm.m is simply: 1 function y_hat = lm_func (t,p, c) 2 mt = max( t); 3 y_hat = p (1)*( t/mt) + p (2)*( t/mt ).ˆ2 + p (3)*( t/mt ).ˆ3 + p (4)*( t/mt ).ˆ4; The true parameter values p true, the initial parameter values p initial, resulting curve-fit parameter values p fit and standard errors of the fit parameters σ p are shown in Table 3. The R 2 fit criterion is 99.9 percent. In this example, the standard parameter errors are larger than in example 1. The standard error for p 2 is 12 percent and the standard error of p 3 is 17 percent. Note that a very high value of the R 2 coefficient of determination does not necessarily mean that parameter values have been found with great accuracy. The parameter correlation matrix is given in Table 4. These parameters are highly correlated with one another, meaning that a change in one parameter will almost certainly result in changes in the other parameters. The bowl-shaped nature of the χ 2 objective function is shown in Figure 2(a). This shape is nearly quadratic and has a single minimum. The correlation of parameters p 2 and p 4, for example, is easily seen from this figure. The convergence of the parameters and the evolution of χ 2 and λ are shown in Figure 2(b). The parameters converge monotonically to their final values. The data points, the curve fit, and the curve fit confidence band are plotted in Figure 2(c). Note that the standard error of the fit approaches zero at t = and is largest at t =. This is because ŷ(; p) =, regardless of the values in p. A histogram of the difference between the data values and the curve-fit is shown in Figure 1(d). Ideally these curve-fit errors should be normally distributed, and they appear to be so in this example. 12

13 Table 3. Parameter values and standard errors. p initial p true p fit σ p σ p /p fit (%) Table 4. Parameter correlation matrix. p 1 p 2 p 3 p 4 p p p p parameter values (a) log (χ 2 ) y(t) p p 2 y data y fit y fit +1.96σ y y fit -1.96σ y - (b) χ 2 and λ χ 2 λ function calls histogram of residuals count 1 σ y (t) -2 (c) t (d) y data - y fit Figure 2. (a) The sum of the squared errors as a function of p 2 and p 4. (b) Top: the convergence of the parameters with each iteration, (b) Bottom: values of χ 2 and λ each iteration. (c) Top: data y, curve-fit ŷ(t; p fit ), curve-fit+error, and curve-fit-error; (c) Bottom: standard error of the fit, σŷ(t). (d) Histogram of the errors between the data and the fit. 13

14 4.4.3 Example 3 Consider fitting the following function to a set of measured data. ŷ(t; p) = p 1 exp( t/p 2 ) + p 3 sin(t/p 4 ) (28) This function is linear in the parameters and may be fit using methods of linear least squares. The m-function to be used with lm.m is simply: 1 function y_hat = lm_func (t,p, c) 2 y_hat = p (1)*exp(-t/p (2)) + p (3)* sin (t/p (4)); The true parameter values p true, the initial parameter values p initial, resulting curve-fit parameter values p fit and standard errors of the fit parameters σ p are shown in Table. The R 2 fit criterion is 99.8 percent. In this example, the standard parameter errors are all less than one percent. The parameter correlation matrix is given in Table 6. Parameters p 4 is not correlated with the other parameters. Parameters p 1 and p 2 are most correlated at 73 percent. The bowl-shaped nature of the χ 2 objective function is shown in Figure 3(a). This shape is clearly not quadratic and has multiple minima. In this example, the initial guess for parameter p 4, the period of the oscillatory component, has to be within ten percent of the true value, otherwise the algorithm in lm.m will converge to a very small value of the amplitude of oscillation p 3 and an erroneous value for p 4. When such an occurrence arises, the standard errors σ p of the fit parameters p 3 and p 4 are quite large and the histogram of curve-fit errors (Figure 3(d)) is not normally distributed. The convergence of the parameters and the evolution of χ 2 and λ are shown in Figure 3(b). The parameters converge monotonically to their final values. The data points, the curve fit, and the curve fit confidence band are plotted in Figure 3(c). A histogram of the difference between the data values and the curve-fit is shown in Figure 1(d). Ideally these curve-fit errors should be normally distributed, and they appear to be so in this example. 14

15 Table. Parameter values and standard errors. p initial p true p fit σ p σ p /p fit (%) Table 6. Parameter correlation matrix. p 1 p 2 p 3 p 4 p p p p (a) log (χ 2 ) 1.8 y(t) p p 2 y data y fit y fit +1.96σ y y fit -1.96σ y (b) parameter values χ 2 and λ count χ 2 λ function calls histogram of residuals σ y (t) (c) t (d) y data - y fit Figure 3. (a) The sum of the squared errors as a function of p 2 and p 4. (b) Top: the convergence of the parameters with each iteration, (b) Bottom: values of χ 2 and λ each iteration. (c) Top: data y, curve-fit ŷ(t; p fit ), curve-fit+error, and curve-fit-error; (c) Bottom: standard error of the fit, σŷ(t). (d) Histogram of the errors between the data and the fit. 1

16 4. Fitting in Multiple Dimensions The code lm.m can carry out fitting in multiple dimensions. For example, the function ẑ(x, y) = (p 1 x p 2 + (1 p 1 )y p 2 ) 1/p 2 may be fit to data points z i (x i, y i ), (i = 1,, m), using lm.m using an m-file such as 1 my_data = load( my_data_file ); % l o a d t h e data 2 x_dat = my_data (:,1); % i f t h e independent v a r i a b l e x i s in column 1 3 y_dat = my_data (:,2); % i f t h e independent v a r i a b l e y i s in column 2 4 z_dat = my_data (:,3); % i f t h e dependent v a r i a b l e z i s in column 3 6 p_min = [.1.1 ]; % minimum e x p e c t e d parameter v a l u e s 7 p_max = [.9 2. ]; % maximum e x p e c t e d parameter v a l u e s 8 p_init = [. 1. ]; % i n i t i a l g uess f o r parameter v a l u e s 9 t = [ x_dat y_dat ]; % x and y are column v e c t o r s o f independent v a r i a b l e s [ p_fit, Chi_sq, sigma_p, sigma_y,corr,r2, cvg_hst ] = lm( lm_func2d,p_init,t,z_dat, weight,.1, p_min, p_max ); with the m-function lm_func2d.m 1 function z_hat = lm_func2d (t, p) 2 % example f u n c t i o n used f o r n o n l i n e a r l e a s t s q u a r e s curve f i t t i n g 3 % to demonstrate t h e Levenberg Marquardt f u n c t i o n, lm.m, 4 % in two f i t t i n g dimensions 6 x_dat = t (:,1); 7 y_dat = t (:,2); 8 z_hat = ( p (1)* x_dat.ˆp(2) + (1 -p (1))* y_dat.ˆp(2) ).ˆ(1/ p (2)); 16

17 Remarks This text and lm.m were written in an attempt to understand and explain methods of nonlinear least squares for curve-fitting applications. It is important to remember that the purpose of applying statistical methods for data analysis is primarily to estimate parameter statistics... not only the parameter values themselves. Reasonable parameter values can often be found using non-statistical minimization algorithms, such as random search methods, the Nelder-Mead simplex method, or simply griding the parameter space and finding the best combination of parameter values. Nonlinear least squares problems can have objective functions with multiple local minima. Fitting algorithms will converge to different local minima depending upon values of the initial guess, the measurement noise, and algorithmic parameters. It is perfectly appropriate and good to use the best available estimate of the desired parameters as the initial guess. In the absence of physical insight into a curve-fitting problem, a reasonable initial guess may be found by coarsely griding the parameter space and finding the best combination of parameter values. There is no sense in forcing any statistical curve-fitting algorithm to work too hard by starting it with a poor initial guess. In most applications, parameters identified from neighboring initial guesses (±%) should converge to similar parameter estimates (±.1%). The fit statistics of these converged parameters should be the same, within two or three significant figures. References [1] A. Björck. Numerical methods for least squares problems, SIAM, Philadelphia, [2] K. Levenberg. A Method for the Solution of Certain Non-Linear Problems in Least Squares. The Quarterly of Applied Mathematics, 2: (1944). [3] M.I.A. Lourakis. A brief description of the Levenberg-Marquardt algorithm implemented by levmar, Technical Report, Institute of Computer Science, Foundation for Research and Technology - Hellas, 2. [4] K. Madsen, N.B. Nielsen, and O. Tingleff. Methods for nonlinear least squares problems. Technical Report. Informatics and Mathematical Modeling, Technical University of Denmark, 24. [] D.W. Marquardt. An algorithm for least-squares estimation of nonlinear parameters, Journal of the Society for Industrial and Applied Mathematics, 11(2): , [6] H.B. Nielson, Damping Parameter In Marquardt s Method, Technical Report IMM-REP-1999-, Dept. of Mathematical Modeling, Technical University Denmark. [7] W.H. Press, S.A. Teukosky, W.T. Vetterling, and B.P. Flannery. Numerical Recipes in C, Cambridge University Press, second edition, [8] Mark K.Transtrum, Benjamin B. Machta, and James P. Sethna, Why are nonlinear fits to data so challenging?, Phys. Rev. Lett. 4, 621 (2), [9] Mark K. Transtrum and James P. Sethna Improvements to the Levenberg-Marquardt algorithm for nonlinear least-squares minimization, Preprint submitted to Journal of Computational Physics, January 3,

Linear and Nonlinear Optimization

Linear and Nonlinear Optimization German University in Cairo October 10, 2016 Outline Introduction Gradient descent method Gauss-Newton method Levenberg-Marquardt method Case study: Straight lines have