SEPARABLE LEAST SQUARES, VARIABLE PROJECTION, AND THE GAUSS-NEWTON ALGORITHM

Similar documents
ETNA Kent State University

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

The standard deviation of the mean

Introduction to Optimization Techniques. How to Solve Equations

Similarity Solutions to Unsteady Pseudoplastic. Flow Near a Moving Wall

6.3 Testing Series With Positive Terms

Properties and Hypothesis Testing

Statistical Inference Based on Extremum Estimators

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

NANYANG TECHNOLOGICAL UNIVERSITY SYLLABUS FOR ENTRANCE EXAMINATION FOR INTERNATIONAL STUDENTS AO-LEVEL MATHEMATICS

The Method of Least Squares. To understand least squares fitting of data.

Discrete Orthogonal Moment Features Using Chebyshev Polynomials

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Regression with an Evaporating Logarithmic Trend

A NEW CLASS OF 2-STEP RATIONAL MULTISTEP METHODS

Generalized Semi- Markov Processes (GSMP)

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

3. Z Transform. Recall that the Fourier transform (FT) of a DT signal xn [ ] is ( ) [ ] = In order for the FT to exist in the finite magnitude sense,

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

The natural exponential function

WHAT IS THE PROBABILITY FUNCTION FOR LARGE TSUNAMI WAVES? ABSTRACT

Series III. Chapter Alternating Series

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Math 155 (Lecture 3)

4.3 Growth Rates of Solutions to Recurrences

Linear Regression Demystified

Chandrasekhar Type Algorithms. for the Riccati Equation of Lainiotis Filter

Problem Set 2 Solutions

Chapter 9 - CD companion 1. A Generic Implementation; The Common-Merge Amplifier. 1 τ is. ω ch. τ io

Notes on iteration and Newton s method. Iteration

Machine Learning for Data Science (CS 4786)

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition

CHAPTER 10 INFINITE SEQUENCES AND SERIES

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

ANALYSIS OF EXPERIMENTAL ERRORS

Revision Topic 1: Number and algebra

Statistics 511 Additional Materials

Lainiotis filter implementation. via Chandrasekhar type algorithm

Optimization Methods MIT 2.098/6.255/ Final exam

Random Variables, Sampling and Estimation

1 Inferential Methods for Correlation and Regression Analysis

ECON 3150/4150, Spring term Lecture 3

Number of fatalities X Sunday 4 Monday 6 Tuesday 2 Wednesday 0 Thursday 3 Friday 5 Saturday 8 Total 28. Day

Distribution of Random Samples & Limit theorems

The z-transform. 7.1 Introduction. 7.2 The z-transform Derivation of the z-transform: x[n] = z n LTI system, h[n] z = re j

September 2012 C1 Note. C1 Notes (Edexcel) Copyright - For AS, A2 notes and IGCSE / GCSE worksheets 1

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

GUIDELINES ON REPRESENTATIVE SAMPLING

Principle Of Superposition

Sequences. Notation. Convergence of a Sequence

THE KALMAN FILTER RAUL ROJAS

A statistical method to determine sample size to estimate characteristic value of soil parameters

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random

Kinetics of Complex Reactions

The target reliability and design working life

x a x a Lecture 2 Series (See Chapter 1 in Boas)

Algebra of Least Squares

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Research Article A New Second-Order Iteration Method for Solving Nonlinear Equations

Session 5. (1) Principal component analysis and Karhunen-Loève transformation

subject to A 1 x + A 2 y b x j 0, j = 1,,n 1 y j = 0 or 1, j = 1,,n 2

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Machine Learning Brett Bernstein

Chapter 6 Infinite Series

6.867 Machine learning, lecture 7 (Jaakkola) 1

7.1 Convergence of sequences of random variables

On forward improvement iteration for stopping problems

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

NICK DUFRESNE. 1 1 p(x). To determine some formulas for the generating function of the Schröder numbers, r(x) = a(x) =

Polynomials with Rational Roots that Differ by a Non-zero Constant. Generalities

11 Correlation and Regression

Lecture 2: Monte Carlo Simulation

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

MA131 - Analysis 1. Workbook 3 Sequences II

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

Convergence of random variables. (telegram style notes) P.J.C. Spreij

1 1 2 = show that: over variables x and y. [2 marks] Write down necessary conditions involving first and second-order partial derivatives for ( x0, y

Problem Cosider the curve give parametrically as x = si t ad y = + cos t for» t» ß: (a) Describe the path this traverses: Where does it start (whe t =

5. Fast NLMS-OCF Algorithm

A PROOF OF THE TWIN PRIME CONJECTURE AND OTHER POSSIBLE APPLICATIONS

THE SYSTEMATIC AND THE RANDOM. ERRORS - DUE TO ELEMENT TOLERANCES OF ELECTRICAL NETWORKS

a for a 1 1 matrix. a b a b 2 2 matrix: We define det ad bc 3 3 matrix: We define a a a a a a a a a a a a a a a a a a

7.1 Convergence of sequences of random variables

Chapter 9: Numerical Differentiation

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Chapter 7: The z-transform. Chih-Wei Liu

6.003 Homework #3 Solutions

Feedback in Iterative Algorithms

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Optimally Sparse SVMs

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

1 Generating functions for balls in boxes

An Introduction to Randomized Algorithms

Anna Janicka Mathematical Statistics 2018/2019 Lecture 1, Parts 1 & 2

NEW FAST CONVERGENT SEQUENCES OF EULER-MASCHERONI TYPE

CS322: Network Analysis. Problem Set 2 - Fall 2009

Transcription:

SEPARABLE LEAS SQUARES, VARIABLE PROJECION, AND HE GAUSS-NEWON ALGORIHM M.R.OSBORNE Key words. oliear least squares, scorig, Newto s method, expected Hessia, Kaufma s modificatio, rate of covergece, radom errors, law of large umbers, cosistecy, large data sets, maximum likelihood AMS subject classificatio. 62-07, 65K99, 90-08 Abstract. A regressio problem is separable if the model ca be represeted as a liear combiatio of fuctios which have a oliear parametric depedece. he Gauss-Newto algorithm is a method for miimizig the residual sum of squares i such problems. It is kow to be effective both whe residuals are small, ad whe measuremet errors are additive ad the data set is large. he large data set result that the iteratio asymptotes to a secod order rate as the data set size becomes ubouded is sketched here. Variable projectio is a techique itroduced by Golub ad Pereyra for reducig the separable estimatio problem to oe of miimizig a sum of squares i the oliear parameters oly. he applicatio of Gauss-Newto to miimize this sum of squares the RGN algorithm is kow to be effective i small residual problems. he mai result preseted is that the RGN algorithm shares the good covergece rate behaviour of the Gauss-Newto algorithm o large data sets eve though the errors are o loger additive. A modificatio of the RGN algorithm due to Kaufma, which aims to reduce its computatioal cost, is show to produce iterates which are almost idetical to those of the Gauss-Newto algorithm o the origial problem. Aspects of the questio of which algorithm is preferable are discussed briefly, ad a example is used to illustrate the importace of the large data set behaviour.. Itroductio. he Gauss-Newto algorithm is a modificatio of Newto s method for miimizatio developed for the particular case whe the objective fuctio ca be writte as a sum of squares. It has a cost advatage i that it avoids the calculatio of secod derivative terms i estimatig the Hessia. Other advatages possessed by the modified algorithm are that its Hessia estimate is geerically positive defiite, ad that it actually has better trasformatio ivariace properties tha those possessed by the origial algorithm. It has the disadvatage that it has a geeric first order rate of covergece. his ca make the method usuitable except i two importat cases:. he case of small residuals. his occurs whe the idividual terms i the sum of squares ca be made small simultaeously so that the associated oliear system is cosistet or early so. 2. he case of large data sets. A importat applicatio of the Gauss-Newto algorithm is to parameter estimatio problems i data aalysis. Noliear least squares problems occur i maximizig likelihoods based o the ormal distributio. Here Gauss-Newto is a special case of the Fisher scorig algorithm [6]. I appropriate circumstaces this asymptotes to a secod order covergece rate as the umber of idepedet observatios i the data set becomes ubouded. he large data set problem is emphasised here. his seeks to estimate the true parameter vector R p by solvig the optimizatio problem mi F, ε. For Gee, who itroduced me to this problem more tha thirty years ago. Mathematical Scieces Istitute, Australia Natioal Uiversity, AC 0200, Australia

2 M.R. Osbore where F, ε = 2 f, ε 2,.2 f R is a vector of smooth eough fuctios fi, ε, i =, 2,,, f has full colum rak p i the regio of parameter space of iterest, ad ε R N 0, σ 2 I plays the role of observatioal error. he orm is assumed to be the Euclidea vector orm uless otherwise specified. It is assumed that the measuremet process that geerated the data set ca be coceptualised for arbitrarily large, ad that} the estimatio problem is cosistet i the sese that there exists a sequece { of local miimisers of. such that a.s.,. Here the mode of covergece is almost sure covergece. A good referece o asymptotic methods i statistics is [2]. Remark.. A key poit is that the errors are assumed to eter the model additively. hat is the fi, i =, 2,, have the fuctioal form f i, ε = y i µ i.3 where, correspodig to the case of observatios made o a sigal i the presece of oise, y i = µ i + ε i..4 hus differetiatio of fi removes the radom compoet. Also F is directly proportioal to the problem log likelihood ad the property of cosistecy becomes a cosequece of the other assumptios. I a umber of special cases there is additioal structure i f so it becomes a legitimate questio to ask if this ca be used to advatage. A oliear regressio model is called separable if the problem residuals b ca be represeted i the form b i α,, ε = y i m φ ij α j, i =, 2,,..5 j= Here the model has the form of a liear combiatio expressed by α R m of oliear fuctios φ ij, R p,. he modified otatio fi, ε b i α,, ε, m µ i φ ij α j, j= is used here to make this structure explicit. It is assumed that the problem fuctios are φ ij = φ j t i,, j =, 2,, m, where the t i, i =, 2,, are sample poits where observatios o the uderlyig sigal are made. here is o restrictio i assumig t i [0, ]. Oe source of examples is provided by geeral solutios of the m th order liear ordiary differetial equatio with fudametal solutios give by the φ i t,. I [] a systematic procedure variable projectio is itroduced for reducig the estimatio problem to a oliear least squares problem i the oliear parameters oly. A recet survey of developmets ad applicatios of variable projectio is [2]. o itroduce the techique let Φ : R m R, > m be the matrix with compoets φ ij. he rak assumptio i the problem formulatio ow requires

Separable least squares 3 [ Φ Φ α ] to have full colum rak m + p. Also let P : R R be the orthogoal projectio matrix defied by Here P has the explicit represetatio he P Φ = 0..6 P = I Φ Φ Φ Φ..7 F = { P y 2 + I P b 2}..8 2 he first term o the right of this equatio is idepedet of α ad the secod ca be reduced to zero by settig α = α = Φ Φ Φ y..9 hus a equivalet formulatio of. i the separable problem is mi 2 P y 2.0 which is a sum of squares i the oliear parameters oly so that, at least formally, the Gauss-Newto algorithm ca be applied. However, ow the radom errors do ot eter additively but are coupled with the oliear parameters i settig up the objective fuctio. he pla of the paper is as follows. he large data set rate of covergece aalysis appropriate to the Gauss-Newto method i the case of additive errors is summarized i the ext sectio. he third sectio shows why this aalysis caot immediately be exteded to the RGN algorithm. Here the rather harder work eeded to arrive at similar coclusios is summarised. Most implemetatios of the variable projectio method use a modificatio due to Kaufma [4] which serves to reduce the amout of computatio eeded i the RGN algorithm. his modified algorithm also shows the favourable large data set rates despite beig developed usig a explicit small residual argumet. However, it is actually closer to the additive Gauss-Newto method tha is the full RGN algorithm. A brief discussio of which form of algorithm is appropriate i particular circumstaces is give i the fial sectio. his is complemeted by a example of a classic data fittig problem which is used to illustrate the importace of the large sample covergece rate. 2. Large data set covergece rate aalysis. he basic iterative step i Newto s method for miimizig F defied i. is i+ = i J i F i, 2. J i = 2 F i. 2.2 I the case of additive errors the scorig/gauss-newto method replaces the Hessia with a approximatio which is costructed as follows. he true Hessia is { } J = { f } { f } + fi 2 fi. 2.3

4 M.R. Osbore he stochastic compoet eters oly through.4 so takig expectatios gives where E {J } = I I = µi µ i 2 fi, 2.4 { } { f } { f }. 2.5 he Gauss-Newto method replaces J with I i 2.. he key poit to otice is I = E { J }. 2.6 Several poits ca be made here:. It follows from the special form of 2.5 that the Gauss-Newto correctio i+ i solves the liear least squares problem mi t y µ µ t 2. 2.7 2. It is a importat result, coditioal o a appropriate experimetal setup, that I is geerically a bouded, positive defiite matrix for all large eough [6]. A similar result is sketched i Lemma 3.2. 3. he use of the form of the expectatio which holds at the true parameter values is a characteristic simplificatio of the scorig algorithm ad is available for more geeral likelihoods [7]. Here it leads to the same result as igorig small residual terms i 2.3. he full-step Gauss-Newto method has the form of a fixed poit iteratio: i+ = Q i, Q = I F. 2.8 he coditio for to be a attractive fixed poit is ϖ Q <, 2.9 where ϖ deotes the spectral radius of the variatioal matrix Q. his quatity determies the first order covergece multiplier of the Gauss-Newto algorithm. he key to the good large sample behaviour is the result ϖ Q a.s. 0,. 2.0 which shows that the algorithm teds to a secod order coverget process as. he derivatio of this result will ow be outlied. As F = 0, it follows that Q Now defie W : R p R p by = I p I 2 F. W = I { I 2 F }. 2.

Separable least squares 5 he W = Q = W + O, 2.2 by cosistecy. By 2.6, W = I { 2 F E { 2 F }}. 2.3 It has bee oted that I is bouded, positive defiite. Also, a factor is implicit i the secod term of the right had side of 2.3, ad the compoets of 2 F are sums of idepedet radom variables. hus it follows by a applicatio of the a.s. law of large umbers [2] that W 0 compoet-wise as. A immediate cosequece is that ϖ W a.s. 0,. 2.4 he desired covergece rate result 2.0 ow follows from 2.2. Note that the property of cosistecy that derives from the maximum likelihood coectio is a essetial compoet of the argumet. Also, that this is ot a completely straightforward applicatio of the law of large umbers because a sequece of sets of observatio poits {t i, i =, 2,, } is ivolved. For this case see [3]. 3. Rate estimatio for separable problems. Variable projectio leads to the oliear least squares problem.0 where f, ε = P y, 3. F, ε = 2 y P y 3.2 Implemetatio of the Gauss-Newto algorithm RGN algorithm has bee discussed i detail i []. It uses a approximate Hessia computed from 2.5 ad requires derivatives of P. he derivative of P i the directio defied by t R p is P [t] = P Φ [t] Φ + Φ + Φ [t] P, 3.3 = A, t + A, t, 3.4 where A R R, the matrix directioal derivative dφ is writte Φ[t] to emphasise both the liear depedece o t ad that t is held fixed i this operatio, explicit depedece o both ad is uderstood, ad Φ + deotes the geeralised iverse of Φ. Note that Φ + P = Φ + Φ + ΦΦ + = 0 so the two compoets of P [t] i 3.4 are orthogoal. Defie matrices K, L : R p R by he the RGN correctio solves where A, t y = K, y t, 3.5 A, t y = L, y t. 3.6 mi t P y + K + L t 2, 3.7 L K = 0 3.8

6 M.R. Osbore as a cosequece of the orthogoality oted above. Remark 3.. Kaufma [4] has examied these terms i more detail. We have t K Kt = y A Ay = O α 2, t L Lt = y AA y = O P y 2. If the orthogoality oted above is used the the secod term i the desig matrix i 3.7 correspods to a small residual term whe P y 2 is relatively small ad ca be igored. he resultig correctio solves mi t P y + Kt 2. 3.9 his modificatio was suggested by Kaufma. It ca be implemeted with less computatioal cost, ad it is favoured for this reaso. Numerical experiece is reported to be very satisfactory [2]. he terms i the sum of squares i the reduced problem.0 are f i = P ij y j, i =, 2,,. 3.0 Now, because the oise ε is coupled with the oliear parameters ad so does ot disappear uder differetiatio, I is quadratic i the oise cotributios. A immediate cosequece is that I E { f f }. 3. hus it is ot possible to repeat exactly the rate of covergece calculatio of the previous sectio. Istead it is coveiet to rewrite equatio 2.: { } W = f f f i 2 f i, 3.2 where the right had side is evaluated at. he property of cosistecy is uchaged so the asymptotic covergece rate is agai determied by ϖ W. We ow examie this expressio i more detail. Lemma 3.2. where G ij = 0 Φ Φ G,, 3.3 φ i t φ j t ϱ t, i, j m, ad the desity ρ is determied by the asymptotic properties of the method for geeratig the sample poits t i, i =, 2,, for large. he Gram matrix G is bouded ad geerically positive defiite. Let = I P. he ij = φ i G φ j + o, 3.4

Separable least squares 7 where φ i = [ φ t i φ 2 t i φ m t i ] his gives a O compoet-wise estimate which applies also to derivatives of both P ad with respect to. Proof. he result 3.3 is discussed i detail i [6]. It follows from Φ Φ = ij φ i t k φ j t k = G ij + O k= by iterpretig the sum as a quadrature formula. Positive defiiteess is a cosequece of the problem rak assumptio. o derive 3.4 ote that = Φ Φ Φ Φ, = Φ G Φ + o. he startig poit for determiig the asymptotics of the covergece rate of the RGN algorithm as is the computatio of the expectatios of the umerator ad deomiator matrices i 3.2. he expectatio of the deomiator is bouded ad geerically positive defiite. he expectatio of the umerator is O as. his suggests strogly that the spectral radius of Q 0,, a result of essetially similar stregth to that obtaied for the additive error case. o complete the proof requires showig that both umerator ad deomiator terms coverge to their expectatios with probability. Cosider first the deomiator term. Lemma 3.3. Fix =. E { f f } = σ 2 M + M 2, 3.5 where M = O, ad M2 teds to a limit which is a bouded, positive defiite matrix whe the problem rak assumptio is satisfied. I detail, these matrices are M = σ2 P ij P ij, 3.6 Proof. Set M 2 = j= µ j µ j j= f f = = fi f i, j= j= k= P ij y j µ j µ k jk. 3.7 k= P ik y k. 3.8

8 M.R. Osbore o calculate the expectatio ote that it follows from equatio.4 that where It follows that E { f f } = E {y j y k } = σ 2 δ jk + µ j µk, 3.9 σ2 j= = σ 2 M + M 2 µ j = e j Φα. P ij P ij + j= k= µ j µ k P ij P ik, o show M 0 is a coutig exercise. M cosists of the sum of 2 terms each of which is a p p matrix of O gradiet terms divided by 3 as a cosequece of Lemma 3.2. M 2 ca be simplified somewhat by otig that j= P ijµ j = 0 idetically i by.6 so that µ j P ij = µ j P ij. j= k= j= his gives, usig the symmetry of P = I, µ j µ k P ij P ik = µ j µ k P ij P ik, = = j= k= j= j= k= µ j µ k P jk, 3.20 µ j µ j j= j= k= µ j µ k jk. Boudedess of M 2 as ow follows usig the estimates for the size of the ij computed i Lemma 3.2. o show that M 2 is positive defiite ote that it follows from 3.20 that As dµ dµ t M 2 t = dµ {I } dµ 0., this expressio ca vaish oly if there is a directio t R p such that dµ = γµ for some γ 0. his requiremet is cotrary to the Gauss-Newto rak assumptio that [ Φ Φα ] has full rak m + p. Lemma 3.4. he umerator i the expressio 3.2 defiig W is f i 2 f i = y j y k P ij 2 P ik. 3.2 { Let M 3 = E } f i 2 f i the M 3 = σ 2 j= k= 2 P ii ij 2 P ij, 3.22 j=

Separable least squares 9 ad M 3 0,. Proof. his is similar to that of Lemma 3.3. he ew poit is that the cotributio to M 3 from the sigal terms µ j i the expectatio 3.9 is j= k= µ j µ k P ij 2 P ik = 0 by summig over j keepig i ad k fixed. he previous coutig argumet ca be used agai to give the estimate M 3 = O,. he fial step required is to show that the umerator ad deomiator terms i 3.2 approach their expectatios as. Oly the case of the deomiator is cosidered here. Lemma 3.5. f a.s. f M 2,. 3.23 Proof. he basic quatities are: f f = = = fi f i, j= P ij y j P ik y k, j= k= k= {µ j µ k + µ j ε k + µ k ε j + ε j ε k } P ij P ik he first of the three terms i this last expasio is M 2. hus the result requires showig that the remaiig terms ted to 0. Let π i = ε j P ij, π i R p. j= As, by Lemma 3.2, the compoets of P ij = O, it follows by applicatios of the law of large umbers that π i a.s. 0, compoetwise. Specifically, give δ > 0, there is a 0 such that i, π i < δ > 0 with probability. Cosider the third term. Let S = = ε j ε k P ij P ik, j= k= π i π i.

0 M.R. Osbore he, i the maximum orm, with probability for > 0, S pδ 2, showig that the third sum teds to 0, almost surely. A similar argumet applies to the secod term which proves to be O δ. hese results ca ow be put together to give the desired covergece result. heorem 3.6. W a.s. 0,. 3.24 Proof. he idea is to write each compoet term Ω i 3.2 i the form Ω = E {Ω} + Ω E {Ω}, ad the to appeal to the asymptotic covergece results established i the precedig lemmas. Remark 3.7. his result whe combied with cosistecy suffices to establish the aalogue of 2.0 i this case. he asymptotic covergece rate of the RGN algorithm ca be expected to be similar to that of the full Gauss-Newto method. While the umerator expectatio i the Gauss-Newto method is 0, ad that i the RGN algorithm is O by Lemma 3.4, these are both smaller tha the discrepacies Ω E {Ω} betwee their full expressios ad their expectatios. hus it is these discrepacy terms that are critical i determiig the covergece rates. Here these correspod to law of large umbers rates for which a scale of O /2 is appropriate. 4. he Kaufma modificatio. As the RGN algorithm possesses similar covergece rate properties to Gauss-Newto i large sample problems, ad, as the Kaufma modificatio is favoured i implemetatio, it is of iterest to ask if it too shares the same good large sample covergece rate properties. Fortuately the aswer is i the affirmative. his result ca be proved i the same way as the mai lemmas i the previous sectio. his calculatio is similar to the precedig ad is relegated to the Appedix. I this sectio the close coectio betwee the modified algorithm ad the full Gauss-Newto method is explored. hat both ca be implemeted with the same amout of work is show i []. First ote that equatio 2.7 for the Gauss-Newto correctio here becomes mi y Φα [ Φ δα,δ Φα ] [ δα δ Itroducig the variable projectio matrix P permits this to be writte: ] 2. 4. mi P y P Φα δ 2 + mi I P y Φα δ Φ α + δα 2. 4.2 δ δα Compariso with 3.3 shows that the first miimizatio is just mi P y Kδ. 4.3 δ hus, give α, the Kaufma search directio computed usig 3.9 is exactly the Gauss-Newto correctio for the oliear parameters. If α is set usig.9 the the secod miimizatio gives δα = Φ + Φα δ, = Φ + Φ [δ] Φ + y, 4.4

Separable least squares while the icremet i α arisig from the Kaufma correctio is α + δ α = Φ + y δ + O δ 2. Note this icremet is ot computed as part of the algorithm. o examie 4.4 i more detail we have dφ + = Φ Φ dφ dφ Φ Φ + Φ Φ Φ + Φ Φ dφ, = Φ Φ dφ = Φ Φ dφ Φ+ dφ Φ+ + Φ Φ dφ, P Φ+ dφ Φ+. he secod term i this last equatio occurs i 4.4. hus, settig δ = δ t, δα Φ + y δ = δ Φ Φ dφ P y + O δ 2, = δ dφ G + O δ 2. P Φ Φ α Φ dp ε he magitude of this resultig expressio ca be show to be small almost surely compared with δ whe is large eough usig the law of large umbers ad cosistecy as before. he proximity of the icremets i the liear parameters plus the idetity of the calculatio of the oliear parameter icremets demostrates the close aligmet betwee the Kaufma ad Gauss-Newto algorithms. he small residual result is discussed i []. 5. Discussio. It has bee show that both of the variats of the Gauss-Newto algorithm cosidered possess similar covergece properties i large data set problems. However, that does ot help resolve the questio of the method of choice i ay particular applicatio. here is agreemet that the Kaufma modificatio of the RGN algorithm has a advatage i beig cheaper to compute, but it is ot less expesive tha the full Gauss-Newto algorithm []. hus a choice betwee variable projectio ad Gauss-Newto must deped o other factors. hese iclude flexibility, ease of use, ad global behaviour. Flexibility teds to favour the full Gauss-Newto method because it ca be applied directly to solve a rage of maximum likelihood problems [7] so it has strog claims to be provided as a geeral purpose procedure. Ease of use is just about a draw. While Gauss-Newto requires startig values for both α ad, give the obvious approach is to compute α by solvig the liear least squares problem. Selectig betwee the methods o some a priori predictio of effectiveess appears much harder. It is argued i [2] that variable projectio ca take fewer iteratios i importat cases. here are two sigificat poits to be made here.. Noliear approximatio families eed ot be closed. Especially if the data is iadequate the the iterates geerated by the full Gauss-Newto may ted to a fuctio i the closure of the family. I this case some parameter values will ted to ad divergece is the correct aswer. he oliear parameters ca be bouded so it is possible for variable projectio to yield a well determied aswer. However, it still eeds to be iterpreted correctly. A example ivolvig the Gauss-Newto method is discussed i [7].

2 M.R. Osbore Fig. 5.. No covergece: fit after 50 iteratios case σ = 4, = 64 2. here is some evidece that strategies which elimiate the liear parameters i separable models ca be spectacularly effective i expoetial fittig problems with small umbers of variables [5], [9]. Similar behaviour has ot bee observed for ratioal fittig [8] which is also a separable regressio problem. It seems there is somethig else goig o i the expoetial fittig case as illcoditioig of the computatio of the liear parameters affects directly both the coditioig of the liear parameter correctio i Gauss-Newto ad the accuracy of the calculatio of P i variable projectio i both these classes of problems. It should be oted that maximum likelihood is ot the way to estimate frequecies which are just the oliear parameters i a closely related problem [0]. Some possible directios for developig modified algorithms are cosidered i [3]. he importace of large sample behaviour, ad the eed for appropriate istrumetatio for data collectio are cosequeces of the result that maximum likelihood parameter estimates have the property that is asymptotically ormally distributed [2]. he effect of sample size o the covergece rate of the Gauss-Newto method is illustrated i able 5. for a estimatio problem ivolvig fittig three Gaussia peaks plus a expoetial backgroud term. Such problems are commo i scietific data aalysis ad are well eough coditioed if the peaks are reasoably distict. I such cases it is relatively easy to set adequate iitial parameter estimates. Here the chose model is µ x, t = 5e 0t + 8e t.252.05 + 5e t.52.03 + 0e t.752.05. Iitial coditios are chose such that there are radom errors of up to 50% i the backgroud parameters ad peak heights, 2.5% i peak locatios, ad 25% i peak wih parameters. Numbers of iteratios are reported for a error process correspodig to a particular sequece of idepedet, ormally distributed

Separable least squares 3 σ = σ = 2 σ = 4 64 7 6 c 256 2 50 024 7 7 8 4096 6 6 7 6384 6 6 7 able 5. Iteratio couts for peak fittig with expoetial backgroud radom umbers, stadard deviatios σ =, 2, 4, ad equispaced sample poits = 64, 256, 024, 4096, 6384. he most sesitive parameters prove to be those determiig the expoetial backgroud, ad they trigger the lack of covergece that occurred whe σ = 4, = 64. he apparet superior covergece behaviour i the = 64 case over the = 256 case for the smaller σ values ca be explaied by the sequece of radom umbers geerated producig more favourable residual values i the former case. he sequece used here correspods to the first quarter of the sequece for = 256. Plots for the fits obtaied for σ = 4, = 64 ad σ = 4, = 256 are give i Figure 5. ad Figure 5.2 respectively. he difficulty with the backgroud estimatio i the former shows up i the sharp kik i the fitted red curve ear t = 0. his figure gives the result after 50 iteratios whe x = 269 ad x2 = 327 so divergece of the backgroud parameters is evidet. However, the rest of the sigal is beig picked up pretty well. he quality of the sigal represetatio suggests possible o-compactess, but the divergig parameters mix liear ad oliear makig iterpretatio of the cacellatio occurrig difficult. A similar pheomeo is discussed i [7]. his ivolves liear parameters oly, ad it is easier to see what is goig o. he problem is attributed to lack of adequate parameter iformatio i the give data. he gree curves give the fit obtaied usig the iitial parameter values ad is the same i both cases. hese curves maage to hide the middle peak fairly well, so the overall fits obtaied are quite satisfactory. he problem would be harder if the umber of peaks was ot kow a priori. 6. Appedix. he variatioal matrix whose spectral radius evaluated at determies the covergece rate of the Kaufma iteratio is Q = I K K = K K 2 F, f i 2 f i + L L. 6. It is possible here to draw o work already doe to establish the key covergece rate { result 2.0. Lemmas 3.3 ad 3.5 describe the covergece behaviour of I = K K + L L } as. Here it proves to be possible to separate out the properties of the idividual terms { by makig use of the orthogoality of K ad L oce it has bee show that E L, ε } a.s. L, ε 0,. his calculatio

4 M.R. Osbore Fig. 5.2. Fit obtaied: case σ = 4, = 256 ca proceed as follows. Let t R p. he { } E t L Lt = { E ε P Φ [t] Φ + Φ + } Φ [t] P ε, = {ε E P Φ [t] Φ Φ } Φ [t] P ε, = { 2 trace Φ [t] G Φ [t] P E { εε } } P + smaller terms, { } = σ2 2 trace Φ [t] G Φ [t] I + smaller terms. his last expressio breaks ito two terms, oe ivolvig the uit matrix ad the other ivolvig the projectio. Both lead to terms of the same order. he uit matrix term gives trace { } { } Φ [t] G Φ [t] = t Ψ i G Ψ i t, where It follows that σ 2 2 Ψ i jk = φ ij k, Ψ i : R m R p. Ψ i G Ψ i = O,. o complete the story ote that the coclusio of Lemma 3.5 ca be writte K K + L L { a.s. E K K + } L L,.

Separable least squares 5 If K K is bouded, positive defiite the, usig the orthogoality 3.8, { } K K K K E K a.s. K { } K KE L L,. his shows that K K teds almost surely to its expectatio provided it is bouded, positive defiite for large eough ad so ca be cacelled o both sides i the above expressio. Note first that the liear parameters caot upset boudedess. α = Φ Φ Φ y, = α + G + O Φ ε, = α + δ, δ = o, 6.2 where α is the true vector of liear parameters. Positive defiiteess follows from tk Kt = α dφ P dφ α, 2 = dφ α dφ α Equality ca hold oly if there is t such that dφ α = γφα. his coditio was met also i Lemma 3.3. 2 0. REFERENCES [] G. Golub ad V. Pereyra, he differetiatio of pseudo-iverses ad oliear least squares problems whose variables separate, SIAM J. Numer. Aal., 0 973, pp. 43 432. [2], Separable oliear least squares: the variable projectio method ad its applicatios, Iverse Problems, 9 2003, pp. R R26. [3] M. Kah, M. Mackisack, M. Osbore, ad G. Smyth, O the cosistecy of Proy s method ad related algorithms, J. Comput. Graph. Statist., 992, pp. 329 350. [4] L. Kaufma, Variable projectio method for solvig separable oliear least squares problems, BI, 5 975, pp. 49 57. [5] M. Osbore, Some special oliear least squares problems, SIAM J. Numer. Aal., 2 975, pp. 9 38. [6], Fisher s method of scorig, Iterat. Statist. Rev., 86 992, pp. 27 286. [7], Least squares methods i maximum likelihood problems, Optim. Methods Softw., 2 2006, pp. 943 959. [8] M. Osbore ad G. Smyth, A modified Proy algorithm for fittig fuctios defied by differece equatios, SIAM J. Sci. ad Statist. Comput., 2 99, pp. 362 382. [9], A modified Proy algorithm for expoetial fittig, SIAM J. Sci. Comput., 6 995, pp. 9 38. [0] B. Qui ad E. Haa, he Estimatio ad rackig of Frequecy, Cambridge Uiversity Press, Cambridge, Uited Kigdom, 200. [] A. Ruhe ad P. Wedi, Algorithms for separable oliear least squares problems, SIAM Rev., 22 980, pp. 38 337. [2] K. Se ad J. Siger, Large Sample Methods i Statistics, Chapma ad Hall, New York, 993. [3] W. Stout, Almost Sure Covergece, Academic Press, New York, 974.