7: Estimation: The Least-Squares Method

Size: px

Start display at page:

Download "7: Estimation: The Least-Squares Method"

Jacob Henry
5 years ago
Views:

1 7: Estmaton: The Least-Squares Method Generaltes. Outlne of the method Suppose that a data sample conssts of a N pars of ponts {x, y }; the x s are known exactly, whle the y s have been measured wth the resoluton σ, wth ndependent errors. Let s frst assumes that the errors are gaussan,and that y s gven by a theoretcal functon y = f(x; a,, a p ). Thus the ndvdual probablty for the pont s and the logarthm of the lkelhood s: P (y ; a) = (y f(x,a) σ π e σ log L = ( y f(x, a) ) σ log( πσ ) The Maxmum lkelhood method conssts of mnmzng the quantty log L, whch corresponds to wth respect to each of the p parameters a k : [ y f(x, a) σ ] χ k, [ y f(x, a) ] = 0 a k σ The use of the χ notaton, n the above defnton, wll be justfed further: t wll be shown that for the case of errors ndeed dstrbuted accordng to a normal law, χ s dstrbuted accordng to the χ dstrbuton.. Least Squares Method when errors are not Gaussan In the general case where no assumpton can be made on the dstrbuton of the errors the method s stll the same,.e t conssts of mnmzng the quantty X = [ y f(x, a) σ ] The notaton X nstead of χ remnds that the sum s not dstrbuted accordng to the χ dstrbuton. Qute frequently, when the observatons are known to be of dfferent accuracy, one can not clam a precse knowledge of the errors. Instead one must estmate the error from each ndvdual measurement. Suppose for example that the measurement y s the number of events n a class. One could use the approxmaton σ = y, equvalent to consderng the true f n a Posson dstrbuton of mean and varance f. In such a case, nstead of the rgorous formula X = [y f(x, a)] f

2 one could use the smplfed LS estmator whch conssts of mnmzng the quantty X [y f(x, a)] y One should also note that the LS estmaton method makes no requrement about the dstrbutonal propertes of the observables. In ths sense, t s dstrbuton-free..3 Least Squares Method when measurements are correlated If the observatons are correlated, wth errors and covarance terms gven n the covarance matrx V = V (y), the Least Squares prncple for estmatng the parameters s formulated as N N X = (y f ) Vj (y j f j ) = mnmum = j= The most classc example: the straght lne ft. Estmators of the slope and the y-ntercept Assume that you want to ft your data ponts to the functon y = ax + b. The χ and ts partal dervatves wth respect to a and b can be wrtten: χ = χ b χ a = = (y ax b) σ σ σ (y ax b) x (y ax b) One has to solve the system of equatons: 0 = y σ â x σ b σ 0 = y x σ â x σ b x σ Introducng the X notaton: x /σ x w x = /σ = w ỹ = y /σ y w /σ = w x x = /σ x /σ = w w x y /σ x y w xy = /σ = w

3 Ths can be rewrtten as: 0 = ỹ â x b 0 = xy â x b x â = xy xỹ x x b = ỹ â x b = ỹ x x xy x x Note that n the smple case where all σ s are equal to a common uncertanty σ, the X notaton reduces to x = x, ỹ = y, ỹ = y, xy = xy,. Parameters covarance n the case of the straght lne ft Snce the least squares s nothng by a specal case of the Maxmum Lkelhood method, one can use the formalsm used n the prevous chapter. Recall that the nverse covarance matrx of the parameters a s: cov (a, a j ) = < log L log L >= < log L > a a j a a j Therefore one can wrte, usng the notaton w = /σ : Thus, the covarance matrx V ab s log L = w (y ax b) log L a = w x (y ax b) log L b = w (y ax b) log L a = w x log L b = w log L a b = w x V ab = V ab = ( ) x x w x ( x x x ) w x x Ths can be rewrtten as: σ (â) = w x x 3

4 σ ( b) = cov(â, b) = ρâ b And n the case of all σ s equal to a common σ: w w = x x σ (â) = σ N σ ( b) = σ N cov(â, b) = σ N ρâ b = x x.3 χ n the case of the straght lne ft x x x x x x x x x x x x x x In the case of a straght lne ft, the actual χ can be evaluated even wthout actually computng â and b. The proof s lengthy, but nstructve: χ = w (y âx b) = w y + â w x + b w â w y x b w y + â b w x χ w = ỹ + â x + b â xỹ bỹ + â b x Notng that b = ỹ â x and replacng â by ts value â = xy xỹ x x χ w = ỹ + â x + ỹ + â x â xỹ â xy ỹ + â xỹ + â xỹ â x = ỹ ỹ + â ( x x ) â( xỹ xy) = ỹ ỹ + = ỹ ỹ = V (y) ( ρ xy) ( xy xỹ) x x ( xy xỹ) x x ( xy xỹ) x x In the case where all σ s are equal to a common σ, ths becomes: χ = N V (y) σ ( ρ xy) These formulae are sometmes useful, when one s nterested n a goodness of ft test (see next chapter) wthout havng to actually compute the estmators a and b. 4

5 .4 Straght lne ft wth uncertantes on both x and y Let s start wth the smple case where both x s and y s have equal uncertantes. If we apply the maxmum lkelhood method, the probablty that a true pont A = (x A, y A ), lyng on the theoretcal lne y = ax + b, s measured as B = (x B, y B ) s just: whch s just: P (A B) = πσ e (x A x B ) /σ e (y A y B ) /σ P (A B) = πσ e u /σ e h /σ To fnd the total probablty that B could come from any pont A, one has to ntegrate P (A B) over all u. The ntegral of e u /σ just gves a constant, σ π, so the probablty of a pont occurrng at B s proportonal to e h /σ, where h s the dstance of the pont B to the lne y = ax + b. In formng the χ to be mnmzed one sums the squared dstances from each measured pont to the pont on the lne from whch t s most lkely to have been produced. The fact that t may actually come from another pont on the lne need not be explctly taken nto account. Specfcally, the dstance between a pont (x, y ) and the lne y = ax + b s h = y ax b + a Thus the χ can be wrtten χ = (y ax b) + a Dfferentatng wth respect to a and b gves, at the cost of a panful calculaton: y = âx + b â = A ± A + A = V (y) V (x) cov(x, y) Out of the solutons, one has to choose the â whose sgn s that of cov(x, y). It must be stressed that ths method works because we assumed that the a pror dstrbuton of ponts s flat and nfnte along a straght lne. In the case that x and y accuraces and not the same and/or the x and y values at a pont are correlated, a change of scale and/or a rotaton can brng us back to the prevous case. The concluson perssts: one needs only to consder the pont on the lne correspondng to the hghest probablty of producng the data pont. 3 The χ dstrbuton 3. Mult-Normal dstrbuton wth, and 3 varables Let s address the followng queston: gven, or 3 ndependent normally dstrbuted varables x, y and z, one can form the quanttes r = x, x + y, or x + y + z. What wll be the dstrbutons of r and r? 5

6 In one dmenson: The standard dstrbuton s f(x) = e x π Under the change of varable r r = h(r), g(r ) = f(r) h (r) = f(r) r g(r ) = r e r = π π (r ) e r = Γ( ) (r ) e Where the frst factor comes from the fact that the equaton y = r has two solutons n r = x, and usng the well known(!) expresson of the Γ functon: Γ( ) = π In two dmensons: We are now consderng the dstrbuton of r = x + y on a crcle n the xy plane, where x and y are normally dstrbuted. On the crcle the dstrbuton s constant and ntegratng over π gves: f(r) = πr e x π r r g(r ) = r re r = π e y = r e r e r = Γ() (r ) 0 e r r In three dmensons: of a sphere and get: Now we consder r = x + y + y ; we now ntegrate over the surface f(r) = 4πr π π e r = r e r π g(r ) = f(r) r = r e r = ) r π 3 Γ( 3 e )(r usng the the well known(!) relaton: Γ( 3 ) = Γ( ) = π/ 3. Defnton of the χ dstrbuton By defnton, the χ dstrbuton wth n degrees of freedom s the functon of the varable χ defned as the sum of the squares of n ndependent varables normally dstrbuted: Let s proof by nducton that: n=, n=, n=3 : see above n f(χ ; n) = N (x; 0, ) = f(χ ; n) = n Γ( n ) (χ ) n e χ 6

7 from n to n+ : Assume that the formula s true for f(χ ; n); for n + the dstrbuton becomes (f g)(x) = Q.E.D = = x t=0 n Γ( n e )tn n Γ( n ) n x n e x n+ Γ( n+ )x t n+ e x x t e dt For ths, we ve used the well-known(!) formula whch derves from ntegraton by parts : Γ(α + ) = = t=0 t=0 = αγ(α) e t t (α+) dt e t αt α dt Note that one can fnd ths result ntutvely: the probablty densty functon s P (x, x,...x n ) = P (x ) P (x )... P (x n ) P (x, x,...x n ) = A e x / e x /... e x n/ P (x, x,...x n ) = A e χ / To get the densty n χ, one has to ntegrate over the hyper-sphere of dmenson χ N and snce one s nterested n a p.d.f of χ, a change of varable wll eat another power of χ. The overall dstrbuton s then B (χ ) n e χ where B s found by the normalzaton condton. Ths s llustrated n the code ch.r 3.3 Basc propertes of the χ dstrbuton The χ dstrbuton s a decreasng functon of χ for n = and n = ; for n >, t starts at zero, reaches a maxmum, then decreases. Its characterstc functon s Performng a change of varable Φ(t) =< e tx >= n Γ( n ) 0 x n e ( x tx) dx y = x( t), dx = dy one gets, usng the defnton Γ(α) = 0 x α e x dx: Φ(t) = n Γ( n ) 0 t y n e y ( t) n ( t)dy = n ( t) n 7 = ( t) n

8 We fnd agan that the χ dstrbuton s nductve (though t was obvous from the defnton): χ (n) χ (n ) = χ (n + n ) From the characterstc functon one can mmedately get ts mean and varance: µ = Φ(t) t=0 = t σ = Φ(t) t=0 µ = t n ( t) n + t=0 = n n(n + ) ( t) n + t=0 n = n + n n = n The result concernng the mean s straghtforward but the one of the varance s more subtle: why a factor? The answer can be found by consderng the varance of a χ wth degree of freedom: t s just the varance of the second central moment of the normal dstrbuton. It s easy to show that the varance of the r th moment of any dstrbuton s: V (x r ) =< (x ) r > < x r > = µ r µ r For the second moment of a normal dstrbuton, one has µ 4 = 3σ 4 and ths becomes, notng that all moments also are central moments: Hence the factor. V (x ) = µ 4 µ = 3σ 4 σ 4 = σ The χ dstrbuton as a specal case of the Γ dstrbuton As we saw, the χ dstrbuton s wrtten: p(χ ; n) = n Γ( n ) (χ ) n e χ In ths form s a the dstrbuton of χ and not χ. Just posng z=χ, t becomes: p(z; n) = n Γ( n ) z n e z/ = Γ(z; α = n, β = ) where Γ s the Gamma dstrbuton we already studed. It follows that Its mean s < χ >= αβ = n Its varance s v(χ ) = αβ = n Its characterstc functon s Φ(t) = ( t) n/ 8

9 3.5 Generatng a varable accordng to a χ dstrbuton A very nce result s that, consderng n ndependent varables (x, x,..., x n ), unformly dstrbuted between 0 and, then u = log(x x...x n ) = χ (n) Ths can be easly shown for n = : y = log x, g(y) = f(x) h (y) = x = e y = χ (y; ) and then by nducton from n to n + ; the algorthm s thus two fold, dependng on whether the number of degrees of freedom one consders s odd or even: n even : generate n/ ndependent varables and apply the above formula n odd : go back to the prevous case and add to the result z, where z s a pseudo random varable dstrbuted accordng to the normal law. For large values of n (typcally n > 0) one can use the fact that χ s very close to a Gaussan dstrbuton of mean n and unt varance. See a graphcal llustraton: chsquare.m 3.6 A fantastc property lnkng the Posson and χ dstrbutons Let s consder a Posson dstrbuton of the nteger varable r and mean µ; a very useful result lnks the probablty that r n (or equvalently r > n) and the probablty that a χ varable s greater than some cut: Proof: P µ (r n >= P [χ (x; n + ) > µ] P [χ (x; n + ) > µ] = x=µ x n e x n! n+ dx = t=µ t n e t dt n! After the change of varable t = x/, dx = dt; now ntegratng by parts and notng that one gets: t=µ t n e t dt = n! x n e x dx = [e x x n ] + n x n e x dx t n e t dt + e µ µ n (n )! t=µ n! = r=n r=0 e µ µ r whch s exactly the sum of the r + ndvdual probabltes of the Posson dstrbuton for r = 0,,...n. Ths property s very mportant from a computatonal pont of vew snce the cumulatve χ probablty s avalable n most computer languages scentfc lbrares, as depcted by the followng Exercse: You are runnng an hgh energy partcle physcs experment, n search for the elusve Hggs boson; you expect b = 3.86 fake background events and you observe n = 9 of them. What s the probablty that your observaton s just a statstcal fluctuaton? (answer: ) 9 r!

10 4 Dstrbuton of the χ of a least squares ft The key (and well known!) result s that the dstrbuton of the χ of a least squares ft, n the case of Gaussan errors, s ndeed a χ dstrbuton of N p degrees of freedom: f(χ N p ; N p) = N p Γ( N p ) (χ ) e χ where N s the number of ponts and p the number of parameters evaluated by the ft. In the case of a straght lne ft, p =. The χ s a sum of N terms u = (y f(x )) σ each of them beng the square of a normal dstrbuton. The ponts of constant probablty le on a N dmensonal hyper-sphere of radus χ. The probablty that a result occurs between χ and χ + δχ thus depends on e χ /, multpled by the volume of the regon concerned, so After change of varable ths becomes P (χ) = αχ N e χ / P (χ ) = β χ N e χ / In secton. dealng wth the case of the ft to a straght lne, we saw that the u are constraned by equatons: 0 = u 0 = u x Thus the ponts u le on an (N )-dmensonal subspace of the orgnal, reducng the dmenson of a regon of equprobablty by two powers of χ. Thus the probablty densty functon s proportonal to P (χ ) = γ χ N 4 e χ / whch s a χ dstrbuton for N degrees of freedom. In case the ft nvolves p parameters nstead of, the above analyss stll holds, one degree of freedom beng lost for each extra parameter. Note that n the case p = N, the ft s not longer a ft but a system of N equatons wth N unknowns. Clearly, f the number of parameters s greater than the number of ponts, the problem s not solvable. 5 Lnear least squares fts A least squares ft s called lnear f the theoretcal functon f(x; a a p ) lnear wth respect to the parameters a ; for example, estmatng a, a, a 3, a 4 n s ndeed a lnear problem, whle estmatng a n y = a + a sn x + a 3 sn(x ) + a 4 x y = cos(ax) s obvously not. The mportant result we want to show s that the LS method provdes an analytc soluton for the parameters a whch can be expressed n a smple closed form. Moreover, the LS estmator n the lnear case s unque, unbased, and s a lnear functon of the observatons y. Fnally, among all estmators whch can be expressed as a lnear functon of the observatons, t s the most effcent. 0

11 5. Normal equatons for least squares fts If many (> ) parameters are to be estmated n a lnear ft, matrx notaton produces elegant results. Let s call a = a,, a p the vector of the p parameters, y = y,, y N the vector of the N measured quanttes and smlarly for x: The ft wll be called lnear f the theoretcal functon f can be expressed as a lnear combnaton of any functons of x; p f(x; a) = c r (x)a r In the general case the X can be expressed as a sum of exponents of the multnormal functons X = [y f(x ; a)]vj [y j f(x j ; a)] j whch can be expressed n matrx notaton: r= X = ( y f) T V (y f) Frst assumng that the y s are ndependent, each of them wth an uncertanty σ, we can wrte N X = [ y p r= a rc r (x ) ] σ Dfferentatng ths wth respect to a r gves the so-called normal equatons: c r (x ) y s âsc s (x ) σ = 0 whch can be expressed n the matrx notaton, ntroducng the matrx C r = c r (x ) f = Ca C T V C â = C T V y â = ( C T V C) C T V y In the above, the dmensons of the varous matrces are: y : N a : p C : N p V : N N C T V : p N C T V y : p C T V C : p p In the case the y s are not ndependent,.e. the matrx V s not dagonal, ths result stll holds: one has to dfferentate the quantty X = ( y Ca) T V (y Ca) wth respect to each of the a k : 0 = â[( y Ca) T V (y Câ)] 0 = C T V (y Câ) C T V y = C T V Câ

12 5. Propertes of Lnear Least Squares Fts The frst rather obvous property s that the Lnear Least Square method provdes a unque soluton, whch s lnear wth respect to the observables y ; secondly, ths estmator s unbased because t s lnear: < â > = < ( C T V C) C T V y > = ( C T V C) C T V < y > = ( C T V C) C T V Ca = a What s the covarance matrx of â? The answer s smple: we saw that f z = Mx, then V (z) = MV (x) M T. In our case M = (C T V C) C T V V (â) = (C T V C) C T V V V C( C T V C) = ( C T V C) 5.3 The Gauss-Markov theorem An nterestng optmum property of the lnear Least Squares estmator s contaned n the Gauss- Markov theorem : among all unbased estmators whch are lnear functons of the observatons, the LS estmators have the smallest varance. To prove t, consder any estmator t of the parameters a, lnear n the observatons y, t = Sy. Its expectaton t s: < t >= S < y >= SCa where a s the orgnal vector of parameters. For t to be unbased, we need < t >= a, a Hence we must have SC =. Now we want to fnd out when the covarance matrx V ( t) for the new estmator has mnmal dagonal term; one has V ( t) = SV (y)s T. For fndng the mnmum we consder the followng dentty, notng V (y) as V, and usng the facts that SC =, V = V T, and (C T V C) T = (C T V C): qtty = [(C T V C) C T V ]V [(C T V C) C T V ] T +[S (C T V C) C T V ]V [S (C T V C) C T V ] T = [(C T V C) C T V ]V [V C(C T V C) ] + SV S T SV V C(C T V C) +[(C T V C) C T V ]V [V C(C T V C) ] (C T V C) C T V V S T = (C T V C) + SV S T (C T V C) + (C T V C) (C T V C) qtty = SV S T = V ( t) Here each of the two terms on the rght hand sde n the defnton of qtty s of the quadratc form UV U T, whch mples non-negatve dagonal elements. Only the second term s a functon of S, and the sum of the two terms wll have strctly mnmum dagonal elements when the second term has vanshng elements on the dagonal. Ths occurs when S = (C T V C) C T V Therefore the mnmum varance unbased estmator for a s where â LS s the Least-Squares estmator for a. t = (C T V C) C T V y = â LS

13 5.4 Lnear Least squares fts wth constrants In frequent stuatons, one s faced wth the problem of estmatng parameters for whch one or more constrants exst. Two approaches can handle such a problem. The Elmnaton method. Lagrange multplers 5.4. Elmnaton method Ths method just conssts, (whenever possble!) n elmnatng some of the parameters by algebrac manpulaton. A very smple example wll llustrate the method: the three angles of a trangle θ, θ and θ 3 have been measured ndependently wth the same uncertanty σ. What s the constraned ( θ = π) LS estmators of the angles? For smplcty, let s call the measurements y, y and y 3 and assume that the measurements are performed n degrees. One wants to solve: X (θ, θ, θ 3 ) = 3 θ = 80 = 3 ( y θ ) = mnmum σ = In ths case, the algebrac manpulaton to elmnate, say, θ 3, s not very dffcult: X = ( y θ σ ) + ( y θ σ ) + [ y 3 (80 θ θ ) ] σ 0 = σ X θ = (y θ ) + [y 3 (80 θ θ )] 0 = σ X θ = (y θ ) + [y 3 (80 θ θ )] θ = 3 (80 + y y y 3 ) θ = 3 (80 y + y y 3 ) θ 3 = 3 (80 y y + y 3 ) An mproved measurement was obtaned by subtractng to each of the measurements y one thrd of the excess y + y + y Ths mprovement s drectly vsble n the varance of the estmators: V ( θ ) = V ( θ ) = V ( θ 3 ) = σ = 9 3 σ Ths result can be ntutvely explaned n terms of nformaton: one could have estmated θ, θ and θ 3 wth two measurements only, say y and y ; n ths case, y 3 would have been computed as y 3 = 80 y y. Addng a thrd measurement ncreases the nformaton by 50%, and we saw, n the secton devoted to the Cramer Rao bound (4.3.), that the mnmum varance s nversely proportonal to the amount of nformaton. 3

14 5.4. Lagrange Multplers Let s assume that we agan want to estmate p parameters a, a a n, lnked together by m constrants: D k (a) = 0, k =, m The basc trck of the method s to replace the p-component vector of the parameters a by a p+mcomponent vector whose frst p components are the usual a and the remanng ones λ k, k =, m, such as the new X s now: X = ( y Ca) T V (y Ca) + λ T D = mnmum If we here equate to zero the dervatves of X wth respect to a, a a p and λ k, = m, we get the normal equatons: âx = (C T V y C T V Câ) + ( âd) T λ = 0 λ X = D = 0 We are now left wth a set of p + m equatons. The second ones are the constrants, whle the frst ones are the analog of the normal equatons for the unconstraned case, now modfed by the λ term due to the constrants. In the case the constrants themselves are lnear, they can be expressed as Ba = b and the set of the modfed p + k normal equatons becomes 0 = (C T V y C T V Câ) + B T λ 0 = Bâ b whch s a set of lnear equatons. After some heavy algebra, one can show that the varance matrx of the estmator â s: V (â) = (C T V C) [B(C T V C) ] T [B T V B] [B(C T V C) ] Ths expresson can not be smplfed any further because nether C or B are square matrces! Note that the matrx (B T V B) has non negatve dagonal terms, and that the same s true for the second term of the above equaton. As a result, the constrant equatons wll always lead to a reducton of the parameter errors (.e. the dagonal terms) compared to the unconstraned case. Back to the prevous exercse, runnng the machnery yelds: χ = 3 = ( y θ σ 0 = χ θ = σ (y θ ) + λ 0 = χ θ = σ (y θ ) + λ 0 = χ θ 3 = σ (y 3 θ 3 ) + λ 4 ) + λ( θ 80)

15 0 = χ = ( θ 80) λ 0 = ( y 80) 3 λσ λ = 3σ ( y 80) θ = y λσ = y 3 ( y 80) 6 Non lnear least squares fts If the functon f(x; a) s not lnear n the a r, you generally have to use an teratve technque to solve the equatons. Gven a frst guess a 0, a Taylor expanson around a 0 wll yeld a new set of values a such as X a r a=a 0 and further teratons wll lead to the soluton. In such teratve methods, havng a good frst guess s wnnng 90% of the battle. The best advce s to fnd a relable software lbrary rather than beng tempted to wrte one s own package. 5

1 Matrix representations of canonical matrices

1 Matrix representations of canonical matrices 1 Matrx representatons of canoncal matrces 2-d rotaton around the orgn: ( ) cos θ sn θ R 0 = sn θ cos θ 3-d rotaton around the x-axs: R x = 1 0 0 0 cos θ sn θ 0 sn θ cos θ 3-d rotaton around the y-axs: