Lecture 3: MLE and Regression

Size: px

Start display at page:

Download "Lecture 3: MLE and Regression"

Christopher Jennings
5 years ago
Views:

1 STAT/Q SCI 403: Itroductio to Resamplig Methods Sprig 207 Istructor: Ye-Chi Che Lecture 3: MLE ad Regressio 3. Parameters ad Distributios Some distributios are idexed by their uderlyig parameters. Thus, as log as we kow the parameter, we kow the etire distributio. For istace, for Normal distributios N(µ, σ 2 ), if we kow µ ad σ 2, the etire distributio is determied. For aother example, for Expoetial distributios Exp(λ), as log as we kow the value of λ, we kow the etire distributio. Because these distributios are determied by their parameters, they are sometimes called parametric distributios. Because parameters i the parametric distributios determie the etire distributio, fidig these parameters is very importat i practice. There are may approaches of fidig parameters; here we will itroduce the most famous ad perhaps most importat oe the maximum likelihood estimator (MLE). 3.2 MLE: Maximum Likelihood Estimator Assume that our radom sample X,, X F, where F F is a distributio depedig o a parameter. For istace, if F is a Normal distributio, the (µ, σ 2 ), the mea ad the variace; if F is a Expoetial distributio, the λ, the rate; if F is a Beroulli distributio, the p, the probability of geeratig. The idea of MLE is to use the PDF or PMF to fid the most likely parameter. For simplicity, here we use the PDF as a illustratio. Because the CDF F F, the PDF (or PMF) p p will also be determied by the parameter. By the idepedece property, the joit PDF of the radom sample X,, X p X,,X (x,, x ) p (x i ). Because p (x) also chages whe chages, we rewrite it as p(x; ) p (x). Thus, the joit PDF ca be rewritte as p X,,X (x,, x ) p(x i ; ). Havig observed x X,, x X, how ca we tell which parameter is most likely? Here is a simple proposal that the MLE uses. Based o the joit PDF ad x X,, x X, we ca rewrite the joit PDF as a fuctio of parameter : L( X,, X ) p(x i ; ). The fuctio L is called the likelihood fuctio. Ad the MLE fids the maximizer of the likelihood fuctio. Namely, MLE argmax L( X,, X ). 3-

2 3-2 Lecture 3: MLE ad Regressio I may cases, maximizig the likelihood fuctio might ot be easy so people cosider maximizig the log-likelihood fuctio: MLE argmax l( X,, X ) argmax log p(x i ; ) argmax l( X i ), where l( X i ) log p(x i ; ) but ow we fix each X i ad view it as a fuctio of. It is easy to see that maximizig the likelihood fuctio is the same as maximizig the log-likelihood fuctio. Whe the log-likelihood fuctio is differetiable with respect to, we ca use our kowledge from calculus to fid. Let s( X i ) l( X i) be the derivative of l( X i ) with respect to (here for simplicity we assume is oe-dimesioal). The fuctio s( X i ) is called the score fuctio. The the MLE is from solvig the followig likelihood equatio: s( X,, X ) s( X i ) 0. (3.) Note that we geerally eed to verify that the solutio is the maximum by checkig the secod derivative but here we igore it for simplicity. : Equatio (3.) is related to the geeralized estimatig equatios, a commo approach to obtai estimators. : The idea of obtaiig a estimator by maximizig certai criteria (or miimizig some criteria) is called a M-estimator 2. I liear regressio, we have leared that the estimators of the slope/itercept is from miimizig the sum of squares of errors (least square estimator). Thus, the least square method is aother M-estimator. Example. (Normal distributio) Here is a example of fidig the MLE of the Normal distributio. We assume X,, X N(µ, ). The goal is to fid a estimator of the mea parameter µ. Because the desity of such a Normal is p µ (x) 2π e (x µ)2 2, the log-likelihood fuctio l(µ X i ) is which further implies that the score fuctio is Thus, the MLE µ satisfies 0 l(µ X i ) log p µ (X i ) (X i µ) log(2π), s( µ X i ) s(µ X i ) µ l(µ X i) µ X i. ( µ X i ) µ Thus, the MLE of the mea parameter is just the sample mea. X i µ X i. Example. (Expoetial distributio) Assume X,, X Exp(λ). Now we fid a estimator of λ usig the MLE. By defiitio of the expoetial distributio, the desity is p λ (x) λe λx. Thus, the log-likelihood fuctio ad the score fuctio are l(λ X i ) log p λ (X i ) log λ λx i, s(λ X i ) λ X i. A bit advaced materials ca be foud i 2

3 Lecture 3: MLE ad Regressio 3-3 As a result, the MLE comes from solvig 0 s( λ X i ) ( ) X i ṋ λ λ Namely, the MLE is the iverse of the sample average. X i λ X. i Example. (Uiform distributio) Here is a case where we caot use the score fuctio to obtai the MLE but still we ca directly fid the MLE. Assume X,, X Ui[0, ]. Namely, the radom sample is from a uiform distributio over the iterval [0, ], where the upper limit parameter is the parameter of iterest. The the desity fuctio is p (x) (0 x ). Here we caot use the log-likelihood fuctio (thik about why) so we use the origial likelihood fuctio: L( X,, X ) p (X i ) (0 X i ) (0 X () X () ), where X () mi{x,, X } ad X () max X,, X are the miimum ad maximum value of the sample (here we use the otatios from order statistic). Observig from L( X,, X ) (0 X () X () ), we see that a smaller, a higher value of the likelihood fuctio. However, there is a restrictio the value of caot be below X () otherwise the idicator fuctio outputs 0. Thus, the maximum value of L( X,, X ) occurs whe X (), so the MLE is X (), the maximum value of the sample. Example. (Beroulli distributio) Fially, we provide a example of fidig the MLE of a discrete radom variable. Suppose X,, X Ber(p). The the PMF is X with a probability of p ad X 0 with a probability of p. We ca rewrite the PMF i the followig succict form: P (x) p X ( p) X. You ca verify that P () P (X ) p ad P (0) P (X 0) p. The likelihood fuctio will be L(p X,, X ) P (X i ) p Xi ( p) Xi We ca the compute the log-likelihood fuctio ad the score fuctio: l(p X,, X ) (X i log p + ( X i ) log( p)), s(p X,, X ) Therefore, the MLE ca be obtaied by solvig Multiplyig both sides by p ( p ), 0 0 s( p X,, X ) X i ( p ) ( X i ) p Agai, the MLE is the sample mea. ( Xi X ) i. p p (X i p ) p ( Xi p X ) i. p X i. : I may problems (such as the mixture models 3 ), we do ot have a closed form of the MLE. The oly way to compute the MLE is via computatioal methods such as the EM algorithm (Expectatio-Maximizatio) 4,

4 3-4 Lecture 3: MLE ad Regressio which is like a gradiet ascet approach. However, the EM algorithm will stuck at the local maximum, so we have to reru the algorithm may times to get the real MLE (the MLE is the parameters of global maximum). I machie learig/data sciece, how to umerically fid the MLE (or approximate the MLE) is a importat topic. A commo solutio is to propose other computatioally feasible estimators that are similar to the MLE ad switch our target to these ew estimators. 3.3 Theory of MLE The MLE has may appealig properties. Here we will focus o oe of its most desirable properties: asymptotic ormality ad asymptotic variace. What is asymptotic ormality? It meas that the estimator ad its target parameter has the followig elegat relatio: ) D ( N(0, I ()), (3.2) where σ 2 () is called the asymptotic variace; it is a quatity depedig oly o (ad the form of the desity fuctio). Simply put, the asymptotic ormality refers to the case where we have the covergece i distributio to a Normal limit cetered at the target parameter. Moreover, this asymptotic variace has a elegat form: ( ( ) ) 2 I() E log p(x; ) E ( s 2 ( X) ). (3.3) The asymptotic variace I() is also called the Fisher iformatio. This quatity plays a key role i both statistical theory ad iformatio theory. Here is a simplified derivatio of equatio (3.2) ad (3.3). Let X,, X p 0, where 0 is the parameter geeratig the radom sample. For simplicity, we assume 0 R ad 0 satisfies 0 argmax E(l( X)). Let be the MLE. Recall that the MLE solves the equatio s( X i ) 0. Because 0 is the maximizer of E(l( X)), it also satisfies E(s( X)) 0. Now cosider the followig expasio: s( 0 X i ) E(s( 0 X)) }{{} 0 s( 0 X i ) f ( 0 ) f ( ) s( X i ) } {{ } 0 ( 0 )f ( ), [ 0, ] by mea value theorem ( 0 )f ( 0 ) ( 0 ) s ( 0 X i ) ( 0 )E(s ( 0 X i )). (3.4)

5 Lecture 3: MLE ad Regressio 3-5 By Cetral Limit Theorem, the right had side of equatio (3.4) ( ) D s( 0 X i ) E(s( 0 X)) N(0, σ 2 ()), (3.5) where σ 2 () Var(s( 0 X i )). Therefore, rearragig the quatities i equatio (3.4) ad (3.5), ) ( 0 ( ) E(s ( 0 X i )) s( 0 X i ) E(s( 0 X)) E(s ( 0 X i )) ( ) s( 0 X i ) E(s( 0 X)) D N ( 0, σ 2 () (E(s ( 0 X i )) 2 Now we have already show the asymptotic ormality. The ext step is to show the asymptotic variace σ 2 () (E(s ( 0 X i)) 2 I( 0 ). ). (3.6) First, we expad σ 2 (): σ 2 () Var(s( 0 X i )) E(s 2 ( 0 X i )) E(s( 0 X i )). (3.7) }{{}}{{} I() 0 2 Secod, we expad E(s ( 0 X i )). But first we focus o s ( 0 X i ): For the first quatity, 2 p(x E 0 2 i ; 0 ) s ( 0 X i ) p(x; 0 ) p(x; 0 ) log 0 2 p(x 0 2 i ; 0 ) ( 0 0 ) 2 2 p(x 0 2 i ; 0 ) ( ) 2 log 0 2 p(x 0 2 i ; 0 ) s 2 ( 0 X i ). 2 p(x; 0 )dx 2 0 p(x; 0 )dx p(x; 0 )dx } {{ } Note that because we exchage the positios of the derivative ad the itegratio, we assume that the parameter is idepedet of the support of the desity fuctio. Thus, E(s ( 0 X i )) E E ( s 2 ( 0 X i ) ) E ( s 2 ( 0 X i ) ) I( 0 ). (3.8)

6 3-6 Lecture 3: MLE ad Regressio Pluggig equatios (3.7) ad (3.8) ito equatio (3.6), we coclude ) D ( 0 N (0, I( ) 0) D I 2 N ( 0, I ( 0 ) ), (3.9) ( 0 ) which is the desired result. As a byproduct, we also showed that I( 0 ) E ( s 2 ( 0 X) ) E (s ( 0 X)). Equatio (3.9) provides several useful properties of MLE: The MLE is a ubiased estimator of 0. The mea square error (MSE) of is ) MSE (, 0 The estimator error of is asymptotically Normal. bias 2 ( ) +Var( ) }{{} 0 I( 0 ). If we kow I( 0 ) or have a estimate Î( 0) of it, we ca costruct a α cofidece iterval usig z α/2 0), z α/2 0). If we wat to test H 0 : 0 versus H a : 0 uder sigificace level α, we ca first compute I( ) ad the reject the ull hypothesis if I( ) z α/2. Example. (Normal distributio) I the example where X,, X N(µ, ), we have see that s(µ X i ) µ X i. Thus, I(µ) E(s 2 (µ X i )) E((µ X i ) 2 ) because the variace is. Moreover, E(s (µ X i )) E() I(µ) agrees with the fact that E(s 2 (µ X i )) E(s (µ X i )). Aother way to check this is directly compute the variace of the MLE µ X i, which equals to I(µ). Agai, we obtai I(µ). Example. (Expoetial distributio) For the example where X,, X Exp(λ), the Fisher iformatio is more ivolved. Because the MLE λ, we caot directly compute its variace. However, usig Xi the Fisher iformatio, we ca still obtai its asymptotic variace. Recall that s(λ X i ) λ X i. Thus, ( E(s 2 (λ X i )) E λ 2 2X ) i λ + X2 i. For a expoetial distributio Exp(λ), its mea is λ ad variace is λ 2, which implies the secod momet E(X 2 i ) Var(X i) + E 2 (X i ) 2 λ 2. Puttig it altogether, I(λ) E(s 2 (λ X i )) λ 2 2 λ λ 2 λ 2. If we use E(s (λ X i )), we obtai E(s (λ X i )) λ 2 I(λ),

7 Lecture 3: MLE ad Regressio 3-7 which agrees with the previous result. : Note that we may obtai a approximatio of the variace of the MLE λ by the delta method 5. : Not all MLEs have the above elegat properties. There are MLEs who do ot satisfy the assumptio of the theory. For istace, i the example of X,, X Ui over [0, ], the MLE X () does ot satisfy the assumptio. A assumptio is that the support of the desity does ot deped o the parameter but here the support is [0, ], which depeds o the parameter. 3.4 Simple Liear Regressio Uder certai coditios, the least square estimator (LSE) i liear regressio ca be framed as a MLE of the regressio slope ad itercept. Let s recall the least square estimator of the liear regressio first. We observe IID bivariate radom variables (X, Y ),, (X, Y ) such that Y i β 0 + β X i + ɛ i, where the ɛ i is a mea 0 oise idepedet of X i. We also assume that the margial distributio of the covariate X,, X p X. The LSE fids β 0,LSE, β,lse by ( β 0,LSE, β,lse ) argmi β 0,β (Y i β 0 β X i ) 2. (3.0) I other words, we are choose β 0,LSE, β,lse such that the sum of squares of errors is miimized. Now, here is the key assumptio the liks the MLE ad the LSE: We assume that Namely, the oises are IID from a mea 0 Normal distributio. ɛ,, ɛ N(0, σ 2 ). (3.) Uder equatio (3.), what is the MLE of β 0 ad β? Here is a derivatio of that. Let p ɛ (e) be the distributio of the ɛ i. The joit desity of (X i, Y i ) is p XY (x, y) p(y x) p X (x) p ɛ (y β 0 β x) p X (x) Thus, the log-likelihood fuctio of β 0, β is l(β 0, β X, Y,, X, Y ) log p XY (X i, Y i ) 2πσ 2 e 2σ 2 (y β0 βx)2 p X (x). 2 (2πσ2 ) }{{} 2σ 2 (Y i β 0 β X i ) 2 + log p X (X i ) }{{}. β β 0,β 0,β Therefore, oly the ceter term is related to the parameter of iterest β 0, β. This meas that the MLE is also the maximizer of l (β 0, β X, Y,, X, Y ) 2σ 2 (Y i β 0 β X i ) ad

8 3-8 Lecture 3: MLE ad Regressio Because multiplyig l (β 0, β X, Y,, X, Y ) by ay positive umber will ot affect the maximizer, we multiply it by 2σ 2, which leads to the followig criterio: l (β 0, β X, Y,, X, Y ) (Y i β 0 β X i ) 2. Accordigly, the MLE fids β 0,MLE, β,mle by maximizig (Y i β 0 β X i ) 2, which is equivalet to miimizig (Y i β 0 β X i ) 2, the same criterio as equatio (3.0). So the LSE ad the MLE coicides i this case ad the theory of MLE applied to LSE (asymptotic ormality, Fisher iformatio,...etc). : Thik about the closed form of β 0,LSE, β,lse give IID bivariate radom variables (X, Y ),, (X, Y ). Ad also thik about if they coverges to the true parameter β 0, β ad if we ca derive the asymptotic ormality of them (you may eed to use the Slutsky s theorem 6 ). 3.5 Logistic Regressio Now we cosider a special case i the regressio problem: biary respose case. We observe IID bivariate radom variables (X, Y ),, (X, Y ) such that Y i takes value 0 ad. Here Y i s are called the respose variable ad X i s are called the covariate. Because Y i takes value betwee 0 ad, we ca view it as a Beroulli radom variable with parameter q but this parameter q depeds o the value of the correspodig covariate X i. Namely, if we observe X i x, the the probability P (Y i X i x) q(x) for some fuctio q(x). Here are some examples why this model is reasoable. Example. I graduate school admissio, we are woderig how a studet s GPA affects the chace that this applicat received the admissio. I this case, each observatios is a studet ad the respose variable Y represets whether the studet received admissio (Y ) or ot (Y 0). GPA is the covariate X. Thus, we ca model the probability P (admitted GPA x) P (Y X x) q(x). Example. I medical research, people are ofte woderig if the heretability of the type-2 diabetes is related to some mutatio from of a gee. Researchers record if the subject has the type-2 diabetes (respose) ad measure the mutatio sigature of gees (covariate X). Thus, the respose variable Y if this subject has the type-2 diabetes. A statistical model to associate the covariate X ad the respose Y is through P (subject has type-2 diabetes mutatio sigature x) P (Y X x) q(x). Thus, the fuctio q(x) ow plays a key role i determiig how the respose Y ad the covariate X are associated. The logistic regressio provides a simple ad elegat way to characterize the fuctio q(x) i a liear way. Because q(x) represets a probability, it rages withi [0, ] so aively usig a liear regressio will ot work. However, cosider the followig quatity: O(x) 6 s_theorem q(x) P (Y X x) [0, ). q(x) P (Y 0 X x)

9 Lecture 3: MLE ad Regressio 3-9 The quatity O(x) is called the odds that measures the cotrast betwee the evet Y versus Y 0. Whe the odds is greater tha, we have a higher chage of gettig Y tha Y 0. The odds has a iterestig asymmetric form if P (Y X x) 2P (Y 0 X x), the O(x) 2 but if P (Y 0 X x) 2P (Y X x), the O(x) 2. To symmetrize the odds, a straight-forward approach is to take (atural) logarithm of it: log O(x) log q(x) q(x). This quatity is called log odds. The log odds has several beautiful properties, for istace whe the two probabilities are the same (P (Y X x) P (Y 0 X x)), log O(x) 0, ad P (Y X x) 2P (Y 0 X x) log O(x) log 2 P (Y 0 X x) 2P (Y X x) log O(x) log 2. The logistic regressio is to impose a liear model to the log odds. Namely, the logistic regressio models leadig to log O(x) log q(x) q(x) β 0 + β x P (Y X x) q(x) eβ0+βx + e β0+βx. Thus, the quatity q(x) q(x; β 0, β ) depeds o the two parameter β 0, β. Here β 0 behaves like the itercept ad β behaves like the slope (they are the itercept ad slope i terms of the log odds). Whe we observe data, how ca we estimate these two parameters? I geeral, people will use the MLE to estimate them ad here is the likelihood fuctio of logistic regressio. Recall that we observe IID bivariate radom sample: (X, Y ),, (X, Y ). Let p X (x) deotes the probability desity of X; ote that we will ot use it i estimatig β 0, β. For a give pair X i, Y i, recalled that the radom variable Y i give X i is just a Beroulli radom variable with parameter q(x X i ). Thus, the PMF of Y i give X i is L(β 0, β X i, Y i ) P (Y Y i X i ) q(x i ) Yi ( q(x i )) Yi ( e β 0+β X i ) Yi ( + e β0+βxi + e β0+βxi eβ0yi+βxiyi + e β0+βxi. ) Yi Note that here we costruct the likelihood fuctio usig oly the coditioal PMF because similarly to the liear regressio, the distributio of the covariate X does ot depeds o the parameter β 0, β. Thus, the log-likelihood fuctio is l(β 0, β X, Y,, X, Y ) log L(β 0, β X i, Y i ) ( e β 0Y i+β X iy i ) log + e β0+βxi β 0 Y i + β X i Y i log ( + e β0+βxi).

10 3-0 Lecture 3: MLE ad Regressio We ca the take derivative to fid the maximizer. However, the derivative of the log-likelihood fuctio l(β 0, β X, Y,, X, Y ) does ot have a closedform solutio so we caot write dow a simple expressio of the estimator. Despite this disadvatage, such a log-likelihood fuctio ca be optimized by gradiet ascet approach such as the Newto-Raphso 7. 7 some refereces ca be foud:

Lecture 2: Monte Carlo Simulation

Lecture 2: Monte Carlo Simulation STAT/Q SCI 43: Itroductio to Resamplig ethods Sprig 27 Istructor: Ye-Chi Che Lecture 2: ote Carlo Simulatio 2 ote Carlo Itegratio Assume we wat to evaluate the followig itegratio: e x3 dx What ca we do?