Logit regression Logit regression

Logit regressio Logit regressio models the probability of Y= as the cumulative stadard logistic distributio fuctio, evaluated at z = β 0 + β X: Pr(Y = X) = F(β 0 + β X) F is the cumulative logistic distributio fuctio: F(β 0 + β X) = + ( 0 X ) e β + β 9-

Logistic regressio, ctd. Pr(Y = X) = F(β 0 + β X) where F(β 0 + β X) = + ( 0 X ) e β + β. Example: β 0 = -3, β = 2, X =.4, so β 0 + β X = -3 + 2x.4 = -2.2 so Pr(Y = X=.4) = /(+e ( 2.2) ) =.0998 Why bother with logit if we have probit? Historically, umerically coveiet I practice, very similar to probit 9-2

Predicted probabilities from estimated probit ad logit models usually are very close. 9-3

Estimatio ad Iferece i Probit (ad Logit) Models (SW Sectio 9.3) Probit model: Pr(Y = X) = Φ(β 0 + β X) Estimatio ad iferece o How to estimate β 0 ad β? o What is the samplig distributio of the estimators? o Why ca we use the usual methods of iferece? First discuss oliear least squares (easier to explai) The discuss maximum likelihood estimatio (what is actually doe i practice) 9-4

Probit estimatio by oliear least squares Recall OLS: mi [ Y ( b + b X )] b0, b i 0 i i= 2 The result is the OLS estimators ˆ β 0 ad ˆ β I probit, we have a differet regressio fuctio the oliear probit model. So, we could estimate β 0 ad β by oliear least squares: b0, b Y i Φ b0 + bxi i= mi [ ( )] Solvig this yields the oliear least squares estimator of the probit coefficiets. 2 9-5

Noliear least squares, ctd. b0, b Y i Φ b0 + b Xi i= mi [ ( )] How to solve this miimizatio problem? Calculus does t give a explicit solutio sice the first-order coditios will be oliear. Must be solved umerically usig the computer, e.g. by trial ad error method of tryig oe set of values for (b 0,b ), the tryig aother, ad aother, Better idea: use specialized miimizatio algorithms that serarch more efficietly tha trial ad error. I practice, oliear least squares is t used because it is t efficiet a estimator with a smaller variace is 2 9-6

Probit estimatio by maximum likelihood The likelihood fuctio is the coditioal desity of Y,,Y give X,,X, treated as a fuctio of the ukow parameters β 0 ad β. The maximum likelihood estimator (MLE) is the value of (β 0, β ) that maximize the likelihood fuctio. The MLE is the value of (β 0, β ) that best describe the full distributio of the data. Like the oliear-least squares estimator, the MLE of β 0 ad β i the probit ad logit models must be foud umerically. 9-7

I large samples, the MLE is: o Cosistet, o ormally distributed, ad o efficiet (has the smallest variace of all estimators) 9-8

Special case: the probit MLE with o X with probability p Y = 0 with probability (Beroulli distributio) p Data: Y,,Y, i.i.d. Derivatio of the likelihood starts with the desity of Y : so Pr(Y = ) = p ad Pr(Y = 0) = p y y Pr(Y = y ) = p ( p) 9-9

For y = 0, y Pr(Y = 0) = y p ( p) = p 0 (-p) -0 = -p For y =, y Pr(Y = ) = y p ( p) = p (-p) - = p 9-0

Joit desity of (Y,Y 2 ): Because Y ad Y 2 are idepedet, Pr(Y = y,y 2 = y 2 ) = Pr(Y = y )x Pr(Y 2 = y 2 ) y y = [ p ( p) y2 y2 ][ p ( p) ] Joit desity of (Y,..,Y ): Pr(Y = y,y 2 = y 2,,Y = y ) = y [ y p ( p) ][ = y2 y2 p ( p) p i= y i ( p) ] [ ( ) yi i= p y ( ) y p ] The likelihood is the joit desity, treated as a fuctio of the ukow parameters, which here is p: f(p;y,,y ) = p i= Y i ( p) ( ) Yi i= 9-

The MLE maximizes the likelihood. Its stadard to work with the log likelihood, l[f(p;y,,y )]: (The parameters that maximize the likelihood fuctio also maximize the log likelihood fuctio, but the latter is more coveiet to work with.) l[f(p;y,,y )] = ( ) ( Y ) i p Y i= i= i p l( ) + l( ) dl f( p; Y,..., Y ) dp = + p p ( ) ( Y ) i Y i= i= i = 0 9-2

Solvig for p yields the MLE; that is, p ˆ MLE satisfies, or or or + MLE pˆ pˆ ( ) ( Y ) i Y i= MLE i= i = 0 pˆ pˆ ( ) ( Y ) i = Y i= MLE i= i Y pˆ = Y pˆ MLE MLE MLE p ˆ MLE = Y = fractio of s 9-3

The MLE i the o-x case (Beroulli distributio): p ˆ MLE = Y = fractio of s For Y i i.i.d. Beroulli, the MLE is the atural estimator of p, the fractio of s, which is Y We already kow the essetials of iferece: o I large, the samplig distributio of p ˆ MLE = Y is ormally distributed o Thus iferece is as usual: hypothesis testig via t-statistic, 95% cofidece iterval as p ˆ MLE +.96SE 9-4

The probit likelihood with oe X The derivatio starts with the desity of Y, give X : Pr(Y = X ) = Φ(β 0 + β X ) Pr(Y = 0 X ) = Φ(β 0 + β X ) so Pr(Y = y X ) = Φ ( β + β X ) [ Φ ( β + β X )] y 0 0 y The probit likelihood fuctio is the joit desity of Y,,Y give X,,X, treated as a fuctio of β 0, β : 9-5

f(β 0,β ; Y,,Y X,,X ) Y Y = { Φ ( β + β X ) [ Φ ( β + β X )] }x 0 0 Y Y x{ Φ ( β + β X ) [ Φ ( β + β X )] } 0 0 Ca t solve for the maximum explicitly Must maximize usig umerical methods As i the case of o X, i large samples: o ˆ MLE 0 ˆ MLE β, β are cosistet o ˆ β MLE 0, ˆ MLE β are ormally distributed Their stadard errors ca be computed o Testig, cofidece itervals proceeds as usual 9-6

The logit likelihood with oe X The oly differece betwee probit ad logit is the fuctioal form used for the probability: Φ is replaced by the cumulative logistic fuctio. Otherwise, the likelihood is similar; for details see SW App. 9.2 As with probit, o ˆ MLE 0 ˆ MLE β, β are cosistet o ˆ β MLE 0, ˆ MLE β are ormally distributed o Their stadard errors ca be computed o Testig, cofidece itervals proceeds as usual 9-7

Measures of fit The R 2 ad 2 R do t make sese here (why?). So, two other specialized measures are used:. The fractio correctly predicted = fractio of Y s for which predicted probability is >50% (if Y i =) or is <50% (if Y i =0). 2. The pseudo-r 2 measure the fit usig the likelihood fuctio: measures the improvemet i the value of the log likelihood, relative to havig o X s (see SW App. 9.2). This simplifies to the R 2 i the liear model with ormally distributed errors. 9-8

Summary: distributio of the MLE The MLE is ormally distributed for large We worked through this result i detail for the probit model with o X s (the Beroulli distributio) For large, cofidece itervals ad hypothesis testig proceeds as usual If the model is correctly specified, the MLE is efficiet, that is, it has a smaller large- variace tha all other estimators (we did t show this). These methods exted to other models with discrete depedet variables, for example cout data (# crimes/day) see SW App. 9.2. 9-9