The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is

Size: px

Start display at page:

Download "The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is"

Wilfrid Preston
5 years ago
Views:

1 Example The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is log p 1 p = β 0 + β 1 f 1 (y 1 ) β d f d (y d ). The probit link is the inverse of the distribution function for the normal distribution, that is, with Φ(x) = 1 x exp ( y 2 ) dy 2π 2 the success probability is related to the linear predictor via and the link function is l = Φ 1. p = Φ(η), Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 12, / 22

2 Example The Poisson point probabilities are λ λx p(x) = e x! = ex log(λ) λ log(x!) thus we have θ = log(λ), b(θ) = e θ and c(x) = log(x!). The mean and variance are computed as and EX = db dθ (θ) = eθ = λ VX = d2 b dθ 2 (θ) = eθ = λ. The canonical link funtion is log and when the log of the mean is a linear combination of the covariates we often talk about a log-linear model. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 12, / 22

3 Classification problems Medical diagnosis. Drug tests. Spam detection. Mail sorting based on automatic readings of post number. Gene discovery.... Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 12, / 22

4 Prediction problems Stock prices. Prices in general/interest rates. Temperature increase. Technical prediction problems, e.g. ELISA (calibration). Survival times given diagnosis.... Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 12, / 22

5 Models behind prediction Method based on probabilistic modeling is often called generative modeling we construct models that can actually generate data as opposed to pure prediction methods that produce mathematics/algorithms that predict but can not produce data. Pure prediction: Fit a least squares regression line to data. Use it for prediction. The line can not generate data. Model based prediction: Formulate the generative model including iid ε s, say, and their distribution. We can use the line for prediction but given that the model is adequate we can use the mdoel for adressing questions such as performance of the method. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 12, / 22

6 Prediction and classification We want to predict the value of an unobserved random variable Y given that X = x. For this to be sensible X and Y should certainly not be independent. What we need is the conditional distribution of Y given X = x, which tells us precisely which values Y could take. If the joint density (or point probabilities) are f : E 1 E 2 (0, ), then the conditional distribution of Y given X = x has conditional density (or conditional point probabilities) f (y x) = f (x, y) f 1 (x) (1) where f 1 (x) = E 2 f (x, y)dy is the density for the marginal distribution of X. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 12, / 22

7 Classification If E 2 is discrete while E 1 is continuous or discrete we often talk about classification in particular if E 2 = {0, 1}. If the conditional distribution of X given Y = y has density g(x y), the marginal distribution of X is P(X A) = y E 2 P(X A, Y = y) = y E 2 P(X A Y = y)f 2 (y) = y E 2 = A A g(x y)dxf 2 (y) y E 2 f 2 (y)g(x y)dx Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 12, / 22

8 Classification The previous computations show that the marginal distribution of X has density f 1 (x) = y E 2 f 2 (y)g(x y). The conditional distribution of Y given X = x has point probabilities f (y x) = f 2(y)g(x y). f 1 (x) Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 12, / 22

9 Maximum a posterior predictor If we know the conditional distribution of Y given X = x the maximum a posterior predictor is the predictor ŷ(x) = arg max y f (y x) = f 2(y)g(x y) f 1 (x) = f 2 (y)g(x y). Hence the maximum a posterior predictor also equals ŷ(x) = arg max y f 2 (y)g(x y) = arg max y {log f 2 (y) + log g(x y)}. This is extremely close mathematically to interpreting y as a parameter, and then compute the log-likelihood, log g(x y), function and add a penalty term log f 2 (y). However, a parameter is fixed when the experiment is repeated and a random variable is not. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 12, / 22

10 Example If E 2 = R and the conditional distribution of Y given X = x is a normal distribution N(g(x), σ 2 ) then arg max y R f (y x) = ( ) 1 exp (y g(x))2 2πσ 2 2σ 2. Thus the maximum a posterior predictor is ( ) 1 exp 2πσ 2 (y g(x))2 2σ 2 = arg max y R (y g(x))2 = g(x). Thus the maximum a posterior predictor is the conditional expectation g(x) of Y given X = x. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 12, / 22

11 Statistical decision theory With a loss function L : E 2 E 2 [0, ) and ŷ : E 1 E 2 any predictor, then L(y, ŷ(x)) is the loss of predicting ŷ(x) when Y = y. The distribution of the losses is summarized by the expected prediction error EPE(ŷ) = E(L(Y, ŷ(x ))). We define an optimal predictor as a predictor that minimizes EPE over the set of possible predictors. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 12, / 22

12 Example The squared error loss is L(y, ŷ) = (y ŷ) 2. The optimal predictor using the squared error loss is the conditional expectation of Y given X = x (see notes for computations). Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 12, / 22

13 Classification With E 2 = {1,..., n} we talk about classification and the values of the losses, L(y, ŷ), can be organized as L(1, 1) L(1, 2)... L(1, n) L =... L(n, 1) L(n, 2)... L(n, n) Usually with diagonal elements being 0 and off-diagonals > 0. Then EPE(ŷ) = E(L(Y, ŷ(x ))) ( ) = L(y, ŷ(x)))f (y x) f 1 (x)dx. y Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 12, / 22

14 Zero-one loss function The zero-one loss function is given by L(i, j) = 1 whenever i j and L(i, i) = 0. Then L(Y, ŷ(x )) is a Bernoulli variable, and EPE(ŷ) equals the probability that the loss equals 1. Using the zero-one loss function the expected prediction error is the probability of misclassification. EPE(ŷ) = P(Y ŷ(x )) = f (y x) f 1 (x)dx. y ŷ(x) Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 12, / 22

15 Zero-one loss function EPE(ŷ) is minimized by minimizing f (y x) = 1 f (ŷ(x) x) every x E 1. y ŷ(x) The optimal choice of ŷ is ŷ(x) = arg max y f (y x), which is precisely the maximum a posterior predictor. In this context this is also known as the Bayes classifier and EPE(ŷ) = P(Y ŷ(x)) (the probability of misclassification) is called the Bayes rate. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 12, / 22

16 Confusion matrix With E 2 = {0, 1} we denote the marginal distribution of Y by π and write π 0 and π 1 for the point probabilities. The Bayes classifier divides the sample space E 1 into two sets { g(x 0) G 0 = x E 1 g(x 1) > π } 1 π 0 and The confusion matrix is G 1 = { g(x 0) x E 1 g(x 1) < π } 1. π 0 Predicted y Observed y p 00 p 01 1 p 10 p 11 where p ij = P(Y = i, ŷ(x ) = j) (Bayes rate = p 01 + p 10 ). Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 12, / 22

17 Estimation Problem: Statistical decision theory is based on knowledge of the probability generating mechanism we never have that. One solution: Estimate the probability measure and compute the resulting optimal predictor (plug-in principle again). Keep in mind: We loose optimality when estimating. The estimated predictor may be good but essentially we are not able to tell. A central question is how the estimation affects the performance of the estimator. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 12, / 22

18 Example If E 1 = R and P(X x, Y = y) = π y G y (x), y = 0, 1 then P(X x) = π 0 G 0 (x) + π 1 G 1 (x). If g(x y) is a density for the conditional distribution the marginal distribution of X has density f 1 (x) = π 0 g(x 0) + π 1 g(x 1). If the conditional distributions are N(µ(0), σ 2 ) and N(µ(1), σ 2 ), respectively, then the boundary between G 0 and G 1 is the point 2σ 2 (log π 0 log π 1 ) + µ(1) 2 µ(0) 2. 2(µ(1) µ(0)) Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 12, / 22

19 Example Consider the logistic regression model of Y given X. We obtain the estimates ˆα and ˆβ and the estimated point probability ˆp(x) of Y = 1 given X = x is ˆp(x) = exp(ˆα + ˆβx) 1 + exp(ˆα + ˆβx). The logit transformation p log(p/(1 p)) is monotonely increasing and ˆp(x) > 1/2 if and only if ˆα + ˆβx > 0. The Bayes classifier is therefore ŷ(x) = { 1 if ˆα + ˆβx > 0 0 if ˆα + ˆβx < 0. Observe that LD 50 is the boundary point that separates the G 0 set from the G 1 set. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 12, / 22

20 Generalization error The following is phrased in a setup where we have a parameterized model (P θ ) θ Θ for the joint distribution of (X, Y ) in E 1 E 2 and we have iid observations (X 1, Y 1 ),..., (X n, Y n ). EPE(θ) = E θ (L(Y, ŷ θ (X ))) is the minimal expected prediction error as a function of θ. With reference to the plug-in principle EPE(ˆθ) is an estimator of the unknown optimal value of EPE. But this is not so interesting we want to know the expected prediction error of the non-optimal predictor ŷˆθ under P θ. The generalization error or test error is Err(θ) = E θ (L(Y, ŷˆθ(x ))), where the expectation E θ denotes expectation over the estimator ˆθ as well as an independent copy of (X, Y ). Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 12, / 22

21 What we want to know In reality we are mostly interested in the quantity R(ˆθ, θ) = E θ (L(Y, ŷˆθ(x )) ˆθ) E θ (L(Y, ŷ θ (X ))) which is the conditional expectation of the loss given the estimator. The inequality follows from the fact that ŷ θ is the optimal predictor under P θ. But we can t get our hands on R(ˆθ, θ) the plug-in principle just yields which is still not so interesting. R( ˆϑ, ˆϑ) = EPE( ˆϑ), The generalization error, Err(θ) = E θ (R(ˆθ, θ)), is the expected performance of the estimated predictor and is thus a measure of the quality of the estimator. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 12, / 22

22 Plug-in estimate of Err We have a dataeset (x, y) = (x 1,..., x n, y 1,..., y n ) E1 n E 2 n and an estimate ˆϑ = θ(x, y). We want to compute the plug-in estimate Err( ˆϑ). Choose B sufficiently large and simulate B new independent, identically distributed data sets, (x 1, y 1 ),..., (x B, y B ) E, each simulation being from the probability measure P ˆϑ. Compute, for each data set (x i, y i ), i = 1,..., B, new estimates ˆϑ i = ˆθ(x i, y i ) using the estimator ˆθ and new optimal predictors ŷ ˆϑi. Compute R( ˆϑ i, ˆϑ) for i = 1,..., B. This may again be done via simulations. Compute ˆ Err( ˆϑ) = 1 B B R( ˆϑ i, ˆϑ). i=1 Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 12, / 22

When is MLE appropriate

When is MLE appropriate As a rule of thumb the following to assumptions need to be fulfilled to make MLE the appropriate method for estimation: The model is adequate. That is, we trust that one of the