When is MLE appropriate

Size: px

Start display at page:

Download "When is MLE appropriate"

Damian Franklin Jenkins
5 years ago
Views:

1 When is MLE appropriate As a rule of thumb the following to assumptions need to be fulfilled to make MLE the appropriate method for estimation: The model is adequate. That is, we trust that one of the probability measures in the parameterized family of probability measures adequately describes the observations. There are compared to the number of parameters sufficiently many observations. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 7, / 17

2 What if the model is wrong? If the model is wrong MLE can produce good approximations within the model approximations that can be used for predictions or discrimination purposes, say. We can live with the approximations but we might then be able to do better, if we were able to come up with better models or different methods for estimation. What is worse is that all conclusions based on distributional assumptions (that is, distributions of estimators, confidence intervals, statistical tests) are no longer valid. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 7, / 17

3 What if there are not sufficiently many observations? MLE is known to have good large-sample properties. Many challenging current problems must be dealt with without a large data sample. MLE does not exists or is ambiguous (overfitting) Is it then completely impossible to find an approximation? Even if MLE exists the estimator may have a large variance and we might ask if we can trade some of the variance for a small bias? Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 7, / 17

4 Other estimation procedures A general procedure is to introduce a function R x : Θ R for a given observation x E called the empirical risk function. We estimate θ by minimizing R x. Examples include R x (θ) = l x (θ) the minus-log-likelihood function and R x being the sum of squares. The resulting estimators are the MLE and least squares estimators, respectively. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 7, / 17

5 Least squares estimation Let x 1,..., x n denote n real observations from the same probability measure P and let x = (x 1,..., x n ) R n. The least squares empirical risk function for the parameter µ R is given by R x (µ) = n (x i µ) 2. i=1 The unique minimum is found to be the average ˆµ = 1 n n x i. i=1 What we estimate is in general min θ ER x (θ), which in this case equals min µ ne(x µ) 2 where X has distribution P. We know that the minimum equals EX. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 7, / 17

6 Another risk function An alternative risk function to choose is R x (µ) = n x i µ. i=1 This function also has a minimum, albeit not necessarily unique. It can be shown that the minimum is attained in a median of the dataset. We are in this case estimating min µ ne X µ) where X has distribution P, and this theoretical minimum is attained in the median of X. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 7, / 17

7 Penalized least squares Consider the least squares regression setup R z (α, β) = n (y i α βx i ) 2. i=1 Rigde regression is defined as minimizing R z (α, β) + λβ 2. The term λβ 2 is known as the penalty function. The effect is that the minimizer has a smaller β-value than without the penalty term. We say that the slope parameter β is shrunk towards zero. How much we shrink towards 0 depends on the size of λ. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 7, / 17

8 Penalized estimation In general we can take any function J : Θ R and instead of minimizing the empricial risk function R x for a given observation x E we minimize the penalized risk function R x (θ) + λj(θ) for some λ 0 as a function of θ Θ. How much the penalty function affects the estimate is determined by λ. Note that the term λj(θ) is completely independent of the observation x. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 7, / 17

9 Exercise Simulate a small dataset in R by x <- runif(100) and then y <- 1+2*x + rnorm(100). Compute the linear regression estimate using lm. Write a function in R that computes the sum of squares and use optim in R to compute the (same) least squares estimate. Extend your function by adding a penalty term. Try computing the penalized parameters for different values of λ and compare the resulting estimated functions. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 7, / 17

10 Generalized linear models In our regression model specification there are two ingredients. The specification of how the mean of the observed variable depends on the covariates that the mean is a linear combination of known functions of the covariates. The specification of the noise term that the noise is added to the mean and that it is typically assumed to be gaussian. The framework of generalized linear models extend this in two directions at the same time. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 7, / 17

11 Generalized linear models The mean of the observed variable X can know be given as EX = m(β 0 + β 1 f 1 (y 1 ) β d f d (y d )) where m is a known function. The distribution of X is not specified by adding noise to the mean, but rather by specifying that the density (or point probabilities) for the distribution of X must have the form ( ) xθ b(θ) f θ,ϕ (x) = exp + c(x, ϕ) a(ϕ) for a two-dimensional parameter (θ, ϕ) R (0, ). The basic question remains. We need to link the distribution with this density to the mean value specification above what is the relation between (θ, ϕ) and the mean value of X? Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 7, / 17

12 The mean in generalized linear models The fact that leads (blackboard) to the equality E f θ,ϕ (x)dx = 1, µ(θ) = EX = db dθ (θ). One additional differentiation results in d 2 b VX = a(ϕ) dθ 2 (θ) }{{} variance function V (θ) which is > 0 unless we consider a degenerate model.as a result, the function µ(θ) is continuous and strictly increasing, thus it has an inverse, which tells that θ = µ 1 (m(β 0 + β 1 f 1 (y 1 ) β d f d (y d ))). Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 7, / 17

13 The link function An alternative way to phrase this equality is that η := β 0 + β 1 f 1 (y 1 ) β d f d (y d ) = m 1 (µ(θ)) it is always assume that m is smooth (differentiable) and invertible function. We call the function l = m 1 the link function and we call η the linear predictor. The canonical link function is defined as l = µ 1 = db 1, dθ and for the canonical link function we see that θ = η = β 0 + β 1 f 1 (y 1 ) β d f d (y d ). and EX = db dθ (η). Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 7, / 17

14 Example Consider the normal distribution with scale-location parameters (µ, σ) f (x) = exp ( x 2 2xµ + µ 2 2σ 2 1 ) 2 log(2πσ2 ) ( xµ µ 2 /2 = exp σ 2 x 2 2σ 2 1 ) 2 log(2πσ2 ). Taking θ = µ and ϕ = σ 2, we see that with and b(θ) = θ 2 /2 c(x, ϕ) = x 2 2ϕ 1 2 log(2πϕ) db (θ) = θ, dθ and the canonical link function is the identity function. The variance function is in this case constantly equal to 1. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 7, / 17

15 Example For a Bernoulli random variable we showed that the point probabilities have the same general form with θ the log-odds of the success probability p = P(X = 1). The derivative of b(θ) = log(1 + exp(θ)) is db dθ (θ) = and the canonical link function is log The variance function is V (θ) = exp(θ) 1 + exp(θ) = p = EX. p 1 p, since p is also the mean of X. exp(θ) = p(1 p). (1 + exp(θ)) 2 Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 7, / 17

16 Example The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is log p 1 p = β 0 + β 1 f 1 (y 1 ) β d f d (y d ). The probit link is the inverse of the distribution function for the normal distribution, that is, with Φ(x) = 1 x exp ( y 2 ) dy 2π 2 the success probability is related to the linear predictor via and the link function is l = Φ 1. p = Φ(η), Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 7, / 17

17 Example The Poisson point probabilities are λ λx p(x) = e x! = ex log(λ) λ log(x!) thus we have θ = log(λ), b(θ) = e θ and c(x) = log(x!). The mean and variance are computed as and EX = db dθ (θ) = eθ = λ VX = d2 b dθ 2 (θ) = eθ = λ. The canonical link funtion is log and when the log of the mean is a linear combination of the covariates we often talk about a log-linear model. Niels Richard Hansen (Univ. Copenhagen) Statistics BI/E lecture January 7, / 17

The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is

Example The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is log p 1 p = β 0 + β 1 f 1 (y 1 ) +... + β d f d (y d ).