Statistical Estimation

Size: px

Start display at page:

Download "Statistical Estimation"

Darrell Fletcher
5 years ago
Views:

1 Statistical Estimation Use data and a model. The plug-in estimators are based on the simple principle of applying the defining functional to the ECDF. Other methods of estimation: minimize residuals from fitted model, e.g. least squares maximize likelihood (what is likelihood?) These involve optimization. 1

2 Statistical Methods as Optimization Problems Many common statistical methods are developed as solutions to optimization problems. In a common method of statistical estimation, we maximize a likelihood, which is a function proportional to a probability density at the point of the observed data. max θ Θ L(θ; x) In another method of estimation and in standard modeling techniques, we minimize a norm of the residuals. The best fit of a model is often defined in terms of a minimum of a norm, such as least squares. max θ y i E(Y i ; θ, x i ) 2

3 Fitting Statistical Models A common type of model relates one variable to others in a form such as y = f(x; θ) + ɛ, (1) in which θ is a vector of fixed parameters with unknown and unobservable values. Fitting the model is mechanically equivalent to estimating θ. The most familiar form of this model is the linear model y = x T β + ɛ, where x and β are vectors. We also often assume that ɛ is a random variable with a normal distribution. There are two approaches, both optimization problems. In both, the first step is to replace the fixed unknown parameter with a variable. 3

4 Fitted Residuals For a given value of the variable used in place of the parameter and for each pair of observed values (y i, x i ), we from a fitted residual. In the case of the linear model, for example, we form the fitted residual r i (b) = y i x T i b, in terms of a variable b in place of the unknown estimand β. We should note the distinction between the fitted residuals r i (b) and the unobservable residuals or errors, ɛ i = y i x T i β. When ɛ is a random variable, ɛ i is a realization of the random variable, but the fitted residual r i (b) is not a realization of ɛ (as beginning statistics students sometimes think of it). In methods that fit the residuals, there may not be a probability model. Only a model of the expectation. 4

5 Minimizing Residuals The idea of minimizing the residuals from the observed data in the model is intuitively appealing. Because there is a residual at each observation, however, minimizing the residuals is not well-defined without additional statements. When there are several things to be minimized, we must decide on some way of combining them into a single measure. It is then the single measure that we seek to minimize. A useful type of overall measure is a norm of the vector of residuals. There are various norms. The L 2 norm of the vector of residuals n i=1 (y i x T i b)2. 5

6 Least Squares of Residuals Fitting the model so as to minimize the L 2 norm of the vector of residuals is least squares. With n observations, the ordinary least squares estimator of β in the linear model is the solution to the optimization problem min b n i=1 (y i x T i b)2. This optimization problem is relatively simple, and its solution can be expressed in a closed form as a system of linear equations. Lots of variations... weighted LS, nonlinear LS, regularized LS 6

7 Least Squares of Residuals in More Complicated Models If the expectation model is nonlinear in the parameters, for example, y i = θ 0 + e (θ 1t i +θ 2 t 2 i ) + ɛ i, as in Exercise 1.9 in the revised Chapter 1, then we usually cannot get the least squares estimator in closed form. We form r i (θ) = y i θ 0 + e (θ 1t i +θ 2 t 2 i ). Then we minimize r 2 i (θ) = (r(θ)) T r(θ) In any method for least squares, we use the fact that the gradient at the solution is 0. 7

8 Least Squares of Residuals in More Complicated Models Because we cannot get the least squares estimator in closed form, we use an iterative method to find a point at which the gradient is 0. We might use Newton s method, which involves the gradient of (r(θ)) T r(θ) and the Hessian of the same quantity. The gradient involves the Jacobian r(θ). It is s(t) = 2(J r (t)) T r(t). The R function nls computes the least squares estimates in a nonlinear model. 8

9 Properties of Least Squares Estimators in More Complicated Models Anytime we do statistical estimation, we want to know the properties of the estimators. Probability distribution, or at least the expectation (for bias), and if possible the variance. This may require a probability model for the distribution of the observations. 9

10 Variance of Least Squares Estimators The first thing we want is an estimation of the variance of the error term in the model. Under the simple expectation model, that variance is σ 2 = E((Y i E(Y i )) 2 ), which we assume to be the same for all i. A good estimator of this is the sample variance (with a divisor that makes it unbiased). Based on the properties of the objective function at the point of the solution, good estimate of the variance-covariance matrix of the parameter estimators or ((J r ( θ)) T 1 J r ( θ)) σ 2 (H s ( θ)) 1 σ 2. 10

11 Maximum Likelihood Estimation Given a PDF, say p(x; θ), and some (independent) data x 1,..., x n from a distribution with that PDF, the likelihood function is L(θ; x) = n i=1 p(x i ; θ). Note that the likelihood function is nonnegative. Note that this is the likelihood of the model (not likelihood of the observations ) Within a class of PDFs parameterized by θ, we could say it is the likelihood of θ. Considering θ to be a variable, and choosing the value of the variable that maximizes the likelihood with the given data, yields the maximum likelihood estimate of θ. 11

12 Maximum Likelihood Estimation MLE This requires a probability distribution model; not just a model of the expectation. The MLE often has good statistical properties, especially asymptotically. Also, use log-likelihood l(θ; x) = log(l(θ; x)) 12

13 Maximum Likelihood Estimation Another way of fitting the model y = f(x; θ)+ɛ is by MLE. What is the PDF? It comes from the probability distribution of ɛ. Given the data, this is the optimization problem max t n i=1 p(y i f(x i ; t)), where p( ) is the probability function or the probability density function of the random error. This yields a maximum likelihood estimator (MLE). For a given probability density p( ), the maximization problem for MLE is usually different from a minimization problem of fitting residuals. 13

14 EM Methods EM methods are used in maximum likelihood estimation when the given sample consists of two components, one observed and one unobserved or missing. A simple example of missing data occurs in life-testing, when, for example, a number of electrical units are switched on and the time when each fails is recorded. In such an experiment, it is usually necessary to curtail the recordings prior to the failure of all units. The failure times of the units still working are unobserved. Another common example is analysis of a finite mixture model. Each observation comes from an unknown one of an assumed set of distributions. The missing data is the distribution indicator. Finally, EM methods are useful when there are known relationships among the parameters of a model; that is, in a model that is overparametrized. 14

15 Missing Data The missing data can be missing observations on the same random variable that yields the observed sample, as in the case of the censoring example. or The missing data can be from a different random variable that is related in some way to the random variable observed, such as the class membership in a finite mixture model. Sometimes an EM method can be constructed based on an artificial missing random variable to supplement the observable data. We call the actual observed data X, and the unobserved data, U. Thus, we have complete data C = (X, U). 15

16 Likelihoods Let L C (θ; c) be the likelihood of the complete data, and let L X (θ; x) be the likelihood of the observed data. Let l C (θ; c) and l X (θ; x) be the corresponding log-likelihoods. Maximizing log-likelihoods is equivalent to maximizing likelihoods. We will just work with the log-likelihood. Our objective is to maximize the log-likelihood of the observed data, l X (θ; x). We will also use the conditional likelihood or the conditional loglikelihood: l C X (θ; c x) = l C (θ; x, u) l X (θ; x). The log-likelihood of interest is therefore l X (θ; x) = l C (θ; x, u) l C X (θ; c x). (2) 16

17 Notation: We will drop the constants and just write l X (θ), l C (θ), and l C X (θ). 17

18 Conditional Expectation of the Log-Likelihood as a Function of U For a given x, and any fixed value of θ, say θ (k), we consider conditional expectations of the log-likelihoods as functions of the random variable U E U x,θ (k) (l X (θ)) = E U x,θ (k) (l C (θ)) E U x,θ (k) ( lc X (θ) ). *Note the notation: E U x,θ (k), etc. Each term is nonnegative, and the term on the left is constant with respect to U; that is, the left-hand side is just l X (θ). Hence, for any θ and x, we have l X (θ) E U x,θ (k) (l C (θ)). 18

19 Maximization of the Log-Likelihood E U x,θ (k) (l C (θ)), which as we have seen, is an upper bound for l X (θ), plays an important role in the EM method. Let q k (x, θ) = E U x,θ (k) (l C (θ)). We find θ (1), θ (2),... by maximizing q k (x, θ) for k = 0,1,... That is, given θ (k), we determine θ (k+1) as the value of θ that maximizes the conditional expectation E U x,θ (k) (l C (θ)). We then update q k (x, θ) using E U x,θ (k+1). 19

20 The EM Method The EM approach to maximizing l X (θ ; x) has two alternating steps. The steps are iterated until convergence. E step : compute q k (x, θ) = E U x,θ (k)(l C (θ; x, U)). (notice that q k (x, θ) is a function of θ (k) ) M step : determine θ (k+1) to maximize q k (x, θ), or at least to increase it (subject to any constraints on acceptable values of θ). Notice that max l X (θ) max q k (x, θ). The sequence θ (1), θ (2),... converges to a local maximum of the observed-data likelihood l X (θ) under fairly general conditions. (It can be very slow to converge, however.) 20

21 Example 1 One of the simplest examples of the EM method was given by Dempster, Laird, and Rubin (1977) in the pioneering article that first systematically discussed EM methods. See Exercise 1.9 (Exercise 1.13 in revised Chapter 1). Consider the multinomial distribution with four outcomes, that is, the multinomial with probability function, p(x 1, x 2, x 3, x 4 ) = n! x 1!x 2!x 3!x 4! πx 1 1 πx 2 2 πx 3 3 πx 4 4, with x 1 + x 2 + x 3 + x 4 = n and π 1 + π 2 + π 3 + π 4 = 1. Notice that a binomial distribution can be formed from a multinomial distribution in various ways. 21

22 Overparametrized Multinomial Now, suppose the probabilities are related by a single parameter, θ: π 1 = θ π 2 = θ π 3 = θ π 4 = 1 4 θ, where 0 θ 1. This relationship may arise from known facts about the phenomenon giving rise to the individual groups in the multinomial distribution. The multinomial is overparametrized. 22

23 Overparametrized Multinomial Given an observation x = (x 1, x 2, x 3, x 4 ), the log-likelihood function is l(θ) = x 1 log(2 + θ) + (x 2 + x 3 )log(1 θ) + x 4 log(θ) + c, for some c that is constant with respect to θ. The objective is to estimate θ by maximizing l(θ). Notice that for this simple problem, the MLE of θ can be determined by solving a simple polynonial equation, because dl(θ) dθ = x θ x 2 + x 3 1 θ + x 4 θ. Let s proceed, however, with an EM formulation. 23

24 Overparametrized Multinomial We notice that π 2 = π 3, and π 1 = π 4 + 1/2. So to use the EM algorithm on this problem, we can think of a multinomial with five classes, which is formed from the original multinomial by splitting the first class into two with associated probabilities 1/2 and θ/4. The original variable x 1 is now the sum of u 1 and u 2. The vector c = (u 1, u 2, x 2, x 3, x 4 ) is the complete data. 24

25 Overparametrized Multinomial Under this reformulation, we now have a maximum likelihood estimate of θ by considering u 2 + x 4 to be a realization of a binomial with n = u 2 + x 4 + x 2 + x 3 and π = θ. However, we do not know u 2. Proceeding as if we had a five-outcome multinomial observation with two missing elements, we have the log-likelihood for the complete data, l C (θ) = (u 2 + x 4 )log(θ) + (x 2 + x 3 )log(1 θ), which has a maximum at θ = u 2 + x 4 u 2 + x 2 + x 3 + x 4. 25

26 Overparametrized Multinomial The E-step of the iterative EM algorithm fills in the missing or unobservable value with its expected value given a current value of the parameter, θ (k), and the observed data. Because l C (θ) is linear in the data, we have E U2 x,θ (k) (l C (θ)) = E U2 x,θ (k) (U 2 +x 4 )log(θ)+e U2 x,θ (k) (x 2 +x 3 )log(1 θ). Under this setup, with θ = θ (k), we get E U2 x,θ (k) (U 2 ) = 1 ( 4 x 1θ (k) 1 / x 1θ (k) ), and now for the M-step, we take u (k) 2 = E U2 x,θ (k) (U 2 ). 26

27 Overparametrized Multinomial We now maximize E U2 x,θ (k) (l C (θ)). We have a closed-form solution. The maximum occurs at θ (k+1) = (u (k) 2 + x 4)/(u (k) 2 + x 2 + x 3 + x 4 ), and this becomes the new value of θ. The following R statements execute a single iteration, after we initialize t.k u2.k <- x[1]*t.k/(2+t.k) t.kp1 <- (u2.k + x[4])/(sum(x)-x[1]+u2.k) Dempster, Laird, and Rubin used the data n = 197 and x = (125,18,20,34). Beginning with t.k in the range of 0.1 to 0.9, this converges to within 10 7 in less than 10 iterations. 27

28 Testing for Convergence We can put the R statements in a loop so long as we test for convergence and put a limit on the number of iterations. # initialize t.kp1 as the starting value and initialize t.k to any different value # initialize iter = 0. # initialize control values, eps and maxit. while (abs(t.k-t.kp1)>eps & iter <= maxit) { iter <- iter + 1 t.k <- t.kp1 u2.k <- x[1]*t.k/(2+t.k) t.kp1 <- (u2.k + x[4])/(sum(x)-x[1]+u2.k) } 28

29 Example 2 A Life-Testing Experiment Using an Exponential Model The exponential model for failure times is p(t) = 1 θ e t/θ. In the classic life-testing experiment, n items are put on test, and the failure times of n 1 items (usually the first ones to fail) are recorded. All that is known of the remaining n n 1 items is the time at which they were taken from the test (usually the time at which the test is stopped). The complete data would be the failure times for all n items: t 1,..., t n. 29

30 A Life-Testing Experiment Using an Exponential Model We will allow the n n 1 items whose failure times were not measured to be removed from the test (or censored ) at varying times; that is, for each one, all we know is a lower bound on its lifetime. The data that we actually have is the set t 1,..., t n1, c n1 +1,..., c n, where we have indexed the observations so that the first n 1 correspond to those with observed failure times, and c n1 +1,..., c n are the censored times for the remaining items; that is, the times at which they were removed from the test. (In many cases, of course, all of the c s would be the same.) If n j n, then all that is known about t j is that t j c j. 30

31 A Life-Testing Experiment Using an Given the data vector x, Exponential Model x = (x 1,..., x n1, x n1 +1,..., x n ) = (t 1,..., t n1, c n1 +1,..., c n ), the objective is to obtain the MLE of θ. The probability of units n through n exceeding the values x n1 +1,..., x n is e n i=n1 +1 x i/θ, and so the likelihood of the observed data is e n i=n1 +1 x i/θ n 1 i=1 1 θ e x i/θ, and the log-likelihood for the observed data is l(θ) = n 1 log(θ) n i=1 x i /θ. (3) 31

32 A Life-Testing Experiment Using an Exponential Model The maximum of the log-likelihood is easily obtained of course. It occurs at ˆθ = n i=1 x i /n 1, but let s use the EM method to approach the problem from the standpoint of how to deal with the missing data. The log-likelihood for the complete data is l C (θ) = nlog(θ) n i=1 x i θ, where, for j > n 1, x j is the unobserved realization of the random variable corresponding to time of failure. 32

33 A Life-Testing Experiment Using an Exponential Model In the EM formulation for this problem, we first get the E step. Given the observed values of x, and, for a given value of θ, say θ (k), taking the expectation of the complete log-likelihood for random variables X n1 +1,..., X n, we get for the E step, q k (x, θ) = nlog(θ) = nlog(θ) n 1 i=1 n i=1 x i + n i=n 1 +1 ( xi + θ (k)) /θ x i + (n n 1 )θ (k) /θ (4) The value that maximizes this is taken to be θ (k+1) ; that is, θ (k+1) = n i=1 x i + (n n 1 )θ (k) /n. 33

34 Example 3 A two-component normal mixture model can be defined by two normal distributions, N(µ 1, σ1 2) and N(µ 2, σ2 2 ), and the probability that the random variable (the observable) arises from the first distribution is w. The parameter in this model is the vector θ = (w, µ 1, σ 2 1, µ 2, σ 2 2 ). (Note that w and the σs have the obvious constraints.) The PDF of the mixture is p(x; θ) = wp 1 (x; µ 1, σ 2 1 ) + (1 w)p 2(x; µ 2, σ 2 2 ), where p j (x; µ j, σ 2 j ) is the normal PDF with parameters µ j and σ 2 j. (I am just writing them this way for convenience; p 1 and p 2 are actually the same parametrized function of course.) 34

35 The Two-Component Mixture Model In the standard formulation with C = (X, U), X represents the observed data, and the unobserved U represents class membership. For the two-class mixture model, let U = 1 if the observation is from the first distribution and U = 0 if the observation is from the second distribution. The unconditional E(U) is the probability that an observation comes from the first distribution, which of course is w. Suppose we have n observations on X, x = (x 1,..., x n ). In the notation we have been using, q (k) i = q (k) (x i, θ (k) ) is the weight (or estimated probability) that the observation is in the first group. 35

36 Estimation in the Two-Component Normal Mixture Model Given a provisional value of θ, we can compute the conditional expected value E(U x) for any realization of X, x i. It is merely q (k) i = E(U x i, θ (k) ) = p ( w (k) p 1 x i ; µ (k) ) 1, σ2(k) 1 ( ). x i ; w (k), µ (k) 1, σ2(k) 1, µ (k) (5) 2, σ2(k) 2 36

37 The M step is just the familiar MLE of the parameters: w (k+1) = 1 n E (U x i, θ (k) ) σ 2(k+1) 2 = µ (k+1) 1 = 1 q (k) nw (k+1) i σ1 2(k+1) = 1 q (k) nw (k+1) i µ (k+1) 2 = ( 1 n ( ( 1 w (k+1)) 1 n(1 w (k+1) ) ( 1 q (k) i x i x i µ (k+1) 1 ) 1 q (k) ) ( i ) 2 x i x i µ (k+1) 2 ) 2 (6) 37

38 Easy to program in R 38

39 The General Mixture Model A multi-component mixture model can be defined by specifying the individual distributions and and the probability that the random variable (the observable) arises from the each distribution. For a mixture of g distributions, if the PDF of the j th distribution is p j (x; θ j ), the PDF of the mixture is p(x; θ) = where π j 0 and g j=1 π j = 1. g j=1 π j p j (x; θ j ), The g-vector π is the unconditional probabilities for a random variable from the mixture. 39

40 The General Mixture Model: Dummy Variables For observations from different groups, we can use a single indicator variable to designate the group, or we can use dummy variables that take values of 0 or 1, depending on the group of a given observation. In linear models when dummy variables are used to distinguish classes, the number of dummy variables is one less than the number of classes. This yields linearly independent column vectors in the data matrix. In mixture models, it is more convenient to use a dummy variable for each class. For an observation in a given class, only the dummy variable for that class takes a value of 1, and all others take the value 0; thus, the sum of the dummy variables for a given observation is 1. In a general probability mixture model, the expected value of a dummy variable is the probability that an observation comes from the group corresponding to the dummy variable. 40

41 The General Mixture Model In the formulation C = (X, U) where X represents the observed data, the unobserved U represents the g-vector of dummy variables. The unconditional E(U) is vector of probabilities that an observation comes from the various components of the mixture distribution. Suppose we have n observations on X, x = (x 1,..., x n ). In the notation we have been using, q (k) i = q (k) (x i, θ (k) ), the conditional expectation of U, is the estimated conditional probability vector for that observation. 41

42 Given provisional values of π and θ, π (k) and θ (k), we can compute the conditional expected value of the j th dummy variable for any realization of X, x i : q (k) ij = E(U j x i, θ (k) ) = x i ; θ (k) j p (x i ; θ (k)). π (k) ( j p j )

43 Estimation in a General Normal Mixture Model Given E(U x, θ (k) ) as before, we can perform the M step. First, we have, the conditional MLE for the g-vector of probabilities: π (k+1) = 1 n n i=1 E ( U x i, θ (k)). (7) For a mixture of normal distributions, θ j = (µ j, σj 2 ), that is, the mean and variance of the j th distribution, for the conditional MLEs, we have the weighted estimates µ (k+1) j = 1 π (k+1) j σ 2(k+1) j = 1 π (k+1) j ni=1 q (k) ij ni=1 q (k) ij x i ( x i µ (k+1) i ) 2. (8) These equations are similar to equations (6). 42

44 EM in R There are various packages and functions for applications of EM in R. Instead of a general function for EM, the functions depend on the applications; for example, mixtools and mclust use EM to estimate the parameters in mixtures of populations. A very general package that allows regression models instead of just probability models, to distinguish the different populations is flexmix. 43

Review and continuation from last week Properties of MLEs

Review and continuation from last week Properties of MLEs As we have mentioned, MLEs have a nice intuitive property, and as we have seen, they have a certain equivariance property. We will see later that