Statistical Estimation
|
|
- Darrell Fletcher
- 5 years ago
- Views:
Transcription
1 Statistical Estimation Use data and a model. The plug-in estimators are based on the simple principle of applying the defining functional to the ECDF. Other methods of estimation: minimize residuals from fitted model, e.g. least squares maximize likelihood (what is likelihood?) These involve optimization. 1
2 Statistical Methods as Optimization Problems Many common statistical methods are developed as solutions to optimization problems. In a common method of statistical estimation, we maximize a likelihood, which is a function proportional to a probability density at the point of the observed data. max θ Θ L(θ; x) In another method of estimation and in standard modeling techniques, we minimize a norm of the residuals. The best fit of a model is often defined in terms of a minimum of a norm, such as least squares. max θ y i E(Y i ; θ, x i ) 2
3 Fitting Statistical Models A common type of model relates one variable to others in a form such as y = f(x; θ) + ɛ, (1) in which θ is a vector of fixed parameters with unknown and unobservable values. Fitting the model is mechanically equivalent to estimating θ. The most familiar form of this model is the linear model y = x T β + ɛ, where x and β are vectors. We also often assume that ɛ is a random variable with a normal distribution. There are two approaches, both optimization problems. In both, the first step is to replace the fixed unknown parameter with a variable. 3
4 Fitted Residuals For a given value of the variable used in place of the parameter and for each pair of observed values (y i, x i ), we from a fitted residual. In the case of the linear model, for example, we form the fitted residual r i (b) = y i x T i b, in terms of a variable b in place of the unknown estimand β. We should note the distinction between the fitted residuals r i (b) and the unobservable residuals or errors, ɛ i = y i x T i β. When ɛ is a random variable, ɛ i is a realization of the random variable, but the fitted residual r i (b) is not a realization of ɛ (as beginning statistics students sometimes think of it). In methods that fit the residuals, there may not be a probability model. Only a model of the expectation. 4
5 Minimizing Residuals The idea of minimizing the residuals from the observed data in the model is intuitively appealing. Because there is a residual at each observation, however, minimizing the residuals is not well-defined without additional statements. When there are several things to be minimized, we must decide on some way of combining them into a single measure. It is then the single measure that we seek to minimize. A useful type of overall measure is a norm of the vector of residuals. There are various norms. The L 2 norm of the vector of residuals n i=1 (y i x T i b)2. 5
6 Least Squares of Residuals Fitting the model so as to minimize the L 2 norm of the vector of residuals is least squares. With n observations, the ordinary least squares estimator of β in the linear model is the solution to the optimization problem min b n i=1 (y i x T i b)2. This optimization problem is relatively simple, and its solution can be expressed in a closed form as a system of linear equations. Lots of variations... weighted LS, nonlinear LS, regularized LS 6
7 Least Squares of Residuals in More Complicated Models If the expectation model is nonlinear in the parameters, for example, y i = θ 0 + e (θ 1t i +θ 2 t 2 i ) + ɛ i, as in Exercise 1.9 in the revised Chapter 1, then we usually cannot get the least squares estimator in closed form. We form r i (θ) = y i θ 0 + e (θ 1t i +θ 2 t 2 i ). Then we minimize r 2 i (θ) = (r(θ)) T r(θ) In any method for least squares, we use the fact that the gradient at the solution is 0. 7
8 Least Squares of Residuals in More Complicated Models Because we cannot get the least squares estimator in closed form, we use an iterative method to find a point at which the gradient is 0. We might use Newton s method, which involves the gradient of (r(θ)) T r(θ) and the Hessian of the same quantity. The gradient involves the Jacobian r(θ). It is s(t) = 2(J r (t)) T r(t). The R function nls computes the least squares estimates in a nonlinear model. 8
9 Properties of Least Squares Estimators in More Complicated Models Anytime we do statistical estimation, we want to know the properties of the estimators. Probability distribution, or at least the expectation (for bias), and if possible the variance. This may require a probability model for the distribution of the observations. 9
10 Variance of Least Squares Estimators The first thing we want is an estimation of the variance of the error term in the model. Under the simple expectation model, that variance is σ 2 = E((Y i E(Y i )) 2 ), which we assume to be the same for all i. A good estimator of this is the sample variance (with a divisor that makes it unbiased). Based on the properties of the objective function at the point of the solution, good estimate of the variance-covariance matrix of the parameter estimators or ((J r ( θ)) T 1 J r ( θ)) σ 2 (H s ( θ)) 1 σ 2. 10
11 Maximum Likelihood Estimation Given a PDF, say p(x; θ), and some (independent) data x 1,..., x n from a distribution with that PDF, the likelihood function is L(θ; x) = n i=1 p(x i ; θ). Note that the likelihood function is nonnegative. Note that this is the likelihood of the model (not likelihood of the observations ) Within a class of PDFs parameterized by θ, we could say it is the likelihood of θ. Considering θ to be a variable, and choosing the value of the variable that maximizes the likelihood with the given data, yields the maximum likelihood estimate of θ. 11
12 Maximum Likelihood Estimation MLE This requires a probability distribution model; not just a model of the expectation. The MLE often has good statistical properties, especially asymptotically. Also, use log-likelihood l(θ; x) = log(l(θ; x)) 12
13 Maximum Likelihood Estimation Another way of fitting the model y = f(x; θ)+ɛ is by MLE. What is the PDF? It comes from the probability distribution of ɛ. Given the data, this is the optimization problem max t n i=1 p(y i f(x i ; t)), where p( ) is the probability function or the probability density function of the random error. This yields a maximum likelihood estimator (MLE). For a given probability density p( ), the maximization problem for MLE is usually different from a minimization problem of fitting residuals. 13
14 EM Methods EM methods are used in maximum likelihood estimation when the given sample consists of two components, one observed and one unobserved or missing. A simple example of missing data occurs in life-testing, when, for example, a number of electrical units are switched on and the time when each fails is recorded. In such an experiment, it is usually necessary to curtail the recordings prior to the failure of all units. The failure times of the units still working are unobserved. Another common example is analysis of a finite mixture model. Each observation comes from an unknown one of an assumed set of distributions. The missing data is the distribution indicator. Finally, EM methods are useful when there are known relationships among the parameters of a model; that is, in a model that is overparametrized. 14
15 Missing Data The missing data can be missing observations on the same random variable that yields the observed sample, as in the case of the censoring example. or The missing data can be from a different random variable that is related in some way to the random variable observed, such as the class membership in a finite mixture model. Sometimes an EM method can be constructed based on an artificial missing random variable to supplement the observable data. We call the actual observed data X, and the unobserved data, U. Thus, we have complete data C = (X, U). 15
16 Likelihoods Let L C (θ; c) be the likelihood of the complete data, and let L X (θ; x) be the likelihood of the observed data. Let l C (θ; c) and l X (θ; x) be the corresponding log-likelihoods. Maximizing log-likelihoods is equivalent to maximizing likelihoods. We will just work with the log-likelihood. Our objective is to maximize the log-likelihood of the observed data, l X (θ; x). We will also use the conditional likelihood or the conditional loglikelihood: l C X (θ; c x) = l C (θ; x, u) l X (θ; x). The log-likelihood of interest is therefore l X (θ; x) = l C (θ; x, u) l C X (θ; c x). (2) 16
17 Notation: We will drop the constants and just write l X (θ), l C (θ), and l C X (θ). 17
18 Conditional Expectation of the Log-Likelihood as a Function of U For a given x, and any fixed value of θ, say θ (k), we consider conditional expectations of the log-likelihoods as functions of the random variable U E U x,θ (k) (l X (θ)) = E U x,θ (k) (l C (θ)) E U x,θ (k) ( lc X (θ) ). *Note the notation: E U x,θ (k), etc. Each term is nonnegative, and the term on the left is constant with respect to U; that is, the left-hand side is just l X (θ). Hence, for any θ and x, we have l X (θ) E U x,θ (k) (l C (θ)). 18
19 Maximization of the Log-Likelihood E U x,θ (k) (l C (θ)), which as we have seen, is an upper bound for l X (θ), plays an important role in the EM method. Let q k (x, θ) = E U x,θ (k) (l C (θ)). We find θ (1), θ (2),... by maximizing q k (x, θ) for k = 0,1,... That is, given θ (k), we determine θ (k+1) as the value of θ that maximizes the conditional expectation E U x,θ (k) (l C (θ)). We then update q k (x, θ) using E U x,θ (k+1). 19
20 The EM Method The EM approach to maximizing l X (θ ; x) has two alternating steps. The steps are iterated until convergence. E step : compute q k (x, θ) = E U x,θ (k)(l C (θ; x, U)). (notice that q k (x, θ) is a function of θ (k) ) M step : determine θ (k+1) to maximize q k (x, θ), or at least to increase it (subject to any constraints on acceptable values of θ). Notice that max l X (θ) max q k (x, θ). The sequence θ (1), θ (2),... converges to a local maximum of the observed-data likelihood l X (θ) under fairly general conditions. (It can be very slow to converge, however.) 20
21 Example 1 One of the simplest examples of the EM method was given by Dempster, Laird, and Rubin (1977) in the pioneering article that first systematically discussed EM methods. See Exercise 1.9 (Exercise 1.13 in revised Chapter 1). Consider the multinomial distribution with four outcomes, that is, the multinomial with probability function, p(x 1, x 2, x 3, x 4 ) = n! x 1!x 2!x 3!x 4! πx 1 1 πx 2 2 πx 3 3 πx 4 4, with x 1 + x 2 + x 3 + x 4 = n and π 1 + π 2 + π 3 + π 4 = 1. Notice that a binomial distribution can be formed from a multinomial distribution in various ways. 21
22 Overparametrized Multinomial Now, suppose the probabilities are related by a single parameter, θ: π 1 = θ π 2 = θ π 3 = θ π 4 = 1 4 θ, where 0 θ 1. This relationship may arise from known facts about the phenomenon giving rise to the individual groups in the multinomial distribution. The multinomial is overparametrized. 22
23 Overparametrized Multinomial Given an observation x = (x 1, x 2, x 3, x 4 ), the log-likelihood function is l(θ) = x 1 log(2 + θ) + (x 2 + x 3 )log(1 θ) + x 4 log(θ) + c, for some c that is constant with respect to θ. The objective is to estimate θ by maximizing l(θ). Notice that for this simple problem, the MLE of θ can be determined by solving a simple polynonial equation, because dl(θ) dθ = x θ x 2 + x 3 1 θ + x 4 θ. Let s proceed, however, with an EM formulation. 23
24 Overparametrized Multinomial We notice that π 2 = π 3, and π 1 = π 4 + 1/2. So to use the EM algorithm on this problem, we can think of a multinomial with five classes, which is formed from the original multinomial by splitting the first class into two with associated probabilities 1/2 and θ/4. The original variable x 1 is now the sum of u 1 and u 2. The vector c = (u 1, u 2, x 2, x 3, x 4 ) is the complete data. 24
25 Overparametrized Multinomial Under this reformulation, we now have a maximum likelihood estimate of θ by considering u 2 + x 4 to be a realization of a binomial with n = u 2 + x 4 + x 2 + x 3 and π = θ. However, we do not know u 2. Proceeding as if we had a five-outcome multinomial observation with two missing elements, we have the log-likelihood for the complete data, l C (θ) = (u 2 + x 4 )log(θ) + (x 2 + x 3 )log(1 θ), which has a maximum at θ = u 2 + x 4 u 2 + x 2 + x 3 + x 4. 25
26 Overparametrized Multinomial The E-step of the iterative EM algorithm fills in the missing or unobservable value with its expected value given a current value of the parameter, θ (k), and the observed data. Because l C (θ) is linear in the data, we have E U2 x,θ (k) (l C (θ)) = E U2 x,θ (k) (U 2 +x 4 )log(θ)+e U2 x,θ (k) (x 2 +x 3 )log(1 θ). Under this setup, with θ = θ (k), we get E U2 x,θ (k) (U 2 ) = 1 ( 4 x 1θ (k) 1 / x 1θ (k) ), and now for the M-step, we take u (k) 2 = E U2 x,θ (k) (U 2 ). 26
27 Overparametrized Multinomial We now maximize E U2 x,θ (k) (l C (θ)). We have a closed-form solution. The maximum occurs at θ (k+1) = (u (k) 2 + x 4)/(u (k) 2 + x 2 + x 3 + x 4 ), and this becomes the new value of θ. The following R statements execute a single iteration, after we initialize t.k u2.k <- x[1]*t.k/(2+t.k) t.kp1 <- (u2.k + x[4])/(sum(x)-x[1]+u2.k) Dempster, Laird, and Rubin used the data n = 197 and x = (125,18,20,34). Beginning with t.k in the range of 0.1 to 0.9, this converges to within 10 7 in less than 10 iterations. 27
28 Testing for Convergence We can put the R statements in a loop so long as we test for convergence and put a limit on the number of iterations. # initialize t.kp1 as the starting value and initialize t.k to any different value # initialize iter = 0. # initialize control values, eps and maxit. while (abs(t.k-t.kp1)>eps & iter <= maxit) { iter <- iter + 1 t.k <- t.kp1 u2.k <- x[1]*t.k/(2+t.k) t.kp1 <- (u2.k + x[4])/(sum(x)-x[1]+u2.k) } 28
29 Example 2 A Life-Testing Experiment Using an Exponential Model The exponential model for failure times is p(t) = 1 θ e t/θ. In the classic life-testing experiment, n items are put on test, and the failure times of n 1 items (usually the first ones to fail) are recorded. All that is known of the remaining n n 1 items is the time at which they were taken from the test (usually the time at which the test is stopped). The complete data would be the failure times for all n items: t 1,..., t n. 29
30 A Life-Testing Experiment Using an Exponential Model We will allow the n n 1 items whose failure times were not measured to be removed from the test (or censored ) at varying times; that is, for each one, all we know is a lower bound on its lifetime. The data that we actually have is the set t 1,..., t n1, c n1 +1,..., c n, where we have indexed the observations so that the first n 1 correspond to those with observed failure times, and c n1 +1,..., c n are the censored times for the remaining items; that is, the times at which they were removed from the test. (In many cases, of course, all of the c s would be the same.) If n j n, then all that is known about t j is that t j c j. 30
31 A Life-Testing Experiment Using an Given the data vector x, Exponential Model x = (x 1,..., x n1, x n1 +1,..., x n ) = (t 1,..., t n1, c n1 +1,..., c n ), the objective is to obtain the MLE of θ. The probability of units n through n exceeding the values x n1 +1,..., x n is e n i=n1 +1 x i/θ, and so the likelihood of the observed data is e n i=n1 +1 x i/θ n 1 i=1 1 θ e x i/θ, and the log-likelihood for the observed data is l(θ) = n 1 log(θ) n i=1 x i /θ. (3) 31
32 A Life-Testing Experiment Using an Exponential Model The maximum of the log-likelihood is easily obtained of course. It occurs at ˆθ = n i=1 x i /n 1, but let s use the EM method to approach the problem from the standpoint of how to deal with the missing data. The log-likelihood for the complete data is l C (θ) = nlog(θ) n i=1 x i θ, where, for j > n 1, x j is the unobserved realization of the random variable corresponding to time of failure. 32
33 A Life-Testing Experiment Using an Exponential Model In the EM formulation for this problem, we first get the E step. Given the observed values of x, and, for a given value of θ, say θ (k), taking the expectation of the complete log-likelihood for random variables X n1 +1,..., X n, we get for the E step, q k (x, θ) = nlog(θ) = nlog(θ) n 1 i=1 n i=1 x i + n i=n 1 +1 ( xi + θ (k)) /θ x i + (n n 1 )θ (k) /θ (4) The value that maximizes this is taken to be θ (k+1) ; that is, θ (k+1) = n i=1 x i + (n n 1 )θ (k) /n. 33
34 Example 3 A two-component normal mixture model can be defined by two normal distributions, N(µ 1, σ1 2) and N(µ 2, σ2 2 ), and the probability that the random variable (the observable) arises from the first distribution is w. The parameter in this model is the vector θ = (w, µ 1, σ 2 1, µ 2, σ 2 2 ). (Note that w and the σs have the obvious constraints.) The PDF of the mixture is p(x; θ) = wp 1 (x; µ 1, σ 2 1 ) + (1 w)p 2(x; µ 2, σ 2 2 ), where p j (x; µ j, σ 2 j ) is the normal PDF with parameters µ j and σ 2 j. (I am just writing them this way for convenience; p 1 and p 2 are actually the same parametrized function of course.) 34
35 The Two-Component Mixture Model In the standard formulation with C = (X, U), X represents the observed data, and the unobserved U represents class membership. For the two-class mixture model, let U = 1 if the observation is from the first distribution and U = 0 if the observation is from the second distribution. The unconditional E(U) is the probability that an observation comes from the first distribution, which of course is w. Suppose we have n observations on X, x = (x 1,..., x n ). In the notation we have been using, q (k) i = q (k) (x i, θ (k) ) is the weight (or estimated probability) that the observation is in the first group. 35
36 Estimation in the Two-Component Normal Mixture Model Given a provisional value of θ, we can compute the conditional expected value E(U x) for any realization of X, x i. It is merely q (k) i = E(U x i, θ (k) ) = p ( w (k) p 1 x i ; µ (k) ) 1, σ2(k) 1 ( ). x i ; w (k), µ (k) 1, σ2(k) 1, µ (k) (5) 2, σ2(k) 2 36
37 The M step is just the familiar MLE of the parameters: w (k+1) = 1 n E (U x i, θ (k) ) σ 2(k+1) 2 = µ (k+1) 1 = 1 q (k) nw (k+1) i σ1 2(k+1) = 1 q (k) nw (k+1) i µ (k+1) 2 = ( 1 n ( ( 1 w (k+1)) 1 n(1 w (k+1) ) ( 1 q (k) i x i x i µ (k+1) 1 ) 1 q (k) ) ( i ) 2 x i x i µ (k+1) 2 ) 2 (6) 37
38 Easy to program in R 38
39 The General Mixture Model A multi-component mixture model can be defined by specifying the individual distributions and and the probability that the random variable (the observable) arises from the each distribution. For a mixture of g distributions, if the PDF of the j th distribution is p j (x; θ j ), the PDF of the mixture is p(x; θ) = where π j 0 and g j=1 π j = 1. g j=1 π j p j (x; θ j ), The g-vector π is the unconditional probabilities for a random variable from the mixture. 39
40 The General Mixture Model: Dummy Variables For observations from different groups, we can use a single indicator variable to designate the group, or we can use dummy variables that take values of 0 or 1, depending on the group of a given observation. In linear models when dummy variables are used to distinguish classes, the number of dummy variables is one less than the number of classes. This yields linearly independent column vectors in the data matrix. In mixture models, it is more convenient to use a dummy variable for each class. For an observation in a given class, only the dummy variable for that class takes a value of 1, and all others take the value 0; thus, the sum of the dummy variables for a given observation is 1. In a general probability mixture model, the expected value of a dummy variable is the probability that an observation comes from the group corresponding to the dummy variable. 40
41 The General Mixture Model In the formulation C = (X, U) where X represents the observed data, the unobserved U represents the g-vector of dummy variables. The unconditional E(U) is vector of probabilities that an observation comes from the various components of the mixture distribution. Suppose we have n observations on X, x = (x 1,..., x n ). In the notation we have been using, q (k) i = q (k) (x i, θ (k) ), the conditional expectation of U, is the estimated conditional probability vector for that observation. 41
42 Given provisional values of π and θ, π (k) and θ (k), we can compute the conditional expected value of the j th dummy variable for any realization of X, x i : q (k) ij = E(U j x i, θ (k) ) = x i ; θ (k) j p (x i ; θ (k)). π (k) ( j p j )
43 Estimation in a General Normal Mixture Model Given E(U x, θ (k) ) as before, we can perform the M step. First, we have, the conditional MLE for the g-vector of probabilities: π (k+1) = 1 n n i=1 E ( U x i, θ (k)). (7) For a mixture of normal distributions, θ j = (µ j, σj 2 ), that is, the mean and variance of the j th distribution, for the conditional MLEs, we have the weighted estimates µ (k+1) j = 1 π (k+1) j σ 2(k+1) j = 1 π (k+1) j ni=1 q (k) ij ni=1 q (k) ij x i ( x i µ (k+1) i ) 2. (8) These equations are similar to equations (6). 42
44 EM in R There are various packages and functions for applications of EM in R. Instead of a general function for EM, the functions depend on the applications; for example, mixtools and mclust use EM to estimate the parameters in mixtures of populations. A very general package that allows regression models instead of just probability models, to distinguish the different populations is flexmix. 43
Review and continuation from last week Properties of MLEs
Review and continuation from last week Properties of MLEs As we have mentioned, MLEs have a nice intuitive property, and as we have seen, they have a certain equivariance property. We will see later that
More informationMore on Unsupervised Learning
More on Unsupervised Learning Two types of problems are to find association rules for occurrences in common in observations (market basket analysis), and finding the groups of values of observational data
More informationInformation in Data. Sufficiency, Ancillarity, Minimality, and Completeness
Information in Data Sufficiency, Ancillarity, Minimality, and Completeness Important properties of statistics that determine the usefulness of those statistics in statistical inference. These general properties
More informationRegression Clustering
Regression Clustering In regression clustering, we assume a model of the form y = f g (x, θ g ) + ɛ g for observations y and x in the g th group. Usually, of course, we assume linear models of the form
More informationStatistics 3858 : Maximum Likelihood Estimators
Statistics 3858 : Maximum Likelihood Estimators 1 Method of Maximum Likelihood In this method we construct the so called likelihood function, that is L(θ) = L(θ; X 1, X 2,..., X n ) = f n (X 1, X 2,...,
More informationOptimization Problems
Optimization Problems The goal in an optimization problem is to find the point at which the minimum (or maximum) of a real, scalar function f occurs and, usually, to find the value of the function at that
More informationLast lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton
EM Algorithm Last lecture 1/35 General optimization problems Newton Raphson Fisher scoring Quasi Newton Nonlinear regression models Gauss-Newton Generalized linear models Iteratively reweighted least squares
More informationEM Algorithm II. September 11, 2018
EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data
More informationMonte Carlo Studies. The response in a Monte Carlo study is a random variable.
Monte Carlo Studies The response in a Monte Carlo study is a random variable. The response in a Monte Carlo study has a variance that comes from the variance of the stochastic elements in the data-generating
More informationExponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger
Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm by Korbinian Schwinger Overview Exponential Family Maximum Likelihood The EM Algorithm Gaussian Mixture Models Exponential
More informationBiostat 2065 Analysis of Incomplete Data
Biostat 2065 Analysis of Incomplete Data Gong Tang Dept of Biostatistics University of Pittsburgh October 20, 2005 1. Large-sample inference based on ML Let θ is the MLE, then the large-sample theory implies
More informationLecture 4 September 15
IFT 6269: Probabilistic Graphical Models Fall 2017 Lecture 4 September 15 Lecturer: Simon Lacoste-Julien Scribe: Philippe Brouillard & Tristan Deleu 4.1 Maximum Likelihood principle Given a parametric
More informationOptimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.
Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may
More informationEM for ML Estimation
Overview EM for ML Estimation An algorithm for Maximum Likelihood (ML) Estimation from incomplete data (Dempster, Laird, and Rubin, 1977) 1. Formulate complete data so that complete-data ML estimation
More informationParametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory
Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007
More informationChapter 3 : Likelihood function and inference
Chapter 3 : Likelihood function and inference 4 Likelihood function and inference The likelihood Information and curvature Sufficiency and ancilarity Maximum likelihood estimation Non-regular models EM
More informationStatistical learning. Chapter 20, Sections 1 4 1
Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete
More informationPOLI 8501 Introduction to Maximum Likelihood Estimation
POLI 8501 Introduction to Maximum Likelihood Estimation Maximum Likelihood Intuition Consider a model that looks like this: Y i N(µ, σ 2 ) So: E(Y ) = µ V ar(y ) = σ 2 Suppose you have some data on Y,
More informationPh.D. Qualifying Exam Friday Saturday, January 3 4, 2014
Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014 Put your solution to each problem on a separate sheet of paper. Problem 1. (5166) Assume that two random samples {x i } and {y i } are independently
More informationReview of Maximum Likelihood Estimators
Libby MacKinnon CSE 527 notes Lecture 7, October 7, 2007 MLE and EM Review of Maximum Likelihood Estimators MLE is one of many approaches to parameter estimation. The likelihood of independent observations
More information1. Fisher Information
1. Fisher Information Let f(x θ) be a density function with the property that log f(x θ) is differentiable in θ throughout the open p-dimensional parameter set Θ R p ; then the score statistic (or score
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationFall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.
1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n
More informationMS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari
MS&E 226: Small Data Lecture 11: Maximum likelihood (v2) Ramesh Johari ramesh.johari@stanford.edu 1 / 18 The likelihood function 2 / 18 Estimating the parameter This lecture develops the methodology behind
More informationCOM336: Neural Computing
COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk
More informationA Few Notes on Fisher Information (WIP)
A Few Notes on Fisher Information (WIP) David Meyer dmm@{-4-5.net,uoregon.edu} Last update: April 30, 208 Definitions There are so many interesting things about Fisher Information and its theoretical properties
More informationPh.D. Qualifying Exam Friday Saturday, January 6 7, 2017
Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017 Put your solution to each problem on a separate sheet of paper. Problem 1. (5106) Let X 1, X 2,, X n be a sequence of i.i.d. observations from a
More informationStatistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation
Statistics - Lecture One Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Outline 1. Basic ideas about estimation 2. Method of Moments 3. Maximum Likelihood 4. Confidence
More information2 Nonlinear least squares algorithms
1 Introduction Notes for 2017-05-01 We briefly discussed nonlinear least squares problems in a previous lecture, when we described the historical path leading to trust region methods starting from the
More informationPh.D. Qualifying Exam Monday Tuesday, January 4 5, 2016
Ph.D. Qualifying Exam Monday Tuesday, January 4 5, 2016 Put your solution to each problem on a separate sheet of paper. Problem 1. (5106) Find the maximum likelihood estimate of θ where θ is a parameter
More informationComputing the MLE and the EM Algorithm
ECE 830 Fall 0 Statistical Signal Processing instructor: R. Nowak Computing the MLE and the EM Algorithm If X p(x θ), θ Θ, then the MLE is the solution to the equations logp(x θ) θ 0. Sometimes these equations
More informationMaximum Likelihood Estimation. only training data is available to design a classifier
Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional
More informationLatent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent
Latent Variable Models for Binary Data Suppose that for a given vector of explanatory variables x, the latent variable, U, has a continuous cumulative distribution function F (u; x) and that the binary
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationExpectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )
Expectation Maximization (EM Algorithm Motivating Example: Have two coins: Coin 1 and Coin 2 Each has it s own probability of seeing H on any one flip. Let p 1 = P ( H on Coin 1 p 2 = P ( H on Coin 2 Select
More informationProbability and Estimation. Alan Moses
Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.
More informationGreene, Econometric Analysis (6th ed, 2008)
EC771: Econometrics, Spring 2010 Greene, Econometric Analysis (6th ed, 2008) Chapter 17: Maximum Likelihood Estimation The preferred estimator in a wide variety of econometric settings is that derived
More informationNew mixture models and algorithms in the mixtools package
New mixture models and algorithms in the mixtools package Didier Chauveau MAPMO - UMR 6628 - Université d Orléans 1ères Rencontres BoRdeaux Juillet 2012 Outline 1 Mixture models and EM algorithm-ology
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationAssociation studies and regression
Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationStatistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach
Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score
More informationStatistical Methods as Optimization Problems
models ( 1 Statistical Methods as Optimization Problems Optimization problems maximization or imization arise in many areas of statistics. Statistical estimation and modeling both are usually special types
More informationStatistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That
Statistics Lecture 2 August 7, 2000 Frank Porter Caltech The plan for these lectures: The Fundamentals; Point Estimation Maximum Likelihood, Least Squares and All That What is a Confidence Interval? Interval
More informationIntroduction to Maximum Likelihood Estimation
Introduction to Maximum Likelihood Estimation Eric Zivot July 26, 2012 The Likelihood Function Let 1 be an iid sample with pdf ( ; ) where is a ( 1) vector of parameters that characterize ( ; ) Example:
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationLecture 3 September 1
STAT 383C: Statistical Modeling I Fall 2016 Lecture 3 September 1 Lecturer: Purnamrita Sarkar Scribe: Giorgio Paulon, Carlos Zanini Disclaimer: These scribe notes have been slightly proofread and may have
More informationExpectation maximization tutorial
Expectation maximization tutorial Octavian Ganea November 18, 2016 1/1 Today Expectation - maximization algorithm Topic modelling 2/1 ML & MAP Observed data: X = {x 1, x 2... x N } 3/1 ML & MAP Observed
More informationMaximum Likelihood Estimation
Chapter 8 Maximum Likelihood Estimation 8. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function
More informationMixtures of Rasch Models
Mixtures of Rasch Models Hannah Frick, Friedrich Leisch, Achim Zeileis, Carolin Strobl http://www.uibk.ac.at/statistics/ Introduction Rasch model for measuring latent traits Model assumption: Item parameters
More informationLecture 3 - Linear and Logistic Regression
3 - Linear and Logistic Regression-1 Machine Learning Course Lecture 3 - Linear and Logistic Regression Lecturer: Haim Permuter Scribe: Ziv Aharoni Throughout this lecture we talk about how to use regression
More informationAdvanced Quantitative Methods: maximum likelihood
Advanced Quantitative Methods: Maximum Likelihood University College Dublin March 23, 2011 1 Introduction 2 3 4 5 Outline Introduction 1 Introduction 2 3 4 5 Preliminaries Introduction Ordinary least squares
More informationECONOMETRICS I. Assignment 5 Estimation
ECONOMETRICS I Professor William Greene Phone: 212.998.0876 Office: KMC 7-90 Home page: people.stern.nyu.edu/wgreene Email: wgreene@stern.nyu.edu Course web page: people.stern.nyu.edu/wgreene/econometrics/econometrics.htm
More informationRegression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood
Regression Estimation - Least Squares and Maximum Likelihood Dr. Frank Wood Least Squares Max(min)imization Function to minimize w.r.t. β 0, β 1 Q = n (Y i (β 0 + β 1 X i )) 2 i=1 Minimize this by maximizing
More informationMathematical statistics
October 4 th, 2018 Lecture 12: Information Where are we? Week 1 Week 2 Week 4 Week 7 Week 10 Week 14 Probability reviews Chapter 6: Statistics and Sampling Distributions Chapter 7: Point Estimation Chapter
More informationRegression Estimation Least Squares and Maximum Likelihood
Regression Estimation Least Squares and Maximum Likelihood Dr. Frank Wood Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 1 Least Squares Max(min)imization Function to minimize
More informationMachine Learning 2017
Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationParameter Estimation
Parameter Estimation Chapters 13-15 Stat 477 - Loss Models Chapters 13-15 (Stat 477) Parameter Estimation Brian Hartman - BYU 1 / 23 Methods for parameter estimation Methods for parameter estimation Methods
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationMaximum Likelihood Estimation
Maximum Likelihood Estimation Prof. C. F. Jeff Wu ISyE 8813 Section 1 Motivation What is parameter estimation? A modeler proposes a model M(θ) for explaining some observed phenomenon θ are the parameters
More informationAdvanced Quantitative Methods: maximum likelihood
Advanced Quantitative Methods: Maximum Likelihood University College Dublin 4 March 2014 1 2 3 4 5 6 Outline 1 2 3 4 5 6 of straight lines y = 1 2 x + 2 dy dx = 1 2 of curves y = x 2 4x + 5 of curves y
More informationStat 451 Lecture Notes Numerical Integration
Stat 451 Lecture Notes 03 12 Numerical Integration Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapter 5 in Givens & Hoeting, and Chapters 4 & 18 of Lange 2 Updated: February 11, 2016 1 / 29
More information(θ θ ), θ θ = 2 L(θ ) θ θ θ θ θ (θ )= H θθ (θ ) 1 d θ (θ )
Setting RHS to be zero, 0= (θ )+ 2 L(θ ) (θ θ ), θ θ = 2 L(θ ) 1 (θ )= H θθ (θ ) 1 d θ (θ ) O =0 θ 1 θ 3 θ 2 θ Figure 1: The Newton-Raphson Algorithm where H is the Hessian matrix, d θ is the derivative
More informationSTA216: Generalized Linear Models. Lecture 1. Review and Introduction
STA216: Generalized Linear Models Lecture 1. Review and Introduction Let y 1,..., y n denote n independent observations on a response Treat y i as a realization of a random variable Y i In the general
More informationEstimating the parameters of hidden binomial trials by the EM algorithm
Hacettepe Journal of Mathematics and Statistics Volume 43 (5) (2014), 885 890 Estimating the parameters of hidden binomial trials by the EM algorithm Degang Zhu Received 02 : 09 : 2013 : Accepted 02 :
More informationMathematical statistics
October 18 th, 2018 Lecture 16: Midterm review Countdown to mid-term exam: 7 days Week 1 Chapter 1: Probability review Week 2 Week 4 Week 7 Chapter 6: Statistics Chapter 7: Point Estimation Chapter 8:
More informationCS534 Machine Learning - Spring Final Exam
CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the
More informationA note on shaved dice inference
A note on shaved dice inference Rolf Sundberg Department of Mathematics, Stockholm University November 23, 2016 Abstract Two dice are rolled repeatedly, only their sum is registered. Have the two dice
More informationSystem Identification, Lecture 4
System Identification, Lecture 4 Kristiaan Pelckmans (IT/UU, 2338) Course code: 1RT880, Report code: 61800 - Spring 2016 F, FRI Uppsala University, Information Technology 13 April 2016 SI-2016 K. Pelckmans
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationGeneralized Linear Models for Non-Normal Data
Generalized Linear Models for Non-Normal Data Today s Class: 3 parts of a generalized model Models for binary outcomes Complications for generalized multivariate or multilevel models SPLH 861: Lecture
More informationParametric Techniques Lecture 3
Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to
More informationMax. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes
Maximum Likelihood Estimation Econometrics II Department of Economics Universidad Carlos III de Madrid Máster Universitario en Desarrollo y Crecimiento Económico Outline 1 3 4 General Approaches to Parameter
More informationNow consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.
Weighting We have seen that if E(Y) = Xβ and V (Y) = σ 2 G, where G is known, the model can be rewritten as a linear model. This is known as generalized least squares or, if G is diagonal, with trace(g)
More informationClustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation
Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationLinear Models A linear model is defined by the expression
Linear Models A linear model is defined by the expression x = F β + ɛ. where x = (x 1, x 2,..., x n ) is vector of size n usually known as the response vector. β = (β 1, β 2,..., β p ) is the transpose
More informationMaximum Likelihood Estimation
Chapter 7 Maximum Likelihood Estimation 7. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function
More informationEM for Spherical Gaussians
EM for Spherical Gaussians Karthekeyan Chandrasekaran Hassan Kingravi December 4, 2007 1 Introduction In this project, we examine two aspects of the behavior of the EM algorithm for mixtures of spherical
More informationMAS223 Statistical Inference and Modelling Exercises
MAS223 Statistical Inference and Modelling Exercises The exercises are grouped into sections, corresponding to chapters of the lecture notes Within each section exercises are divided into warm-up questions,
More informationOutline of GLMs. Definitions
Outline of GLMs Definitions This is a short outline of GLM details, adapted from the book Nonparametric Regression and Generalized Linear Models, by Green and Silverman. The responses Y i have density
More informationAccelerating the EM Algorithm for Mixture Density Estimation
Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 1/18 Accelerating the EM Algorithm for Mixture Density Estimation Homer Walker Mathematical Sciences Department Worcester Polytechnic
More informationChapter 5. Chapter 5 sections
1 / 43 sections Discrete univariate distributions: 5.2 Bernoulli and Binomial distributions Just skim 5.3 Hypergeometric distributions 5.4 Poisson distributions Just skim 5.5 Negative Binomial distributions
More informationLinear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52
Statistics for Applications Chapter 10: Generalized Linear Models (GLMs) 1/52 Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52 Components of a linear model The two
More informationTwo hours. To be supplied by the Examinations Office: Mathematical Formula Tables THE UNIVERSITY OF MANCHESTER. 21 June :45 11:45
Two hours MATH20802 To be supplied by the Examinations Office: Mathematical Formula Tables THE UNIVERSITY OF MANCHESTER STATISTICAL METHODS 21 June 2010 9:45 11:45 Answer any FOUR of the questions. University-approved
More informationMaximum Likelihood Estimation
Maximum Likelihood Estimation Guy Lebanon February 19, 2011 Maximum likelihood estimation is the most popular general purpose method for obtaining estimating a distribution from a finite sample. It was
More informationParameter Estimation of Power Lomax Distribution Based on Type-II Progressively Hybrid Censoring Scheme
Applied Mathematical Sciences, Vol. 12, 2018, no. 18, 879-891 HIKARI Ltd, www.m-hikari.com https://doi.org/10.12988/ams.2018.8691 Parameter Estimation of Power Lomax Distribution Based on Type-II Progressively
More informationCOMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017
COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SOFT CLUSTERING VS HARD CLUSTERING
More informationReview of Discrete Probability (contd.)
Stat 504, Lecture 2 1 Review of Discrete Probability (contd.) Overview of probability and inference Probability Data generating process Observed data Inference The basic problem we study in probability:
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationGeneralized Linear Models. Last time: Background & motivation for moving beyond linear
Generalized Linear Models Last time: Background & motivation for moving beyond linear regression - non-normal/non-linear cases, binary, categorical data Today s class: 1. Examples of count and ordered
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationCSE446: Clustering and EM Spring 2017
CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled
More informationMaximum Likelihood, Logistic Regression, and Stochastic Gradient Training
Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions
More informationBasic Computations in Statistical Inference
1 Basic Computations in Statistical Inference The purpose of an exploration of data may be rather limited and ad hoc, or the purpose may be more general, perhaps to gain understanding of some natural phenomenon.
More information1 One-way analysis of variance
LIST OF FORMULAS (Version from 21. November 2014) STK2120 1 One-way analysis of variance Assume X ij = µ+α i +ɛ ij ; j = 1, 2,..., J i ; i = 1, 2,..., I ; where ɛ ij -s are independent and N(0, σ 2 ) distributed.
More informationPractice Problems Section Problems
Practice Problems Section 4-4-3 4-4 4-5 4-6 4-7 4-8 4-10 Supplemental Problems 4-1 to 4-9 4-13, 14, 15, 17, 19, 0 4-3, 34, 36, 38 4-47, 49, 5, 54, 55 4-59, 60, 63 4-66, 68, 69, 70, 74 4-79, 81, 84 4-85,
More information