A732: Exercise #7 Maximum Likelihood

Similar documents
Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Parametric Techniques Lecture 3

Parametric Techniques

Bayesian Decision and Bayesian Learning

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

The Gaussian distribution

COM336: Neural Computing

Notes on the Multivariate Normal and Related Topics

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Exercises and Answers to Chapter 1

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Statistics and Data Analysis

Parameter Estimation and Fitting to Data

Statistics 3858 : Maximum Likelihood Estimators

Essential Maths 1. Macquarie University MAFC_Essential_Maths Page 1 of These notes were prepared by Anne Cooper and Catriona March.

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Exam 2. Jeremy Morris. March 23, 2006

COS513 LECTURE 8 STATISTICAL CONCEPTS

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

Space Telescope Science Institute statistics mini-course. October Inference I: Estimation, Confidence Intervals, and Tests of Hypotheses

Modern Methods of Data Analysis - WS 07/08

THE UNIVERSITY OF HONG KONG DEPARTMENT OF MATHEMATICS

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Statistics. Lecture 4 August 9, 2000 Frank Porter Caltech. 1. The Fundamentals; Point Estimation. 2. Maximum Likelihood, Least Squares and All That

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Introduction to Machine Learning

Dimension Reduction. David M. Blei. April 23, 2012

Motivating the Covariance Matrix

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Statistics & Data Sciences: First Year Prelim Exam May 2018

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

UC Berkeley Department of Electrical Engineering and Computer Sciences. EECS 126: Probability and Random Processes

Test Problems for Probability Theory ,

Basics on Probability. Jingrui He 09/11/2007

Bayesian Decision Theory

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Introduction to Machine Learning

Probability and Estimation. Alan Moses

MAS223 Statistical Inference and Modelling Exercises

Statistics: Learning models from data

Gaussian Models (9/9/13)

Advanced Quantitative Methods: maximum likelihood

If we want to analyze experimental or simulated data we might encounter the following tasks:

Probability Density Functions

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

1 Hoeffding s Inequality

ME 597: AUTONOMOUS MOBILE ROBOTICS SECTION 2 PROBABILITY. Prof. Steven Waslander

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Lecture 4: Probabilistic Learning

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

CS 195-5: Machine Learning Problem Set 1

F & B Approaches to a simple model

Qualifying Exam CS 661: System Simulation Summer 2013 Prof. Marvin K. Nakayama

Comment about AR spectral estimation Usually an estimate is produced by computing the AR theoretical spectrum at (ˆφ, ˆσ 2 ). With our Monte Carlo

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Topic 19 Extensions on the Likelihood Ratio

Exponential Families

1 Data Arrays and Decompositions

Factor Analysis and Kalman Filtering (11/2/04)

Hidden Markov Models and Gaussian Mixture Models

Introduction to Machine Learning. Lecture 2

Some Assorted Formulae. Some confidence intervals: σ n. x ± z α/2. x ± t n 1;α/2 n. ˆp(1 ˆp) ˆp ± z α/2 n. χ 2 n 1;1 α/2. n 1;α/2

Statistics for Data Analysis. Niklaus Berger. PSI Practical Course Physics Institute, University of Heidelberg

Expectation Maximization Algorithm

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

Minimum Error Rate Classification

Statistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart

Hybrid Censoring Scheme: An Introduction

STAT 135 Lab 2 Confidence Intervals, MLE and the Delta Method

26, 24, 26, 28, 23, 23, 25, 24, 26, 25

Quick Tour of Basic Probability Theory and Linear Algebra

STAT 514 Solutions to Assignment #6

An Introduction to Expectation-Maximization

probability of k samples out of J fall in R.

Representation theory of SU(2), density operators, purification Michael Walter, University of Amsterdam

Random Variables and Their Distributions

Machine learning - HT Maximum Likelihood

Solving Systems of Linear Equations Symbolically

Lecture 3. Inference about multivariate normal distribution

Bayesian inference with reliability methods without knowing the maximum of the likelihood function

Problem Selected Scores

10. Composite Hypothesis Testing. ECE 830, Spring 2014

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

Statistical Methods in Particle Physics

Problem Set 1. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 20

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Probability Distributions: Continuous

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Chapter 4: Factor Analysis

Advanced Quantitative Methods: maximum likelihood

TIGHT BOUNDS FOR THE FIRST ORDER MARCUM Q-FUNCTION

(Multivariate) Gaussian (Normal) Probability Densities

Frequentist-Bayesian Model Comparisons: A Simple Example

Math 152. Rumbos Fall Solutions to Assignment #12

Expansion formula using properties of dot product (analogous to FOIL in algebra): u v 2 u v u v u u 2u v v v u 2 2u v v 2

Practice Problems Section Problems

Primer on statistics:

Transcription:

A732: Exercise #7 Maximum Likelihood Due: 29 Novemer 2007 Analytic computation of some one-dimensional maximum likelihood estimators (a) Including the normalization, the exponential distriution function is f(x,θ) = θe θx. () The likelihood function for n data points is then L(θ) = θ n e θ n i= x i (2) The likelihood estimator follows easily y determining the zero of dl(θ)/dθ to find θ ML = n n i= x =. (3) i µ x In words, the maximum likelihood estimator θ ML is equal to the inverse of the sample mean. () In this case one can not determine θ y searching the zero of the first derivative of the likelihood function ecause density function is discontinuous, { θ f(x;θ) = if x < θ 0 otherwise. and the resulting likelihood function, L(θ) = θ n (4)

is monotonic. Note that finding the root of the first derivative for the likelihood is only a mathematical device for finding the extremum and there is no reason that other arguments can not e used. In particular, note that θ must e larger than or equal to x max = max(x,x 2,...,x n ). Since x max is the smallest of the possile values of θ consistent with the data, it is the one that maximimizes L(θ). We have therefore argued that the maximum likelihood estimate of θ is θ ML = x max. Suppose that the true value of θ = Θ is greater than the maximum likelihood estimate (θ ML = x max ). It is straightforward to calculate the cumulative proaility, P(x < x max ;Θ) and determine the proaility that n values of x smaller than x max : ( xmax ) n. P(x < x max ;Θ) = (5) Θ As intuitively expected, the larger n, the smaller the proaility that Θ > θ ML. One can also easily see that the maximum-likelihood estimate is iased: the distriution function of x max = max(x,x 2,...,x n ) with x i drawn from the uniform distriution discussed in this exercise is f(x max ) = n θ nxn max (6) It is straightforward to show that E(x max ) = n+ n θ. Figure shows the histogram of x max otained from 0000 samples of 5 random numers drawn from a uniform distriution with θ = 2. As expected for n = 5 and θ = 2, the mean value of x max is in this case equal to.666. 2 Bayes theorem and ias (a) See next... () Assume that {x} is distriuted as f(x;µ,σ 2 ) where µ and σ 2 descrie the mean and variance of the known distriution f( ). The likelihood function is then L(µ;σ 2 ) = N f(x k ;µ,σ 2 ) k= 09Dec07/MDW 2

x max N 0 500 000 500 2000 0.5.0.5 2.0 Figure : Histogram of values of x max from samples of 5 random numers drawn from a uniform distriution for 0 < x < 2. and our ML estimates are the values ˆµ, ˆσ 2 that maximize L or equivalently log L. Our analytic expressions for these parameters are: logl µ logl σ 2 = k = k f f f µ f σ 2 For a Gaussian, f(x;µ,σ 2 ) = exp [ (x µ) 2 /2σ 2] / 2πσ 2, we find: xk xk ˆµ = N k x k.75 ˆσ 2 = N k (x k ˆµ) 2 2.9. In class, we showed that the sample mean and sample variance are uniased estimators for the mean and variance. Notice that our ML estimate ˆµ is the sample mean ut ˆσ 2 differs from the sample variance y the factor N/(N ) and is, therefore, iased. The difference etween 2.9 for the iased variance nand 2.92 for the uniased variance is relatively large for our small data set! (c) We know that Bayes theorem tells us that the posterior proaility of some parameter µ is the product of the prior distriution of the 09Dec07/MDW 3

parameter and the likelihood function, appropriately normalized: P(µ D) = P(µ)L(D µ) R dµp(µ)l(d µ), where D is the data. If we now assume that the prior distriution of µ is normal with mean µ 0 = 2 and variance σ 2 0 = 2 we have: P(µ D) P(µ;µ 0,σ 2 0 )L(D µ) where L(D µ) = N k= f(x k;µ,σ 2 e) with σ 2 e = 3. Therefore, P(µ D) is the product of two Gaussians which is, again, a Gaussian. After a it of algera, one find that the mean and variance of P(µ D) is: µ = σ 2 = ( µ0 σ 2 0 ( σ 2 0 + ˆµ ) ( σ 2 e /N.82/ + ) σ 2 0.55. e/n σ 2 0 + σ 2 e /N In other words, the mean of the distriution for µ is the average of the prior mean inversely weighted y the prior variance and the sample mean inversely weighted y the variance of the sample mean. For large N, this mean will e dominated y the sample mean, and vice versa for small N. Similarly, the variance of the distriution for µ is the harmonic mean of prior variance and the variance of the sample mean. Note that the variance of the sample mean follows from the central limit theorem: the true population variance over the numer of data points. Therefore, just as in the case for the mean, for large N, the variance will e dominated y the sample variance, and vice versa for small N. 3 Estimating a power law with a reak ), (a) We assume that x [x min,x max ] to prevent divergence as x 0 or x. This is typical in physical applications. For example, if x represents the mass of a galaxy, the distriution has a cutoff at some small and large galaxy mass. Further, we can assume that [x min,x max], otherwise we would not have a roken power law. We can integrate { (x/) p if x f(x; p, p 2,) = K (x/) p 2 if x > 09Dec07/MDW 4

and therefore = Z xmax x min K = Z = x min / p + dx f(x; p, p 2,) Z dyy p xmax / + [ ( xmin dyy p 2 ) p + ] + p 2 + [ (xmax ) p2 + ] determines K. This expression demands p > if x min 0 and p 2 < if x max, further illustrating the care that must e taken in choosing the limits on the range for a power law. () Comined with next part... (c) In this prolem, p 2 = 3/2 so f is a two parameter family of functions: f(x; p, 3/2,). The likelihood is then: N L(p,) = f(x; p, 3/2,) k= or N logl(p,) = log f(x; p, 3/2,). k= My ML solution is plotted along with a histgram of the data in Figure 2 As stated in the prolem and discussed in class, logl = logl o (θ θ) 2 /2σ 2 θ (7) so that one sigma error is the value logl( θ+σ θ ) = L o /2. For higher dimensionality, one uses the fact that the likelihood function is distriuted like a multidimensional Gaussian (also known as the χ 2 distriution). From equation (7) it is then clear that the quantity 2(logL logl o ) is distriuted like χ 2 where L o is the value of maximum likelihood. In our case, one sigma for two of degrees of freedom is descried y the contour in the plot of log likelihood down from the maximum value y χ 2 /2.5. Similarly the two sigma ( three sigma ) value is the contour containing 95.4% (99.7%) of the proaility. For two degrees of freedom, the values of χ 2 /2 09Dec07/MDW 5

35 30 ML estimate data 25 Bin counts 20 5 0 5 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Values Figure 2: Plot the roken power law with ML estimated values for p and (red curve) along with the data (green histogram). Tale : Parameter confidence limits from likelihood plot Peak: p 0.27, 0.083 Confidence p limit min max min max 68.3% -0.55 0.45 0.057 0. 95.4& -0.70 3.2 0.028 0.3 99.7& -0.85 4.70 0.025 0.8 are 3.09 (5.9). The value of logl is shown in Figure 3 and the three contours corresponding to these one, two, and three sigma proaility values. The confidence limits read from this plot is shown in Figure. (d) We have derived that the covariance matrix for the likelihood is σ 2 i j = 2 logl θ i θ j. Because f(x; p, p 2,) has a slightly messy analytic expression, I found it easier to perform numerical partial differentiation of log L rather than use the equivalent expression: σ 2 i j = E [ 2 log f L θ i θ j ]. 09Dec07/MDW 6

0.3 0.25 0.2 0-2 -4 0.5 0. -6-8 -0 0.05 0-2 - 0 2 3 4 5 6 p 0. 0. 0.09 0.08 0.07 0-0.2-0.4-0.6-0.8-0.06 0.05-0.5-0.4-0.3-0.2-0. 0 p Figure 3: Plot logl as a function of power-law exponent p and reak point with logl 0 = 0. Top panel: the three curves show the theoretical 68.3%, 95.4% and 99.7% isovalues. Lower panel: low up of the density inside of the 68.3% contour. 09Dec07/MDW 7

I recursively used the two-point difference formula. For the diagonal terms one finds: 2 logl p 2 2 logl 2 2 logl p ( p ) 2 [logl(p + p,) 2logL(p,)+ logl(p p,)] ( ) 2 [logl(p,+ ) 2logL(p,)+ logl(p,+ )] p [logl(p + p /2,+ /2) logl(p p /2,+ /2) logl(p + p, )+logl(p p, )] I chose p = 0.0 ˆp and = 0.0ˆ. The eigenvectors of the 2 2 covariance matrix descrie the principal components (directions of uncorrelated error) and the inverse of the eigenvalues is the variance in this direction. I find: σ 6.4 0 3 and σ 2.7 0 with corresponding directions: ê = (0.08,.0) ê 2 = (.0, 0.08) In other words, the principal axes are nearly along the p and directions with a small ( ) tilt. This is consistent with our graphical solution depicted in Figure 3. Similarly, the variance estimate is consistent with the overall scale on which the levels vary ut, of course, do not predict the shape of contours. In particular, it is very important to note that p is unounded to small values of p ; in other words we can not rule out a value of nearly zero for p. Similarly, the high-confidence oundaries are not elliptical. It is nearly always more revealing to study the explicit likelihood distriution rather than rely on the covariance matrix. 09Dec07/MDW 8