A732: Exercise #7 Maximum Likelihood Due: 29 Novemer 2007 Analytic computation of some one-dimensional maximum likelihood estimators (a) Including the normalization, the exponential distriution function is f(x,θ) = θe θx. () The likelihood function for n data points is then L(θ) = θ n e θ n i= x i (2) The likelihood estimator follows easily y determining the zero of dl(θ)/dθ to find θ ML = n n i= x =. (3) i µ x In words, the maximum likelihood estimator θ ML is equal to the inverse of the sample mean. () In this case one can not determine θ y searching the zero of the first derivative of the likelihood function ecause density function is discontinuous, { θ f(x;θ) = if x < θ 0 otherwise. and the resulting likelihood function, L(θ) = θ n (4)
is monotonic. Note that finding the root of the first derivative for the likelihood is only a mathematical device for finding the extremum and there is no reason that other arguments can not e used. In particular, note that θ must e larger than or equal to x max = max(x,x 2,...,x n ). Since x max is the smallest of the possile values of θ consistent with the data, it is the one that maximimizes L(θ). We have therefore argued that the maximum likelihood estimate of θ is θ ML = x max. Suppose that the true value of θ = Θ is greater than the maximum likelihood estimate (θ ML = x max ). It is straightforward to calculate the cumulative proaility, P(x < x max ;Θ) and determine the proaility that n values of x smaller than x max : ( xmax ) n. P(x < x max ;Θ) = (5) Θ As intuitively expected, the larger n, the smaller the proaility that Θ > θ ML. One can also easily see that the maximum-likelihood estimate is iased: the distriution function of x max = max(x,x 2,...,x n ) with x i drawn from the uniform distriution discussed in this exercise is f(x max ) = n θ nxn max (6) It is straightforward to show that E(x max ) = n+ n θ. Figure shows the histogram of x max otained from 0000 samples of 5 random numers drawn from a uniform distriution with θ = 2. As expected for n = 5 and θ = 2, the mean value of x max is in this case equal to.666. 2 Bayes theorem and ias (a) See next... () Assume that {x} is distriuted as f(x;µ,σ 2 ) where µ and σ 2 descrie the mean and variance of the known distriution f( ). The likelihood function is then L(µ;σ 2 ) = N f(x k ;µ,σ 2 ) k= 09Dec07/MDW 2
x max N 0 500 000 500 2000 0.5.0.5 2.0 Figure : Histogram of values of x max from samples of 5 random numers drawn from a uniform distriution for 0 < x < 2. and our ML estimates are the values ˆµ, ˆσ 2 that maximize L or equivalently log L. Our analytic expressions for these parameters are: logl µ logl σ 2 = k = k f f f µ f σ 2 For a Gaussian, f(x;µ,σ 2 ) = exp [ (x µ) 2 /2σ 2] / 2πσ 2, we find: xk xk ˆµ = N k x k.75 ˆσ 2 = N k (x k ˆµ) 2 2.9. In class, we showed that the sample mean and sample variance are uniased estimators for the mean and variance. Notice that our ML estimate ˆµ is the sample mean ut ˆσ 2 differs from the sample variance y the factor N/(N ) and is, therefore, iased. The difference etween 2.9 for the iased variance nand 2.92 for the uniased variance is relatively large for our small data set! (c) We know that Bayes theorem tells us that the posterior proaility of some parameter µ is the product of the prior distriution of the 09Dec07/MDW 3
parameter and the likelihood function, appropriately normalized: P(µ D) = P(µ)L(D µ) R dµp(µ)l(d µ), where D is the data. If we now assume that the prior distriution of µ is normal with mean µ 0 = 2 and variance σ 2 0 = 2 we have: P(µ D) P(µ;µ 0,σ 2 0 )L(D µ) where L(D µ) = N k= f(x k;µ,σ 2 e) with σ 2 e = 3. Therefore, P(µ D) is the product of two Gaussians which is, again, a Gaussian. After a it of algera, one find that the mean and variance of P(µ D) is: µ = σ 2 = ( µ0 σ 2 0 ( σ 2 0 + ˆµ ) ( σ 2 e /N.82/ + ) σ 2 0.55. e/n σ 2 0 + σ 2 e /N In other words, the mean of the distriution for µ is the average of the prior mean inversely weighted y the prior variance and the sample mean inversely weighted y the variance of the sample mean. For large N, this mean will e dominated y the sample mean, and vice versa for small N. Similarly, the variance of the distriution for µ is the harmonic mean of prior variance and the variance of the sample mean. Note that the variance of the sample mean follows from the central limit theorem: the true population variance over the numer of data points. Therefore, just as in the case for the mean, for large N, the variance will e dominated y the sample variance, and vice versa for small N. 3 Estimating a power law with a reak ), (a) We assume that x [x min,x max ] to prevent divergence as x 0 or x. This is typical in physical applications. For example, if x represents the mass of a galaxy, the distriution has a cutoff at some small and large galaxy mass. Further, we can assume that [x min,x max], otherwise we would not have a roken power law. We can integrate { (x/) p if x f(x; p, p 2,) = K (x/) p 2 if x > 09Dec07/MDW 4
and therefore = Z xmax x min K = Z = x min / p + dx f(x; p, p 2,) Z dyy p xmax / + [ ( xmin dyy p 2 ) p + ] + p 2 + [ (xmax ) p2 + ] determines K. This expression demands p > if x min 0 and p 2 < if x max, further illustrating the care that must e taken in choosing the limits on the range for a power law. () Comined with next part... (c) In this prolem, p 2 = 3/2 so f is a two parameter family of functions: f(x; p, 3/2,). The likelihood is then: N L(p,) = f(x; p, 3/2,) k= or N logl(p,) = log f(x; p, 3/2,). k= My ML solution is plotted along with a histgram of the data in Figure 2 As stated in the prolem and discussed in class, logl = logl o (θ θ) 2 /2σ 2 θ (7) so that one sigma error is the value logl( θ+σ θ ) = L o /2. For higher dimensionality, one uses the fact that the likelihood function is distriuted like a multidimensional Gaussian (also known as the χ 2 distriution). From equation (7) it is then clear that the quantity 2(logL logl o ) is distriuted like χ 2 where L o is the value of maximum likelihood. In our case, one sigma for two of degrees of freedom is descried y the contour in the plot of log likelihood down from the maximum value y χ 2 /2.5. Similarly the two sigma ( three sigma ) value is the contour containing 95.4% (99.7%) of the proaility. For two degrees of freedom, the values of χ 2 /2 09Dec07/MDW 5
35 30 ML estimate data 25 Bin counts 20 5 0 5 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Values Figure 2: Plot the roken power law with ML estimated values for p and (red curve) along with the data (green histogram). Tale : Parameter confidence limits from likelihood plot Peak: p 0.27, 0.083 Confidence p limit min max min max 68.3% -0.55 0.45 0.057 0. 95.4& -0.70 3.2 0.028 0.3 99.7& -0.85 4.70 0.025 0.8 are 3.09 (5.9). The value of logl is shown in Figure 3 and the three contours corresponding to these one, two, and three sigma proaility values. The confidence limits read from this plot is shown in Figure. (d) We have derived that the covariance matrix for the likelihood is σ 2 i j = 2 logl θ i θ j. Because f(x; p, p 2,) has a slightly messy analytic expression, I found it easier to perform numerical partial differentiation of log L rather than use the equivalent expression: σ 2 i j = E [ 2 log f L θ i θ j ]. 09Dec07/MDW 6
0.3 0.25 0.2 0-2 -4 0.5 0. -6-8 -0 0.05 0-2 - 0 2 3 4 5 6 p 0. 0. 0.09 0.08 0.07 0-0.2-0.4-0.6-0.8-0.06 0.05-0.5-0.4-0.3-0.2-0. 0 p Figure 3: Plot logl as a function of power-law exponent p and reak point with logl 0 = 0. Top panel: the three curves show the theoretical 68.3%, 95.4% and 99.7% isovalues. Lower panel: low up of the density inside of the 68.3% contour. 09Dec07/MDW 7
I recursively used the two-point difference formula. For the diagonal terms one finds: 2 logl p 2 2 logl 2 2 logl p ( p ) 2 [logl(p + p,) 2logL(p,)+ logl(p p,)] ( ) 2 [logl(p,+ ) 2logL(p,)+ logl(p,+ )] p [logl(p + p /2,+ /2) logl(p p /2,+ /2) logl(p + p, )+logl(p p, )] I chose p = 0.0 ˆp and = 0.0ˆ. The eigenvectors of the 2 2 covariance matrix descrie the principal components (directions of uncorrelated error) and the inverse of the eigenvalues is the variance in this direction. I find: σ 6.4 0 3 and σ 2.7 0 with corresponding directions: ê = (0.08,.0) ê 2 = (.0, 0.08) In other words, the principal axes are nearly along the p and directions with a small ( ) tilt. This is consistent with our graphical solution depicted in Figure 3. Similarly, the variance estimate is consistent with the overall scale on which the levels vary ut, of course, do not predict the shape of contours. In particular, it is very important to note that p is unounded to small values of p ; in other words we can not rule out a value of nearly zero for p. Similarly, the high-confidence oundaries are not elliptical. It is nearly always more revealing to study the explicit likelihood distriution rather than rely on the covariance matrix. 09Dec07/MDW 8