Part 4: Multi-parameter and normal models

Size: px

Start display at page:

Download "Part 4: Multi-parameter and normal models"

Joshua Ford
5 years ago
Views:

1 Part 4: Multi-parameter and normal models 1

2 The normal model Perhaps the most useful (or utilized) probability model for data analysis is the normal distribution There are several reasons for this, e.g., the CLT it is a simple model with separate parameters for the population mean and variance - two quantities that are often of primary interest

3 The normal PDF A RV Y is said to be normally distributed with mean µ and variance if the density of Y is given by f(y µ, )= 1 (y µ) p exp for 1 <y<1 We shall write Y N (µ, parameters (µ, ) ) and call the pair of 3

4 Normal properties Some important things to remember about the normal distribution include the distribution is symmetric about µ : the mode, the median and the mean are all equal to µ about 95% of the population lies within two (more precisely 1.96) standard deviations of the mean if X N (µ x, x) and Y N (µ y, y) and X and Y are independent, then ax + by N (aµ x + bµ y,a x + b y) 4

5 Joint sampling density Suppose our sampling model is {Y 1,...,Y n µ, } iid N (µ, ) Then the joint sampling density is given by ny p(y 1,...,y n µ, )= p(y i µ, ) i=1 = ny i=1 1 (yi p exp µ) =( p ) n/ exp 5 ( 1 nx i=1 (y i µ) )

6 Joint sampling density By expanding the quadratic term in the exponent nx (y i µ) nx nx i=1 we see that = 1 i=1 p(y 1,...,y n µ, y i µ ) i=1 depends upon y i + n µ y 1,...,y n through { P yi, P y i }, a two-dimensional sufficient statistic Knowing these quantities is equivalent to knowing nx nx ȳ = 1 n y i, and s = 1 (y i ȳ) i=1 n 1 i=1 and so are also a sufficient statistic {ȳ, s } 6

7 Inference by conditioning Inference for this two-parameter model can be broken down into two one-parameter problems We begin with the problem of making inference for when is known, and use a conjugate prior for µ µ For any (conditional) prior distribution p(µ ) the posterior distribution will satisfy p(µ y 1,...,y n, ) / exp ( 1 ) nx (y i µ) p(µ i=1 ) / e c 1(µ c ) p(µ 7 )

8 Conjugate prior So we see that if p(µ ) is to be conjugate, then it must include quadratic terms like exp{c 1 (µ c ) } The simplest such class of probability densities is the normal family, suggesting that if p(µ ) is normal and {y 1,...,y n } iid N (µ, ) then p(µ y 1,...,y n, ) is also a normal density 8

9 Posterior derivation If µ N (µ 0, 0 ) then where p(µ y 1,...,y n, ) / exp (µ b/a) /a a = n and b = µ P yi This function has exactly the same form as a normal density curve with 1/a playing the role of the variance and b/a playing the role of the mean 9

10 Normal posterior Since probability distributions are determined by their shape, this means that p(µ y 1,...,y n, ) is indeed a normal density i.e., {µ y 1,...,y n, } N (µ n, n) where n = 1 a = n and µ n = b a = 1 0 µ 0 + n ȳ n So: normal prior + normal sampling model = normal posterior 10

11 Combining information The (conditional) posterior parameters n and µ n combine the prior parameters 0 and µ 0 with the terms from the data First consider the posterior inverse variance, a.k.a. the precision: 1 n = n n = 1 So the posterior precision combines the sampling 0 precision and the prior precision n = 0 + n and n!1, as n!1 11

12 Combining means Notice that the posterior mean decomposes as µ n = n µ 0 + n 0 + n ȳ so it is a weighted average of the prior mean and the data mean The weight on the prior mean is precision 1/ 0, the prior 1

13 Prior sample size If the prior mean were based on apple 0 prior observations from the same (or similar) population as Y 1,...,Y n then we might want to set 0 = apple 0 which may be interpreted as the variance of the mean of the prior observations In this case the formula for the posterior mean reduces to µ n = apple 0 apple 0 + n µ 0 + n apple 0 + nȳ µ n! ȳ, as n!1 Finally, 13

14 Prediction Consider predicting a new observation population after having observed Ỹ from the {Y 1 = y 1,...,Y n = y n } To find the predictive distribution we may take advantage of the following fact {Ỹ µ, } N (µ, ), Ỹ = µ + ", " N (0, ) In other words, saying that Ỹ is normal with mean µ is the same as saying that Ỹ is equal to µ plus some mean-zero normally distributed noise 14

15 Predictive moments Using this result, let s first compute the posterior mean and variance of Ỹ E{Ỹ y 1,...,y n, } = E{µ + " y 1,...,y n, } = E{µ y 1,...,y n, = µ n +0=µ n } + E{ " y 1,...,y n, } Var[Ỹ y 1,...,y n, ] = Var[µ + " y 1,...,y n, ] = Var[µ y 1,...,y n, ] + Var[ " y 1,...,y n, ] = n + 15

16 Moments to distribution Recall that the sum of independent normal random variables is normal Therefore, since µ and ", conditional on y 1,...,y n and, are normally distributed, so is Ỹ = µ + " So the predictive distribution is Ỹ y 1,...,y n, N (µ n, n + ) Observe that, as n!1, Var[Ỹ y 1,...,y n, ]! > 0 i.e., certainty in µ does not translate into certainty about Ỹ 16

17 Example: Midge wing length Grogan and Wirth (1981) provide data on the wing length in millimeters of nine members of a species of midge (small, two-winged flies) From these measurements we wish to make inference about the population mean µ 17

18 Example: prior information Studies from other populations suggest that wing lengths are usually around 1.9mm, so we set µ 0 =1.9 We also know that lengths must be positive (µ >0) We can approximate this restriction with a normal prior distribution for µ as follows: Since most of the normal density is within two standard deviations of the mean we choose 0 so that µ 0 0 > 0 ) 0 < 1.9/ =0.95 For now we take 0 =0.95 which somewhat overstates our prior uncertainty 18

19 Example: data and posterior The observations in order of increasing magnitude are (1.64, 1.70, 1.7, 1.74, 1.8, 1.8, 1.8, 1.9,.08) Therefore the posterior is {µ y 1,...,y 9, } N (µ n, n) where µ n = 1 0 µ 0 + n ȳ n = n = n =

20 Example: a choice of If we take = s =0.017, then {µ y 1,...,y 9, =0.017} N (1.805, 0.00) density posterior prior mu A 95% CI for µ is (1.7, 1.89) 0

21 Example: full accounting of uncertainty However, these results assume that we are certain that = s when in fact s is only a rough estimate of based upon only nine observations To get a more accurate representation of our information (and in particular of our posterior uncertainty) we need to account for the fact that not known is 1

22 Joint Bayesian inference Bayesian inference for two or more parameters is not conceptually different from the one-parameter case For any joint prior distribution p(µ, ) for µ and, posterior inference proceeds using Bayes rule: p(µ, y 1,...,y n )= p(y 1,...,y n µ, )p(µ, ) p(y 1,...,y n ) As before, we will begin by developing a simple conjugate class of prior distributions which will make posterior calculations easy

23 Prior decomposition A joint distribution can always be expressed as the product of a conditional probability and a marginal probability: p(µ, )=p(µ )p( ) We just saw that if were known, then a conjugate prior distribution for µ was N (µ 0, 0 ) 3

24 Prior decomposition Lets consider the particular case where 0 = /apple 0 : p(µ, )=p(µ )p( ) = N (µ; µ 0, 0 = /apple 0 ) p( ) The parameters µ 0 and apple 0 can be interpreted as the mean and sample size from a set of prior observations For we need a family of prior distributions that has support on (0, 1) 4

25 Conjugate prior for One such family of distributions is the gamma family, as we used for the Poisson sampling model Unfortunately, this family is not conjugate for the normal variance However, the gamma family does turn out to be a conjugate class of densities for the precision 1/ When using such a prior we say that has an inversegamma distribution 1/ G(a, b) ) IG(a, b) 5

26 Inverse-gamma prior For interpretability, instead of using a and b we will use IG 0, 0 Under this parameterization: 0 E{ } = 0 mode( )= 0 0 / 0 / 1 0 / 0 /+1, so mode( ) < 0 < E{ } Var[ ] is decreasing in v 0 We can interpret the prior parameters ( 0, 0 ) as the prior sample variance and the prior sample size 6

27 Posterior inference Suppose our prior model and sampling model are IG( 0 /, 0 0 /) µ N (µ 0, /apple 0 ) Y 1,...Y n iid N (µ, ) Just as the prior distribution for µ and can be decomposed as p(µ, )=p(µ )p( ), the posterior distribution can be similarly decomposed as p(µ, y 1,...,y n )=p(µ,y 1,...,y n )p( y 1,...,y n ) 7

28 Posterior conditional(s) We have already derived the posterior conditional distribution of µ given /apple 0 0 Plugging in for we have {µ y 1,...,y n, } N (µ n, /apple n ), where apple n = apple 0 + n and µ n = (apple 0/ )µ 0 +(n/ )ȳ apple 0 / + n/ (posterior sample size and mean) = apple 0µ 0 + nȳ apple n 8

29 Posterior conditional(s) The posterior distribution of can be obtained by integrating over the unknown value of µ p( y 1,...,y n ) / p(y 1,...,y n Z )p( ) = p( ) p(y 1,...,y n µ, )p(µ ) dµ This integral can be done without much knowledge of calculus, but it is somewhat tedious 9

30 Posterior conditional(s) Check that this leads to { y 1,...,y n } IG( n /, n n /), where n = 0 + n n = 1 apple 0 0 +(n 1)s + apple 0n (ȳ µ 0 ) n apple n These formulae suggest that the interpretation of 0 as a prior sample size, from which a prior sample variance 0 has been obtained, is reasonable Similarly, we may think of 0 0 and n n as a prior and posterior sum of squares, which is the sum of the prior and data sum of squares 30

31 Posterior conditional(s) n = 1 n apple 0 0 +(n 1)s + apple 0n apple n (ȳ µ 0 ) However, the third term is a bit harder to understand It says that a large value of (ȳ µ 0 ) posterior probability of a large increases the This makes sense for our joint prior p(µ, : If we want to think of prior observations with variance is also an estimate of µ 0 as the sample mean of ), then this term 31

32 Example: the midge data The studies of other populations suggest that the true mean and standard deviation of our population under study should not be far from 1.9mm and 0.1mm respectively, suggesting µ 0 =1.9 and 0 =0.01 However, this population may be different from others in terms of wing length, and so we choose apple 0 = 0 =1 so that our prior distributions are only weakly centered around the estimates from other populations 3

33 Example: the midge data The sample mean and variance of our observed data are ȳ =1.804 and s = From these values and the prior parameters, we compute µ n = apple 0µ 0 + nȳ = apple n n = 1 apple 0 µ 0 +(n n = )s + apple 0n apple n (ȳ µ 0 ) = =0.015

34 Example: the full posterior Our joint posterior distribution is completely determined by these values µ n =1.814, apple n = 10, n =0.015, n = 10 and can be expressed as {µ y 1,...,y n, } N (1.814, /10) { y 1,...,y n } IG(10/, /) 34

35 Example: the full posterior image Observe that the contours are more peaked as a µ function of for low values of than high values 35

36 Marginal interests For many data analyses, interest primarily lies in estimating the population mean µ, and so we would like to calculate quantities like E{µ y 1,...,y n } Var[µ y 1,...,y n ] P (µ 1 <µ y 1,...,y n ) These quantities are all determined by the marginal posterior distribution of µ given the data But all we know so far is that the conditional distribution of given the data and is normal, and that µ given the data is inverse-gamma 36

37 Joint sampling If we could generate marginal samples of µ, then we could use the MC method to approximate the marginal quantities of interest It turns out that this is easy to do by generating samples of µ and from their joint posterior by MC (1) IG( n /, n n /), µ (1) N (µ n, (1) /apple n ) () IG( n /, n n /), µ () N (µ n, () /apple n ) (S) IG( n /, n n /), µ (S) N (µ n, (S) /apple n ) 37

38 Monte Carlo marginals The sequence of pairs {( (1),µ (1) ),...,( (S),µ (S) )} simulated with this procedure are independent samples from the joint posterior p(µ, y 1,...,y n ) Additionally, the simulated sequence {µ (1),...,µ (S) } can be seen as independent samples from the marginal posterior distribution p(µ y 1,...,y n ) So these may be used to make MC approximations to posterior expectations of (functions of) µ 38

39 Example: joint & marginal samples µ (s) The samples are obtained conditional on the samples of leading to joint samples (s) We may extract the marginals from the joint 39

40 Example: credible v. confidence intervals density cred conf mu The Bayesian CI is very close to the frequentist CI, based on t-statistics since p(µ y 1,...,y n ) St 0 +n(µ n, n/apple n ) apple 0 0 If and are small, this will be very close to the sampling distribution of the MLE 40

41 Objective priors? How small can apple 0 and 0, the prior sample sizes, be? Consider µ n = apple 0µ 0 + nȳ apple n n = 1 apple 0 µ 0 +(n n 1)s + apple 0n apple n (ȳ µ 0 ) So as apple 0, 0! 0 µ n! ȳ n! n n 1 s = 1 n X (yi ȳ) 41

42 Improper prior This leads to the following posterior : { n y 1,...,y n } IG, n 1 X (yi n {µ y 1,...,y n, } N ȳ, n ȳ) More formally, if we take the improper prior p(µ, ) / 1/ the posterior is proper, and we get the same posterior conditional for µ as above, but { y 1,...,y n } IG n 1, 1 X (yi ȳ) 4

43 Classical comparison The marginal posterior for µ may then be obtained as p(µ y 1,...,y n ) Z (cond. prob) (LTP) = p(µ,y 1,...,y n )p( y 1,...,y n ) d (normal conditional) (IG conditional) / St n 1 (µ;ȳ, s /n) It is interesting to compare this result to the sampling distribution of the t-statistic, i.e., of the MLE t = Ȳ µ s/ p n µ t n 1, ˆµ = Ȳ St n 1(µ, s /n) 43

44 Point estimator A point estimator of an unknown parameter is a function that converts your data into a single element of the parameter space In the case of the normal sampling model and conjugate prior distribution, the posterior mean estimator of µ is ˆµ b (y 1,...,y n )=E{µ y 1,...,y n } = n apple 0 + nȳ + apple 0 apple 0 + n µ 0 = wȳ +(1 w)µ 0 44

45 Sampling properties The sampling properties of an estimator such as ˆµ b refers to its behavior under hypothetically repeatable surveys or experiments Lets compare the sampling properties of ˆµ b to ˆµ e (y 1,...,y n )=ȳ, the sample mean, when the true population mean is µ true E{ˆµ e µ = µ true } = µ true E{ˆµ b µ = µ true } = wµ true +(1 w)µ 0 So we say that ˆµ e is unbiased and, unless µ 0 = µ true, we say that ˆµ b is biased 45

46 Bias Bias refers to how close the center of mass of the sampling distribution of the estimator is to the true value An unbiased estimator is an estimator with zero bias, which sounds desirable However, bias does not tell us how far away an estimate might be from the true value For example, ˆµ = y 1 is an unbiased estimator of the population mean µ true, but will generally be farther away from µ true than ȳ 46

47 Mean squared error To evaluate how close an estimator ˆ is likely to be to the true value error (MSE) true, we might use the mean squared Letting m = E{ˆ true }, the MSE is MSE[ˆ true ]=E{( true ) = true } = E{(ˆ m) = true } +(m true ) = Var[ˆ = true ] + Bias [ˆ = true ] 47

48 Mean squared error This means that, before the data are gathered, the expected distance from the estimator to the true value depends on how close distribution of trueˆ is to the center of the (the bias), and how spread out the distribution is (the variance) 48

49 Comparing estimators Getting back to our comparison of ˆµ b to ˆµ e, the bias of ˆµ e is zero, but Var[ˆµ e µ = µ true, ]= n, whereas Var[ˆµ b µ = µ true, ]=w n < n and so ˆµ b has lower variability 49

50 Comparing estimators Which one is better in terms of MSE? MSE[ˆµ e µ = µ true ] = Var[ˆµ e µ = µ true ]= MSE[ˆµ b µ = µ true ] = Var[ˆµ b µ = µ true ] + Bias [ˆµ b µ = µ true ] = w n +(1 w) (µ 0 µ truth ) Therefore MSE[ˆµ b µ = µ true ] < MSE[ˆµ e µ = µ true ] n if (µ 0 µ true ) < n 1+w 1 w = 1 n + apple 0 50

51 Low MSE Bayesian estimators So if you know just a little about the population you are about to sample from, you should be able to find values of µ 0 and apple 0 such that the Bayesian estimator has lower average distance to the truth (MSE) than the MLE E.g., if you are pretty sure that your prior guess µ 0 is within two standard deviations of the true population mean, then if you pick apple 0 =1 you can be pretty sure that the Bayes estimator has a lower MSE since (µ 0 µ true ) < 4 51

52 Example: IQ scores Scoring on IQ tests is designed to produce a normal distribution with a mean of 100 and a variance of 5 when applied to the general population Suppose that we were to sample n individuals from a particular town in the USA and then use it to estimate µ, the town-specific mean IQ score For a Bayesian estimation, if we lack much information about the town in question, a natural choice of would be µ 0 = 100 5

53 Example: IQ scores, MSE Suppose that, unknown to us, the people in this town are extremely exceptional and the true mean and variance of IQ scores in the town are (µ true = 11, = 169) The MSEs of the estimators ˆµ e and ˆµ b are MSE[ˆµ e µ = 11, MSE[ˆµ b µ = 11, = 169] = Var[ˆµ e µ = 11, = 169 n = 169] = Var[ˆµ b µ = 11, = 169] = 169] + Bias [ˆµ b µ = 11, = 169] = w n +(1 w) 144

54 Example: Relative MSE One way compare MSEs is through their ratio. Consider MSE[ˆµ b µ = 11]/MSE[ˆµ e µ = 11] plotted as a function of n, for k 0 {1,, 3} when k 0 {1, }, relative MSE k0 = 0 k0 = 1 k0 = k0 = 3 the Bayes estimate has lower MSE than the MLE the µ 0 = 100 setting is only bad for apple 0 3 as n increases, the bias of the estimators shrinks to zero sample size 54

55 Example: Sampling densities Consider the sampling densities when n = 10 which highlight the relative contributions of the bias and variance to the MSE density k0 = 0 k0 = 1 k0 = k0 = 3 truth mean IQ 55

Part 6: Multivariate Normal and Linear Models

Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of