Multiparameter models (cont.) Dr. Jarad Niemi STAT 544 - Iowa State University February 1, 2018 Jarad Niemi (STAT544@ISU) Multiparameter models (cont.) February 1, 2018 1 / 20
Outline Multinomial Multivariate normal Unknown mean Unknown mean and covariance In the process, we ll introduce the following distributions Multinomial Dirichlet Multivariate normal Inverse Wishart (and Wishart) normal-inverse Wishart distribution Jarad Niemi (STAT544@ISU) Multiparameter models (cont.) February 1, 2018 2 / 20
Multinomial Motivating examples Multivariate count data: Item-response (Likert scale) Voting Jarad Niemi (STAT544@ISU) Multiparameter models (cont.) February 1, 2018 3 / 20
Multinomial Multinomial distribution Suppose there are K categories and each individual independently chooses category k with probability π k such that K k=1 π k = 1. Let y k be the number of individuals who choose category k with n = K k=1 y k being the total number of individuals. Then Y = (Y 1,..., Y K ) has a multinomial distribution, i.e. Y Mult(n, π), with probability mass function (pmf) p(y) = n! k k=1 π y k k y k!. Jarad Niemi (STAT544@ISU) Multiparameter models (cont.) February 1, 2018 4 / 20
Multinomial Properties of the multinomial distribution The multinomial distribution with pmf: has the following properties: E[Y k ] = nπ k V [Y k ] = nπ k (1 π k ) p(y) = n! Cov[Y k, Y k ] = nπ k π k for k k k k=1 π y k k y k! Marginally, each component of a multinomial distribution is a binomial distribution with Y k Bin(n, π k ). Jarad Niemi (STAT544@ISU) Multiparameter models (cont.) February 1, 2018 5 / 20
Multinomial Dirichlet distribution Let π = (π 1,..., π K ) have a Dirichlet distribution, i.e. π Dir(a), with concentration parameter a = (a 1,..., a K ) where a k > 0 for all k. The probability density function (pdf) for π is p(π) = 1 Beta(a) K k=1 π a k 1 k with K k=1 π k = 1 and Beta(a) is the multinomial beta function, i.e. Beta(a) = K k=1 Γ(a k) Γ( K k=1 a k). Jarad Niemi (STAT544@ISU) Multiparameter models (cont.) February 1, 2018 6 / 20
Multinomial Properties of the Dirichlet distribution The Dirichlet distribution with pdf p(π) K k=1 π a k 1 k has the following properties (where a 0 = K k=1 a k): E[π k ] = a k a 0 V [π k ] = a k(a 0 a k ) a 2 0 (a 0+1) Cov[π k, π k ] = a ka k a 2 0 (a 0+1) Marginally, each component of a Dirichlet distribution is a beta distribution with π k Be(a k, a 0 a k ). Jarad Niemi (STAT544@ISU) Multiparameter models (cont.) February 1, 2018 7 / 20
Multinomial Bayesian inference The conjugate prior for a multinomial distribution, i.e. Y M ult(n, π), with unknown probability vector π is a Dirichlet distribution. The Jeffreys prior is a Dirichlet distribution with a k = 0.5 for all k. Some argue that for large K, this prior will put too much mass on rare categories and would suggest the Dirichlet prior with a k = 1/K for all k. The posterior under a Dirichlet prior is p(π y) p(y π)p(π) [ K k=1 πy k k = K k=1 πa k+y k 1 k ] [ K k=1 πa k 1 k ] Thus π y Dir(a + y). Jarad Niemi (STAT544@ISU) Multiparameter models (cont.) February 1, 2018 8 / 20
Multivariate normal distribution Let Y = (Y 1,..., Y K ) have a multivariate normal distribution, i.e. Y N K (µ, Σ) with mean µ and variance-covariance matrix Σ. The probability density function (pdf) for Y is ( p(y) = (2π) k/2 Σ 1/2 exp 1 ) 2 (y µ) Σ 1 (y µ) Jarad Niemi (STAT544@ISU) Multiparameter models (cont.) February 1, 2018 9 / 20
Bivariate normal contours Contours of a bivariate normal with correlation of 0.8 3 2 1 0 1 2 3 4 5 2 6 7 8 9 10 10 9 8 7 1 6 5 3 3 2 1 0 1 2 3 Jarad Niemi (STAT544@ISU) Multiparameter models (cont.) February 1, 2018 10 / 20
Properties of the multivariate normal distribution The multivariate normal distribution with pdf ( p(y) = (2π) k/2 Σ 1/2 exp 1 ) 2 (y µ) Σ 1 (y µ) has the following properties: E[Y k ] = µ k V [Y k ] = Σ kk Cov[Y k, Y k ] = Σ k,k Marginally, each component of a multivariate normal distribution is a normal distribution with Y k N(µ, Σ kk ). Conditional distributions are also normal, i.e. if ( ) ([ ] [ ]) Y1 µ1 Σ11 Σ N, 12 Y 2 µ 2 Σ 21 Σ 22 then Y 1 Y 2 = y 2 N ( µ 1 + Σ 12 Σ 1 22 (y 2 µ 2 ), Σ 11 Σ 12 Σ 1 22 Σ 21). Jarad Niemi (STAT544@ISU) Multiparameter models (cont.) February 1, 2018 11 / 20
Representing independence in a multivariate normal Let Y N(µ, Σ) with precision matrix Ω = Σ 1. If Σ k,k = 0, then Y k and Y k are independent of each other. If Ω k,k = 0, then Y k and Y k are conditionally independent of each other given Y j for j k, k. Jarad Niemi (STAT544@ISU) Multiparameter models (cont.) February 1, 2018 12 / 20
Unknown mean Default inference with an unknown mean ind Let Y i N K (µ, S) with default prior p(µ) 1 where Y i = (Y i1,..., Y ik ), then p(µ y) p(y µ)p(µ) exp ( 1 n 2 i=1 (y i µ) S 1 (y i µ) ) = exp ( 1 2 tr(s 1 S 0 ) ) where S 0 = n (y i µ)(y i µ). i=1 This posterior is proper if n K and, in that case, is ( µ y N K y, 1 ) n S. where this y = (y 1,..., y K ) has elements y k = 1 n y n ik. i=1 Jarad Niemi (STAT544@ISU) Multiparameter models (cont.) February 1, 2018 13 / 20
Unknown mean Conjugate inference with an unknown mean Let Y i ind N(µ, S) with conjugate prior µ N K (m, C) p(µ y) p(y µ)p(µ) exp ( 1 n 2 i=1 (y i µ) S 1 (y i µ) ) exp ( 1 2 µ m) C 1 (µ m) ) = exp ( 1 2 (µ m ) C 1 (µ m ) ) and thus where µ y N(m, C ) C = [ C 1 + ns 1] 1 m = C [ C 1 m + ns 1 y ]. Jarad Niemi (STAT544@ISU) Multiparameter models (cont.) February 1, 2018 14 / 20
Unknown mean Inverse Wishart distribution Let the K K matrix Σ have an inverse Wishart distribution, i.e. Σ IW (v, W 1 ), with degrees of freedom v > K 1 and positive definite scale matrix W. The pdf for Σ is p(σ) Σ (v+k+1)/2 exp ( 1 2 tr ( W Σ 1)). Jarad Niemi (STAT544@ISU) Multiparameter models (cont.) February 1, 2018 15 / 20
Unknown mean Properties of the inverse Wishart distribution The inverse Wishart distribution with pdf ( p(σ) Σ (v+k+1)/2 exp 1 2 tr ( W Σ 1)). has the following properties: E[Σ] = (v K 1) 1 W for v > K + 1. Marginally, σ 2 k = Σ kk Inv χ 2 (v, W kk ). If a K K matrix W has a Wishart distribution, i.e. W W ishart(v, S), then W 1 IW (v, S 1 ). Jarad Niemi (STAT544@ISU) Multiparameter models (cont.) February 1, 2018 16 / 20
Unknown mean Normal-inverse Wishart distribution A multivariate generalization of the normal-scaled-inverse-χ 2 distribution is the normal-inverse Wishart distribution. For a vector µ R K and K K matrix Σ, the normal-inverse Wishart distribution is µ Σ N(m, Σ/c) Σ IW (v, W 1 ) The marginal distribution for µ, i.e. p(µ) = p(µ Σ)p(Σ)dΣ, is a multivariate t-distribution, i.e. µ t v K+1 (m, W/[c(v K + 1)]). Jarad Niemi (STAT544@ISU) Multiparameter models (cont.) February 1, 2018 17 / 20
Unknown mean and covariance Conjugate inference with unknown mean and covariance Let Y i ind N(µ, Σ) with conjugate prior which has pdf p(µ, Σ) Σ ((v+k)/2+1) exp µ Σ N(m, Σ/c) Σ IW (v, W 1 ) ( 1 2 tr(w Σ 1 ) c ) 2 (µ m) Σ 1 (µ m). The posterior is a normal-inverse Wishart with parameters where = c + n = v + n m = k k+n m + c v W n k+n y = W + S + kn k+n S = (y m)(y m) n (y i y)(y i y). i=1 Jarad Niemi (STAT544@ISU) Multiparameter models (cont.) February 1, 2018 18 / 20
Unknown mean and covariance Default inference with unknown mean and covariance The prior Σ IW (K + 1, I) is non-informative in the sense that marginally each correlation has a uniform distribution on (-1,1). The prior p(µ, Σ) Σ (K+1)/2, which can be thought of as a normal-inverse-wishart distribution with c 0, v 1, and W 0, results in the posterior distribution µ Σ, y N(y, Σ/n) Σ y IW (n 1, S 1 ). Jarad Niemi (STAT544@ISU) Multiparameter models (cont.) February 1, 2018 19 / 20
Unknown mean and covariance Issues with the inverse Wishart distribution Marginals of the IW have an IG (or scaled-inverse-χ 2 ) distribution and therefore inherit the low density near zero resulting in a (possible) bias for small variances toward larger values. Due to the above issue, and the relationship between the variances and the correlations (http://www.themattsimpson.com/2012/08/20/ prior-distributions-for-covariance-matrices-the-scaled-inverse-wishart-prior/ the correlations can be biased: small variances imply small correlations large variances imply large correlations Remedies: Don t blindly use I for the scale matrix in an IW, instead use a reasonable diagonal matrix for your data set. Use the scaled Inverse wishart distribution (see pg 74) Use the separation strategy, i.e. Σ = Λ where is diagonal and Λ is a correlation matrix, where you specify the standard deviations (or variances) and correlations separately. In this case, Gelman recommends putting the LKJ prior (see page 582) on the correlation matrix. Jarad Niemi (STAT544@ISU) Multiparameter models (cont.) February 1, 2018 20 / 20