Multinomial Data. f(y θ) θ y i. where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is

Similar documents
Stat 5101 Lecture Slides: Deck 8 Dirichlet Distribution. Charles J. Geyer School of Statistics University of Minnesota

More Spectral Clustering and an Introduction to Conjugacy

Carl N. Morris. University of Texas

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Multiparameter models (cont.)

ST 740: Linear Models and Multivariate Normal Inference

Contents 1. Contents

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition.

ST 740: Multiparameter Inference

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Chapter 5 continued. Chapter 5 sections

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Conjugate Analysis for the Linear Model

Part III. A Decision-Theoretic Approach and Bayesian testing

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Foundations of Statistical Inference

g-priors for Linear Regression

STAT Advanced Bayesian Inference

STAT J535: Chapter 5: Classes of Bayesian Priors

A Very Brief Summary of Bayesian Inference, and Examples

PMR Learning as Inference

Shrinkage Estimation in High Dimensions

Bayesian Models in Machine Learning

1 Priors, Contd. ISyE8843A, Brani Vidakovic Handout Nuisance Parameters

Markov Chain Monte Carlo (MCMC)

The Bayesian Approach to Multi-equation Econometric Model Estimation

HIERARCHICAL BAYESIAN ANALYSIS OF BINARY MATCHED PAIRS DATA

POSTERIOR PROPRIETY IN SOME HIERARCHICAL EXPONENTIAL FAMILY MODELS

Introduction to Bayesian Methods

Stat 5101 Lecture Notes

Fixed Effects, Invariance, and Spatial Variation in Intergenerational Mobility

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

STAT215: Solutions for Homework 2

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

Bayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling

Advanced topics from statistics

Introduc)on to Bayesian Methods

Lecture 2: Priors and Conjugacy

MAS3301 Bayesian Statistics

Introduction to Applied Bayesian Modeling. ICPSR Day 4

Part 2: One-parameter models

3.0.1 Multivariate version and tensor product of experiments

Introduction to Machine Learning. Lecture 2

A CONDITION TO OBTAIN THE SAME DECISION IN THE HOMOGENEITY TEST- ING PROBLEM FROM THE FREQUENTIST AND BAYESIAN POINT OF VIEW

Bayesian inference for sample surveys. Roderick Little Module 2: Bayesian models for simple random samples

Generative Models for Discrete Data

Lecture 7 October 13

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Part 6: Multivariate Normal and Linear Models

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Statistical Inference with Monotone Incomplete Multivariate Normal Data

Metropolis-Hastings Algorithm

2 Inference for Multinomial Distribution

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

The binomial model. Assume a uniform prior distribution on p(θ). Write the pdf for this distribution.

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Maximum Likelihood Estimation

Linear Models A linear model is defined by the expression

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

A Discussion of the Bayesian Approach

Covariance. Lecture 20: Covariance / Correlation & General Bivariate Normal. Covariance, cont. Properties of Covariance

A Few Special Distributions and Their Properties

Combining Biased and Unbiased Estimators in High Dimensions. (joint work with Ed Green, Rutgers University)

STA 294: Stochastic Processes & Bayesian Nonparametrics

Bayes Factors for Grouped Data

Bayes Factors for Goodness of Fit Testing

ST 740: Model Selection

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Bayesian Interpretations of Heteroskedastic Consistent Covariance Estimators Using the Informed Bayesian Bootstrap

PhD Qualifying Examination Department of Statistics, University of Florida

Bayesian Inference. Chapter 9. Linear models and regression

Overall Objective Priors

Maximum Likelihood Estimation

Linear Methods for Prediction

Chapter 14 Stein-Rule Estimation

Lecture 20 May 18, Empirical Bayes Interpretation [Efron & Morris 1973]

Bayesian inference for multivariate skew-normal and skew-t distributions

Department of Large Animal Sciences. Outline. Slide 2. Department of Large Animal Sciences. Slide 4. Department of Large Animal Sciences

Chapter 5. Chapter 5 sections

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

The Jeffreys Prior. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) The Jeffreys Prior MATH / 13

Divergence Based priors for the problem of hypothesis testing

Introduction to Probabilistic Graphical Models

Inference for a Population Proportion

Stat 5101 Notes: Brand Name Distributions

10. Exchangeability and hierarchical models Objective. Recommended reading

The Multivariate Normal Distribution 1

1. Fisher Information

Sequential Decisions

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Probability Theory. Patrick Lam

Linear Methods for Prediction

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Bayesian Inference for Normal Mean

Data Mining Stat 588

Statistics Ph.D. Qualifying Exam

1 Bayesian Linear Regression (BLR)

Transcription:

Multinomial Data The multinomial distribution is a generalization of the binomial for the situation in which each trial results in one and only one of several categories, as opposed to just two, as in the case of the binomial experiment. Let Y = (Y 1,..., Y k ), where Y i is the number of n independent trials that result in category i, i = 1,..., k. The likelihood function is such that f(y θ) k θ y i i, where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is Θ = {θ : θ i 0, i = 1,..., k; k j=1 θ j = 1}. 122

Of course, the vector of observations satisfies y 1 + + y k = n. Conjugate prior for multinomial data The so-called Dirichlet distribution is the conjugate family of priors for the multinomial distribution. The Dirichlet distribution is such that π(θ; α) = Γ( k α i ) k Γ(α i ) where α i > 0, i = 1,..., k. k θ α i 1 i I Θ (θ), Using this prior in the multinomial experiment yields a Dirichlet posterior with parameters y i + α i, i = 1,..., k. 123

The parameters of the Dirichlet prior have the same sort of interpretation as those of a beta prior, which of course is a special case of the Dirichlet. The information in a prior with parameters α 1,..., α k is equivalent to that in a multinomial experiment with α 1 + + α k trials and α i outcomes in category i, i = 1,..., k. A natural noninformative prior is to take α i = 1, i = 1,..., k, which is uniform over Θ. What is the Jeffreys prior? log f(y θ) = Cy + k y i log θ i. 124

θ j log f(y θ) = y j θ j 2 θ i θ j log f(y θ) = { yi /θi 2, i = j, 0, i j. The information matrix is thus diagonal with diagonal entries equal to 1 θ 2 i E(Y i ) = n θ i, i = 1,..., k. So, the Jeffreys prior is Dirichlet with α i = 1/2, i = 1,..., k, which is a proper prior. One can verify that the marginal distributions of a Dirichlet are also Dirichlet. 125

Multivariate Normal Distribution Suppose we have a random sample of size n from the d-variate normal distribution. Here the data Y are an n by d matrix. The ith row of this matrix is Y T i, where Y T i = (Y i1,..., Y id ), i = 1,..., n. The parameters of the d-variate normal are the mean vector µ and the covariance matrix Σ. These are defined by and µ = (µ 1,..., µ d ) T = E(Y T i ) Σ ij = Cov(Y ri, Y rj ), i = 1,..., d, j = 1,..., d. 126

The likelihood function is f(y µ, Σ) Σ n/2 exp ( 1 2 n (y i µ) T Σ 1 (y i µ) ). In the (unlikely) event that Σ is known, we only need a prior for µ. It can be verified that the multivariate normal is a conjugate prior for µ in this case. Suppose that a priori µ N(η, Λ). Proceeding analogously to the univariate case, it can be shown that the posterior distribution is normal with mean vector µ n and covariance matrix Λ n, where µ n = (Λ 1 + nσ 1 ) 1 (Λ 1 η + nσ 1 ȳ) and Λ 1 n = Λ 1 + nσ 1. 127

Let µ 1 and µ 2 contain the first k and the last d k elements of µ, respectively. Similarly, define µ n1 and µ n2 in terms of the elements of µ n. Partition Λ n as Λ n = [ Λ 11 n Λ 12 n Λ 21 n Λ 22 n ], where Λ 11 n is k k and Λ 22 n is (d k) (d k). It follows that the conditional distribution of µ 1 given µ 2 is normal with mean vector µ n1 + Λ 12 n (Λ 22 n ) 1 (µ 2 µ n2 ) and covariance matrix Λ 11 n Λ12 n (Λ22 n ) 1 Λ 21 n. Of course, the marginal of, for example, µ 1 is normal with mean vector µ n1 and covariance matrix Λ 11 n. 128

By letting Λ 1 0, we can obtain a noninformative prior in the limit. The resulting prior is uniform over all the d-dimensional reals, and of course is improper. If n d, the posterior corresponding to the uniform prior for µ is N(ȳ, Σ/n). Inadmissibility of a Bayes estimator: Stein theory James- Suppose we observe Y that has a d-variate normal distribution with unknown mean vector µ and known covariance matrix I d, the d d identity. This problem is equivalent to one where we simultaneously estimate means from independent experiments. 129

If we use the noninformative, uniform prior for µ, and the squared error loss L(µ, a) = d (µ i a i ) 2, then the Bayes estimator of µ is, not surprisingly, Y. The surprising thing is that this natural estimator is inadmissible for d 3. (It is admissible for d = 1 or 2.) This result is proven by Stein (1955), Proceedings of the Third Berkeley Symposium. James and Stein (1960), Proceedings of the Fourth Berkeley Symposium, produced an estimator that has uniformly smaller risk than Y. The estimator is δ JS (Y ) = ( 1 d 2 d Y 2 i ) Y. 130

It turns out that the ratio R(µ, δ JS )/R(µ, Y ) is very close to 1 over most of the parameter space. Only near µ T = (0,..., 0) is the ratio of risks substantially smaller than 1. One way of seeing why is to first prove the following fact: For a set of µ i s that are all bounded in absolute value by the same constant, and when d is large, the statistic T d = d Yi 2 /(d 2) is very close to θ d = 1 + d µ 2 i /d. Proof We have, for each i, and E(Y 2 i ) = 1 + µ2 i Var(Y 2 i ) = 2(1 + 2µ2 i ). 131

For an arbitrarily small, positive ɛ, Markov s inequality says that Now, P ( T d θ d < ɛ) 1 E(T d θ d ) 2 /ɛ 2. E(T d θ d ) 2 = Var(T d ) + [E(T d ) θ d ] 2 = Var(T d ) + 4θ2 d (d 2) 2. Since the Y i s are independent, Var(T d ) = 2 (d 2) 2 d (1 + 2µ 2 i ). Using the fact that µ 1,..., µ d are all less than or equal to the same constant, we have E(T d θ d ) 2 C d for some positive constant C. It follows that when d is sufficiently big, P ( T d θ d < ɛ) is arbitrarily close to 1. Q.E.D. 132

To get a better understanding of the James- Stein estimator, we now consider ) ˆδ JS (Y ) = (1 1θd Y = where µ 2 d = µ 2 i /d. ( µ 2 d 1 + µ 2 d ) Y, The result proved on the previous pages shows that, for large d, δ JS ˆδ JS. The random variable ˆδ JS provides us with some intuition about the James-Stein estimator. If the vector µ is close to the origin, i.e., 0 = (0,..., 0) T, then µ 2 d is close to 0, and hence ˆδ JS 0. This is good!! 133

On the other hand, if µ is far from the origin, then µ 2 d 1 + µ 2 d 1, and ˆδ JS Y. This is good, because if µ is not close to the origin, then there s no rationale for shrinking the estimate towards the origin. Shrinkage towards 0 is arbitrary Suppose we have a rationale for shrinking the estimate towards a point α in d-space. For example, some theory may suggest that µ = α. We may define an estimate δ JS (Y ; α) = Y d 2 d α). (Y i α i ) 2(Y 134

Using the squared error loss on p. 130N, verify that R(µ, δ JS (Y ; α)) = R(µ α, δ JS (Y )) ( ) for all µ. Because Y has constant risk and because R(µ, δ JS (Y )) R(µ, Y ) for all µ, ( ) implies that δ JS (Y ; α) has risk no larger than that of Y for all µ. The part of the parameter space where δ JS (Y ; α) has substantially smaller risk is near α. Read Example 46, p. 256 of Berger (2nd edition) for more details on this problem. 135