Studentization and Prediction in a Multivariate Normal Setting

Similar documents
Invariance of Posterior Distributions under Reparametrization

ASSESSING A VECTOR PARAMETER

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

Some Curiosities Arising in Objective Bayesian Analysis

Fixed Effects, Invariance, and Spatial Variation in Intergenerational Mobility

Carl N. Morris. University of Texas

Principles of Statistical Inference

Principles of Statistical Inference

A simple analysis of the exact probability matching prior in the location-scale model

Stat 5101 Lecture Notes

g-priors for Linear Regression

A Note on Bootstraps and Robustness. Tony Lancaster, Brown University, December 2003.

Computational Connections Between Robust Multivariate Analysis and Clustering

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Linear Models A linear model is defined by the expression

Testing Some Covariance Structures under a Growth Curve Model in High Dimension

RESEARCH REPORT. A note on gaps in proofs of central limit theorems. Christophe A.N. Biscio, Arnaud Poinas and Rasmus Waagepetersen

A noninformative Bayesian approach to domain estimation

On Bayesian Inference with Conjugate Priors for Scale Mixtures of Normal Distributions

An Introduction to Multivariate Statistical Analysis

Ordered Designs and Bayesian Inference in Survey Sampling

LECTURES IN ECONOMETRIC THEORY. John S. Chipman. University of Minnesota

Statistical Inference with Monotone Incomplete Multivariate Normal Data

Statistics. Lecture 4 August 9, 2000 Frank Porter Caltech. 1. The Fundamentals; Point Estimation. 2. Maximum Likelihood, Least Squares and All That

The Bayesian Approach to Multi-equation Econometric Model Estimation

Wilks Λ and Hotelling s T 2.

Naive Bayes and Gaussian Bayes Classifier

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Regression #5: Confidence Intervals and Hypothesis Testing (Part 1)

10. Exchangeability and hierarchical models Objective. Recommended reading

Remarks on Improper Ignorance Priors

PARAMETER CURVATURE REVISITED AND THE BAYES-FREQUENTIST DIVERGENCE.

Journal of Statistical Research 2007, Vol. 41, No. 1, pp Bangladesh

STAT 830 Bayesian Estimation

Frequentist-Bayesian Model Comparisons: A Simple Example

QED. Queen s Economics Department Working Paper No Hypothesis Testing for Arbitrary Bounds. Jeffrey Penney Queen s University

A Bayesian Treatment of Linear Gaussian Regression

Testing a Normal Covariance Matrix for Small Samples with Monotone Missing Data

The Multivariate Gaussian Distribution [DRAFT]

Approximate Inference for the Multinomial Logit Model

On the Theoretical Foundations of Mathematical Statistics. RA Fisher

Multivariate Analysis and Likelihood Inference

Measures and Jacobians of Singular Random Matrices. José A. Díaz-Garcia. Comunicación de CIMAT No. I-07-12/ (PE/CIMAT)

MULTIVARIATE ANALYSIS OF VARIANCE UNDER MULTIPLICITY José A. Díaz-García. Comunicación Técnica No I-07-13/ (PE/CIMAT)

Evaluating the Performance of Estimators (Section 7.3)

Naive Bayes and Gaussian Bayes Classifier

A CLASS OF ORTHOGONALLY INVARIANT MINIMAX ESTIMATORS FOR NORMAL COVARIANCE MATRICES PARAMETRIZED BY SIMPLE JORDAN ALGEBARAS OF DEGREE 2

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

Naive Bayes and Gaussian Bayes Classifier

Regression Graphics. 1 Introduction. 2 The Central Subspace. R. D. Cook Department of Applied Statistics University of Minnesota St.

Estimation of parametric functions in Downton s bivariate exponential distribution

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

DE FINETTI'S THEOREM FOR SYMMETRIC LOCATION FAMILIES. University of California, Berkeley and Stanford University

Conditional Inference by Estimation of a Marginal Distribution

GENERAL THEORY OF INFERENTIAL MODELS I. CONDITIONAL INFERENCE

Math 423/533: The Main Theoretical Topics

COMPARISON OF FIVE TESTS FOR THE COMMON MEAN OF SEVERAL MULTIVARIATE NORMAL POPULATIONS

Modified Simes Critical Values Under Positive Dependence

Default priors and model parametrization

Testing Restrictions and Comparing Models

A BAYESIAN MATHEMATICAL STATISTICS PRIMER. José M. Bernardo Universitat de València, Spain

Statistical Inference with Monotone Incomplete Multivariate Normal Data

DEFNITIVE TESTING OF AN INTEREST PARAMETER: USING PARAMETER CONTINUITY

STA 2101/442 Assignment 3 1

NONINFORMATIVE NONPARAMETRIC BAYESIAN ESTIMATION OF QUANTILES

Divergence Based priors for the problem of hypothesis testing

On matrix variate Dirichlet vectors

FACTOR ANALYSIS AS MATRIX DECOMPOSITION 1. INTRODUCTION

1 Appendix A: Matrix Algebra

Inverse Wishart Distribution and Conjugate Bayesian Analysis

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

HAPPY BIRTHDAY CHARLES

One-parameter models

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Optimization Problems

Consistent Bivariate Distribution

Large Sample Properties of Estimators in the Classical Linear Regression Model

Submitted to the Brazilian Journal of Probability and Statistics

SIMULTANEOUS ESTIMATION OF SCALE MATRICES IN TWO-SAMPLE PROBLEM UNDER ELLIPTICALLY CONTOURED DISTRIBUTIONS

Fundamental Probability and Statistics

FE670 Algorithmic Trading Strategies. Stevens Institute of Technology

Path Diagrams. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Chapter 3 : Likelihood function and inference

Multivariate Distributions

Factor Analysis (10/2/13)

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix

HANDBOOK OF APPLICABLE MATHEMATICS

SOME ASPECTS OF MULTIVARIATE BEHRENS-FISHER PROBLEM

Testing Statistical Hypotheses

Alternative Bayesian Estimators for Vector-Autoregressive Models

Hypothesis testing in multilevel models with block circular covariance structures

Inferences on a Normal Covariance Matrix and Generalized Variance with Monotone Missing Data

Mean. Pranab K. Mitra and Bimal K. Sinha. Department of Mathematics and Statistics, University Of Maryland, Baltimore County

Statistical Data Analysis Stat 3: p-values, parameter estimation

International Journal of Pure and Applied Mathematics Volume 21 No , THE VARIANCE OF SAMPLE VARIANCE FROM A FINITE POPULATION

Use of Bayesian multivariate prediction models to optimize chromatographic methods

On Selecting Tests for Equality of Two Normal Mean Vectors

A union of Bayesian, frequentist and fiducial inferences by confidence distribution and artificial data sampling

Likelihood and p-value functions in the composite likelihood context

Transcription:

Studentization and Prediction in a Multivariate Normal Setting Morris L. Eaton University of Minnesota School of Statistics 33 Ford Hall 4 Church Street S.E. Minneapolis, MN 55455 USA eaton@stat.umn.edu D. A. S. Fraser University of Toronto Department of Statistics 00 St. George Street Toronto, Canada M5S 3G3 dfraser@utstat.toronto.edu Abstract In a simple multivariate normal prediction setting, derivation of a predictive distribution can flow from formal Bayes arguments as well as pivoting arguments. We look at two special cases and show that the classical invariant predictive distribution is based on a pivot that does not have a parameter free distribution that is, is not an ancillary statistic. In contrast, a predictive distribution derived by a structural argument is based on a pivot with a fixed distribution (an ancillary statistic). The classical procedure is formal Bayes for the Jeffreys prior. Our results show that this procedure does not have a structural or fiducial interpretation. Keywords and phrases: pivotal quantities, ancillarity, formal Bayes predictive distribution Running Head: Studentization and multivariate prediction.

Introduction There are a variety of methods for constructing predictive distributions for future observations. Even in the case of a multivariate normal model, a host of proposals have been made. The survey article of Keyes and Levy (996) is a good review of methodology in a multivariate analysis of variance setting. Some popular approaches include the formal Bayes methods, arguments based on likelihood and appeals to invariance considerations in decision theoretic settings. Some relevant references to these are Geisser (993), the review article of Bjornstad (990) and Eaton and Sudderth (00). This paper deals with simple random sampling from a multivariate normal distribution where the problem is to predict the next observation. In this setting, the current evidence supports the contention that among all fully invariant predictive distributions, there is one that is best. By fully invariant, we mean predictive distributions that are invariant under all nonsingular affine transformations. The notion of best invariant comes from decision theory and is explained in Eaton and Sudderth (00). This best fully invariant predictive distribution is a formal Bayes rule and is specified in the next section. In what follows, Q 0 denotes this fully invariant rule. An alternative approach is to look at predictive distributions with less than full invariance. This leads naturally to an alternative to the best fully invariant procedure and indeed yields a pivotal quantity that allows a fiducial construction of a predictive distribution. This is denoted by Q in the sequel. The derivation of Q relies on the ancillarity of the pivotal quantity suggested by invariance and is specified in the sequel. The main result of this paper shows that in the case of Q 0, a natural pivotal quantity obtained from an appealing Studentization is in fact not ancillary so that a classical fiducial interpretation for Q 0 is unavailable. Thus the formal Bayes derivation described in Geisser (993, Chapter 9) yields the best fully invariant predictive distribution, but the usual fiducial interpretation fails. This observation helps explain the dominance of Q over Q 0 in a variety of settings see Eaton, Muirhead and Pickering (004) for a discussion. In brief, this paper is organized as follows. Section contains our basic assumptions and a description of the best fully invariant predictive distribution Q 0. The alternative proposal Q is derived from invariance considerations in Section 3. In Section 4, the main result is established and this is followed by some discussion in Section 5.

Background Consider independent and identically distributed X,..., X n where each X i has a p-dimensional normal distribution with mean vector µ and positive definite covariance matrix Σ, so each column vector X i is N p (µ, Σ). As n usual, the sample mean n X i is denoted by X and S = n (X i X) (X i X) i= is the un-normalized sample covariance matrix. It is assumed that n p + so S is positive definite with probability one. Now consider Z in R p which is also N p (µ, Σ) and is independent of the X i s. Suppose one is interested in constructing a predictive distribution for Z that is allowed to depend on the data X = (X,..., X n ). One method for obtaining such a predictive distribution is to use a formal Bayes argument with the so-called Jeffreys or uninformative prior distribution given by dσ ν(dµ, dσ) = dµ Σ. (p+)/ See Box and Tiao (973, Chapter 8) for a Fisher Information justification of this particular choice. The given model for (X, Z) together with the prior ν and the use of Bayes Theorem yields a predictive distribution that can be described as follows. Consider the density function h 0 (w) = Γ( n ) π P Γ( (n p) ), w R p () ( + w w) n and let W R p be a random vector with the density h 0. Next, define the statistic W 0 R p by W 0 = ( + ) S (Z X) () n where S denotes the inverse of the unique positive definite square root of S. For any µ R p and the special Σ = σ I p with σ > 0, the sampling distribution of W 0 under the given model has the density h 0 the proof of 3

this is a routine multivariate calculation. Now, fixing the data X and solving for Z yields ( Z = + ) S W0 + X. n Assume W 0 has the density h 0. Then for X fixed, Z has the density ( q 0 (z X) = + ) ( ( S n h 0 + ) ) S (z X) n where denotes determinant. The resulting predictive distribution is Q 0 (dz X) = q 0 (z X)dz. For a direct formal Bayes derivation of Q 0 using the Jeffreys prior, see Geisser (993, Chapter 9). A structural derivation may be found in Fraser and Haq (969). The reader will recognize the above derivation as the well known fiducial inversion, assuming that W 0 has the density h 0. The purpose of this paper is to show that in fact this inversion is not valid, as the distribution of W 0 actually depends on the value of the parameter Σ when p >. In other words, the natural Studentized statistic W 0 is not an ancillary statistic so the fiducial inversion is not applicable. For later reference, note that if W has density h 0, then the first coordinate of W, say W (), has a distribution R 0 that is (n p) t n p where t m denotes a Student-t variable with m degrees of freedom. 3 Invariance Considerations The problem of predicting the future observation Z given the data X and the model in Section, is invariant under the following transformations: X i AX i + b, i =,..., n Z AZ + b µ Aµ + b Σ AΣA where A is a p p non-singular matrix and b is p-vector. For more details on this invariance, see Eaton and Sudderth (998, 999). In brief, a predictive 4

distribution for Z given the data X is a conditional distribution Q(dz X) for Z with X fixed. Such a Q is fully invariant if for all sets B R P and for all affine transformations g = (A, b) as above, Q satisfies where and Q(gB gx) = Q(B X) gx = (gx,..., gx n ) gx i = AX i + b. Of course, gb = {w w = gz, z B} and gz = Az + b. It is not hard to show that Q 0 specified in Section is fully invariant. The arguments in Eaton and Sudderth (00) show that Q 0 is a best fully invariant predictive distribution for a variety of invariant criteria. In the multivariate normal setting here, Eaton and Sudderth (998) used the invariance to propose an alternative to Q 0. Let G + T denote the group of all p p lower triangular matrices with positive diagonal elements. Since S is positive definite there is a unique element of G + T, say L(S), that satisfies S = L(S)L (S). (3) The matrix L(S) can be thought of as an alternative to S of S. Now, consider the Studentized statistic as a square root W = ( + n ) L (S)(Z X) (4) That W is an ancillary statistic (the sampling distribution of W does not depend on the parameter (µ, Σ)) is well-known see Fraser (968, p. 50) or Eaton and Sudderth (998). Standard multivariate calculations show that the sampling distribution of W has a density function on R p given by (p ): where h 0 is given in () and h (w) = h 0 (w) ψ p (w), w Rp (5) ψ p (w) = p ( + w + + wi ) i= ( + w w) (p )/. 5

It also follows easily that the first coordinate of W with density (5) is distributed as (n ) t n. This distribution is denoted by R. The fact that R 0 and R are different when p > is crucial to our arguments. The difference between W 0 and W is simply the choice of square root of S. The unique positive definite choice yields the natural Studentized statistic W 0 while the choice of L as a square root yields the not-so natural Studentized statistic W. Some arguments suggesting that W is the better choice can be found in Eaton and Sudderth (999) and Eaton, Muirhead and Pickering (004). When W given by (4) has the density h in (5), a direct pivoting argument (thinking of X as fixed and Z as random) yields the predictive density ( q (z X) = + ) ( ( L (S) n h + ) ) L (z X) n with respect to Lebesque measure dz on R P. This gives the predictive distribution Q (dz X) = q (z X)dz. This predictive distribution is not fully invariant but does have optimality properties that show, within certain frameworks, it is superior to Q 0 see Eaton and Sudderth (00) and Eaton, Muirhead and Pickering (004); it does however depend on the order in which coordinates are presented. 4 The Main Result In this section, we prove that the statistic W 0 is not a pivotal quantity by showing the the sampling distribution of W 0 depends on the parameters. In the case of p =, techniques described in Olkin and Rubin (964) can be used to convince oneself of the above assertion, but a rigorous proof for p > based on these methods seems elusive. Our proof is by contradiction and is not constructive. Before giving the formal proof which is somewhat involved, we first sketch the main idea. Given the covariance matrix Σ, write Σ = θθ with θ G + T the parameter analog of (3). Then let P ( θ) denote the distribution of the statistic W 0 defined in (). Our main result is that P ( θ) is not constant in θ as θ ranges over G + T and correspondingly Σ ranges over all p p non-singular covariance matrices. The argument is by contradiction. Let ɛ denote the 6

vector in R p with a one as the first coordinate and all other coordinates zero. Then ɛ W 0 is the first coordinate of W 0 and ɛ W 0 has the distribution R 0 when θ = I p. Thus, if P ( θ) is constant in θ and P () ( θ) denotes the distribution of ɛ W 0 when W 0 has distribution P ( θ), then P () ( θ) must be R 0 for all θ G + T. We will obtain a contradiction by showing there is a sequence {θ n } with θ n G + T so that P () ( θ n ) converges to R given in section 3. As R 0 R, P ( θ) cannot be constant in θ. The formal details of the above follow. Theorem 4.. The distribution P ( θ), θ G + T is not constant in θ. Proof. The p p matrix θ has the form θ 0 0 θ θ 0 θ =.. θ p θ p θ pp with θ ii > 0 for i =,, p. Fix all the elements of θ except θ, and consider a sequence of θ values increasing to +. This gives the sequence {θ n } with θ n G + T. As argued above, it suffices to show that P () ( θ n ) converges to R. ) To this end, let V = ( + (Z X) so V is N n p (0, Σ) and is independent of S which is Wishart, W (Σ, p, n ), in the notation of Eaton (983). Next, let V 0 be N(0, I p ) and S 0 be W (I p, p, n ) with V 0 and S 0 independent. With d = denoting equality in distribution, it follows that when Σ = θθ. From the definition of W 0, we have V d = θv 0, S d = θs 0 θ (6) W 0 = W 0 (θ) = S V = S L(S)L (S)V = τ(s)w (7) where W is defined in (4) and τ(s) = S L(S). The p p matrix τ(s) is orthogonal since τ(s)τ (S) = S L(S)L (S)S = Ip. 7

Combining (6) and (7) and using the ancillarity of W, where W = L (S 0 )V 0. It follows from Lemma 4. below that ɛ W 0 = ɛ W 0 (θ n ) d = ɛ τ(θ n S 0 θ n)w (8) lim n ɛ τ(θ ns 0 θ n ) = ɛ (9) for each positive definite S 0. Hence the random variable in (8) converges in distribution to ɛ W. Since ɛ W has R as its distribution the proof is complete. Lemma 4.. Equation (9) holds. Proof. Let denote the usual norm on R p. It must be shown that lim n ɛ τ(θ ns 0 θ n ) ɛ = 0, for each positive definite S 0. Because τ(θs 0 θ ) is an orthogonal matrix, we can equivalently show But, for any θ G + T, lim n ɛ ɛ τ (θ n S 0 θ n ) = 0. τ (θs 0 θ ) = τ (θs 0 θ ) = L (θs 0 θ )(θs 0 θ ). Because L(θS 0 θ ) is lower triangular, its (,) element is (θ s 0,) s 0, is the (,) element of S 0. Therefore where ɛ L (θs 0 θ ) = θ s 0, ɛ. (0) By the continuity of the square root mapping on non-negative definite matrices, we have ( ) (θs 0 θ ) = θs θ θ 0 θ (s 0,ɛ ɛ θ ) = s 0,ɛ ɛ. () 8

Combining (0) and () yields lim θ ɛ L (θs 0 θ )(θs 0 θ ) ( ) = lim ɛ θs θ s0, θ 0 θ = s0, ɛ (s 0,ɛ ɛ ) = ɛ ɛ ɛ = ɛ. Thus (9) holds and the proof is complete. 5 Discussion The implications of Theorem 4. are far from clear. The results do suggest that the predictive distribution Q 0 is somehow suspect, in spite of its status as a best fully invariant decision rule. Other examples of questionable best fully invariant procedures include the covariance estimation results of Stein in James and Stein (96). The validity of Theorem 4. continues to hold in a confidence set situation. That is, the suggestive pivotal S (X θ) is, in fact, not an ancillary statistic so cannot be used to construct confidence regions for θ of the form S (C) + X for sets C R P. However, the quantity (X θ) S (X θ) is ancillary and is commonly used to justify the classical confidence ellipses for θ. Also the pivotal quantity L (X θ) is ancillary so standard confidence set arguments are valid based on this. These results extend routinely to the multivariate regression context. Theorem 4. suggests that the appealing Studentization using S should perhaps be replaced by using L for L G + T. Such L -Studentizations are not invariant under the transformations given in Section 3, although they are invariant when the linear part of the affine transformation A is in G + T. But, different choices of coordinate systems (e.g. permute the coordinates of the data) do produce different predictive distributions that is, different Q s. This suggests that thoughts about the right coordinate system need to precede the data analysis. 9

References Bjørnstad, J. F. (990), Predictive likelihood: A review, Statistical Science 5, 4-65. Box, G. E. P. and G. C. Tiao (973), Bayesian inference in statistical analysis, Addison Wesley, Reading, Massachusetts. Eaton, M. L. (983), Multivariate statistics: A vector space approach, John Wiley, New York. Eaton, M. L. and W. D. Sudderth (998), A new predictive distribution for normal multivariate linear models, Sankhya 60, Series A, part 3, 363-38. Eaton, M. L. and W. D. Sudderth (999), Consistency and strong inconsistency of group invariant predictive inference, Bernoulli 5, 833-854. Eaton, M. L. and W. D. Sudderth (00), Best invariant predictive distributions, in: M. Viana and D. Richards (eds.), Contemporary mathematics series of the American mathematical society, Vol. 87, Providence, Rhode Island, 49-6. Eaton, M. L., R. J. Muirhead, and E. H. Pickering (004), Assessing a vector of clinical observations. Submitted for publication. Fraser, D. A. S. (968), The structure of inference, John Wiley, New York. Fraser, D. A. S. and M. S. Haq (969), Structural probability and prediction for the multivariate model, Journal of the Royal Statistical Society B 3, 37-33. Geisser, S. (993), Predictive inference: An introduction, Chapman & Hall, New York. James, W. and C. Stein (96), Estimation with quadratic loss, in: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Berkeley, CA, 36-380. Keyes, T. and M. Levy (996), Goodness of fit prediction for multivariate linear models, Journal of the American Statistical Association 9, 9-97. 0

Olkin, I. and H. Rubin (964), Multivariate beta distributions and independence properties of the Wishart distribution, Annals of Mathematical Statistics 35, 6-69.