Stat 206: Sampling theory, sample moments, mahalanobis

Similar documents
3d scatterplots. You can also make 3d scatterplots, although these are less common than scatterplot matrices.

Stat 206: Linear algebra

Random Vectors 1. STA442/2101 Fall See last slide for copyright information. 1 / 30

Notes on Random Vectors and Multivariate Normal

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Basic Concepts in Matrix Algebra

2. Matrix Algebra and Random Vectors

MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2

Lecture 11. Multivariate Normal theory

Vectors and Matrices Statistics with Vectors and Matrices

Basic Concepts in Linear Algebra

Elements of Probability Theory

Review of Basic Concepts in Linear Algebra

Stat 206: the Multivariate Normal distribution

A Probability Review

Page 52. Lecture 3: Inner Product Spaces Dual Spaces, Dirac Notation, and Adjoints Date Revised: 2008/10/03 Date Given: 2008/10/03

Lecture Note 1: Probability Theory and Statistics

Lecture Notes Part 2: Matrix Algebra

Recall that if X 1,...,X n are random variables with finite expectations, then. The X i can be continuous or discrete or of any other type.

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

01 Probability Theory and Statistics Review

Mathematics for Graphics and Vision

Random Vectors, Random Matrices, and Matrix Expected Value

The Multivariate Gaussian Distribution

Vectors. January 13, 2013

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

The Hilbert Space of Random Variables

Basic Linear Algebra in MATLAB

REVIEW OF MAIN CONCEPTS AND FORMULAS A B = Ā B. Pr(A B C) = Pr(A) Pr(A B C) =Pr(A) Pr(B A) Pr(C A B)

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

TAMS39 Lecture 2 Multivariate normal distribution

MAS113 Introduction to Probability and Statistics. Proofs of theorems

5. Random Vectors. probabilities. characteristic function. cross correlation, cross covariance. Gaussian random vectors. functions of random vectors

A Introduction to Matrix Algebra and the Multivariate Normal Distribution

The Matrix Algebra of Sample Statistics

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Image Registration Lecture 2: Vectors and Matrices

3. Probability and Statistics

The Multivariate Gaussian Distribution [DRAFT]

Section 8.1. Vector Notation

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Lecture 14: Multivariate mgf s and chf s

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Lecture 13: Simple Linear Regression in Matrix Format

Stat 5101 Notes: Brand Name Distributions

Review (Probability & Linear Algebra)

STAT 501 Assignment 1 Name Spring Written Assignment: Due Monday, January 22, in class. Please write your answers on this assignment

Delta Method. Example : Method of Moments for Exponential Distribution. f(x; λ) = λe λx I(x > 0)

Stat 206: Estimation and testing for a mean vector,

Multivariate Distributions (Hogg Chapter Two)

LS.1 Review of Linear Algebra

MATH 38061/MATH48061/MATH68061: MULTIVARIATE STATISTICS Solutions to Problems on Random Vectors and Random Sampling. 1+ x2 +y 2 ) (n+2)/2

MAS113 Introduction to Probability and Statistics. Proofs of theorems

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

CS 246 Review of Linear Algebra 01/17/19

STAT 100C: Linear models

EE731 Lecture Notes: Matrix Computations for Signal Processing

Def. The euclidian distance between two points x = (x 1,...,x p ) t and y = (y 1,...,y p ) t in the p-dimensional space R p is defined as

Notation, Matrices, and Matrix Mathematics

Part 6: Multivariate Normal and Linear Models

The Transpose of a Vector

Topics in Probability and Statistics

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.

Chapter 2. Linear Algebra. rather simple and learning them will eventually allow us to explain the strange results of

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Linear Algebra Basics

Exam 2. Jeremy Morris. March 23, 2006

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

Probability Theory and Statistics. Peter Jochumzen

CS281A/Stat241A Lecture 17

Statistical Pattern Recognition

This appendix provides a very basic introduction to linear algebra concepts.

Lecture 9 SLR in Matrix Form

Next is material on matrix rank. Please see the handout

[y i α βx i ] 2 (2) Q = i=1

Lecture Notes 1: Vector spaces

MATH 320, WEEK 7: Matrices, Matrix Operations

Chp 4. Expectation and Variance

Chapter 2. Matrix Arithmetic. Chapter 2

Lecture 2: Review of Basic Probability Theory

STA 2201/442 Assignment 2

An introduction to multivariate data

v = ( 2)

a11 a A = : a 21 a 22

An Introduction to Matrix Algebra

Multiple Random Variables

The Multivariate Normal Distribution. In this case according to our theorem

P (x). all other X j =x j. If X is a continuous random vector (see p.172), then the marginal distributions of X i are: f(x)dx 1 dx n

Review Packet 1 B 11 B 12 B 13 B = B 21 B 22 B 23 B 31 B 32 B 33 B 41 B 42 B 43

Multivariate Distributions

Principal Components Theory Notes

BIOS 2083 Linear Models Abdus S. Wahed. Chapter 2 84

MULTIVARIATE PROBABILITY DISTRIBUTIONS

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Whitening and Coloring Transformations for Multivariate Gaussian Data. A Slecture for ECE 662 by Maliha Hossain

MULTIVARIATE DISTRIBUTIONS

Review of Linear Algebra

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

Transcription:

Stat 206: Sampling theory, sample moments, mahalanobis topology James Johndrow (adapted from Iain Johnstone s notes) 2016-11-02 Notation My notation is different from the book s. This is partly because I am going to be writing on the board, and having less complicated notation makes that easier. I am also just a minimalist when it comes to notation. My notation is: symbol description what it is X upper case letter matrix, random variable x lower case letter vector Compare this to the book s notation: symbol description what it is X upper case bold letter matrix (except data) X upper case bold letter vector random variable x lower case bold letter vector x lower case non-bold letter scalar X bigger upper case bold letter data matrix There are clearly pros and cons. Here are possible points where confusion may arise using my notation: 1. Random variables. I ll use X to refer both random variables and the data matrix. It will usually be obvious from the context which I am talking about. In particular, if I write X f(x), E[X], et cetera i.e. anytime I make a probability statement I am referring to the random variable. 2. Subscripting. We will sometimes talk about a collection of random vectors X 1,..., X n where each X i is a vector. We will also talk about the data matrix X ij, which might look like the jth entry of the ith random vector. Again, it will hopefully be clear from context which I mean, and if not, I ll make an effort to point it out. As a rule of thumb, something with a single subscript i will usually be the ith vector in a collection, and something with a single subscript j will (usually) refer to the jth component of a vector (also see the next point). 3. Indexing. the book uses j to index observations and k to index variables. I will use i to index observations and j to index variables. The notation I use is more common in statistics, so if I tried

stat 206: sampling theory, sample moments, mahalanobis topology 2 to switch to use the book s notation I would inevitably fall back into my usual habit, causing even more confusion. I apologize in advance for having to keep track of different notation. Random sampling By and large we will assume that our data x = (x 1,..., x p ) 1 are independent realizations of a vector random variable X with a density f : R d Ñ R on R p, that is 1 X f. When we write X f, we mean the data distribution has a density satisfying ş f(x)dx = 1. 2 R p We commonly need to partition X as X = ( X 1 X 2 ) 1 backtick ( 1 ) will refer to the transpose of a vector or matrix, i.e. the object with row and column indices switched. By default, vectors are column vectors, so their transposes are row vectors. 2 A common situation in which independence is violated is in time series applications or longitudinal studies, but the principles we learn by studying independence can be applied to develop methods for non-independent samples. where X 1, X 2 are random vectors of dimension p 1, p 2 with p 1 + p 2 = p. Then the marginal density of X 1 is ż f 1 (x 1 ) = R p 2 f(x 1, x 2 )dx 2 and the conditional density of x 2 given x 1 = x 0 1 is f(x 2 x 1 ) = f(x0 1, x 2 ) f 1 (x 0 1 ). Statistical independence occurs when f(x 2 x 0 1) = f(x 2 ) for all x 0 1 P R p 1. When X 1 is independent of X 2 we write X 1 K X 2. Theorem 1. If X 1 K X 2 then f(x) = f 1 (x 1 )f 2 (x 2 ). Bayes theorm allows us to reverse conditional probabilities. Suppose we have random variables (Θ, X) and Θ f(θ) is the prior, X Θ f(x θ) is the likelihood or sampling model then the joint density of (Θ, X) is f(θ)f(x θ) and the marginal density of X is ż f(x) = f(θ)f(x θ)dθ. Then

stat 206: sampling theory, sample moments, mahalanobis topology 3 Theorem 2 (Bayes). The posterior density of Θ X is the conditional distribution of parameters given observables, and is given by f(θ x) = f(x θ)f(θ). f(x) Note: each of Θ and X could be a multivariate vector or a discrete quantity (though in the latter case, we would replace with pmfs). The mean µ and variance of the vector variable X (when they exist) 3 are definined analogously to the univariate case 1. The population mean vector µ = EX has components ż µ j = x j f(x)dx. 3 In general we assume both the mean and variance exist and are finite 2. The population covariance matrix Σ = cov(x) = E[(X µ)(x µ) 1 ] The matrix Σ is p ˆ p and has entries σ jk = cov(x j, X k ) = E[(X j µ j )(X k µ k )] ż = (x j µ j )(x k µ k )f(x)dx. It follows that Σ is symmetric i.e. Σ = Σ 1 and non-negative definite (defined formally in next lecture). If we only wish to specify the first and second order moments of a random vector X, it is convenient to write X (µ, Σ), keeping in mind that this does not specify a particular distribution for X. Some key properties of means and covariances that we use frequently are Remark 1. Proof. Taking the expectation Σ = EXX 1 µµ 1 (X µ)(x µ) 1 = XX 1 µx 1 Xµ 1 + µµ 1. Σ = E[XX 1 ] µex 1 EXµ 1 + µµ 1 = E[XX 1 ] µµ 1.

stat 206: sampling theory, sample moments, mahalanobis topology 4 Another property is linearity Theorem 3 (linearity of expectation (vector)). E[AX + b] = AE[X]b = Aµ + b cov(ax) = A cov(x)a 1 = AΣA 1. Proof. [ ] ÿ E[(AX) j ] = E a jk X k = ÿ a jk E[X k ] k k (1) = (AEX) k = (Aµ) k (2) Now E[(AX Aµ)(AX Aµ) 1 ] = E[A(X µ)(x µ) 1 A 1 ] = A[E(X µ)(x µ) 1 ]A 1 = AΣA 1, where the next to last step, if written fully, would involve repeated use of linearity of expectation as in (1). Linear combinations are just a special case. If a P R p is a constant vector then a 1 X has moments Ea 1 X = a 1 µ var(a 1 X) = a 1 Σa = ÿ j,k a j σ jk a k. Paritioning vectors and matrices We often want to partition vectors and matrices in similar fashion to what we did for random variables. If we partition X as before, e.g X = ( X 1 X 2 then the mean of X and covariance matrix are partitioned conformably µ = ( µ 1 µ 2 ) ) ( ) Σ 11 Σ 22, Σ =. Σ 21 Σ 22 Writing this out in a little more detail, we have

stat 206: sampling theory, sample moments, mahalanobis topology 5 ( ) ( ) µ = EX = EX 1 µ 1 = EX 2 µ 2 ( ) Σ = E(X µ)(x µ) 1 X 1 µ ( 1 = E (X 1 µ 1 ) 1 X 2 µ 2 ) (X 2 µ 2 ) 1 ( ) = E(X E(X 1 µ 1 )(X 1 µ 1 ) 1 E(X 1 µ 1 )(X 2 µ 2 ) 1 2 µ 2 )(X 1 µ 1 ) 1 E(X 2 µ 2 )(X 2 µ 2 ) 1 ( ) = Σ 11 Σ 12 Σ 21 Σ 22. Notice that, by symmetry, Σ 12 = Σ 1 21. It is sometimes useful to consider instead the correlation between components of X, which has the same interpretation regardless of the marginal variance of X: ρ jk = cor(x j, X k ) = cov(x j, X k ) a var(xj ) a P [ 1, 1], var(x k ) and the correlation matrix, often denoted P, the p ˆ p matrix with entries P jk = ρ jk. If V = diag(σ 11, σ 22,..., σ pp ), 4 where σ jj are the diagonal entries of Σ (the marginal variances), then we can express P as 4 this notation means the diagonal entries are given by the values inside the parentheses and all the off-diagonal entries are zero P = V 1/2 ΣV 1/2, where V 1/2 = diag(σ 1/2 11,..., σ 1/2 pp ). Sample moments We can now give some basic properties of the sample mean and sample covariance. The sample mean sx is given by sx = 1 n x i = ( 1 n x i1,..., 1 n ) 1 x ip = (sx 1,..., sx p ) 1. Since the x i are iid realizations of a random variable X f, E[ s X] = E [ 1 n ] X i = 1 n E[X i ] = (E[X 1 ],..., E[X p ]) 1 = (µ 1,..., µ p ) 1

stat 206: sampling theory, sample moments, mahalanobis topology 6 so the expectation of the sample mean is the mean µ of the random vector X with density f. The sample covariance matrix S n is defined as (S n ) jk = 1 n (x ij sx j )(x ik sx k ). So we can express the sample covariance matrix as the sum of matrices: S n = 1 n (x i sx)(x i sx) 1. Now we ll state an important result about the sample moments and prove part of it (see the book for the rest of the proof). Theorem 4. The covariance of the sample mean is cov( s X) = 1 n Σ and the expectation of the sample covariance is E[S n ] = Σ n 1 Σ so n(n 1) 1 S n = S is an unbiased estimator of Σ. Proof. We prove the second part, see the book for the first part. ( ) (X i sx)(x i sx) 1 = (X i sx)(x ÿ n i ) 1 (X i sx) = X i (X i ) 1 nx s X s 1, sx 1 since ř n (X i sx) = 0 and ( X) s 1 = n 1 ř n (X i) 1. So then [ ] E (X i sx)(x i sx) 1 = E[X i (X i ) 1 ] ne[ X s X s 1 ]. Now applying Remark 1, we have E[S n ] = n 1 E[X i (X i ) 1 ] E[ X s X s 1 ] = Σ + µµ 1 (n 1 Σ + µµ 1 ) = n 1 n Σ.

stat 206: sampling theory, sample moments, mahalanobis topology 7 We also can define the sample correlation matrix R by R jk = s jk? sjj? skk, with s jk = (S) jk. If we put D 1/2 = diag(s 1/2 11,..., s 1/2 pp ) then R = D 1/2 SD 1/2. Finally, note that the law of large numbers and the central limit theorem also work for vectors. We will give these results without proof, if interested there are many references. For our purposes, it is important just to know that these key asymptotic results also hold for vectors. Theorem 5 (Multivariate weak law of large numbers). Let X 1,..., X n be a sequence of iid length p random vectors with finite mean µ. Let sx n = n 1 ř n X i. Then P [ˇˇ s Xn µˇˇ ě ϵ ] Ñ 0 as n Ñ 8 for all ϵ ą 0, where x = ř p j=1 x j is the L 1 norm. Theorem 6 (Multivariate central limit theorem). Let X 1,..., X n be a sequence of iid length p random vectors with finite mean µ and finite covariance Σ. Then? n( s Xn µ) D Ñ No(0, Σ). In Theorem 6 No(0, Σ) is the multivariate normal distribution with mean 0 and covariance Σ, which we will soon characterize. Finally, we will briefly mention the notion of generalized variance, which the book (and numerous other sources) defines as S, the determinant of S. This is a sensible way to summarize the variability of the sample in a single number, and we will revisit it after we have reviewed a bit more linear algebra, which make its properties easier to understand. Vector norms and Mahalanobis topology When dealing with univariate random variables, the notion of magnitude is relatively straightforward. We generally take the magnitude of a real number to be its absolute value, which is of course equal to the Euclidean or L 2 norm? x 2 when x is unidimensional. However, for vectors the definition of magnitude is more subtle. One notion we have already mentioned is the L 1 norm, x = ř j x j, the sum of the

stat 206: sampling theory, sample moments, mahalanobis topology 8 absolute values of the entries. But this isn t equivalent to the L 2 norm for vectors d ÿ x 2 = (xx, xy) 1/2 = x 2 j ď ÿ j j b x 2 j = ÿ j x j by the triangle inequality, where xx, xy = x 1 x is the inner (or dot) product. The L 2 norm is arguably the default way to measure the magnitude of vectors, and it induces a metric on R p via d ÿ d(x, y) = x y 2 = (x j y j ) 2. (3) This is referred to as the Euclidean metric, which corresponds to the familiar straight line distance in R p. When considering distance between data points, it may not make sense to use the Euclidean metric. To understand why, it helps to know a little about quadratic forms. Definition 1. For a p-vector x and a p ˆ p symmetric matrix Λ, a quadratic form is given by the matrix product x 1 Λx. The expectation of quadratic forms is simple Theorem 7. Let X be a random vector with finite mean µ and finite covariance Σ. Then E[X 1 ΛX] = tr(λσ) + µ 1 Λµ. Proof. Note: we use properties of the trace of a matrix that will be discussed in the next lecture. j E[X 1 ΛX] = tr(e[x 1 ΛX]) = E[tr(X 1 ΛX)] = E[tr(ΛXX 1 )] = tre[λxx 1 ] = tr(λe[xx 1 ]) = tr(λ(σ + µµ 1 )) = tr(λσ) + tr(λµµ 1 ) = tr(λσ) + tr(µ 1 Λµ). In theorem 7, tr is the trace of a matrix, which is the sum of its diagonal elements. Using this result, we can understand the average distance between points in an iid sample.

stat 206: sampling theory, sample moments, mahalanobis topology 9 Theorem 8 (Expected Euclidean distance). Suppose X, Y are independent, identically distributed random variables with mean µ and covariance Σ. Then E[ X Y 2 2] = ÿ j σ jj. Proof. E[ X Y 2 2] = E[(X Y ) 1 I(X Y )] = tr(iσ) + (E[X Y ]) 1 I(E[X Y ]) = tr(σ) = ÿ j σ jj. Thus, the Euclidean distance between sample points will depend on the variances. This is an undesirable property, since we d like to be able to interpret distances between sample points in the same way for all samples we d like to have a common scale that means something similar no matter how our data were generated. So it makes sense to have a distance metric that scales by the inverse of the variances. We ll actually do a bit more than that, and focus on Mahalanobis distances. 5 Definition 2 (Mahalanobis distance). Given a p ˆ p symmetric, positive-definite matrix Λ, the Mahalnobis distance x y Λ between p-vectors x and y with respect to Λ is given by 5 Don t worry about the terms symmetric and positive definite for now, we ll soon define them. x y Λ = a (x y) 1 Λ 1 (x y) = d Λ (x, y). If we put Λ = Σ in the definition above, and measure distances using x y Σ, we have E[ X Y 2 Σ = E[(X Y ) 1 Σ 1 (X Y )] = tr(i) = n, no matter the value of Σ. 6 Additional motivation for using x y Σ will be offered when we study the multivariate normal distribution. How do distances in the Mahalanobis metric compare to the straight-line distances that we are used to? The best way to answer this is geometric. In the usual Euclidean metric, the set of all points x equidistant from a single point y are those that lie on the perimeters of circles centered at y, and the equation of all points a distance m from y is given by the equation of the circle (x y) 1 (x y) = m. In the Mahalanobis metric, the set of all points equidistant from y is given by an ellipse with axis lengths proportional to the inverse variances and orientation that is determined by the off-diagonal entries of Σ 1. 6 The notation Σ 1 refers to the inverse of the matrix Σ, that is, the matrix such that when multiplied by Σ gives the identity.

stat 206: sampling theory, sample moments, mahalanobis topology 10 Another way to think about this is that the action of a matrix Σ on a vector x via the matrix product Σ 1/2 x is to rotate and stretch the original vector. 7 You can think equivalently in terms of rotating and scaling the coordinate axes in p-dimensional space. Figure 1 shows the set of points that are distance 1 from the origin in the Euclidean metric (a circle) and the set of points equidistant from the origin in the Mahalanobis metric d Λ for ( ) 1.9 Λ =.9 1 7 don t know what Σ 1/2 is? Don t worry, we ll get to that shortly pts <- ellipse(matrix(c(1,.9,.9,1),2,2),centre=c(0,0)) df <- data.frame(x=pts[,1],y=pts[,2]) pts2 <- ellipse(matrix(c(1,0,0,1),2,2),centre=c(0,0)) df2 <- data.frame(x=pts2[,1],y=pts2[,2]) df$cor <-.9 df2$cor <- 0 df <- rbind(df,df2) df$cor <- as.factor(df$cor) ggplot(df,aes(x=x,y=y,col=cor)) + geom_path() y 2 1 0 1 cor 0 0.9 2 Summary We have covered a number of properties of random vectors and multivariate samples. Importantly, what we have done so far required only (1) iid observations of a random variable X with a density f, (2) that X has finite mean and covariance. We made no other assumptions about X or f. We will soon shift focus to the study of the multivariate normal distribution. Because µ and Σ play an important role in understanding the multivariate normal, it is easy to lose sight of the fact that the sample mean and sample covariance have meaning and certain statistical properties regardless of whether f is the density of a multivariate normal. Keep this in mind as we move along. 2 1 0 1 2 x Figure 1: the set of points equidistant from the origin in the Euclidean metric (cor=0) and the Mahalanobis metric defined in text (cor=.9) References