Advanced Statistics I : Gaussian Linear Model (and beyond)

Advanced Statistics I : Gaussian Linear Model (and beyond) Aurélien Garivier CNRS / Telecom ParisTech Centrale

Outline One and Two-Sample Statistics Linear Gaussian Model Model Reduction and model Selection Exercices

Discrete and continuous distributions Discrete distribution P = n i=1 p iδ xi Ex: Binomial, Poisson distributions Continuous distribution Q(dx) = f (x)dx. Ex: Exponential distribution A distribution can be neither purely discrete, nor purely continuous! Ex: Z = min{x,1}, where X E(λ)

Descriptive properties Expectation Variance µ = E[X] [ σ 2 = E (X µ) 2] = E [ X 2] µ 2 Skewness γ = E[(X µ)3 ] σ 3 Kurtosis κ = E[(X µ)4 ] σ 4 3

Some remarkable distributions Normal: scale- and shift- stable family Chi-2: if X 1,...,X n is a N(0,1)-sample, then Z X 2 1 + + X 2 n χ 2 (n) Student: if X N(0,1) is independent of Z χ 2 (n), then T = X Z/n T (n) Fischer: if X χ 2 (n) is independent of Y χ 2 (m), then F = X/n Y /m F(n,m)

Empirical Distribution and statistics Let X 1,...,X n be a P-sample Empirical mean: Xn = 1 n n i=1 X i Empirical variance: Σ 2 n = 1 n ( Xi n X ) 2 1 n = n i=1 n i=1 X 2 i ( Xn ) 2 Empirical distribution P n = 1 n n i=1 δ X i Unbiased variance estimator ( ˆσ n 2 = 1 n ( Xi n 1 X ) n ) 2 1 n = Xi 2 n ( Xn ) 2 n 1 i=1 i=1 If P = N(0,σ 2 ), using Cochran s Theorem we get X n N(0,σ 2 /n) n ( Xi X ) 2 n σ 2 χ 2 (n 1) i=1

Convergence properties Theorem (LLN) If E[ X i ] <, then (in probability, almost surely) X n µ Application to S 2 n and ˆσ 2 n, etc... Convergence of the empirical distribution P n P under appropriate hypotheses

Central Limit Theorem Theorem (CLT ) If E[X 2 i ] <, n ( X n µ ) σ N(0,1) By Slutsky s Lemma, n ( Xn µ ) ˆσ n Student statistic: if X i N(µ,σ 2 ), n ( Xn µ ) N(0,1) ˆσ n T (n 1)

Confidence interval for the mean if σ is known, if σ is unknown, I α (µ) = I α (µ) = [ X n ± σφ ] 1 α/2 n [ X n ± ˆσ nt n 1 ] 1 α/2 n

Confidence interval for the variance if µ is known, as n i=1 ( ) 2 Xi µ σ χ 2 (n) [ n = I α (σ 2 i=1 ) = (X i µ) 2 ] n i=1, (X i µ) 2 if µ is unknown, as n i=1 = I α (σ 2 ) = [ χ n 1 α/2 χ n α/2 ( ) 2 Xi X n σ χ 2 (n 1) ˆσ 2 n ˆσ2 χ n 1 1 α/2 /(n 1), χ n 1 α/2 n /(n 1) ]

Comparison of two variances let X 1,1,...,X 1,n1 be a sample N(µ 1,σ 2 1 ), and X 2,1,...,X 2,n2 be an independant sample N(µ 2,σ 2 2 ), in order to test H 0 : σ 1 = σ 2 versus H 1 : σ 1 σ 2, use statistic F = ˆσ 1 2 2 ˆσ H 0 F(n 1 1,n 2 1) 2

Theorem Comparison of two means let X 1,1,...,X 1,n1 be a sample N(µ 1,σ 2 ), and X 2,1,...,X 2,n2 be an independent sample N(µ 2,σ 2 ), To estimate the common variance, use ˆσ 12 = 1 n 1 + n 2 2 ( n1 ) ( X1,i X ) 2 n 2 ( 1 + X2,i X ) 2 2 i=1 σ 2 n 1 + n 2 2 χ2 (n 1 + n 2 2) i=1 in order to test H 0 : µ 1 = µ 2 versus H 1 : µ 1 µ 2, use statistic n1 n 2 ( X1 T = X ) 2 H0 T(n 1 + n 2 2) n 1 + n 2 ˆσ 12

Outline One and Two-Sample Statistics Linear Gaussian Model Model Reduction and model Selection Exercices

Generic formulation Y i = α 1 xi 1 + + α p x p i + σz i, Z i N(0,1) Matrice form: Y = Xθ + σz, Z N(0 n,i n ) Ex: ANOVA, regression, rupture in time series

Theorem Cochran s Theorem let X = (X 1,...,X n ) be a standard centered normal sample let E 1,...,E p be a decomposition of R n by two-by-two orthogonal subspaces of dimensions dime j = d j for 1 i p, let v1 i,...,vi j i be an orhogonal basis of E j Then the components of X in base (v 1,...,v n ) form another standard centered normal sample the random vectors X E1,...,X Ep obtained by projecting X on E 1,...,E p are independent so are X E1,..., X Ep, and they satisfy: X Ei 2 χ 2 (d i )

Theorem Generic solution The Maximum-Likelihood estimator and the least-square estimator are given by: ˆθ = (t XX ) 1 t XY N(θ,σ 2 (t XX ) 1 ) The variance σ 2 is estimated (without bias) by: ˆσ 2 = Y X ˆθ 2 n p σ2 n p χ2 (n p) ˆθ and σ 2 are independent Incremental Gramm-Schmidt procedure Theorem (Gauss-Markov) ˆθ has minimal variance among all linear unbiased estimators of θ

Y = Xθ + σz Xθ M 0 M

Y = Xθ + σz Xθ M Y M = Xˆθ M 0 M

Y = Xθ + σz Y Xθ 2 σ 2 χ 2 (n) Y Y M 2 σ 2 χ 2 (n p) Y M = Xˆθ M Y M Xθ 2 σ 2 χ 2 (p) Xθ 0 M

Theorem Simple regression: Y i = α + βx i + σz i The ML-estimators are given by: ( ˆα = Ȳ ˆβ x N α, σ2 E[x 2 ) ] nvar(x) ( N β, ˆβ = Cov(x,Y ) Var(x) σ 2 nvar(x) ( ) They are correlated: Cov ˆα, ˆβ = σ2 x nvar[x] The variance can be estimated by: ˆσ n 2 = 1 (Yi ˆα ˆβx i ) 2 n 2 ) σ2 n 2 χ2 (n 2) Smart reparameterization Y i = δ + β(x i x) + σz i

Polynomial regression Y = 1 x x 2... x p Can also be used for exponential growth models y i = exp(ax i + b i + ǫ i ) to determine β such that E[Y ] = αx β,... α 0 α 1 α 2. α p

Outline One and Two-Sample Statistics Linear Gaussian Model Model Reduction and model Selection Exercices

Student test on a regressor Theorem In order to test H 0 = θ k = a versus H 1 = θ k a estimate the variance of ˆθ k by use the statistic ˆσ 2 (ˆβk ) = ˆσ 2 { ( t XX) 1} k,k T = ˆβ k a ) H0 T(n p) ˆσ (ˆβk Generalization: to test H 0 = t bθ = a versus H 1 = t bθ a, use T = t b ˆβ a ˆσ t b( t XX) 1 b H 0 T(n p)

Fischer Test model vs submodel Theorem let H E R n, dimh = q,dime = p to test H 0 = θ H versus H 1 = θ E \ H, use the statistic F = Y E Y H 2 /(p q) Y Y E 2 /(n p) H 0 F(p q,n p) reject if F > F p q,n p 1 α

Y = Xθ + σz 0 E Xθ H H Y E = Xˆθ E

Y = Xθ + σz 0 E Xθ H Y H = X ˆθ H H Y E = Xˆθ E

Y = Xθ + σz Y Y E 2 σ 2 χ 2 (n p) 0 E Y H Xθ 2 Y E = Xˆθ E σ 2 χ 2 (q) Y E Y H 2 σ 2 χ 2 (p q) Xθ H Y H = X ˆθ H H

SSS-notations and R 2 For a model M (relative to a matrix X), define total variance SSY = Y Ȳ 1 n 2 = SSE(1 n ) residual variance SSE(M) = Y X ˆθ 2 explained variance SSR(M) = X ˆθ Ȳ 1 n 2 SSY = SSE(M) + SSR(M) The quality of fit is quantified by R 2 (M) = SSR(M) SSY The Fischer statistic can be written: (SSE(H) SSE(E))/(p q) F = SSE(E)/(n p) = n dim(e) dim(e) dim(h) R2 (E) R 2 (H) 1 R 2 (E)

The model can be written: ANOVA Y i,k = θ i + σǫ i,k,1 i p,1 k n i Let Y i, = 1 n i k Y i,k and Y, = 1 n i,k Y i,k The variance can be decomposed as: SSY = SSR(M) + SSE(M) = i n i (Y i, Y, ) + i,k (Y i,k Y i, ) To test H 0 = θ 1 = = θ p versus H 1 = H 0, the Fischer statistic is: F = n p i n i(y i, Y, ) 2 p 1 i,k (Y F(p 1,n p) i,k Y i, ) 2

Exhaustive, Forward, Backward and Stepwise selection Exhaustive search: for all sizes 1 k p, find the combination of directions with highest R 2. Forward selection: at each step, add the direction most correlated with Y. Stop when the Fischer test for this direction is not rejected Backward selection: start with full model, and remove the direction with smallest t-statistic. Stop when all remaining t-statistics are significant Stepwise selection: like Forward selection, but after each inclusion remove all directions with unsignificant F-statistic Note: unless specified, 1 n is always included into the models.

Quadratic Risk: Bias-Variance decomposition To simplify the discussion, we consider the model Y = θ + σz where θ is arbitrary but aims at be understood by the family of models M The quadratic risk of model M M is defined as [ θ ] 2 r(m) = E ˆθM It can be decomposed as: r(m) = θ θ M 2 + σ 2 dim(m)

Risk Estimation and Mallow s criterion Goal: choose model M M with minimal quadratic risk r(m). Problem: the bias θ θ M 2 is unknown Idea: penalize complexity dim(m) Mallow s criterion: choose model M minimizing C p (M) = SSE(M) + 2σ 2 dim(m) Heuristic: ] r(m) = θ 2 θ M 2 + σ 2 dim(m), but E [ ˆθ M 2 = θ M 2 + σ 2 dim(m), hence ( ) r(m) = θ 2 ˆθ M 2 σ 2 dim(m) + σ 2 dim(m) has expectation r(m), but maximizing r(m) over M is equivalent to maximizing r(m) θ 2 + Y 2 = Y ˆθ M 2 + 2σ 2 dim(m)

Y = θ + σz θ 0 M

Y = θ + σz θ θ M = Π M (θ) ˆθ M = Π M (Y ) 0 M

Y = θ + σz θ ˆθ 2 ˆθ M = Π M (Y ) θ 2 θ M = Π M (θ) 0 M θ 1 ˆθ 1

Y = θ + σz θ 2 ˆθ 2 θ Bias Risk Variance θ M = Π M (θ) ˆθ M = Π M (Y ) 0 M θ 1 ˆθ 1

Y = θ + σz θ Bias Risk 0 M Variance θ 1 ˆθ 1

Y = θ + σz θ 2 ˆθ 2 Variance Risk Bias θ 0 M

Y = θ + σz θ 2 ˆθ 2 Variance Bias Risk Bias Risk θ Bias Risk Variance θ M = Π M (θ) ˆθ M = Π M (Y ) 0 M Variance θ 1 ˆθ 1

Other Criteria Adjusted R 2 : Ra(M) 2 n 1 = 1 n dim(m) (1 R2 (M)) n 1 = 1 n dim(m) SSE(M) SSY Bayesian Information Criterion: BIC(M) = SSE(M) + σ 2 dim(m)log n

Application: denoising a signal discretized and noisy version of f : [0, 1] R : Y k = f (k/n) + σz k,0 k n 1 choice of an orthogonal basis of R n : Fourier { [ ( )] 2πkl Ω n = sin n [ ( )] 2πkl cos n,1 k 0 l N 1,0 k 0 l N 1 n 1 2 n } 2 nested models with increasing number of non-zero Fourier coefficients,

Logistic Regression The Gaussian model does obviously not apply everywhere; think e.g. of a regression age/heart disease. Logistic model: Y i B ( µ( t X i θ) ), where µ(η) = exp(η) 1+exp(η) is the inverse logit function. Maximum likelihood estimation is possible numerically (Newton-Raphson method)

Outline One and Two-Sample Statistics Linear Gaussian Model Model Reduction and model Selection Exercices

Discovery of R Understand and modify the source codes available on the website. The data frame called cars contains two arrays: cars$dist and cars$speed. Its gives the speed of cars and the distances taken to stop (recorded in the 1920s). A relation dist = A speed B is expected. How to estimate A and B? Test if B = 0, and then if B = 1. Find out how logistic regression can be done with R. Illustrate on some data you choose.

Simple Exercices Show that if X 1,...,X n is a N(0,1)-sample, then X n N(0,1/n) n ( Xi X ) 2 n χ 2 (n 1) i=1 Re-compute the formula giving ˆα and ˆβ in the simple regression model by analytic minimization of the total squared errors n i=1 (y i α βx i ) 2. Compute the squared prediction error E [ (ŷ α βx ) 2] for a new observation at point x in the simple regression model. Same exercices for the general gaussian linear model.

Exercice: weighting methods A two plate weighing machine is called unbiased with precision σ 2 if, an object of true weight m on the left plate is balanced by a random weight y such that y = m + σǫ on the right plate, where ǫ is a centered standard normal variable. Mister M. has three objects of mass a, b and c to weigh with such a machine, and he is allowed to proceed to three measurements. He thinks of three possibilities weighting each object separately : (a ), (b ), (c ); weighting the objects two at a time : (ab ), (ac ) and (bc ); putting each object one time on the right plate alone and two times with another on the right plate (ab c), (ac b), (bc a). What would you advice him?

Exercice: multi-intercept regression Botanists want to quantify the average difference of height between the trees of two forests A and B. In their model, the height of a tree is the sum of three terms: a term q depending on the quality of the ground, which is assumed to be constant in each forest: q A for the trees of forest A, q B for the trees of forest B; an unknown biological constant times the quantity of humus around the tree; a random term proper to each tree. Precisely, they want to estimate the difference D = q A q B. For their study, they have collected the height of n A trees in forest A, n B trees in forest B, as well as the quantities (h A i ) 1 j na and (h B i ) 1 j nb of humus at the basis of all thoses trees. Tell them how to do it.