Advanced Statistics I : Gaussian Linear Model (and beyond)

Similar documents
Ma 3/103: Lecture 24 Linear Regression I: Estimation

Applied Econometrics (QEM)

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Stat 5101 Lecture Notes

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Statistics Ph.D. Qualifying Exam: Part II November 3, 2001

Math 423/533: The Main Theoretical Topics

Regression, Ridge Regression, Lasso

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

EXAMINERS REPORT & SOLUTIONS STATISTICS 1 (MATH 11400) May-June 2009

Ch 2: Simple Linear Regression

2017 Financial Mathematics Orientation - Statistics

Qualifying Exam in Probability and Statistics.

Lecture 14 Simple Linear Regression

Regression #4: Properties of OLS Estimator (Part 2)

[y i α βx i ] 2 (2) Q = i=1

simple if it completely specifies the density of x

Inference in Regression Analysis

Bayesian Inference. Chapter 9. Linear models and regression

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Linear Models and Estimation by Least Squares

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

High-dimensional regression modeling

Topic 12 Overview of Estimation

Multivariate Regression

Diagnostics can identify two possible areas of failure of assumptions when fitting linear models.

Statistics. Statistics

ECON 4160, Autumn term Lecture 1

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

BTRY 4090: Spring 2009 Theory of Statistics

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

STAT5044: Regression and Anova. Inyoung Kim

Association studies and regression

Introduction to Simple Linear Regression

Some Assorted Formulae. Some confidence intervals: σ n. x ± z α/2. x ± t n 1;α/2 n. ˆp(1 ˆp) ˆp ± z α/2 n. χ 2 n 1;1 α/2. n 1;α/2

Ch 3: Multiple Linear Regression

STA 2201/442 Assignment 2

Data Mining Stat 588

Qualifying Exam in Probability and Statistics.

Linear models and their mathematical foundations: Simple linear regression

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

Model comparison and selection

Brief Review on Estimation Theory

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Regression Estimation Least Squares and Maximum Likelihood

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

STAT 100C: Linear models

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

Time Series Analysis

Simple Linear Regression

The regression model with one fixed regressor cont d

Regression and Statistical Inference

A Very Brief Summary of Statistical Inference, and Examples

1. Simple Linear Regression

Simple and Multiple Linear Regression

Central Limit Theorem ( 5.3)

Correlation and Regression

Simple Linear Regression

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

A Very Brief Summary of Statistical Inference, and Examples

where x and ȳ are the sample means of x 1,, x n

Simple Linear Regression

Formal Statement of Simple Linear Regression Model

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Applied Econometrics (QEM)

Statistics & Data Sciences: First Year Prelim Exam May 2018

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

Advanced Econometrics I

3. Linear Regression With a Single Regressor

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Direction: This test is worth 250 points and each problem worth points. DO ANY SIX

Simple Linear Regression: The Model

General Linear Model: Statistical Inference

Bias Variance Trade-off

Week 9 The Central Limit Theorem and Estimation Concepts

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

HT Introduction. P(X i = x i ) = e λ λ x i

BIOS 2083 Linear Models c Abdus S. Wahed

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Simple Linear Regression

Topic 16 Interval Estimation

So far our focus has been on estimation of the parameter vector β in the. y = Xβ + u

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

The Multiple Regression Model Estimation

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

Statistical Distribution Assumptions of General Linear Models

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

STAT 540: Data Analysis and Regression

Tutorial 6: Linear Regression

Master s Written Examination

This paper is not to be removed from the Examination Halls

Petter Mostad Mathematical Statistics Chalmers and GU

Transcription:

Advanced Statistics I : Gaussian Linear Model (and beyond) Aurélien Garivier CNRS / Telecom ParisTech Centrale

Outline One and Two-Sample Statistics Linear Gaussian Model Model Reduction and model Selection Exercices

Discrete and continuous distributions Discrete distribution P = n i=1 p iδ xi Ex: Binomial, Poisson distributions Continuous distribution Q(dx) = f (x)dx. Ex: Exponential distribution A distribution can be neither purely discrete, nor purely continuous! Ex: Z = min{x,1}, where X E(λ)

Descriptive properties Expectation Variance µ = E[X] [ σ 2 = E (X µ) 2] = E [ X 2] µ 2 Skewness γ = E[(X µ)3 ] σ 3 Kurtosis κ = E[(X µ)4 ] σ 4 3

Some remarkable distributions Normal: scale- and shift- stable family Chi-2: if X 1,...,X n is a N(0,1)-sample, then Z X 2 1 + + X 2 n χ 2 (n) Student: if X N(0,1) is independent of Z χ 2 (n), then T = X Z/n T (n) Fischer: if X χ 2 (n) is independent of Y χ 2 (m), then F = X/n Y /m F(n,m)

Empirical Distribution and statistics Let X 1,...,X n be a P-sample Empirical mean: Xn = 1 n n i=1 X i Empirical variance: Σ 2 n = 1 n ( Xi n X ) 2 1 n = n i=1 n i=1 X 2 i ( Xn ) 2 Empirical distribution P n = 1 n n i=1 δ X i Unbiased variance estimator ( ˆσ n 2 = 1 n ( Xi n 1 X ) n ) 2 1 n = Xi 2 n ( Xn ) 2 n 1 i=1 i=1 If P = N(0,σ 2 ), using Cochran s Theorem we get X n N(0,σ 2 /n) n ( Xi X ) 2 n σ 2 χ 2 (n 1) i=1

Convergence properties Theorem (LLN) If E[ X i ] <, then (in probability, almost surely) X n µ Application to S 2 n and ˆσ 2 n, etc... Convergence of the empirical distribution P n P under appropriate hypotheses

Central Limit Theorem Theorem (CLT ) If E[X 2 i ] <, n ( X n µ ) σ N(0,1) By Slutsky s Lemma, n ( Xn µ ) ˆσ n Student statistic: if X i N(µ,σ 2 ), n ( Xn µ ) N(0,1) ˆσ n T (n 1)

Confidence interval for the mean if σ is known, if σ is unknown, I α (µ) = I α (µ) = [ X n ± σφ ] 1 α/2 n [ X n ± ˆσ nt n 1 ] 1 α/2 n

Confidence interval for the variance if µ is known, as n i=1 ( ) 2 Xi µ σ χ 2 (n) [ n = I α (σ 2 i=1 ) = (X i µ) 2 ] n i=1, (X i µ) 2 if µ is unknown, as n i=1 = I α (σ 2 ) = [ χ n 1 α/2 χ n α/2 ( ) 2 Xi X n σ χ 2 (n 1) ˆσ 2 n ˆσ2 χ n 1 1 α/2 /(n 1), χ n 1 α/2 n /(n 1) ]

Comparison of two variances let X 1,1,...,X 1,n1 be a sample N(µ 1,σ 2 1 ), and X 2,1,...,X 2,n2 be an independant sample N(µ 2,σ 2 2 ), in order to test H 0 : σ 1 = σ 2 versus H 1 : σ 1 σ 2, use statistic F = ˆσ 1 2 2 ˆσ H 0 F(n 1 1,n 2 1) 2

Theorem Comparison of two means let X 1,1,...,X 1,n1 be a sample N(µ 1,σ 2 ), and X 2,1,...,X 2,n2 be an independent sample N(µ 2,σ 2 ), To estimate the common variance, use ˆσ 12 = 1 n 1 + n 2 2 ( n1 ) ( X1,i X ) 2 n 2 ( 1 + X2,i X ) 2 2 i=1 σ 2 n 1 + n 2 2 χ2 (n 1 + n 2 2) i=1 in order to test H 0 : µ 1 = µ 2 versus H 1 : µ 1 µ 2, use statistic n1 n 2 ( X1 T = X ) 2 H0 T(n 1 + n 2 2) n 1 + n 2 ˆσ 12

Outline One and Two-Sample Statistics Linear Gaussian Model Model Reduction and model Selection Exercices

Generic formulation Y i = α 1 xi 1 + + α p x p i + σz i, Z i N(0,1) Matrice form: Y = Xθ + σz, Z N(0 n,i n ) Ex: ANOVA, regression, rupture in time series

Theorem Cochran s Theorem let X = (X 1,...,X n ) be a standard centered normal sample let E 1,...,E p be a decomposition of R n by two-by-two orthogonal subspaces of dimensions dime j = d j for 1 i p, let v1 i,...,vi j i be an orhogonal basis of E j Then the components of X in base (v 1,...,v n ) form another standard centered normal sample the random vectors X E1,...,X Ep obtained by projecting X on E 1,...,E p are independent so are X E1,..., X Ep, and they satisfy: X Ei 2 χ 2 (d i )

Theorem Generic solution The Maximum-Likelihood estimator and the least-square estimator are given by: ˆθ = (t XX ) 1 t XY N(θ,σ 2 (t XX ) 1 ) The variance σ 2 is estimated (without bias) by: ˆσ 2 = Y X ˆθ 2 n p σ2 n p χ2 (n p) ˆθ and σ 2 are independent Incremental Gramm-Schmidt procedure Theorem (Gauss-Markov) ˆθ has minimal variance among all linear unbiased estimators of θ

Y = Xθ + σz Xθ M 0 M

Y = Xθ + σz Xθ M Y M = Xˆθ M 0 M

Y = Xθ + σz Y Xθ 2 σ 2 χ 2 (n) Y Y M 2 σ 2 χ 2 (n p) Y M = Xˆθ M Y M Xθ 2 σ 2 χ 2 (p) Xθ 0 M

Theorem Simple regression: Y i = α + βx i + σz i The ML-estimators are given by: ( ˆα = Ȳ ˆβ x N α, σ2 E[x 2 ) ] nvar(x) ( N β, ˆβ = Cov(x,Y ) Var(x) σ 2 nvar(x) ( ) They are correlated: Cov ˆα, ˆβ = σ2 x nvar[x] The variance can be estimated by: ˆσ n 2 = 1 (Yi ˆα ˆβx i ) 2 n 2 ) σ2 n 2 χ2 (n 2) Smart reparameterization Y i = δ + β(x i x) + σz i

Polynomial regression Y = 1 x x 2... x p Can also be used for exponential growth models y i = exp(ax i + b i + ǫ i ) to determine β such that E[Y ] = αx β,... α 0 α 1 α 2. α p

Outline One and Two-Sample Statistics Linear Gaussian Model Model Reduction and model Selection Exercices

Student test on a regressor Theorem In order to test H 0 = θ k = a versus H 1 = θ k a estimate the variance of ˆθ k by use the statistic ˆσ 2 (ˆβk ) = ˆσ 2 { ( t XX) 1} k,k T = ˆβ k a ) H0 T(n p) ˆσ (ˆβk Generalization: to test H 0 = t bθ = a versus H 1 = t bθ a, use T = t b ˆβ a ˆσ t b( t XX) 1 b H 0 T(n p)

Fischer Test model vs submodel Theorem let H E R n, dimh = q,dime = p to test H 0 = θ H versus H 1 = θ E \ H, use the statistic F = Y E Y H 2 /(p q) Y Y E 2 /(n p) H 0 F(p q,n p) reject if F > F p q,n p 1 α

Y = Xθ + σz 0 E Xθ H H Y E = Xˆθ E

Y = Xθ + σz 0 E Xθ H Y H = X ˆθ H H Y E = Xˆθ E

Y = Xθ + σz Y Y E 2 σ 2 χ 2 (n p) 0 E Y H Xθ 2 Y E = Xˆθ E σ 2 χ 2 (q) Y E Y H 2 σ 2 χ 2 (p q) Xθ H Y H = X ˆθ H H

SSS-notations and R 2 For a model M (relative to a matrix X), define total variance SSY = Y Ȳ 1 n 2 = SSE(1 n ) residual variance SSE(M) = Y X ˆθ 2 explained variance SSR(M) = X ˆθ Ȳ 1 n 2 SSY = SSE(M) + SSR(M) The quality of fit is quantified by R 2 (M) = SSR(M) SSY The Fischer statistic can be written: (SSE(H) SSE(E))/(p q) F = SSE(E)/(n p) = n dim(e) dim(e) dim(h) R2 (E) R 2 (H) 1 R 2 (E)

The model can be written: ANOVA Y i,k = θ i + σǫ i,k,1 i p,1 k n i Let Y i, = 1 n i k Y i,k and Y, = 1 n i,k Y i,k The variance can be decomposed as: SSY = SSR(M) + SSE(M) = i n i (Y i, Y, ) + i,k (Y i,k Y i, ) To test H 0 = θ 1 = = θ p versus H 1 = H 0, the Fischer statistic is: F = n p i n i(y i, Y, ) 2 p 1 i,k (Y F(p 1,n p) i,k Y i, ) 2

Exhaustive, Forward, Backward and Stepwise selection Exhaustive search: for all sizes 1 k p, find the combination of directions with highest R 2. Forward selection: at each step, add the direction most correlated with Y. Stop when the Fischer test for this direction is not rejected Backward selection: start with full model, and remove the direction with smallest t-statistic. Stop when all remaining t-statistics are significant Stepwise selection: like Forward selection, but after each inclusion remove all directions with unsignificant F-statistic Note: unless specified, 1 n is always included into the models.

Quadratic Risk: Bias-Variance decomposition To simplify the discussion, we consider the model Y = θ + σz where θ is arbitrary but aims at be understood by the family of models M The quadratic risk of model M M is defined as [ θ ] 2 r(m) = E ˆθM It can be decomposed as: r(m) = θ θ M 2 + σ 2 dim(m)

Risk Estimation and Mallow s criterion Goal: choose model M M with minimal quadratic risk r(m). Problem: the bias θ θ M 2 is unknown Idea: penalize complexity dim(m) Mallow s criterion: choose model M minimizing C p (M) = SSE(M) + 2σ 2 dim(m) Heuristic: ] r(m) = θ 2 θ M 2 + σ 2 dim(m), but E [ ˆθ M 2 = θ M 2 + σ 2 dim(m), hence ( ) r(m) = θ 2 ˆθ M 2 σ 2 dim(m) + σ 2 dim(m) has expectation r(m), but maximizing r(m) over M is equivalent to maximizing r(m) θ 2 + Y 2 = Y ˆθ M 2 + 2σ 2 dim(m)

Y = θ + σz θ 0 M

Y = θ + σz θ θ M = Π M (θ) ˆθ M = Π M (Y ) 0 M

Y = θ + σz θ ˆθ 2 ˆθ M = Π M (Y ) θ 2 θ M = Π M (θ) 0 M θ 1 ˆθ 1

Y = θ + σz θ 2 ˆθ 2 θ Bias Risk Variance θ M = Π M (θ) ˆθ M = Π M (Y ) 0 M θ 1 ˆθ 1

Y = θ + σz θ Bias Risk 0 M Variance θ 1 ˆθ 1

Y = θ + σz θ 2 ˆθ 2 Variance Risk Bias θ 0 M

Y = θ + σz θ 2 ˆθ 2 Variance Bias Risk Bias Risk θ Bias Risk Variance θ M = Π M (θ) ˆθ M = Π M (Y ) 0 M Variance θ 1 ˆθ 1

Other Criteria Adjusted R 2 : Ra(M) 2 n 1 = 1 n dim(m) (1 R2 (M)) n 1 = 1 n dim(m) SSE(M) SSY Bayesian Information Criterion: BIC(M) = SSE(M) + σ 2 dim(m)log n

Application: denoising a signal discretized and noisy version of f : [0, 1] R : Y k = f (k/n) + σz k,0 k n 1 choice of an orthogonal basis of R n : Fourier { [ ( )] 2πkl Ω n = sin n [ ( )] 2πkl cos n,1 k 0 l N 1,0 k 0 l N 1 n 1 2 n } 2 nested models with increasing number of non-zero Fourier coefficients,

Logistic Regression The Gaussian model does obviously not apply everywhere; think e.g. of a regression age/heart disease. Logistic model: Y i B ( µ( t X i θ) ), where µ(η) = exp(η) 1+exp(η) is the inverse logit function. Maximum likelihood estimation is possible numerically (Newton-Raphson method)

Outline One and Two-Sample Statistics Linear Gaussian Model Model Reduction and model Selection Exercices

Discovery of R Understand and modify the source codes available on the website. The data frame called cars contains two arrays: cars$dist and cars$speed. Its gives the speed of cars and the distances taken to stop (recorded in the 1920s). A relation dist = A speed B is expected. How to estimate A and B? Test if B = 0, and then if B = 1. Find out how logistic regression can be done with R. Illustrate on some data you choose.

Simple Exercices Show that if X 1,...,X n is a N(0,1)-sample, then X n N(0,1/n) n ( Xi X ) 2 n χ 2 (n 1) i=1 Re-compute the formula giving ˆα and ˆβ in the simple regression model by analytic minimization of the total squared errors n i=1 (y i α βx i ) 2. Compute the squared prediction error E [ (ŷ α βx ) 2] for a new observation at point x in the simple regression model. Same exercices for the general gaussian linear model.

Exercice: weighting methods A two plate weighing machine is called unbiased with precision σ 2 if, an object of true weight m on the left plate is balanced by a random weight y such that y = m + σǫ on the right plate, where ǫ is a centered standard normal variable. Mister M. has three objects of mass a, b and c to weigh with such a machine, and he is allowed to proceed to three measurements. He thinks of three possibilities weighting each object separately : (a ), (b ), (c ); weighting the objects two at a time : (ab ), (ac ) and (bc ); putting each object one time on the right plate alone and two times with another on the right plate (ab c), (ac b), (bc a). What would you advice him?

Exercice: weighting methods A two plate weighing machine is called unbiased with precision σ 2 if, an object of true weight m on the left plate is balanced by a random weight y such that y = m + σǫ on the right plate, where ǫ is a centered standard normal variable. Mister M. has three objects of mass a, b and c to weigh with such a machine, and he is allowed to proceed to three measurements. He thinks of three possibilities weighting each object separately : (a ), (b ), (c ); weighting the objects two at a time : (ab ), (ac ) and (bc ); putting each object one time on the right plate alone and two times with another on the right plate (ab c), (ac b), (bc a). What would you advice him? More precisely: compute the individual variance for each possibility and give a first conclusion. Does it hold if one is interested in linear combinations of the weight?

Exercice: multi-intercept regression Botanists want to quantify the average difference of height between the trees of two forests A and B. In their model, the height of a tree is the sum of three terms: a term q depending on the quality of the ground, which is assumed to be constant in each forest: q A for the trees of forest A, q B for the trees of forest B; an unknown biological constant times the quantity of humus around the tree; a random term proper to each tree. Precisely, they want to estimate the difference D = q A q B. For their study, they have collected the height of n A trees in forest A, n B trees in forest B, as well as the quantities (h A i ) 1 j na and (h B i ) 1 j nb of humus at the basis of all thoses trees. Tell them how to do it.