Statistics for Applications. Chapter 7: Regression 1/43

Size: px

Start display at page:

Download "Statistics for Applications. Chapter 7: Regression 1/43"

Jacob Tyler
6 years ago
Views:

1 Statistics for Appications Chapter 7: Regression 1/43

2 Heuristics of the inear regression (1) Consider a coud of i.i.d. random points (X i,y i ),i =1,...,n : 2/43

3 Heuristics of the inear regression (2) I Idea: Fit the best ine fitting the data. I Approximation: Y i a + bx i,i=1,...,n, for some (unknown) a, b 2 IR. I Find a, ˆ ˆ b that approach a and b. I More generay: Y i 2 IR,X i 2 IR d, Y i a + X > b, a 2 IR,b2 IR d. i I Goa: Write a rigorous mode and estimate a and b. 3/43

4 Heuristics of the inear regression (3) Exampes: Economics: Demand and price, D i a + bp i, i =1,..., n. Idea gas aw: PV = nrt, og P i a + b og V i + c og T i, i =1,..., n. 4/43

5 Linear regression of a r.v. Y on a r.v. X (1) Let X and Y be two rea r.v. (non necessariy independent) with two moments and such that V ar( X) 6= 0. The theoretica inear regression of Y on X is the best approximation in quadratic means of Y by a inear function of X, i.e. the r.v. a + bx, h where a and b iare the two rea numbers minimizing IE ( Y a bx) 2. By some simpe agebra: cov( X, Y ) I b =, V ar( X) cov( X, Y ) I a = IE[ Y ] bie[ X] = IE[ Y ] IE[ X]. V ar( X) 5/43

6 Linear regression of a r.v. Y on a r.v. X (2) If " = Y (a + bx), then Y = a + bx + ", with IE["] =0and cov(x, ") =0. Conversey: Assume that Y = a + bx + " for some a, b 2 IR and some centered r.v. " that satisfies cov(x, ") =0. E.g., if X? " or if IE[" X] =0, then cov(x, ") =0. Then, a + bx is the theoretica inear regression of Y on X. 6/43

7 Linear regression of a r.v. Y on a r.v. X (3) A sampe of n i.i.d. random pairs (X 1,...,X n ) with same distribution as (X, Y ) is avaiabe. We want to estimate a and b. 7/43

8 Linear regression of a r.v. Y on a r.v. X (3) A sampe of n i.i.d. random pairs (X 1,...,X n ) with same distribution as (X, Y ) is avaiabe. We want to estimate a and b. 8/43

9 Linear regression of a r.v. Y on a r.v. X (3) A sampe of n i.i.d. random pairs (X 1,...,X n ) with same distribution as (X, Y ) is avaiabe. We want to estimate a and b. 9/43

10 Linear regression of a r.v. Y on a r.v. X (3) A sampe of n i.i.d. random pairs (X 1,...,X n ) with same distribution as (X, Y ) is avaiabe. We want to estimate a and b. 10/43

11 Linear regression of a r.v. Y on a r.v. X (3) A sampe of n i.i.d. random pairs (X 1,Y 1 ),...,(X n,y n ) with same distribution as (X, Y ) is avaiabe. We want to estimate a and b. 11/43

12 Linear regression of a r.v. Y on a r.v. X (4) Definition The east squared error (LSE) estimator of (a, b) is the minimizer of the sum of squared errors: nx (Yi a bx i ) 2. i=1 (â, bˆ) is given by XY X Y ˆb =, X 2 X 2 â = Y ˆX. b 12/43

13 Linear regression of a r.v. Y on a r.v. X (5) 13/43

14 Mutivariate case (1) Y i = X i β + " i, i = 1,..., n. Vector of expanatory variabes or covariates: X i 2 IR p (wog, assume its first coordinate is 1). Dependent variabe: Y i. β = (a, b ) ; β 1 (= a) is caed the intercept. {" i } i=1,...,n : noise terms satisfying cov(x i, " i )= 0. Definition The east squared error (LSE) estimator of β is the minimizer of the sum of square errors: n βˆ = argmin X (Y i X i t) 2 t2ir p i=1 14/43

15 Mutivariate case (2) LSE in matrix form Let Y = (Y 1,...,Y n ) 2 IR n. Let X be the n p matrix whose rows are X 1,..., X caed the design). n (X is Let " = (" 1,...," n ) 2 IR n (unobserved noise) Y = Xβ + ". The LSE βˆ satisfies: βˆ = argmin ky Xtk 2 2. t2ir p 15/43

16 Mutivariate case (3) Assume that rank(x) = p. Anaytic computation of the LSE: βˆ =(X X) 1 X Y. Geometric interpretation of the LSE Xβˆ is the orthogona projection of Y onto the subspace spanned by the coumns of X: where P = X(X X) 1 X. Xβˆ = P Y, 16/43

17 Linear regression with deterministic design and Gaussian noise (1) Assumptions: The design matrix X is deterministic and rank(x) = p. The mode is homoscedastic: " 1,...," n are i.i.d. The noise vector " is Gaussian: " N n (0, σ 2 I n ), for some known or unknown σ 2 > 0. 17/43

18 Linear regression with deterministic design and Gaussian noise (2) LSE = MLE: β ˆ N p β, σ 2 (X X) 1. h i Quadratic risk of βˆ: IE kβˆ βk 2 = σ 2 tr (X X) 1. 2 h i Prediction error: IE ky Xβˆk 2 2 = σ 2 (n p). Unbiased estimator of σ 2 : σˆ2 = 1 ky Xβˆk 2 2. n p Theorem 2 (n p) σ σˆ2 χ 2. n p ˆβ? σˆ2. 18/43

19 Significance tests (1) Test whether the j-th expanatory variabe is significant in the inear regression (1 appe j appe p). H 0 : β j =0 v.s. H 1 : β j =0. If γ j is the j-th diagona coefficient of (X X) 1 (γ j > 0): βˆj Let T n (j) = p. σˆ2γ j βˆj β j p t n p. σˆ2γ j Test with non asymptotic eve 2 (0, 1): δ (j) = 1{ T (j) > q (t n p )}, n 2 where q (t n p ) is the (1 /2)-quantie of t n p. 2 19/43

20 Significance tests (2) Test whether a group of expanatory variabes is significant in the inear regression. H 0 : β j =0, 8j 2 S v.s. H 1 : 9j 2 S, β j =0, where S {1,...,p}. Bonferroni s test: δ B = max δ j2s (j) /k δ has non asymptotic eve at most., where k = S. 20/43

21 More tests (1) Let G be a k p matrix with rank(g) = k (k appe p) and λ 2 IR k. Consider the hypotheses: H 0 : Gβ = λ v.s. H 1 : Gβ = λ. The setup of the previous side is a particuar case. If H 0 is true, then: and Gβˆ λ N k 0, σ 2 G(X X) 1 G, 1 σ 2 (Gβˆ λ) G(X X) 1 G (Gβ λ) χ 2 k. 21/43

22 More tests (2) Let S n = 1 ˆσ 2 (Gˆβ λ) ( G(X X) 1 G k ) 1 (Gβ λ). If H 0 is true, then S n F k,n p. Test with non asymptotic eve 2 (0, 1): δ = 1{S n >q (F k,n p )}, where q (F k,n p ) is the (1 )-quantie of F k,n p. Definition The Fisher distribution with p and q degrees of freedom, denoted U/p by F p,q, is the distribution of, where: V/q U χ 2 p, V χ 2 q, U? V. 22/43

23 Concuding remarks Linear regression exhibits correations, NOT causaity Normaity of the noise: One can use goodness of fit tests to test whether the residuas "ˆi = Y i X i βˆ are Gaussian. Deterministic design: If X is not deterministic, a the above can be understood conditionay on X, if the noise is assumed to be Gaussian, conditionay on X. 23/43

24 Linear regression and ack of identifiabiity (1) Consider the foowing mode: with: Y = Xβ + ", 1. Y 2 IR n (dependent variabes), X 2 IR n p (deterministic design) ; 2. β 2 IR p, unknown; 3. " N n (0, σ 2 I n ). Previousy, we assumed that X had rank p, so we coud invert X X. What if X is not of rank p? E.g., if p >n? β woud no onger be identified: estimation of β is vain (uness we add more structure). 24/43

25 Linear regression and ack of identifiabiity (2) What about prediction? Xβ is sti identified. Ŷ: orthogona projection of Y onto the inear span of the coumns of X. Ŷ = Xβˆ = X(X X) XY, where A stands for the (Moore-Penrose) pseudo inverse of a matrix A. Simiary as before, if k = rank(x): kŷ Yk 22 χ 2 n k, σ 2 kŷ Yk 2 2? Ŷ. 25/43

26 Linear regression and ack of identifiabiity (3) In particuar: IE[kŶ Yk 2 2 ]=(n k)σ 2. Unbiased estimator of the variance: 1 σˆ2 = kŷ Yk 2 2. n k 26/43

27 Linear regression in high dimension (1) Consider again the foowing mode: with: Y = Xβ + ", 1. Y 2 IR n (dependent variabes), X 2 IR n p (deterministic design) ; 2. β 2 IR p, unknown: to be estimated; 3. " N n (0, σ 2 I n ). For each i, X i 2 IR p is the vector of covariates of the i-th individua. If p is too arge (p >n), there are too many parameters to be estimated (overfitting mode), athough some covariates may be irreevant. Soution: Reduction of the dimension. 27/43

28 Linear regression in high dimension (2) Idea: Assume that ony a few coordinates of β are nonzero (but we do not know which ones). Based on the sampe, seect a subset of covariates and estimate the corresponding coordinates of β. For S {1,...,p}, et βˆ S 2 argmin ky X S tk 2, t2ir S where X S is the submatrix of X obtained by keeping ony the covariates indexed in S. 28/43

29 Linear regression in high dimension (3) Seect a subset S that minimizes the prediction error penaized by the compexity (or size) of the mode: ˆ ky X S β S k 2 + λ S, where λ > 0 is a tuning parameter. If λ = 2ˆσ 2, this is the Maow s C p or AIC criterion. If λ = σˆ2 og n, this is the BIC criterion. 29/43

30 Linear regression in high dimension (4) Each of these criteria is equivaent to finding β 2 IR p that minimizes: ky Xbk λkbk 0, where kbk 0 is the number of nonzero coefficients of b. This is a computationay hard probem: nonconvex and requires to compute 2 n estimators (a the βˆ S, for S {1,...,p}). Lasso estimator: X p repacekbk 0 = 1I{b j =0} with kbk 1 j=1 and the probem becomes convex. L βˆ 2 argmin ky Xbk 2 + λkbk 1, b2ir p where λ > 0 is a tuning parameter. p X = b j j=1 30/43

31 Linear regression in high dimension (5) How to choose λ? This is a difficut question (see grad course : High-dimensiona statistics in Spring 2017). A good choice of λ with ead to an estimator βˆ that is very cose to β and wi aow to recover the subset S of a j 2 {1,...,p} for which β j =0, with high probabiity. 31/43

32 Linear regression in high dimension (6) 32/43

33 Nonparametric regression (1) In the inear setup, we assumed that Y i = X i β + " i, where X i are deterministic. This has to be understood as working conditionay on the design. This is to assume that IE[Y i X i ] is a inear function of X i, which is not true in genera. Let f(x) = IE[Y i X i = x], x 2 IR p : How to estimate the function f? 33/43

34 Nonparametric regression (2) Let p = 1 in the seque. One can make a parametric assumption on f. 2 a+bx E.g., f(x) = a + bx, f(x) = a + bx + cx, f(x) = e,... The probem reduces to the estimation of a finite number of parameters. LSE, MLE, a the previous theory for the inear case coud be adapted. What if we do not make any such parametric assumption on f? 34/43

35 Nonparametric regression (3) Assume f is smooth enough: f can be we approximated by a piecewise constant function. Idea: Loca averages. For x 2 IR: f(t) f(x) for t cose to x. For a i such that X i is cose enough to x, Y i f(x)+ " i. Estimate f(x) by the average of a Y i s for which X i is cose enough to x. 35/43

36 Nonparametric regression (4) Let h > 0: the window s size (or bandwidth). Let I x = {i = 1,...,n : X i x < h}. Let fˆ n,h(x) be the average of {Y i : i 2 I x }. 8 >< 1 X Yi if I x = ; I x i2ix fˆn,h(x) = >: 0 otherwise. 36/43

37 Nonparametric regression (5) X Y 37/43

38 Nonparametric regression (6) X Y x 0.6 h 0.1 f ^ x /43

39 Nonparametric regression (7) How to choose h? If h! 0: overfitting the data; If h! 1: underfitting, fˆ n,h(x) = Y n. 39/43

40 Nonparametric regression (8) Exampe: n = 100, f(x) = x(1 x), h = X Y f ^ nh 40/43

41 Nonparametric regression (9) Exampe: n = 100, f(x) = x(1 x), h = X Y f ^ nh 41/43

42 Nonparametric regression (10) Exampe: n = 100, f(x) = x(1 x), h = X Y f ^ nh 42/43

43 Nonparametric regression (11) Choice of h? If the smoothness of f is known (i.e., quaity of oca approximation of f by piecewise constant functions): There is a good choice of h depending on that smoothness If the smoothness of f is unknown: Other techniques, e.g. cross vaidation. 43/43

Two-Stage Least Squares as Minimum Distance

Two-Stage Least Squares as Minimum Distance Frank Windmeijer Discussion Paper 17 / 683 7 June 2017 Department of Economics University of Bristo Priory Road Compex Bristo BS8 1TU United Kingdom Two-Stage