Computational Statistics

Size: px

Start display at page:

Download "Computational Statistics"

Olivia Higgins
5 years ago
Views:

1 Cmputatinal Statistics Spring 2008 Peter Bühlmann and Martin Mächler Seminar für Statistik ETH Zürich February 2008 (February 23, 2011)

2 ii

3 Cntents 1 Multiple Linear Regressin Intrductin The Linear Mdel Stchastic Mdels Examples Gals f a linear regressin analysis Least Squares Methd The nrmal equatins Assumptins fr the Linear Mdel Gemetrical Interpretatin Dn t d many regressins n single variables! Cmputer-Output frm R : Part I Prperties f Least Squares Estimates Mments f least squares estimates Distributin f least squares estimates assuming Gaussian errrs Tests and Cnfidence Regins Cmputer-Output frm R : Part II Analysis f residuals and checking f mdel assumptins The Tukey-Anscmbe Plt The Nrmal Plt Plt fr detecting serial crrelatin Generalized least squares and weighted regressin Mdel Selectin Mallws C p statistic Nnparametric Density Estimatin Intrductin Estimatin f a density Histgram Kernel estimatr The rle f the bandwidth Variable bandwidths: k nearest neighbrs The bias-variance trade-ff Asympttic bias and variance Estimating the bandwidth Higher dimensins The curse f dimensinality iii

4 iv CONTENTS 3 Nnparametric Regressin Intrductin The kernel regressin estimatr The rle f the bandwidth Inference fr the underlying regressin curve Lcal plynmial nnparametric regressin estimatr Smthing splines and penalized regressin Penalized sum f squares The smthing spline slutin Shrinking twards zer Relatin t equivalent kernels Crss-Validatin Intrductin Training and Test Set Cnstructing training-, test-data and crss-validatin Leave-ne-ut crss-validatin K-fld Crss-Validatin Randm divisins int test- and training-data Prperties f different CV-schemes Leave-ne-ut CV Leave-d-ut CV K-fld CV; stchastic apprximatins Cmputatinal shrtcut fr sme linear fitting peratrs Btstrap Intrductin Efrn s nnparametric btstrap The btstrap algrithm The btstrap distributin Btstrap cnfidence interval: a first apprach Btstrap estimate f the generalizatin errr Out-f-btstrap sample fr estimatin f the generalizatin errr Duble btstrap Mdel-based btstrap Parametric btstrap Mdel structures beynd i.i.d. and the parametric btstrap The mdel-based btstrap fr regressin Classificatin Intrductin The Bayes classifier The view f discriminant analysis Linear discriminant analysis Quadratic discriminant analysis The view f lgistic regressin Binary classificatin Multiclass case, J >

5 CONTENTS v 7 Flexible regressin and classificatin methds Intrductin Additive mdels Backfitting fr additive regressin mdels Additive mdel fitting in R MARS Hierarchical interactins and cnstraints MARS in R Neural Netwrks Fitting neural netwrks in R Prjectin pursuit regressin Prjectin pursuit regressin in R Classificatin and Regressin Trees (CART) Tree structured estimatin and tree representatin Tree-structured search algrithm and tree interpretatin Prs and cns f trees CART in R Variable Selectin, Regularizatin, Ridging and the Lass Intrductin Ridge Regressin The Lass Lass extensins Bagging and Bsting Intrductin Bagging The bagging algrithm Bagging fr trees Subagging Bsting L 2 Bsting

6 vi CONTENTS

7 Chapter 1 Multiple Linear Regressin 1.1 Intrductin Linear regressin is a widely used statistical mdel in a brad variety f applicatins. It is ne f the easiest examples t demnstrate imprtant aspects f statistical mdelling. 1.2 The Linear Mdel Multiple Regressin Mdel: Given is ne respnse variable: up t sme randm errrs it is a linear functin f several predictrs (r cvariables). The linear functin invlves unknwn parameters. The gal is t estimate these parameters, t study their relevance and t estimate the errr variance. Mdel frmula: Y i = β 1 x i β p x ip + ε i (i = 1,..., n) (1.1) Usually we assume that ε 1,..., ε n are i.i.d. (independent, identically distributed) with E[ε i ] = 0, Var(ε i ) = σ 2. Ntatins: Y = {Y i ; i = 1,..., n} is the vectr f the respnse variables x (j) = {x ij ; i = 1,..., n} is the vectr f the jth predictr (cvariable) (j = 1,..., p) x i = {x ij ; j = 1,... p} is the vectr f predictrs fr the ith bservatin (i = 1,..., n) β = {β j ; j = 1,..., p} is the vectr f the unknwn parameters ε = {ε i ; i = 1,..., n} is the vectr f the unknwn randm errrs n is the sample size, p is the number f predictrs The parameters β j and σ 2 are unknwn and the errrs ε i are unbservable. On the ther hand, the respnse variables Y i and the predictrs x ij have been bserved. Mdel in vectr ntatin: Y i = x i β + ε i (i = 1,..., n) 1

8 2 Multiple Linear Regressin Mdel in matrix frm: Y = X β + ε n 1 n p p 1 n 1 (1.2) where X is a (n p)-matrix with rws x i and clumns x (j). The first predictr variable is ften a cnstant, i.e., x i1 1 fr all i. We then get an intercept in the mdel. Y i = β 1 + β 2 x i β p x ip + ε i. We typically assume that the sample size n is larger than the number f predictrs p, n > p, and mrever that the matrix X has full rank p, i.e., the p clumn vectrs x (1),..., x (p) are linearly independent Stchastic Mdels The linear mdel in (1.1) invlves sme stchastic (randm) cmpnents: the errr terms ε i are randm variables and hence the respnse variables Y i as well. The predictr variables x ij are here assumed t be nn-randm. In sme applicatins, hwever it is mre apprpriate t treat the predictr variables as randm. The stchastic nature f the errr terms ε i can be assigned t varius surces: fr example, measurement errrs r inability t capture all underlying nn-systematic effects which are then summarized by a randm variable with expectatin zer. The stchastic mdelling apprach will allw t quantify uncertainty, t assign significance t varius cmpnents, e.g. significance f predictr variables in mdel (1.1), and t find a gd cmprmise between the size f a mdel and the ability t describe the data (see sectin 1.7). The bserved respnse in the data is always assumed t be realizatins f the randm variables Y 1,..., Y n ; the x ij s are nn-randm and equal t the bserved predictrs in the data Examples Tw-sample mdel: p = 2, X = , β = ( µ1 µ 2 ). Main questins: µ 1 = µ 2? Quantitative difference between µ 1 and µ 2? Frm intrductry curses we knw that ne culd use the tw-sample t-test r tw-sample Wilcxn test.

9 1.2 The Linear Mdel 3 Regressin thrugh the rigin: Y i = βx i + ε i (i = 1,... n). x 1 p = 1, x 2 X =., β = β. x n Simple linear regressin: Y i = β 1 + β 2 x i + ε i (i = 1,... n). 1 x 1 1 x 2 ( ) p = 2 X =.., β = β1. β 2 1 x n Quadratic regressin: Y i = β 1 + β 2 x i + β 3 x 2 i + ε i (i = 1,... n). 1 x 1 x x 2 x 2 β 1 2 p = 3, X =..., β = β 2. β 1 x n x 2 3 n Nte that the fitted functin is quadratic in the x i s but linear in the cefficients β j and therefre a special case f the linear mdel (1.1). Regressin with transfrmed predictr variables: Y i = β 1 + β 2 lg(x i2 ) + β 3 sin(πx i3 ) + ε i (i = 1,... n). 1 lg(x 12 ) sin(πx 13 ) 1 lg(x 22 ) sin(πx 23 ) p = 3, X =..., β = 1 lg(x n2 ) sin(πx n3 ) Again, the mdel is linear in the cefficients β j but nnlinear in the x ij s. In summary: The mdel in (1.1) is called linear because it is linear in the cefficients β j. The predictr (and als the respnse) variables can be transfrmed versins f the riginal predictr and/r respnse variables Gals f a linear regressin analysis A gd fit. Fitting r estimating a (hyper-)plane ver the predictr variables t explain the respnse variables such that the errrs are small. The standard tl fr this is the methd f least squares (see sectin 1.3). Gd parameter estimates. This is useful t describe the change f the respnse when varying sme predictr variable(s). Gd predictin. This is useful t predict a new respnse as a functin f new predictr variables. Uncertainties and significance fr the three gals abve. Cnfidence intervals and statistical tests are useful tls fr this gal. β 1 β 2 β 3.

10 4 Multiple Linear Regressin Develpment f a gd mdel. In an interactive prcess, using methds fr the gals mentined abve, we may change parts f an initial mdel t cme up with a better mdel. The first and third gal can becme cnflicting, see sectin Least Squares Methd We assume the linear mdel Y = Xβ + ε. We are lking fr a gd estimate f β. The least squares estimatr β is defined as where dentes the Euclidean nrm in R n. β = arg min Y Xβ 2, (1.3) β The nrmal equatins The minimizer in (1.3) can be cmputed explicitly (assuming that X has rank p). Cmputing partial derivatives f Y Xβ 2 with respect t β (p-dimensinal gradient vectr), evaluated at β, and setting them t zer yields ( 2) X (Y X β) = 0 ((p 1) null-vectr). Thus, we get the nrmal equatins X X β = X Y. (1.4) These are p linear equatins fr the p unknwns (cmpnents f β). Assuming that the matrix X has full rank p, the p p matrix X X is invertible, the least squares estimatr is unique and can be represented as β = (X X) 1 X Y. This frmula is useful fr theretical purpses. Fr numerical cmputatin it is much mre stable t use the QR decmpsitin instead f inverting the matrix X X. 1 Frm the residuals r i = Y i x i β, the usual estimate fr σ 2 is ˆσ 2 = 1 n p n ri 2. Nte that the r i s are estimates fr ε i s; hence the estimatr is plausible, up t the smewhat unusual factr 1/(n p). It will be shwn in sectin that due t this factr, E[ˆσ 2 ] = σ 2. 1 Let X = QR with rthgnal (n p) matrix Q and upper (Right) triangular (p p) R. Because f X X = R Q QR = R R, cmputing β nly needs subsequent slutin f tw triangular systems: First slve R c = X Y fr c, and then slve R β = c. Further, when Cv( β) requires (X X) 1, the latter is R 1 (R 1 ). i=1

11 1.3 Least Squares Methd 5 number f births (CH) (in 1000) year Figure 1.1: The pill kink Assumptins fr the Linear Mdel We emphasize here that we d nt make any assumptins n the predictr variables, except that the matrix X has full rank p < n. In particular, the predictr variables can be cntinuus r discrete (e.g. binary). We need sme assumptins s that fitting a linear mdel by least squares is reasnable and that tests and cnfidence intervals (see 1.5) are apprximately valid. 1. The linear regressin equatin is crrect. This means: E[ε i ] = 0 fr all i. 2. All x i s are exact. This means that we can bserve them perfectly. 3. The variance f the errrs is cnstant ( hmscedasticity ). This means: Var(ε i ) = σ 2 fr all i. 4. The errrs are uncrrelated. This means: Cv(ε i, ε j ) = 0 fr all i j. 5. The errrs {ε i ; i = 1,..., n} are jintly nrmally distributed. This implies that als {Y i ; i = 1,..., n} are jintly nrmally distributed. In case f vilatins f item 3, we can use weighted least squares instead f least squares. Similarly, if item 4 is vilated, we can use generalized least squares. If the nrmality assumptin in 5 des nt hld, we can use rbust methds instead f least squares. If assumptin 2 fails t be true, we need crrectins knwn frm errrs in variables methds. If the crucial assumptin in 1 fails, we need ther mdels than the linear mdel. The fllwing example shws vilatins f assumptin 1 and 4. The respnse variable is the annual number f births in Switzerland since 1930, and the predictr variable is the time (year). We see in Figure 1.1 that the data can be apprximately described by a linear relatin until the pill kink in We als see that the errrs seem t be crrelated: they are all psitive r negative during perids f years. Finally, the linear mdel is nt representative after the pill kink in In general, it is dangerus t use a fitted mdel fr extraplatin where n predictr variables have been bserved (fr example: if we had fitted the linear mdel in 1964 fr predictin f number f births in the future until 2005).

12 6 Multiple Linear Regressin Y r Y^ E[Y] X Figure 1.2: The residual vectr r is rthgnal t X Gemetrical Interpretatin The respnse variable Y is a vectr in R n. Als, Xβ describes a p-dimensinal subspace X in R n (thrugh the rigin) when varying β R p (assuming that X has full rank p). The least squares estimatr β is then such that X β is clsest t Y with respect t the Euclidean distance. But this means gemetrically that We dente the (vectr f ) fitted values by They can be viewed as an estimate f Xβ. The (vectr f) residuals is defined by X β is the rthgnal prjectin f Y nt X. Ŷ = X β. r = Y Ŷ. Gemetrically, it is evident that the residuals are rthgnal t X, because rthgnal prjectin f Y nt X. This means that Ŷ is the where x (j) is the jth clumn f X. We can frmally see why the map r x (j) = 0 fr all j = 1,..., p, Y Ŷ is an rthgnal prjectin. Since Ŷ = X β = X(X X) 1 X Y, the map can be represented by the matrix P = X(X X) 1 X. (1.5)

13 1.3 Least Squares Methd 7 It is evident that P is symmetric (P = P ) and P is idem-ptent (P 2 = P ). Furthermre P ii = tr(p ) = tr(x(x X) 1 X ) = tr((x X) 1 X X) = tr(i p p ) = p. i But these 3 prperties characterize that P is an rthgnal prjectin frm R n nt a p-dimensinal subspace, here X. The residuals r can be represented as r = (I P )Y, where I P is nw als an rthgnal prjectin nt the rthgnal cmplement f X, X = R n \ X, which is (n p)-dimensinal. In particular, the residuals are elements f X Dn t d many regressins n single variables! In general, it is nt apprpriate t replace multiple regressin by many single regressins (n single predictr variables). The fllwing (synthetic) example shuld help t demnstrate this pint. Cnsider tw predictr variables x (1), x (2) and a respnse variable Y with the values x (1) x (2) Y Multiple regressin yields the least squares slutin which describes the data pints exactly Y i = Ŷi = 2x i1 x i2 fr all i (ˆσ 2 = 0). (1.6) The cefficients 2 and -1, respectively, describe hw y is changing when varying either x (1) r x (2) and keeping the ther predictr variable cnstant. In particular, we see that Y decreases when x (2) increases. On the ther hand, if we d a simple regressin f Y nt x (2) (while ignring the values f x (1) ; and thus, we d nt keep them cnstant), we btain the least squares estimate Ŷ i = 1 9 x i fr all i (ˆσ2 = 1.72). This least squares regressin line describes hw Y changes when varying x (2) while ignring x (1). In particular, Ŷ increases when x(2) increases, in cntrast t multiple regressin! The reasn fr this phenmenn is that x (1) and x (2) are strngly crrelated: if x (2) increases, then als x (1) increases. Nte that in the multiple regressin slutin, x (1) has a larger cefficient in abslute value than x (2) and hence, an increase in x (1) has a strnger influence fr changing y than x (2). The crrelatin amng the predictrs in general makes als the interpretatin f the regressin cefficients mre subtle: in the current setting, the cefficient β 1 quantifies the influence f x (1) n Y after having subtracted the effect f x (2) n Y, see als sectin 1.5. Summarizing: Simple least squares regressins n single predictr variables yield the multiple regressin least squares slutin, nly if the predictr variables are rthgnal. In general, multiple regressin is the apprpriate tl t include effects f mre than ne predictr variables simultaneusly.

14 8 Multiple Linear Regressin The equivalence in case f rthgnal predictrs is easy t see algebraically. Orthgnality f predictrs means X X = diag( n i=1 x2 i1,..., n i=1 x2 ip ) and hence the least squares estimatr ˆβ j = n n x ij Y i / x 2 ij (j = 1,..., p), i=1 i=1 i.e., ˆβ j depends nly n the respnse variable Y i and the jth predictr variable x ij Cmputer-Output frm R : Part I We shw here parts f the cmputer utput (frm R ) when fitting a linear mdel t data abut quality f asphalt. y = LOGRUT : lg("rate f rutting") = lg(change f rut depth in inches per millin wheel passes) ["rut" := Wagenspur, ausgefahrenes Geleise] x1 = LOGVISC : lg(viscsity f asphalt) x2 = ASPH : percentage f asphalt in surface curse x3 = BASE : percentage f asphalt in base curse x4 = RUN : 0/1 indicatr fr tw sets f runs. x5 = FINES : 10 percentage f fines in surface curse x6 = VOIDS : percentage f vids in surface curse The fllwing table shws the least squares estimates ˆβ j (j = 1,..., 7), sme empirical quantiles f the residuals r i (i = 1,..., n), the estimated standard deviatin f the errrs 2 ˆσ 2 and the s-called degrees f freedm n p. Call: lm(frmula = LOGRUT ~., data = asphalt1) Residuals: Min 1Q Median 3Q Max Cefficients: Estimate (Intercept) LOGVISC ASPH BASE RUN FINES VOIDS Residual standard errr: n 24 degrees f freedm 2 The term residual standard errr is a misnmer with a lng traditin, since standard errr usually means Var (ˆθ) fr an estimated parameter θ.

15 1.4 Prperties f Least Squares Estimates Prperties f Least Squares Estimates As an intrductry remark, we pint ut that the least squares estimates are randm variables: fr new data frm the same data-generating mechanism, the data wuld lk differently every time and hence als the least squares estimates. Figure 1.3 displays three least squares regressin lines which are based n three different realizatins frm the same data-generating mdel (i.e., three simulatins frm a mdel). We see that the estimates are varying: they are randm themselves! true line least squares line Figure 1.3: Three least squares estimated regressin lines fr three different data realizatins frm the same mdel Mments f least squares estimates We assume here the usual linear mdel Y = Xβ + ε, E[ε] = 0, Cv(ε) = E[εε ] = σ 2 I n n. (1.7) This means that assumptin frm sectin are satisfied. It can then be shwn that: (i) E[ β] = β: that is, β is unbiased (ii) E[Ŷ] = E[Y] = Xβ which fllws frm (i). Mrever, E[r] = 0. (iii) Cv( β) = σ 2 (X X) 1 (iv) Cv(Ŷ) = σ2 P, Cv(r) = σ 2 (I P ) The residuals (which are estimates f the unknwn errrs ε i ) are als having expectatin zer but they are nt uncrrelated: Frm this, we btain Var(r i ) = σ 2 (1 P ii ). n E[ ri 2 ] = i=1 n E[ri 2 ] = i=1 n n Var(r i ) = σ 2 (1 P ii ) = σ 2 (n tr(p )) = σ 2 (n p). i=1 i=1 Therefre, E[ˆσ 2 ] = E[ n i=1 r2 i /(n p)] = σ2 is unbiased.

16 10 Multiple Linear Regressin Distributin f least squares estimates assuming Gaussian errrs We assume the linear mdel as in (1.7) but require in additin that ε 1,..., ε n i.i.d. N (0, σ 2 ). It can then be shwn that: (i) β N p (β, σ 2 (X X) 1 ) (ii) Ŷ N n(xβ, σ 2 P ), r N n (0, σ 2 (I P )) (iii) ˆσ 2 σ2 n p χ2 n p. The nrmality assumptins f the errrs ε i is ften nt (apprximately) fulfilled in practice. We can then rely n the central limit therem which implies that fr large sample size n, the prperties (i)-(iii) abve are still apprximately true. This is the usual justificatin in practice t use these prperties fr cnstructing cnfidence intervals and tests fr the linear mdel parameters. Hwever, it is ften much better t use rbust methds in case f nn-gaussian errrs which we are nt discussing here. 1.5 Tests and Cnfidence Regins We assume the linear mdel as in (1.7) with ε 1,..., ε n i.i.d. N (0, σ 2 ) (r with ε i s i.i.d. and large sample size n). As we have seen abve, the parameter estimates β are nrmally distributed. If we are interested whether the jth predictr variable is relevant, we can test the null-hypthesis H 0,j : β j = 0 against the alternative H A,j : β j 0. We can then easily derive frm the nrmal distributin f ˆβ j that ˆβ j N (0, 1) under the null-hypthesis H 0,j. σ 2 (X X) 1 jj Since σ 2 is unknwn, this quantity is nt useful, but if we substitute it with the estimate ˆσ 2 we btain the s-called t-test statistic T j = ˆβ j t n p under the null-hypthesis H 0,j, (1.8) ˆσ 2 (X X) 1 jj which has a slightly different distributin than standard Nrmal N (0, 1). The crrespnding test is then called the t-test. In practice, we can thus quantify the relevance f individual predictr variables by lking at the size f the test-statistics T j (j = 1,..., p) r at the crrespnding P -values which may be mre infrmative. The prblem by lking at individual tests H 0,j is (besides the multiple testing prblem in general) that it can happen that all individual tests d nt reject the null-hyptheses (say at the 5% significance level) althugh it is true that sme predictr variables have a significant effect. This paradx can ccur because f crrelatin amng the predictr variables. An individual t-test fr H 0,j shuld be interpreted as quantifying the effect f the jth predictr variable after having subtracted the linear effect f all ther predictr variables n Y. T test whether there exists any effect frm the predictr variables, we can lk at the simultaneus null-hypthesis H 0 : β 2 =... = β p = 0 versus the alternative H A : β j

17 1.5 Tests and Cnfidence Regins 11 0 fr at least ne j {2,..., p}; we assume here that the first predictr variable is the cnstant X i,1 1 (there are p 1 (nn-trivial) predictr variables). Such a test can be develped with an analysis f variance (anva) decmpsitin which takes a simple frm fr this special case: Y Y 2 = Ŷ Y 2 + Y Ŷ 2 which decmpses the ttal squared errr Y Y arund the mean Y = n 1 n i=1 Y i 1 as a sum f the squared errr due t the regressin Ŷ Y (the amunt that the fitted values vary arund the glbal arithmetic mean) and the squared residual errr r = Y Ŷ. (The equality can be seen mst easily frm a gemetrical pint f view: the residuals r are rthgnal t X and hence t Ŷ Y). Such a decmpsitin is usually summarized by an ANOVA table (ANalysis Of VAriance). sum f squares degrees f freedm mean square E [mean square] regressin Ŷ Y 2 p 1 Ŷ Y 2 /(p 1) σ 2 + E[Y] E[Y] 2 p 1 errr Y Ŷ 2 n p Y Ŷ 2 /(n p) σ 2 ttal arund glbal mean Y Y 2 n 1 In case f the glbal null-hypthesis, there is n effect f any predictr variable and hence E[Y] cnst. = E[Y]: therefre, the expected mean square equals σ 2 under H 0. The idea is nw t divide the mean square by the estimate ˆσ 2 t btain a scale-free quantity: this leads t the s-called F -statistic F = Ŷ Y 2 /(p 1) Y Ŷ 2 /(n p) F p 1,n p under the glbal null-hypthesis H 0. This test is called the F -test (it is ne amng several ther F -tests in regressin). Besides perfrming a glbal F -test t quantify the statistical significance f the predictr variables, we ften want t describe the gdness f fit f the linear mdel fr explaining the data. A meaningful quantity is the cefficient f determinatin, abbreviated by R 2, R 2 = Ŷ Y 2 Y Y 2. This is the prprtin f the ttal variatin f Y arund Y which is explained by the regressin (see the ANOVA decmpsitin and table abve). Similarly t the t-tests as in (1.8), ne can derive cnfidence intervals fr the unknwn parameters β j : ˆβ j ± ˆσ 2 (X X) 1 jj t n p;1 α/2 is a tw-sided cnfidence interval which cvers the true β j with prbability 1 α; here, t n p;1 α/2 dentes the 1 α/2 quantile f a t n p distributin Cmputer-Output frm R : Part II We cnsider again the dataset frm sectin We nw give the cmplete list f summary statistics frm a linear mdel fit t the data. Call: lm(frmula = LOGRUT ~., data = asphalt1)

18 12 Multiple Linear Regressin Residuals: Min 1Q Median 3Q Max Cefficients: Estimate Std. Errr t value Pr(> t ) (Intercept) LOGVISC e-07 ASPH BASE RUN FINES VOIDS Signif. cdes: Residual standard errr: n 24 degrees f freedm Multiple R-Squared: , Adjusted R-squared: F-statistic: n 6 and 24 DF, p-value: < 2.2e-16 The table displays the standard errrs f the estimates Var( ˆβ j ) = ˆσ 2 (X X) 1 jj, the t-test statistics fr the null-hyptheses H 0,j : β j = 0 and their crrespnding tw-sided P -values with sme abbreviatin abut strength f significance. Mrever, the R 2 and adjusted R 2 are given and finally als the F -test statistic fr the null-hypthesis H 0 : β 2 =... = β p = 0 (with the degrees f freedm) and its crrespnding P -value. 1.6 Analysis f residuals and checking f mdel assumptins The residuals r i = Y i Ŷi can serve as an apprximatin f the unbservable errr term ε i and fr checking whether the linear mdel is apprpriate The Tukey-Anscmbe Plt The Tukey-Anscmbe is a graphical tl: we plt the residuals r i (n the y-axis) versus the fitted values Ŷi (n the x-axis). A reasn t plt against the fitted values Ŷi is that the sample crrelatin between r i and Ŷi is always zer. In the ideal case, the pints in the Tukey-Anscmbe plt fluctuate randmly arund the hrizntal line thrugh zer: see als Figure 1.4. An ften encuntered deviatin is nn-cnstant variability f the residuals, i.e., an indicatin that the variance f ε i increases with the respnse variable Y i : this is shwn in Figure 1.5 a) c). If the Tukey-Anscmbe plt shws a trend, there is sme evidence that the linear mdel assumptin is nt crrect (the expectatin f the errr is nt zer which indicates a systematic errr): Figure 1.5d) is a typical example. In case where the Tukey-Anscmbe plt exhibits a systematic relatin f the variability n the fitted values Ŷi, we shuld either transfrm the respnse variable r perfrm a weighted regressin (see Sectin 1.6.4). If the standard deviatin grws linearly with the fitted values (as in Figure 1.5a)), the lg-transfrm Y lg(y ) stabilizes the variance; if the standard deviatin grws as the square rt with the values Ŷi (as in Figure 1.5b)), the square rt transfrmatin Y Y stabilizes the variance.

19 1.6 Analysis f residuals and checking f mdel assumptins 13 r y^ Figure 1.4: Ideal Tukey-Anscmbe plt: n vilatins f mdel assumptins. a) r y^ b) r y^ c) r y^ d) r y^ Figure 1.5: a) linear increase f standard deviatin, b) nnlinear increase f standard deviatin, c) 2 grups with different variances, d) missing quadratic term in the mdel The Nrmal Plt Assumptins fr the distributin f randm variables can be graphically checked with the QQ (quantile-quantile) plt. In the special case f checking fr the nrmal distributin, the QQ plt is als referred t as a nrmal plt. In the linear mdel applicatin, we plt the empirical quantiles f the residuals (n the y axis) versus the theretical quantiles f a N (0, 1) distributin (n the x axis). If the residuals were nrmally distributed with expectatin µ and variance σ 2, the nrmal plt wuld exhibit an apprximate straight line with intercept µ and slpe σ. Figures 1.6 and 1.7 shw sme nrmal plts with exactly nrmally and nn-nrmally distributed bservatins Plt fr detecting serial crrelatin Fr checking independence f the errrs we plt the residuals r i versus the bservatin number i (r if available, the time t i f recrding the ith bservatin). If the residuals vary randmly arund the zer line, there are n indicatins fr serial crrelatins amng the errrs ε i. On the ther hand, if neighbring (with respect t the x-axis) residuals lk similar, the independence assumptin fr the errrs seems vilated.

20 14 Multiple Linear Regressin Quantiles f Standard Nrmal x a) Quantiles f Standard Nrmal x Quantiles f Standard Nrmal x b) Quantiles f Standard Nrmal x Quantiles f Standard Nrmal x c) Quantiles f Standard Nrmal x Figure 1.6: QQ-plts fr i.i.d. nrmally distributed randm variables. Tw plts fr each sample size n equal t a) 20, b) 100 and c) Quantiles f Standard Nrmal x a) Quantiles f Standard Nrmal x b) Quantiles f Standard Nrmal x c) Figure 1.7: QQ-plts fr a) lng-tailed distributin, b) skewed distributin, c) dataset with utlier Generalized least squares and weighted regressin In a mre general situatin, the errrs are crrelated with knwn cvariance matrix, Y = Xβ + ε, with ε N n (0, Σ). When Σ is knwn (and als in the case where Σ = σ 2 G with unknwn σ 2 ), this case can be transfrmed t the i.i.d. ne, using a square rt C such that Σ = CC (defined, e.g., via Chlesky factrizatin Σ = LL, L is uniquely determined lwer triangular): If Ỹ := C 1 Y and X := C 1 X, we have Ỹ = Xβ + ε, where ε N (0, I). This leads t the generalized least squares slutin β = (X Σ 1 X) 1 X Σ 1 Y with Cv( β) = (X Σ 1 X) 1. A special case where Σ is diagnal, Σ = σ 2 diag(z 1, z 2,..., z n ) (with trivial inverse) is the weighted least squares prblem min β n i=1 w i(y i x i β) 2, with weights w i 1/z i. 1.7 Mdel Selectin We assume the linear mdel Y i = p j=1 β j x ij + ε i (i = 1,..., n),

21 1.7 Mdel Selectin 15 where ε 1,..., ε n i.i.d., E[ε i ] = 0, Var(ε i ) = σ 2. Prblem: Which f the predictr variables shuld be used in the linear mdel? It may be that nt all f the p predictr variables are relevant. In additin, every cefficient has t be estimated and thus is afflicted with variability: the individual variabilities fr each cefficient sum up and the variability f the estimated hyper-plane increases the mre predictrs are entered int the mdel, whether they are relevant r nt. The aim is ften t lk fr the ptimal r best - nt the true - mdel. What we just explained in wrds can be frmalized a bit mre. Suppse we are lking fr ptimizing the predictin q ˆβ jr x ijr r=1 which includes q predictr variables with indices j 1,..., j q {1,..., p}. The average mean squared errr f this predictin is n 1 n E[(m(x i ) i=1 i=1 q ˆβ jr x ijr ) 2 ] r=1 n q = n 1 (E[ ˆβ jr x ijr ] m(x i )) 2 + n 1 r=1 n q Var( ˆβ jr x ijr ), (1.9) i=1 r=1 } {{ } = q n σ2 where m( ) dentes the regressin functin in the full mdel with all the predictr variables. It is plausible that the systematic errr (squared bias) n 1 n i=1 (E[ q ˆβ r=1 jr x ijr ] m(x i )) 2 decreases as the number f predictr variables q increases (i.e., with respect t bias, we have nthing t lse by using as many predictrs as we can), but the variance term increases linearly in the number f predictrs q (the variance term equals q/n σ 2 which is nt t difficult t derive). This is the s-called bias-variance trade-ff which is present in very many ther situatins and applicatins in statistics. Finding the best mdel thus means t ptimize the bias-variance trade-ff: this is smetimes als referred t as regularizatin (fr aviding a t cmplex mdel) Mallws C p statistic The mean squared errr in (1.9) is unknwn: we d nt knw the magnitude f the bias term but frtunately, we can estimate the mean squared errr. Dente by SSE(M) the residual sum f squares in a mdel M: it is verly ptimistic and nt a gd measure t estimate the mean squared errr in (1.9). Fr example, SSE(M) becmes smaller the bigger the mdel M and the biggest mdel under cnsideratin has the lwest SSE (which generally cntradicts the equatin in (1.9)). Fr any (sub-)mdel M which invlves sme (r all) f the predictr variables, the mean squared errr in (1.9) can be estimated by n 1 SSE(M) ˆσ 2 + 2ˆσ 2 M /n, where ˆσ 2 is the errr variance estimate in the full mdel and SSE(M) is the residual sum f squares in the submdel M. (A justificatin can be fund in the literature). Thus, in rder t estimate the best mdel, we culd search fr the sub-mdel M minimizing the

22 16 Multiple Linear Regressin abve quantity. Since ˆσ 2 and n are cnstants with respect t submdels M, we can als cnsider the well-knwn C p statistic C p (M) = SSE(M) ˆσ 2 n + 2 M and search fr the sub-mdel M minimizing the C p statistic. Other ppular criteria t estimate the predictive ptential f an estimated mdel are Akaike s infrmatin criterin (AIC) and the Bayesian infrmatin criterin (BIC). Searching fr the best mdel with respect t C p If the full mdel has p predictr variables, there are 2 p 1 sub-mdels (every predictr can be in r ut but we exclude the sub-mdel M which crrespnds t the empty set). Therefre, an exhaustive search fr the sub-mdel M minimizing the C p statistic is nly feasible if p is less than say 16 ( = which is already fairly large). If p is large, we can prceed with stepwise algrithms. Frward selectin. 1. Start with the smallest mdel M 0 (lcatin mdel) as the current mdel. 2. Include the predictr variable t the current mdel which reduces the residual sum f squares mst. 3. Cntinue step 2. until all predictr variables have been chsen r until a large number f predictr variables has been selected. This prduces a sequence f sub-mdels M 0 M 1 M Chse the mdel in the sequence M 0 M 1 M 2... which has smallest C p statistic. Backward selectin. 1. Start with the full mdel M 0 as the current mdel. 2. Exclude the predictr variable frm the current mdel which increases the residual sum f squares the least. 3. Cntinue step 2. until all predictr variables have been deleted (r a large number f predictr variables). This prduces a sequence f sub-mdels M 0 M 1 M Chse the mdel in the sequence M 0 M 1 M 2... which has smallest C p statistic. Backward selectin is typically a bit better than frward selectin but it is cmputatinally mre expensive. Als, in case where p n, we dn t want t fit the full mdel and frward selectin is an apprpriate way t prceed.

23 Chapter 2 Nnparametric Density Estimatin 2.1 Intrductin Fr a mment, we will g back t simple data structures: we have bservatins which are realizatins f univariate randm variables, X 1,..., X n i.i.d. F, where F dentes an unknwn cumulative distributin functin. The gal is t estimate the distributin F. In particular, we are interested in estimating the density f = F, assuming that it exists. Instead f assuming a parametric mdel fr the distributin (e.g. Nrmal distributin with unknwn expectatin and variance), we rather want t be as general as pssible : that is, we nly assume that the density exists and is suitably smth (e.g. differentiable). It is then pssible t estimate the unknwn density functin f( ). Mathematically, a functin is an infinite-dimensinal bject. Density estimatin will becme a basic principle hw t d estimatin fr infinite-dimensinal bjects. We will make use f such a principle in many ther settings such as nnparametric regressin with ne predictr variable (Chapter 3) and flexible regressin and classificatin methds with many predictr variables (Chapter 7). 2.2 Estimatin f a density We cnsider the data which recrds the duratin f eruptins f Old Faithful, a famus geysir in Yellwstne Natinal Park (Wyming, USA). Yu can watch it via web-cam n Histgram The histgram is the ldest and mst ppular density estimatr. We need t specify an rigin x 0 and the class width h fr the specificatins f the intervals I j = (x 0 + j h, x 0 + (j + 1)h] (j =..., 1, 0, 1,...) fr which the histgram cunts the number f bservatins falling int each I j : we then plt the histgram such that the area f each bar is prprtinal t the number f bservatins falling int the crrespnding class (interval I j ). 17

24 18 Nnparametric Density Estimatin nc= length [min] Figure 2.1: Histgrams (different class widths) fr duratins f eruptins f Old Faithful geysir in Yellwstne Park (n = 272, data(faithful)). h h h h x 0 x 0 + h x 0 + 2h x 0 + 3h The chice f the rigin x 0 is highly arbitrary, whereas the rle f the class width is immediately clear fr the user. The frm f the histgram depends very much n these tw tuning parameters Kernel estimatr The naive estimatr Similar t the histgram, we can cmpute the relative frequency f bservatins falling int a small regin. The density functin f( ) at a pint x can be represented as 1 f(x) = lim P[x h < X x + h]. (2.1) h 0 2h The naive estimatr is then cnstructed withut taking the limit in (2.1) and by replacing prbabilities with relative frequencies: ˆf(x) = 1 2hn #{i; X i (x h, x + h]}. (2.2) This naive estimatr is nly piecewise cnstant since every X i is either in r ut f the interval (x h, x + h]. As fr histgrams, we als need t specify the s-called bandwidth h; but in cntrast t the histgram, we d nt need t specify an rigin x 0. An alternative representatin f the naive estimatr (2.2) is as fllws. Define the weight functin w(x) = { 1/2 if x 1, 0 therwise.

25 2.3 The rle f the bandwidth 19 Then, ˆf(x) = 1 nh n ( ) x Xi w. h i=1 If we chse instead f the rectangle weight functin w( ) a general, typically mre smth kernel functin K( ), we have the definitin f the kernel density estimatr ˆf(x) = 1 n ( ) x Xi K, nh h K(x) 0, i=1 K(x)dx = 1, K(x) = K( x). (2.3) The estimatr depends n the bandwidth h > 0 which acts as a tuning parameter. Fr large bandwidth h, the estimate ˆf(x) tends t be very slwly varying as a functin f x, while small bandwidths will prduce a mre wiggly functin estimate. The psitivity f the kernel functin K( ) guarantees a psitive density estimate ˆf( ) and the nrmalizatin K(x)dx = 1 implies that ˆf(x)dx = 1 which is necessary fr ˆf( ) t be a density. Typically, the kernel functin K( ) is chsen as a prbability density which is symmetric arund 0. The smthness f ˆf( ) is inherited frm the smthness f the kernel: if the rth derivative K (r) (x) exists fr all x, then ˆf (r) (x) exists as well fr all x (easy t verify using the chain rule fr differentiatin). Ppular kernels are the Gaussian kernel K(x) = ϕ(x) = (2π) 1 2 e x 2 /2 (the density f the N (0, 1) distributin) r a kernel with finite supprt such as K(x) = π 4 cs( π 2 x)1( x 1). The Epanechnikv kernel, which is ptimal with respect t mean squared errr, is K(x) = 3 ( 1 x 2) 1( x 1). 4 But far mre imprtant than the kernel is the bandwidth h, see figure 2.2: its rle and hw t chse it are discussed belw h = h = h = h = h = Figure 2.2: kernel density estimates f the Old Faithful eruptin lengths; Gaussian kernel and bandwidths h = ,2,..., The rle f the bandwidth The bandwidth h is ften als called the smthing parameter : a mment f thught will reveal that fr h 0, we will have δ-spikes at every bservatin X i, whereas ˆf( ) = ˆf h ( ) becmes smther as h is increasing.

26 20 Nnparametric Density Estimatin Variable bandwidths: k nearest neighbrs Instead f using a glbal bandwidth, we can use lcally changing bandwidths h(x). The general idea is t use a large bandwidth fr regins where the data is sparse. The k-nearest neighbr idea is t chse h(x) = Euclidean distance f x t the kth nearest bservatin, where k is regulating the magnitude f the bandwidth. Nte that generally, ˆfh(x) ( ) will nt be a density anymre since the integral is nt necessarily equal t ne The bias-variance trade-ff We can frmalize the behavir f ˆf( ) when varying the bandwidth h in terms f bias and variance f the estimatr. It is imprtant t understand heuristically that the (abslute value f the) bias f ˆf increases and the variance f ˆf decreases as h increases. Therefre, if we want t minimize the mean squared errr MSE(x) at a pint x, [ ( ) ] 2 ( MSE(x) = E ˆf(x) f(x) = E[ ˆf(x)] 2 f(x)) + Var( ˆf(x)), we are cnfrnted with a bias-variance trade-ff. As a cnsequence, this allws - at least cnceptually - t ptimize the bandwidth parameter (namely t minimize the mean squared errr) in a well-defined, cherent way. Instead f ptimizing the mean squared errr at a pint x, ne may want t ptimize the integrated mean squared errr (IMSE) IMSE = MSE(x)dx which yields an integrated decmpsitin f squared bias and variance (integratin is ver the supprt f X). Since the integrand is nn-negative, the rder f integratin (ver the supprt f X and ver the prbability space f X) can be reversed, dented as MISE (mean integrated squared errr) and written as MISE = E[ ( ˆf(x) f(x) ) 2 dx] = E[ISE], (2.4) where ISE = ( ˆf(x) f(x) ) 2 dx Asympttic bias and variance It is straightfrward (using definitins) t give an expressin fr the exact bias and variance: ( ) E[ ˆf(x)] 1 x y = h K f(y)dy h Var( ˆf(x)) = 1 ( ( )) [ x nh 2 Var Xi K = 1 ( ) ] x 2 h nh 2 E Xi K 1 [ ( )] x 2 h nh 2 E Xi K h ( ) 1 x y 2 ( ( ) 1 x y 2 = n 1 h 2 K f(y)dy n f(y)dy) 1 h h K. (2.5) h

27 2.3 The rle f the bandwidth 21 Fr the bias we therefre get (by a change f variable and K( z) = K(z)) Bias(x) = ( ) 1 x y h K f(y)dy f(x) h }{{} = z=(y x)/h K(z)f(x + hz)dz f(x) = K(z) (f(x + hz) f(x)) dz, (2.6) T apprximate this expressin in general, we invke an asympttic argument. We assume that h 0 as sample size n, that is: h = h n 0 with nh n. This will imply that the bias ges t zer since h n 0; the secnd cnditin requires that h n is ging t zer mre slwly than 1/n which turns ut t imply that als the variance f the estimatr will g t zer as n. T see this, we use a Taylr expansin f f, assuming that f is sufficiently smth: Plugging this int (2.6) yields Bias(x) = hf (x) f(x + hz) = f(x) + hzf (x) h2 z 2 f (x) +... = 1 2 h2 f (x) zk(z)dz } {{ } = h2 f (x) z 2 K(z)dz +... z 2 K(z)dz + higher rder terms in h. Fr the variance, we get frm (2.5) ( ) Var( ˆf(x)) 1 x y 2 = n 1 h 2 K f(y)dy n 1 (f(x) + Bias(x)) 2 h = n 1 h 1 f(x hz)k(z) 2 dz n 1 (f(x) + Bias(x)) 2 }{{} =O(n 1 ) = n 1 h 1 f(x hz)k(z) 2 dz + O(n 1 ) = n 1 h 1 f(x) K(z) 2 dz + (n 1 h 1 ), assuming that f is smth and hence f(x hz) f(x) as h n 0. In summary: fr h = h n 0, h n n as n Bias(x) = h 2 f (x) z 2 K(z) dz/2 + (h 2 ) (n ) Var( ˆf(x)) = (nh) 1 f(x) K(z) 2 dz + ((nh) 1 ) (n ) The ptimal bandwidth h = h n which minimizes the leading term in the asympttic MSE(x) can be calculated straightfrwardly by slving hmse(x) = 0, ( h pt (x) = n 1/5 f(x) K 2 ) 1/5 (z)dz (f (x)) 2 ( z 2 K(z)dz) 2. (2.7) Since it s nt straightfrward t estimate and use a lcal bandwidth h(x), ne rather cnsiders minimizing the MISE, i.e., MSE(x) dx which is asympttically asympt. MISE = Bias(x) 2 + Var( ˆf(x)) dx = 1 4 h4 R(f ) σk 4 + R(K)/(nh), (2.8)

28 22 Nnparametric Density Estimatin where R(g) = g 2 (x) dx, σ 2 K = x 2 K(x) dx, and the glbal asympttically ptimal bandwidth becmes h pt = n 1/5 ( R(K)/σ 4 K 1/R(f ) ) 1/5. (2.9) By replacing h with h pt, e.g., in (2.8), we see that bth variance and bias terms are f rder O(n 4/5 ), the ptimal rate fr the MISE and MSE(x). Frm sectin 2.4.1, this rate is als ptimal fr a much larger class f density estimatrs Estimating the bandwidth As seen frm (2.9), the asympttically best bandwidth depends n R(f ) = f 2 (x) dx which is unknwn (whereas as R(K) and σk 2 are knwn). It is pssible t estimate the f again by a kernel estimatr with an initial bandwidth h init (smetimes called a pilt bandwidth) yielding ˆf h init. Plugging this estimate int (2.9) yields an estimated bandwidth ĥ fr the density estimatr ˆf( ) (the riginal prblem): f curse, ĥ depends n the initial bandwidth h init, but chsing h init in an ad-hc way is less critical fr the density estimatr than chsing the bandwidth h itself. Furthermre, methds have been devised t determine h init and h simultaneusly (e.g., Sheather-Jnes, in R using density(, bw="sj")). Estimating lcal bandwidths Nte that the h pt (x) bandwidth selectin in (2.7) is mre prblematical mainly because ˆfĥpt(x) (x) will nt integrate t ne withut further nrmalizatin. On the ther hand, it can be imprtant t use lcally varying bandwidths instead f a single glbal ne in a kernel estimatr at the expense f being mre difficult. The plug-in prcedure utlined abve can be applied lcally, i.e., cnceptually fr each x and hence describes hw t estimate lcal bandwidths frm data and hw t implement a kernel estimatr with lcally varying bandwidths. In the related area f nnparametric regressin, in sectin 3.2.1, we will shw an example abut lcally changing bandwidths which are estimated based n an iterative versin f the (lcal) plug-in idea abve. Other density estimatrs There are quite a few ther appraches t density estimatin than the kernel estimatrs abve (whereas in practice, the fixed bandwidth kernel estimatrs are used predminantly because f their simplicity). An imprtant apprach in particular aims t estimate the lg density lg f(x) (setting ˆf = exp( lg f)) which has n psitivity cnstraints and whse nrmal limit is a simple quadratic. One gd implementatin is in Kperberg s R package lgspline, where spline knts are placed in a stepwise algrithm minimizing apprximate BIC (r AIC). This is can be seen as anther versin f lcally varying bandwidths.

29 2.4 Higher dimensins Higher dimensins Quite many applicatins invlve multivariate data. Fr simplicity, cnsider data which are i.i.d. realizatins f d-dimensinal randm variables X 1,..., X n i.i.d. f(x 1,..., x d )dx 1 dx d where f( ) dentes the multivariate density. The multivariate kernel density estimatr is, in its simplest frm, defined as ˆf(x) = 1 nh d n ( ) x Xi K, h where the kernel K( ) is nw a functin, defined fr d-dimensinal x, satisfying K(u) 0, K(u)du = 1, R d uk(u)du = 0, R d uu K(u)du = I d. R d i=1 Usually, the kernel K( ) is chsen as a prduct f a kernel K univ fr univariate density estimatin d K(u) = K univ (u j ). j=1 If ne additinally desires the multivariate kernel K(u) t be radially symmetric, it can be shwn that K must be the multivariate nrmal (Gaussian) density, K(u) = c d exp( 1 2 u u) The curse f dimensinality In practice, multivariate kernel density estimatin is ften restricted t dimensin d = 2. The reasn is, that a higher dimensinal space (with d f medium size r large) will be nly very sparsely ppulated by data pints. Or in ther wrds, there will be nly very few neighbring data pints t any value x in a higher dimensinal space, unless the sample size is extremely large. This phenmenn is als called the curse f dimensinality. An implicatin f the curse f dimensinality is the fllwing lwer bund fr the best mean squared errr f nnparametric density estimatrs (assuming that the underlying density is twice differentiable): it has been shwn that the best pssible MSE rate is O(n 4/(4+d) ). The fllwing table evaluates n 4/(4+d) fr varius n and d: n 4/(4+d) d = 1 d = 2 d = 3 d = 5 d = 10 n = n = n = Thus, fr d = 10, the rate with n = is still 1.5 times wrse than fr d = 1 and n = 100.

30 24 Nnparametric Density Estimatin

31 Chapter 3 Nnparametric Regressin 3.1 Intrductin We cnsider here nnparametric regressin with ne predictr variable. Practically relevant generalizatins t mre than ne r tw predictr variables are nt s easy due t the curse f dimensinality mentined in sectin and ften require different appraches, as will be discussed later in Chapter 7. Figure 3.1 shws (several identical) scatter plts f (x i, Y i ) (i = 1,..., n). We can mdel such data as Y i = m(x i ) + ε i, (3.1) where ε 1,..., ε n i.i.d. with E[ε i ] = 0 and m : R R is an arbitrary functin. The functin m( ) is called the nnparametric regressin functin and it satisfies m(x) = E[Y x]. The restrictin we make fr m( ) is that it fulfills sme kind f smthness cnditins. The regressin functin in Figure 3.1 des nt appear t be linear in x and linear regressin is nt a gd mdel. The flexibility t allw fr an arbitrary regressin functin is very desirable; but f curse, such flexibility has its price, namely an inferir estimatin accuracy than fr linear regressin. 3.2 The kernel regressin estimatr We can view the regressin functin in (3.1) as m(x) = E[Y X = x], (assuming that X is randm and X i = x i are realized values f the randm variables). We can express this cnditinal expectatin as R yf Y X (y x)dy = R yf X,Y (x, y)dy, f X (x) where f Y X, f X,Y, f X dente the cnditinal, jint and marginal densities. We can nw plug in the univariate and bivariate kernel density (all with the same univariate kernel K) estimates ˆf X (x) = n i=1 K ( ) x x i h, ˆfX,Y (x, y) = nh 25 n i=1 K ( x x i h nh 2 ) ( ) K y Yi h

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeff Reading: Chapter 2 STATS 202: Data mining and analysis September 27, 2017 1 / 20 Supervised vs. unsupervised learning In unsupervised