Introduction to Regression

Size: px

Start display at page:

Download "Introduction to Regression"

Imogene Maxwell
5 years ago
Views:

1 Introduction to Regression p. 1/97 Introduction to Regression Chad Schafer Carnegie Mellon University

2 Introduction to Regression p. 1/97 Acknowledgement Larry Wasserman, All of Nonparametric Statistics, Springer, 2006

3 Introduction to Regression p. 2/97 Outline General Concepts of Regression, Bias-Variance Tradeoff Linear Smoothers Cross Validation Local Polynomial Regression Confidence Bands Basis Methods: Splines Multiple Regression

4 Introduction to Regression p. 3/97 Basic Concepts in Regression The Regression Problem: Observe (X 1,Y 1 ),...,(X n,y n ), estimate f(x) = E(Y X = x). Equivalently: where E(ǫ i ) = 0. Y i = f(x i ) + ǫ i Simple estimators: f(x) = β 0 + β 1 x f(x) = mean{y i : X i x h} parametric nonparametric

5 Introduction to Regression p. 3/97 Next... We will first quickly look at some simple examples from astronomy

6 Introduction to Regression p. 4/97 Example: Type Ia Supernovae Distance Modulus Redshift From Riess, et al. (2007), 182 Type Ia Supernovae

7 Introduction to Regression p. 5/97 Example: Type Ia Supernovae Assumption: Observed pairs (z i,y i ) are realizations of Y i = f(z i ) + σ i ǫ i, where the ǫ i are i.i.d. standard normal with mean zero. Error standard deviations σ i assumed known. Objective: Estimate f( ).

8 Introduction to Regression p. 6/97 Example: Type Ia Supernovae Distance Modulus Fourth order polynomial fit. Redshift

9 Introduction to Regression p. 7/97 Example: Type Ia Supernovae Distance Modulus Fifth order polynomial fit. Redshift

10 Introduction to Regression p. 8/97 Example: Type Ia Supernovae Distance Modulus Nonparametric estimate. Redshift

11 Introduction to Regression p. 9/97 Example: Type Ia Supernovae ΛCDM, two parameter model: ( ) c(1 + z) z du f(z) = 5 log 10 H 0 Ωm (1 + u) 3 + (1 Ω m )

12 Introduction to Regression p. 10/97 Example: Type Ia Supernovae Distance Modulus Redshift Estimate where H 0 = and Ω m = (the MLE)

13 Introduction to Regression p. 11/97 Example: Dark Energy Y i = f(z i ) + ǫ i, i = 1,...,n Y i = luminosity of i th supernova z i = redshift of i th supernova Want to estimate equation of state w(z): where w = T(f,f,f ) w(z) = 1 + z 3 3H0 2Ω M(1 + z) f (z) (f (z)) 3 H0 2Ω M(1 + z) 3 1. (f (z)) 2

14 Introduction to Regression p. 11/97 Next... We will now discuss the fundamental challenge of constructing a regression model: How complex should it be?

15 Introduction to Regression p. 12/97 The Bias Variance Tradeoff Must choose model to achieve balance between too simple high bias, i.e., E( f(x)) not close to f(x) xxxxxx precise, but not accurate too complex high variance, i.e., Var( f(x)) large xxxxxx accurate, but not precise This is the classic bias variance tradeoff. Could be called the accuracy precision tradeoff.

16 Introduction to Regression p. 13/97 The Bias Variance Tradeoff Linear Regression Model: Fit models of the form p f(x) = β 0 + β i g i (x) i=1 where g i ( ) are fixed, specified functions. Increasing p yields more complex model

17 Introduction to Regression p. 14/97 The Bias Variance Tradeoff Nonparametric Case: Every nonparametric procedure requires a smoothing parameter h. Consider the regression estimator based on local averaging: f(x) = mean{y i : X i x h}. Increasing h yields smoother estimate f

18 Introduction to Regression p. 15/97 The Bias Variance Tradeoff Can formalize this via appropriate measure of error in estimator Squared error loss: L(f(x), f(x)) = (f(x) f(x)) 2. Mean squared error MSE (risk) MSE = R(f(x), f(x)) = E(L(f(x), f(x))).

19 Introduction to Regression p. 16/97 The Bias Variance Tradeoff The key result: where R(f(x), f(x)) = bias 2 x + variance x bias x variance x = E( f(x)) f(x) = Var( f(x)) Often written as MSE = BIAS 2 + VARIANCE.

20 Introduction to Regression p. 17/97 The Bias Variance Tradeoff Mean Integrated Squared Error: MISE = = R(f(x), f(x)) dx bias 2 x dx + variance x dx Empirical version: 1 n n i=1 R(f(x i ), f(x i )).

21 Introduction to Regression p. 18/97 The Bias Variance Tradeoff Risk Bias squared Variance Less Optimal smoothing More

22 Introduction to Regression p. 19/97 The Bias Variance Tradeoff For nonparametric procedures, when x is one-dimensional, which is minimized at Hence, MISE c 1 h 4 + c 2 nh h = O MISE = O ( 1 n 1/5 ( 1 ) n 4/5 whereas, for parametric problems ( ) 1 MISE = O n )

23 Introduction to Regression p. 19/97 Next... The term nonparametric is confusing. People ask: Is this not just a parametric model where there are many more parameters?

24 Introduction to Regression p. 20/97 What is nonparametric? Data Estimate Assumptions In the parametric case, the influence of assumptions is fixed

25 Introduction to Regression p. 21/97 What is nonparametric? Data Estimate Assumptions h In the nonparametric case, the influence of assumptions is controlled by h, where h = O ( 1 n 1/5 )

26 Introduction to Regression p. 22/97 Regression Parametric Regression: Y Y = β 0 + β 1 X + ǫ = β 0 + β 1 X β d X d + ǫ Y = β 0 e β 1X 1 + β 2 X 2 + ǫ Nonparametric Regression: Y Y Y = f(x) + ǫ = f 1 (X 1 ) + + f d (X d ) + ǫ = f(x 1,X 2 ) + ǫ

27 Introduction to Regression p. 22/97 Next... Now we will begin to discuss different procedures for fitting regression models, and some elements they have in common. Namely, they are all linear smoothers.

28 Introduction to Regression p. 23/97 Regression Methods: Linear Regression Binning Local averaging Kernels Local polynomials Splines

29 Introduction to Regression p. 24/97 Linear Smoothers All are linear smoothers: f(x) = i Y i l i (x) for some weights: l 1 (x),...,l n (x) The vector of weights depends on the target point x. Each method involves specification of a parameter to control model smoothness/complexity (either p or h)

30 Introduction to Regression p. 25/97 Linear Smoothers If f(x) = n i=1 l i(x)y i then 0 1 by 1. C A byn {z } by 0 by = L Y bf(x 1 ).. bf(xn) 0 1 = C A The effective degrees of freedom is: l 1 (X 1 ) l 2 (X 1 ) ln(x 1 ) l 1 (X 2 ) l 2 (X 2 ) ln(x 2 ).... l 1 (Xn) l 2 (Xn) ln(xn) {z } ν = trace(l) = L.. n L ii. i= B A Y C A Yn {z } Y Can be interpreted as the number of parameters in the model.

31 Introduction to Regression p. 26/97 Linear Smoothers Consider the Extreme Cases: When l i (X j ) = 1/n for all i,j, and ν = 1. f(x j ) = Y for all j When l i (X j ) = δ ij for all i,j, and ν = n. f(x j ) = Y j

32 Introduction to Regression p. 26/97 Next... Linear regression models are the classic approach to regression. Here, the response is modelled as a linear combination of p selected predictors.

33 Introduction to Regression p. 27/97 Linear Regression Assume that p Y = β 0 + β i g i (x) + ǫ i=1 where g i ( ) are specified functions, not estimated. Assume that E(ǫ) = 0, Var(ǫ) = σ 2 Often assume ǫ is Gaussian, but this is not essential

34 Introduction to Regression p. 28/97 Linear Regression The Design Matrix: D = 1 g 1 (X 1 ) g 2 (X 1 ) g p (X 1 ) 1 g 1 (X 2 ) g 2 (X 2 ) g p (X 2 ) g 1 (X n ) g 2 (X n ) g p (X n ) Then, L = D(D T D) 1 D T is the projection matrix. Note that ν = trace(l) = p + 1

35 Introduction to Regression p. 29/97 Linear Regression Let Y be a n-vector filled with Y 1,Y 2,...,Y n. The Least Squares Estimates of β: β = (D T D) 1 D T Y The Fitted Values: Ŷ = LY = D β The Residuals: ǫ = Y Ŷ = (I L)Y

36 Introduction to Regression p. 30/97 Linear Regression The estimates of β given above minimize the sum of squared errors: SSE = n ǫ 2 i = i=1 n i=1 ( ) 2 Y i Ŷi If ǫ i are Gaussian, then β is Gaussian, leading to confidence intervals, hypothesis tests

37 Introduction to Regression p. 31/97 Linear Regression in R Linear regression in R is done via the command lm() > holdlm = lm(y x) Default is to include intercept term. If want to exclude, use > holdlm = lm(y x - 1)

38 Introduction to Regression p. 32/97 Linear Regression in R y x

39 Introduction to Regression p. 33/97 Linear Regression in R Fitting the model Y = β 0 + β 1 x + ǫ The summary() shows the key output > holdlm = lm(y x) > summary(holdlm) < CUT SOME HERE > Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) * x e-10 *** < CUT SOME HERE > Note that β 1 = 2.07, the SE for β 1 is approximately 0.17.

40 Introduction to Regression p. 34/97 Linear Regression in R Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) * x e-10 *** If the ǫ are Gaussian, then the last column gives the p-value for the test of H 0 : β i = 0 versus H 1 : β i 0. Also, if the ǫ are Gaussian, β i ± 2 ŜE( β i ) is an approximate 95% confidence interval for β i.

41 Introduction to Regression p. 35/97 Linear Regression in R Residuals Fitted Values Always a good idea to look at plot of residuals versus fitted values. There should be no apparent pattern. > plot(holdlm$residuals,holdlm$fitted.values)

42 Introduction to Regression p. 36/97 Linear Regression in R Sample Quantiles Theoretical Quantiles Is it reasonable to assume ǫ are Gaussian? Check the normal probability plot of the residuals. > qqplot(holdlm$residuals) > qqline(holdlm$residuals)

43 Introduction to Regression p. 37/97 Linear Regression in R Simple to include more terms in the model: > holdlm = lm(y x + z) Unfortunately, this does not work: > holdlm = lm(y x + xˆ2) Instead, create a new variable and use that: > x2 = xˆ2 > holdlm = lm(y x + x2)

44 Introduction to Regression p. 37/97 Next... Now we will begin our discussion of nonparametric regression. To motivate the discussion, we start with a simple, but naive, approach: binning.

45 Introduction to Regression p. 38/97 Binning Divide the x-axis into bins B 1,B 2,... of width h. f is a step function based on averaging the Y i s in each bin: { for x B j : f(x) = mean Y i : X i B j }. The (arbitrary) choice of the boundaries of the bins can affect inference, especially when h large.

46 Introduction to Regression p. 39/97 Binning WMAP data, binned estimate with h = 15. Cl l

47 Introduction to Regression p. 40/97 Binning WMAP data, binned estimate with h = 50. Cl l

48 Introduction to Regression p. 41/97 Binning WMAP data, binned estimate with h = 75. Cl l

49 Introduction to Regression p. 41/97 Next... A step beyond binning is to take local averages to estimate the regression function.

50 Introduction to Regression p. 42/97 Local Averaging For each x let N x = [ x h 2, x + h ]. 2 This is a moving window of length h, centered at x. Define } f(x) = mean {Y i : X i N x. This is like binning but removes the arbitrary boundaries.

51 Introduction to Regression p. 43/97 Local Averaging WMAP data, local average estimate with h = 15. Cl Binned Local Averaging l

52 Introduction to Regression p. 44/97 Local Averaging WMAP data, local average estimate with h = 50. Cl Binned Local Averaging l

53 Introduction to Regression p. 45/97 Local Averaging WMAP data, local average estimate with h = 75. Cl Binned Local Averaging l

54 Introduction to Regression p. 45/97 Next... Local averaging is a special case of kernel regression.

55 Introduction to Regression p. 46/97 Kernel Regression The local average estimator can be written: n i=1 f(x) = Y i K ( x X i ) h n i=1 K ( x X i ) h where K(x) = { 1 x < 1/2 0 otherwise. Can improve this by using a function K which is smoother.

56 Introduction to Regression p. 47/97 Kernels A kernel is any smooth function K such that K(x) 0 and K(x)dx = 1, xk(x)dx = 0 and σk 2 x 2 K(x)dx > 0. Some commonly used kernels are the following: the boxcar kernel : K(x) = 1 I( x < 1), 2 the Gaussian kernel : K(x) = 1 e x2 /2, 2π the Epanechnikov kernel : K(x) = 3 4 (1 x2 )I( x < 1) the tricube kernel : K(x) = (1 x 3 ) 3 I( x < 1).

57 Introduction to Regression p. 48/97 Kernels

58 Introduction to Regression p. 49/97 Kernel Regression We can write this as f(x) = n i=1 Y i K ( x X i ) h n i=1 K ( x X i ) h f(x) = i Y i l i (x) where l i (x) = K ( ) x X i h n i=1 K ( x X i ). h

59 Introduction to Regression p. 50/97 Kernel Regression WMAP data, kernel regression estimates, h = 15. Cl boxcar kernel Gaussian kernel tricube kernel l

60 Introduction to Regression p. 51/97 Kernel Regression WMAP data, kernel regression estimates, h = 50. Cl boxcar kernel Gaussian kernel tricube kernel l

61 Introduction to Regression p. 52/97 Kernel Regression WMAP data, kernel regression estimates, h = 75. Cl boxcar kernel Gaussian kernel tricube kernel l

62 Introduction to Regression p. 53/97 Kernel Regression MISE ( ) 2 ( ) 2 x 2 K(x)dx f (x) + 2f (x) g (x) dx g(x) + σ2 K 2 (x)dx 1 nh g(x) dx. h4 4 where g(x) is the density for X. What is especially notable is the presence of the term 2f (x) g (x) g(x) = design bias. Also, bias is large near the boundary. We can reduce these biases using local polynomials.

63 Introduction to Regression p. 54/97 Kernel Regression WMAP data, h = 50. Note the boundary bias. Cl boxcar kernel Gaussian kernel tricube kernel l

64 Introduction to Regression p. 54/97 Next... The deficiencies of kernel regression can be addressed by considering models which, on small scales, model the regression function as a low order polynomial.

65 Introduction to Regression p. 55/97 Local Polynomial Regression Recall polynomial regression: f(x) = β 0 + β 1 x + β 2 x β p x p where β = ( β 0,..., β p ) are obtained by least squares: n ( minimize Y i [β 0 + β 1 x + β 2 x β p x p ] i=1 ) 2

66 Introduction to Regression p. 56/97 Local Polynomial Regression Local polynomial regression: approximate f(x) locally by a different polynomial for every x: f(u) β 0 (x)+β 1 (x)(u x)+β 2 (x)(u x) 2 + +β p (x)(u x) p for u near x. Estimate ( β 0 (x),..., β p (x)) by local least squares: minimize n ( ) x (Y i [β 0 (x)+β 1 (x)x+β 2 (x)x 2 + +β p (x)x p ]) 2 Xi K h i=1 }{{} kernel f(x) = β 0 (x)

67 Introduction to Regression p. 57/97 Local Polynomial Regression Taking p = 0 yields the kernel regression estimator: f n (x) = l i (x) = n l i (x)y i i=1 K ( x x i ) h ( ). n j=1 K x xj h Taking p = 1 yields the local linear estimator. This is the best, all-purpose smoother. Choice of Kernel K: not important Choice of bandwidth h: crucial

68 Introduction to Regression p. 58/97 Local Polynomial Regression The local polynomial regression estimate is n f n (x) = l i (x)y i i=1 where l(x) T = (l 1 (x),...,l n (x)), l(x) T = e T 1 (Xx T W x X x ) 1 Xx T W x, e 1 = (1, 0,...,0) T and X x and W x are defined by 0 Xx = 1 X 1 x 1 X 2 x Xn x (X 1 x) p p! (X 2 x) p p! (Xn x) p p! 1 0 Wx = B A K x X1 h... « K x X n h C A

69 Introduction to Regression p. 59/97 Local Polynomial Regression Note that f(x) = n i=1 l i(x)y i is a linear smoother. Define Ŷ = (Ŷ1,...,Ŷn) where Ŷi = f(x i ). Then Ŷ = LY where L is the smoothing matrix: L = l 1 (X 1 ) l 2 (X 1 ) ln(x 1 ) l 1 (X 2 ) l 2 (X 2 ) ln(x 2 ).... l 1 (Xn) l 2 (Xn) ln(xn)..... The effective degrees of freedom is: ν = trace(l) = n i=1 L ii.

70 Introduction to Regression p. 59/97 Next... Nonparametric procedures force the choice of the smoothing parameter h. Next we will discuss how this can be selected in a data-driven manner via cross-validation.

71 Introduction to Regression p. 60/97 Choosing the Bandwidth Estimate the risk 1 n n E( f(x i ) f(x i )) 2 i=1 with the leave-one-out cross-validation score: CV = 1 n (Y i n f ( i) (X i )) 2 i=1 where f ( i) is the estimator obtained by omitting the i th pair (X i,y i ).

72 Introduction to Regression p. 61/97 Choosing the Bandwidth Amazing shortcut formula: ( CV = 1 n n i=1 Y i f ) 2 n (x i ). 1 L ii An commonly used approximation is GCV (generalized cross-validation): GCV = 1 n ( 1 ν n n ) 2 (Y i f(x i )) 2 i=1 ν = trace(l).

73 Introduction to Regression p. 62/97 Theoretical Aside Why local linear (p = 1) is better than kernel (p = 0). Both have (approximate) variance σ 2 (x) g(x)nh The kernel estimator has bias Z h f (x) + f (x)g (x) g(x) K 2 (u)du «Z u 2 K(u)du whereas the local linear estimator has asymptotic bias h f (x) Z u 2 K(u)du The local linear estimator is free from design bias. At the boundary points, the kernel estimator has asymptotic bias of O(h) while the local linear estimator has bias O(h 2 ).

74 Introduction to Regression p. 62/97 Next... Next, we will go over a bit of how to perform local polynomial regression in R.

75 Introduction to Regression p. 63/97 Using locfit() Need to include the locfit library: > install.packages("locfit") > library(locfit) > result = locfit(y x, alpha=c(0, 1.5), deg=1) y and x are vectors the second argument to alpha gives the bandwidth (h) the first argument to alpha specifies the nearest neighbor fraction, an alternative to the bandwidth fitted(result) gives the fitted values, f(x i ) residuals(result) gives the residuals, Y i f(x i )

76 Introduction to Regression p. 64/97 Using locfit() degree=1, n=1000, sigma=0.1 y x(1 x)sin((2.1π) (x + 0.2)) estimate using h = estimate using gcv optimal h = x

77 Introduction to Regression p. 65/97 Using locfit() degree=1, n=1000, sigma=0.1 y x(1 x)sin((2.1π) (x + 0.2)) estimate using h = 0.1 estimate using gcv optimal h = x

78 Introduction to Regression p. 66/97 Using locfit() degree=1, n=1000, sigma=0.5 y (x + 5exp( 4x 2 )) estimate using h = 0.05 estimate using gcv optimal h = x

79 Introduction to Regression p. 67/97 Using locfit() degree=1, n=1000, sigma=0.5 y (x + 5exp( 4x 2 )) estimate using h = 0.5 estimate using gcv optimal h = x

80 Introduction to Regression p. 68/97 Example WMAP data, local linear fit, h = 15. y estimate using h = 15 estimate using gcv optimal h = x

81 Introduction to Regression p. 69/97 Example WMAP data, local linear fit, h = 125. y estimate using h = 125 estimate using gcv optimal h = x

82 Introduction to Regression p. 69/97 Next... We should quantify the amount of error in our estimator for the regression function. Next we will look at how this can be done using local polynomial regression.

83 Introduction to Regression p. 70/97 Variance Estimation Let where σ 2 = n i=1 (Y i f(x i )) 2 n 2ν + ν ν = tr(l), ν = tr(l T L) = n l(x i ) 2. If f is sufficiently smooth, then σ 2 is a consistent estimator of σ 2. i=1

84 Introduction to Regression p. 71/97 Variance Estimation For the WMAP data, using local linear fit. > wmap = read.table("wmap.dat",header=t) > opth = 63.0 > locfitwmap = locfit(wmap$cl[1:700] wmap$ell[1:700], alpha=c(0,opth),deg=1) > nu = as.numeric(locfitwmap$dp[6]) > nutilde = as.numeric(locfitwmap$dp[7]) > sigmasqrhat = sum(residuals(locfitwmap)ˆ2)/(700-2*nu+nutilde) > sigmasqrhat [1] But, does not seem reasonable to assume homoscedasticity...

85 Introduction to Regression p. 72/97 Variance Estimation Allow σ to be a function of x: Y i = f(x i ) + σ(x i )ǫ i. Let Z i = log(y i f(x i )) 2 and δ i = log ǫ 2 i. Then, Z i = log(σ 2 (x i )) + δ i. This suggests estimating log σ 2 (x) by regressing the log squared residuals on x.

86 Introduction to Regression p. 73/97 Variance Estimation 1. Estimate f(x) with any nonparametric method to get an estimate f n (x). 2. Define Z i = log(y i f n (x i )) Regress the Z i s on the x i s (again using any nonparametric method) to get an estimate q(x) of log σ 2 (x) and let σ 2 (x) = e bq(x).

87 Introduction to Regression p. 74/97 Example WMAP data, log squared residuals, local linear fit, h = log residual squared l

88 Introduction to Regression p. 75/97 Example With local linear fit, h = 130, chosen via GCV log residual squared l

89 Introduction to Regression p. 76/97 Example Estimating σ(x): sigma hat l

90 Introduction to Regression p. 77/97 Confidence Bands Recall that so f(x) = i Var( f(x)) = i Y i l i (x) σ 2 (X i )l 2 i(x). An approximate 1 α confidence interval for f(x) is f(x) ± z α/2 i σ 2 (X i )l 2 i (x). When σ(x) is smooth, we can approximate σ 2 (X i )l 2 i (x) σ(x) l(x). i

91 Introduction to Regression p. 78/97 Confidence Bands Two caveats: 1. f is biased so this is realy an interval for E( f(x)). Result: bands can miss sharp peaks in the function. 2. Pointwise coverage does not imply simultaneous coverage for all x. Solution for 2 is to replace z α/2 with a larger number (Sun and Loader 1994). locfit() does this for you.

92 Introduction to Regression p. 79/97 More on locfit() > diaghat = predict.locfit(locfitwmap,where="data",what="infl") > normell = predict.locfit(locfitwmap,where="data",what="vari") diaghat will be L ii, i = 1, 2,...,n. normell will be l(x i ), i = 1, 2,...,n The Sun and Loader replacement for z α/2 is found using kappa0(locfitwmap)$crit.val.

93 Introduction to Regression p. 80/97 Example 95% pointwise confidence bands: Cl estimate 95% pointwise band assuming constant variance 95% pointwise band using nonconstant variance l

94 Introduction to Regression p. 81/97 Example 95% simultaneous confidence bands: Cl estimate 95% simultaneous band assuming constant variance 95% simultaneous band using nonconstant variance l

95 Introduction to Regression p. 81/97 Next... We discuss the use of regression splines. These are a well-motivated choice of basis for constructing regression functions.

96 Introduction to Regression p. 82/97 Basis Methods Idea: expand f as f(x) = j β j ψ j (x) where ψ 1 (x),ψ 2 (x),... are specially chosen, known functions. Then estimate β j and set f(x) = j β j ψ j (x). Here we consider a popular version, splines

97 Introduction to Regression p. 83/97 Splines and Penalization Define f n to be the function that minimizes M(λ) = (Y i f n (x i )) 2 + λ (f (x)) 2 dx. i λ = 0 = f n (X i ) = Y i (no smoothing) λ = = f n (x) = β 0 + β 1 x (linear) 0 < λ < = f n (x) = cubic spline with knots at X i. A cubic spline is a continuous function f such that 1. f is a cubic polynomial between the X i s 2. f has continuous first and second derivatives at the X i s.

98 Introduction to Regression p. 84/97 Basis For Splines Define (z) + = max{z, 0}, N = n + 4, ψ 1 (x) = 1 ψ 2 (x) = x ψ 3 (x) = x 2 ψ 4 (x) = x 3 ψ 5 (x) = (x X 1 ) 3 + ψ 6 (x) = (x X 2 ) 3 + ψ N (x) = (x X n ) 3 +. These functions form a basis for the splines: we can write f(x) = N β j ψ j (x). j=1 (For numerical calculations it is actually more efficient to use other spline bases.) We can thus write (1) f n (x) = N β j ψ j (x), j=1

99 Introduction to Regression p. 85/97 Basis For Splines We can now rewrite the minimization as follows: minimize : (Y Ψβ) T (Y Ψβ) + λβ T Ωβ where Ψ ij = ψ j (X i ) and Ω jk = ψ j (x)ψ k (x)dx. The value of β that minimizes this is β = (Ψ T Ψ + λω) 1 Ψ T Y. The smoothing spline f n (x) is a linear smoother, that is, there exist weights l(x) such that f n (x) = n i=1 Y il i (x).

100 Basis For Splines Basis functions with 5 knots. Psi x Introduction to Regression p. 86/97

101 Introduction to Regression p. 87/97 Basis For Splines Same span as previous slide, the B-spline basis, 5 knots: x

102 Introduction to Regression p. 88/97 Smoothing Splines in R > smosplresult = smooth.spline(x,y, cv=false, all.knots=true) If cv=true, then cross-validiation used to choose λ; if cv=false, gcv is used If all.knots=true, then knots are placed at all data points; if all.knots=false, then set nknots to specify the number of knots to be used. Using fewer knots eases the computational cost. predict(smosplresult) gives fitted values.

103 Introduction to Regression p. 89/97 Example WMAP data, λ chosen using GCV. Cl l

104 Introduction to Regression p. 89/97 Next... Finally, we address the challenges of fitting regression models when the predictor x is of large dimension.

105 Introduction to Regression p. 90/97 Recall: The Bias Variance Tradeoff For nonparametric procedures, when x is one-dimensional, which is minimized at Hence, MISE c 1 h 4 + c 2 nh h = O MISE = O ( 1 n 1/5 ( 1 ) n 4/5 whereas, for parametric problems ( ) 1 MISE = O n )

106 Introduction to Regression p. 91/97 The Curse of Dimensionality For nonparametric procedures, when x is d-dimensional, which is minimized at Hence, MISE c 1 h 4 + c 2 nh d h = O ( MISE = O 1 n 1/(4+d) ( 1 ) n 4/(4+d) whereas, for parametric problems, still, ( ) 1 MISE = O n )

107 Introduction to Regression p. 92/97 Multiple Regression Y = f(x 1,X 2,...,X d ) + ǫ The curse of dimensionality: Optimal rate of convergence for d = 1 is n 4/5. In d dimensions the optimal rate of convergence is is n 4/(4+d). Thus, the sample size m required for a d-dimensional problem to have the same accuracy as a sample size n in a one-dimensional problem is m n cd where c = (4 + d)/(5d) > 0. To maintain a given degree of accuracy of an estimator, the sample size must increase exponentially with the dimension d. Put another way, confidence bands get very large as the dimension d increases.

108 Introduction to Regression p. 93/97 Multiple Local Linear Given a nonsingular positive definite d d bandwidth matrix H, we define K H (x) = 1 H 1/2K(H 1/2 x). Often, one scales each covariate to have the same mean and variance and then we use the kernel h d K( x /h) where K is any one-dimensional kernel. Then there is a single bandwidth parameter h.

109 Introduction to Regression p. 94/97 Multiple Local Linear At a target value x = (x 1,...,x d ) T, the local sum of squares is given by n i=1 ( w i (x) Y i a 0 d j=1 ) 2 a j (x ij x j ) where w i (x) = K( x i x /h). The estimator is f n (x) = â 0 â = (â 0,...,â d ) T is the value of a = (a 0,...,a d ) T that minimizes the weighted sums of squares.

110 Introduction to Regression p. 95/97 Multiple Local Linear The solution â is â = (X T x W x X x ) 1 X T x W x Y where X x = 1 (x 11 x 1 ) (x 1d x d ) 1 (x 21 x 1 ) (x 2d x d ) (x n1 x 1 ) (x nd x d ) and W x is the diagonal matrix whose (i,i) element is w i (x).

111 Introduction to Regression p. 96/97 A Compromise: Additive Models Y = α + d j=1 f j (x j ) + ǫ Usually take α = Y. Then estimate the f j s by backfitting. 1. set α = Y, f 1 = f d = Iterate until convergence: for j = 1,...,d: Compute Ỹi = Y i α k j f k (X i ), i = 1,...,n. Apply a smoother to Ỹi on X j to obtain f j. Set f j (x) equal to f j (x) n 1 n i=1 f j (x i ). The last step ensures that i f j (X i ) = 0 (identifiability).

112 Introduction to Regression p. 97/97 To Sum Up... Classic linear regression is computationally and theoretically simple Nonparametric procedures for regression offer increased flexibility Careful consideration of MISE suggests use of local polynomial fits Smoothing parameters can be chosen objectively High-dimensional data present a challenge

Introduction to Regression

Introduction to Regression Chad M. Schafer May 20, 2015 Outline General Concepts of Regression, Bias-Variance Tradeoff Linear Regression Nonparametric Procedures Cross Validation Local Polynomial Regression