Introduction to Regression

Size: px

Start display at page:

Download "Introduction to Regression"

Louisa Lamb
5 years ago
Views:

1 Introduction to Regression David E Jones (slides mostly by Chad M Schafer) June 1, / 102

2 Outline General Concepts of Regression, Bias-Variance Tradeoff Linear Regression Nonparametric Procedures Cross Validation Local Polynomial Regression Confidence Bands Basis Methods: Splines Multiple Regression Notes based on All of Nonparametric Statistics by Larry Wasserman 2 / 102

3 Basic Concepts in Regression The Regression Problem: Objective is to construct a model that can be used to predict the response variable Y using observation of the predictor variables X Note: In this framework, Y is real-valued, while X could be a vector Example: Predicting the distance modulus from redshift Y = distance modulus X = redshift 3 / 102

4 Basic Concepts in Regression The Regression Problem: Observe a training sample (X 1, Y 1 ),, (X n, Y n ), use this to estimate f(x) = E(Y X = x) Equivalently: Estimate f(x) when Y i = f(x i ) + ɛ i, i = 1, 2,, n where E(ɛ i ) = 0 Simple estimators when X is real-valued: f(x) = β 0 + β 1 x parametric f(x) = mean{y i : X i x h} nonparametric 4 / 102

5 Basic Concepts of Regression Parametric Regression: Y = β 0 + β 1 X + ɛ Y = β 0 + β 1 X β d X d + ɛ Y = β 0 e β1x1 + β 2 X 2 + ɛ Nonparametric Regression: Y = f(x) + ɛ Y = f 1 (X 1 ) + + f d (X d ) + ɛ Y = f(x 1, X 2 ) + ɛ 5 / 102

6 Next We first look at a simple example from astronomy 5 / 102

7 Example: Type Ia Supernovae Distance Modulus Redshift From Riess, et al (2007), 182 Type Ia Supernovae 6 / 102

8 Example: Type Ia Supernovae Assumptions: Observed (redshift, distance modulus) pairs (z i, Y i ) are related by Y i = f(z i ) + σ i ɛ i, where the ɛ i are iid standard normal with mean zero Error standard deviations σ i assumed known Goal: Estimate f( ) 7 / 102

9 Example: Type Ia Supernovae ΛCDM, two parameter model: ( ) c(1 + z) z du f(z) = 5 log H 0 0 Ωm (1 + u) 3 + (1 Ω m ) H 0 = Hubble constant Ω m = dimensionless matter density 8 / 102

10 Example: Type Ia Supernovae Distance Modulus Redshift Estimate where H 0 = 7276 and Ω m = 0341 (the MLE) 9 / 102

11 Example: Type Ia Supernovae Distance Modulus Redshift Fourth order polynomial fit: Y = β 0 + β 1 z + β 2 z 2 + β 3 z 3 + β 4 z 4 + ɛ 10 / 102

12 Example: Type Ia Supernovae Distance Modulus Redshift Fifth order polynomial fit: Y = β 0 + β 1 z + β 2 z 2 + β 3 z 3 + β 4 z 4 + β 5 z 5 + ɛ 11 / 102

13 Next The previous two frames illustrate the fundamental challenge of constructing a regression model: How complex should it be? 11 / 102

14 The Bias Variance Tradeoff Must choose model to achieve balance between too simple high bias, ie, E( f(x)) not close to f(x) xxxxxx precise, but not accurate too complex high variance, ie, Var( f(x)) large xxxxxx accurate, but not precise This is the classic bias variance tradeoff Could be called the accuracy precision tradeoff 12 / 102

15 The Bias Variance Tradeoff An Example: Consider the Linear Regression Model These models are of the form p f(x) = β 0 + β i g i (x) where g i ( ) are fixed, specified functions (For example, g i (x) = x i ) i=1 Increasing p yields more complex models 13 / 102

16 Next Linear regression models are the classic approach to regression Here, the response is modelled as a linear combination of p selected predictors 13 / 102

17 Linear Regression Assume that p Y = β 0 + β i g i (x) + ɛ i=1 where g i ( ) are specified functions, not estimated Assume that E(ɛ) = 0, Var(ɛ) = σ 2 Often assume ɛ is Gaussian, but this is not essential Start with the Simple Linear Regression model: Y = β 0 + β 1 x + ɛ 14 / 102

18 Linear Regression Redshift g r LRG data from 2SLAQ, www2slaqinfo 15 / 102

19 Linear Regression Redshift Y i Y i Showing one of the fitted values, Ŷi g r 16 / 102

20 Linear Regression Redshift Y i ε i Y i Showing one of the residuals, ɛ i g r 17 / 102

21 Linear Regression The line fit here is with β 0 = 0493 and β 1 = Ŷ i = β 0 + β 1 x i These estimates of β 0 and β 1 minimize the sum of squared errors: SSE = This is least squares regression n ɛ 2 i = i=1 n i=1 ( ) 2 Y i Ŷi 18 / 102

22 Linear Regression Least squares has a natural geometric interpretation 19 / 102

23 Linear Regression The Design Matrix: D = 1 g 1 (X 1 ) g 2 (X 1 ) g p (X 1 ) 1 g 1 (X 2 ) g 2 (X 2 ) g p (X 2 ) 1 g 1 (X n ) g 2 (X n ) g p (X n ) Then, L = D(D T D) 1 D T is the projection matrix Note that ν = trace(l) = p + 1 = complexity of model 20 / 102

24 Linear Regression Let Y be a n-vector filled with Y 1, Y 2,, Y n The Fitted Values: Ŷ = LY = D β The Least Squares Estimates of β: β = (D T D) 1 D T Y The Residuals: ɛ = Y Ŷ = (I L)Y 21 / 102

25 Linear Regression in R Linear regression in R is done via the command lm() > holdlm = lm(y ~ x) Note: That is a tilde ( ) between y and x Default is to include intercept term If want to exclude it, use > holdlm = lm(y ~ x - 1) 22 / 102

26 Linear Regression in R y x 23 / 102

27 Linear Regression in R Fitting the model Y = β 0 + β 1 x + ɛ The summary() shows the key output > holdlm = lm(y ~ x) > summary(holdlm) < CUT SOME HERE > Coefficients: Estimate Std Error t value Pr(> t ) (Intercept) * x e-10 *** < CUT SOME HERE > Note that β 1 = 207, the SE for β 1 is approximately / 102

28 Linear Regression in R Coefficients: Estimate Std Error t value Pr(> t ) (Intercept) * x e-10 *** If the ɛ are Gaussian, then the last column gives the p-value for the test of H 0 : β i = 0 versus H 1 : β i 0 Also, if the ɛ are Gaussian, β i ± 2 ŜE( β i ) is an approximate 95% confidence interval for β i 25 / 102

29 Linear Regression in R Residuals Fitted Values Always a good idea to look at plot of residuals versus fitted values There should be no apparent pattern > plot(holdlm$residuals,holdlm$fittedvalues) 26 / 102

30 Linear Regression in R Sample Quantiles Theoretical Quantiles Is it reasonable to assume ɛ are Gaussian? Check the normal probability plot of the residuals > qqplot(holdlm$residuals) > qqline(holdlm$residuals) 27 / 102

31 Linear Regression in R Simple to include more terms in the model: > holdlm = lm(y ~ x + z) Unfortunately, this does not work: > holdlm = lm(y ~ x + x^2) Instead, can use: > holdlm = lm(y ~ x + I(x^2)) 28 / 102

32 The Correlation Coefficient A standard way of quantifying the strength of the linear relationship between two variables is via the (Pearson) correlation coefficient Sample Version: r = 1 n 1 n i=1 X i Y i where X i and Y i are the standardized versions of X i and Y i, ie, X i = X i X s X 29 / 102

33 The Correlation Coefficient Note that 1 r 1 r = 1 indicates perfect linear, increasing relationship r = 1 indicates perfect linear, decreasing relationship Also, note that In R: use cor() ) sy β 1 = r( s X 30 / 102

34 The Correlation Coefficient Examples of different scatter plots and values for r (Wikipedia) 31 / 102

35 The Coefficient of Determination A very popular, but overused, regression diagnostic is the coefficient of determination, denoted R 2 ( ) 2 n n ( R 2 i=1 Y i Ŷi = 1 ( Yi Y ) 2 = i=1 Yi Y ) 2 ( n i=1 Y i Ŷi n ( i=1 Yi Y ) 2 n i=1 ) 2 The proportion of the variation explained by the model Note that 0 R 2 1, and, R 2 = r 2 But: Large R 2 does not imply that the model is a good fit 32 / 102

36 Anscombe s Quartet (1973) 33 / 102

37 Next Now we will discuss nonparametric regression Reconsider the SNe example from the start of the lecture 33 / 102

38 Example: Type Ia Supernovae Distance Modulus Redshift Nonparametric estimate 34 / 102

39 The Bias Variance Tradeoff Nonparametric Case: Every nonparametric procedure requires a smoothing parameter h Consider the regression estimator based on local averaging: f(x) = mean{y i : X i x h} Increasing h yields smoother estimate f 35 / 102

40 What is nonparametric? Data Estimate Assumptions In the parametric case, the influence of assumptions is fixed 36 / 102

41 What is nonparametric? Data Estimate Assumptions h In the nonparametric case, the influence of assumptions is controlled by h 37 / 102

42 The Bias Variance Tradeoff Squared error loss: L ( f(x), f(x) ) = ( f(x) f(x) ) 2 Mean squared error (MSE): MSE = E ( L(f(x), f(x)) ) = bias 2 x + variance x where bias x = E( f(x)) f(x) variance x = Var( f(x)) 38 / 102

43 The Bias Variance Tradeoff Risk Bias squared Variance Less Optimal smoothing More 39 / 102

44 Bias Variance Tradeoff: choosing h Want to minimize the average MSE (risk): 1 n n E ( (f(x i ) f(x i )) 2) i=1 Estimate with the leave-one-out cross-validation score: CV = 1 n n (Y i f ( i) (x i )) 2 where f ( i) is the estimator obtained by omitting the i th pair (x i, Y i ) i=1 40 / 102

45 The Bias Variance Tradeoff: choosing h Risk Estimated via Cross Validation h 41 / 102

46 Bias Variance Tradeoff: what the theory says In certain settings, the optimal choice of h is ( ) 1 h = O n 1/5 and the corresponding minimum risk is Risk = O ( ) 1 n 4/5 However, for parametric problems, when the model is correct, ( ) 1 Risk = O n Message: models are helpful if we have a good reason to believe they are approximately correct 42 / 102

47 Next Now we will take a closer look at a class of some nonparametric regression procedures that share a common structure with linear regression Namely, they are all linear smoothers 42 / 102

48 Linear Smoothers For a linear smoother: f(x) = i l i (x)y i for some weights: l 1 (x),, l n (x) The vector of weights depends on the target point x For example, in linear regression, we saw Ŷ = LY, and therefore l 1 (X i ),, l n (X i ) is given by the i th row of the projection matrix L = D(D T D) 1 D T 43 / 102

49 Linear Smoothers If f(x) = n i=1 l i(x)y i then Ŷ 1 Ŷn }{{} Ŷ Ŷ = L Y f(x 1 ) f(xn) = l 1 (X 1 ) l 2 (X 1 ) ln(x 1 ) l 1 (X 2 ) l 2 (X 2 ) ln(x 2 ) l 1 (Xn) l 2 (Xn) ln(xn) } {{ } L Y 1 Yn } {{ } Y The effective degrees of freedom is: ν = trace(l) = n L ii Can be interpreted as the number of parameters in the model Recall that ν = p + 1 for linear regression i=1 44 / 102

50 Linear Smoothers Consider the Extreme Cases: When l i (X j ) = 1/n for all i, j, f(x j ) = Y for all j and ν = 1 When l i (X j ) = δ ij for all i, j, f(x j ) = Y j and ν = n 45 / 102

51 Next One of the simplest examples is binning 45 / 102

52 Binning Divide the x-axis into bins B 1, B 2, of width h f is a step function based on averaging the Y i s in each bin: { } for x B j : f(x) = mean Y i : X i B j = 1 n 1 {Xi B n j}y i j where n j is the number of observations in B j i=1 The (arbitrary) choice of the boundaries of the bins can affect inference, especially when h large 46 / 102

53 Binning WMAP data, binned estimate with h = 15 Cl l 47 / 102

54 Binning WMAP data, binned estimate with h = 50 Cl l 48 / 102

55 Binning WMAP data, binned estimate with h = 75 Cl l 49 / 102

56 Next A step beyond binning is to take local averages to estimate the regression function 49 / 102

57 Local Averaging For each x let N x = [ x h 2, x + h ] 2 This is a moving window of length h, centered at x Define } f(x) = mean {Y i : X i N x This is like binning but removes the arbitrary boundaries 50 / 102

58 Local Averaging WMAP data, local average estimate with h = 15 Cl Binned Local Averaging l 51 / 102

59 Local Averaging WMAP data, local average estimate with h = 50 Cl Binned Local Averaging l 52 / 102

60 Local Averaging WMAP data, local average estimate with h = 75 Cl Binned Local Averaging l 53 / 102

61 Next Local averaging is a special case of kernel regression 53 / 102

62 Kernel Regression The local average estimator can be written: n i=1 f(x) = Y i K ( x X i ) h n i=1 K ( x X i ) h where K(x) = { 1 x < 1/2 0 otherwise Can improve this by using a function K which is smoother 54 / 102

63 Kernels A kernel is any smooth function K such that K(x) 0 and K(x) dx = 1, xk(x)dx = 0 and σk 2 x 2 K(x)dx > 0 Some commonly used kernels are the following: the boxcar kernel : K(x) = 1 I( x < 1), 2 the Gaussian kernel : K(x) = 1 e x2 /2, 2π the Epanechnikov kernel : K(x) = 3 4 (1 x2 )I( x < 1) the tricube kernel : K(x) = (1 x 3 ) 3 I( x < 1) 55 / 102

64 Kernels / 102

65 Kernel Regression We can write this as where n i=1 f(x) = Y i K ( ) x X i h n i=1 K ( x X i ) h f(x) = i Y i l i (x) K ( x X i ) h l i (x) = n i=1 K ( x X i ) h 57 / 102

66 Kernel Regression WMAP data, kernel regression estimates, h = 15 Cl boxcar kernel Gaussian kernel tricube kernel l 58 / 102

67 Kernel Regression WMAP data, kernel regression estimates, h = 50 Cl boxcar kernel Gaussian kernel tricube kernel l 59 / 102

68 Kernel Regression WMAP data, kernel regression estimates, h = 75 Cl boxcar kernel Gaussian kernel tricube kernel l 60 / 102

69 Kernel Regression: what the theory says Suppose the x values are distributed according to the density g Then for estimating f(x), kernel regression has variance and bias where g(x) is the density for X σ 2 g(x)nh K 2 (u)du ( h f (x) + f ) (x)g (x) g(x) u 2 K(u)du Problem: the presence of the term 2f (x) g (x) g(x) = design bias Another problem is that there is particularly large bias near the boundaries 61 / 102

70 Kernel Regression WMAP data, h = 50 (zoomed in) Note the boundary bias Cl boxcar kernel Gaussian kernel tricube kernel l 62 / 102

71 Next The deficiencies of kernel regression can be addressed by considering models which, on small scales, model the regression function as a low order polynomial 62 / 102

72 Local Polynomial Regression Recall polynomial regression: f(x) = β 0 + β 1 x + β 2 x β p x p where β = ( β 0,, β p ) are obtained by least squares: minimize ( n Y i [β 0 + β 1 x + β 2 x β p x p ] i=1 ) 2 63 / 102

73 Local Polynomial Regression Local polynomial regression: approximate f(x) locally by a different polynomial for every x: f(u) β 0 (x) + β 1 (x)(u x) + β 2 (x)(u x) β p (x)(u x) p for u near x Estimate ( β 0 (x),, β p (x)) by local least squares: minimize n ( ) x (Y i [β 0 (x) + β 1 (x)(x i x) + + β p (x)(x i x) p ]) 2 Xi K h }{{} kernel i=1 f(x) = β 0 (x) 64 / 102

74 Local Polynomial Regression The local polynomial regression estimate is f(x) = where l(x) T = (l 1 (x),, l n (x)), n l i (x)y i = β 0 (x) i=1 l(x) T = e T 1 (D T x W x D x ) 1 D T x W x, e 1 = (1, 0,, 0) T and D x and W x are defined by Dx = 1 X 1 x (X 1 x) p 1 X 2 x (X 2 x) p 1 Xn x (Xn x) p Wx = K ( x X1 h ) K ( x Xn h ) Compare to standard least squares which gives l(x i ) T as the i th row of L = D(D T D) 1 D T 65 / 102

75 Local Polynomial Regression So for Ŷ = (Ŷ1,, Ŷn) and Ŷi = f(x i ), we have where L is the smoothing matrix: L = The effective degrees of freedom is: Ŷ = LY l 1 (X 1 ) l 2 (X 1 ) ln(x 1 ) l 1 (X 2 ) l 2 (X 2 ) ln(x 2 ) l 1 (Xn) l 2 (Xn) ln(xn) ν = trace(l) = n L ii i=1 66 / 102

76 Local Polynomial Regression Taking p = 0, f(x) minimizes n ( ) x (Y i β 0 (x)) 2 Xi K h i=1 which exactly gives the kernel regression estimator n f(x) = l i (x)y i i=1 K ( x X i ) h l i (x) = ( ) n j=1 K x Xj h Taking p = 1 yields the local linear estimator, with f(x) minimizing n ( ) x (Y i [β 0 (x) + β 1 (x)(x i x)]) 2 Xi K h i=1 This is a very good all-purpose smoother Choice of Kernel K: not important Choice of bandwidth h: crucial 67 / 102

77 Local Polynomial Regression: what the theory says Local linear (p = 1) is better than kernel (p = 0): 1 Both have similar variance but the kernel estimator has bias ( h f (x) + f ) (x)g (x) u 2 K(u)du g(x) whereas the local linear estimator has bias h f (x) u 2 K(u)du Thus, the local linear estimator is free from design bias 2 At the boundary points, the kernel estimator has bias of O(h) while the local linear estimator has bias O(h 2 ) 68 / 102

78 Local Polynomial Regression: choosing h (and p) Cross-validation revisited: Minimize leave-one-out cross-validation score: Amazing shortcut formula: CV = 1 n CV = 1 n n (Y i f ( i) (x i )) 2 i=1 ( n Y i f(x ) 2 i ) 1 L ii i=1 A commonly used approximation is GCV (generalized cross-validation): 1 n GCV = n ( ) 1 ν 2 (Y i f(x i )) 2 n i=1 ν = trace(l) 69 / 102

79 Next Next, we will go over a bit of how to perform local polynomial regression in R 69 / 102

80 Using locfit() Need to include the locfit library: > installpackages("locfit") > library(locfit) > result = locfit(y x, alpha=c(0, 15), deg=1) y and x are vectors the second argument to alpha gives the bandwidth (h) the first argument to alpha specifies the nearest neighbor fraction, an alternative to the bandwidth fitted(result) gives the fitted values, f(x i ) residuals(result) gives the residuals, Y i f(x i ) 70 / 102

81 Using locfit() x y x(1 x)sin((21π) (x + 02)) estimate using h = 0005 estimate using gcv optimal h = 0015 degree=1, n=1000, sigma=01 71 / 102

82 Using locfit() x y x(1 x)sin((21π) (x + 02)) estimate using h = 01 estimate using gcv optimal h = 0015 degree=1, n=1000, sigma=01 72 / 102

83 Using locfit() x y (x + 5exp( 4x 2 )) estimate using h = 005 estimate using gcv optimal h = 0174 degree=1, n=1000, sigma=05 73 / 102

84 Using locfit() x y (x + 5exp( 4x 2 )) estimate using h = 05 estimate using gcv optimal h = 0174 degree=1, n=1000, sigma=05 74 / 102

85 Example WMAP data, local linear fit, h = x y estimate using h = 15 estimate using gcv optimal h = / 102

86 Example WMAP data, local linear fit, h = x y estimate using h = 125 estimate using gcv optimal h = / 102

87 Next We should quantify the amount of error in our estimator for the regression function Next we will look at how this can be done using local polynomial regression 76 / 102

88 Variance Estimation Let n σ 2 i=1 = (Y i f(x i )) 2 n 2ν + ν where n ν = tr(l), ν = tr(l T L) = l(x i ) 2 If f is sufficiently smooth, then σ 2 is a consistent estimator of σ 2 i=1 77 / 102

89 Variance Estimation For the WMAP data, using local linear fit > wmap = readtable("wmapdat",header=t) > opth = 630 > locfitwmap = locfit(wmap$cl[1:700]~wmap$ell[1:700], alpha=c(0,opth),deg=1) > nu = asnumeric(locfitwmap$dp[6]) > nutilde = asnumeric(locfitwmap$dp[7]) > sigmasqrhat = sum(residuals(locfitwmap)^2)/(700-2*nu+nutilde) > sigmasqrhat [1] But, does not seem reasonable to assume homoscedasticity 78 / 102

90 Variance Estimation Allow σ to be a function of x: Y i = f(x i ) + σ(x i )ɛ i Let Z i = log(y i f(x i )) 2 and δ i = log ɛ 2 i Then, Z i = log ( σ 2 (x i ) ) + δ i This suggests estimating log σ 2 (x) by regressing the log squared residuals on x 79 / 102

91 Variance Estimation 1 Estimate f(x) with any nonparametric method to get an estimate f(x) 2 Define Z i = log(y i f(x i )) 2 3 Regress the Z i s on the x i s (again using any nonparametric method) to get an estimate q(x) of log σ 2 (x) and let σ 2 (x) = e q(x) 80 / 102

92 Example WMAP data, log squared residuals, local linear fit, h = 630 log residual squared l 81 / 102

93 Example With local linear fit, h = 130, chosen via GCV log residual squared l 82 / 102

94 Example Estimating σ(x): sigma hat l 83 / 102

95 Confidence Bands Recall that f(x) = i Y i l i (x) so Var( f(x)) = i σ 2 (X i )l 2 i (x) An approximate 1 α confidence interval for f(x) is f(x) ± z α/2 i σ 2 (X i )l 2 i (x) When σ(x) is smooth, we can approximate σ 2 (X i )l 2 i (x) σ(x) l(x) i 84 / 102

96 Confidence Bands Two caveats: 1 f is biased so this is really an interval for E( f(x)) Result: bands can miss sharp peaks in the function 2 Pointwise coverage does not imply simultaneous coverage for all x Solution for 2 is to replace z α/2 with a larger number (Sun and Loader 1994) locfit() does this for you 85 / 102

97 More on locfit() > normell = predictlocfit(locfitwmap,where="data",what="vari") normell will be l(x i ), i = 1, 2,, n (needed for variance calculation) The Sun and Loader replacement for z α/2 is found using kappa0(locfitwmap)$critval 86 / 102

98 Example 95% pointwise confidence bands: Cl estimate 95% pointwise band assuming constant variance 95% pointwise band using nonconstant variance l 87 / 102

99 Example 95% simultaneous confidence bands: Cl estimate 95% simultaneous band assuming constant variance 95% simultaneous band using nonconstant variance l 88 / 102

100 Next We discuss the use of regression splines These are a well-motivated choice of basis for constructing regression functions 88 / 102

101 Basis Methods Idea: expand f as f(x) = j β j ψ j (x) where ψ 1 (x), ψ 2 (x), are specially chosen, known functions Then estimate β j and set f(x) = β j ψ j (x) j Here we consider a popular version, splines 89 / 102

102 Splines and Penalization Define f to be the function that minimizes M(λ) = (Y i f(x i )) 2 + λ i ( f (x)) 2 dx λ = 0 = f(x i ) = Y i (no smoothing) λ = = f(x) = β 0 + β 1 x (linear) 0 < λ < = f(x) = cubic spline with knots at X i A cubic spline is a continuous function f such that 1 f is a cubic polynomial between the X i s 2 f has continuous first and second derivatives at the X i s 90 / 102

103 Basis For Splines Define (z) + = max{z, 0}, N = n + 4, ψ 1(x) = 1 ψ 2(x) = x ψ 3(x) = x 2 ψ 4(x) = x 3 ψ 5(x) = (x X 1) 3 + ψ 6(x) = (x X 2) 3 + ψ N (x) = (x X n) 3 + These functions form a basis for the splines: we can write f(x) = N β j ψ j (x) j=1 (For numerical calculations it is actually more efficient to use other spline bases) We can thus write N f(x) = β j ψ j (x), (1) j=1 91 / 102

104 Basis For Splines We can now rewrite the minimization as follows: minimize : (Y Ψβ) T (Y Ψβ) + λβ T Ωβ where Ψ ij = ψ j (X i ) and Ω jk = ψ j (x)ψ k (x)dx The value of β that minimizes this is β = (Ψ T Ψ + λω) 1 Ψ T Y The smoothing spline f(x) is a linear smoother, that is, there exist weights l(x) such that f(x) = n i=1 Y il i (x) 92 / 102

105 Basis For Splines Basis functions with 5 knots Psi x 93 / 102

106 Basis For Splines Same span as previous frame, the B-spline basis, 5 knots: x 94 / 102

107 Smoothing Splines in R > smosplresult = smoothspline(x,y, cv=false, allknots=true) If cv=true, then cross-validiation used to choose λ; if cv=false, gcv is used If allknots=true, then knots are placed at all data points; if allknots=false, then set nknots to specify the number of knots to be used Using fewer knots eases the computational cost predict(smosplresult) gives fitted values 95 / 102

108 Example WMAP data, λ chosen using GCV Cl l 96 / 102

109 Next Finally, we consider the challenges of fitting regression models when the predictor x is of large dimension 96 / 102

110 Multiple Regression Y = f(x 1, X 2,, X d ) + ɛ The curse of dimensionality: Optimal rate of convergence for d = 1 is n 4/5 In d dimensions the optimal rate of convergence is is n 4/(4+d) Thus, the sample size m required for a d-dimensional problem to have the same accuracy as a sample size n in a one-dimensional problem is m n cd where c = (4 + d)/(5d) > 0 To maintain a given degree of accuracy of an estimator, the sample size must increase exponentially with the dimension d Put another way, confidence bands get very large as the dimension d increases 97 / 102

111 Multiple Local Linear Regression There is now a d d bandwidth matrix H (nonsingular positive definite) and we use K(H 1/2 x) Often, we instead scale each covariate to have the same mean and variance and then we use K( x /h) where K is any one-dimensional kernel Then there is a single bandwidth parameter h 98 / 102

112 Multiple Local Linear Regression At a target value x = (x 1,, x d ) T, the local sum of squares is given by ( n w i (x) Y i β 0 i=1 j=1 d β j (X ij x j ) ) 2 where w i (x) = K( X i x /h) The estimator is f(x) = β 0, where β = ( β 0,, β d ) T is the value of β = (β 0,, β d ) T that minimizes the weighted sums of squares 99 / 102

113 Multiple Local Linear Regression The solution β is β = (D T x W x D x ) 1 D T x W x Y where 1 (X 11 x 1 ) (X 1d x d ) 1 (X 21 x 1 ) (X 2d x d ) D x = 1 (X n1 x 1 ) (X nd x d ) and W x is the diagonal matrix whose (i, i) element is w i (x) 100 / 102

114 A Compromise: Additive Models Y = α + d f j (x j ) + ɛ j=1 Usually take α = Y Then estimate the f j s by backfitting 1 set α = Y, f 1 = f d = 0 2 Iterate until convergence: for j = 1,, d: Compute Ỹi = Yi α k j f k (X ik ), i = 1,, n Apply a smoother to Ỹ on (X1j,, Xnj)T to obtain f j Set f j(x) equal to f j(x) n 1 n i=1 f j(x ij) The last step ensures that n i=1 f j (X ij ) = 0 (identifiability) 101 / 102

115 To Sum Up Classic linear regression is computationally and theoretically simple Nonparametric procedures for regression offer increased flexibility Local polynomial regression improves on kernel regression Bandwidth can be chosen using cross validation High-dimensional data present a challenge Not covered, Bayesian methods 102 / 102

Introduction to Regression

Introduction to Regression p. 1/97 Introduction to Regression Chad Schafer cschafer@stat.cmu.edu Carnegie Mellon University Introduction to Regression p. 1/97 Acknowledgement Larry Wasserman, All of Nonparametric