Introduction - Introduction -2 Introduction Linear Regression E(Y X) = X β +...+X d β d = X β Example: Wage equation Y = log wages, X = schooling (measured in years), labor market experience (measured as: AGE SCHOOL 6) and experience squared. E(Y SCHOOL, EXP) = β +β 2 SCHOOL+β 3 EXP+β 4 EXP 2. coefficient estimates for the wage equation: Dependent Variable: Log Wages Variable Coefficients S.E. t-values SCHOOL 0.0898 0.0083 0.788 EXP 0.0349 0.0056 6.85 EXP 2 0.0005 0.000 4.307 constant 0.5202 0.236 4.209 R 2 = 0.24, sample size n = 534 Table : Results from OLS estimation CPS 985, n = 534 Berndt (99) Introduction -3 Introduction -4 Wage <-- Schooling Wage <-- Experience Wage <-- Schooling, Experience 0.5.5 5 0 5 0 0. 0.2 0.3 0.4 0.5 0 0 20 30 40 50 4.2 Experience 27.5 3.8 2.3.9.5 6.0 0.0 Schooling 4.0 Figure : wage-schooling profile and wage-experience profile R: SPMcps85lin Figure 2: Parametrically estimated regression function R: SPMcps85lin
Introduction -5 Introduction -6 Wage <-- Schooling, Experience Linear regression E(Y SCHOOL,EXP) = const+β SCHOOL+β 2 EXP 2.4 Nonparametric Regression 2.0 m(.) is a smooth function E(Y SCHOOL, EXP) = m(school, EXP) 4.2 Experience 27.5 3.8.7 6.0 0.0 Schooling 4.0 Figure 3: Nonparametrically estimated regression function R: SPMcps85reg Introduction -7 Introduction -8 Engel Curve Example: Engel curve Engel (857)..., daß je ärmer eine Familie ist, einen desto größeren Anteil von der Gesammtausgabe muß zur Beschaffung der Nahrung aufgewendet werden. (The poorer a family, the bigger the share of total expenditures that has to be used for food.) 0 0.2 0.4 0.6 Figure 4: Engel curve, U.K. Family Expenditure Survey 973 R: SPMengelcurve2
Introduction - 3 Introduction - 4 Example: Wage equation Nonparametric regression E(Y SCHOOL, EXP) = m(school, EXP) Semiparametric Regression E(Y SCHOOL,EXP) = α+g (SCHOOL)+g 2 (EXP) Y - -0.5 0 Wage <-- Schooling Y -0.4-0.2 0 0.2 Wage <-- Experience g (.), g 2 (.) are smooth functions 5 0 5 X 0 0 20 30 40 50 X Figure 8: Additive model fit vs. parametric fit, wage-schooling (left) and wage-experience (right) profiles R: SPMcps85add Introduction - 5 Introduction - 6 Wage <-- Schooling, Experience Example: Binary choice model if person imagines to move to west, Y = 0 otherwise. 0.4 0. E(Y x) = P(Y = x) = G(β x) 4.2 Experience 27.5 3.8-0. 6.0 0.0 Schooling 4.0 typically: logistic link function (logit model) E(Y x) = P(Y = x) = +exp( β x) Figure 9: Surface plot for the additive model R: SPMcps85add
Introduction - 7 Introduction - 8 Logit Model Link Function, Responses Heteroscedasticity problems binary choice model with Var(ε) = 4 where u has a (standard) logistic distribution { +(β x) 2} 2 Var(u), -3-2 - 0 2 Index Figure 0: Logit model for migration R: SPMlogit Introduction - 9 Introduction - 20 Wrong link consequences True versus Logit Link G(Index) -4-3 -2-0 2 3 4 Index Sampling Distribution 0 0.2 0.4 0.6 True Ratio vs. Sampling Distribution -.5.5 True+Estimated Ratio Figure : The link function of a homoscedastic logit model (thin) vs. a heteroscedastic model (solid) R: SPMtruelogit Figure 2: Sampling distribution of the ratio of estimated coefficients and the ratio s true value R: SPMsimulogit
Introduction - 2 Introduction - 22 Link Function, Responses Single Index Model -3-2 - 0 2 Index Summary: Introduction Parametric models are fully determined up to a parameter (vector). They lead to an easy interpretation of the resulting fits. Nonparametric models only impose qualitative restrictions like a smooth density function or a smooth regression function m. They may support known parametric models. They open the way for new models by their flexibility. Semiparametric model combine parametric and nonparametric parts. This keeps the easy interpretation of the results, but gives more flexibility in some aspects of the model. Figure 3: Single index model for migration R: SPMsim Nonparametric Regression 4- Nonparametric Regression {(X i,y i )}, i =,...,n; X R d,y R Engel curve: X = net-income, Y = expenditure Y = m(x)+ε CHARN model: time series of the form Y t = m(y t )+σ(y t )ξ t Nonparametric Regression 4-2 Univariate Kernel Regression model Y i = m(x i )+ε i, i =,...,n m( ) smooth regression function, ε i i.i.d. error terms with Eε i = 0 we aim to estimate the conditional expectation of Y given X = x m(x) = E(Y X = x) = y f(y x) dy = y f(x,y) f X (x) dy where f(x,y) denotes the joint density of (X,Y) and f X (x) the marginal density of X Example: normal variables (X,Y) N(µ,Σ) = m(x) is linear.
Nonparametric Regression 4-4 Nonparametric Regression 4-5 Nadaraya-Watson Estimator idea: (X i,y i ) have joint a pdf, so we can estimate m( ) by a multivariate kernel estimator f h, h(x,y) = n ( ) ( ) n h K x Xi y Yi h hk h resulting estimator = y f h, h(x,y)dy = n n K h (x X i )Y i n n K h (x X i )Y i m h (x) = n n K h (x X j ) j= = r h(x) f h (x) Engel Curve Figure 44: Nadaraya-Watson kernel regression, h = 0.2, U.K. Family Expenditure Survey 973 R: SPMengelcurve Nonparametric Regression 4-8 Nonparametric Regression 4-9 Statistical properties of the Nadaraya-Watson estimator m h (x) m(x) = = r h(x) m(x) f h (x) f X (x) { }[ { r h (x) fh f h (x) m(x) (x) f X (x) + f }] h (x) f X (x) +{ m h (x) m(x)} f X(x) f h (x) f X (x) calculate now bias and variance in the same way as for the KDE: AMSE{ m h (x)} = σ 2 (x) nh f X (x) K 2 2 }{{} variance part { + h4 m (x)+2 m (x)f X (x) 4 f X (x) } 2 µ 2 2(K) } {{ } bias part h=0.05 h=0. h=0.2 h=0.5 Figure 45: Four kernel regression estimates for the 973 U.K. Family Expenditure data with bandwidths h = 0.05, h = 0., h = 0.2, and h = 0.5 R: SPMregress
Nonparametric Regression 4-0 Nonparametric Regression 4- Asymptotic normal distribution For some regularity conditions and h = cn /5 remarks n 2/5 { m h (x) m(x)} L N { m (x) c2 µ 2 (K) + m (x)f X (x) } 2 f X (x) }{{} b x bias is a function of m and m f variance is a function of σ 2 and f, σ2 (x) K 2 2 cf X (x) } {{ } vx 2. Pointwise Confidence Intervals [ m h (x) z α 2 K 2 σ 2 (x) nh f h (x), m h (x)+z α 2 K 2 σ 2 (x) n σ 2 (x) = n W hi (x){y i m h (x)} 2, nh f h (x) σ 2 and f both influence the precision of the confidence interval correction for bias!? analogous to KDE: confidence bands (Bickel and Rosenblatt; 973) ] Nonparametric Regression 4-2 Nonparametric Regression 4-3 Confidence Intervals Local Polynomial Estimation Taylor expansion for sufficiently smooth functions m(t) m(x)+m (x)(t x)+m (x)(t x) 2 2! + +m(p) (x)(t x) p p! consider a weighted least squares problem min β n { Y i β 0 β (X i x) β 2 (X i x) 2... β p (X i x) p} 2 Kh (x X i ) Figure 46: Nadaraya-Watson kernel regression and 95% confidence intervals, h = 0.2, U.K. Family Expenditure Survey 973 R: SPMengelconf = resulting estimate for β provides estimates for m (ν) (x), ν = 0,,...,p
Nonparametric Regression 4-4 Nonparametric Regression 4-5 notations X x (X x) 2... (X x) p X 2 x (X 2 x) 2... (X 2 x) p X =....... X n x (X n x) 2... (X n x) p Y = (Y,,Y n ) W = diag({k h (x X i )} n ) local polynomial estimator β(x) = ( X WX ) X WY local polynomial regression estimator m h,p (x) = β 0 (x) (Asymptotic) Statistical Properties under regularity conditions, h = cn 5, nh remarks: n 2/5 { m h, (x) m(x)} L N asymptotically equivalent to higher order kernel ( c 2 µ 2 (K) m (x), σ2 (x) K 2 2 2 cf X (x) analog theorem can be stated for derivative estimation ) Nonparametric Regression 4-6 Nonparametric Regression 4-7 Local Polynomial Regression Derivative Estimation Figure 47: Local polynomial regression, p =, h = 0.2, U.K. Family Expenditure Survey 973 R: SPMlocpolyreg Figure 48: Local polynomial derivative estimation, p = 2, h by rule of thumb, U.K. Family Expenditure Survey 973 R: SPMderivest
Kernel Density Estimation 3-3 Kernel Density Estimation 3-4 Different Kernel Functions Required properties of kernels K( ) is a density function: K( ) is symmetric: K(u)du = and K(u) 0 uk(u)du = 0 Kernel K(u) Uniform 2 ( u ) Triangle ( u )( u ) 3 Epanechnikov 4 ( u2 )( u ) 5 Quartic 6 ( u2 ) 2 ( u ) 35 Triweight 32 ( u2 ) 3 ( u ) Gaussian 2π exp( 2 u2 ) π Cosinus 4 cos(π 2u)( u ) Table 2: Kernel functions Kernel Density Estimation 3-5 Kernel Density Estimation 3-6 Uniform Epanechnikov K(u) K(u) -.5.5 u Triangle -.5.5 u K(u) K(u) -.5.5 u Quartic -.5.5 Figure 27: Some kernel functions: Uniform (top left), Triangle (bottom left), Epanechnikov (top right), Quartic (bottom right) R: SPMkernel u Example: Construction of the KDE consider the KDE using a Gaussian kernel f h (x) = n ( ) x Xi K nh h here we have = nh n u = x X i h 2π exp( 2 u2 )
Kernel Density Estimation 3-52 Kernel Density Estimation 3-53 Example: bandwidth matrix H = 0 0 How to get a Multivariate Kernel? u = (u,...,u d ) product kernel radially symmetric or spherical kernel K(u) = K(u )... K(u d ) K( u ) K(u) = K( u )du R d with u def = u u Product Kernel Radial-symmetric Kernel Figure 37: Contours from bivariate product (left) and bivariate radially symmetric (right) Epanechnikov kernel R: SPMkernelcontours Kernel Density Estimation 3-54 Example: bandwidth matrix H = 0 0 0.5 Kernel Density Estimation 3-55 Example: bandwidth matrix H = 0.5 0.5 /2 Product Kernel Radial-symmetric Kernel Product Kernel Radial-symmetric Kernel - -0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 - -0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 Figure 38: Contours from bivariate product (left) and bivariate radially symmetric (right) Epanechnikov kernel R: SPMkernelcontours Figure 39: Contours from bivariate product (left) and bivariate radially symmetric (right) Epanechnikov kernel R: SPMkernelcontours
Kernel Density Estimation 3-56 Kernel properties K is a density function K is symmetric K has a second moment (matrix) R d K(u)du =, K(u) 0 R d uk(u)du = 0 d R d uu K(u)du = µ 2 (K)I d where I d denotes the d d identity matrix K has a kernel norm K 2 2 = K 2 (u)du Nonparametric Regression 4-8 Higher Order Kernels Kernel is of order p if u j K(u) du = 0 and u p K(u) du < j =,...,p K_opt(v=0,p=4,6) 0.5 0..0.5 2.0 K_opt(v=,2,p=5,6) 20 0 0 0 20.0 0.5 0..0.0 0.5 0..0 Figure 49: Optimal p th order kernels for v th derivative est. R: SPMhokernel Nonparametric Regression 4-9 Nonparametric Regression 4-20 Optimal and Gauss-type Higher Order Kernels Order(v, p) K(u) Opt (0,4) 5 32 (3 0u2 +7u 4 ) Opt (0,6) 35 256 (5 05u2 +89u 4 99u 6 ) Opt (,3) 5 4 (u3 u) Opt (, 5) 05 32 ( 5u+4u3 9u 5 ) Opt (2, 4) 05 6 ( +6u2 5u 4 ) Opt (2, 6) 35 64 ( 5+63u2 35u 4 +77u 6 ) Gauss (0, 4) 2 (3 u2 )φ(u) Gauss (0, 6) 225 8 (5 0u2 +u 4 )φ(u) Gauss (0, 8) 48 (05 05u2 +2u 4 u 6 )φ(u) Gauss (0, 0) 384 (945 260u2 +378u 4 36u 6 +u 8 )φ(u) Estimators using Higher Order Kernels Kernel estimators with higher order (p > 2) kernels achieve better bias rates (typically: K of order p bias h p ) Therefore, the asymptotic statistical properties are equivalent to those of local polynomials of corresponding order Like for local polynomials, the optimal choice is to have p v > 0 even Note that this is asymptotic. In practice, performance (especially graphical one) might be poor unless samples size is huge Can also be applied to density estimation but may give negative estimates for some x-values, especially where data are sparse or samples are small
Nonparametric Regression 4-2 Bandwidth Choice in Kernel Regression Nonparametric Regression 4-22 Example: simulated data for previous figure Bias^2, Variance and MASE Simulated Data Bias^2, Variance and MASE 0 0.0 0.0 0.02 0.02 0.03 0.03 y, m, mh 0.05 0. 0.5 Bandwidth h x Figure 50: MASE (thick line), bias part (thin solid line) and variance part (thin dashed line) for simulated data R: SPMsimulmase Figure 5: Simulated data with curve m(x) = {sin(2πx 3 )} 3 Y i = m(x i) + ε i,x i U[0,],ε i N(0,0.) R: SPMsimulmase Nonparametric Regression 4-23 Convergence rates (univariate case) for a kernel K of order p AMSE(h) = nh C +h 2p C 2 Nonparametric Regression 4-24 Cross Validation n ASE = n { m h (X j ) m(x j )} 2 w(x j ) j= MASE = E{ASE X = x,,x n = x n } Plug-In Methods h opt = argmin h AMISE(h) = h opt n +2p, AMISE(hopt ) n 2p 2p+ Idea: calculate pre-estimators of all unknown parts in C and C 2 Most practical: rough approximations should do, e.g. use some parametric higher order polynomials (at least of order two to obtain also the second derivatives) estimate MASE by resubstitution function (w is a weight function) p(h) = n {Y i m h (X i )} 2 w(x i ) n separate estimation and validation by using leave-one-out estimators CV(h) = n {Y i m h, i (X i )} 2 w(x i ) n minimizing gives ĥcv
Nonparametric Regression 4-25 Nonparametric Regression 4-26 Generalized Cross Validation Because CV does easily break down or is computationally too intensive for large data sets, Generalized Cross Validation has been proposed. Minimizes the residual sum of squares corrected for the degrees of freedom (dof). Unfortunately, there exist several definitions like e.g. nˆσ 2 n dof (=) RSS n dof, n RSS (n dof) 2 Also for the dof we find several definitions. Considering only estimators linear in Y, i.e. ˆm(X) = AY, typical proposals are trace(a), trace(aa t ), n trace(2a AA t ) Nadaraya-Watson Estimate Any of this definitions makes only sense for symmetric A. Minimizing gives ĥgcv Figure 52: Nadaraya-Watson kernel regression, cross-validated bandwidth ĥcv = 0.5, U.K. Family Expenditure Survey 973 R: SPMnadwaest Nonparametric Regression 4-27 Local Polynomial Estimate Figure 53: Local polynomial regression, p =, cross-validated bandwidth ĥcv = 0.56, U.K. Family Expenditure Survey 973 R: SPMlocpolyest