41903: Introduction to Nonparametrics

Size: px

Start display at page:

Download "41903: Introduction to Nonparametrics"

William Gilbert
5 years ago
Views:

1 41903: Notes 5

2 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific data set i.e. concerned about overfitting/false discovery i.e. concerned about bias/variance tradeoff Data-mining/ big data methods naturally viewed as nonparametric procedures

3 Some ML/Data-Mining Jargon Data mining terminology: Supervised Learning Want to predict target variable Y with input variables X AKA Predictive Analytics This is our goal in this set of notes. Focus on classic methods here Data mining terminology: Unupervised Learning Want to find structure within set of variables X No specific target Exploratory data analysis (EDA) Essentially fancy descriptive statistics

4 Predictive Models Useful to think about relation between target (Y ) and input (X) as Y i }{{} target = g(x i ) }{{} signal + ε i }{{} noise Goal: Learn g( ) from the data in a way that yields generalizable forecasts OR Get a forecast rule that minimizes expected forecast loss

5 Squared Error Loss Loss function: L(Y, g(x)) = (Y g(x)) 2 Expected Loss(g( )): E X [E Y X [L(Y, g(x)) X]] Clear that sufficient to minimize pointwise Forecast rule: g(x) = arg min c E Y X [(Y c) 2 X = x] = arg min c E Y X [Y 2 X = x] 2cE Y X [Y X = x] + c 2 = E[Y X = x] Could also do L(Y, g(x)) = Y g(x) g(x) = median(y X)

6 Classification/Discrete Outcome Discrete Y : Y {1,..., R} Common loss function 0-1 loss : L(k, l) = L(Y = k, g(x) = l) = { 0 k = l 1 else Expected Loss(g( )): [ R ] E X [E Y X [L(k, l) X]] = E X L(r, g(x))p(y = r X) r=1

7 Classification/Discrete Outcome Again, sufficient to minimize pointwise: That is g(x) = arg min c = arg min c R L(r, c)p(y = r X = x) r=1 P(Y = r X = x) r c = arg min(1 P(Y = c X = x)) c g(x) = r : P(Y = r X = x) = max P(Y = c X = x) c Defines Bayes classifier Simply forecast outcome to be outcome with highest conditional probability Key input: P(Y = r X = x) = E[1(Y = r) X = x] Common structure: Want a good estimate of E[Y X] or other feature of conditional distribution (for suitable Y )

8 Coarsely Discrete Regressors Suppose that X can take on R values, {x 1,..., x R }. E.g. Gender, R = 2. Years of Schooling, R = 20ish. Gender x Years of Schooling, R = 2*20ish. Estimation of E[Y X = x r ] is easy!! Find all observations with x i = x r and calculate sample mean with this subsample No assumptions about E[Y X] - completely flexible Will have usual properties as long as R finite (just learning about R expectations)

9 Example: Conditional Wages Data: 329,505 men from 1980 U.S. Census aged years of schooling race (black, white) married (married, non-married) Condition on schooling or age: Schooling: 21 categories average of 15,691 observations per category Range: ,934 Age: 40 categories average of 8238 observations per category Range:

10 Example: Conditional Wages (a) Schooling (b) Age

11 Example: Conditional Wages Condition on schooling and age: 840 categories average number of observations per category: 392 (large range: )

12 Example: Conditional Wages Things only get worse as we condition on Race and Marital Status 3360 categories average of 98 observations per category Range: empty categories (7.7%) 670 categories with 0-2 observations (19.9%)

13 Example: Conditional Wages

14 Example: Conditional Wages Some questions: 1. Do we really think conditional expectation function is that bumpy? 2. What do we do about categories with 0 observations? 3. Estimates cell by cell are unbiased. What happens to variance as the number of cells increases? 4. Suppose we conditioned on State of Birth too? (171,360 categories) [Curse of dimensionality] Fundamental statistical learning problem - Need for Regularization: Structure and estimators that trade off bias and variance to produce reasonable forecasts/models

15 Local Averaging - Kernels Literally averaging for each separate x value is only feasible in cases where X is coarsely discrete - need beliefs/regularization Smoothness: E.g. E[Y X] is a smooth (e.g. continuous, differentiable, etc.) function of X Function shouldn t change much across values of X that are close Estimate E[Y X = x ] by averaging y s over values of x close to x Kernel Regression: E[Y X = x ] ĝ(x ) = n i=1 y ik h (x i x ) n i=1 K h(x i x ) where K h ( ) is a kernel function and h is a bandwidth.

16 Some kernel details Common Univariate Kernels: Uniform: K h (u) = 1 1( u < h) 2h Gaussian: K h (u) = 1 2πh 2 Epanechnikov: K h (u) = 3 4h ( Triangular: K h (u) = 1 h u2 exp{ ( 1 ( u h ) 1 u h 2h } + ) 2 ) + Multivariate kernels: Most common to just take product of univariate kernels ( product kernel ) Any multivariate density would also work E.g. q dimensional multivariate normal with q q bandwidth matrix H

17 Intuition using uniform kernel With uniform kernel, ĝ(x ) = n i=1 y i1( x i x < h) n i=1 1( x i x < h) = 1 n x,h i : x i x <h y i n x,h is the number of observations such that x i x < h I.e. estimator is just sample average of the y i across all points where x i x < h

18 Local averaging picture 1.5 n x*,h = 18 ĝ(.3) = 18 1 i : xi (.2,.4) yi ĝ(x) x * h x * x * +h

19 (Conditional) MSE Heuristic MSE derivation: Condition on {x i } n i=1 and h Assume (i) E[y i X] = E[y i x i ], (ii) Var(y i X) = Var(y i x i ) exist Assume independence across i [ E (ĝ(x ) g(x )) 2 ] X 2 = E 1 (y i g(x i ) + g(x i ) g(x )) X n x,h i : x i x <h = 1 n x 2 E[(y i g(x i )) 2 X],h i : x i x <h (g(x i ) g(x )) n x,h i : x i x <h = 1 n x 2 Var(y i x i ),h i : x i x <h (g(x + (x i x )) g(x )) n x,h i : x i x <h where the first term is the variance and the second is squared-bias.

20 Dependence of MSE on h (1) We can see how h relates to MSE: larger h larger n x,h first term is smaller larger h more points farther away from x higher bias (in general) Heuristic Derivation of MSE Bound in Univariate Case: 1. n x,h /n = n 1 ni=1 1( x i x < h) F(h + x ) F ( h + x ) 2f ( x)h where we ve assumed x i are iid with CDF F(x), pdf f (x), and x is an intermediate value satisfying x h x h + x h x = 2h. df (x) dg(x) d 2. Assume sup x X f (x), sup x X dx, sup x X g(x), sup x X dx, sup 2 g(x) x X dx 2, and sup i Var(y i x i ) are bounded n x i : x,h i x dg(x ) <h (x dx i x ) dg(x ) x +h dx x h (x x )f (x)dx = h dg(x ) 1 1 uf (x + dx hu)du = h dg(x ) f (x ) 1 dx 1 udu + h 2 dg(ū) df (ū) u dx dx du = h2 dg(x ) df (ū) u dx dx du.

21 Dependence of MSE on h (2) MSE Bound: Using 1-3 and letting mean approximately less than or equal to then give [ E (ĝ(x ) g(x )) 2 ] X 1 sup Var(y i x i ) n x,h i ( 1 dg(x ) + (x i x ) + 1 n x,h i : x i x dx 2 <h ( ) hn 1 M n x,h hn + h 2 2 ( ) 1 M hn + h4 d 2 g( x) dx 2 (x i x ) 2 ) 2 MSE goes to zero if h 0 hn Optimal rate equates the two and give h n n 1/5

22 Formal statement of results Asymptotic Distribution of the Kernel Regression Estimator: Theorem: Assume that x Interior(X ) where X R q, g(x) and f (x) are three times continuously differentiable, and f (x) > 0, then as n, h s 0 for s = 1,..., q, n q s=1 hs, and (n q s=1 hs) q s=1 h6 s 0. Then ) q q n h s (ĝ(x) g(x) hsb 2 d s(x) N(0, κ q σ 2 (x)/f (x)) s=1 for σ 2 (x) = Var(y X = x), κ = K 2 (v)dv, and B s(x) = κ 2 2 for κ 2 = v 2 k(v)dv. s=1 2 f (x) xs g(x) xs +f (x) 2 g(x) xs 2 f (x) Operationalizing Need to estimate the density f (x) Need to estimate the bias B s(x) for s = 1,..., q requires estimation of the derivative of the density and the first and second derivatives of the regression function

23 Curse of dimensionality Note that the asymptotic normality result implies that (approximately) bias = O( q s=1 h2 s) variance = O(n q s=1 hs) Setting each bandwidth = O(h), equating bias 2 and variance rates gives and ignoring multiplicative constants ) h 4 = (nh q ) 1 h = O (n q+4 1 MSE = O(n q+4 4 ) Increasing q really slows rate of convergence - curse of dimensionality. Intuition Think about data uniformly distributed on the q-dimensional unit cube To get fraction b of the observations, need b of the volume On average, will need a cube with edge length b 1/q Neighbors aren t so local (e.g. q = 10, b =.01, need cover 63% of the support of each input Averaging points that are very far from each other bias

24 Bandwidth selection Key input to nonparametrics is choice of tuning parameter (bandwidth in this case) Simple options: 1. Eyeball Method - Choose a bandwidth. Estimate the regression function. Look at the result. If it looks more wiggly than you d like, increase the bandwidth. If it looks more smooth than you d like, decrease the bandwidth. 2. Rule-of-Thumb - Most common rule-of-thumb is approximately optimal for estimating a Gaussian density with a Gaussian kernel. Silverman s Rule-of-Thumb: h n = σ x (4/3) 1/5 n 1/5 where σ x is the sample standard deviation of x Generalization to multiple dimensions (i) h n,s = σ sn 1/(4+q) where σ s is the sample standard deviation of x s and q is the dimension of X, or (ii) H n = Σ 1/2 n 1/(4+q) where Σ is the q q sample covariance matrix of X.

25 Cross-validation (CV) Basic idea of cross-validation - evaluate quality of the bandwidth by seeing how well the resulting estimator forecasts out-of-sample Leave-one-out-CV bandwidth: ĥ = arg min CV (h) = arg min h h n (y i ŷ i,h ) 2 i=1 ŷ i,h = ĝ h (x i ) = estimate of the conditional expectation at x i using bandwidth h and all observations EXCEPT observation i Could use a different loss function such as absolute values, etc. To calculate CV (h): Choose h. For each observation i = 1,..., n, Calculate ŷ i,h = j i y j K h (x j x i ) ni=1 K h (x j x i ) Calculate e 2 i,h = (y i ŷ i,h ) 2 Calculate CV (h) = n i=1 e2 i,h Common to use K-fold CV rather than leave-one-out CV

26 K-NN K Nearest Neighbors closely related to kernels: f (x) = 1 K i:d(x i,x) d(x (K ),x) y i K : number of neighbors to use d(x 1, x 2 ): distance from point x 1 to point x 2, usually Euclidean x K the observation ranked K th in distance from target point x Can be viewed as kernel with varying bandwidth Can choose K (number of neighbors) by CV

27 Kernels and K-NN in Schooling Example (1) Figure: Uniform Kernel, ha = 3, he = 1, average 7766 obs per cell (range )

28 Kernels and K-NN in Schooling Example (2) Figure: KNN, between 2200 and 4500 obs per cell

29 Kernels and K-NN in Schooling Example (3) Figure: Uniform kernel; condition on schooling, age, non-married, black; ha = 12, he = 3

30 Kernels and K-NN in Schooling Example (4) Figure: Cross-Validation function

31 Kernels and K-NN in Schooling Example (5) Figure: Uniform kernel, CV bandwidths, ha = 3, he = 4

32 Series Series: Model g(x) = p j=0 β jϕ j (x) + r p(x) E.g. if g(x) is infinitely differentiable, we have a Taylor series - g(x) = j=0 a j x j where a j = 1 j g(0) j! x j ϕ j (x) are series/basis terms E.g. {ϕ j (x)} = 1, x, x 2, x 3,... (or orthogonal polynomials) E.g. {ϕ j (x)} = 1, x, x 2, x 3, (x k 4 ) 3 +, (x k 5) 3 +,... where k 4, k 5,... are knots (cubic spline) E.g. b-splines, fourier series,... Obtain ĝ(x) by LS regression of Y on {ϕ j (X)} p j=1 Global method Regularization comes in through the choice of p. Higher p means less bias since we are leaving out less terms from the infinite sum Higher p means higher variance since we are estimating more regression coefficients from the same amount of data

33 Series Estimation Operationally, series are extremely easy Define ϕ n (x) = (ϕ n,1 (x),..., ϕ n,p(x)) Define Z n = [ϕ n (x 1 ),..., ϕ n (x n)] (an n p matrix) The series estimator of E[Y X = x]: ĝ(x) = ϕ n (x) β where β = (Z n Z n) 1 (Z ny ) I.e. estimate coefficients by OLS of Y on Z n. Can also do inference using conventional OLS output (e.g. Newey (1997))

34 Series Asymptotics Example sufficient conditions for establishing asymptotic properties (p, n, p/n 0) of series estimator (simplified version of Newey (1997)): Var(Y X) bounded ϕ n (x), g(x), and ϕ n (x) β n are uniformly bounded, sup x Z = ζ(p) λ min (Z Z )/p the p p design matrix remains well-behaved as n (recall that p ) There exists a β n such that sup x ϕ n (x) β n g(x) = O(p α ) Requires smoothness of the function g(x) ( ) For splines or power series, have sup x ϕ n (x) β n g(x) = O p d s where s is the number of continuous derivatives of g(x) and d is the dimension of x Under these conditions, can obtain rates of convergence: [g(x) ĝ(x)] 2 df 0 (x) = O p(p/n + p 2α ) sup x g(x) ĝ(x) = O p(ζ(p)( p/n + p α ))

35 Asymptotic Distribution Essentially Theorem 2 from Newey (1997): Under regularity and assuming np α 0, ĝ(x) = g(x) + O p(ζ(p)/ n) and for n V 1/2 p (ĝ(x) g(x)) d N(0, 1) V = ϕ n (x) Q 1 Ω Q 1 ϕ n (x) Q = Z Z /n Ω = 1 ϕ n (x i )ϕ n (x i ) [y i ϕ n (x i ) 2 β] n i Ignoring technicalities, this result states that you can do inference for results based on a series estimator exactly like you would for results based on OLS estimates of the linear model.

36 Choosing the number of terms Usual choice of number of terms given by cross-validation Closed form expression for the leave-one-out CV function for least-squares series estimators: n ( ) 2 ei,k CV (K ) = 1 p i,k i=1 e i,k = y i K j=1 âjϕ j (x i ) = regression residual from regressing y on ϕ 1 (x),..., ϕ K (x) using all the observations you only have to run the regression once p i,k is the (i, i) element of the matrix P K (P K P K ) 1 P K P K is the n K matrix formed by stacking the 1 K vectors (ϕ 1 (x i ),..., ϕ K (x i )) for each i One can then use the CV number of series terms: K = arg min CV (K ) K Could also do K-fold cross-validation

37 Series in Schooling Example (1) Estimate E[log(wage) schooling,age,non-married,black] Use cubic splines with equally spaced knots 4 knots in age and education fully interact marginal splines Simple polynomial (Monomial) 7 th order in age and education fully interact marginal polynomials

38 Series in Schooling Example (2) (a) Cubic Spline (b) Polynomial Figure: Series Estimates of Conditional Expectation of log(wage) Given Age, Schooling, Non-Married, and Black

39 Series in Schooling Example (3) Cross-Validation: Cubic Spline 13 knots in age and 1 in education CV = Polynomial 9 th order in age and 2 nd order in education CV =

40 Series in Schooling Example (4) (a) Cubic Spline (b) Polynomial Figure: Series Estimates of Conditional Expectation of log(wage) Given Age, Schooling, Non-Married, and Black with Terms Chosen by CV

Nonparametric Methods

Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of Pennsylvania July 28, 2009 Michael R. Roberts Nonparametric Methods 1/42 Overview Great for data analysis