41903: Introduction to Nonparametrics

Size: px
Start display at page:

Download "41903: Introduction to Nonparametrics"

Transcription

1 41903: Notes 5

2 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific data set i.e. concerned about overfitting/false discovery i.e. concerned about bias/variance tradeoff Data-mining/ big data methods naturally viewed as nonparametric procedures

3 Some ML/Data-Mining Jargon Data mining terminology: Supervised Learning Want to predict target variable Y with input variables X AKA Predictive Analytics This is our goal in this set of notes. Focus on classic methods here Data mining terminology: Unupervised Learning Want to find structure within set of variables X No specific target Exploratory data analysis (EDA) Essentially fancy descriptive statistics

4 Predictive Models Useful to think about relation between target (Y ) and input (X) as Y i }{{} target = g(x i ) }{{} signal + ε i }{{} noise Goal: Learn g( ) from the data in a way that yields generalizable forecasts OR Get a forecast rule that minimizes expected forecast loss

5 Squared Error Loss Loss function: L(Y, g(x)) = (Y g(x)) 2 Expected Loss(g( )): E X [E Y X [L(Y, g(x)) X]] Clear that sufficient to minimize pointwise Forecast rule: g(x) = arg min c E Y X [(Y c) 2 X = x] = arg min c E Y X [Y 2 X = x] 2cE Y X [Y X = x] + c 2 = E[Y X = x] Could also do L(Y, g(x)) = Y g(x) g(x) = median(y X)

6 Classification/Discrete Outcome Discrete Y : Y {1,..., R} Common loss function 0-1 loss : L(k, l) = L(Y = k, g(x) = l) = { 0 k = l 1 else Expected Loss(g( )): [ R ] E X [E Y X [L(k, l) X]] = E X L(r, g(x))p(y = r X) r=1

7 Classification/Discrete Outcome Again, sufficient to minimize pointwise: That is g(x) = arg min c = arg min c R L(r, c)p(y = r X = x) r=1 P(Y = r X = x) r c = arg min(1 P(Y = c X = x)) c g(x) = r : P(Y = r X = x) = max P(Y = c X = x) c Defines Bayes classifier Simply forecast outcome to be outcome with highest conditional probability Key input: P(Y = r X = x) = E[1(Y = r) X = x] Common structure: Want a good estimate of E[Y X] or other feature of conditional distribution (for suitable Y )

8 Coarsely Discrete Regressors Suppose that X can take on R values, {x 1,..., x R }. E.g. Gender, R = 2. Years of Schooling, R = 20ish. Gender x Years of Schooling, R = 2*20ish. Estimation of E[Y X = x r ] is easy!! Find all observations with x i = x r and calculate sample mean with this subsample No assumptions about E[Y X] - completely flexible Will have usual properties as long as R finite (just learning about R expectations)

9 Example: Conditional Wages Data: 329,505 men from 1980 U.S. Census aged years of schooling race (black, white) married (married, non-married) Condition on schooling or age: Schooling: 21 categories average of 15,691 observations per category Range: ,934 Age: 40 categories average of 8238 observations per category Range:

10 Example: Conditional Wages (a) Schooling (b) Age

11 Example: Conditional Wages Condition on schooling and age: 840 categories average number of observations per category: 392 (large range: )

12 Example: Conditional Wages Things only get worse as we condition on Race and Marital Status 3360 categories average of 98 observations per category Range: empty categories (7.7%) 670 categories with 0-2 observations (19.9%)

13 Example: Conditional Wages

14 Example: Conditional Wages Some questions: 1. Do we really think conditional expectation function is that bumpy? 2. What do we do about categories with 0 observations? 3. Estimates cell by cell are unbiased. What happens to variance as the number of cells increases? 4. Suppose we conditioned on State of Birth too? (171,360 categories) [Curse of dimensionality] Fundamental statistical learning problem - Need for Regularization: Structure and estimators that trade off bias and variance to produce reasonable forecasts/models

15 Local Averaging - Kernels Literally averaging for each separate x value is only feasible in cases where X is coarsely discrete - need beliefs/regularization Smoothness: E.g. E[Y X] is a smooth (e.g. continuous, differentiable, etc.) function of X Function shouldn t change much across values of X that are close Estimate E[Y X = x ] by averaging y s over values of x close to x Kernel Regression: E[Y X = x ] ĝ(x ) = n i=1 y ik h (x i x ) n i=1 K h(x i x ) where K h ( ) is a kernel function and h is a bandwidth.

16 Some kernel details Common Univariate Kernels: Uniform: K h (u) = 1 1( u < h) 2h Gaussian: K h (u) = 1 2πh 2 Epanechnikov: K h (u) = 3 4h ( Triangular: K h (u) = 1 h u2 exp{ ( 1 ( u h ) 1 u h 2h } + ) 2 ) + Multivariate kernels: Most common to just take product of univariate kernels ( product kernel ) Any multivariate density would also work E.g. q dimensional multivariate normal with q q bandwidth matrix H

17 Intuition using uniform kernel With uniform kernel, ĝ(x ) = n i=1 y i1( x i x < h) n i=1 1( x i x < h) = 1 n x,h i : x i x <h y i n x,h is the number of observations such that x i x < h I.e. estimator is just sample average of the y i across all points where x i x < h

18 Local averaging picture 1.5 n x*,h = 18 ĝ(.3) = 18 1 i : xi (.2,.4) yi ĝ(x) x * h x * x * +h

19 (Conditional) MSE Heuristic MSE derivation: Condition on {x i } n i=1 and h Assume (i) E[y i X] = E[y i x i ], (ii) Var(y i X) = Var(y i x i ) exist Assume independence across i [ E (ĝ(x ) g(x )) 2 ] X 2 = E 1 (y i g(x i ) + g(x i ) g(x )) X n x,h i : x i x <h = 1 n x 2 E[(y i g(x i )) 2 X],h i : x i x <h (g(x i ) g(x )) n x,h i : x i x <h = 1 n x 2 Var(y i x i ),h i : x i x <h (g(x + (x i x )) g(x )) n x,h i : x i x <h where the first term is the variance and the second is squared-bias.

20 Dependence of MSE on h (1) We can see how h relates to MSE: larger h larger n x,h first term is smaller larger h more points farther away from x higher bias (in general) Heuristic Derivation of MSE Bound in Univariate Case: 1. n x,h /n = n 1 ni=1 1( x i x < h) F(h + x ) F ( h + x ) 2f ( x)h where we ve assumed x i are iid with CDF F(x), pdf f (x), and x is an intermediate value satisfying x h x h + x h x = 2h. df (x) dg(x) d 2. Assume sup x X f (x), sup x X dx, sup x X g(x), sup x X dx, sup 2 g(x) x X dx 2, and sup i Var(y i x i ) are bounded n x i : x,h i x dg(x ) <h (x dx i x ) dg(x ) x +h dx x h (x x )f (x)dx = h dg(x ) 1 1 uf (x + dx hu)du = h dg(x ) f (x ) 1 dx 1 udu + h 2 dg(ū) df (ū) u dx dx du = h2 dg(x ) df (ū) u dx dx du.

21 Dependence of MSE on h (2) MSE Bound: Using 1-3 and letting mean approximately less than or equal to then give [ E (ĝ(x ) g(x )) 2 ] X 1 sup Var(y i x i ) n x,h i ( 1 dg(x ) + (x i x ) + 1 n x,h i : x i x dx 2 <h ( ) hn 1 M n x,h hn + h 2 2 ( ) 1 M hn + h4 d 2 g( x) dx 2 (x i x ) 2 ) 2 MSE goes to zero if h 0 hn Optimal rate equates the two and give h n n 1/5

22 Formal statement of results Asymptotic Distribution of the Kernel Regression Estimator: Theorem: Assume that x Interior(X ) where X R q, g(x) and f (x) are three times continuously differentiable, and f (x) > 0, then as n, h s 0 for s = 1,..., q, n q s=1 hs, and (n q s=1 hs) q s=1 h6 s 0. Then ) q q n h s (ĝ(x) g(x) hsb 2 d s(x) N(0, κ q σ 2 (x)/f (x)) s=1 for σ 2 (x) = Var(y X = x), κ = K 2 (v)dv, and B s(x) = κ 2 2 for κ 2 = v 2 k(v)dv. s=1 2 f (x) xs g(x) xs +f (x) 2 g(x) xs 2 f (x) Operationalizing Need to estimate the density f (x) Need to estimate the bias B s(x) for s = 1,..., q requires estimation of the derivative of the density and the first and second derivatives of the regression function

23 Curse of dimensionality Note that the asymptotic normality result implies that (approximately) bias = O( q s=1 h2 s) variance = O(n q s=1 hs) Setting each bandwidth = O(h), equating bias 2 and variance rates gives and ignoring multiplicative constants ) h 4 = (nh q ) 1 h = O (n q+4 1 MSE = O(n q+4 4 ) Increasing q really slows rate of convergence - curse of dimensionality. Intuition Think about data uniformly distributed on the q-dimensional unit cube To get fraction b of the observations, need b of the volume On average, will need a cube with edge length b 1/q Neighbors aren t so local (e.g. q = 10, b =.01, need cover 63% of the support of each input Averaging points that are very far from each other bias

24 Bandwidth selection Key input to nonparametrics is choice of tuning parameter (bandwidth in this case) Simple options: 1. Eyeball Method - Choose a bandwidth. Estimate the regression function. Look at the result. If it looks more wiggly than you d like, increase the bandwidth. If it looks more smooth than you d like, decrease the bandwidth. 2. Rule-of-Thumb - Most common rule-of-thumb is approximately optimal for estimating a Gaussian density with a Gaussian kernel. Silverman s Rule-of-Thumb: h n = σ x (4/3) 1/5 n 1/5 where σ x is the sample standard deviation of x Generalization to multiple dimensions (i) h n,s = σ sn 1/(4+q) where σ s is the sample standard deviation of x s and q is the dimension of X, or (ii) H n = Σ 1/2 n 1/(4+q) where Σ is the q q sample covariance matrix of X.

25 Cross-validation (CV) Basic idea of cross-validation - evaluate quality of the bandwidth by seeing how well the resulting estimator forecasts out-of-sample Leave-one-out-CV bandwidth: ĥ = arg min CV (h) = arg min h h n (y i ŷ i,h ) 2 i=1 ŷ i,h = ĝ h (x i ) = estimate of the conditional expectation at x i using bandwidth h and all observations EXCEPT observation i Could use a different loss function such as absolute values, etc. To calculate CV (h): Choose h. For each observation i = 1,..., n, Calculate ŷ i,h = j i y j K h (x j x i ) ni=1 K h (x j x i ) Calculate e 2 i,h = (y i ŷ i,h ) 2 Calculate CV (h) = n i=1 e2 i,h Common to use K-fold CV rather than leave-one-out CV

26 K-NN K Nearest Neighbors closely related to kernels: f (x) = 1 K i:d(x i,x) d(x (K ),x) y i K : number of neighbors to use d(x 1, x 2 ): distance from point x 1 to point x 2, usually Euclidean x K the observation ranked K th in distance from target point x Can be viewed as kernel with varying bandwidth Can choose K (number of neighbors) by CV

27 Kernels and K-NN in Schooling Example (1) Figure: Uniform Kernel, ha = 3, he = 1, average 7766 obs per cell (range )

28 Kernels and K-NN in Schooling Example (2) Figure: KNN, between 2200 and 4500 obs per cell

29 Kernels and K-NN in Schooling Example (3) Figure: Uniform kernel; condition on schooling, age, non-married, black; ha = 12, he = 3

30 Kernels and K-NN in Schooling Example (4) Figure: Cross-Validation function

31 Kernels and K-NN in Schooling Example (5) Figure: Uniform kernel, CV bandwidths, ha = 3, he = 4

32 Series Series: Model g(x) = p j=0 β jϕ j (x) + r p(x) E.g. if g(x) is infinitely differentiable, we have a Taylor series - g(x) = j=0 a j x j where a j = 1 j g(0) j! x j ϕ j (x) are series/basis terms E.g. {ϕ j (x)} = 1, x, x 2, x 3,... (or orthogonal polynomials) E.g. {ϕ j (x)} = 1, x, x 2, x 3, (x k 4 ) 3 +, (x k 5) 3 +,... where k 4, k 5,... are knots (cubic spline) E.g. b-splines, fourier series,... Obtain ĝ(x) by LS regression of Y on {ϕ j (X)} p j=1 Global method Regularization comes in through the choice of p. Higher p means less bias since we are leaving out less terms from the infinite sum Higher p means higher variance since we are estimating more regression coefficients from the same amount of data

33 Series Estimation Operationally, series are extremely easy Define ϕ n (x) = (ϕ n,1 (x),..., ϕ n,p(x)) Define Z n = [ϕ n (x 1 ),..., ϕ n (x n)] (an n p matrix) The series estimator of E[Y X = x]: ĝ(x) = ϕ n (x) β where β = (Z n Z n) 1 (Z ny ) I.e. estimate coefficients by OLS of Y on Z n. Can also do inference using conventional OLS output (e.g. Newey (1997))

34 Series Asymptotics Example sufficient conditions for establishing asymptotic properties (p, n, p/n 0) of series estimator (simplified version of Newey (1997)): Var(Y X) bounded ϕ n (x), g(x), and ϕ n (x) β n are uniformly bounded, sup x Z = ζ(p) λ min (Z Z )/p the p p design matrix remains well-behaved as n (recall that p ) There exists a β n such that sup x ϕ n (x) β n g(x) = O(p α ) Requires smoothness of the function g(x) ( ) For splines or power series, have sup x ϕ n (x) β n g(x) = O p d s where s is the number of continuous derivatives of g(x) and d is the dimension of x Under these conditions, can obtain rates of convergence: [g(x) ĝ(x)] 2 df 0 (x) = O p(p/n + p 2α ) sup x g(x) ĝ(x) = O p(ζ(p)( p/n + p α ))

35 Asymptotic Distribution Essentially Theorem 2 from Newey (1997): Under regularity and assuming np α 0, ĝ(x) = g(x) + O p(ζ(p)/ n) and for n V 1/2 p (ĝ(x) g(x)) d N(0, 1) V = ϕ n (x) Q 1 Ω Q 1 ϕ n (x) Q = Z Z /n Ω = 1 ϕ n (x i )ϕ n (x i ) [y i ϕ n (x i ) 2 β] n i Ignoring technicalities, this result states that you can do inference for results based on a series estimator exactly like you would for results based on OLS estimates of the linear model.

36 Choosing the number of terms Usual choice of number of terms given by cross-validation Closed form expression for the leave-one-out CV function for least-squares series estimators: n ( ) 2 ei,k CV (K ) = 1 p i,k i=1 e i,k = y i K j=1 âjϕ j (x i ) = regression residual from regressing y on ϕ 1 (x),..., ϕ K (x) using all the observations you only have to run the regression once p i,k is the (i, i) element of the matrix P K (P K P K ) 1 P K P K is the n K matrix formed by stacking the 1 K vectors (ϕ 1 (x i ),..., ϕ K (x i )) for each i One can then use the CV number of series terms: K = arg min CV (K ) K Could also do K-fold cross-validation

37 Series in Schooling Example (1) Estimate E[log(wage) schooling,age,non-married,black] Use cubic splines with equally spaced knots 4 knots in age and education fully interact marginal splines Simple polynomial (Monomial) 7 th order in age and education fully interact marginal polynomials

38 Series in Schooling Example (2) (a) Cubic Spline (b) Polynomial Figure: Series Estimates of Conditional Expectation of log(wage) Given Age, Schooling, Non-Married, and Black

39 Series in Schooling Example (3) Cross-Validation: Cubic Spline 13 knots in age and 1 in education CV = Polynomial 9 th order in age and 2 nd order in education CV =

40 Series in Schooling Example (4) (a) Cubic Spline (b) Polynomial Figure: Series Estimates of Conditional Expectation of log(wage) Given Age, Schooling, Non-Married, and Black with Terms Chosen by CV

Nonparametric Methods

Nonparametric Methods Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of Pennsylvania July 28, 2009 Michael R. Roberts Nonparametric Methods 1/42 Overview Great for data analysis

More information

Time Series and Forecasting Lecture 4 NonLinear Time Series

Time Series and Forecasting Lecture 4 NonLinear Time Series Time Series and Forecasting Lecture 4 NonLinear Time Series Bruce E. Hansen Summer School in Economics and Econometrics University of Crete July 23-27, 2012 Bruce Hansen (University of Wisconsin) Foundations

More information

Nonparametric Density Estimation

Nonparametric Density Estimation Nonparametric Density Estimation Econ 690 Purdue University Justin L. Tobias (Purdue) Nonparametric Density Estimation 1 / 29 Density Estimation Suppose that you had some data, say on wages, and you wanted

More information

Modelling Non-linear and Non-stationary Time Series

Modelling Non-linear and Non-stationary Time Series Modelling Non-linear and Non-stationary Time Series Chapter 2: Non-parametric methods Henrik Madsen Advanced Time Series Analysis September 206 Henrik Madsen (02427 Adv. TS Analysis) Lecture Notes September

More information

Nonparametric Econometrics

Nonparametric Econometrics Applied Microeconometrics with Stata Nonparametric Econometrics Spring Term 2011 1 / 37 Contents Introduction The histogram estimator The kernel density estimator Nonparametric regression estimators Semi-

More information

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas 0 0 5 Motivation: Regression discontinuity (Angrist&Pischke) Outcome.5 1 1.5 A. Linear E[Y 0i X i] 0.2.4.6.8 1 X Outcome.5 1 1.5 B. Nonlinear E[Y 0i X i] i 0.2.4.6.8 1 X utcome.5 1 1.5 C. Nonlinearity

More information

12 - Nonparametric Density Estimation

12 - Nonparametric Density Estimation ST 697 Fall 2017 1/49 12 - Nonparametric Density Estimation ST 697 Fall 2017 University of Alabama Density Review ST 697 Fall 2017 2/49 Continuous Random Variables ST 697 Fall 2017 3/49 1.0 0.8 F(x) 0.6

More information

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a

More information

Supervised Learning: Non-parametric Estimation

Supervised Learning: Non-parametric Estimation Supervised Learning: Non-parametric Estimation Edmondo Trentin March 18, 2018 Non-parametric Estimates No assumptions are made on the form of the pdfs 1. There are 3 major instances of non-parametric estimates:

More information

Econ 582 Nonparametric Regression

Econ 582 Nonparametric Regression Econ 582 Nonparametric Regression Eric Zivot May 28, 2013 Nonparametric Regression Sofarwehaveonlyconsideredlinearregressionmodels = x 0 β + [ x ]=0 [ x = x] =x 0 β = [ x = x] [ x = x] x = β The assume

More information

Introduction to Regression

Introduction to Regression Introduction to Regression p. 1/97 Introduction to Regression Chad Schafer cschafer@stat.cmu.edu Carnegie Mellon University Introduction to Regression p. 1/97 Acknowledgement Larry Wasserman, All of Nonparametric

More information

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors

More information

Introduction to Regression

Introduction to Regression Introduction to Regression Chad M. Schafer May 20, 2015 Outline General Concepts of Regression, Bias-Variance Tradeoff Linear Regression Nonparametric Procedures Cross Validation Local Polynomial Regression

More information

Nonparametric Regression. Badr Missaoui

Nonparametric Regression. Badr Missaoui Badr Missaoui Outline Kernel and local polynomial regression. Penalized regression. We are given n pairs of observations (X 1, Y 1 ),...,(X n, Y n ) where Y i = r(x i ) + ε i, i = 1,..., n and r(x) = E(Y

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

ECO Class 6 Nonparametric Econometrics

ECO Class 6 Nonparametric Econometrics ECO 523 - Class 6 Nonparametric Econometrics Carolina Caetano Contents 1 Nonparametric instrumental variable regression 1 2 Nonparametric Estimation of Average Treatment Effects 3 2.1 Asymptotic results................................

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

Introduction to Regression

Introduction to Regression Introduction to Regression David E Jones (slides mostly by Chad M Schafer) June 1, 2016 1 / 102 Outline General Concepts of Regression, Bias-Variance Tradeoff Linear Regression Nonparametric Procedures

More information

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive

More information

Introduction. Linear Regression. coefficient estimates for the wage equation: E(Y X) = X 1 β X d β d = X β

Introduction. Linear Regression. coefficient estimates for the wage equation: E(Y X) = X 1 β X d β d = X β Introduction - Introduction -2 Introduction Linear Regression E(Y X) = X β +...+X d β d = X β Example: Wage equation Y = log wages, X = schooling (measured in years), labor market experience (measured

More information

CSE446: non-parametric methods Spring 2017

CSE446: non-parametric methods Spring 2017 CSE446: non-parametric methods Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Linear Regression: What can go wrong? What do we do if the bias is too strong? Might want

More information

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically

More information

Introduction to Regression

Introduction to Regression Introduction to Regression Chad M. Schafer cschafer@stat.cmu.edu Carnegie Mellon University Introduction to Regression p. 1/100 Outline General Concepts of Regression, Bias-Variance Tradeoff Linear Regression

More information

Nonparametric Estimation of Regression Functions In the Presence of Irrelevant Regressors

Nonparametric Estimation of Regression Functions In the Presence of Irrelevant Regressors Nonparametric Estimation of Regression Functions In the Presence of Irrelevant Regressors Peter Hall, Qi Li, Jeff Racine 1 Introduction Nonparametric techniques robust to functional form specification.

More information

STA 414/2104, Spring 2014, Practice Problem Set #1

STA 414/2104, Spring 2014, Practice Problem Set #1 STA 44/4, Spring 4, Practice Problem Set # Note: these problems are not for credit, and not to be handed in Question : Consider a classification problem in which there are two real-valued inputs, and,

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods) Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Probability and Statistical Decision Theory

Probability and Statistical Decision Theory Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Probability and Statistical Decision Theory Many slides attributable to: Erik Sudderth (UCI) Prof. Mike Hughes

More information

Terminology for Statistical Data

Terminology for Statistical Data Terminology for Statistical Data variables - features - attributes observations - cases (consist of multiple values) In a standard data matrix, variables or features correspond to columns observations

More information

Nonparametric Inference via Bootstrapping the Debiased Estimator

Nonparametric Inference via Bootstrapping the Debiased Estimator Nonparametric Inference via Bootstrapping the Debiased Estimator Yen-Chi Chen Department of Statistics, University of Washington ICSA-Canada Chapter Symposium 2017 1 / 21 Problem Setup Let X 1,, X n be

More information

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal

More information

Kernel density estimation

Kernel density estimation Kernel density estimation Patrick Breheny October 18 Patrick Breheny STA 621: Nonparametric Statistics 1/34 Introduction Kernel Density Estimation We ve looked at one method for estimating density: histograms

More information

Lecture 02 Linear classification methods I

Lecture 02 Linear classification methods I Lecture 02 Linear classification methods I 22 January 2016 Taylor B. Arnold Yale Statistics STAT 365/665 1/32 Coursewebsite: A copy of the whole course syllabus, including a more detailed description of

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods

More information

Nonparametric Regression Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction

Nonparametric Regression Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction Tine Buch-Kromann Univariate Kernel Regression The relationship between two variables, X and Y where m(

More information

Econometrics I. Lecture 10: Nonparametric Estimation with Kernels. Paul T. Scott NYU Stern. Fall 2018

Econometrics I. Lecture 10: Nonparametric Estimation with Kernels. Paul T. Scott NYU Stern. Fall 2018 Econometrics I Lecture 10: Nonparametric Estimation with Kernels Paul T. Scott NYU Stern Fall 2018 Paul T. Scott NYU Stern Econometrics I Fall 2018 1 / 12 Nonparametric Regression: Intuition Let s get

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Introduction to Econometrics

Introduction to Econometrics Introduction to Econometrics Lecture 3 : Regression: CEF and Simple OLS Zhaopeng Qu Business School,Nanjing University Oct 9th, 2017 Zhaopeng Qu (Nanjing University) Introduction to Econometrics Oct 9th,

More information

Nonparametric Regression. Changliang Zou

Nonparametric Regression. Changliang Zou Nonparametric Regression Institute of Statistics, Nankai University Email: nk.chlzou@gmail.com Smoothing parameter selection An overall measure of how well m h (x) performs in estimating m(x) over x (0,

More information

The risk of machine learning

The risk of machine learning / 33 The risk of machine learning Alberto Abadie Maximilian Kasy July 27, 27 2 / 33 Two key features of machine learning procedures Regularization / shrinkage: Improve prediction or estimation performance

More information

From Histograms to Multivariate Polynomial Histograms and Shape Estimation. Assoc Prof Inge Koch

From Histograms to Multivariate Polynomial Histograms and Shape Estimation. Assoc Prof Inge Koch From Histograms to Multivariate Polynomial Histograms and Shape Estimation Assoc Prof Inge Koch Statistics, School of Mathematical Sciences University of Adelaide Inge Koch (UNSW, Adelaide) Poly Histograms

More information

Uncertainty Quantification for Inverse Problems. November 7, 2011

Uncertainty Quantification for Inverse Problems. November 7, 2011 Uncertainty Quantification for Inverse Problems November 7, 2011 Outline UQ and inverse problems Review: least-squares Review: Gaussian Bayesian linear model Parametric reductions for IP Bias, variance

More information

conditional cdf, conditional pdf, total probability theorem?

conditional cdf, conditional pdf, total probability theorem? 6 Multiple Random Variables 6.0 INTRODUCTION scalar vs. random variable cdf, pdf transformation of a random variable conditional cdf, conditional pdf, total probability theorem expectation of a random

More information

Scientific Computing: Monte Carlo

Scientific Computing: Monte Carlo Scientific Computing: Monte Carlo Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 Course MATH-GA.2043 or CSCI-GA.2112, Spring 2012 April 5th and 12th, 2012 A. Donev (Courant Institute)

More information

Classification: The rest of the story

Classification: The rest of the story U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

Selection on Observables: Propensity Score Matching.

Selection on Observables: Propensity Score Matching. Selection on Observables: Propensity Score Matching. Department of Economics and Management Irene Brunetti ireneb@ec.unipi.it 24/10/2017 I. Brunetti Labour Economics in an European Perspective 24/10/2017

More information

Robustness to Parametric Assumptions in Missing Data Models

Robustness to Parametric Assumptions in Missing Data Models Robustness to Parametric Assumptions in Missing Data Models Bryan Graham NYU Keisuke Hirano University of Arizona April 2011 Motivation Motivation We consider the classic missing data problem. In practice

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

Week 5: Logistic Regression & Neural Networks

Week 5: Logistic Regression & Neural Networks Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and

More information

Quantitative Economics for the Evaluation of the European Policy. Dipartimento di Economia e Management

Quantitative Economics for the Evaluation of the European Policy. Dipartimento di Economia e Management Quantitative Economics for the Evaluation of the European Policy Dipartimento di Economia e Management Irene Brunetti 1 Davide Fiaschi 2 Angela Parenti 3 9 ottobre 2015 1 ireneb@ec.unipi.it. 2 davide.fiaschi@unipi.it.

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Supervised Learning: Regression I Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Some of the

More information

Short Questions (Do two out of three) 15 points each

Short Questions (Do two out of three) 15 points each Econometrics Short Questions Do two out of three) 5 points each ) Let y = Xβ + u and Z be a set of instruments for X When we estimate β with OLS we project y onto the space spanned by X along a path orthogonal

More information

Ch 7: Dummy (binary, indicator) variables

Ch 7: Dummy (binary, indicator) variables Ch 7: Dummy (binary, indicator) variables :Examples Dummy variable are used to indicate the presence or absence of a characteristic. For example, define female i 1 if obs i is female 0 otherwise or male

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Nonparametric Regression

Nonparametric Regression Nonparametric Regression Econ 674 Purdue University April 8, 2009 Justin L. Tobias (Purdue) Nonparametric Regression April 8, 2009 1 / 31 Consider the univariate nonparametric regression model: where y

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Exploiting k-nearest Neighbor Information with Many Data

Exploiting k-nearest Neighbor Information with Many Data Exploiting k-nearest Neighbor Information with Many Data 2017 TEST TECHNOLOGY WORKSHOP 2017. 10. 24 (Tue.) Yung-Kyun Noh Robotics Lab., Contents Nonparametric methods for estimating density functions Nearest

More information

Metric-based classifiers. Nuno Vasconcelos UCSD

Metric-based classifiers. Nuno Vasconcelos UCSD Metric-based classifiers Nuno Vasconcelos UCSD Statistical learning goal: given a function f. y f and a collection of eample data-points, learn what the function f. is. this is called training. two major

More information

Bayes Decision Theory - I

Bayes Decision Theory - I Bayes Decision Theory - I Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD Statistical Learning from Data Goal: Given a relationship between a feature vector and a vector y, and iid data samples ( i,y i ), find

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Monte Carlo Studies. The response in a Monte Carlo study is a random variable. Monte Carlo Studies The response in a Monte Carlo study is a random variable. The response in a Monte Carlo study has a variance that comes from the variance of the stochastic elements in the data-generating

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Intermediate Econometrics

Intermediate Econometrics Intermediate Econometrics Heteroskedasticity Text: Wooldridge, 8 July 17, 2011 Heteroskedasticity Assumption of homoskedasticity, Var(u i x i1,..., x ik ) = E(u 2 i x i1,..., x ik ) = σ 2. That is, the

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 6: Bias and variance (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 49 Our plan today We saw in last lecture that model scoring methods seem to be trading off two different

More information

Regression #8: Loose Ends

Regression #8: Loose Ends Regression #8: Loose Ends Econ 671 Purdue University Justin L. Tobias (Purdue) Regression #8 1 / 30 In this lecture we investigate a variety of topics that you are probably familiar with, but need to touch

More information

AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET. Questions AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET

AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET. Questions AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET The Problem Identification of Linear and onlinear Dynamical Systems Theme : Curve Fitting Division of Automatic Control Linköping University Sweden Data from Gripen Questions How do the control surface

More information

Part 6: Multivariate Normal and Linear Models

Part 6: Multivariate Normal and Linear Models Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of

More information

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 4, 2015 Today: Generative discriminative classifiers Linear regression Decomposition of error into

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood

Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood Kuangyu Wen & Ximing Wu Texas A&M University Info-Metrics Institute Conference: Recent Innovations in Info-Metrics October

More information

Model-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego

Model-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego Model-free prediction intervals for regression and autoregression Dimitris N. Politis University of California, San Diego To explain or to predict? Models are indispensable for exploring/utilizing relationships

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Local Polynomial Regression

Local Polynomial Regression VI Local Polynomial Regression (1) Global polynomial regression We observe random pairs (X 1, Y 1 ),, (X n, Y n ) where (X 1, Y 1 ),, (X n, Y n ) iid (X, Y ). We want to estimate m(x) = E(Y X = x) based

More information

COMS 4771 Regression. Nakul Verma

COMS 4771 Regression. Nakul Verma COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the

More information

A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

More information

Variance Reduction and Ensemble Methods

Variance Reduction and Ensemble Methods Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis

More information

Business Statistics. Lecture 10: Course Review

Business Statistics. Lecture 10: Course Review Business Statistics Lecture 10: Course Review 1 Descriptive Statistics for Continuous Data Numerical Summaries Location: mean, median Spread or variability: variance, standard deviation, range, percentiles,

More information

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Joint Probability Distributions and Random Samples (Devore Chapter Five) Joint Probability Distributions and Random Samples (Devore Chapter Five) 1016-345-01: Probability and Statistics for Engineers Spring 2013 Contents 1 Joint Probability Distributions 2 1.1 Two Discrete

More information

Asymptotic Statistics-III. Changliang Zou

Asymptotic Statistics-III. Changliang Zou Asymptotic Statistics-III Changliang Zou The multivariate central limit theorem Theorem (Multivariate CLT for iid case) Let X i be iid random p-vectors with mean µ and and covariance matrix Σ. Then n (

More information

ELEG 3143 Probability & Stochastic Process Ch. 6 Stochastic Process

ELEG 3143 Probability & Stochastic Process Ch. 6 Stochastic Process Department of Electrical Engineering University of Arkansas ELEG 3143 Probability & Stochastic Process Ch. 6 Stochastic Process Dr. Jingxian Wu wuj@uark.edu OUTLINE 2 Definition of stochastic process (random

More information

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model Xiuming Zhang zhangxiuming@u.nus.edu A*STAR-NUS Clinical Imaging Research Center October, 015 Summary This report derives

More information

Introduction to Smoothing spline ANOVA models (metamodelling)

Introduction to Smoothing spline ANOVA models (metamodelling) Introduction to Smoothing spline ANOVA models (metamodelling) M. Ratto DYNARE Summer School, Paris, June 215. Joint Research Centre www.jrc.ec.europa.eu Serving society Stimulating innovation Supporting

More information

ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation. Petra E. Todd

ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation. Petra E. Todd ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation Petra E. Todd Fall, 2014 2 Contents 1 Review of Stochastic Order Symbols 1 2 Nonparametric Density Estimation 3 2.1 Histogram

More information

Chapter 7: Model Assessment and Selection

Chapter 7: Model Assessment and Selection Chapter 7: Model Assessment and Selection DD3364 April 20, 2012 Introduction Regression: Review of our problem Have target variable Y to estimate from a vector of inputs X. A prediction model ˆf(X) has

More information

Regression I: Mean Squared Error and Measuring Quality of Fit

Regression I: Mean Squared Error and Measuring Quality of Fit Regression I: Mean Squared Error and Measuring Quality of Fit -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 The Setup Suppose there is a scientific problem we are interested in solving

More information

Numerical Methods I Monte Carlo Methods

Numerical Methods I Monte Carlo Methods Numerical Methods I Monte Carlo Methods Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 Course G63.2010.001 / G22.2420-001, Fall 2010 Dec. 9th, 2010 A. Donev (Courant Institute) Lecture

More information

Problem Set 7. Ideally, these would be the same observations left out when you

Problem Set 7. Ideally, these would be the same observations left out when you Business 4903 Instructor: Christian Hansen Problem Set 7. Use the data in MROZ.raw to answer this question. The data consist of 753 observations. Before answering any of parts a.-b., remove 253 observations

More information

Why experimenters should not randomize, and what they should do instead

Why experimenters should not randomize, and what they should do instead Why experimenters should not randomize, and what they should do instead Maximilian Kasy Department of Economics, Harvard University Maximilian Kasy (Harvard) Experimental design 1 / 42 project STAR Introduction

More information

ECE 4400:693 - Information Theory

ECE 4400:693 - Information Theory ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential

More information

Introduction to Simple Linear Regression

Introduction to Simple Linear Regression Introduction to Simple Linear Regression Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Introduction to Simple Linear Regression 1 / 68 About me Faculty in the Department

More information

Day 4A Nonparametrics

Day 4A Nonparametrics Day 4A Nonparametrics A. Colin Cameron Univ. of Calif. - Davis... for Center of Labor Economics Norwegian School of Economics Advanced Microeconometrics Aug 28 - Sep 2, 2017. Colin Cameron Univ. of Calif.

More information

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods. TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin

More information