41903: Introduction to Nonparametrics
|
|
- William Gilbert
- 5 years ago
- Views:
Transcription
1 41903: Notes 5
2 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific data set i.e. concerned about overfitting/false discovery i.e. concerned about bias/variance tradeoff Data-mining/ big data methods naturally viewed as nonparametric procedures
3 Some ML/Data-Mining Jargon Data mining terminology: Supervised Learning Want to predict target variable Y with input variables X AKA Predictive Analytics This is our goal in this set of notes. Focus on classic methods here Data mining terminology: Unupervised Learning Want to find structure within set of variables X No specific target Exploratory data analysis (EDA) Essentially fancy descriptive statistics
4 Predictive Models Useful to think about relation between target (Y ) and input (X) as Y i }{{} target = g(x i ) }{{} signal + ε i }{{} noise Goal: Learn g( ) from the data in a way that yields generalizable forecasts OR Get a forecast rule that minimizes expected forecast loss
5 Squared Error Loss Loss function: L(Y, g(x)) = (Y g(x)) 2 Expected Loss(g( )): E X [E Y X [L(Y, g(x)) X]] Clear that sufficient to minimize pointwise Forecast rule: g(x) = arg min c E Y X [(Y c) 2 X = x] = arg min c E Y X [Y 2 X = x] 2cE Y X [Y X = x] + c 2 = E[Y X = x] Could also do L(Y, g(x)) = Y g(x) g(x) = median(y X)
6 Classification/Discrete Outcome Discrete Y : Y {1,..., R} Common loss function 0-1 loss : L(k, l) = L(Y = k, g(x) = l) = { 0 k = l 1 else Expected Loss(g( )): [ R ] E X [E Y X [L(k, l) X]] = E X L(r, g(x))p(y = r X) r=1
7 Classification/Discrete Outcome Again, sufficient to minimize pointwise: That is g(x) = arg min c = arg min c R L(r, c)p(y = r X = x) r=1 P(Y = r X = x) r c = arg min(1 P(Y = c X = x)) c g(x) = r : P(Y = r X = x) = max P(Y = c X = x) c Defines Bayes classifier Simply forecast outcome to be outcome with highest conditional probability Key input: P(Y = r X = x) = E[1(Y = r) X = x] Common structure: Want a good estimate of E[Y X] or other feature of conditional distribution (for suitable Y )
8 Coarsely Discrete Regressors Suppose that X can take on R values, {x 1,..., x R }. E.g. Gender, R = 2. Years of Schooling, R = 20ish. Gender x Years of Schooling, R = 2*20ish. Estimation of E[Y X = x r ] is easy!! Find all observations with x i = x r and calculate sample mean with this subsample No assumptions about E[Y X] - completely flexible Will have usual properties as long as R finite (just learning about R expectations)
9 Example: Conditional Wages Data: 329,505 men from 1980 U.S. Census aged years of schooling race (black, white) married (married, non-married) Condition on schooling or age: Schooling: 21 categories average of 15,691 observations per category Range: ,934 Age: 40 categories average of 8238 observations per category Range:
10 Example: Conditional Wages (a) Schooling (b) Age
11 Example: Conditional Wages Condition on schooling and age: 840 categories average number of observations per category: 392 (large range: )
12 Example: Conditional Wages Things only get worse as we condition on Race and Marital Status 3360 categories average of 98 observations per category Range: empty categories (7.7%) 670 categories with 0-2 observations (19.9%)
13 Example: Conditional Wages
14 Example: Conditional Wages Some questions: 1. Do we really think conditional expectation function is that bumpy? 2. What do we do about categories with 0 observations? 3. Estimates cell by cell are unbiased. What happens to variance as the number of cells increases? 4. Suppose we conditioned on State of Birth too? (171,360 categories) [Curse of dimensionality] Fundamental statistical learning problem - Need for Regularization: Structure and estimators that trade off bias and variance to produce reasonable forecasts/models
15 Local Averaging - Kernels Literally averaging for each separate x value is only feasible in cases where X is coarsely discrete - need beliefs/regularization Smoothness: E.g. E[Y X] is a smooth (e.g. continuous, differentiable, etc.) function of X Function shouldn t change much across values of X that are close Estimate E[Y X = x ] by averaging y s over values of x close to x Kernel Regression: E[Y X = x ] ĝ(x ) = n i=1 y ik h (x i x ) n i=1 K h(x i x ) where K h ( ) is a kernel function and h is a bandwidth.
16 Some kernel details Common Univariate Kernels: Uniform: K h (u) = 1 1( u < h) 2h Gaussian: K h (u) = 1 2πh 2 Epanechnikov: K h (u) = 3 4h ( Triangular: K h (u) = 1 h u2 exp{ ( 1 ( u h ) 1 u h 2h } + ) 2 ) + Multivariate kernels: Most common to just take product of univariate kernels ( product kernel ) Any multivariate density would also work E.g. q dimensional multivariate normal with q q bandwidth matrix H
17 Intuition using uniform kernel With uniform kernel, ĝ(x ) = n i=1 y i1( x i x < h) n i=1 1( x i x < h) = 1 n x,h i : x i x <h y i n x,h is the number of observations such that x i x < h I.e. estimator is just sample average of the y i across all points where x i x < h
18 Local averaging picture 1.5 n x*,h = 18 ĝ(.3) = 18 1 i : xi (.2,.4) yi ĝ(x) x * h x * x * +h
19 (Conditional) MSE Heuristic MSE derivation: Condition on {x i } n i=1 and h Assume (i) E[y i X] = E[y i x i ], (ii) Var(y i X) = Var(y i x i ) exist Assume independence across i [ E (ĝ(x ) g(x )) 2 ] X 2 = E 1 (y i g(x i ) + g(x i ) g(x )) X n x,h i : x i x <h = 1 n x 2 E[(y i g(x i )) 2 X],h i : x i x <h (g(x i ) g(x )) n x,h i : x i x <h = 1 n x 2 Var(y i x i ),h i : x i x <h (g(x + (x i x )) g(x )) n x,h i : x i x <h where the first term is the variance and the second is squared-bias.
20 Dependence of MSE on h (1) We can see how h relates to MSE: larger h larger n x,h first term is smaller larger h more points farther away from x higher bias (in general) Heuristic Derivation of MSE Bound in Univariate Case: 1. n x,h /n = n 1 ni=1 1( x i x < h) F(h + x ) F ( h + x ) 2f ( x)h where we ve assumed x i are iid with CDF F(x), pdf f (x), and x is an intermediate value satisfying x h x h + x h x = 2h. df (x) dg(x) d 2. Assume sup x X f (x), sup x X dx, sup x X g(x), sup x X dx, sup 2 g(x) x X dx 2, and sup i Var(y i x i ) are bounded n x i : x,h i x dg(x ) <h (x dx i x ) dg(x ) x +h dx x h (x x )f (x)dx = h dg(x ) 1 1 uf (x + dx hu)du = h dg(x ) f (x ) 1 dx 1 udu + h 2 dg(ū) df (ū) u dx dx du = h2 dg(x ) df (ū) u dx dx du.
21 Dependence of MSE on h (2) MSE Bound: Using 1-3 and letting mean approximately less than or equal to then give [ E (ĝ(x ) g(x )) 2 ] X 1 sup Var(y i x i ) n x,h i ( 1 dg(x ) + (x i x ) + 1 n x,h i : x i x dx 2 <h ( ) hn 1 M n x,h hn + h 2 2 ( ) 1 M hn + h4 d 2 g( x) dx 2 (x i x ) 2 ) 2 MSE goes to zero if h 0 hn Optimal rate equates the two and give h n n 1/5
22 Formal statement of results Asymptotic Distribution of the Kernel Regression Estimator: Theorem: Assume that x Interior(X ) where X R q, g(x) and f (x) are three times continuously differentiable, and f (x) > 0, then as n, h s 0 for s = 1,..., q, n q s=1 hs, and (n q s=1 hs) q s=1 h6 s 0. Then ) q q n h s (ĝ(x) g(x) hsb 2 d s(x) N(0, κ q σ 2 (x)/f (x)) s=1 for σ 2 (x) = Var(y X = x), κ = K 2 (v)dv, and B s(x) = κ 2 2 for κ 2 = v 2 k(v)dv. s=1 2 f (x) xs g(x) xs +f (x) 2 g(x) xs 2 f (x) Operationalizing Need to estimate the density f (x) Need to estimate the bias B s(x) for s = 1,..., q requires estimation of the derivative of the density and the first and second derivatives of the regression function
23 Curse of dimensionality Note that the asymptotic normality result implies that (approximately) bias = O( q s=1 h2 s) variance = O(n q s=1 hs) Setting each bandwidth = O(h), equating bias 2 and variance rates gives and ignoring multiplicative constants ) h 4 = (nh q ) 1 h = O (n q+4 1 MSE = O(n q+4 4 ) Increasing q really slows rate of convergence - curse of dimensionality. Intuition Think about data uniformly distributed on the q-dimensional unit cube To get fraction b of the observations, need b of the volume On average, will need a cube with edge length b 1/q Neighbors aren t so local (e.g. q = 10, b =.01, need cover 63% of the support of each input Averaging points that are very far from each other bias
24 Bandwidth selection Key input to nonparametrics is choice of tuning parameter (bandwidth in this case) Simple options: 1. Eyeball Method - Choose a bandwidth. Estimate the regression function. Look at the result. If it looks more wiggly than you d like, increase the bandwidth. If it looks more smooth than you d like, decrease the bandwidth. 2. Rule-of-Thumb - Most common rule-of-thumb is approximately optimal for estimating a Gaussian density with a Gaussian kernel. Silverman s Rule-of-Thumb: h n = σ x (4/3) 1/5 n 1/5 where σ x is the sample standard deviation of x Generalization to multiple dimensions (i) h n,s = σ sn 1/(4+q) where σ s is the sample standard deviation of x s and q is the dimension of X, or (ii) H n = Σ 1/2 n 1/(4+q) where Σ is the q q sample covariance matrix of X.
25 Cross-validation (CV) Basic idea of cross-validation - evaluate quality of the bandwidth by seeing how well the resulting estimator forecasts out-of-sample Leave-one-out-CV bandwidth: ĥ = arg min CV (h) = arg min h h n (y i ŷ i,h ) 2 i=1 ŷ i,h = ĝ h (x i ) = estimate of the conditional expectation at x i using bandwidth h and all observations EXCEPT observation i Could use a different loss function such as absolute values, etc. To calculate CV (h): Choose h. For each observation i = 1,..., n, Calculate ŷ i,h = j i y j K h (x j x i ) ni=1 K h (x j x i ) Calculate e 2 i,h = (y i ŷ i,h ) 2 Calculate CV (h) = n i=1 e2 i,h Common to use K-fold CV rather than leave-one-out CV
26 K-NN K Nearest Neighbors closely related to kernels: f (x) = 1 K i:d(x i,x) d(x (K ),x) y i K : number of neighbors to use d(x 1, x 2 ): distance from point x 1 to point x 2, usually Euclidean x K the observation ranked K th in distance from target point x Can be viewed as kernel with varying bandwidth Can choose K (number of neighbors) by CV
27 Kernels and K-NN in Schooling Example (1) Figure: Uniform Kernel, ha = 3, he = 1, average 7766 obs per cell (range )
28 Kernels and K-NN in Schooling Example (2) Figure: KNN, between 2200 and 4500 obs per cell
29 Kernels and K-NN in Schooling Example (3) Figure: Uniform kernel; condition on schooling, age, non-married, black; ha = 12, he = 3
30 Kernels and K-NN in Schooling Example (4) Figure: Cross-Validation function
31 Kernels and K-NN in Schooling Example (5) Figure: Uniform kernel, CV bandwidths, ha = 3, he = 4
32 Series Series: Model g(x) = p j=0 β jϕ j (x) + r p(x) E.g. if g(x) is infinitely differentiable, we have a Taylor series - g(x) = j=0 a j x j where a j = 1 j g(0) j! x j ϕ j (x) are series/basis terms E.g. {ϕ j (x)} = 1, x, x 2, x 3,... (or orthogonal polynomials) E.g. {ϕ j (x)} = 1, x, x 2, x 3, (x k 4 ) 3 +, (x k 5) 3 +,... where k 4, k 5,... are knots (cubic spline) E.g. b-splines, fourier series,... Obtain ĝ(x) by LS regression of Y on {ϕ j (X)} p j=1 Global method Regularization comes in through the choice of p. Higher p means less bias since we are leaving out less terms from the infinite sum Higher p means higher variance since we are estimating more regression coefficients from the same amount of data
33 Series Estimation Operationally, series are extremely easy Define ϕ n (x) = (ϕ n,1 (x),..., ϕ n,p(x)) Define Z n = [ϕ n (x 1 ),..., ϕ n (x n)] (an n p matrix) The series estimator of E[Y X = x]: ĝ(x) = ϕ n (x) β where β = (Z n Z n) 1 (Z ny ) I.e. estimate coefficients by OLS of Y on Z n. Can also do inference using conventional OLS output (e.g. Newey (1997))
34 Series Asymptotics Example sufficient conditions for establishing asymptotic properties (p, n, p/n 0) of series estimator (simplified version of Newey (1997)): Var(Y X) bounded ϕ n (x), g(x), and ϕ n (x) β n are uniformly bounded, sup x Z = ζ(p) λ min (Z Z )/p the p p design matrix remains well-behaved as n (recall that p ) There exists a β n such that sup x ϕ n (x) β n g(x) = O(p α ) Requires smoothness of the function g(x) ( ) For splines or power series, have sup x ϕ n (x) β n g(x) = O p d s where s is the number of continuous derivatives of g(x) and d is the dimension of x Under these conditions, can obtain rates of convergence: [g(x) ĝ(x)] 2 df 0 (x) = O p(p/n + p 2α ) sup x g(x) ĝ(x) = O p(ζ(p)( p/n + p α ))
35 Asymptotic Distribution Essentially Theorem 2 from Newey (1997): Under regularity and assuming np α 0, ĝ(x) = g(x) + O p(ζ(p)/ n) and for n V 1/2 p (ĝ(x) g(x)) d N(0, 1) V = ϕ n (x) Q 1 Ω Q 1 ϕ n (x) Q = Z Z /n Ω = 1 ϕ n (x i )ϕ n (x i ) [y i ϕ n (x i ) 2 β] n i Ignoring technicalities, this result states that you can do inference for results based on a series estimator exactly like you would for results based on OLS estimates of the linear model.
36 Choosing the number of terms Usual choice of number of terms given by cross-validation Closed form expression for the leave-one-out CV function for least-squares series estimators: n ( ) 2 ei,k CV (K ) = 1 p i,k i=1 e i,k = y i K j=1 âjϕ j (x i ) = regression residual from regressing y on ϕ 1 (x),..., ϕ K (x) using all the observations you only have to run the regression once p i,k is the (i, i) element of the matrix P K (P K P K ) 1 P K P K is the n K matrix formed by stacking the 1 K vectors (ϕ 1 (x i ),..., ϕ K (x i )) for each i One can then use the CV number of series terms: K = arg min CV (K ) K Could also do K-fold cross-validation
37 Series in Schooling Example (1) Estimate E[log(wage) schooling,age,non-married,black] Use cubic splines with equally spaced knots 4 knots in age and education fully interact marginal splines Simple polynomial (Monomial) 7 th order in age and education fully interact marginal polynomials
38 Series in Schooling Example (2) (a) Cubic Spline (b) Polynomial Figure: Series Estimates of Conditional Expectation of log(wage) Given Age, Schooling, Non-Married, and Black
39 Series in Schooling Example (3) Cross-Validation: Cubic Spline 13 knots in age and 1 in education CV = Polynomial 9 th order in age and 2 nd order in education CV =
40 Series in Schooling Example (4) (a) Cubic Spline (b) Polynomial Figure: Series Estimates of Conditional Expectation of log(wage) Given Age, Schooling, Non-Married, and Black with Terms Chosen by CV
Nonparametric Methods
Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of Pennsylvania July 28, 2009 Michael R. Roberts Nonparametric Methods 1/42 Overview Great for data analysis
More informationTime Series and Forecasting Lecture 4 NonLinear Time Series
Time Series and Forecasting Lecture 4 NonLinear Time Series Bruce E. Hansen Summer School in Economics and Econometrics University of Crete July 23-27, 2012 Bruce Hansen (University of Wisconsin) Foundations
More informationNonparametric Density Estimation
Nonparametric Density Estimation Econ 690 Purdue University Justin L. Tobias (Purdue) Nonparametric Density Estimation 1 / 29 Density Estimation Suppose that you had some data, say on wages, and you wanted
More informationModelling Non-linear and Non-stationary Time Series
Modelling Non-linear and Non-stationary Time Series Chapter 2: Non-parametric methods Henrik Madsen Advanced Time Series Analysis September 206 Henrik Madsen (02427 Adv. TS Analysis) Lecture Notes September
More informationNonparametric Econometrics
Applied Microeconometrics with Stata Nonparametric Econometrics Spring Term 2011 1 / 37 Contents Introduction The histogram estimator The kernel density estimator Nonparametric regression estimators Semi-
More informationDensity estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas
0 0 5 Motivation: Regression discontinuity (Angrist&Pischke) Outcome.5 1 1.5 A. Linear E[Y 0i X i] 0.2.4.6.8 1 X Outcome.5 1 1.5 B. Nonlinear E[Y 0i X i] i 0.2.4.6.8 1 X utcome.5 1 1.5 C. Nonlinearity
More information12 - Nonparametric Density Estimation
ST 697 Fall 2017 1/49 12 - Nonparametric Density Estimation ST 697 Fall 2017 University of Alabama Density Review ST 697 Fall 2017 2/49 Continuous Random Variables ST 697 Fall 2017 3/49 1.0 0.8 F(x) 0.6
More informationReminders. Thought questions should be submitted on eclass. Please list the section related to the thought question
Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a
More informationSupervised Learning: Non-parametric Estimation
Supervised Learning: Non-parametric Estimation Edmondo Trentin March 18, 2018 Non-parametric Estimates No assumptions are made on the form of the pdfs 1. There are 3 major instances of non-parametric estimates:
More informationEcon 582 Nonparametric Regression
Econ 582 Nonparametric Regression Eric Zivot May 28, 2013 Nonparametric Regression Sofarwehaveonlyconsideredlinearregressionmodels = x 0 β + [ x ]=0 [ x = x] =x 0 β = [ x = x] [ x = x] x = β The assume
More informationIntroduction to Regression
Introduction to Regression p. 1/97 Introduction to Regression Chad Schafer cschafer@stat.cmu.edu Carnegie Mellon University Introduction to Regression p. 1/97 Acknowledgement Larry Wasserman, All of Nonparametric
More informationCMSC858P Supervised Learning Methods
CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors
More informationIntroduction to Regression
Introduction to Regression Chad M. Schafer May 20, 2015 Outline General Concepts of Regression, Bias-Variance Tradeoff Linear Regression Nonparametric Procedures Cross Validation Local Polynomial Regression
More informationNonparametric Regression. Badr Missaoui
Badr Missaoui Outline Kernel and local polynomial regression. Penalized regression. We are given n pairs of observations (X 1, Y 1 ),...,(X n, Y n ) where Y i = r(x i ) + ε i, i = 1,..., n and r(x) = E(Y
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationECO Class 6 Nonparametric Econometrics
ECO 523 - Class 6 Nonparametric Econometrics Carolina Caetano Contents 1 Nonparametric instrumental variable regression 1 2 Nonparametric Estimation of Average Treatment Effects 3 2.1 Asymptotic results................................
More informationInstance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest
More informationIntroduction to Regression
Introduction to Regression David E Jones (slides mostly by Chad M Schafer) June 1, 2016 1 / 102 Outline General Concepts of Regression, Bias-Variance Tradeoff Linear Regression Nonparametric Procedures
More informationIntroduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones
Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive
More informationIntroduction. Linear Regression. coefficient estimates for the wage equation: E(Y X) = X 1 β X d β d = X β
Introduction - Introduction -2 Introduction Linear Regression E(Y X) = X β +...+X d β d = X β Example: Wage equation Y = log wages, X = schooling (measured in years), labor market experience (measured
More informationCSE446: non-parametric methods Spring 2017
CSE446: non-parametric methods Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Linear Regression: What can go wrong? What do we do if the bias is too strong? Might want
More informationMachine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier
Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically
More informationIntroduction to Regression
Introduction to Regression Chad M. Schafer cschafer@stat.cmu.edu Carnegie Mellon University Introduction to Regression p. 1/100 Outline General Concepts of Regression, Bias-Variance Tradeoff Linear Regression
More informationNonparametric Estimation of Regression Functions In the Presence of Irrelevant Regressors
Nonparametric Estimation of Regression Functions In the Presence of Irrelevant Regressors Peter Hall, Qi Li, Jeff Racine 1 Introduction Nonparametric techniques robust to functional form specification.
More informationSTA 414/2104, Spring 2014, Practice Problem Set #1
STA 44/4, Spring 4, Practice Problem Set # Note: these problems are not for credit, and not to be handed in Question : Consider a classification problem in which there are two real-valued inputs, and,
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationMachine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)
Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationProbability and Statistical Decision Theory
Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Probability and Statistical Decision Theory Many slides attributable to: Erik Sudderth (UCI) Prof. Mike Hughes
More informationTerminology for Statistical Data
Terminology for Statistical Data variables - features - attributes observations - cases (consist of multiple values) In a standard data matrix, variables or features correspond to columns observations
More informationNonparametric Inference via Bootstrapping the Debiased Estimator
Nonparametric Inference via Bootstrapping the Debiased Estimator Yen-Chi Chen Department of Statistics, University of Washington ICSA-Canada Chapter Symposium 2017 1 / 21 Problem Setup Let X 1,, X n be
More informationNearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2
Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal
More informationKernel density estimation
Kernel density estimation Patrick Breheny October 18 Patrick Breheny STA 621: Nonparametric Statistics 1/34 Introduction Kernel Density Estimation We ve looked at one method for estimating density: histograms
More informationLecture 02 Linear classification methods I
Lecture 02 Linear classification methods I 22 January 2016 Taylor B. Arnold Yale Statistics STAT 365/665 1/32 Coursewebsite: A copy of the whole course syllabus, including a more detailed description of
More informationLinear Models for Regression
Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationStatistics: Learning models from data
DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial
More informationCSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes
CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods
More informationNonparametric Regression Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction
Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction Tine Buch-Kromann Univariate Kernel Regression The relationship between two variables, X and Y where m(
More informationEconometrics I. Lecture 10: Nonparametric Estimation with Kernels. Paul T. Scott NYU Stern. Fall 2018
Econometrics I Lecture 10: Nonparametric Estimation with Kernels Paul T. Scott NYU Stern Fall 2018 Paul T. Scott NYU Stern Econometrics I Fall 2018 1 / 12 Nonparametric Regression: Intuition Let s get
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationIntroduction to Econometrics
Introduction to Econometrics Lecture 3 : Regression: CEF and Simple OLS Zhaopeng Qu Business School,Nanjing University Oct 9th, 2017 Zhaopeng Qu (Nanjing University) Introduction to Econometrics Oct 9th,
More informationNonparametric Regression. Changliang Zou
Nonparametric Regression Institute of Statistics, Nankai University Email: nk.chlzou@gmail.com Smoothing parameter selection An overall measure of how well m h (x) performs in estimating m(x) over x (0,
More informationThe risk of machine learning
/ 33 The risk of machine learning Alberto Abadie Maximilian Kasy July 27, 27 2 / 33 Two key features of machine learning procedures Regularization / shrinkage: Improve prediction or estimation performance
More informationFrom Histograms to Multivariate Polynomial Histograms and Shape Estimation. Assoc Prof Inge Koch
From Histograms to Multivariate Polynomial Histograms and Shape Estimation Assoc Prof Inge Koch Statistics, School of Mathematical Sciences University of Adelaide Inge Koch (UNSW, Adelaide) Poly Histograms
More informationUncertainty Quantification for Inverse Problems. November 7, 2011
Uncertainty Quantification for Inverse Problems November 7, 2011 Outline UQ and inverse problems Review: least-squares Review: Gaussian Bayesian linear model Parametric reductions for IP Bias, variance
More informationconditional cdf, conditional pdf, total probability theorem?
6 Multiple Random Variables 6.0 INTRODUCTION scalar vs. random variable cdf, pdf transformation of a random variable conditional cdf, conditional pdf, total probability theorem expectation of a random
More informationScientific Computing: Monte Carlo
Scientific Computing: Monte Carlo Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 Course MATH-GA.2043 or CSCI-GA.2112, Spring 2012 April 5th and 12th, 2012 A. Donev (Courant Institute)
More informationClassification: The rest of the story
U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationChap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University
Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics
More informationSelection on Observables: Propensity Score Matching.
Selection on Observables: Propensity Score Matching. Department of Economics and Management Irene Brunetti ireneb@ec.unipi.it 24/10/2017 I. Brunetti Labour Economics in an European Perspective 24/10/2017
More informationRobustness to Parametric Assumptions in Missing Data Models
Robustness to Parametric Assumptions in Missing Data Models Bryan Graham NYU Keisuke Hirano University of Arizona April 2011 Motivation Motivation We consider the classic missing data problem. In practice
More informationFast learning rates for plug-in classifiers under the margin condition
Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,
More informationWeek 5: Logistic Regression & Neural Networks
Week 5: Logistic Regression & Neural Networks Instructor: Sergey Levine 1 Summary: Logistic Regression In the previous lecture, we covered logistic regression. To recap, logistic regression models and
More informationQuantitative Economics for the Evaluation of the European Policy. Dipartimento di Economia e Management
Quantitative Economics for the Evaluation of the European Policy Dipartimento di Economia e Management Irene Brunetti 1 Davide Fiaschi 2 Angela Parenti 3 9 ottobre 2015 1 ireneb@ec.unipi.it. 2 davide.fiaschi@unipi.it.
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Supervised Learning: Regression I Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Some of the
More informationShort Questions (Do two out of three) 15 points each
Econometrics Short Questions Do two out of three) 5 points each ) Let y = Xβ + u and Z be a set of instruments for X When we estimate β with OLS we project y onto the space spanned by X along a path orthogonal
More informationCh 7: Dummy (binary, indicator) variables
Ch 7: Dummy (binary, indicator) variables :Examples Dummy variable are used to indicate the presence or absence of a characteristic. For example, define female i 1 if obs i is female 0 otherwise or male
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationNonparametric Regression
Nonparametric Regression Econ 674 Purdue University April 8, 2009 Justin L. Tobias (Purdue) Nonparametric Regression April 8, 2009 1 / 31 Consider the univariate nonparametric regression model: where y
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationExploiting k-nearest Neighbor Information with Many Data
Exploiting k-nearest Neighbor Information with Many Data 2017 TEST TECHNOLOGY WORKSHOP 2017. 10. 24 (Tue.) Yung-Kyun Noh Robotics Lab., Contents Nonparametric methods for estimating density functions Nearest
More informationMetric-based classifiers. Nuno Vasconcelos UCSD
Metric-based classifiers Nuno Vasconcelos UCSD Statistical learning goal: given a function f. y f and a collection of eample data-points, learn what the function f. is. this is called training. two major
More informationBayes Decision Theory - I
Bayes Decision Theory - I Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD Statistical Learning from Data Goal: Given a relationship between a feature vector and a vector y, and iid data samples ( i,y i ), find
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationMonte Carlo Studies. The response in a Monte Carlo study is a random variable.
Monte Carlo Studies The response in a Monte Carlo study is a random variable. The response in a Monte Carlo study has a variance that comes from the variance of the stochastic elements in the data-generating
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationIntermediate Econometrics
Intermediate Econometrics Heteroskedasticity Text: Wooldridge, 8 July 17, 2011 Heteroskedasticity Assumption of homoskedasticity, Var(u i x i1,..., x ik ) = E(u 2 i x i1,..., x ik ) = σ 2. That is, the
More informationMS&E 226: Small Data
MS&E 226: Small Data Lecture 6: Bias and variance (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 49 Our plan today We saw in last lecture that model scoring methods seem to be trading off two different
More informationRegression #8: Loose Ends
Regression #8: Loose Ends Econ 671 Purdue University Justin L. Tobias (Purdue) Regression #8 1 / 30 In this lecture we investigate a variety of topics that you are probably familiar with, but need to touch
More informationAUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET. Questions AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET
The Problem Identification of Linear and onlinear Dynamical Systems Theme : Curve Fitting Division of Automatic Control Linköping University Sweden Data from Gripen Questions How do the control surface
More informationPart 6: Multivariate Normal and Linear Models
Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of
More informationCOMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma
COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 4, 2015 Today: Generative discriminative classifiers Linear regression Decomposition of error into
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationSpatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood
Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood Kuangyu Wen & Ximing Wu Texas A&M University Info-Metrics Institute Conference: Recent Innovations in Info-Metrics October
More informationModel-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego
Model-free prediction intervals for regression and autoregression Dimitris N. Politis University of California, San Diego To explain or to predict? Models are indispensable for exploring/utilizing relationships
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationLocal Polynomial Regression
VI Local Polynomial Regression (1) Global polynomial regression We observe random pairs (X 1, Y 1 ),, (X n, Y n ) where (X 1, Y 1 ),, (X n, Y n ) iid (X, Y ). We want to estimate m(x) = E(Y X = x) based
More informationCOMS 4771 Regression. Nakul Verma
COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the
More informationA Modern Look at Classical Multivariate Techniques
A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico
More informationVariance Reduction and Ensemble Methods
Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis
More informationBusiness Statistics. Lecture 10: Course Review
Business Statistics Lecture 10: Course Review 1 Descriptive Statistics for Continuous Data Numerical Summaries Location: mean, median Spread or variability: variance, standard deviation, range, percentiles,
More informationJoint Probability Distributions and Random Samples (Devore Chapter Five)
Joint Probability Distributions and Random Samples (Devore Chapter Five) 1016-345-01: Probability and Statistics for Engineers Spring 2013 Contents 1 Joint Probability Distributions 2 1.1 Two Discrete
More informationAsymptotic Statistics-III. Changliang Zou
Asymptotic Statistics-III Changliang Zou The multivariate central limit theorem Theorem (Multivariate CLT for iid case) Let X i be iid random p-vectors with mean µ and and covariance matrix Σ. Then n (
More informationELEG 3143 Probability & Stochastic Process Ch. 6 Stochastic Process
Department of Electrical Engineering University of Arkansas ELEG 3143 Probability & Stochastic Process Ch. 6 Stochastic Process Dr. Jingxian Wu wuj@uark.edu OUTLINE 2 Definition of stochastic process (random
More informationRestricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model
Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model Xiuming Zhang zhangxiuming@u.nus.edu A*STAR-NUS Clinical Imaging Research Center October, 015 Summary This report derives
More informationIntroduction to Smoothing spline ANOVA models (metamodelling)
Introduction to Smoothing spline ANOVA models (metamodelling) M. Ratto DYNARE Summer School, Paris, June 215. Joint Research Centre www.jrc.ec.europa.eu Serving society Stimulating innovation Supporting
More informationECON 721: Lecture Notes on Nonparametric Density and Regression Estimation. Petra E. Todd
ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation Petra E. Todd Fall, 2014 2 Contents 1 Review of Stochastic Order Symbols 1 2 Nonparametric Density Estimation 3 2.1 Histogram
More informationChapter 7: Model Assessment and Selection
Chapter 7: Model Assessment and Selection DD3364 April 20, 2012 Introduction Regression: Review of our problem Have target variable Y to estimate from a vector of inputs X. A prediction model ˆf(X) has
More informationRegression I: Mean Squared Error and Measuring Quality of Fit
Regression I: Mean Squared Error and Measuring Quality of Fit -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 The Setup Suppose there is a scientific problem we are interested in solving
More informationNumerical Methods I Monte Carlo Methods
Numerical Methods I Monte Carlo Methods Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 Course G63.2010.001 / G22.2420-001, Fall 2010 Dec. 9th, 2010 A. Donev (Courant Institute) Lecture
More informationProblem Set 7. Ideally, these would be the same observations left out when you
Business 4903 Instructor: Christian Hansen Problem Set 7. Use the data in MROZ.raw to answer this question. The data consist of 753 observations. Before answering any of parts a.-b., remove 253 observations
More informationWhy experimenters should not randomize, and what they should do instead
Why experimenters should not randomize, and what they should do instead Maximilian Kasy Department of Economics, Harvard University Maximilian Kasy (Harvard) Experimental design 1 / 42 project STAR Introduction
More informationECE 4400:693 - Information Theory
ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential
More informationIntroduction to Simple Linear Regression
Introduction to Simple Linear Regression Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Introduction to Simple Linear Regression 1 / 68 About me Faculty in the Department
More informationDay 4A Nonparametrics
Day 4A Nonparametrics A. Colin Cameron Univ. of Calif. - Davis... for Center of Labor Economics Norwegian School of Economics Advanced Microeconometrics Aug 28 - Sep 2, 2017. Colin Cameron Univ. of Calif.
More informationData Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.
TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin
More information