Econometrics I. Lecture 10: Nonparametric Estimation with Kernels. Paul T. Scott NYU Stern. Fall 2018

Similar documents
Nonparametric Methods

Terminology for Statistical Data

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Alternatives to Basis Expansions. Kernels in Density Estimation. Kernels and Bandwidth. Idea Behind Kernel Methods

Modelling Non-linear and Non-stationary Time Series

Local regression, intrinsic dimension, and nonparametric sparsity

CSE446: non-parametric methods Spring 2017

ENGRG Introduction to GIS

Lecture 3: Statistical Decision Theory (Part II)

Lecture 8. Spatial Estimation

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

41903: Introduction to Nonparametrics

Interpolating Raster Surfaces

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007

12 - Nonparametric Density Estimation

Introduction to Nonparametric and Semiparametric Estimation. Good when there are lots of data and very little prior information on functional form.

Statistical Learning

Lecture 02 Linear classification methods I

Economics 620, Lecture 19: Introduction to Nonparametric and Semiparametric Estimation

Lectures 5 & 6: Hypothesis Testing

Kernel-based density. Nuno Vasconcelos ECE Department, UCSD

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Lecture 5 Geostatistics

Recap from previous lecture

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

STAT 705 Nonlinear regression

PDV Uncertainty Estimation & Method Comparisons

Sliced Inverse Regression

NONPARAMETRIC DENSITY ESTIMATION WITH RESPECT TO THE LINEX LOSS FUNCTION

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

COMS 4771 Introduction to Machine Learning. Nakul Verma

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Nonparametric Methods Lecture 5

COMS 4771 Regression. Nakul Verma

Statistics: Learning models from data

ECO Class 6 Nonparametric Econometrics

Time Series and Forecasting Lecture 4 NonLinear Time Series

Gaussian with mean ( µ ) and standard deviation ( σ)

Local Polynomial Modelling and Its Applications

Machine Learning Practice Page 2 of 2 10/28/13

Spatial analysis. Spatial descriptive analysis. Spatial inferential analysis:

A nonparametric method of multi-step ahead forecasting in diffusion processes

Manifold Regularization

Linear Models for Regression CS534

1. How can you tell if there is serial correlation? 2. AR to model serial correlation. 3. Ignoring serial correlation. 4. GLS. 5. Projects.

Do Markov-Switching Models Capture Nonlinearities in the Data? Tests using Nonparametric Methods

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Function Approximation

Testing for Regime Switching in Singaporean Business Cycles

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

Understanding Generalization Error: Bounds and Decompositions

Lab 07 Introduction to Econometrics

Chapter 2: simple regression model

Discussion of the paper Inference for Semiparametric Models: Some Questions and an Answer by Bickel and Kwon

Department of Economics, Vanderbilt University While it is known that pseudo-out-of-sample methods are not optimal for

Stochastic Nonparametric Envelopment of Data (StoNED) in the Case of Panel Data: Timo Kuosmanen

Making sense of Econometrics: Basics

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

PAC-learning, VC Dimension and Margin-based Bounds

When is MLE appropriate

Lecture 9. Time series prediction

10.7 Fama and French Mutual Funds notes

A NOTE ON THE CHOICE OF THE SMOOTHING PARAMETER IN THE KERNEL DENSITY ESTIMATE

Operation and Supply Chain Management Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

Additive Isotonic Regression

Fast learning rates for plug-in classifiers under the margin condition

Lecture Data Science

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

New Statistical Methods That Improve on MLE and GLM Including for Reserve Modeling GARY G VENTER

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Introduction. Chapter 1

Gaussian Processes (10/16/13)

Estimation of Mars surface physical properties from hyperspectral images using the SIR method

Selection on Observables: Propensity Score Matching.

STA414/2104 Statistical Methods for Machine Learning II

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

High-dimensional regression with unknown variance

PAC-learning, VC Dimension and Margin-based Bounds

Kernel density estimation

CS-E3210 Machine Learning: Basic Principles

Notes 11: OLS Theorems ECO 231W - Undergraduate Econometrics

MS&E 226: Small Data

More on Estimation. Maximum Likelihood Estimation.

Econometrics I. Professor William Greene Stern School of Business Department of Economics 1-1/40. Part 1: Introduction

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Estimation of the Conditional Variance in Paired Experiments

LECTURE NOTE #11 PROF. ALAN YUILLE

Introduction to the regression problem. Luca Martino

Lecture 6: Methods for high-dimensional problems

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Day 3: Classification, logistic regression

Lecture 24 May 30, 2018

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Linear Models for Regression. Sargur Srihari

AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET. Questions AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Hierarchical models for the rainfall forecast DATA MINING APPROACH

Transcription:

Econometrics I Lecture 10: Nonparametric Estimation with Kernels Paul T. Scott NYU Stern Fall 2018 Paul T. Scott NYU Stern Econometrics I Fall 2018 1 / 12

Nonparametric Regression: Intuition Let s get back to conditional means and consider a general functional form: y = µ (x) + ε The basic idea behind non-parametric estimation is to avoid imposing any functional form on the relationships between variables. How do we go about this in practice? On board: what if x is discretely distributed? Paul T. Scott NYU Stern Econometrics I Fall 2018 2 / 12

Nearest Neighbor Interpolation 6 5 4 3 2 1 0 0 1 2 3 4 5 Nearest neighbor interpolation: µ (x) is the the value of y associated with the nearest observed value of x Paul T. Scott NYU Stern Econometrics I Fall 2018 3 / 12

Linear Interpolation y 6 5 4 3 2 1 0 0 1 2 3 4 5 x Linear interpolation: µ (x) is a weighted average of two points There are more sophisticated interpolation techniques, especially when it comes to multidimensional x. See kriging. Paul T. Scott NYU Stern Econometrics I Fall 2018 4 / 12

Local Averaging: Intuition y = µ (x) + ε The above interpolation techniques were just about filling in the value of the function between observations. When we have lots of data, interpolation makes less sense. When observations are close together, the difference between them might have more to do with noise in ε than an actual change in µ (x) µ (x) is likely to start looking very jagged if we interpolate between lots of close-together observations. This brings us to nonparametric estimators that still fit µ (x) based on the nearby observations, but average over observations to some extent. Paul T. Scott NYU Stern Econometrics I Fall 2018 5 / 12

Local Averaging Our estimate of µ ( ) can be thought of as weighted mean functions: ˆµ (x ) = n w i (x x) y i i with n i w i (x x) = 1. Note that this is not just a weighted average, but a function that takes a different weighted average at different points. The weights depend on at what value of x we are evaluating the function. What s the w function for the linear interpolation example above? Paul T. Scott NYU Stern Econometrics I Fall 2018 6 / 12

Kernel Kernel estimation starts with a kernel function K (x x i, h) that allows you to generate the weights from the data. Note: the kernel just depends on the point in question x and a single observation x i, but the weights ultimately depend on all the observations. Kernels involve a bandwidth h that, rougly speaking, determines how close x i should be to x for x i to get some weight in w i (x x). The bandwidth can be adjusted (and optimally selected). Paul T. Scott NYU Stern Econometrics I Fall 2018 7 / 12

Logistic Kernel The logistic kernel function is is where and K (x x i, h) = Λ (v i ) (1 Λ (v i )) Λ (v) = v i = x i x h exp (v) 1 + exp (v) Note that as x i becomes far from x, either Λ (v i ) 0 or (1 Λ (v i )) (v i ) 0, but either way the kernel goes to zero as the observations become far apart. Other kernels have K (x x i, h) = 0 when x i x > h/2. Paul T. Scott NYU Stern Econometrics I Fall 2018 8 / 12

From Kernel to Weights ˆµ (x ) = n w i (x x) y i i Given a kernel function K (x x i, h), the weights are w i (x x, h) = K (x x i, h) ni=1 K (x x i, h) Paul T. Scott NYU Stern Econometrics I Fall 2018 9 / 12

Source: Wikipedia Paul T. Scott NYU Stern Econometrics I Fall 2018 10 / 12

Bandwidth Selection The bandwidth h needs to be set somehow, and it s important. If the bandwidth is too large, we risk estimating µ (x ) based on using observations x i where µ (x ) µ (x i ). BIAS If the bandwidth is too small, we risk averaging over very few observations in which case our estimate of ˆµ (x ) will be very imprecise. VARIANCE There is a large literature on bandwidth selection. The main idea is to minimize the expected gap between the fitted function and true function (mean square error). This takes some work to do formally, but in some cases there are straightforward formulas for the optimal bandwidth. In practice, the eyeball test can provide a good starting point. When plotting your data and ˆµ (x), does seem to wiggle around to try to fit individual observations? Are there some patterns in the data that ˆµ (x) fails to capture because it smooths them out? Paul T. Scott NYU Stern Econometrics I Fall 2018 11 / 12

Further Comments Kernels become difficult-to-impossible to implement with high-dimensional data. Other nonparametric techniques are better suited (splines, LASSO). Any version of non-parametric estimation should do something to balance bias and variance. The broader issue is model selection, which is important for parametric as well as non-parametric estimation. Paul T. Scott NYU Stern Econometrics I Fall 2018 12 / 12