Opening Theme: Flexibility vs. Stability

Similar documents
Local regression I. Patrick Breheny. November 1. Kernel weighted averages Local linear regression

CMSC858P Supervised Learning Methods

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

MS&E 226: Small Data

Linear Methods for Prediction

Sampling distribution of GLM regression coefficients

Machine Learning. Lecture 9: Learning Theory. Feng Li.

The bootstrap. Patrick Breheny. December 6. The empirical distribution function The bootstrap

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Stability and the elastic net

MS&E 226: Small Data

Kernel density estimation

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Lecture 5: Clustering, Linear Regression

Linear Methods for Prediction

MS&E 226: Small Data. Lecture 6: Bias and variance (v2) Ramesh Johari

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Lecture 3: Statistical Decision Theory (Part II)

Lecture 5: Clustering, Linear Regression

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

LECTURE NOTE #3 PROF. ALAN YUILLE

Extensions to LDA and multinomial regression

Estimation of Mars surface physical properties from hyperspectral images using the SIR method

Variance Reduction and Ensemble Methods

Data Mining Stat 588

Chapter 7: Model Assessment and Selection

Lecture 5: Clustering, Linear Regression

Bootstrap tests. Patrick Breheny. October 11. Bootstrap vs. permutation tests Testing for equality of location

Statistical Methods for SVM

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Introduction to Statistical modeling: handout for Math 489/583

Prediction & Feature Selection in GLM

Model comparison. Patrick Breheny. March 28. Introduction Measures of predictive power Model selection

ISyE 691 Data mining and analytics

PDEEC Machine Learning 2016/17

Statistical Machine Learning Hilary Term 2018

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Model comparison: Deviance-based approaches

Theoretical results for lasso, MCP, and SCAD

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Assignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran

Approximate Linear Relationships

Fitting a regression model

COMS 4771 Regression. Nakul Verma

STA 2201/442 Assignment 2

Optimization Problems

Two-sample inference: Continuous data

Regression. ECO 312 Fall 2013 Chris Sims. January 12, 2014

High-dimensional regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Linear Regression. Prof. Matteo Matteucci

Two-sample inference: Continuous data

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17

Measuring the fit of the model - SSR

Regression, Ridge Regression, Lasso

The Central Limit Theorem

Some Concepts of Probability (Review) Volker Tresp Summer 2018

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

ECE521 week 3: 23/26 January 2017

18.9 SUPPORT VECTOR MACHINES

Minimum Description Length (MDL)

Diagnostics. Gad Kimmel

Learning Theory Continued

Bootstrap, Jackknife and other resampling methods

STA 414/2104, Spring 2014, Practice Problem Set #1

The prediction of house price

Data Integration for Big Data Analysis for finite population inference

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

STK Statistical Learning: Advanced Regression and Classification

UNIVERSITETET I OSLO

Multiple samples: Modeling and ANOVA

Lecture 3: Pattern Classification

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Bindel, Spring 2017 Numerical Analysis (CS 4220) Notes for

Regression Models - Introduction

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Machine Learning

1 Least Squares Estimation - multiple regression.

Multiple Linear Regression for the Supervisor Data

Introduction to Machine Learning and Cross-Validation

Discriminant analysis and supervised classification

Assignment 2: K-Nearest Neighbors and Logistic Regression

9. Least squares data fitting

Machine Learning, Fall 2009: Midterm

Regression Clustering

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

Lecture 6: Methods for high-dimensional problems

Robustness of Principal Components

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

References. Regression standard errors in clustered samples. diagonal matrix with elements

Metric-based classifiers. Nuno Vasconcelos UCSD

High-Dimensional Statistical Learning: Introduction

The t-distribution. Patrick Breheny. October 13. z tests The χ 2 -distribution The t-distribution Summary

Learning from Data: Regression

Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error?

Transcription:

Opening Theme: Flexibility vs. Stability Patrick Breheny August 25 Patrick Breheny BST 764: Applied Statistical Modeling 1/20

Introduction We begin this course with a contrast of two simple, but very different, methods: the ordinary least squares regression model and the k-nearest neighbor prediction rule The linear model makes huge assumptions about the structure of the problem, but is quite stable Nearest neighbors is virtually assumption-free, but its results can be quite unstable Each method can be quite powerful in different settings and for different reasons Patrick Breheny BST 764: Applied Statistical Modeling 2/20

Simulation settings To examine which method is better in which setting, we will simulate data from a simple model in which y can take on one of two values: 1 or 1 The corresponding x values are derived from one of two settings: Setting 1: x values are drawn from a bivariate normal distribution with different means for y = 1 and y = 1 Setting 2: A mixture in which 10 sets of means for each class (1, 1) are drawn; x values are then drawn by randomly selecting a mean from the appropriate class and then generating a random bivariate normal observation with that mean A fair competition between the two methods is then how well they do at predicting whether a future observation is 1 or 1 given its x values Patrick Breheny BST 764: Applied Statistical Modeling 3/20

The linear model: A review You are all familiar with the linear model, in which y is predicted via: ŷ = X β To review, the linear model is fit (i.e., β is estimated) by minimizing the residual sum of squares criterion: RSS = (y Xβ) T (y Xβ) Differentiating with respect to β, we obtain the so-called normal equations: X T Xβ = X T y Provided that X is of full rank, then the unique solution is β = (X T X) 1 X T y Patrick Breheny BST 764: Applied Statistical Modeling 4/20

Linear model results Patrick Breheny BST 764: Applied Statistical Modeling 5/20

Linear model remarks The linear model seems to classify points reasonably in setting 1 In setting 2, on the other hand, there are some regions which seem questionable For example, in the lower left hand corner of the plot, does it really make sense to predict blue given that all of the nearby points are red? Patrick Breheny BST 764: Applied Statistical Modeling 6/20

Nearest neighbors Consider then a completely different approach in which we don t assume a model, a distribution, a likelihood, or anything about the problem: we just look at nearby points and base our prediction on the average of those points This approach is called the nearest-neighbor method, and is defined formally as ŷ(x) = 1 k x i N k (x) where N k (x) is the neighborhood of x defined by its k closest points in the sample y i, Patrick Breheny BST 764: Applied Statistical Modeling 7/20

Nearest neighbor results Patrick Breheny BST 764: Applied Statistical Modeling 8/20

Nearest neighbor remarks Nearest neighbor seems not to perform terribly well in setting 1, as its classification boundaries are unnecessarily complex and unstable On the other hand, the method seemed perhaps better than the linear model in setting 2, where a complex and curved boundary seems to fit the data better Furthermore, the choice of k plays a big role in the fit, and the optimal k might not be the same in settings 1 and 2 Patrick Breheny BST 764: Applied Statistical Modeling 9/20

Inference Least squares vs. nearest neighbors Of course, it is potentially misleading to judge whether a method is better simply because it fits the sample better What matters, of course, is how well its predictions generalize to new samples Thus, consider generating new data sets of size 10,000 and measuring how well each method does at predicting these 10,000 new results By repeatedly simulating sample and prediction data sets, we can estimate the long-run prediction accuracy of each method in each of the two settings Patrick Breheny BST 764: Applied Statistical Modeling 10/20

Simulation results Black line = least squares; blue line = nearest neighbors 0 50 100 150 Setting 1 Setting 2 0.32 Misclassification Rate 0.30 0.28 0.26 0.24 0 50 100 150 k Patrick Breheny BST 764: Applied Statistical Modeling 11/20

Remarks Least squares vs. nearest neighbors In setting 1, linear regression was always better than nearest neighbors In setting 2, nearest neighbors was usually better than linear regression However, it wasn t always better than linear regression when k was too big or too small, the nearest neighbors method performed poorly In setting 1, the bigger k was, the better; in setting 2, there was a Goldilocks value of k (about 25) that proved optimal Patrick Breheny BST 764: Applied Statistical Modeling 12/20

The regression function Let us now develop a little theory in order to provide insight as to the reason for our simulation results The nearest neighbors and least squares methods have the common goal of estimating the regression function: f(x) = E(Y x), although they go about it very differently: Nearest neighbors conditions on those observations closest to x and then estimates the expected value by taking the average Least squares starts by making a strong assumption about f and then uses all the data to estimate f Patrick Breheny BST 764: Applied Statistical Modeling 13/20

Expected prediction error If our model is asked to predict y given x, we can calculate the prediction error As we will see later on in the course, prediction error can be measured in many ways; for now, it is most convenient to consider squared error loss (y ˆf(x)) 2 A very reasonable criterion by which to judge a model is therefore the expected value of the prediction error; with squared error loss, EPE = E{(Y ˆf(x)) 2 x} Patrick Breheny BST 764: Applied Statistical Modeling 14/20

Theorem: At a given point x 0, EPE = σ 2 + Bias 2 ( ˆf) + Var( ˆf), where σ 2 is the variance of Y x 0 Thus, expected prediction error consists of three parts: Irreducible error; this is beyond our control and would remain even if we were able to estimate f perfectly Bias (squared); the difference between E{ ˆf(x 0 )} and the true value f(x 0 ) Variance; the variance of the estimate ˆf(x 0 ) The second and third terms make up the mean squared error Patrick Breheny BST 764: Applied Statistical Modeling 15/20

: An illustration FIGURE 2.11. Test and training error as a function Patrick Breheny BST 764: Applied Statistical Modeling 16/20 An illustration from our textbook: Prediction Error High Bias Low Variance Test Sample Low Bias High Variance Low Training Sample Model Complexity High

Limitations of nearest neighbor methods Our earlier simulation results seem to suggest that if we choose k appropriately, nearest neighbor methods can always do at least roughly as well as linear regression, and sometimes much better To some extent, this holds true, provided that the dimension of x is small However, the statistical properties of nearest neighbor methods worsen rapidly as p grows Patrick Breheny BST 764: Applied Statistical Modeling 17/20

For example, suppose that x follows a uniform distribution on the unit p-cube; how many points will be within a ball centered at x 0 of radius 0.2? When p = 2 and n = 120, we can expect 15 points in the neighborhood of x 0 When p = 3, we need 448 observations to get 15 observations in a neighborhood of the same size When p = 10, we need over 57 million observations This phenomenon is commonly referred to as the curse of dimensionality, and we will return to it again in the course Patrick Breheny BST 764: Applied Statistical Modeling 18/20

Conclusions So where do we stand? Fitting an ordinary linear model is rarely the best we can do On the other hand, nearest-neighbors is rarely stable enough to be used, even in modest dimensions, unless our sample size is very large Patrick Breheny BST 764: Applied Statistical Modeling 19/20

Conclusions (cont d) These two methods stand on opposite sides of the methodology spectrum with regard to assumptions and structure Many of the methods we will discuss in this course involve bridging the gap between these two methods making linear regression more flexible, adding structure and stability to nearest neighbor ideas, or combining concepts from both One of the main themes of this course will be the need to find, develop, and apply methods that bring the right mix of flexibility and stability that is appropriate for the data Patrick Breheny BST 764: Applied Statistical Modeling 20/20