Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Similar documents
Lecture 3: Statistical Decision Theory (Part II)

Recap from previous lecture

STK Statistical Learning: Advanced Regression and Classification

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

CMSC858P Supervised Learning Methods

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Terminology for Statistical Data

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

12 - Nonparametric Density Estimation

Introduction to Machine Learning and Cross-Validation

Machine Learning Practice Page 2 of 2 10/28/13

Opening Theme: Flexibility vs. Stability

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Day 3: Classification, logistic regression

Statistical Data Mining and Machine Learning Hilary Term 2016

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning for OR & FE

2. (Today) Kernel methods for regression and classification ( 6.1, 6.2, 6.6)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Lecture 6: Methods for high-dimensional problems

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

CSCI-567: Machine Learning (Spring 2019)

Linear Methods for Classification

Course in Data Science

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Machine Learning And Applications: Supervised Learning-SVM

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Proteomics and Variable Selection

Announcements. Proposals graded

Classification: The rest of the story

Decision trees COMS 4771

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Support Vector Machines for Classification: A Statistical Portrait

ECE521 week 3: 23/26 January 2017

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Classification 2: Linear discriminant analysis (continued); logistic regression

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

ISyE 691 Data mining and analytics

Today. Calculus. Linear Regression. Lagrange Multipliers

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Chapter 7: Model Assessment and Selection

CS 6375 Machine Learning

Issues and Techniques in Pattern Classification

Supervised Learning: Non-parametric Estimation

The prediction of house price

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

PDEEC Machine Learning 2016/17

Overfitting, Bias / Variance Analysis

Data Mining Stat 588

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

Holdout and Cross-Validation Methods Overfitting Avoidance

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

BAYESIAN DECISION THEORY

Econometrics I. Lecture 10: Nonparametric Estimation with Kernels. Paul T. Scott NYU Stern. Fall 2018

Prediction & Feature Selection in GLM

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Machine Learning Linear Classification. Prof. Matteo Matteucci

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Chart types and when to use them

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

Introduction to Machine Learning

10-701/ Machine Learning - Midterm Exam, Fall 2010

9 Classification. 9.1 Linear Classifiers

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Decision Tree Learning Lecture 2

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Naïve Bayes classification

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Bias-Variance Tradeoff. David Dalpiaz STAT 430, Fall 2017

Day 5: Generative models, structured classification

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

A Bias Correction for the Minimum Error Rate in Cross-validation

Machine Learning & SVM

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

CPSC 340: Machine Learning and Data Mining. Gradient Descent Fall 2016

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

Introduction to Machine Learning Midterm Exam

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

The Bayes classifier

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Brief Introduction to Machine Learning

Sparse Linear Models (10/7/13)

Maximum Likelihood Estimation. only training data is available to design a classifier

Probability and Statistical Decision Theory

Introduction to Machine Learning Midterm Exam Solutions

Online Learning and Sequential Decision Making

Transcription:

Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University

0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics Perform experiments Make conclusion Statistical learning procedure Collect data Analyze the data Find new rules Let the data tell something. Seoul National University. 1

Why Statistical learning necessary? We know most of rules which can be imagined by our brain. Life (nature, socio-economic status, human behavior, biology etc.) is more complex than we have thought. Our world is changing too fast for us to keep up with based only on our logics. Due to digitalization, amount of data is increasing very fast. Most of information in huge data remains undiscovered Sample questions What are the risk factors for heart failure? Are there genes which characterize differences between various races? How does the stock market behave? Which chemical confounds are effective for a specific disease? Seoul National University. 2

Who are valuable customers for our company? What are the influential factors for changing the amount of ozone? Are there patterns in the content of spam mails? In statistical learning, the common objective is to find causes for a given phenomenon. One of the common features of the problems is that the set of possible causes we can think of is very large. Learning procedure suffers from time limitation unless we are lucky. Seoul National University. 3

Machine learning vs Statistical learning (personal view) Machine learning is a method to educate a machine (computer). Two tasks Without errors (eg. rule based learning) With errors Statistical learning is a subset of machine learning, which deals with tasks with errors. Seoul National University. 4

Statistical view of statistical learning Analysis of ultra-high dimensional data Methods to overcome the curse of dimensionality Seoul National University. 5

Supervised and Unsupervised learnings Supervised learning Use the inputs to predict the values of the outputs Examples: Regression and Classification Unsupervised learning Only use inputs to describe the data Examples: Clustering, PCA Seoul National University. 6

1. Basic set-up of Supervised learning Input(Covariate) : x R p Output(Response) : y Y System (Model): y = ϕ(x, ϵ) Loss function: l(y, a) Assumption : f belongs to a family of functions F. Learning set (Data): L = {(y i, x i ), i = 1,..., n} assumed to be a random sample of (Y, X) P Objective: Find f 0 = arg min f F E (Y,X) l(y, f(x)). Predictor(Estimator): ˆf(x) = f(x, L). Prediction: If new input is x, predict unknown y by ˆf(x). Seoul National University. 7

y is categorical Classification is continuous Regression Seoul National University. 8

2. From Least Squares to Nearest Neighbor (for regression) Least Squares Assumption : f(x) {β 0 + p i=1 x iβ i } Estimate β = (β 0, β 1,..., β p ) by ˆβ which minimizes the residual sum of square ( ) 2 n p RSS(β) = y i β 0 x ki β k. i=1 k=1 f(x, L) = ˆβ 0 + p i=1 x i ˆβ i. Seoul National University. 9

Nearest Neighbor (NN) N k (x): the neighborhood of x defined by the k closest points x i in the training sample. f(x, L) = 1 k x i N k (x) y i. Seoul National University. 10

Simulation 1 Model: y = x + ϵ and ϵ N(0, 1). Training sample size is 100. The test error is calculated by test sample of size 5000. Result Method Training error Test error Linear 0.8247196 3.395535 1 NN 0.0000000 3.915410 5 NN 0.7080551 3.434624 15 NN 0.8412333 3.400420 Seoul National University. 11

Plot Linear Regression Nearest Neighbor with k= 1 y -2 0 2 4 y -2 0 2 4-2 -1 0 1 2 3 Nearest Neighbor with k= 5 x -2-1 0 1 2 3 Nearest Neighbor with k= 15 x y -2 0 2 4 y -2 0 2 4-2 -1 0 1 2 3 x -2-1 0 1 2 3 x Seoul National University. 12

Simulation 2 Model: y = x(1 x) + ϵ and ϵ N(0, 1). Training sample size is 100. The test error is calculated by test sample of size 5000. Result Method Training error Test error Linear 3.3307623 3.051589 1 NN 0.0000000 1.892876 5 NN 0.9872481 1.387429 15 NN 2.1303585 2.069501 Seoul National University. 13

Plot Linear Regression Nearest Neighbor with k= 1 y -10-6 -4-2 0 2 y -10-6 -4-2 0 2-3 -2-1 0 1 2 3 Nearest Neighbor with k= 5 x -3-2 -1 0 1 2 3 Nearest Neighbor with k= 15 x y -10-6 -4-2 0 2 y -10-6 -4-2 0 2-3 -2-1 0 1 2 3 x -3-2 -1 0 1 2 3 x Seoul National University. 14

Comments Linear model is the best when the true model is linear and worst when the true model is nonlinear. NN performs reasonably well regardless of what the true function is. Training error is not a good estimate of the test error. Complicated models do not always perform well. The number of neighborhood k controls the complexity of the predictor. Seoul National University. 15

LS vs NN LS NN Assumption linear nothing Data size small to medium large Interpretation easy almost impossible Predictability good when the true stable regardless is simple of the ture Tuning parameter nothing the size of neighbor Seoul National University. 16

3. Statistical Decision theory Regression The training sample L is a random sample from the joint distribution P (y, x). Let l(y, f(x)) be a loss function for penalizing errors in prediction. The most popular loss function is squared error loss: l(y, f(x)) = (y f(x)) 2. The expected prediction error of f (EP E(f)) is defined as where (Y, X) P (y, x). EP E(f) = E(Y f(x)) 2 Theorem : f 0 (x) = E(Y X = x) minimizes EP E(f). E(Y X = x) is called the regression function. Seoul National University. 17

For NN method, f is estimated by ˆf : Two approximations are ˆf(x) = Ave(y i x i N k (x)). expectation is approximated by averaging over sample data conditioning at a point is relaxed to conditioning on some region close to the target point. Theorem: Under regularity conditions, ˆf(x) f 0 (x) for all x R p when n, k and k/n 0. The condition k/n 0 means that the model complexity should increase slower than the sample size. Seoul National University. 18

For LS, f is assumed to be a linear function: f(x) = β 0 + p x i β i. i=1 f with β = ( E(XX T ) ) 1 E(XY ) minimizes the EPE. The LS estimator replace the expectation by averages over the training sample. Seoul National University. 19

Classification y {1,..., J}. For a given loss function l, the EPE is defined as E(l(Y, f(x))). Since EP E(f) = E X J L(j, f(x))p (Y = j X), j=1 J f(x) = arg min k=1,...,j j=1 the EPE. L(j, k)p (Y = j X = x) minimizes If l(y, f(x)) = I(y f(x)), f(x) becomes f(x) = max j=1,...,j P (Y = j X = x). (1) This predictor is called the Bayes rule (Bayes classifier) and its EPE is called the Bayes rate. Seoul National University. 20

Estimate the Bayes classifier via function estimation First, estimate ϕ j (x) = P (Y = j X = x), and Estimate the Bayes classifier by replacing P (Y = j X = x) by ϕ j (x) in (1). The NN estimation of ϕ j ˆϕ j (x) = 1 k x i N k (x) I(y i = j). Linear models do not fit well for estimating ϕ j since ϕ j should have values between 0 and 1. Logistic regression is an promising alternative. Seoul National University. 21

4. Curse of dimensionality When p is large, the concept neighborhood does not work for local averaging. Phenomenon 1 X = (X 1,..., X p ) Uniform[0, 1] p Consider a hypercubical neighborhood about a target point. We want to capture a fraction r of the sample. Then the expected edge length will be e p (r) = r 1/p. e 10 (0.01) = 0.63 and e 10 (0.1) = 0.80. To capture 1% or 10% of the data to form a local average, we must cover 63% or 80% of the range of each input variable. Such neighborhoods are no longer local. Seoul National University. 22

Phenomenon 2: X = (X 1,..., X p ) Uniform in a p dimensional unit ball centered at the origin. For the sample size n, let R i = p k=1 X2 ki Let R (1) = min{r i }. Then, the median of R (1) is (1 (1/2) 1/n ) 1/p. for i = 1,..., n. For n = 5000, p = 10, the median is approximately 0.52, more than half way to the boundary. Most data points are closer to the boundary of the sample space than to the origin. Prediction is much more difficult near the edges since one must extrapolate rather than interpolate. Seoul National University. 23

Phenomenon 3: Suppose X Unifrom[ 1, 1] p. Assume that the true relation is Y = f(x) = exp( 8 X 2 ). Consider the 1-NN estimate at x = 0. The bias of the estimator is 1 exp( 8 x 2 (1) ) where x (1) is the smallest norm among the training sample. Since X 2 = p i=1 X2 i X2 (p) and X2 (p) the bias tends to increase as p increases. 1 as p, Seoul National University. 24

5. Overfitting and Bias-Variance tradeoff As we have seen, in the NN method, the size of neighborhood k controls the complexity of the predictor. The question is how to choose k? If we know P (y, x), we can choose k by minimizing the EPE (test error): EP E( ˆf k ) = E(Y ˆf k (X)) 2 where ˆf k is the k-nn estimate of f. Unfortunately, we do not know P (y, x). One naive answer is to estimate the EPE of ˆf k by the residual sum of square (training error): n (y i ˆf k (x i )) 2. i=1 Seoul National University. 25

The training error is downward biased estimator of the test error since the data set is used twice (one for constructing ˆf and the other for calculating the training error). Moreover, the training error keeps decreasing as k is getting smaller while the test error decreases initially and increases later. This means that too complicated models (or models fitting the training data too closely, or overfitted models) show poor performance. This seemingly mysterious phenomenon can be explained by the bias-variance decomposition. Several ways of choosing the model complexity (i.e. k in the NN method) will be explained later. Seoul National University. 26

Bias-Variance tradeoff (for regression) Suppose Y = f(x) + ϵ with E(ϵ) = 0 and Var(ϵ) = σ 2. For a given training sample L, the test error of f(x, L) is given by T E = E L E (Y,X) ((Y f(x, L)) 2 ), which is decomposed by T E = E (Y,X) ((Y f(x)) 2 ) + E X ((f(x) E L (f(x, L))) 2 ) +E X (E L (f(x, L) E L (f(x, L)) 2 ) = σ 2 + E X (Bias L (X) 2 + Variance L (X)). Seoul National University. 27

In general, if the model is getting complicated, the bias decreases and the variance increases. Example : k-nn method f(x, L) = k l=1 (f(x (l)) + ϵ (l) )/k where the subscripts (l) indicates the sequence of nearest neighbors to x. Then Bias L (x) = f(x) 1 k k f(x (l) ) and Variance L (x) = σ2 k. For k = 1, the bias is the smallest and variance is the largest while for k = n, the bias is the largest and variance is the smallest. l=1 Seoul National University. 28

Plot High bias Low variance Low bias High variance Error Test error Training error Complexity Seoul National University. 29

6. Four situations in supervised learning 1. p is small and F is parametric. Standard regression and classification problems MLE, least square, Robust estimator etc. 2. p is large and F is parametric. Develop efficient methods for small and moderate samples Variable selection, Shrinkage, Bayesian method etc. 3. p is small and F is nonparametric. Nonparametric regression Kernel, Spline, Wavelet, Mixture model etc. 4. p is large and F is nonparametric. Main play ground of Data Mining Decision tree, Project pursuit, MARS, Neural network etc. Seoul National University. 30