What is Statistical Learning?

Similar documents
Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Simple Linear Regression (single variable)

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Resampling Methods. Chapter 5. Chapter 5 1 / 52

IAML: Support Vector Machines

Pattern Recognition 2014 Support Vector Machines

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

COMP 551 Applied Machine Learning Lecture 4: Linear classification

Statistical Learning. 2.1 What Is Statistical Learning?

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001)

Distributions, spatial statistics and a Bayesian perspective

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

NUMBERS, MATHEMATICS AND EQUATIONS

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

The blessing of dimensionality for kernel methods

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

Smoothing, penalized least squares and splines

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

Linear Classification

Kinetic Model Completeness

Support Vector Machines and Flexible Discriminants

Tree Structured Classifier

Chapter 3: Cluster Analysis

Comparing Several Means: ANOVA. Group Means and Grand Mean

Logistic Regression. and Maximum Likelihood. Marek Petrik. Feb

Phys. 344 Ch 7 Lecture 8 Fri., April. 10 th,

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

Chapter 15 & 16: Random Forests & Ensemble Learning

Overview of Supervised Learning

Differentiation Applications 1: Related Rates

AP Statistics Notes Unit Two: The Normal Distributions

Part 3 Introduction to statistical classification techniques

Statistics, Numerical Models and Ensembles

Contents. This is page i Printer: Opaque this

CS 109 Lecture 23 May 18th, 2016

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

Localized Model Selection for Regression

Physics 2010 Motion with Constant Acceleration Experiment 1

Lead/Lag Compensator Frequency Domain Properties and Design Methods

Department of Economics, University of California, Davis Ecn 200C Micro Theory Professor Giacomo Bonanno. Insurance Markets

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Application of ILIUM to the estimation of the T eff [Fe/H] pair from BP/RP

Activity Guide Loops and Random Numbers

Support-Vector Machines

Elements of Machine Intelligence - I

Chapter Summary. Mathematical Induction Strong Induction Recursive Definitions Structural Induction Recursive Algorithms

Data Mining: Concepts and Techniques. Classification and Prediction. Chapter February 8, 2007 CSE-4412: Data Mining 1

Lecture 3: Principal Components Analysis (PCA)

MODULE 1. e x + c. [You can t separate a demominator, but you can divide a single denominator into each numerator term] a + b a(a + b)+1 = a + b

MATCHING TECHNIQUES Technical Track Session VI Céline Ferré The World Bank

Linear programming III

Sequential Allocation with Minimal Switching

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

Introduction to Regression

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must

Particle Size Distributions from SANS Data Using the Maximum Entropy Method. By J. A. POTTON, G. J. DANIELL AND B. D. RAINFORD

Support Vector Machines and Flexible Discriminants

, which yields. where z1. and z2

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

We say that y is a linear function of x if. Chapter 13: The Correlation Coefficient and the Regression Line

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

Math Foundations 20 Work Plan

INSTRUMENTAL VARIABLES

initially lcated away frm the data set never win the cmpetitin, resulting in a nnptimal nal cdebk, [2] [3] [4] and [5]. Khnen's Self Organizing Featur

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

Equilibrium of Stress

SURVIVAL ANALYSIS WITH SUPPORT VECTOR MACHINES

Math 105: Review for Exam I - Solutions

Reinforcement Learning" CMPSCI 383 Nov 29, 2011!

Springer Texts in Statistics

Computational Statistics

READING STATECHART DIAGRAMS

and the Doppler frequency rate f R , can be related to the coefficients of this polynomial. The relationships are:

Edinburgh Research Explorer

Chapter 8: The Binomial and Geometric Distributions

Excessive Social Imbalances and the Performance of Welfare States in the EU. Frank Vandenbroucke, Ron Diris and Gerlinde Verbist

24 Multiple Eigenvectors; Latent Factor Analysis; Nearest Neighbors

Admin. MDP Search Trees. Optimal Quantities. Reinforcement Learning

Least Squares Optimal Filtering with Multirate Observations

Trigonometric Ratios Unit 5 Tentative TEST date

In God we trust, all others bring data.

Data mining/machine learning large data sets. STA 302 or 442 (Applied Statistics) :, 1

The Law of Total Probability, Bayes Rule, and Random Variables (Oh My!)

Time, Synchronization, and Wireless Sensor Networks

ENSC Discrete Time Systems. Project Outline. Semester

1 The limitations of Hartree Fock approximation

22.54 Neutron Interactions and Applications (Spring 2004) Chapter 11 (3/11/04) Neutron Diffusion

Transcription:

What is Statistical Learning? Sales 5 10 15 20 25 Sales 5 10 15 20 25 Sales 5 10 15 20 25 0 50 100 200 300 TV 0 10 20 30 40 50 Radi 0 20 40 60 80 100 Newspaper Shwn are Sales vs TV, Radi and Newspaper, with a blue linear-regressin line fit separately t each. Can we predict Sales using these three? Perhaps we can d better using a mdel Sales f(tv, Radi, Newspaper) 1/30

Ntatin Here Sales is a respnse r target that we wish t predict. We generically refer t the respnse as Y. TV is a feature, r input, r predictr; we name it X 1. Likewise name Radi as X 2, and s n. We can refer t the input vectr cllectively as 0 1 Nw we write ur mdel as B X = @ X 1 X 2 X 3 C A Y = f(x)+ where captures measurement errrs and ther discrepancies. 2/30

What is f(x) gdfr? With a gd f we can make predictins f Y at new pints X = x. We can understand which cmpnents f X =(X 1,X 2,...,X p ) are imprtant in explaining Y, and which are irrelevant. e.g. Senirity and Years f Educatin have a big impact n Incme, but Marital Status typically des nt. Depending n the cmplexity f f, we may be able t understand hw each cmpnent X j f X a ects Y. 3/30

1 2 3 4 5 6 7 2 0 2 4 6 x y Is there an ideal f(x)? In particular, what is a gd value fr f(x) at any selected value f X, sayx = 4? There can be many Y values at X = 4. A gd value is f(4) = E(Y X = 4) E(Y X = 4) means expected value (average) f Y given X = 4. This ideal f(x) =E(Y X = x) is called the regressin functin. 4/30

The regressin functin f(x) Is als defined fr vectr X; e.g. f(x) =f(x 1,x 2,x 3 )=E(Y X 1 = x 1,X 2 = x 2,X 3 = x 3 ) 5/30

The regressin functin f(x) Is als defined fr vectr X; e.g. f(x) =f(x 1,x 2,x 3 )=E(Y X 1 = x 1,X 2 = x 2,X 3 = x 3 ) Is the ideal r ptimal predictr f Y with regard t mean-squared predictin errr: f(x) = E(Y X = x) isthe functin that minimizes E[(Y g(x)) 2 X = x] ver all functins g at all pints X = x. 5/30

The regressin functin f(x) Is als defined fr vectr X; e.g. f(x) =f(x 1,x 2,x 3 )=E(Y X 1 = x 1,X 2 = x 2,X 3 = x 3 ) Is the ideal r ptimal predictr f Y with regard t mean-squared predictin errr: f(x) = E(Y X = x) isthe functin that minimizes E[(Y g(x)) 2 X = x] ver all functins g at all pints X = x. = Y f(x) istheirreducible errr i.e. even if we knew f(x), we wuld still make errrs in predictin, since at each X = x there is typically a distributin f pssible Y values. 5/30

The regressin functin f(x) Is als defined fr vectr X; e.g. f(x) =f(x 1,x 2,x 3 )=E(Y X 1 = x 1,X 2 = x 2,X 3 = x 3 ) Is the ideal r ptimal predictr f Y with regard t mean-squared predictin errr: f(x) = E(Y X = x) isthe functin that minimizes E[(Y g(x)) 2 X = x] ver all functins g at all pints X = x. = Y f(x) istheirreducible errr i.e. even if we knew f(x), we wuld still make errrs in predictin, since at each X = x there is typically a distributin f pssible Y values. Fr any estimate ˆf(x) f f(x), we have E[(Y ˆf(X)) 2 2 X = x] =[f(x) ˆf(x)] {z } Reducible + Var( ) {z } Irreducible 5/30

Hw t estimate f Typically we have few if any data pints with X =4 exactly. S we cannt cmpute E(Y X = x)! Relax the definitin and let ˆf(x) = Ave(Y X 2N(x)) where N (x) is sme neighbrhd f x. 1 2 3 4 5 6 2 1 0 1 2 3 x y 6/30

Nearest neighbr averaging can be pretty gd fr small p i.e.p apple 4 and large-ish N. We will discuss smther versins, such as kernel and spline smthing later in the curse. 7/30

Nearest neighbr averaging can be pretty gd fr small p i.e.p apple 4 and large-ish N. We will discuss smther versins, such as kernel and spline smthing later in the curse. Nearest neighbr methds can be lusy when p is large. Reasn: the curse f dimensinality. Nearest neighbrs tend t be far away in high dimensins. We need t get a reasnable fractin f the N values f y i t average t bring the variance dwn e.g. 10%. A 10% neighbrhd in high dimensins need n lnger be lcal, s we lse the spirit f estimating E(Y X = x) by lcal averaging. 7/30

The curse f dimensinality 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 x1 x2 10% Neighbrhd 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.5 1.0 1.5 Fractin f Vlume Radius p= 1 p= 2 p= 3 p= 5 p= 10 8/30

Parametric and structured mdels The linear mdel is an imprtant example f a parametric mdel: f L (X) = 0 + 1 X 1 + 2 X 2 +... px p. A linear mdel is specified in terms f p + 1 parameters 0, 1,..., p. We estimate the parameters by fitting the mdel t training data. Althugh it is almst never crrect, a linear mdel ften serves as a gd and interpretable apprximatin t the unknwn true functin f(x). 9/30

A linear mdel ˆf L (X) = ˆ0 + ˆ1X gives a reasnable fit here 1 2 3 4 5 6 2 1 0 1 2 3 x y A quadratic mdel ˆf Q (X) = ˆ0 + ˆ1X + ˆ2X 2 fits slightly better. 1 2 3 4 5 6 2 1 0 1 2 3 x y 10 / 30

Incme Years f Educatin Senirity Simulated example. Red pints are simulated values fr incme frm the mdel f is the blue surface. incme = f(educatin, senirity)+ 11 / 30

Incme Years f Educatin Senirity Linear regressin mdel fit t the simulated data. ˆf L (educatin, senirity) = ˆ0+ ˆ1 educatin+ ˆ2 senirity 12 / 30

Incme Years f Educatin Senirity Mre flexible regressin mdel ˆf S (educatin, senirity) fitt the simulated data. Here we use a technique called a thin-plate spline t fit a flexible surface. We cntrl the rughness f the fit (chapter 7). 13 / 30

Incme Years f Educatin Senirity Even mre flexible spline regressin mdel ˆf S (educatin, senirity) fit t the simulated data. Here the fitted mdel makes n errrs n the training data! Als knwn as verfitting. 14 / 30

Sme trade- s Predictin accuracy versus interpretability. Linear mdels are easy t interpret; thin-plate splines are nt. 15 / 30

Sme trade- s Predictin accuracy versus interpretability. Linear mdels are easy t interpret; thin-plate splines are nt. Gd fit versus ver-fit r under-fit. Hw d we knw when the fit is just right? 15 / 30

Sme trade- s Predictin accuracy versus interpretability. Linear mdels are easy t interpret; thin-plate splines are nt. Gd fit versus ver-fit r under-fit. Hw d we knw when the fit is just right? Parsimny versus black-bx. We ften prefer a simpler mdel invlving fewer variables ver a black-bx predictr invlving them all. 15 / 30

Interpretability Lw High Subset Selectin Lass Least Squares Generalized Additive Mdels Trees Bagging, Bsting Supprt Vectr Machines Lw High Flexibility 16 / 30

Assessing Mdel Accuracy Suppse we fit a mdel ˆf(x) t sme training data Tr = {x i,y i } N 1, and we wish t see hw well it perfrms. We culd cmpute the average squared predictin errr ver Tr: MSE Tr = Ave i2tr [y i ˆf(xi )] 2 This may be biased tward mre verfit mdels. Instead we shuld, if pssible, cmpute it using fresh test data Te = {x i,y i } M 1 : MSE Te = Ave i2te [y i ˆf(xi )] 2 17 / 30

Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 2 5 10 20 X Flexibility Black curve is truth. Red curve n right is MSE Te, grey curve is MSE Tr. Orange, blue and green curves/squares crrespnd t fits f di erent flexibility. 18 / 30

Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 X 2 5 10 20 Flexibility Here the truth is smther, s the smther fit and linear mdel d really well. 19 / 30

Y 10 0 10 20 Mean Squared Errr 0 5 10 15 20 0 20 40 60 80 100 X 2 5 10 20 Flexibility Here the truth is wiggly and the nise is lw, s the mre flexible fits d the best. 20 / 30

Bias-Variance Trade- Suppse we have fit a mdel ˆf(x) t sme training data Tr, and let (x 0,y 0 ) be a test bservatin drawn frm the ppulatin. If the true mdel is Y = f(x)+ (with f(x) =E(Y X = x)), then 2 E y 0 ˆf(x0 ) = Var( ˆf(x0 )) + [Bias( ˆf(x 0 ))] 2 + Var( ). The expectatin averages ver the variability f y 0 as well as the variability in Tr. Nte that Bias( ˆf(x 0 ))] = E[ ˆf(x 0 )] f(x 0 ). Typically as the flexibility f ˆf increases, its variance increases, and its bias decreases. S chsing the flexibility based n average test errr amunts t a bias-variance trade-. 21 / 30

Bias-variance trade- fr the three examples 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0 5 10 15 20 MSE Bias Var 2 5 10 20 Flexibility 2 5 10 20 Flexibility 2 5 10 20 Flexibility 22 / 30

Classificatin Prblems Here the respnse variable Y is qualitative e.g. email is ne f C =(spam, ham) (ham=gd email), digit class is ne f C = {0, 1,...,9}. Our gals are t: Build a classifier C(X) that assigns a class label frm C t a future unlabeled bservatin X. Assess the uncertainty in each classificatin Understand the rles f the di erent predictrs amng X =(X 1,X 2,...,X p ). 23 / 30

y 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 6 7 x Is there an ideal C(X)? Suppse the K elements in C are numbered 1, 2,...,K. Let p k (x) =Pr(Y = kx = x), k=1, 2,...,K. These are the cnditinal class prbabilities at x; e.g. see little barplt at x = 5. Then the Bayes ptimal classifier at x is C(x) =j if p j (x) = max{p 1 (x),p 2 (x),...,p K (x)} 24 / 30

y 0.0 0.2 0.4 0.6 0.8 1.0 2 3 4 5 6 x Nearest-neighbr averaging can be used as befre. Als breaks dwn as dimensin grws. Hwever, the impact n Ĉ(x) is less than n ˆp k (x), k=1,...,k. 25 / 30

Classificatin: sme details Typically we measure the perfrmance f Ĉ(x) usingthe misclassificatin errr rate: Err Te = Ave i2te I[y i 6= Ĉ(x i)] The Bayes classifier (using the true p k (x)) has smallest errr (in the ppulatin). 26 / 30

Classificatin: sme details Typically we measure the perfrmance f Ĉ(x) usingthe misclassificatin errr rate: Err Te = Ave i2te I[y i 6= Ĉ(x i)] The Bayes classifier (using the true p k (x)) has smallest errr (in the ppulatin). Supprt-vectr machines build structured mdels fr C(x). We will als build structured mdels fr representing the p k (x). e.g. Lgistic regressin, generalized additive mdels. 26 / 30

Example: K-nearest neighbrs in tw dimensins X 1 X2 27 / 30

KNN: K=10 X 1 X2 FIGURE 2.15. The black curve indicates the KNN decisin bundary n the 28 / 30

FIGURE 2.15. The black curve indicates the KNN decisin bundary n the data frm Figure 2.13, using K =10.TheBayesdecisinbundaryisshwnas a purple dashed line.the KNN and Bayes decisin bundaries are very similar. KNN: K=1 KNN: K=100 FIGURE 2.16. A cmparisn f the KNN decisin bundaries (slid black curves) btained using K =1and K =100n the data frm Figure 2.13. With K =1,thedecisinbundaryisverlyflexible,whilewithK =100it is nt su ciently flexible. The Bayes decisin bundary is shwn as a purple dashed 29 / 30

Errr Rate 0.00 0.05 0.10 0.15 0.20 Training Errrs Test Errrs 0.01 0.02 0.05 0.10 0.20 0.50 1.00 1/K 30 / 30