Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Similar documents
Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

What is Statistical Learning?

Simple Linear Regression (single variable)

Lecture 3: Principal Components Analysis (PCA)

Part 3 Introduction to statistical classification techniques

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Comparing Several Means: ANOVA. Group Means and Grand Mean

Pattern Recognition 2014 Support Vector Machines

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

Statistical Learning. 2.1 What Is Statistical Learning?

Chapter 3: Cluster Analysis

Distributions, spatial statistics and a Bayesian perspective

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

COMP 551 Applied Machine Learning Lecture 4: Linear classification

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

IAML: Support Vector Machines

Resampling Methods. Chapter 5. Chapter 5 1 / 52

CS 109 Lecture 23 May 18th, 2016

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

Tree Structured Classifier

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

The general linear model and Statistical Parametric Mapping I: Introduction to the GLM

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

Smoothing, penalized least squares and splines

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

NUMBERS, MATHEMATICS AND EQUATIONS

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must

Bayesian nonparametric modeling approaches for quantile regression

The blessing of dimensionality for kernel methods

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001)

Lecture 10, Principal Component Analysis

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

Least Squares Optimal Filtering with Multirate Observations

NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

Lecture 8: Multiclass Classification (I)

AP Statistics Practice Test Unit Three Exploring Relationships Between Variables. Name Period Date

COMP9444 Neural Networks and Deep Learning 3. Backpropagation

Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

Modelling of Clock Behaviour. Don Percival. Applied Physics Laboratory University of Washington Seattle, Washington, USA

Data Mining: Concepts and Techniques. Classification and Prediction. Chapter February 8, 2007 CSE-4412: Data Mining 1

Overview of Supervised Learning

Churn Prediction using Dynamic RFM-Augmented node2vec

Particle Size Distributions from SANS Data Using the Maximum Entropy Method. By J. A. POTTON, G. J. DANIELL AND B. D. RAINFORD

Multiple Source Multiple. using Network Coding

A Matrix Representation of Panel Data

Admin. MDP Search Trees. Optimal Quantities. Reinforcement Learning

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

Determining the Accuracy of Modal Parameter Estimation Methods

Introduction to Regression

Logistic Regression. and Maximum Likelihood. Marek Petrik. Feb

Performance Bounds for Detect and Avoid Signal Sensing

A Scalable Recurrent Neural Network Framework for Model-free

Versatility of Singular Value Decomposition (SVD) January 7, 2015

INSTRUMENTAL VARIABLES

Formal Uncertainty Assessment in Aquarius Salinity Retrieval Algorithm

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

Localized Model Selection for Regression

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

Computational modeling techniques

Experimental Design Initial GLM Intro. This Time

Data mining/machine learning large data sets. STA 302 or 442 (Applied Statistics) :, 1

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

Math 105: Review for Exam I - Solutions

7 TH GRADE MATH STANDARDS

Elements of Machine Intelligence - I

Chapter 15 & 16: Random Forests & Ensemble Learning

Application of ILIUM to the estimation of the T eff [Fe/H] pair from BP/RP

Kinetic Model Completeness

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

AP Statistics Notes Unit Two: The Normal Distributions

Lecture 13: Markov Chain Monte Carlo. Gibbs sampling

EASTERN ARIZONA COLLEGE Introduction to Statistics

Computational modeling techniques

, which yields. where z1. and z2

Support-Vector Machines

ECEN 4872/5827 Lecture Notes

Reinforcement Learning" CMPSCI 383 Nov 29, 2011!

NOTE ON THE ANALYSIS OF A RANDOMIZED BLOCK DESIGN. Junjiro Ogawa University of North Carolina

Excessive Social Imbalances and the Performance of Welfare States in the EU. Frank Vandenbroucke, Ron Diris and Gerlinde Verbist

Statistics, Numerical Models and Ensembles

INTRODUCTION TO MACHINE LEARNING FOR MEDICINE

Differentiation Applications 1: Related Rates

Optimization Programming Problems For Control And Management Of Bacterial Disease With Two Stage Growth/Spread Among Plants

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

Linear Classification

Study Group Report: Plate-fin Heat Exchangers: AEA Technology

SAMPLING DYNAMICAL SYSTEMS

Perfrmance f Sensitizing Rules n Shewhart Cntrl Charts with Autcrrelated Data Key Wrds: Autregressive, Mving Average, Runs Tests, Shewhart Cntrl Chart

Slide04 (supplemental) Haykin Chapter 4 (both 2nd and 3rd ed): Multi-Layer Perceptrons

MATCHING TECHNIQUES Technical Track Session VI Céline Ferré The World Bank

CS:4420 Artificial Intelligence

Figure 1a. A planar mechanism.

Department of Economics, University of California, Davis Ecn 200C Micro Theory Professor Giacomo Bonanno. Insurance Markets

Transcription:

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeff Reading: Chapter 2 STATS 202: Data mining and analysis September 27, 2017 1 / 20

Supervised vs. unsupervised learning In unsupervised learning we seek t understand the relatinships between variables in a data matrix: Samples r bservatins Variables r factrs 2 / 20

Supervised vs. unsupervised learning In unsupervised learning we seek t understand the relatinships between variables in a data matrix: Samples r bservatins Variables r factrs Quantitative variables, eg. weight, height, number f children,... 2 / 20

Supervised vs. unsupervised learning In unsupervised learning we seek t understand the relatinships between variables in a data matrix: Samples r bservatins Variables r factrs Qualitative variables, eg. cllege majr, prfessin, gender,... 2 / 20

Supervised vs. unsupervised learning In unsupervised learning we seek t understand the relatinships between variables in a data matrix: Our gal is t: Find meaningful relatinships between the variables. Crrelatin analysis. Find lw-dimensinal representatins f the data which make it easy t visualize the variables. PCA, ICA, ismap, lcally linear embeddings, etc. Find meaningful grupings f the data. Clustering. 3 / 20

Supervised vs. unsupervised learning In supervised learning, there are input variables, and utput variables: Samples Input variables Output variable Variables r factrs 4 / 20

Supervised vs. unsupervised learning In supervised learning, there are input variables, and utput variables: Samples Input variables Output variable If quantitative, we say this is a regressin prblem Variables r factrs 4 / 20

Supervised vs. unsupervised learning In supervised learning, there are input variables, and utput variables: Samples Input variables Output variable If qualitative (categrical), we say this is a classificatin prblem Variables r factrs 4 / 20

Supervised vs. unsupervised learning In supervised learning, there are input variables, and utput variables: If X is the vectr f inputs fr a particular sample. The utput variable is mdeled by: Y = f(x) + ε }{{} Randm errr The functin f captures the systematic relatinship between X and Y. f is fixed and unknwn. ɛ represents the unpredictable "nise" in the prblem. Our gal is t learn the functin f, using a set f training samples. 5 / 20

Supervised vs. unsupervised learning Mtivatins: Y = f(x) + ε }{{} Randm errr Predictin: Useful when the input variable is readily available, but the utput variable is nt. Example: Predict stck prices next mnth using data frm last year. Inference: A mdel fr f can help us understand the structure f the data which variables influence the utput, and which dn t? What is the relatinship between each variable and the utput, e.g. linear, nn-linear? Example: What is the influence f genetic variatins n the incidence f heart disease. 6 / 20

Parametric and nnparametric methds: Mst supervised learning methds fall int ne f tw classes: Parametric methds: We assume that f takes a specific frm. Fr example, a linear frm: f(x) = X 1 β 1 + + X p β p with parameters β 1,..., β p. Using the training data, we try t fit the parameters. Nn-parametric methds: We dn t make any assumptins n the frm f f, but we restrict hw wiggly r rugh the functin can be. 7 / 20

Incme Incme Parametric vs. nnparametric predictin Years f Educatin Senirity Years f Educatin Senirity Figures 2.4 and 2.5 Parametric methds have a limit f fit quality. Nn-parametric methds keep imprving as we add mre data t fit. Parametric methds are ften simpler t interpret. 8 / 20

Predictin errr Training data: (x 1, y 1 ), (x 2, y 2 )... (x n, y n ) Predicted functin: ˆf, which is based n these data. What is a gd chice fr ˆf? Our gal in supervised learning is t minimize the predictin errr: Fr a new datapint (x 0, y 0 ) (nt used in training) we want the squared errr (y 0 ˆf(x 0 )) 2 t be small. Given many test data {(x i, y i ); i = 1,..., m} which were nt used t fit the mdel, a cmmn measure f quality f ˆf is the test mean squared errr (MSE): MSE test ( ˆf) = 1 m m (y i ˆf(x i)) 2. i=1 9 / 20

Predictin errr What if we dn t have any test data? It is tempting t use instead the training MSE: MSE training ( ˆf) = 1 n n (y i ˆf(x i )) 2. i=1 The main challenge f statistical learning is that a lw training MSE des nt imply a lw test MSE. 10 / 20

Figure 2.9. Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 X 2 5 10 20 Flexibility The circles are simulated data frm the black curve. In this artificial example, we knw what f is. 11 / 20

Figure 2.9. Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 X 2 5 10 20 Flexibility Three estimates ˆf are shwn: 1. Linear regressin. 2. Splines (very smth). 3. Splines (quite rugh). 11 / 20

Figure 2.9. Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 X 2 5 10 20 Flexibility Red line: Test MSE. Gray line: Training MSE. 11 / 20

Figure 2.10 Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 X 2 5 10 20 Flexibility The functin f is nw almst linear. 12 / 20

Figure 2.11 Y 10 0 10 20 Mean Squared Errr 0 5 10 15 20 0 20 40 60 80 100 X 2 5 10 20 Flexibility When the nise ε has small variance, the third methd des well. 13 / 20

The bias variance decmpsitin Let x 0 be a fixed test pint, y 0 = f(x 0 ) + ε 0, and ˆf be estimated frm n training samples (x 1, y 1 )... (x n, y n ). Let E dente the expectatin ver y 0 and the training data. Then, the expected test MSE at x 0 can be decmpsed: E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 )) + [Bias( ˆf(x 0 ))] 2 + Var(ε 0 ). 14 / 20

The bias variance decmpsitin Let x 0 be a fixed test pint, y 0 = f(x 0 ) + ε 0, and ˆf be estimated frm n training samples (x 1, y 1 )... (x n, y n ). Let E dente the expectatin ver y 0 and the training data. Then, the expected test MSE at x 0 can be decmpsed: E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 )) + [Bias( ˆf(x 0 ))] 2 + Var(ε 0 ). Irreducible errr 14 / 20

The bias variance decmpsitin Let x 0 be a fixed test pint, y 0 = f(x 0 ) + ε 0, and ˆf be estimated frm n training samples (x 1, y 1 )... (x n, y n ). Let E dente the expectatin ver y 0 and the training data. Then, the expected test MSE at x 0 can be decmpsed: E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 )) + [Bias( ˆf(x 0 ))] 2 + Var(ε 0 ). The variance f the estimate f Y : E[ ˆf(x 0 ) E( ˆf(x 0 ))] 2 This measures hw much the estimate f ˆf at x 0 changes when we sample new training data. 14 / 20

The bias variance decmpsitin Let x 0 be a fixed test pint, y 0 = f(x 0 ) + ε 0, and ˆf be estimated frm n training samples (x 1, y 1 )... (x n, y n ). Let E dente the expectatin ver y 0 and the training data. Then, the expected test MSE at x 0 can be decmpsed: E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 )) + [Bias( ˆf(x 0 ))] 2 + Var(ε 0 ). The squared bias f the estimate f Y : [E( ˆf(x 0 )) f(x 0 )] 2 This measures the deviatin f the average predictin ˆf(x 0 ) frm the truth f(x 0 ). 14 / 20

15 / 20

15 / 20

15 / 20

Implicatins f bias variance decmpsitin E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 )) + [Bias( ˆf(x 0 ))] 2 + Var(ε). The expected MSE is never smaller than the irreducible errr. Bth small bias and small variance are needed fr small expected MSE. Easy t find zer variance prcedure with high bias (predict a cnstant value) and very lw bias prcedure with high variance (chse ˆf passing thrugh every training pint). In practice, best expected MSE achieved by incurring sme bias t decrease variance and vice-versa: this is the bias-variance trade-ff. Rule f thumb: Mre flexible methds Higher variance and Lwer Bias. Example: Linear fit may nt change substantially with training data but has high bias if f is very nn-linear. 16 / 20

Squiggly f, high nise Linear f, high nise Squiggly f, lw nise 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0 5 10 15 20 MSE Bias Var 2 5 10 20 Flexibility 2 5 10 20 Flexibility 2 5 10 20 Flexibility Figure 2.12 17 / 20

Classificatin prblems In a classificatin setting, the utput takes values in a discrete set. Fr example, if we are predicting the brand f a car based n a number f variables, the functin f takes values in the set {Frd, Tyta, Mercedes-Benz,... }. We adpt sme additinal ntatin: P (X, Y ) : jint distributin f (X, Y ), P (Y X) : cnditinal distributin f Y given X, ŷ i : predictin fr x i. 18 / 20

Lss functin fr classificatin There are many ways t measure the errr f a classificatin predictin. One f the mst cmmn is the 0-1 lss: 1(y 0 ŷ 0 ) As with squared errr, we can cmpute average test preditin errr (called test errr rate under 0-1 lss) using previusly unseen test data {(x i, y i ); i = 1,..., m}: 1 m m 1(y i ŷ i) i=1 Similarly, we can cmpute the (usually ptimistic) training errr rate 1 n 1(y i ŷ i ) n i=1 19 / 20

Bayes classifier X1 X2 Figure 2.13 In practice, we never knw the jint prbability P. Hwever, we can assume that it exists. The Bayes classifier assigns: ŷ i = argmax j P (Y = j X = x i ) It can be shwn that this is the best classifier under the 0-1 lss. 20 / 20