Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

Similar documents
Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Simple Linear Regression (single variable)

COMP 551 Applied Machine Learning Lecture 4: Linear classification

Resampling Methods. Chapter 5. Chapter 5 1 / 52

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

Linear Classification

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

What is Statistical Learning?

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Smoothing, penalized least squares and splines

Part 3 Introduction to statistical classification techniques

IAML: Support Vector Machines

Distributions, spatial statistics and a Bayesian perspective

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Module 3: Gaussian Process Parameter Estimation, Prediction Uncertainty, and Diagnostics

Pattern Recognition 2014 Support Vector Machines

Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

Statistical classifiers: Bayesian decision theory and density estimation

Tree Structured Classifier

Elements of Machine Intelligence - I

Chapter 3: Cluster Analysis

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

The blessing of dimensionality for kernel methods

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

Discussion on Regularized Regression for Categorical Data (Tutz and Gertheiss)

Lecture 8: Multiclass Classification (I)

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

, which yields. where z1. and z2

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

Chapter 15 & 16: Random Forests & Ensemble Learning

The general linear model and Statistical Parametric Mapping I: Introduction to the GLM

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001)

A Matrix Representation of Panel Data

Comparing Several Means: ANOVA. Group Means and Grand Mean

Support-Vector Machines

Linear programming III

Support Vector Machines and Flexible Discriminants

Admissibility Conditions and Asymptotic Behavior of Strongly Regular Graphs

Least Squares Optimal Filtering with Multirate Observations

Principal Components

Enhancing Performance of MLP/RBF Neural Classifiers via an Multivariate Data Distribution Scheme

Logistic Regression. and Maximum Likelihood. Marek Petrik. Feb

Bayesian nonparametric modeling approaches for quantile regression

Linear Methods for Regression

Machine Learning for OR & FE

Maximum A Posteriori (MAP) CS 109 Lecture 22 May 16th, 2016

Localized Model Selection for Regression

How do we solve it and what does the solution look like?

7 TH GRADE MATH STANDARDS

Math Foundations 20 Work Plan

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

initially lcated away frm the data set never win the cmpetitin, resulting in a nnptimal nal cdebk, [2] [3] [4] and [5]. Khnen's Self Organizing Featur

Building to Transformations on Coordinate Axis Grade 5: Geometry Graph points on the coordinate plane to solve real-world and mathematical problems.

High-dimensional regression modeling

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

EASTERN ARIZONA COLLEGE Introduction to Statistics

Quantum Harmonic Oscillator, a computational approach

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

A Scalable Recurrent Neural Network Framework for Model-free

Slide04 (supplemental) Haykin Chapter 4 (both 2nd and 3rd ed): Multi-Layer Perceptrons

Midwest Big Data Summer School: Introduction to Statistics. Kris De Brabanter

SIZE BIAS IN LINE TRANSECT SAMPLING: A FIELD TEST. Mark C. Otto Statistics Research Division, Bureau of the Census Washington, D.C , U.S.A.

Contents. This is page i Printer: Opaque this

Some Theory Behind Algorithms for Stochastic Optimization

Source Coding and Compression

Artificial Neural Networks MLP, Backpropagation

EDA Engineering Design & Analysis Ltd

AP Statistics Notes Unit Two: The Normal Distributions

Online Model Racing based on Extreme Performance

Probability, Random Variables, and Processes. Probability

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Functional Form and Nonlinearities

Modeling of ship structural systems by events

Lecture 10, Principal Component Analysis

Support Vector Machines and Flexible Discriminants

COMP9444 Neural Networks and Deep Learning 3. Backpropagation

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

Introduction to Regression

Function notation & composite functions Factoring Dividing polynomials Remainder theorem & factor property

Statistics, Numerical Models and Ensembles

Computational Statistics

Modelling of Clock Behaviour. Don Percival. Applied Physics Laboratory University of Washington Seattle, Washington, USA

Hiding in plain sight

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

22.54 Neutron Interactions and Applications (Spring 2004) Chapter 11 (3/11/04) Neutron Diffusion

INSTRUMENTAL VARIABLES

You need to be able to define the following terms and answer basic questions about them:

Optimization of frequency quantization. VN Tibabishev. Keywords: optimization, sampling frequency, the substitution frequencies.

Group Report Lincoln Laboratory. The Angular Resolution of Multiple Target. J. R. Sklar F. C. Schweppe. January 1964

Methods for Determination of Mean Speckle Size in Simulated Speckle Pattern

MSA220/MVE440 Statistical Learning for Big Data

Kinetics of Particles. Chapter 3

Data Mining: Concepts and Techniques. Classification and Prediction. Chapter February 8, 2007 CSE-4412: Data Mining 1

CS 109 Lecture 23 May 18th, 2016

Transcription:

Midwest Big Data Summer Schl: Machine Learning I: Intrductin Kris De Brabanter kbrabant@iastate.edu Iwa State University Department f Statistics Department f Cmputer Science June 24, 2016 1/24

Outline 1 Intrductin t Pattern Recgnitin 2 Bayes classifier and variants 3 k-nearest Neighbr Classifier 4 Crss-validatin Leave-ne-ut Crss-validatin v-fld Crss-Validatin Drawbacks & ther methds 5 Beynd simple linear regressin: Shrinkage methds Ridge regressin LASSO and variable selectin 2/24

What is Pattern Recgnitin? Devrye, Györfy & Lugsi (1996): Pattern recgnitin r discriminatin is abut guessing r predicting the unknwn nature f an bservatin, a discrete quantity such as black r white, ne r zer, sick r healthy, real r fake. 3/24

Tw imprtant cncepts: Bias and Variance 4/24

Bayes classifier and variants 1 Intrductin t Pattern Recgnitin 2 Bayes classifier and variants 3 k-nearest Neighbr Classifier 4 Crss-validatin Leave-ne-ut Crss-validatin v-fld Crss-Validatin Drawbacks & ther methds 5 Beynd simple linear regressin: Shrinkage methds Ridge regressin LASSO and variable selectin 5/24

Bayes rule: An intuitive classifier Key idea: Assign an bservatin x t class k such that P[Y = k X = x] is largest. 6/24

Bayes rule: An intuitive classifier Key idea: Assign an bservatin x t class k such that P[Y = k X = x] is largest. Prblem I: Hw t determine P[Y = k X = x]? 6/24

Bayes rule: An intuitive classifier Key idea: Assign an bservatin x t class k such that P[Y = k X = x] is largest. Prblem I: Hw t determine P[Y = k X = x]? This prbability is usually nt knwn/nt readily available in practice. 6/24

Bayes rule: An intuitive classifier Key idea: Assign an bservatin x t class k such that P[Y = k X = x] is largest. Prblem I: Hw t determine P[Y = k X = x]? This prbability is usually nt knwn/nt readily available in practice. Bayes rule (therem): P[Y = k X = x] }{{} psterir prbability = = P[Y = k]p[x = x Y = k] P[X = x] P[Y = k]p[x = x Y = k] K i=1p[y = i]p[x = x Y = i] 6/24

Bayes rule: An intuitive classifier Key idea: Assign an bservatin x t class k such that P[Y = k X = x] is largest. Prblem I: Hw t determine P[Y = k X = x]? This prbability is usually nt knwn/nt readily available in practice. Bayes rule (therem): P[Y = k X = x] }{{} psterir prbability r fr cntinuus randm variables X = = P[Y = k X = x] = P[Y = k]p[x = x Y = k] P[X = x] P[Y = k]p[x = x Y = k] K i=1p[y = i]p[x = x Y = i] P[Y = k]f(x = x Y = k) K i=1 P[Y = i]f(x = x Y = i) 6/24

Bayes rule: An intuitive classifier Since lg is a strictly mntnic functin and the denminatr des nt depend n k, assign class k s.t. the psterir prbability is maximized 7/24

Bayes rule: An intuitive classifier Since lg is a strictly mntnic functin and the denminatr des nt depend n k, assign class k s.t. the psterir prbability is maximized Ŷ(x) = argmaxlgp[y = k X = x] k = argmaxlg[p[y = k]f(x = x Y = k)] k = argmax[lgp[y = k]+lgf(x = x Y = k)] k 7/24

Bayes rule: An intuitive classifier Since lg is a strictly mntnic functin and the denminatr des nt depend n k, assign class k s.t. the psterir prbability is maximized Ŷ(x) = argmaxlgp[y = k X = x] k = argmaxlg[p[y = k]f(x = x Y = k)] k = argmax[lgp[y = k]+lgf(x = x Y = k)] k Prblem II: Still 2 unknwn quantities (P[Y = k] and f(x = x Y = k)) 7/24

Bayes rule: An intuitive classifier Since lg is a strictly mntnic functin and the denminatr des nt depend n k, assign class k s.t. the psterir prbability is maximized Ŷ(x) = argmaxlgp[y = k X = x] k = argmaxlg[p[y = k]f(x = x Y = k)] k = argmax[lgp[y = k]+lgf(x = x Y = k)] k Prblem II: Still 2 unknwn quantities (P[Y = k] and f(x = x Y = k)) P[Y = k] = n k n and fr f(x = x Y = k), assume a density (usually nrmal) r estimate using kernel density estimatin (r histgram) 7/24

Bayes rule: An intuitive classifier Assume f(x = x Y = k) multivariate nrmal with equal variance-cvariance matrix fr each class LDA 8/24

Bayes rule: An intuitive classifier Assume f(x = x Y = k) multivariate nrmal with equal variance-cvariance matrix fr each class LDA f(x = x Y = k) multivariate nrmal with unequal variance-cvariance matrix fr each class QDA 8/24

Bayes rule: An intuitive classifier Assume f(x = x Y = k) multivariate nrmal with equal variance-cvariance matrix fr each class LDA f(x = x Y = k) multivariate nrmal with unequal variance-cvariance matrix fr each class QDA feature X i is cnditinally independent f every ther feature X j fr i j given class k Naive Bayes Classifier 8/24

Example: QDA vs. naive Bayes X2 4 2 0 1 2 3 X2 4 2 0 1 2 3 2 1 0 1 2 3 2 1 0 1 2 3 X1 X1 (a) QDA (b) Naive Bayes 9/24

k-nearest Neighbr Classifier 1 Intrductin t Pattern Recgnitin 2 Bayes classifier and variants 3 k-nearest Neighbr Classifier 4 Crss-validatin Leave-ne-ut Crss-validatin v-fld Crss-Validatin Drawbacks & ther methds 5 Beynd simple linear regressin: Shrinkage methds Ridge regressin LASSO and variable selectin 10/24

k-nearest Neighbr Classifier Figure: Illustratin f k-nearest Neighbr Classifier. Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, 2013 11/24

Effect f k KNN: K=1 KNN: K=100 Figure: Effect f k in the Nearest Neighbr Classifier. Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, 2013 12/24

Effect f k KNN: K=1 KNN: K=100 Figure: Effect f k in the Nearest Neighbr Classifier. Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, 2013 Hw t chse k? 12/24

Crss-validatin 1 Intrductin t Pattern Recgnitin 2 Bayes classifier and variants 3 k-nearest Neighbr Classifier 4 Crss-validatin Leave-ne-ut Crss-validatin v-fld Crss-Validatin Drawbacks & ther methds 5 Beynd simple linear regressin: Shrinkage methds Ridge regressin LASSO and variable selectin 13/24

Leave-ne-ut Crss-validatin Figure: Leave-ne-ut Crss-Validatin idea. Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, 2013 14/24

Leave-ne-ut Crss-validatin Lw bias, high variance Figure: Leave-ne-ut Crss-Validatin idea. Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, 2013 14/24

v-fld Crss-Validatin Figure: v-fld Crss-Validatin idea. Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, 2013 15/24

v-fld Crss-Validatin higher bias, lw variance Figure: v-fld Crss-Validatin idea. Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, 2013 15/24

Crss-validatin Crss-validatin gives an ESTIMATE f the predictin errr 16/24

Crss-validatin Crss-validatin gives an ESTIMATE f the predictin errr Figure: 10-fld CV errr (black), test errr (brwn) and training errr (blue). Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, 2013 16/24

Drawbacks & ther methds Easy t implement Can be applied t many situatins Data-driven N explicit variance estimate f the errrs needed 17/24

Drawbacks & ther methds Easy t implement Can be applied t many situatins Data-driven N explicit variance estimate f the errrs needed Cmputatinally expensive slw cnvergence 17/24

Drawbacks & ther methds Easy t implement Can be applied t many situatins Data-driven N explicit variance estimate f the errrs needed Cmputatinally expensive slw cnvergence There exist ther methds e.g. cmplexity criteria (AIC, BIC, Mallw s C p,...), generalized CV, etc. 17/24

Beynd simple linear regressin: Shrinkage methds 1 Intrductin t Pattern Recgnitin 2 Bayes classifier and variants 3 k-nearest Neighbr Classifier 4 Crss-validatin Leave-ne-ut Crss-validatin v-fld Crss-Validatin Drawbacks & ther methds 5 Beynd simple linear regressin: Shrinkage methds Ridge regressin LASSO and variable selectin 18/24

Shrinkage methds When p > n, OLS des nt have a unique slutin. Remember OLS (ˆβ 0,..., ˆβ p ) = argmin β 0,...,β p R p+1 n Y i β 0 i=1 p β i x ij j=1 2. 19/24

Shrinkage methds When p > n, OLS des nt have a unique slutin. Remember OLS (ˆβ 0,..., ˆβ p ) = argmin β 0,...,β p R p+1 In matrix frm: β = (X T X) 1 X T y n Y i β 0 i=1 p β i x ij j=1 y = (Y 1,...,Y n ) T,β = (β 0,...,β p ) T 2. and 1 x 11 x 1p X =... 1 x n1 x np 19/24

Shrinkage methds When p > n, OLS des nt have a unique slutin. Remember OLS (ˆβ 0,..., ˆβ p ) = argmin β 0,...,β p R p+1 In matrix frm: β = (X T X) 1 X T y n Y i β 0 i=1 p β i x ij j=1 y = (Y 1,...,Y n ) T,β = (β 0,...,β p ) T 2. and 1 x 11 x 1p X =... 1 x n1 x np X T X will be (clse t) singular and hence nt invertible 19/24

Ridge regressin Slutin t previus prblem: make matrix X T X invertible by adding a little nise n the main diagnal 20/24

Ridge regressin Slutin t previus prblem: make matrix X T X invertible by adding a little nise n the main diagnal ˆβ ridge = (X T X+λI p ) 1 X T y, λ 0. 20/24

Ridge regressin Slutin t previus prblem: make matrix X T X invertible by adding a little nise n the main diagnal ˆβ ridge = (X T X+λI p ) 1 X T y, λ 0. This crrespnds t slving the fllwing penalized residual sums f squares n ( p ) 2 p ˆβ ridge = argmin Y β 0,...,β p i β 0 x ij β j +λ βj 2. i=1 j=1 j=1 20/24

Ridge regressin Slutin t previus prblem: make matrix X T X invertible by adding a little nise n the main diagnal ˆβ ridge = (X T X+λI p ) 1 X T y, λ 0. This crrespnds t slving the fllwing penalized residual sums f squares n ( p ) 2 p ˆβ ridge = argmin Y β 0,...,β p i β 0 x ij β j +λ βj 2. i=1 j=1 j=1 as λ 0, ˆβ ridge ˆβ OLS as λ, ˆβ ridge 0 shrinks the clumn space f X having small variance the mst λ depends n σ 2 e,p and β 20/24

Ridge regressin Slutin t previus prblem: make matrix X T X invertible by adding a little nise n the main diagnal ˆβ ridge = (X T X+λI p ) 1 X T y, λ 0. This crrespnds t slving the fllwing penalized residual sums f squares n ( p ) 2 p ˆβ ridge = argmin Y β 0,...,β p i β 0 x ij β j +λ βj 2. i=1 j=1 j=1 as λ 0, ˆβ ridge ˆβ OLS as λ, ˆβ ridge 0 shrinks the clumn space f X having small variance the mst λ depends n σ 2 e,p and β Find λ via crss-validatin 20/24

Why Ridge regressin? Therem (Existence Therem (Herl & Kennard, 1970)) There always exist a λ s.t. that TMSE(ˆβ ridge ) < TMSE(ˆβ OLS ) 21/24

Why Ridge regressin? Therem (Existence Therem (Herl & Kennard, 1970)) There always exist a λ s.t. that TMSE(ˆβ ridge ) < TMSE(ˆβ OLS ) Figure: bias T bias and trace(variance) f ridge regressin 21/24

The existence therem in practice Y 15 10 5 0 10th rder plynmial regressin 3 2 1 0 1 2 3 x data OLS Ridge, λ = 0.1 OLS ceff. Ridge ceff. 1.1660 1.3379 4.2225 1.7752 2.3333 1.5521-5.3965-0.9540-0.8235-0.3379 2.5254 0.3483 0.1503 0.0566-0.4294-0.0498-0.0085-0.0030 0.0235 0.0024 22/24

LASSO and variable selectin (Tibshirani, 1996) n ( ˆβ lass = argmin Y β 0,...,β p i β 0 i=1 p ) 2 x ij β j +λ j=1 p β j j=1 23/24

LASSO and variable selectin (Tibshirani, 1996) n ( ˆβ lass = argmin Y β 0,...,β p i β 0 i=1 p ) 2 x ij β j +λ j=1 N clsed frm, numerical ptimizatin needed cnvex prblem Sets certain cefficients β exactly equal t zer λ can be determined via crss-validatin p β j j=1 23/24

LASSO and variable selectin (Tibshirani, 1996) n ( ˆβ lass = argmin Y β 0,...,β p i β 0 i=1 p ) 2 x ij β j +λ j=1 N clsed frm, numerical ptimizatin needed cnvex prblem Sets certain cefficients β exactly equal t zer λ can be determined via crss-validatin p β j j=1 23/24

References Bühlmann, P & van de Geer S. (2011), Statistics fr High-Dimensinal Data: Methds, Thery and Applicatins, Springer Devrye L., Györfi L. & Lugsi G. (1996). A Prbabilistic Thery f Pattern Recgnitin, Springer Gareth J., Witten D., Hastie T. & Tibshirani R. (2013). An Intrductin t Statistical Learning with Applicatins in R, Springer Hastie T., Tibshirani R. & Friedman J. (2009), The Elements f Statistical Learning: Data Mining, Inference, and Predictin (2nd Ed.), Springer Herl A.E. & Kennard R. (1974), Ridge regressin: biased estimatin fr nnrthgnal prblems, Technmetrics, 12: 55-67 Györfi L, Khler M., Krzyżak A. & Walk H. (2002). A Distributin-Free Thery f Nnparametric Regressin, Springer Hampel F. R., Rnchetti E. M., Russeeuw P. J. & Werner A. (1986), Rbust Statistics: The Apprach Based n Influence Functins, Jhn Wiley & Sns Silverman B. W., Density Estimatin fr Statistics and Data Analysis, Chapman & Hall, 1986 Simnff J.S. (1996), Smthing Methds in Statistics, Springer R. Tibshirani (1996), Regressin shrinkage and selectin via the lass, J. Ryal. Statist. Sc B., 58(1): 267-288 24/24