Midwest Big Data Summer Schl: Machine Learning I: Intrductin Kris De Brabanter kbrabant@iastate.edu Iwa State University Department f Statistics Department f Cmputer Science June 24, 2016 1/24
Outline 1 Intrductin t Pattern Recgnitin 2 Bayes classifier and variants 3 k-nearest Neighbr Classifier 4 Crss-validatin Leave-ne-ut Crss-validatin v-fld Crss-Validatin Drawbacks & ther methds 5 Beynd simple linear regressin: Shrinkage methds Ridge regressin LASSO and variable selectin 2/24
What is Pattern Recgnitin? Devrye, Györfy & Lugsi (1996): Pattern recgnitin r discriminatin is abut guessing r predicting the unknwn nature f an bservatin, a discrete quantity such as black r white, ne r zer, sick r healthy, real r fake. 3/24
Tw imprtant cncepts: Bias and Variance 4/24
Bayes classifier and variants 1 Intrductin t Pattern Recgnitin 2 Bayes classifier and variants 3 k-nearest Neighbr Classifier 4 Crss-validatin Leave-ne-ut Crss-validatin v-fld Crss-Validatin Drawbacks & ther methds 5 Beynd simple linear regressin: Shrinkage methds Ridge regressin LASSO and variable selectin 5/24
Bayes rule: An intuitive classifier Key idea: Assign an bservatin x t class k such that P[Y = k X = x] is largest. 6/24
Bayes rule: An intuitive classifier Key idea: Assign an bservatin x t class k such that P[Y = k X = x] is largest. Prblem I: Hw t determine P[Y = k X = x]? 6/24
Bayes rule: An intuitive classifier Key idea: Assign an bservatin x t class k such that P[Y = k X = x] is largest. Prblem I: Hw t determine P[Y = k X = x]? This prbability is usually nt knwn/nt readily available in practice. 6/24
Bayes rule: An intuitive classifier Key idea: Assign an bservatin x t class k such that P[Y = k X = x] is largest. Prblem I: Hw t determine P[Y = k X = x]? This prbability is usually nt knwn/nt readily available in practice. Bayes rule (therem): P[Y = k X = x] }{{} psterir prbability = = P[Y = k]p[x = x Y = k] P[X = x] P[Y = k]p[x = x Y = k] K i=1p[y = i]p[x = x Y = i] 6/24
Bayes rule: An intuitive classifier Key idea: Assign an bservatin x t class k such that P[Y = k X = x] is largest. Prblem I: Hw t determine P[Y = k X = x]? This prbability is usually nt knwn/nt readily available in practice. Bayes rule (therem): P[Y = k X = x] }{{} psterir prbability r fr cntinuus randm variables X = = P[Y = k X = x] = P[Y = k]p[x = x Y = k] P[X = x] P[Y = k]p[x = x Y = k] K i=1p[y = i]p[x = x Y = i] P[Y = k]f(x = x Y = k) K i=1 P[Y = i]f(x = x Y = i) 6/24
Bayes rule: An intuitive classifier Since lg is a strictly mntnic functin and the denminatr des nt depend n k, assign class k s.t. the psterir prbability is maximized 7/24
Bayes rule: An intuitive classifier Since lg is a strictly mntnic functin and the denminatr des nt depend n k, assign class k s.t. the psterir prbability is maximized Ŷ(x) = argmaxlgp[y = k X = x] k = argmaxlg[p[y = k]f(x = x Y = k)] k = argmax[lgp[y = k]+lgf(x = x Y = k)] k 7/24
Bayes rule: An intuitive classifier Since lg is a strictly mntnic functin and the denminatr des nt depend n k, assign class k s.t. the psterir prbability is maximized Ŷ(x) = argmaxlgp[y = k X = x] k = argmaxlg[p[y = k]f(x = x Y = k)] k = argmax[lgp[y = k]+lgf(x = x Y = k)] k Prblem II: Still 2 unknwn quantities (P[Y = k] and f(x = x Y = k)) 7/24
Bayes rule: An intuitive classifier Since lg is a strictly mntnic functin and the denminatr des nt depend n k, assign class k s.t. the psterir prbability is maximized Ŷ(x) = argmaxlgp[y = k X = x] k = argmaxlg[p[y = k]f(x = x Y = k)] k = argmax[lgp[y = k]+lgf(x = x Y = k)] k Prblem II: Still 2 unknwn quantities (P[Y = k] and f(x = x Y = k)) P[Y = k] = n k n and fr f(x = x Y = k), assume a density (usually nrmal) r estimate using kernel density estimatin (r histgram) 7/24
Bayes rule: An intuitive classifier Assume f(x = x Y = k) multivariate nrmal with equal variance-cvariance matrix fr each class LDA 8/24
Bayes rule: An intuitive classifier Assume f(x = x Y = k) multivariate nrmal with equal variance-cvariance matrix fr each class LDA f(x = x Y = k) multivariate nrmal with unequal variance-cvariance matrix fr each class QDA 8/24
Bayes rule: An intuitive classifier Assume f(x = x Y = k) multivariate nrmal with equal variance-cvariance matrix fr each class LDA f(x = x Y = k) multivariate nrmal with unequal variance-cvariance matrix fr each class QDA feature X i is cnditinally independent f every ther feature X j fr i j given class k Naive Bayes Classifier 8/24
Example: QDA vs. naive Bayes X2 4 2 0 1 2 3 X2 4 2 0 1 2 3 2 1 0 1 2 3 2 1 0 1 2 3 X1 X1 (a) QDA (b) Naive Bayes 9/24
k-nearest Neighbr Classifier 1 Intrductin t Pattern Recgnitin 2 Bayes classifier and variants 3 k-nearest Neighbr Classifier 4 Crss-validatin Leave-ne-ut Crss-validatin v-fld Crss-Validatin Drawbacks & ther methds 5 Beynd simple linear regressin: Shrinkage methds Ridge regressin LASSO and variable selectin 10/24
k-nearest Neighbr Classifier Figure: Illustratin f k-nearest Neighbr Classifier. Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, 2013 11/24
Effect f k KNN: K=1 KNN: K=100 Figure: Effect f k in the Nearest Neighbr Classifier. Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, 2013 12/24
Effect f k KNN: K=1 KNN: K=100 Figure: Effect f k in the Nearest Neighbr Classifier. Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, 2013 Hw t chse k? 12/24
Crss-validatin 1 Intrductin t Pattern Recgnitin 2 Bayes classifier and variants 3 k-nearest Neighbr Classifier 4 Crss-validatin Leave-ne-ut Crss-validatin v-fld Crss-Validatin Drawbacks & ther methds 5 Beynd simple linear regressin: Shrinkage methds Ridge regressin LASSO and variable selectin 13/24
Leave-ne-ut Crss-validatin Figure: Leave-ne-ut Crss-Validatin idea. Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, 2013 14/24
Leave-ne-ut Crss-validatin Lw bias, high variance Figure: Leave-ne-ut Crss-Validatin idea. Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, 2013 14/24
v-fld Crss-Validatin Figure: v-fld Crss-Validatin idea. Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, 2013 15/24
v-fld Crss-Validatin higher bias, lw variance Figure: v-fld Crss-Validatin idea. Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, 2013 15/24
Crss-validatin Crss-validatin gives an ESTIMATE f the predictin errr 16/24
Crss-validatin Crss-validatin gives an ESTIMATE f the predictin errr Figure: 10-fld CV errr (black), test errr (brwn) and training errr (blue). Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, 2013 16/24
Drawbacks & ther methds Easy t implement Can be applied t many situatins Data-driven N explicit variance estimate f the errrs needed 17/24
Drawbacks & ther methds Easy t implement Can be applied t many situatins Data-driven N explicit variance estimate f the errrs needed Cmputatinally expensive slw cnvergence 17/24
Drawbacks & ther methds Easy t implement Can be applied t many situatins Data-driven N explicit variance estimate f the errrs needed Cmputatinally expensive slw cnvergence There exist ther methds e.g. cmplexity criteria (AIC, BIC, Mallw s C p,...), generalized CV, etc. 17/24
Beynd simple linear regressin: Shrinkage methds 1 Intrductin t Pattern Recgnitin 2 Bayes classifier and variants 3 k-nearest Neighbr Classifier 4 Crss-validatin Leave-ne-ut Crss-validatin v-fld Crss-Validatin Drawbacks & ther methds 5 Beynd simple linear regressin: Shrinkage methds Ridge regressin LASSO and variable selectin 18/24
Shrinkage methds When p > n, OLS des nt have a unique slutin. Remember OLS (ˆβ 0,..., ˆβ p ) = argmin β 0,...,β p R p+1 n Y i β 0 i=1 p β i x ij j=1 2. 19/24
Shrinkage methds When p > n, OLS des nt have a unique slutin. Remember OLS (ˆβ 0,..., ˆβ p ) = argmin β 0,...,β p R p+1 In matrix frm: β = (X T X) 1 X T y n Y i β 0 i=1 p β i x ij j=1 y = (Y 1,...,Y n ) T,β = (β 0,...,β p ) T 2. and 1 x 11 x 1p X =... 1 x n1 x np 19/24
Shrinkage methds When p > n, OLS des nt have a unique slutin. Remember OLS (ˆβ 0,..., ˆβ p ) = argmin β 0,...,β p R p+1 In matrix frm: β = (X T X) 1 X T y n Y i β 0 i=1 p β i x ij j=1 y = (Y 1,...,Y n ) T,β = (β 0,...,β p ) T 2. and 1 x 11 x 1p X =... 1 x n1 x np X T X will be (clse t) singular and hence nt invertible 19/24
Ridge regressin Slutin t previus prblem: make matrix X T X invertible by adding a little nise n the main diagnal 20/24
Ridge regressin Slutin t previus prblem: make matrix X T X invertible by adding a little nise n the main diagnal ˆβ ridge = (X T X+λI p ) 1 X T y, λ 0. 20/24
Ridge regressin Slutin t previus prblem: make matrix X T X invertible by adding a little nise n the main diagnal ˆβ ridge = (X T X+λI p ) 1 X T y, λ 0. This crrespnds t slving the fllwing penalized residual sums f squares n ( p ) 2 p ˆβ ridge = argmin Y β 0,...,β p i β 0 x ij β j +λ βj 2. i=1 j=1 j=1 20/24
Ridge regressin Slutin t previus prblem: make matrix X T X invertible by adding a little nise n the main diagnal ˆβ ridge = (X T X+λI p ) 1 X T y, λ 0. This crrespnds t slving the fllwing penalized residual sums f squares n ( p ) 2 p ˆβ ridge = argmin Y β 0,...,β p i β 0 x ij β j +λ βj 2. i=1 j=1 j=1 as λ 0, ˆβ ridge ˆβ OLS as λ, ˆβ ridge 0 shrinks the clumn space f X having small variance the mst λ depends n σ 2 e,p and β 20/24
Ridge regressin Slutin t previus prblem: make matrix X T X invertible by adding a little nise n the main diagnal ˆβ ridge = (X T X+λI p ) 1 X T y, λ 0. This crrespnds t slving the fllwing penalized residual sums f squares n ( p ) 2 p ˆβ ridge = argmin Y β 0,...,β p i β 0 x ij β j +λ βj 2. i=1 j=1 j=1 as λ 0, ˆβ ridge ˆβ OLS as λ, ˆβ ridge 0 shrinks the clumn space f X having small variance the mst λ depends n σ 2 e,p and β Find λ via crss-validatin 20/24
Why Ridge regressin? Therem (Existence Therem (Herl & Kennard, 1970)) There always exist a λ s.t. that TMSE(ˆβ ridge ) < TMSE(ˆβ OLS ) 21/24
Why Ridge regressin? Therem (Existence Therem (Herl & Kennard, 1970)) There always exist a λ s.t. that TMSE(ˆβ ridge ) < TMSE(ˆβ OLS ) Figure: bias T bias and trace(variance) f ridge regressin 21/24
The existence therem in practice Y 15 10 5 0 10th rder plynmial regressin 3 2 1 0 1 2 3 x data OLS Ridge, λ = 0.1 OLS ceff. Ridge ceff. 1.1660 1.3379 4.2225 1.7752 2.3333 1.5521-5.3965-0.9540-0.8235-0.3379 2.5254 0.3483 0.1503 0.0566-0.4294-0.0498-0.0085-0.0030 0.0235 0.0024 22/24
LASSO and variable selectin (Tibshirani, 1996) n ( ˆβ lass = argmin Y β 0,...,β p i β 0 i=1 p ) 2 x ij β j +λ j=1 p β j j=1 23/24
LASSO and variable selectin (Tibshirani, 1996) n ( ˆβ lass = argmin Y β 0,...,β p i β 0 i=1 p ) 2 x ij β j +λ j=1 N clsed frm, numerical ptimizatin needed cnvex prblem Sets certain cefficients β exactly equal t zer λ can be determined via crss-validatin p β j j=1 23/24
LASSO and variable selectin (Tibshirani, 1996) n ( ˆβ lass = argmin Y β 0,...,β p i β 0 i=1 p ) 2 x ij β j +λ j=1 N clsed frm, numerical ptimizatin needed cnvex prblem Sets certain cefficients β exactly equal t zer λ can be determined via crss-validatin p β j j=1 23/24
References Bühlmann, P & van de Geer S. (2011), Statistics fr High-Dimensinal Data: Methds, Thery and Applicatins, Springer Devrye L., Györfi L. & Lugsi G. (1996). A Prbabilistic Thery f Pattern Recgnitin, Springer Gareth J., Witten D., Hastie T. & Tibshirani R. (2013). An Intrductin t Statistical Learning with Applicatins in R, Springer Hastie T., Tibshirani R. & Friedman J. (2009), The Elements f Statistical Learning: Data Mining, Inference, and Predictin (2nd Ed.), Springer Herl A.E. & Kennard R. (1974), Ridge regressin: biased estimatin fr nnrthgnal prblems, Technmetrics, 12: 55-67 Györfi L, Khler M., Krzyżak A. & Walk H. (2002). A Distributin-Free Thery f Nnparametric Regressin, Springer Hampel F. R., Rnchetti E. M., Russeeuw P. J. & Werner A. (1986), Rbust Statistics: The Apprach Based n Influence Functins, Jhn Wiley & Sns Silverman B. W., Density Estimatin fr Statistics and Data Analysis, Chapman & Hall, 1986 Simnff J.S. (1996), Smthing Methds in Statistics, Springer R. Tibshirani (1996), Regressin shrinkage and selectin via the lass, J. Ryal. Statist. Sc B., 58(1): 267-288 24/24