Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

Size: px

Start display at page:

Download "Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter"

Debra Sparks
5 years ago
Views:

1 Midwest Big Data Summer Schl: Machine Learning I: Intrductin Kris De Brabanter Iwa State University Department f Statistics Department f Cmputer Science June 24, /24

2 Outline 1 Intrductin t Pattern Recgnitin 2 Bayes classifier and variants 3 k-nearest Neighbr Classifier 4 Crss-validatin Leave-ne-ut Crss-validatin v-fld Crss-Validatin Drawbacks & ther methds 5 Beynd simple linear regressin: Shrinkage methds Ridge regressin LASSO and variable selectin 2/24

3 What is Pattern Recgnitin? Devrye, Györfy & Lugsi (1996): Pattern recgnitin r discriminatin is abut guessing r predicting the unknwn nature f an bservatin, a discrete quantity such as black r white, ne r zer, sick r healthy, real r fake. 3/24

4 Tw imprtant cncepts: Bias and Variance 4/24

5 Bayes classifier and variants 1 Intrductin t Pattern Recgnitin 2 Bayes classifier and variants 3 k-nearest Neighbr Classifier 4 Crss-validatin Leave-ne-ut Crss-validatin v-fld Crss-Validatin Drawbacks & ther methds 5 Beynd simple linear regressin: Shrinkage methds Ridge regressin LASSO and variable selectin 5/24

6 Bayes rule: An intuitive classifier Key idea: Assign an bservatin x t class k such that P[Y = k X = x] is largest. 6/24

7 Bayes rule: An intuitive classifier Key idea: Assign an bservatin x t class k such that P[Y = k X = x] is largest. Prblem I: Hw t determine P[Y = k X = x]? 6/24

8 Bayes rule: An intuitive classifier Key idea: Assign an bservatin x t class k such that P[Y = k X = x] is largest. Prblem I: Hw t determine P[Y = k X = x]? This prbability is usually nt knwn/nt readily available in practice. 6/24

9 Bayes rule: An intuitive classifier Key idea: Assign an bservatin x t class k such that P[Y = k X = x] is largest. Prblem I: Hw t determine P[Y = k X = x]? This prbability is usually nt knwn/nt readily available in practice. Bayes rule (therem): P[Y = k X = x] }{{} psterir prbability = = P[Y = k]p[x = x Y = k] P[X = x] P[Y = k]p[x = x Y = k] K i=1p[y = i]p[x = x Y = i] 6/24

10 Bayes rule: An intuitive classifier Key idea: Assign an bservatin x t class k such that P[Y = k X = x] is largest. Prblem I: Hw t determine P[Y = k X = x]? This prbability is usually nt knwn/nt readily available in practice. Bayes rule (therem): P[Y = k X = x] }{{} psterir prbability r fr cntinuus randm variables X = = P[Y = k X = x] = P[Y = k]p[x = x Y = k] P[X = x] P[Y = k]p[x = x Y = k] K i=1p[y = i]p[x = x Y = i] P[Y = k]f(x = x Y = k) K i=1 P[Y = i]f(x = x Y = i) 6/24

11 Bayes rule: An intuitive classifier Since lg is a strictly mntnic functin and the denminatr des nt depend n k, assign class k s.t. the psterir prbability is maximized 7/24

12 Bayes rule: An intuitive classifier Since lg is a strictly mntnic functin and the denminatr des nt depend n k, assign class k s.t. the psterir prbability is maximized Ŷ(x) = argmaxlgp[y = k X = x] k = argmaxlg[p[y = k]f(x = x Y = k)] k = argmax[lgp[y = k]+lgf(x = x Y = k)] k 7/24

13 Bayes rule: An intuitive classifier Since lg is a strictly mntnic functin and the denminatr des nt depend n k, assign class k s.t. the psterir prbability is maximized Ŷ(x) = argmaxlgp[y = k X = x] k = argmaxlg[p[y = k]f(x = x Y = k)] k = argmax[lgp[y = k]+lgf(x = x Y = k)] k Prblem II: Still 2 unknwn quantities (P[Y = k] and f(x = x Y = k)) 7/24

14 Bayes rule: An intuitive classifier Since lg is a strictly mntnic functin and the denminatr des nt depend n k, assign class k s.t. the psterir prbability is maximized Ŷ(x) = argmaxlgp[y = k X = x] k = argmaxlg[p[y = k]f(x = x Y = k)] k = argmax[lgp[y = k]+lgf(x = x Y = k)] k Prblem II: Still 2 unknwn quantities (P[Y = k] and f(x = x Y = k)) P[Y = k] = n k n and fr f(x = x Y = k), assume a density (usually nrmal) r estimate using kernel density estimatin (r histgram) 7/24

15 Bayes rule: An intuitive classifier Assume f(x = x Y = k) multivariate nrmal with equal variance-cvariance matrix fr each class LDA 8/24

16 Bayes rule: An intuitive classifier Assume f(x = x Y = k) multivariate nrmal with equal variance-cvariance matrix fr each class LDA f(x = x Y = k) multivariate nrmal with unequal variance-cvariance matrix fr each class QDA 8/24

17 Bayes rule: An intuitive classifier Assume f(x = x Y = k) multivariate nrmal with equal variance-cvariance matrix fr each class LDA f(x = x Y = k) multivariate nrmal with unequal variance-cvariance matrix fr each class QDA feature X i is cnditinally independent f every ther feature X j fr i j given class k Naive Bayes Classifier 8/24

18 Example: QDA vs. naive Bayes X X X1 X1 (a) QDA (b) Naive Bayes 9/24

19 k-nearest Neighbr Classifier 1 Intrductin t Pattern Recgnitin 2 Bayes classifier and variants 3 k-nearest Neighbr Classifier 4 Crss-validatin Leave-ne-ut Crss-validatin v-fld Crss-Validatin Drawbacks & ther methds 5 Beynd simple linear regressin: Shrinkage methds Ridge regressin LASSO and variable selectin 10/24

20 k-nearest Neighbr Classifier Figure: Illustratin f k-nearest Neighbr Classifier. Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, /24

21 Effect f k KNN: K=1 KNN: K=100 Figure: Effect f k in the Nearest Neighbr Classifier. Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, /24

22 Effect f k KNN: K=1 KNN: K=100 Figure: Effect f k in the Nearest Neighbr Classifier. Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, 2013 Hw t chse k? 12/24

23 Crss-validatin 1 Intrductin t Pattern Recgnitin 2 Bayes classifier and variants 3 k-nearest Neighbr Classifier 4 Crss-validatin Leave-ne-ut Crss-validatin v-fld Crss-Validatin Drawbacks & ther methds 5 Beynd simple linear regressin: Shrinkage methds Ridge regressin LASSO and variable selectin 13/24

24 Leave-ne-ut Crss-validatin Figure: Leave-ne-ut Crss-Validatin idea. Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, /24

25 Leave-ne-ut Crss-validatin Lw bias, high variance Figure: Leave-ne-ut Crss-Validatin idea. Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, /24

26 v-fld Crss-Validatin Figure: v-fld Crss-Validatin idea. Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, /24

27 v-fld Crss-Validatin higher bias, lw variance Figure: v-fld Crss-Validatin idea. Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, /24

28 Crss-validatin Crss-validatin gives an ESTIMATE f the predictin errr 16/24

29 Crss-validatin Crss-validatin gives an ESTIMATE f the predictin errr Figure: 10-fld CV errr (black), test errr (brwn) and training errr (blue). Taken frm Gareth, Witten, Hastie & Tibshirani, An Intrductin t Statistical Learning with Applicatins in R, Springer, /24

30 Drawbacks & ther methds Easy t implement Can be applied t many situatins Data-driven N explicit variance estimate f the errrs needed 17/24

31 Drawbacks & ther methds Easy t implement Can be applied t many situatins Data-driven N explicit variance estimate f the errrs needed Cmputatinally expensive slw cnvergence 17/24

32 Drawbacks & ther methds Easy t implement Can be applied t many situatins Data-driven N explicit variance estimate f the errrs needed Cmputatinally expensive slw cnvergence There exist ther methds e.g. cmplexity criteria (AIC, BIC, Mallw s C p,...), generalized CV, etc. 17/24

33 Beynd simple linear regressin: Shrinkage methds 1 Intrductin t Pattern Recgnitin 2 Bayes classifier and variants 3 k-nearest Neighbr Classifier 4 Crss-validatin Leave-ne-ut Crss-validatin v-fld Crss-Validatin Drawbacks & ther methds 5 Beynd simple linear regressin: Shrinkage methds Ridge regressin LASSO and variable selectin 18/24

34 Shrinkage methds When p > n, OLS des nt have a unique slutin. Remember OLS (ˆβ 0,..., ˆβ p ) = argmin β 0,...,β p R p+1 n Y i β 0 i=1 p β i x ij j= /24

35 Shrinkage methds When p > n, OLS des nt have a unique slutin. Remember OLS (ˆβ 0,..., ˆβ p ) = argmin β 0,...,β p R p+1 In matrix frm: β = (X T X) 1 X T y n Y i β 0 i=1 p β i x ij j=1 y = (Y 1,...,Y n ) T,β = (β 0,...,β p ) T 2. and 1 x 11 x 1p X =... 1 x n1 x np 19/24

36 Shrinkage methds When p > n, OLS des nt have a unique slutin. Remember OLS (ˆβ 0,..., ˆβ p ) = argmin β 0,...,β p R p+1 In matrix frm: β = (X T X) 1 X T y n Y i β 0 i=1 p β i x ij j=1 y = (Y 1,...,Y n ) T,β = (β 0,...,β p ) T 2. and 1 x 11 x 1p X =... 1 x n1 x np X T X will be (clse t) singular and hence nt invertible 19/24

37 Ridge regressin Slutin t previus prblem: make matrix X T X invertible by adding a little nise n the main diagnal 20/24

38 Ridge regressin Slutin t previus prblem: make matrix X T X invertible by adding a little nise n the main diagnal ˆβ ridge = (X T X+λI p ) 1 X T y, λ 0. 20/24

39 Ridge regressin Slutin t previus prblem: make matrix X T X invertible by adding a little nise n the main diagnal ˆβ ridge = (X T X+λI p ) 1 X T y, λ 0. This crrespnds t slving the fllwing penalized residual sums f squares n ( p ) 2 p ˆβ ridge = argmin Y β 0,...,β p i β 0 x ij β j +λ βj 2. i=1 j=1 j=1 20/24

40 Ridge regressin Slutin t previus prblem: make matrix X T X invertible by adding a little nise n the main diagnal ˆβ ridge = (X T X+λI p ) 1 X T y, λ 0. This crrespnds t slving the fllwing penalized residual sums f squares n ( p ) 2 p ˆβ ridge = argmin Y β 0,...,β p i β 0 x ij β j +λ βj 2. i=1 j=1 j=1 as λ 0, ˆβ ridge ˆβ OLS as λ, ˆβ ridge 0 shrinks the clumn space f X having small variance the mst λ depends n σ 2 e,p and β 20/24

41 Ridge regressin Slutin t previus prblem: make matrix X T X invertible by adding a little nise n the main diagnal ˆβ ridge = (X T X+λI p ) 1 X T y, λ 0. This crrespnds t slving the fllwing penalized residual sums f squares n ( p ) 2 p ˆβ ridge = argmin Y β 0,...,β p i β 0 x ij β j +λ βj 2. i=1 j=1 j=1 as λ 0, ˆβ ridge ˆβ OLS as λ, ˆβ ridge 0 shrinks the clumn space f X having small variance the mst λ depends n σ 2 e,p and β Find λ via crss-validatin 20/24

42 Why Ridge regressin? Therem (Existence Therem (Herl & Kennard, 1970)) There always exist a λ s.t. that TMSE(ˆβ ridge ) < TMSE(ˆβ OLS ) 21/24

43 Why Ridge regressin? Therem (Existence Therem (Herl & Kennard, 1970)) There always exist a λ s.t. that TMSE(ˆβ ridge ) < TMSE(ˆβ OLS ) Figure: bias T bias and trace(variance) f ridge regressin 21/24

44 The existence therem in practice Y th rder plynmial regressin x data OLS Ridge, λ = 0.1 OLS ceff. Ridge ceff /24

45 LASSO and variable selectin (Tibshirani, 1996) n ( ˆβ lass = argmin Y β 0,...,β p i β 0 i=1 p ) 2 x ij β j +λ j=1 p β j j=1 23/24

46 LASSO and variable selectin (Tibshirani, 1996) n ( ˆβ lass = argmin Y β 0,...,β p i β 0 i=1 p ) 2 x ij β j +λ j=1 N clsed frm, numerical ptimizatin needed cnvex prblem Sets certain cefficients β exactly equal t zer λ can be determined via crss-validatin p β j j=1 23/24

47 LASSO and variable selectin (Tibshirani, 1996) n ( ˆβ lass = argmin Y β 0,...,β p i β 0 i=1 p ) 2 x ij β j +λ j=1 N clsed frm, numerical ptimizatin needed cnvex prblem Sets certain cefficients β exactly equal t zer λ can be determined via crss-validatin p β j j=1 23/24

48 References Bühlmann, P & van de Geer S. (2011), Statistics fr High-Dimensinal Data: Methds, Thery and Applicatins, Springer Devrye L., Györfi L. & Lugsi G. (1996). A Prbabilistic Thery f Pattern Recgnitin, Springer Gareth J., Witten D., Hastie T. & Tibshirani R. (2013). An Intrductin t Statistical Learning with Applicatins in R, Springer Hastie T., Tibshirani R. & Friedman J. (2009), The Elements f Statistical Learning: Data Mining, Inference, and Predictin (2nd Ed.), Springer Herl A.E. & Kennard R. (1974), Ridge regressin: biased estimatin fr nnrthgnal prblems, Technmetrics, 12: Györfi L, Khler M., Krzyżak A. & Walk H. (2002). A Distributin-Free Thery f Nnparametric Regressin, Springer Hampel F. R., Rnchetti E. M., Russeeuw P. J. & Werner A. (1986), Rbust Statistics: The Apprach Based n Influence Functins, Jhn Wiley & Sns Silverman B. W., Density Estimatin fr Statistics and Data Analysis, Chapman & Hall, 1986 Simnff J.S. (1996), Smthing Methds in Statistics, Springer R. Tibshirani (1996), Regressin shrinkage and selectin via the lass, J. Ryal. Statist. Sc B., 58(1): /24

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017 Resampling Methds Crss-validatin, Btstrapping Marek Petrik 2/21/2017 Sme f the figures in this presentatin are taken frm An Intrductin t Statistical Learning, with applicatins in R (Springer, 2013) with