An Introduction. Statistical Learning. The Elements of Statistical Learning. Data Mining, Inference, and Prediction.

Size: px

Start display at page:

Download "An Introduction. Statistical Learning. The Elements of Statistical Learning. Data Mining, Inference, and Prediction."

Marilyn McDonald
5 years ago
Views:

1 Intrductin CS 189 / 289A [Spring 217] Machine Learning Jnathan Shewchuk http://wwwcsberkeleyedu/ jrs/189 TAs: Daylen Yang, Ting-Chun Wang, Muls Vretts, Mstafa Rhaninejad, Michael Zhang, Anurag

public fr mst questins s ther peple can benefit] Fr persnal matters nly, jrs@cryeecsberkeleyedu Discussin sectins: 12 nw; mre will be added Attend any sectin If the rm is t full, please g t anther ne

] N sectins this week [Enrllment: We re trying t raise it t 54 After enugh students drp, it s pssible that everyne might get in Cncurrent enrllment students have the lwest pririty; nn-cs grad

1 1 Intrductin CS 189 / 289A [Spring 217] Machine Learning Jnathan Shewchuk jrs/189 TAs: Daylen Yang, Ting-Chun Wang, Muls Vretts, Mstafa Rhaninejad, Michael Zhang, Anurag Ajay, Alvin Wan, Srush Nasiriany, Garrett Thmas, Nah Glmant, Adam Villaflr, Raul Puri, Alex Francis Questins: Please use Piazza, nt [Piazza has an ptin fr private questins, but please use public fr mst questins s ther peple can benefit] Fr persnal matters nly, jrs@cryeecsberkeleyedu Discussin sectins: 12 nw; mre will be added Attend any sectin If the rm is t full, please g t anther ne [Hwever, t get int the curse, yu have t pick sme sectin with space Desn t matter which ne!] N sectins this week [Enrllment: We re trying t raise it t 54 After enugh students drp, it s pssible that everyne might get in Cncurrent enrllment students have the lwest pririty; nn-cs grad students the secnd-lwest] [Textbks: Available free nline Linked frm class web page] f the field nd cmplex arketing t t imprtant pics include, tree-based d real-wrld is textbk ners in scienting the e statistical STS Springer Texts in Statistics Springer Series in Statistics James Witten Hastie Tibshirani 1 An Intrductin t Statistical Learning Gareth James Daniela Witten Trevr Hastie Rbert Tibshirani Trevr Hastie Rbert Tibshirani Jerme Friedman The Elements f Statictical Learning An Intrductin t Statistical Learning During the past decade there has been an explsin in cmputatin and infrmatin technlgy With it have cme vast amunts f data in a variety f fields such as medicine, bilgy, finance, and marketing The challenge f understanding these data has led t the devel-, Tibshirani nd machine pment f new tls in the field f statistics, and spawned new areas such as data mining, f the same machine learning, and biinfrmatics Many f these tls have cmmn underpinnings but targeted are at ften expressed with different terminlgy This bk describes the imprtant ideas in stical learnrse in linear these areas in a cmmn cnceptual framewrk While the apprach is statistical, the emphasis is n cncepts rather than mathematics Many examples are given, with a liberal use f clr graphics It shuld be a valuable resurce fr statisticians and anyne interested in data mining in science r industry The bk s cverage is brad, frm supervised learning nia He has (predictin) t unsupervised learning The many tpics include neural netwrks, supprt stical learncnceptual tpic in any bk vectr machines, classificatin trees and bsting the first cmprehensive treatment f this This majr new editin features many tpics nt cvered in the riginal, including graphical ingtn Her mdels, randm frests, ensemble methds, least angle regressin & path algrithms fr the ng She has lass, nn-negative matrix factrizatin, and spectral clustering There is als a chapter n f genmics, methds fr wide data (p bigger than n), including multiple testing and false discvery rates mittee that Trevr Hastie, Rbert Tibshirani, and Jerme Friedman are prfessrs f statistics at Stanfrd University They are prminent researchers in this area: Hastie and Tibshirani iversity, and develped generalized additive mdels and wrte a ppular bk f that title Hastie cdevelped much f the statistical mdeling sftware and envirnment in R/S-PLUS and Hastie and k f that invented principal curves and surfaces Tibshirani prpsed the lass and is c-authr f the nvirnment very successful An Intrductin t the Btstrap Friedman is the c-inventr f many datamining tls including CART, MARS, prjectin pursuit and gradient ed the lass bsting STATISTICS ISBN springercm with Applicatins in R Hastie Tibshirani Friedman The Elements f Statistical Learning Trevr Hastie Rbert Tibshirani Jerme Friedman Springer Series in Statistics The Elements f Statistical Learning Data Mining, Inference, and Predictin Secnd Editin 1

2 Prerequisites Math 5 (vectr calculus) Math 54 r 11 (linear algebra) CS 7 (prbability) NOT CS 188 [Might still be listed as a prerequisite, but we re having it remved] [BUT be aware that 189 midterm starts 1 minutes after 188 midterm ends] Grading: 189 4% 7 Hmewrks Late plicy: 5 slip days ttal 2% Midterm: Wednesday, March 15, in class (6: 8 pm) 4% Final Exam: MOVED t Mnday, May 8, 6 PM (Exam grup ) [Gd news fr sme f yu wh had final exam cnflicts] Grading: 289A 4% HW 2% Midterm 2% Final 2% Prject Cheating Discussin f HW prblems is encuraged All hmewrks, including prgramming, must be written individually We will actively check fr plagiarism Typical penalty is a large NEGATIVE scre, but I reserve right t give an instant F fr even ne vilatin, and will always give an F fr tw [Last time I taught CS 61B, we had t punish rughly 1 peple fr cheating It was very painful Please dn t put me thrugh that again] CORE MATERIAL Finding patterns in data; using them t make predictins Mdels and statistics help us understand patterns Optimizatin algrithms learn the patterns [The mst imprtant part f this is the data Data drives everything else Yu cannt learn much if yu dn t have enugh data Yu cannt learn much if yur data sucks But it s amazing what yu can d if yu have lts f gd data Machine learning has changed a lt in the last decade because the internet has made truly vast quantities f data available Fr instance, with a little patience yu can dwnlad tens f millins f phtgraphs Then yu can build a D mdel f Paris Sme techniques that had fallen ut f favr, like neural nets, have cme back big in the last few years because researchers fund that they wrk s much better when yu have vast quantities f data] 2

3 CLASSIFICATION 42 Why Nt Linear Regressin? 129 Incme Balance Incme Balance N Yes Default N Yes Default FIGURE 41 The Default data set Left: The annual incmes and mnthly credit card balances f a number f individuals The individuals wh defaulted n their credit card payments are shwn in range, and thse wh did nt are shwn in blue Center: Bxplts f balance as a functin f default status Right: Bxplts f incme as a functin f default status creditcardspdf (ISL, Figure 41) [The prblem f classificatin We are given data pints, each 42belnging Whyt ne Nt f tw Linear classes Regressin? Then we are given additinal pints whse class is unknwn, and we are asked t predict what class each new pint is in Given the credit card balance We have andstated annual incme that linear f a cardhlder, regressin predict is nt whether apprpriate they will default in the ncase their debt] f a Cllect qualitative training respnse data: reliable Why debtrs nt? & defaulted debtrs Evaluate Suppse new applicants that we(predictin) are trying t predict the medical cnditin f a patient in the emergency rm n the basis f her symptms In this simplified example, there are three pssible diagnses: decisin strke, bundary drug verdse, and epileptic seizure Weculdcnsiderencdingthesevaluesasaquantitative respnse variable, Y, asfllws: 1 if strke; Y = 2 if drug verdse; if epileptic seizure Using this cding, least squares culd be used t fit a linear regressin mdel t predict Y n the basis f a set f predictrs X 1,,X p Unfrtunately, this cding [Drawimplies this figure anbyrdering hand classifypdf n the utcmes, ] putting drug verdse in between [Draw strke 2 clrs andfepileptic dts, almst but seizure, nt quite andinsistingthatthedifference linearly separable] between [ Hw strke d we and classify druga new verdse pint? isdraw the asame pint in asa the thirddifference clr] between drug verdse [One pssibility: and epileptic lk at its nearest seizure neighbr] Inpracticethereisnparticular [Anther pssibility: draw a linear decisin bundary; label it] reasn that this needs t be the case Fr instance, ne culd chse an [Thse are tw different mdels fr the data] equally reasnable cding, [We ll learn sme ways t draw these linear decisin bundaries in the next several lectures But fr nw, let s cmpare this methd with anther methd] 1 if epileptic seizure; Y = 2 if strke; if drug verdse

16 2 Overview f Supervised Learning 2 Least Squares and Nearest Neighbrs 1 1 Nearest Neighbr Classifier Linear Regressin f /1 Respnse classnearpdf, classlinearpdf (ESL, Figures 2 & 21) [Here are tw

4 16 2 Overview f Supervised Learning 2 Least Squares and Nearest Neighbrs 1 1 Nearest Neighbr Classifier Linear Regressin f /1 Respnse classnearpdf, classlinearpdf (ESL, Figures 2 & 21) [Here are tw examples f classifiers fr the same data At left we have a nearest neighbr classifier, which classifies a pint by finding the nearest pint in the input data, and assigning it the same class At right we have a linear classifier, which guesses that everything abve the line is brwn, and everything belw the line is blue The decisin bundaries are in black] FIGURE 21 A classificatin example in tw dimensins The classes are cded FIGURE 2 The same classificatin example in tw dimensins as in Figure 21 The classes are cded as a binary variable (BLUE =, ORANGE =1),and as a binary variable (BLUE =, ORANGE =1), and then fit by linear regressin The line is the decisin bundary defined by x T then predicted by 1-nearest-neighbr classificatin ˆβ =5 Therangeshadedregin dentes that part f input space classified as ORANGE, while the blue regin is classified as BLUE Frm Overview Least f Supervised Squares t Learning Nearest Neighbrs 2 Least Squares and Nearest Neighbrs The linear decisin bundary frm least squares is very The smth, set fand pints ap-iparently stable t fit It 1 Nearest des appear Neighbr t Classifier rely heavily nindicated the assumptin in Figure 21, and 15-Nearest the tw Neighbr predicted Classifier classes are separated by the IR 2 classified as ORANGE crrespnds t {x : x T ˆβ 15 > 5}, that a linear decisin bundary is apprpriate In language decisin we willbundary develp {x : x T ˆβ =5}, whichislinearinthiscasewesee shrtly, it has lw variance and ptentially high bias that fr these data there are several misclassificatins n bth sides f the On the ther hand, the k-nearest-neighbr prcedures decisin d nt appear bundary t Perhaps ur linear mdel is t rigid rare such errrs rely n any stringent assumptins abut the underlying data,andcanadapt unavidable? Remember that these are errrs n the training data itself, t any situatin Hwever, any particular subregin f the and decisin we have bundary depends n a handful f input pints and their particular nt said where the cnstructed data came frm Cnsider the tw pssible psitins, scenaris: and is thus wiggly and unstable high variance and lw bias Scenari 1: The training data in each class were generated frm bivariate Each methd has its wn situatins fr which it wrks best; ingaussian particular distributins with uncrrelated cmpnents and different linear regressin is mre apprpriate fr Scenari 1 abve, while means nearest neighbrs are mre suitable fr Scenari 2 The time has cme t expse the racle! The data in fact were simulated frm a mdel Scenari smewhere 2: The between the tw, but clser t Scenari 2 training data in each class came frm a mixture f 1 lw- means m Gaussian k distributins, with individual means themselves First we generated 1variance frm a bivariate Gaussian distributin N((1, ) T, I) andlabeledthisclass distributed as Gaussian BLUE Similarly,1mreweredrawnfrmN((, 1) T, I) andlabeledclass ORANGE Thenfreachclasswegenerated1bservatinsasfllws: A mixture f frgaussians is best described in terms f the generative each bservatin, we picked an m k at randm with prbability mdel 1/1, One first and generates a discrete variable that determines which f FIGURE classnearpdf, 2 The same classnear15pdf classificatin example(esl, in tw dimensins Figures FIGURE as2 22 in Figure 21 The classes are cded a binary variable (BLUE =, & The22) same classificatin [At example right in twwe dimensins have as in Fig-ure 21 The classes are cded as a binary variable (BLUE =, ORANGE =1)and then ORANGE fit by =1),and 15-nearest neighbr classifier Instead f lking 15-nearest-neighbr at the averaging nearest as in (28) neighbr The predictedf classais hence new then predicted by 1-nearest-neighbr classificatin chsen by majrity vte amngst the 15-nearest neighbrs pint, it lks at the 15 nearest neighbrs and lets them vte fr the crrect class The 1-nearest neighbr classifier at left has a big advantage: it classifies all the training data crrectly, whereas the 15-nearest neighbr classifier at right figure des nt But the right 2 Frm Least Squares t Nearest Neighbrs In Figure 22 we see that far fewer training bservatins are misclassified The linear decisin bundary frm least squares is very than smth, in Figure and apparently stable t fit It des appear t rely heavily ninthe Figure assumptin 2 nne f the training data are misclassified A little thught 21 This shuld nt give us t much cmfrt, thugh, since that a linear decisin bundary is apprpriate In language suggests we will that develp fr k-nearest-neighbr fits, the errr n the training data figure has an advantage t Smebdy shrtly, it has lw variance and ptentially high bias shuld please be apprximately tell me what] an increasing functin f k, andwillalwaysbe On the ther hand, the k-nearest-neighbr prcedures fr d nt k =1Anindependenttestsetwuldgiveusamresatisfactrymeans appear t rely any stringent assumptins abut the underlying data,andcanadapt fr cmparing the different methds t any situatin Hwever, any particular subregin f the decisin It appears bundary depends a handful f input pints and their particular that k-nearest-neighbr fits have a single parameter, the number f neighbrs psitins, k, cmpared t the p parameters in least-squares fits Althugh this is the case, we will see that the effective number f parameters Classifica9n(Pipeline( and is thus wiggly and unstable high variance and lw bias Each methd has its wn situatins fr which it wrks best; f k-nearest in particular neighbrs is N/k and is generally bigger than p, anddecreases with increasing k Tgetanideafwhy,ntethatiftheneighbrhds linear regressin is mre apprpriate fr Scenari 1 abve, while nearest were nnverlapping, there wuld be N/k neighbrhds and we wuld fit neighbrs are mre suitable fr Scenari 2 The time has cme t expse ne parameter (a mean) in each neighbrhd the racle! The data in fact were simulated frm a mdel smewhere between the tw, but clser t Scenari 2 First we generated 1 means m k It is als clear that we cannt use sum-f-squared errrs n the training frm a bivariate Gaussian distributin N((1, ) T set as a criterin fr picking k, since we wuld always pick k =1!Itwuld, I) andlabeledthisclass BLUE Similarly,1mreweredrawnfrmN((, 1) T seem that k-nearest-neighbr methds wuld be mre apprpriate fr the, I) andlabeledclass mixture Scenari 2 described abve, while fr Gaussian data the decisin ORANGE Thenfreachclasswegenerated1bservatinsasfllws: fr bundaries f k-nearest neighbrs wuld be unnecessarily nisy each bservatin, we picked an m k at randm with prbability 1/1, and [The left figure is an example f what s called verfitting In the left figure, bserve hw intricate the decisin bundary is that separates the psitive examples frm the negative examples It s a bit t intricate t reflect reality In the right figure, the decisin bundary is smther Intuitively, that smthness is prbably mre likely t crrespnd t reality] Cllect(Training(Images( Classifying Digits Psi9ve:(( Nega9ve:(( Training(Time( sevensnespdf [In this simplified digit recgnitin prblem, we are given handwritten 7 s and 1 s, and we are asked t learn t distinguish the 7 s frm the 1 s] Cmpute(feature(vectrs(fr(psi9ve(and(nega9ve( 4 example(images( Train(a(classifier(

5 Express these images as vectrs Images are pints in 16-dimensinal space Linear decisin bundary is a hyperplane Validatin Train a classifier: it learns t distinguish 7 frm nt 7 Test the classifier n NEW images 2 kinds f errr: Training set errr: fractin f training images nt classified crrectly [This is zer with the 1-nearest neighbr classifier, but nnzer with the 15-nearest neighbr and linear classifiers we ve just seen] Test set errr: fractin f misclassified NEW images, nt seen during training [When I underline a wrd r phrase, that usually means it s a definitin If yu want t d well in this curse, my advice t yu is t memrize the definitins I cver in class] utliers: pints whse labels are atypical (eg slvent brrwer wh defaulted anyway) verfitting: when the test errr deterirates because the classifier becmes t sensitive t utliers r ther spurius patterns [In machine learning, the gal is t create a classifier that generalizes t new examples we haven t seen yet Overfitting is cunterprductive t that gal S we re always seeking a cmprmise: we want decisin bundaries that make fine distinctins withut being dwnright superstitius] 5

6 Mst ML algrithms have a few hyperparameters that cntrl ver/underfitting, eg k in k-nearest neighbrs underfit k: # f nearest neighbrs errr rate Train Test Bayes Linear test errr training errr verfit! best (7) verfitlabeledpdf (mdified frm ESL, Figure 24) We select them by validatin: Hld back a subset f training data, called the validatin set Train the classifier multiple times with different hyperparameter settings Chse the settings that wrk best n validatin set Nw we have sets: training set used t learn mdel weights validatin set used t tune hyperparameters, chse amng different mdels test set used as FINAL evaluatin f mdel Keep in a vault Run ONCE, at the very end [It s very bad when researchers in medicine r pharmaceuticals peek int the test set prematurely!] Kagglecm: Runs ML cmpetitins, including ur HWs We use 2 data sets: public set results available during cmpetitin private set revealed nly after due date [If yur public results are a lt better than yur private results, we will knw that yu verfitted] Techniques [taught in this class, NOT a cmplete list] Supervised learning: Classificatin: is this spam? Regressin: hw likely des this patient have cancer? Unsupervised learning: Clustering: which DNA sequences are similar t each ther? Dimensinality reductin: what are cmmn features f faces? cmmn differences? 6

What is Statistical Learning?

What is Statistical Learning? Sales 5 10 15 20 25 Sales 5 10 15 20 25 Sales 5 10 15 20 25 0 50 100 200 300 TV 0 10 20 30 40 50 Radi 0 20 40 60 80 100 Newspaper Shwn are Sales vs TV, Radi and Newspaper,