Springer Texts in Statistics

Size: px

Start display at page:

Download "Springer Texts in Statistics"

Ethelbert Leonard
6 years ago
Views:

1 Springer Texts in Statistics Series Editrs: G. Casella S. Fienberg I. Olkin Fr further vlumes:

3 Gareth James Daniela Witten Trevr Hastie Rbert Tibshirani An Intrductin t Statistical Learning with Applicatins in R 123

4 Gareth James Department f Infrmatin and Operatins Management University f Suthern Califrnia Ls Angeles, CA, USA Trevr Hastie Department f Statistics Stanfrd University Stanfrd, CA, USA Daniela Witten Department f Bistatistics University f Washingtn Seattle, WA, USA Rbert Tibshirani Department f Statistics Stanfrd University Stanfrd, CA, USA ISSN X ISBN ISBN (ebk) DOI / Springer New Yrk Heidelberg Drdrecht Lndn Library f Cngress Cntrl Number: Springer Science+Business Media New Yrk 2013 (Crrected at 4 printing 2014) This wrk is subject t cpyright. All rights are reserved by the Publisher, whether the whle r part f the material is cncerned, specifically the rights f translatin, reprinting, reuse f illustratins, recitatin, bradcasting, reprductin n micrfilms r in any ther physical way, and transmissin r infrmatin strage and retrieval, electrnic adaptatin, cmputer sftware, r by similar r dissimilar methdlgy nw knwn r hereafter develped. Exempted frm this legal reservatin are brief excerpts in cnnectin with reviews r schlarly analysis r material supplied specifically fr the purpse f being entered and executed n a cmputer system, fr exclusive use by the purchaser f the wrk. Duplicatin f this publicatin r parts theref is permitted nly under the prvisins f the Cpyright Law f the Publisher s lcatin, in its current versin, and permissin fr use must always be btained frm Springer. Permissins fr use may be btained thrugh RightsLink at the Cpyright Clearance Center. Vilatins are liable t prsecutin under the respective Cpyright Law. The use f general descriptive names, registered names, trademarks, service marks, etc. in this publicatin des nt imply, even in the absence f a specific statement, that such names are exempt frm the relevant prtective laws and regulatins and therefre free fr general use. While the advice and infrmatin in this bk are believed t be true and accurate at the date f publicatin, neither the authrs nr the editrs nr the publisher can accept any legal respnsibility fr any errrs r missins that may be made. The publisher makes n warranty, express r implied, with respect t the material cntained herein. Printed n acid-free paper Springer is part f Springer Science+Business Media (

5 T ur parents: Alisn and Michael James Chiara Nappi and Edward Witten Valerie and Patrick Hastie Vera and Sami Tibshirani and t ur families: Michael, Daniel, and Catherine Ari Samantha, Timthy, and Lynda Charlie, Ryan, Julie, and Cheryl

7 Preface Statistical learning refers t a set f tls fr mdeling and understanding cmplex datasets. It is a recently develped area in statistics and blends with parallel develpments in cmputer science and, in particular, machine learning. The field encmpasses many methds such as the lass and sparse regressin, classificatin and regressin trees, and bsting and supprt vectr machines. With the explsin f Big Data prblems, statistical learning has becme a very ht field in many scientific areas as well as marketing, finance, and ther business disciplines. Peple with statistical learning skills are in high demand. One f the first bks in this area The Elements f Statistical Learning (ESL) (Hastie, Tibshirani, and Friedman) was published in 2001, with a secnd editin in ESL has becme a ppular text nt nly in statistics but als in related fields. One f the reasns fr ESL s ppularity is its relatively accessible style. But ESL is intended fr individuals with advanced training in the mathematical sciences. An Intrductin t Statistical Learning (ISL) arse frm the perceived need fr a brader and less technical treatment f these tpics. In this new bk, we cver many f the same tpics as ESL, but we cncentrate mre n the applicatins f the methds and less n the mathematical details. We have created labs illustrating hw t implement each f the statistical learning methds using the ppular statistical sftware package R. These labs prvide the reader with valuable hands-n experience. This bk is apprpriate fr advanced undergraduates r master s students in statistics r related quantitative fields r fr individuals in ther vii

8 viii Preface disciplines wh wish t use statistical learning tls t analyze their data. It can be used as a textbk fr a curse spanning ne r tw semesters. We wuld like t thank several readers fr valuable cmments n preliminary drafts f this bk: Pallavi Basu, Alexandra Chuldechva, Patrick Danaher, Will Fithian, Luella Fu, Sam Grss, Max Grazier G Sell, Curtney Paulsn, Xingha Qia, Elisa Sheng, Nah Simn, Kean Ming Tan, and Xin Lu Tan. It s tugh t make predictins, especially abut the future. -Ygi Berra Ls Angeles, USA Seattle, USA Pal Alt, USA Pal Alt, USA Gareth James Daniela Witten Trevr Hastie Rbert Tibshirani

9 Cntents Preface vii 1 Intrductin 1 2 Statistical Learning What Is Statistical Learning? Why Estimate f? Hw D We Estimate f? The Trade-Off Between Predictin Accuracy and Mdel Interpretability Supervised Versus Unsupervised Learning Regressin Versus Classificatin Prblems Assessing Mdel Accuracy Measuring the Quality f Fit The Bias-Variance Trade-Off The Classificatin Setting Lab: Intrductin t R Basic Cmmands Graphics Indexing Data Lading Data Additinal Graphical and Numerical Summaries Exercises ix

10 x Cntents 3 Linear Regressin Simple Linear Regressin Estimating the Cefficients Assessing the Accuracy f the Cefficient Estimates Assessing the Accuracy f the Mdel Multiple Linear Regressin Estimating the Regressin Cefficients Sme Imprtant Questins Other Cnsideratins in the Regressin Mdel Qualitative Predictrs Extensins f the Linear Mdel Ptential Prblems The Marketing Plan Cmparisn f Linear Regressin with K-Nearest Neighbrs Lab: Linear Regressin Libraries Simple Linear Regressin Multiple Linear Regressin Interactin Terms Nn-linear Transfrmatins f the Predictrs Qualitative Predictrs Writing Functins Exercises Classificatin An Overview f Classificatin Why Nt Linear Regressin? Lgistic Regressin The Lgistic Mdel Estimating the Regressin Cefficients Making Predictins Multiple Lgistic Regressin Lgistic Regressin fr >2 Respnse Classes Linear Discriminant Analysis Using Bayes Therem fr Classificatin Linear Discriminant Analysis fr p = Linear Discriminant Analysis fr p> Quadratic Discriminant Analysis A Cmparisn f Classificatin Methds Lab: Lgistic Regressin, LDA, QDA, and KNN The Stck Market Data Lgistic Regressin Linear Discriminant Analysis

11 Cntents xi Quadratic Discriminant Analysis K-Nearest Neighbrs An Applicatin t Caravan Insurance Data Exercises Resampling Methds Crss-Validatin The Validatin Set Apprach Leave-One-Out Crss-Validatin k-fld Crss-Validatin Bias-Variance Trade-Off fr k-fld Crss-Validatin Crss-Validatin n Classificatin Prblems The Btstrap Lab: Crss-Validatin and the Btstrap The Validatin Set Apprach Leave-One-Out Crss-Validatin k-fld Crss-Validatin The Btstrap Exercises Linear Mdel Selectin and Regularizatin Subset Selectin Best Subset Selectin Stepwise Selectin Chsing the Optimal Mdel Shrinkage Methds Ridge Regressin The Lass Selecting the Tuning Parameter Dimensin Reductin Methds Principal Cmpnents Regressin Partial Least Squares Cnsideratins in High Dimensins High-Dimensinal Data What Ges Wrng in High Dimensins? Regressin in High Dimensins Interpreting Results in High Dimensins Lab 1: Subset Selectin Methds Best Subset Selectin Frward and Backward Stepwise Selectin Chsing Amng Mdels Using the Validatin Set Apprach and Crss-Validatin

12 xii Cntents 6.6 Lab 2: Ridge Regressin and the Lass Ridge Regressin The Lass Lab 3: PCR and PLS Regressin Principal Cmpnents Regressin Partial Least Squares Exercises Mving Beynd Linearity Plynmial Regressin Step Functins Basis Functins Regressin Splines Piecewise Plynmials Cnstraints and Splines The Spline Basis Representatin Chsing the Number and Lcatins f the Knts Cmparisn t Plynmial Regressin Smthing Splines An Overview f Smthing Splines Chsing the Smthing Parameter λ Lcal Regressin Generalized Additive Mdels GAMs fr Regressin Prblems GAMs fr Classificatin Prblems Lab: Nn-linear Mdeling Plynmial Regressin and Step Functins Splines GAMs Exercises Tree-Based Methds The Basics f Decisin Trees Regressin Trees Classificatin Trees Trees Versus Linear Mdels Advantages and Disadvantages f Trees Bagging, Randm Frests, Bsting Bagging Randm Frests Bsting Lab: Decisin Trees Fitting Classificatin Trees Fitting Regressin Trees

13 Cntents xiii Bagging and Randm Frests Bsting Exercises Supprt Vectr Machines Maximal Margin Classifier What Is a Hyperplane? Classificatin Using a Separating Hyperplane The Maximal Margin Classifier Cnstructin f the Maximal Margin Classifier The Nn-separable Case Supprt Vectr Classifiers Overview f the Supprt Vectr Classifier Details f the Supprt Vectr Classifier Supprt Vectr Machines Classificatin with Nn-linear Decisin Bundaries The Supprt Vectr Machine An Applicatin t the Heart Disease Data SVMs with Mre than Tw Classes One-Versus-One Classificatin One-Versus-All Classificatin Relatinship t Lgistic Regressin Lab: Supprt Vectr Machines Supprt Vectr Classifier Supprt Vectr Machine ROC Curves SVM with Multiple Classes Applicatin t Gene Expressin Data Exercises Unsupervised Learning The Challenge f Unsupervised Learning Principal Cmpnents Analysis What Are Principal Cmpnents? Anther Interpretatin f Principal Cmpnents Mre n PCA Other Uses fr Principal Cmpnents Clustering Methds K-Means Clustering Hierarchical Clustering Practical Issues in Clustering Lab 1: Principal Cmpnents Analysis

14 xiv Cntents 10.5 Lab 2: Clustering K-Means Clustering Hierarchical Clustering Lab 3: NCI60 Data Example PCA n the NCI60 Data Clustering the Observatins f the NCI60 Data Exercises Index 419

15 1 Intrductin An Overview f Statistical Learning Statistical learning refers t a vast set f tls fr understanding data. These tls can be classified as supervised r unsupervised. Bradly speaking, supervised statistical learning invlves building a statistical mdel fr predicting, r estimating, an utput basednnermreinputs. Prblemsf this nature ccur in fields as diverse as business, medicine, astrphysics, and public plicy. With unsupervised statistical learning, there are inputs but n supervising utput; nevertheless we can learn relatinships and structure frm such data. T prvide an illustratin f sme applicatins f statistical learning, we briefly discuss three real-wrld data sets that are cnsidered in this bk. Wage Data In this applicatin (which we refer t as the Wage data set thrughut this bk), we examine a number f factrs that relate t wages fr a grup f males frm the Atlantic regin f the United States. In particular, we wish t understand the assciatin between an emplyee s age and educatin, as well as the calendar year, n his wage. Cnsider, fr example, the left-hand panel f Figure 1.1, which displays wage versus age fr each f the individuals in the data set. There is evidence that wage increases with age but then decreases again after apprximately age 60. The blue line, which prvides an estimate f the average wage fr a given age, makes this trend clearer. G. James et al., An Intrductin t Statistical Learning: with Applicatins in R, Springer Texts in Statistics, DOI / , Springer Science+Business Media New Yrk

16 2 1. Intrductin Wage Wage Wage Age Year Educatin Level FIGURE 1.1. Wage data, which cntains incme survey infrmatin fr males frm the central Atlantic regin f the United States. Left: wage as a functin f age. On average, wage increases with age until abut 60 years f age, at which pint it begins t decline. Center: wage as a functin f year. Thereisaslw but steady increase f apprximately $10,000 in the average wage between 2003 and Right: Bxplts displaying wage as a functin f educatin, with1 indicating the lwest level (n high schl diplma) and 5 the highest level (an advanced graduate degree). On average, wage increases with the level f educatin. Givenanemplyee sage, we can use this curve t predict his wage. Hwever, it is als clear frm Figure 1.1 that there is a significant amunt f variability assciated with this average value, and s age alne is unlikely t prvide an accurate predictin f a particular man s wage. We als have infrmatin regarding each emplyee s educatin level and the year in which the wage was earned. The center and right-hand panels f Figure 1.1, which display wage as a functin f bth year and educatin, indicate that bth f these factrs are assciated with wage. Wages increase by apprximately $10,000, in a rughly linear (r straight-line) fashin, between 2003 and 2009, thugh this rise is very slight relative t the variability in the data. Wages are als typically greater fr individuals with higher educatin levels: men with the lwest educatin level (1) tend t have substantially lwer wages than thse with the highest educatin level (5). Clearly, the mst accurate predictin f a given man s wage will be btained by cmbining his age, his educatin, and the year. In Chapter 3, we discuss linear regressin, which can be used t predict wage frm this data set. Ideally, we shuld predict wage in a way that accunts fr the nn-linear relatinship between wage and age. In Chapter 7, we discuss a class f appraches fr addressing this prblem. Stck Market Data The Wage data invlves predicting a cntinuus r quantitative utput value. This is ften referred t as a regressin prblem. Hwever, in certain cases we may instead wish t predict a nn-numerical value that is, a categrical

17 1. Intrductin 3 Yesterday Tw Days Previus Three Days Previus Percentage change in S&P Percentage change in S&P Percentage change in S&P Dwn Up Tday s Directin Dwn Up Tday s Directin Dwn Up Tday s Directin FIGURE 1.2. Left: Bxplts f the previus day s percentage change in the S&P index fr the days fr which the market increased r decreased, btained frm the Smarket data. Center and Right: Same as left panel, but the percentage changes fr 2 and 3 days previus are shwn. r qualitative utput. Fr example, in Chapter 4 we examine a stck market data set that cntains the daily mvements in the Standard & Pr s 500 (S&P) stck index ver a 5-year perid between 2001 and We refer t this as the Smarket data. The gal is t predict whether the index will increase r decrease n a given day using the past 5 days percentage changes in the index. Here the statistical learning prblem des nt invlve predicting a numerical value. Instead it invlves predicting whether a given day s stck market perfrmance will fall int the Up bucket r the Dwn bucket. This is knwn as a classificatin prblem. A mdel that culd accurately predict the directin in which the market will mve wuld be very useful! The left-hand panel f Figure 1.2 displays tw bxplts f the previus day s percentage changes in the stck index: ne fr the 648 days fr which the market increased n the subsequent day, and ne fr the 602 days fr which the market decreased. The tw plts lk almst identical, suggesting that there is n simple strategy fr using yesterday s mvement in the S&P t predict tday s returns. The remaining panels, which display bxplts fr the percentage changes 2 and 3 days previus t tday, similarly indicate little assciatin between past and present returns. Of curse, this lack f pattern is t be expected: in the presence f strng crrelatins between successive days returns, ne culd adpt a simple trading strategy t generate prfits frm the market. Nevertheless, in Chapter 4, we explre these data using several different statistical learning methds. Interestingly, there are hints f sme weak trends in the data that suggest that, at least fr this 5-year perid, it is pssible t crrectly predict the directin f mvement in the market apprximately 60% f the time (Figure 1.3).

18 4 1. Intrductin Predicted Prbability Dwn Up Tday s Directin FIGURE 1.3. We fit a quadratic discriminant analysis mdel t the subset f the Smarket data crrespnding t the time perid, and predicted the prbability f a stck market decrease using the 2005 data. On average, the predicted prbability f decrease is higher fr the days in which the market des decrease. Based n these results, we are able t crrectly predict the directin f mvement in the market 60% f the time. Gene Expressin Data The previus tw applicatins illustrate data sets with bth input and utput variables. Hwever, anther imprtant class f prblems invlves situatins in which we nly bserve input variables, with n crrespnding utput. Fr example, in a marketing setting, we might have demgraphic infrmatin fr a number f current r ptential custmers. We may wish t understand which types f custmers are similar t each ther by gruping individuals accrding t their bserved characteristics. This is knwn as a clustering prblem. Unlike in the previus examples, here we are nt trying t predict an utput variable. We devte Chapter 10 t a discussin f statistical learning methds fr prblems in which n natural utput variable is available. We cnsider the NCI60 data set, which cnsists f 6,830 gene expressin measurements fr each f 64 cancer cell lines. Instead f predicting a particular utput variable, we are interested in determining whether there are grups, r clusters, amng the cell lines based n their gene expressin measurements. This is a difficult questin t address, in part because there are thusands f gene expressin measurements per cell line, making it hard t visualize the data. The left-hand panel f Figure 1.4 addresses this prblem by representing each f the 64 cell lines using just tw numbers, Z 1 and Z 2.These are the first tw principal cmpnents f the data, which summarize the 6, 830 expressin measurements fr each cell line dwn t tw numbers r dimensins. While it is likely that this dimensin reductin has resulted in

19 1. Intrductin 5 Z Z Z Z 1 FIGURE 1.4. Left: Representatin f the NCI60 gene expressin data set in a tw-dimensinal space, Z 1 and Z 2. Each pint crrespnds t ne f the 64 cell lines. There appear t be fur grups f cell lines, which we have represented using different clrs. Right: Same as left panel except that we have represented each f the 14 different types f cancer using a different clred symbl. Cell lines crrespnding t the same cancer type tend t be nearby in the tw-dimensinal space. sme lss f infrmatin, it is nw pssible t visually examine the data fr evidence f clustering. Deciding n the number f clusters is ften a difficult prblem. But the left-hand panel f Figure 1.4 suggests at least fur grups f cell lines, which we have represented using separate clrs. We can nw examine the cell lines within each cluster fr similarities in their types f cancer, in rder t better understand the relatinship between gene expressin levels and cancer. In this particular data set, it turns ut that the cell lines crrespnd t 14 different types f cancer. (Hwever, this infrmatin was nt used t create the left-hand panel f Figure 1.4.) The right-hand panel f Figure 1.4 is identical t the left-hand panel, except that the 14 cancer types are shwn using distinct clred symbls. There is clear evidence that cell lines with the same cancer type tend t be lcated near each ther in this tw-dimensinal representatin. In additin, even thugh the cancer infrmatin was nt used t prduce the left-hand panel, the clustering btained des bear sme resemblance t sme f the actual cancer types bserved in the right-hand panel. This prvides sme independent verificatin f the accuracy f ur clustering analysis. A Brief Histry f Statistical Learning Thugh the term statistical learning is fairly new, many f the cncepts that underlie the field were develped lng ag. At the beginning f the nineteenth century, Legendre and Gauss published papers n the methd

20 6 1. Intrductin f least squares, which implemented the earliest frm f what is nw knwn as linear regressin. The apprach was first successfully applied t prblems in astrnmy. Linear regressin is used fr predicting quantitative values, such as an individual s salary. In rder t predict qualitative values, such as whether a patient survives r dies, r whether the stck market increases r decreases, Fisher prpsed linear discriminant analysis in In the 1940s, varius authrs put frth an alternative apprach, lgistic regressin. In the early 1970s, Nelder and Wedderburn cined the term generalized linear mdels fr an entire class f statistical learning methds that include bth linear and lgistic regressin as special cases. By the end f the 1970s, many mre techniques fr learning frm data were available. Hwever, they were almst exclusively linear methds, because fitting nn-linear relatinships was cmputatinally infeasible at the time. By the 1980s, cmputing technlgy had finally imprved sufficiently that nn-linear methds were n lnger cmputatinally prhibitive. In mid 1980s Breiman, Friedman, Olshen and Stne intrduced classificatin and regressin trees, and were amng the first t demnstrate the pwer f a detailed practical implementatin f a methd, including crss-validatin fr mdel selectin. Hastie and Tibshirani cined the term generalized additive mdels in 1986 fr a class f nn-linear extensins t generalized linear mdels, and als prvided a practical sftware implementatin. Since that time, inspired by the advent f machine learning and ther disciplines, statistical learning has emerged as a new subfield in statistics, fcused n supervised and unsupervised mdeling and predictin. In recent years, prgress in statistical learning has been marked by the increasing availability f pwerful and relatively user-friendly sftware, such as the ppular and freely available R system. This has the ptential t cntinue the transfrmatin f the field frm a set f techniques used and develped by statisticians and cmputer scientists t an essential tlkit fr a much brader cmmunity. This Bk The Elements f Statistical Learning (ESL) by Hastie, Tibshirani, and Friedman was first published in Since that time, it has becme an imprtant reference n the fundamentals f statistical machine learning. Its success derives frm its cmprehensive and detailed treatment f many imprtant tpics in statistical learning, as well as the fact that (relative t many upper-level statistics textbks) it is accessible t a wide audience. Hwever, the greatest factr behind the success f ESL has been its tpical nature. At the time f its publicatin, interest in the field f statistical

21 1. Intrductin 7 learning was starting t explde. ESL prvided ne f the first accessible and cmprehensive intrductins t the tpic. Since ESL was first published, the field f statistical learning has cntinued t flurish. The field s expansin has taken tw frms. The mst bvius grwth has invlved the develpment f new and imprved statistical learning appraches aimed at answering a range f scientific questins acrss a number f fields. Hwever, the field f statistical learning has als expanded its audience. In the 1990s, increases in cmputatinal pwer generated a surge f interest in the field frm nn-statisticians wh were eager t use cutting-edge statistical tls t analyze their data. Unfrtunately, the highly technical nature f these appraches meant that the user cmmunity remained primarily restricted t experts in statistics, cmputer science, and related fields with the training (and time) t understand and implement them. In recent years, new and imprved sftware packages have significantly eased the implementatin burden fr many statistical learning methds. At the same time, there has been grwing recgnitin acrss a number f fields, frm business t health care t genetics t the scial sciences and beynd, that statistical learning is a pwerful tl with imprtant practical applicatins. As a result, the field has mved frm ne f primarily academic interest t a mainstream discipline, with an enrmus ptential audience. This trend will surely cntinue with the increasing availability f enrmus quantities f data and the sftware t analyze it. The purpse f An Intrductin t Statistical Learning (ISL) is t facilitate the transitin f statistical learning frm an academic t a mainstream field. ISL is nt intended t replace ESL, which is a far mre cmprehensive text bth in terms f the number f appraches cnsidered and the depth t which they are explred. We cnsider ESL t be an imprtant cmpanin fr prfessinals (with graduate degrees in statistics, machine learning, r related fields) wh need t understand the technical details behind statistical learning appraches. Hwever, the cmmunity f users f statistical learning techniques has expanded t include individuals with a wider range f interests and backgrunds. Therefre, we believe that there is nw a place fr a less technical and mre accessible versin f ESL. In teaching these tpics ver the years, we have discvered that they are f interest t master s and PhD students in fields as disparate as business administratin, bilgy, and cmputer science, as well as t quantitativelyriented upper-divisin undergraduates. It is imprtant fr this diverse grup t be able t understand the mdels, intuitins, and strengths and weaknesses f the varius appraches. But fr this audience, many f the technical details behind statistical learning methds, such as ptimizatin algrithms and theretical prperties, are nt f primary interest. We believe that these students d nt need a deep understanding f these aspects in rder t becme infrmed users f the varius methdlgies, and

22 8 1. Intrductin in rder t cntribute t their chsen fields thrugh the use f statistical learning tls. ISLR is based n the fllwing fur premises. 1. Many statistical learning methds are relevant and useful in a wide range f academic and nn-academic disciplines, beynd just the statistical sciences. We believe that many cntemprarystatisticallearning prcedures shuld, and will, becme as widely available and used as is currently the case fr classical methds such as linear regressin. As a result, rather than attempting t cnsider every pssible apprach (an impssible task), we have cncentrated n presenting the methds that we believe are mst widely applicable. 2. Statistical learning shuld nt be viewed as a series f black bxes. N single apprach will perfrm well in all pssible applicatins. Withut understanding all f the cgs inside the bx, r the interactin between thse cgs, it is impssible t select the best bx. Hence, we have attempted t carefully describe the mdel, intuitin, assumptins, and trade-ffs behind each f the methds that we cnsider. 3. While it is imprtant t knw what jb is perfrmed by each cg, it is nt necessary t have the skills t cnstruct the machine inside the bx! Thus, we have minimized discussin f technical details related t fitting prcedures and theretical prperties. We assume that the reader is cmfrtable with basic mathematical cncepts, but we d nt assume a graduate degree in the mathematical sciences. Fr instance, we have almst cmpletely avided the use f matrix algebra, and it is pssible t understand the entire bk withut a detailed knwledge f matrices and vectrs. 4. We presume that the reader is interested in applying statistical learning methds t real-wrld prblems. In rder t facilitate this, as well as t mtivate the techniques discussed, we have devted a sectin within each chapter t R cmputer labs. In each lab, we walk the reader thrugh a realistic applicatin f the methds cnsidered in that chapter. When we have taught this material in ur curses, we have allcated rughly ne-third f classrm time t wrking thrugh the labs, and we have fund them t be extremely useful. Many f the less cmputatinally-riented students wh were initially intimidated by R s cmmand level interface gt the hang f things ver the curse f the quarter r semester. We have used R because it is freely available and is pwerful enugh t implement all f the methds discussed in the bk. It als has ptinal packages that can be dwnladed t implement literally thusands f additinal methds. Mst imprtantly, R is the language f chice fr academic statisticians, and new appraches ften becme available in

23 1. Intrductin 9 R years befre they are implemented in cmmercial packages. Hwever, the labs in ISL are self-cntained, and can be skipped if the reader wishes t use a different sftware package r des nt wish t apply the methds discussed t real-wrld prblems. Wh Shuld Read This Bk? This bk is intended fr anyne wh is interested in using mdern statistical methds fr mdeling and predictin frm data. This grup includes scientists, engineers, data analysts, r quants, but als less technical individuals with degrees in nn-quantitative fields such as the scial sciences r business. We expect that the reader will have had at least ne elementary curse in statistics. Backgrund in linear regressin is als useful, thugh nt required, since we review the key cncepts behind linear regressin in Chapter 3. The mathematical level f this bk is mdest, and a detailed knwledge f matrix peratins is nt required. This bk prvides an intrductin t the statistical prgramming language R. Previus expsure t a prgramming language, such as MATLAB r Pythn, is useful but nt required. We have successfully taught material at this level t master s and PhD students in business, cmputer science, bilgy, earth sciences, psychlgy, and many ther areas f the physical and scial sciences. This bk culd als be apprpriate fr advanced undergraduates wh have already taken a curse n linear regressin. In the cntext f a mre mathematically rigrus curse in which ESL serves as the primary textbk, ISL culd be used as a supplementary text fr teaching cmputatinal aspects f the varius appraches. Ntatin and Simple Matrix Algebra Chsing ntatin fr a textbk is always a difficult task. Fr the mst part we adpt the same ntatinal cnventins as ESL. We will use n t represent the number f distinct data pints, r bservatins, in ur sample. We will let p dente the number f variables that are available fr use in making predictins. Fr example, the Wage data set cnsists f 12 variables fr 3,000 peple, s we have n =3,000 bservatins and p = 12 variables (such as year, age, wage, and mre). Nte that thrughut this bk, we indicate variable names using clred fnt: Variable Name. In sme examples, p might be quite large, such as n the rder f thusands r even millins; this situatin arises quite ften, fr example, in the analysis f mdern bilgical data r web-based advertising data.

24 10 1. Intrductin In general, we will let x ij represent the value f the jth variable fr the ith bservatin, where i =1, 2,...,n and j =1, 2,...,p. Thrughut this bk, i will be used t index the samples r bservatins (frm 1 t n) and j will be used t index the variables (frm 1 t p). We let X dente a n p matrix whse (i, j)th element is x ij.thatis, x 11 x x 1p x 21 x x 2p X = x n1 x n2... x np Fr readers wh are unfamiliar with matrices, it is useful t visualize X as a spreadsheet f numbers with n rws and p clumns. At times we will be interested in the rws f X, which we write as x 1,x 2,...,x n.herex i is a vectr f length p, cntaining the p variable measurements fr the ith bservatin. That is, x i1 x i2 x i =... (1.1) x ip (Vectrs are by default represented as clumns.) Fr example, fr the Wage data, x i is a vectr f length 12, cnsisting f year, age, wage, and ther values fr the ith individual. At ther times we will instead be interested in the clumns f X, which we write as x 1, x 2,...,x p. Each is a vectr f length n. Thatis, x 1j x 2j x j =.. x nj Fr example, fr the Wage data, x 1 cntains the n =3,000 values fr year. Using this ntatin, the matrix X can be written as X = ( x 1 x 2 x p ), r x T 1 x T 2 X =... x T n

25 1. Intrductin 11 The T ntatin dentes the transpse f a matrix r vectr. S, fr example, x 11 x x n1 X T x 12 x x n2 =..., x 1p x 2p... x np while x T i = ( x i1 x i2 x ip ). We use y i t dente the ith bservatin f the variable n which we wish t make predictins, such as wage. Hence, we write the set f all n bservatins in vectr frm as y 1 y 2 y =. y n. Then ur bserved data cnsists f {(x 1,y 1 ), (x 2,y 2 ),...,(x n,y n )}, where each x i is a vectr f length p. (Ifp =1,thenx i is simply a scalar.) In this text, a vectr f length n will always be dented in lwer case bld ; e.g. a 1 a 2 a =.. a n Hwever, vectrs that are nt f length n (such as feature vectrs f length p, as in (1.1)) will be dented in lwer case nrmal fnt, e.g. a. Scalars will als be dented in lwer case nrmal fnt, e.g. a. In the rare cases in which these tw uses fr lwer case nrmal fnt lead t ambiguity, we will clarify which use is intended. Matrices will be dented using bld capitals, such as A. Randm variables will be dented using capital nrmal fnt, e.g. A, regardless f their dimensins. Occasinally we will want t indicate the dimensin f a particular bject. T indicate that an bject is a scalar, we will use the ntatin a R. T indicate that it is a vectr f length k, we will use a R k (r a R n if it is f length n). We will indicate that an bject is a r s matrix using A R r s. We have avided using matrix algebra whenever pssible. Hwever, in a few instances it becmes t cumbersme t avid it entirely. In these rare instances it is imprtant t understand the cncept f multiplying tw matrices. Suppse that A R r d and B R d s. Then the prduct

26 12 1. Intrductin f A and B is dented AB. The(i, j)th element f AB is cmputed by multiplying each element f the ith rw f A by the crrespnding element f the jth clumn f B. Thatis,(AB) ij = d k=1 a ikb kj. As an example, cnsider A = Then ( )( ) AB = = ( ) and B = ( ) ( ) = ( ) Nte that this peratin prduces an r s matrix. It is nly pssible t cmpute AB if the number f clumns f A is the same as the number f rws f B. Organizatin f This Bk Chapter 2 intrduces the basic terminlgy and cncepts behind statistical learning. This chapter als presents the K-nearest neighbr classifier, a very simple methd that wrks surprisingly well n many prblems. Chapters 3 and 4 cver classical linear methds fr regressin and classificatin. In particular, Chapter 3 reviews linear regressin, the fundamental starting pint fr all regressin methds. In Chapter 4 we discuss tw f the mst imprtant classical classificatin methds, lgistic regressin and linear discriminant analysis. A central prblem in all statistical learning situatins invlves chsing the best methd fr a given applicatin. Hence, in Chapter 5 we intrduce crss-validatin and the btstrap, which can be used t estimate the accuracy f a number f different methds in rder t chse the best ne. Much f the recent research in statistical learning has cncentrated n nn-linear methds. Hwever, linear methds ften have advantages ver their nn-linear cmpetitrs in terms f interpretability and smetimes als accuracy. Hence, in Chapter 6 we cnsider a hst f linear methds, bth classical and mre mdern, which ffer ptential imprvements ver standard linear regressin. These include stepwise selectin, ridge regressin, principal cmpnents regressin, partial least squares, andthelass. The remaining chapters mve int the wrld f nn-linear statistical learning. We first intrduce in Chapter 7 a number f nn-linear methds that wrk well fr prblems with a single input variable. We then shw hw these methds can be used t fit nn-linear additive mdels fr which there is mre than ne input. In Chapter 8, we investigate tree-based methds, including bagging, bsting, and randm frests. Supprt vectr machines, a set f appraches fr perfrming bth linear and nn-linear classificatin,

27 1. Intrductin 13 are discussed in Chapter 9. Finally, in Chapter 10, we cnsider a setting in which we have input variables but n utput variable. In particular, we present principal cmpnents analysis, K-means clustering, andhierarchical clustering. At the end f each chapter, we present ne r mre R lab sectins in which we systematically wrk thrugh applicatins f the varius methds discussed in that chapter. These labs demnstrate the strengths and weaknesses f the varius appraches, and als prvide a useful reference fr the syntax required t implement the varius methds. The reader may chse t wrk thrugh the labs at his r her wn pace, r the labs may be the fcus f grup sessins as part f a classrm envirnment. Within each R lab, we present the results that we btained when we perfrmed the lab at the time f writing this bk. Hwever, new versins f R are cntinuusly released, and ver time, the packages called in the labs will be updated. Therefre, in the future, it is pssible that the results shwn in the lab sectins may n lnger crrespnd precisely t the results btained by the reader wh perfrms the labs. As necessary, we will pst updates t the labs n the bk website. We use the symbl t dente sectins r exercises that cntain mre challenging cncepts. These can be easily skipped by readers wh d nt wish t delve as deeply int the material, r wh lack the mathematical backgrund. Data Sets Used in Labs and Exercises In this textbk, we illustrate statistical learning methds using applicatins frm marketing, finance, bilgy, and ther areas. The ISLR package available n the bk website cntains a number f data sets that are required in rder t perfrm the labs and exercises assciated with this bk. One ther data set is cntained in the MASS library, and yet anther is part f the base R distributin. Table 1.1 cntains a summary f the data sets required t perfrm the labs and exercises. A cuple f these data sets are als available as text files n the bk website, fr use in Chapter 2. Bk Website Thewebsitefrthisbkislcatedat

28 14 1. Intrductin Name Aut Bstn Caravan Carseats Cllege Default Hitters Khan NCI60 OJ Prtfli Smarket USArrests Wage Weekly Descriptin Gas mileage, hrsepwer, and ther infrmatin fr cars. Husing values and ther infrmatin abut Bstn suburbs. Infrmatin abut individuals ffered caravan insurance. Infrmatin abut car seat sales in 400 stres. Demgraphic characteristics, tuitin, and mre fr USA clleges. Custmer default recrds fr a credit card cmpany. Recrds and salaries fr baseball players. Gene expressin measurements fr fur cancer types. Gene expressin measurements fr 64 cancer cell lines. Sales infrmatin fr Citrus Hill and Minute Maid range juice. Past values f financial assets, fr use in prtfli allcatin. Daily percentage returns fr S&P 500 ver a 5-year perid. Crime statistics per 100,000 residents in 50 states f USA. Incme survey data fr males in central Atlantic regin f USA. 1,089 weekly stck market returns fr 21 years. TABLE 1.1. A list f data sets needed t perfrm the labs and exercises in this textbk. All data sets are available in the ISLR library, with the exceptin f Bstn (part f MASS) and USArrests (part f the base R distributin). It cntains a number f resurces, including the R package assciated with this bk, and sme additinal data sets. Acknwledgements A few f the plts in this bk were taken frm ESL: Figures 6.7, 8.3, and All ther plts are new t this bk.

29 2 Statistical Learning 2.1 What Is Statistical Learning? In rder t mtivate ur study f statistical learning, we begin with a simple example. Suppse that we are statistical cnsultants hired by a client t prvide advice n hw t imprve sales f a particular prduct. The Advertising data set cnsists f the sales f that prduct in 200 different markets, alng with advertising budgets fr the prduct in each f thse markets fr three different media: TV, radi, andnewspaper. The data are displayed in Figure 2.1. It is nt pssible fr ur client t directly increase sales f the prduct. On the ther hand, they can cntrl the advertising expenditure in each f the three media. Therefre, if we determine that there is an assciatin between advertising and sales, then we can instruct ur client t adjust advertising budgets, thereby indirectly increasing sales. In ther wrds, ur gal is t develp an accurate mdel that can be used t predict sales n the basis f the three media budgets. In this setting, the advertising budgets are input variables while sales input is an utput variable. The input variables are typically dented using the symbl X, with a subscript t distinguish them. S X 1 might be the TV budget, X 2 the radi budget, and X 3 the newspaper budget. The inputs g by different names, such as predictrs, independent variables, features, r smetimes just variables. The utput variable in this case, sales is ften called the respnse r dependent variable, and is typically dented using the symbl Y. Thrughut this bk, we will use all f these terms interchangeably. G. James et al., An Intrductin t Statistical Learning: with Applicatins in R, Springer Texts in Statistics, DOI / , Springer Science+Business Media New Yrk variable utput variable predictr independent variable feature variable respnse dependent variable

30 16 2. Statistical Learning Sales Sales Sales TV Radi Newspaper FIGURE 2.1. The Advertising data set. The plt displays sales, in thusands f units, as a functin f TV, radi, and newspaper budgets, in thusands f dllars, fr 200 different markets. In each plt we shw the simple least squares fit f sales t that variable, as described in Chapter 3. In ther wrds, each blue line represents a simple mdel that can be used t predict sales using TV, radi, and newspaper, respectively. Mre generally, suppse that we bserve a quantitative respnse Y and p different predictrs, X 1,X 2,...,X p. We assume that there is sme relatinship between Y and X =(X 1,X 2,...,X p ), which can be written in the very general frm Y = f(x)+ɛ. (2.1) Here f is sme fixed but unknwn functin f X 1,...,X p,andɛ is a randm errr term, which is independent f X and has mean zer. In this frmula- errr term tin, f represents the systematic infrmatin that X prvides abut Y. systematic As anther example, cnsider the left-hand panel f Figure 2.2, a plt f incme versus years f educatin fr 30 individuals in the Incme data set. The plt suggests that ne might be able t predict incme using years f educatin. Hwever, the functin f that cnnects the input variable t the utput variable is in general unknwn. In this situatin ne must estimate f basednthebservedpints.sinceincme is a simulated data set, f is knwn and is shwn by the blue curve in the right-hand panel f Figure 2.2. The vertical lines represent the errr terms ɛ. We nte that sme f the 30 bservatins lie abve the blue curve and sme lie belw it; verall, the errrs have apprximately mean zer. In general, the functin f may invlve mre than ne input variable. In Figure 2.3 we plt incme as a functin f years f educatin and senirity. Here f is a tw-dimensinal surface that must be estimated based n the bserved data.

31 2.1 What Is Statistical Learning? 17 Incme Incme Years f Educatin Years f Educatin FIGURE 2.2. The Incme data set. Left: The red dts are the bserved values f incme (in tens f thusands f dllars) and years f educatin fr 30 individuals. Right: The blue curve represents the true underlying relatinship between incme and years f educatin, which is generally unknwn (but is knwn in this case because the data were simulated). The black lines represent the errr assciated with each bservatin. Nte that sme errrs are psitive (if an bservatin lies abve the blue curve) and sme are negative (if an bservatin lies belw the curve). Overall, these errrs have apprximately mean zer. In essence, statistical learning refers t a set f appraches fr estimating f. In this chapter we utline sme f the key theretical cncepts that arise in estimating f, as well as tls fr evaluating the estimates btained Why Estimate f? There are tw main reasns that we may wish t estimate f: predictin and inference. We discuss each in turn. Predictin In many situatins, a set f inputs X are readily available, but the utput Y cannt be easily btained. In this setting, since the errr term averages t zer, we can predict Y using Ŷ = ˆf(X), (2.2) where ˆf represents ur estimate fr f, andŷ represents the resulting predictin fr Y. In this setting, ˆf is ften treated as a black bx, in the sense that ne is nt typically cncerned with the exact frm f ˆf, prvided that it yields accurate predictins fr Y.

32 Incme Statistical Learning Years f Educatin Senirity FIGURE 2.3. The plt displays incme as a functin f years f educatin and senirity in the Incme data set. The blue surface represents the true underlying relatinship between incme and years f educatin and senirity, which is knwn since the data are simulated. The red dts indicate the bserved values f these quantities fr 30 individuals. As an example, suppse that X 1,...,X p are characteristics f a patient s bld sample that can be easily measured in a lab, and Y is a variable encding the patient s risk fr a severe adverse reactin t a particular drug. It is natural t seek t predict Y using X, since we can then avid giving the drug in questin t patients wh are at high risk f an adverse reactin that is, patients fr whm the estimate f Y is high. The accuracy f Ŷ as a predictin fr Y depends n tw quantities, which we will call the reducible errr and the irreducible errr. In general, reducible ˆf will nt be a perfect estimate fr f, and this inaccuracy will intrduce sme errr. This errr is reducible because we can ptentially imprve the accuracy f ˆf by using the mst apprpriate statistical learning technique t estimate f. Hwever, even if it were pssible t frm a perfect estimate fr f, s that ur estimated respnse tk the frm Ŷ = f(x), ur predictin wuld still have sme errr in it! This is because Y is als a functin f ɛ, which, by definitin, cannt be predicted using X. Therefre, variability assciated with ɛ als affects the accuracy f ur predictins. This is knwn as the irreducible errr, because n matter hw well we estimate f, we cannt reduce the errr intrduced by ɛ. Why is the irreducible errr larger than zer? The quantity ɛ may cntain unmeasured variables that are useful in predicting Y : since we dn t measure them, f cannt use them fr its predictin. The quantity ɛ may als cntain unmeasurable variatin. Fr example, the risk f an adverse reactin might vary fr a given patient n a given day, depending n errr irreducible errr

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeff Reading: Chapter 2 STATS 202: Data mining and analysis September 27, 2017 1 / 20 Supervised vs. unsupervised learning In unsupervised