Statistical Learning. 2.1 What Is Statistical Learning?

Size: px

Start display at page:

Download "Statistical Learning. 2.1 What Is Statistical Learning?"

Bruce Rogers
6 years ago
Views:

1 2 Statistical Learning 2.1 What Is Statistical Learning? In rder t mtivate ur study f statistical learning, we begin with a simple example. Suppse that we are statistical cnsultants hired by a client t prvide advice n hw t imprve sales f a particular prduct. The Advertising data set cnsists f the sales f that prduct in 200 different markets, alng with advertising budgets fr the prduct in each f thse markets fr three different media: TV, radi, andnewspaper. The data are displayed in Figure 2.1. It is nt pssible fr ur client t directly increase sales f the prduct. On the ther hand, they can cntrl the advertising expenditure in each f the three media. Therefre, if we determine that there is an assciatin between advertising and sales, then we can instruct ur client t adjust advertising budgets, thereby indirectly increasing sales. In ther wrds, ur gal is t develp an accurate mdel that can be used t predict sales n the basis f the three media budgets. In this setting, the advertising budgets are input variables while sales input is an utput variable. The input variables are typically dented using the symbl X, with a subscript t distinguish them. S X 1 might be the TV budget, X 2 the radi budget, and X 3 the newspaper budget. The inputs g by different names, such as predictrs, independent variables, features, r smetimes just variables. The utput variable in this case, sales is ften called the respnse r dependent variable, and is typically dented using the symbl Y. Thrughut this bk, we will use all f these terms interchangeably. G. James et al., An Intrductin t Statistical Learning: with Applicatins in R, Springer Texts in Statistics 103, DOI / , Springer Science+Business Media New Yrk variable utput variable predictr independent variable feature variable respnse dependent variable

2 16 2. Statistical Learning Sales Sales Sales TV Radi Newspaper FIGURE 2.1. The Advertising data set. The plt displays sales, in thusands f units, as a functin f TV, radi, and newspaper budgets, in thusands f dllars, fr 200 different markets. In each plt we shw the simple least squares fit f sales t that variable, as described in Chapter 3. In ther wrds, each blue line represents a simple mdel that can be used t predict sales using TV, radi, and newspaper, respectively. Mre generally, suppse that we bserve a quantitative respnse Y and p different predictrs, X 1,X 2,...,X p. We assume that there is sme relatinship between Y and X =(X 1,X 2,...,X p ), which can be written in the very general frm Y = f(x)+ɛ. (2.1) Here f is sme fixed but unknwn functin f X 1,...,X p,andɛ is a randm errr term, which is independent f X and has mean zer. In this frmula- errr term tin, f represents the systematic infrmatin that X prvides abut Y. systematic As anther example, cnsider the left-hand panel f Figure 2.2, a plt f incme versus years f educatin fr 30 individuals in the Incme data set. The plt suggests that ne might be able t predict incme using years f educatin. Hwever, the functin f that cnnects the input variable t the utput variable is in general unknwn. In this situatin ne must estimate f basednthebservedpints.sinceincme is a simulated data set, f is knwn and is shwn by the blue curve in the right-hand panel f Figure 2.2. The vertical lines represent the errr terms ɛ. We nte that sme f the 30 bservatins lie abve the blue curve and sme lie belw it; verall, the errrs have apprximately mean zer. In general, the functin f may invlve mre than ne input variable. In Figure 2.3 we plt incme as a functin f years f educatin and senirity. Here f is a tw-dimensinal surface that must be estimated based n the bserved data.

3 2.1 What Is Statistical Learning? 17 Incme Incme Years f Educatin Years f Educatin FIGURE 2.2. The Incme data set. Left: The red dts are the bserved values f incme (in tens f thusands f dllars) and years f educatin fr 30 individuals. Right: The blue curve represents the true underlying relatinship between incme and years f educatin, which is generally unknwn (but is knwn in this case because the data were simulated). The black lines represent the errr assciated with each bservatin. Nte that sme errrs are psitive (if an bservatin lies abve the blue curve) and sme are negative (if an bservatin lies belw the curve). Overall, these errrs have apprximately mean zer. In essence, statistical learning refers t a set f appraches fr estimating f. In this chapter we utline sme f the key theretical cncepts that arise in estimating f, as well as tls fr evaluating the estimates btained Why Estimate f? There are tw main reasns that we may wish t estimate f: predictin and inference. We discuss each in turn. Predictin In many situatins, a set f inputs X are readily available, but the utput Y cannt be easily btained. In this setting, since the errr term averages t zer, we can predict Y using Ŷ = ˆf(X), (2.2) where ˆf represents ur estimate fr f, andŷ represents the resulting predictin fr Y. In this setting, ˆf is ften treated as a black bx, in the sense that ne is nt typically cncerned with the exact frm f ˆf, prvided that it yields accurate predictins fr Y.

4 Incme Statistical Learning Years f Educatin Senirity FIGURE 2.3. The plt displays incme as a functin f years f educatin and senirity in the Incme data set. The blue surface represents the true underlying relatinship between incme and years f educatin and senirity, which is knwn since the data are simulated. The red dts indicate the bserved values f these quantities fr 30 individuals. As an example, suppse that X 1,...,X p are characteristics f a patient s bld sample that can be easily measured in a lab, and Y is a variable encding the patient s risk fr a severe adverse reactin t a particular drug. It is natural t seek t predict Y using X, since we can then avid giving the drug in questin t patients wh are at high risk f an adverse reactin that is, patients fr whm the estimate f Y is high. The accuracy f Ŷ as a predictin fr Y depends n tw quantities, which we will call the reducible errr and the irreducible errr. In general, reducible ˆf will nt be a perfect estimate fr f, and this inaccuracy will intrduce sme errr. This errr is reducible because we can ptentially imprve the accuracy f ˆf by using the mst apprpriate statistical learning technique t estimate f. Hwever, even if it were pssible t frm a perfect estimate fr f, s that ur estimated respnse tk the frm Ŷ = f(x), ur predictin wuld still have sme errr in it! This is because Y is als a functin f ɛ, which, by definitin, cannt be predicted using X. Therefre, variability assciated with ɛ als affects the accuracy f ur predictins. This is knwn as the irreducible errr, because n matter hw well we estimate f, we cannt reduce the errr intrduced by ɛ. Why is the irreducible errr larger than zer? The quantity ɛ may cntain unmeasured variables that are useful in predicting Y : since we dn t measure them, f cannt use them fr its predictin. The quantity ɛ may als cntain unmeasurable variatin. Fr example, the risk f an adverse reactin might vary fr a given patient n a given day, depending n errr irreducible errr

5 2.1 What Is Statistical Learning? 19 manufacturing variatin in the drug itself r the patient s general feeling f well-being n that day. Cnsider a given estimate ˆf and a set f predictrs X, which yields the predictin Ŷ = ˆf(X). Assume fr a mment that bth ˆf and X are fixed. Then, it is easy t shw that E(Y Ŷ )2 = E[f(X)+ɛ ˆf(X)] 2 = [f(x) ˆf(X)] 2 }{{} Reducible + Var(ɛ) }{{}, (2.3) Irreducible where E(Y Ŷ )2 represents the average, r expected value, f the squared expected difference between the predicted and actual value f Y,andVar(ɛ) represents the variance assciated with the errr term ɛ. value variance The fcus f this bk is n techniques fr estimating f with the aim f minimizing the reducible errr. It is imprtant t keep in mind that the irreducible errr will always prvide an upper bund n the accuracy f ur predictin fr Y. This bund is almst always unknwn in practice. Inference We are ften interested in understanding the way that Y is affected as X 1,...,X p change. In this situatin we wish t estimate f, but ur gal is nt necessarily t make predictins fr Y. We instead want t understand the relatinship between X and Y, r mre specifically, t understand hw Y changes as a functin f X 1,...,X p.nw ˆf cannt be treated as a black bx, because we need t knw its exact frm. In this setting, ne may be interested in answering the fllwing questins: Which predictrs are assciated with the respnse? It is ften the case that nly a small fractin f the available predictrs are substantially assciated with Y. Identifying the few imprtant predictrs amng a large set f pssible variables can be extremely useful, depending n the applicatin. What is the relatinship between the respnse and each predictr? Sme predictrs may have a psitive relatinship with Y, in the sense that increasing the predictr is assciated with increasing values f Y. Other predictrs may have the ppsite relatinship. Depending n the cmplexity f f, the relatinship between the respnse and a given predictr may als depend n the values f the ther predictrs. Can the relatinship between Y and each predictr be adequately summarized using a linear equatin, r is the relatinship mre cmplicated? Histrically, mst methds fr estimating f have taken a linear frm. In sme situatins, such an assumptin is reasnable r even desirable. But ften the true relatinship is mre cmplicated, in which case a linear mdel may nt prvide an accurate representatin f the relatinship between the input and utput variables.

6 20 2. Statistical Learning In this bk, we will see a number f examples that fall int the predictin setting, the inference setting, r a cmbinatin f the tw. Fr instance, cnsider a cmpany that is interested in cnducting a direct-marketing campaign. The gal is t identify individuals wh will respnd psitively t a mailing, based n bservatins f demgraphic variables measured n each individual. In this case, the demgraphic variables serve as predictrs, and respnse t the marketing campaign (either psitive r negative) serves as the utcme. The cmpany is nt interested in btaining a deep understanding f the relatinships between each individual predictr and the respnse; instead, the cmpany simply wants an accurate mdel t predict the respnse using the predictrs. This is an example f mdeling fr predictin. In cntrast, cnsider the Advertising data illustrated in Figure 2.1. One may be interested in answering questins such as: Which media cntribute t sales? Which media generate the biggest bst in sales? r Hw much increase in sales is assciated with a given increase in TV advertising? This situatin falls int the inference paradigm. Anther example invlves mdeling the brand f a prduct that a custmer might purchase based n variables such as price, stre lcatin, discunt levels, cmpetitin price, and s frth. In this situatin ne might really be mst interested in hw each f the individual variables affects the prbability f purchase. Fr instance, what effect will changing the price f a prduct have n sales? This is an example f mdeling fr inference. Finally, sme mdeling culd be cnducted bth fr predictin and inference. Fr example, in a real estate setting, ne may seek t relate values f hmes t inputs such as crime rate, zning, distance frm a river, air quality, schls, incme level f cmmunity, size f huses, and s frth. In this case ne might be interested in hw the individual input variables affect the prices that is, hw much extra will a huse be wrth if it has a view f the river? This is an inference prblem. Alternatively, ne may simply be interested in predicting the value f a hme given its characteristics: is this huse under- r ver-valued? This is a predictin prblem. Depending n whether ur ultimate gal is predictin, inference, r a cmbinatin f the tw, different methds fr estimating f may be apprpriate. Fr example, linear mdels allw fr relatively simple and inter- linear mdel pretable inference, but may nt yield as accurate predictins as sme ther appraches. In cntrast, sme f the highly nn-linear appraches that we discuss in the later chapters f this bk can ptentially prvide quite accurate predictins fr Y, but this cmes at the expense f a less interpretable mdel fr which inference is mre challenging.

7 2.1.2 Hw D We Estimate f? 2.1 What Is Statistical Learning? 21 Thrughut this bk, we explre many linear and nn-linear appraches fr estimating f. Hwever, these methds generally share certain characteristics. We prvide an verview f these shared characteristics in this sectin. We will always assume that we have bserved a set f n different data pints. Fr example in Figure 2.2 we bserved n =30datapints. These bservatins are called the training data because we will use these training data bservatins t train, r teach, ur methd hw t estimate f. Letx ij represent the value f the jth predictr, r input, fr bservatin i, where i =1, 2,...,n and j =1, 2,...,p. Crrespndingly, let y i represent the respnse variable fr the ith bservatin. Then ur training data cnsist f {(x 1,y 1 ), (x 2,y 2 ),...,(x n,y n )} where x i =(x i1,x i2,...,x ip ) T. Our gal is t apply a statistical learning methd t the training data in rder t estimate the unknwn functin f. In ther wrds, we want t find a functin ˆf such that Y ˆf(X) fr any bservatin (X, Y ). Bradly speaking, mst statistical learning methds fr this task can be characterized as either parametric r nn-parametric. We nw briefly discuss these parametric tw types f appraches. Parametric Methds Parametric methds invlve a tw-step mdel-based apprach. 1. First, we make an assumptin abut the functinal frm, r shape, f f. Fr example, ne very simple assumptin is that f is linear in X: f(x) =β 0 + β 1 X 1 + β 2 X β p X p. (2.4) This is a linear mdel, which will be discussed extensively in Chapter 3. Once we have assumed that f is linear, the prblem f estimating f is greatly simplified. Instead f having t estimate an entirely arbitrary p-dimensinal functin f(x), ne nly needs t estimate the p + 1 cefficients β 0,β 1,...,β p. 2. After a mdel has been selected, we need a prcedure that uses the training data t fit r train the mdel. In the case f the linear mdel (2.4), we need t estimate the parameters β 0,β 1,...,β p.thatis,we want t find values f these parameters such that nnparametric fit train Y β 0 + β 1 X 1 + β 2 X β p X p. The mst cmmn apprach t fitting the mdel (2.4) is referred t as (rdinary) least squares, which we discuss in Chapter 3. Hw- least squares ever, least squares is ne f many pssible ways way t fit the linear mdel. In Chapter 6, we discuss ther appraches fr estimating the parameters in (2.4). The mdel-based apprach just described is referred t as parametric; it reduces the prblem f estimating f dwn t ne f estimating a set f

8 Incme Statistical Learning Years f Educatin Senirity FIGURE 2.4. A linear mdel fit by least squares t the Incme data frm Figure 2.3. The bservatins are shwn in red, and the yellw plane indicates the least squares fit t the data. parameters. Assuming a parametric frm fr f simplifies the prblem f estimating f because it is generally much easier t estimate a set f parameters, such as β 0,β 1,...,β p in the linear mdel (2.4), than it is t fit an entirely arbitrary functin f. The ptential disadvantage f a parametric apprach is that the mdel we chse will usually nt match the true unknwn frm f f. If the chsen mdel is t far frm the true f, then ur estimate will be pr. We can try t address this prblem by chsing flexible mdels that can fit many different pssible functinal frms flexible fr f. But in general, fitting a mre flexible mdel requires estimating a greater number f parameters. These mre cmplex mdels can lead t a phenmenn knwn as verfitting the data, which essentially means they verfitting fllw the errrs, r nise, t clsely. These issues are discussed thrugh- nise ut this bk. Figure 2.4 shws an example f the parametric apprach applied t the Incme data frm Figure 2.3. We have fit a linear mdel f the frm incme β 0 + β 1 educatin + β 2 senirity. Since we have assumed a linear relatinship between the respnse and the tw predictrs, the entire fitting prblem reduces t estimating β 0, β 1,and β 2, which we d using least squares linear regressin. Cmparing Figure 2.3 t Figure 2.4, we can see that the linear fit given in Figure 2.4 is nt quite right: the true f has sme curvature that is nt captured in the linear fit. Hwever, the linear fit still appears t d a reasnable jb f capturing the psitive relatinship between years f educatin and incme, aswellasthe

9 Incme 2.1 What Is Statistical Learning? 23 Years f Educatin Senirity FIGURE 2.5. A smth thin-plate spline fit t the Incme data frm Figure 2.3 is shwn in yellw; the bservatins are displayed in red. Splines are discussed in Chapter 7. slightly less psitive relatinship between senirity and incme. Itmaybe that with such a small number f bservatins, this is the best we can d. Nn-parametric Methds Nn-parametric methds d nt make explicit assumptins abut the functinal frm f f. Instead they seek an estimate f f that gets as clse t the data pints as pssible withut being t rugh r wiggly. Such appraches can have a majr advantage ver parametric appraches: by aviding the assumptin f a particular functinal frm fr f, they have the ptential t accurately fit a wider range f pssible shapes fr f. Anyparametric apprach brings with it the pssibility that the functinal frm used t estimate f is very different frm the true f, in which case the resulting mdel will nt fit the data well. In cntrast, nn-parametric appraches cmpletely avid this danger, since essentially n assumptin abut the frm f f is made. But nn-parametric appraches d suffer frm a majr disadvantage: since they d nt reduce the prblem f estimating f t a small number f parameters, a very large number f bservatins (far mre than is typically needed fr a parametric apprach) is required in rder t btain an accurate estimate fr f. An example f a nn-parametric apprach t fitting the Incme data is shwninfigure2.5. Athin-plate spline is used t estimate f. This ap- thin-plate prach des nt impse any pre-specified mdel n f. It instead attempts spline t prduce an estimate fr f that is as clse as pssible t the bserved data, subject t the fit that is, the yellw surface in Figure 2.5 being

10 Incme Statistical Learning Years f Educatin Senirity FIGURE 2.6. A rugh thin-plate spline fit t the Incme data frm Figure 2.3. This fit makes zer errrs n the training data. smth. In this case, the nn-parametric fit has prduced a remarkably accurate estimate f the true f shwn in Figure 2.3. In rder t fit a thin-plate spline, the data analyst must select a level f smthness. Figure 2.6 shws the same thin-plate spline fit using a lwer level f smthness, allwing fr a rugher fit. The resulting estimate fits the bserved data perfectly! Hwever, the spline fit shwn in Figure 2.6 is far mre variable than the true functin f, frm Figure 2.3. This is an example f verfitting the data, which we discussed previusly. It is an undesirable situatin because the fit btained will nt yield accurate estimates f the respnse n new bservatins that were nt part f the riginal training data set. We discuss methds fr chsing the crrect amunt f smthness in Chapter 5. Splines are discussed in Chapter 7. As we have seen, there are advantages and disadvantages t parametric and nn-parametric methds fr statistical learning. We explre bth types f methds thrughut this bk The Trade-Off Between Predictin Accuracy and Mdel Interpretability Of the many methds that we examine in this bk, sme are less flexible, r mre restrictive, in the sense that they can prduce just a relatively small range f shapes t estimate f. Fr example, linear regressin is a relatively inflexible apprach, because it can nly generate linear functins such as the lines shwn in Figure 2.1 r the plane shwn in Figure 2.3.

11 2.1 What Is Statistical Learning? 25 High Subset Selectin Lass Interpretability Least Squares Generalized Additive Mdels Trees Bagging, Bsting Lw Supprt Vectr Machines Lw High Flexibility FIGURE 2.7. A representatin f the tradeff between flexibility and interpretability, using different statistical learning methds. In general, as the flexibility f a methd increases, its interpretability decreases. Other methds, such as the thin plate splines shwn in Figures 2.5 and 2.6, are cnsiderably mre flexible because they can generate a much wider range f pssible shapes t estimate f. One might reasnably ask the fllwing questin: why wuld we ever chse t use a mre restrictive methd instead f a very flexible apprach? There are several reasns that we might prefer a mre restrictive mdel. If we are mainly interested in inference, then restrictive mdels are much mre interpretable. Fr instance, when inference is the gal, the linear mdel may be a gd chice since it will be quite easy t understand the relatinship between Y and X 1,X 2,...,X p. In cntrast, very flexible appraches, such as the splines discussed in Chapter 7 and displayed in Figures 2.5 and 2.6, and the bsting methds discussed in Chapter 8, can lead t such cmplicated estimates f f that it is difficult t understand hw any individual predictr is assciated with the respnse. Figure 2.7 prvides an illustratin f the trade-ff between flexibility and interpretability fr sme f the methds that we cver in this bk. Least squares linear regressin, discussed in Chapter 3, is relatively inflexible but is quite interpretable. The lass, discussedinchapter6, relies upn the lass linear mdel (2.4) but uses an alternative fitting prcedure fr estimating the cefficients β 0,β 1,...,β p. The new prcedure is mre restrictive in estimating the cefficients, and sets a number f them t exactly zer. Hence in this sense the lass is a less flexible apprach than linear regressin. It is als mre interpretable than linear regressin, because in the final mdel the respnse variable will nly be related t a small subset f the predictrs namely, thse with nnzer cefficient estimates. Generalized

12 26 2. Statistical Learning additive mdels (GAMs), discussed in Chapter 7, instead extend the lin- generalized ear mdel (2.4) t allw fr certain nn-linear relatinships. Cnsequently, additive GAMs are mre flexible than linear regressin. They are als smewhat mdel less interpretable than linear regressin, because the relatinship between each predictr and the respnse is nw mdeled using a curve. Finally, fully nn-linear methds such as bagging, bsting, andsupprt vectr machines bagging with nn-linear kernels, discussed in Chapters 8 and 9, are highly flexible appraches that are harder t interpret. We have established that when inference is the gal, there are clear advantages t using simple and relatively inflexible statistical learning methds. In sme settings, hwever, we are nly interested in predictin, and the interpretability f the predictive mdel is simply nt f interest. Fr instance, if we seek t develp an algrithm t predict the price f a stck, ur sle requirement fr the algrithm is that it predict accurately interpretability is nt a cncern. In this setting, we might expect that it will be best t use the mst flexible mdel available. Surprisingly, this is nt always the case! We will ften btain mre accurate predictins using a less flexible methd. This phenmenn, which may seem cunterintuitive at first glance, has t d with the ptential fr verfitting in highly flexible methds. We saw an example f verfitting in Figure 2.6. We will discuss this very imprtant cncept further in Sectin 2.2 and thrughut this bk. bsting supprt vectr machine Supervised Versus Unsupervised Learning Mst statistical learning prblems fall int ne f tw categries: supervised supervised r unsupervised. The examples that we have discussed s far in this chap- unsupervised ter all fall int the supervised learning dmain. Fr each bservatin f the predictr measurement(s) x i, i =1,...,n there is an assciated respnse measurement y i. We wish t fit a mdel that relates the respnse t the predictrs, with the aim f accurately predicting the respnse fr future bservatins (predictin) r better understanding the relatinship between the respnse and the predictrs (inference). Many classical statistical learning methds such as linear regressin and lgistic regressin (Chapter 4), as lgistic well as mre mdern appraches such as GAM, bsting, and supprt vectr machines, perate in the supervised learning dmain. The vast majrity regressin f this bk is devted t this setting. In cntrast, unsupervised learning describes the smewhat mre challenging situatin in which fr every bservatin i =1,...,n,webserve a vectr f measurements x i but n assciated respnse y i.itisntpssible t fit a linear regressin mdel, since there is n respnse variable t predict. In this setting, we are in sme sense wrking blind; the situatin is referred t as unsupervised because we lack a respnse variable that can supervise ur analysis. What srt f statistical analysis is

13 2.1 What Is Statistical Learning? FIGURE 2.8. A clustering data set invlving three grups. Each grup is shwn using a different clred symbl. Left: The three grups are well-separated. In this setting, a clustering apprach shuld successfully identify the three grups. Right: There is sme verlap amng the grups. Nw the clustering task is mre challenging. pssible? We can seek t understand the relatinships between the variables r between the bservatins. One statistical learning tl that we may use in this setting is cluster analysis, r clustering. The gal f cluster analysis cluster is t ascertain, n the basis f x 1,...,x n, whether the bservatins fall int analysis relatively distinct grups. Fr example, in a market segmentatin study we might bserve multiple characteristics (variables) fr ptential custmers, such as zip cde, family incme, and shpping habits. We might believe that the custmers fall int different grups, such as big spenders versus lw spenders. If the infrmatin abut each custmer s spending patterns were available, then a supervised analysis wuld be pssible. Hwever, this infrmatin is nt available that is, we d nt knw whether each ptential custmer is a big spender r nt. In this setting, we can try t cluster the custmers n the basis f the variables measured, in rder t identify distinct grups f ptential custmers. Identifying such grups can be f interest because it might be that the grups differ with respect t sme prperty f interest, such as spending habits. Figure 2.8 prvides a simple illustratin f the clustering prblem. We have pltted 150 bservatins with measurements n tw variables, X 1 and X 2. Each bservatin crrespnds t ne f three distinct grups. Fr illustrative purpses, we have pltted the members f each grup using different clrs and symbls. Hwever, in practice the grup memberships are unknwn, and the gal is t determine the grup t which each bservatin belngs. In the left-hand panel f Figure 2.8, this is a relatively easy task because the grups are well-separated. In cntrast, the right-hand panel illustrates a mre challenging prblem in which there is sme verlap

14 28 2. Statistical Learning tistical learning methd that can incrprate the m bservatins fr which respnse measurements are available as well as the n m bservatins fr which they are nt. Althugh this is an interesting tpic, it is beynd the scpe f this bk. between the grups. A clustering methd culd nt be expected t assign all f the verlapping pints t their crrect grup (blue, green, r range). In the examples shwn in Figure 2.8, there are nly tw variables, and s ne can simply visually inspect the scatterplts f the bservatins in rder t identify clusters. Hwever, in practice, we ften encunter data sets that cntain many mre than tw variables. In this case, we cannt easily plt the bservatins. Fr instance, if there are p variables in ur data set, then p(p 1)/2 distinct scatterplts can be made, and visual inspectin is simply nt a viable way t identify clusters. Fr this reasn, autmated clustering methds are imprtant. We discuss clustering and ther unsupervised learning appraches in Chapter 10. Many prblems fall naturally int the supervised r unsupervised learning paradigms. Hwever, smetimes the questin f whether an analysis shuld be cnsidered supervised r unsupervised is less clear-cut. Fr instance, suppse that we have a set f n bservatins. Fr m f the bservatins, where m<n, we have bth predictr measurements and a respnse measurement. Fr the remaining n m bservatins, we have predictr measurements but n respnse measurement. Such a scenari can arise if the predictrs can be measured relatively cheaply but the crrespnding respnses are much mre expensive t cllect. We refer t this setting as a semi-supervised learning prblem. In this setting, we wish t use a sta- semisupervised learning Regressin Versus Classificatin Prblems Variables can be characterized as either quantitative r qualitative (als quantitative knwn as categrical). Quantitative variables take n numerical values. qualitative Examples include a persn s age, height, r incme, the value f a huse, categrical and the price f a stck. In cntrast, qualitative variables take n values in ne f K different classes, r categries. Examples f qualitative class variables include a persn s gender (male r female), the brand f prduct purchased (brand A, B, r C), whether a persn defaults n a debt (yes r n), r a cancer diagnsis (Acute Myelgenus Leukemia, Acute Lymphblastic Leukemia, r N Leukemia). We tend t refer t prblems with a quantitative respnse as regressin prblems, while thse invlv- regressin ing a qualitative respnse are ften referred t as classificatin prblems. Hwever, the distinctin is nt always that crisp. Least squares linear regressin (Chapter 3) is used with a quantitative respnse, whereas lgistic regressin (Chapter 4) is typically used with a qualitative (tw-class, r binary) respnse. As such it is ften used as a classificatin methd. But binary since it estimates class prbabilities, it can be thught f as a regressin classificatin

15 2.2 Assessing Mdel Accuracy 29 methd as well. Sme statistical methds, such as K-nearest neighbrs (Chapters 2 and 4) and bsting (Chapter 8), can be used in the case f either quantitative r qualitative respnses. We tend t select statistical learning methds n the basis f whether the respnse is quantitative r qualitative; i.e. we might use linear regressin when quantitative and lgistic regressin when qualitative. Hwever, whether the predictrs are qualitative r quantitative is generally cnsidered less imprtant. Mst f the statistical learning methds discussed in this bk can be applied regardless f the predictr variable type, prvided that any qualitative predictrs are prperly cded befre the analysis is perfrmed. This is discussed in Chapter Assessing Mdel Accuracy One f the key aims f this bk is t intrduce the reader t a wide range f statistical learning methds that extend far beynd the standard linear regressin apprach. Why is it necessary t intrduce s many different statistical learning appraches, rather than just a single best methd? There is n free lunch in statistics: n ne methd dminates all thers ver all pssible data sets. On a particular data set, ne specific methd may wrk best, but sme ther methd may wrk better n a similar but different data set. Hence it is an imprtant task t decide fr any given set f data which methd prduces the best results. Selecting the best apprach can be ne f the mst challenging parts f perfrming statistical learning in practice. In this sectin, we discuss sme f the mst imprtant cncepts that arise in selecting a statistical learning prcedure fr a specific data set. As the bk prgresses, we will explain hw the cncepts presented here can be applied in practice Measuring the Quality f Fit In rder t evaluate the perfrmance f a statistical learning methd n a given data set, we need sme way t measure hw well its predictins actually match the bserved data. That is, we need t quantify the extent t which the predicted respnse value fr a given bservatin is clse t the true respnse value fr that bservatin. In the regressin setting, the mst cmmnly-used measure is the mean squared errr (MSE), given by MSE = 1 n n (y i ˆf(x i )) 2, (2.5) i=1 mean squared errr

16 30 2. Statistical Learning where ˆf(x i ) is the predictin that ˆf gives fr the ith bservatin. The MSE will be small if the predicted respnses are very clse t the true respnses, and will be large if fr sme f the bservatins, the predicted and true respnses differ substantially. The MSE in (2.5) is cmputed using the training data that was used t fit the mdel, and s shuld mre accurately be referred t as the training MSE. But in general, we d nt really care hw well the methd wrks training n the training data. Rather, we are interested in the accuracy f the predictins that we btain when we apply ur methd t previusly unseen MSE test data. Why is this what we care abut? Suppse that we are interested test data in develping an algrithm t predict a stck s price based n previus stck returns. We can train the methd using stck returns frm the past 6 mnths. But we dn t really care hw well ur methd predicts last week s stck price. We instead care abut hw well it will predict tmrrw s price r next mnth s price. On a similar nte, suppse that we have clinical measurements (e.g. weight, bld pressure, height, age, family histry f disease) fr a number f patients, as well as infrmatin abut whether each patient has diabetes. We can use these patients t train a statistical learning methd t predict risk f diabetes based n clinical measurements. In practice, we want this methd t accurately predict diabetes risk fr future patients based n their clinical measurements. We are nt very interested in whether r nt the methd accurately predicts diabetes risk fr patients used t train the mdel, since we already knw which f thse patients have diabetes. T state it mre mathematically, suppse that we fit ur statistical learning methd n ur training bservatins {(x 1,y 1 ), (x 2,y 2 ),...,(x n,y n )}, and we btain the estimate ˆf. We can then cmpute ˆf(x 1 ), ˆf(x 2 ),..., ˆf(x n ). If these are apprximately equal t y 1,y 2,...,y n, then the training MSE given by (2.5) is small. Hwever, we are really nt interested in whether ˆf(x i ) y i ; instead, we want t knw whether ˆf(x 0 ) is apprximately equal t y 0,where(x 0,y 0 )isapreviusly unseen test bservatin nt used t train the statistical learning methd. We want t chse the methd that gives the lwest test MSE, as ppsed t the lwest training MSE. In ther wrds, test MSE if we had a large number f test bservatins, we culd cmpute Ave( ˆf(x 0 ) y 0 ) 2, (2.6) the average squared predictin errr fr these test bservatins (x 0,y 0 ). We d like t select the mdel fr which the average f this quantity the test MSE is as small as pssible. Hw can we g abut trying t select a methd that minimizes the test MSE? In sme settings, we may have a test data set available that is, we may have access t a set f bservatins that were nt used t train the statistical learning methd. We can then simply evaluate (2.6) n the test bservatins, and select the learning methd fr which the test MSE is

17 2.2 Assessing Mdel Accuracy 31 Y Mean Squared Errr X Flexibility FIGURE 2.9. Left: Data simulated frm f, shwn in black. Three estimates f f are shwn: the linear regressin line (range curve), and tw smthing spline fits (blue and green curves). Right: Training MSE (grey curve), test MSE (red curve), and minimum pssible test MSE ver all methds (dashed line). Squares represent the training and test MSEs fr the three fits shwn in the left-hand panel. smallest. But what if n test bservatins are available? In that case, ne might imagine simply selecting a statistical learning methd that minimizes the training MSE (2.5). This seems like it might be a sensible apprach, since the training MSE and the test MSE appear t be clsely related. Unfrtunately, there is a fundamental prblem with this strategy: there is n guarantee that the methd with the lwest training MSE will als have the lwest test MSE. Rughly speaking, the prblem is that many statistical methds specifically estimate cefficients s as t minimize the training set MSE. Fr these methds, the training set MSE can be quite small, but the test MSE is ften much larger. Figure 2.9 illustrates this phenmenn n a simple example. In the lefthand panel f Figure 2.9, we have generated bservatins frm (2.1) with the true f given by the black curve. The range, blue and green curves illustrate three pssible estimates fr f btained using methds with increasing levels f flexibility. The range line is the linear regressin fit, which is relatively inflexible. The blue and green curves were prduced using smthing splines, discussedinchapter7, with different levels f smthness. It is smthing clear that as the level f flexibility increases, the curves fit the bserved spline data mre clsely. The green curve is the mst flexible and matches the data very well; hwever, we bserve that it fits the true f (shwn in black) prly because it is t wiggly. By adjusting the level f flexibility f the smthing spline fit, we can prduce many different fits t this data.

18 32 2. Statistical Learning We nw mve n t the right-hand panel f Figure 2.9. The grey curve displays the average training MSE as a functin f flexibility, r mre frmally the degrees f freedm, fr a number f smthing splines. The de- degrees f grees f freedm is a quantity that summarizes the flexibility f a curve; it freedm is discussed mre fully in Chapter 7. The range, blue and green squares indicate the MSEs assciated with the crrespnding curves in the lefthand panel. A mre restricted and hence smther curve has fewer degrees f freedm than a wiggly curve nte that in Figure 2.9, linear regressin is at the mst restrictive end, with tw degrees f freedm. The training MSE declines mntnically as flexibility increases. In this example the true f is nn-linear, and s the range linear fit is nt flexible enugh t estimate f well. The green curve has the lwest training MSE f all three methds, since it crrespnds t the mst flexible f the three curves fit in the left-hand panel. In this example, we knw the true functin f, and s we can als cmpute the test MSE ver a very large test set, as a functin f flexibility. (Of curse, in general f is unknwn, s this will nt be pssible.) The test MSE is displayed using the red curve in the right-hand panel f Figure 2.9. As with the training MSE, the test MSE initially declines as the level f flexibility increases. Hwever, at sme pint the test MSE levels ff and then starts t increase again. Cnsequently, the range and green curves bth have high test MSE. The blue curve minimizes the test MSE, which shuld nt be surprising given that visually it appears t estimate f the best in the left-hand panel f Figure 2.9. The hrizntal dashed line indicates Var(ɛ), the irreducible errr in (2.3), which crrespnds t the lwest achievable test MSE amng all pssible methds. Hence, the smthing spline represented by the blue curve is clse t ptimal. In the right-hand panel f Figure 2.9, as the flexibility f the statistical learning methd increases, we bserve a mntne decrease in the training MSE and a U-shape in the test MSE. This is a fundamental prperty f statistical learning that hlds regardless f the particular data set at hand and regardless f the statistical methd being used. As mdel flexibility increases, training MSE will decrease, but the test MSE may nt. When a given methd yields a small training MSE but a large test MSE, we are saidt be verfitting the data. This happens becauseur statisticallearning prcedure is wrking t hard t find patterns in the training data, and may be picking up sme patterns that are just caused by randm chance rather than by true prperties f the unknwn functin f. Whenweverfit the training data, the test MSE will be very large because the suppsed patterns that the methd fund in the training data simply dn t exist in the test data. Nte that regardless f whether r nt verfitting has ccurred, we almst always expect the training MSE t be smaller than the test MSE because mst statistical learning methds either directly r indirectly seek t minimize the training MSE. Overfitting refers specifically t the case in which a less flexible mdel wuld have yielded a smaller test MSE.

19 2.2 Assessing Mdel Accuracy 33 Y Mean Squared Errr X Flexibility FIGURE Details are as in Figure 2.9, using a different true f that is much clser t linear. In this setting, linear regressin prvides a very gd fit t the data. Figure 2.10 prvides anther example in which the true f is apprximately linear. Again we bserve that the training MSE decreases mntnically as the mdel flexibility increases, and that there is a U-shape in the test MSE. Hwever, because the truth is clse t linear, the test MSE nly decreases slightly befre increasing again, s that the range least squares fit is substantially better than the highly flexible green curve. Finally, Figure 2.11 displays an example in which f is highly nn-linear. The training and test MSE curves still exhibit the same general patterns, but nw there is a rapid decrease in bth curves befre the test MSE starts t increase slwly. In practice, ne can usually cmpute the training MSE with relative ease, but estimating test MSE is cnsiderably mre difficult because usually n test data are available. As the previus three examples illustrate, the flexibility level crrespnding t the mdel with the minimal test MSE can vary cnsiderably amng data sets. Thrughut this bk, we discuss a variety f appraches that can be used in practice t estimate this minimum pint. One imprtant methd is crss-validatin (Chapter 5), which is a crssvalidatin methd fr estimating test MSE using the training data The Bias-Variance Trade-Off The U-shape bserved in the test MSE curves (Figures ) turns ut t be the result f tw cmpeting prperties f statistical learning methds. Thugh the mathematical prf is beynd the scpe f this bk, it is pssible t shw that the expected test MSE, fr a given value x 0,can

20 34 2. Statistical Learning Y Mean Squared Errr X Flexibility FIGURE Details are as in Figure 2.9, using a different f that is far frm linear. In this setting, linear regressin prvides a very pr fit t the data. always be decmpsed int the sum f three fundamental quantities: the variance f ˆf(x 0 ), the squared bias f ˆf(x 0 ) and the variance f the errr variance terms ɛ. Thatis, bias ( E y 0 ˆf(x 2 0 )) =Var(ˆf(x0 )) + [Bias( ˆf(x 0 ))] 2 +Var(ɛ). (2.7) ( Here the ntatin E y 0 ˆf(x ) 2 0 ) defines the expected test MSE, and refers expected t the average test MSE that we wuld btain if we repeatedly estimated test MSE f using a large number f training sets, and tested each at x 0.Theverall ( expected test MSE can be cmputed by averaging E y 0 ˆf(x 2 0 )) ver all pssible values f x 0 in the test set. Equatin 2.7 tells us that in rder t minimize the expected test errr, we need t select a statistical learning methd that simultaneusly achieves lw variance and lw bias. Nte that variance is inherently a nnnegative quantity, and squared bias is als nnnegative. Hence, we see that the expected test MSE can never lie belw Var(ɛ), the irreducible errr frm (2.3). What d we mean by the variance and bias f a statistical learning methd? Variance refers t the amunt by which ˆf wuld change if we estimated it using a different training data set. Since the training data are used t fit the statistical learning methd, different training data sets will result in a different ˆf. But ideally the estimate fr f shuld nt vary t much between training sets. Hwever, if a methd has high variance then small changes in the training data can result in large changes in ˆf. In general, mre flexible statistical methds have higher variance. Cnsider the

21 2.2 Assessing Mdel Accuracy 35 green and range curves in Figure 2.9. The flexible green curve is fllwing the bservatins very clsely. It has high variance because changing any ne f these data pints may cause the estimate ˆf t change cnsiderably. In cntrast, the range least squares line is relatively inflexible and has lw variance, because mving any single bservatin will likely cause nly a small shift in the psitin f the line. On the ther hand, bias refers t the errr that is intrduced by apprximating a real-life prblem, which may be extremely cmplicated, by a much simpler mdel. Fr example, linear regressin assumes that there is a linear relatinship between Y and X 1,X 2,...,X p. It is unlikely that any real-life prblem truly has such a simple linear relatinship, and s perfrming linear regressin will undubtedly result in sme bias in the estimate f f. In Figure 2.11, the true f is substantially nn-linear, s n matter hw many training bservatins we are given, it will nt be pssible t prduce an accurate estimate using linear regressin. In ther wrds, linear regressin results in high bias in this example. Hwever, in Figure 2.10 the true f is very clse t linear, and s given enugh data, it shuld be pssible fr linear regressin t prduce an accurate estimate. Generally, mre flexible methds result in less bias. As a general rule, as we use mre flexible methds, the variance will increase and the bias will decrease. The relative rate f change f these tw quantities determines whether the test MSE increases r decreases. As we increase the flexibility f a class f methds, the bias tends t initially decrease faster than the variance increases. Cnsequently, the expected test MSE declines. Hwever, at sme pint increasing flexibility has little impact n the bias but starts t significantly increase the variance. When this happens the test MSE increases. Nte that we bserved this pattern f decreasing test MSE fllwed by increasing test MSE in the right-hand panels f Figures The three plts in Figure 2.12 illustrate Equatin 2.7 fr the examples in Figures In each case the blue slid curve represents the squared bias, fr different levels f flexibility, while the range curve crrespnds t the variance. The hrizntal dashed line represents Var(ɛ), the irreducible errr. Finally, the red curve, crrespnding t the test set MSE, is the sum f these three quantities. In all three cases, the variance increases and the bias decreases as the methd s flexibility increases. Hwever, the flexibility level crrespnding t the ptimal test MSE differs cnsiderably amng the three data sets, because the squared bias and variance change at different rates in each f the data sets. In the left-hand panel f Figure 2.12, the bias initially decreases rapidly, resulting in an initial sharp decrease in the expected test MSE. On the ther hand, in the center panel f Figure 2.12 the true f is clse t linear, s there is nly a small decrease in bias as flexibility increases, and the test MSE nly declines slightly befre increasing rapidly as the variance increases. Finally, in the right-hand panel f Figure 2.12, as flexibility increases, there is a dramatic decline in bias because

22 36 2. Statistical Learning MSE Bias Var Flexibility Flexibility Flexibility FIGURE Squared bias (blue curve), variance (range curve), Var(ɛ) (dashed line), and test MSE (red curve) fr the three data sets in Figures The vertical dtted line indicates the flexibility level crrespnding t the smallest test MSE. the true f is very nn-linear. There is als very little increase in variance as flexibility increases. Cnsequently, the test MSE declines substantially befre experiencing a small increase as mdel flexibility increases. The relatinship between bias, variance, and test set MSE given in Equatin 2.7 and displayed in Figure 2.12 is referred t as the bias-variance trade-ff. Gd test set perfrmance f a statistical learning methd re- bias-variance quires lw variance as well as lw squared bias. This is referred t as a trade-ff trade-ff because it is easy t btain a methd with extremely lw bias but high variance (fr instance, by drawing a curve that passes thrugh every single training bservatin) r a methd with very lw variance but high bias (by fitting a hrizntal line t the data). The challenge lies in finding a methd fr which bth the variance and the squared bias are lw. This trade-ff is ne f the mst imprtant recurring themes in this bk. In a real-life situatin in which f is unbserved, it is generally nt pssible t explicitly cmpute the test MSE, bias, r variance fr a statistical learning methd. Nevertheless, ne shuld always keep the bias-variance trade-ff in mind. In this bk we explre methds that are extremely flexible and hence can essentially eliminate bias. Hwever, this des nt guarantee that they will utperfrm a much simpler methd such as linear regressin. T take an extreme example, suppse that the true f is linear. In this situatin linear regressin will have n bias, making it very hard fr a mre flexible methd t cmpete. In cntrast, if the true f is highly nn-linear and we have an ample number f training bservatins, then we may d better using a highly flexible apprach, as in Figure In Chapter 5 we discuss crss-validatin, which is a way t estimate the test MSE using the training data.

23 2.2.3 The Classificatin Setting 2.2 Assessing Mdel Accuracy 37 Thus far, ur discussin f mdel accuracy has been fcused n the regressin setting. But many f the cncepts that we have encuntered, such as the bias-variance trade-ff, transfer ver t the classificatin setting with nly sme mdificatins due t the fact that y i is n lnger numerical. Suppse that we seek t estimate f n the basis f training bservatins {(x 1,y 1 ),...,(x n,y n )}, where nw y 1,...,y n are qualitative. The mst cmmn apprach fr quantifying the accuracy f ur estimate ˆf is the training errr rate, the prprtin f mistakes that are made if we apply errr rate ur estimate ˆf t the training bservatins: 1 n n I(y i ŷ i ). (2.8) i=1 Here ŷ i is the predicted class label fr the ith bservatin using ˆf. And I(y i ŷ i )isanindicatr variable that equals 1 if y i ŷ i and zer if y i =ŷ i. If I(y i ŷ i ) = 0 then the ith bservatin was classified crrectly by ur classificatin methd; therwise it was misclassified. Hence Equatin 2.8 cmputes the fractin f incrrect classificatins. indicatr variable Equatin 2.8 is referred t as the training errr rate because it is cm- training puted based n the data that was used t train ur classifier. As in the errr regressin setting, we are mst interested in the errr rates that result frm applying ur classifier t test bservatins that were nt used in training. The test errr rate assciated with a set f test bservatins f the frm test errr (x 0,y 0 )isgivenby Ave (I(y 0 ŷ 0 )), (2.9) where ŷ 0 is the predicted class label that results frm applying the classifier t the test bservatin with predictr x 0.Agd classifier is ne fr which the test errr (2.9) is smallest. The Bayes Classifier It is pssible t shw (thugh the prf is utside f the scpe f this bk) that the test errr rate given in (2.9) is minimized, n average, by a very simple classifier that assigns each bservatin t the mst likely class, given its predictr values. In ther wrds, we shuld simply assign a test bservatin with predictr vectr x 0 t the class j fr which Pr(Y = j X = x 0 ) (2.10) is largest. Nte that (2.10) isacnditinal prbability: it is the prbability cnditinal that Y = j, given the bserved predictr vectr x 0. This very simple classifier is called the Bayes classifier. In a tw-class prblem where there are Bayes prbability nly tw pssible respnse values, say class 1 r class 2, the Bayes classifier classifier

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeff Reading: Chapter 2 STATS 202: Data mining and analysis September 27, 2017 1 / 20 Supervised vs. unsupervised learning In unsupervised