What is Statistical Learning? Sales 5 10 15 20 25 Sales 5 10 15 20 25 Sales 5 10 15 20 25 0 50 100 200 300 TV 0 10 20 30 40 50 Radi 0 20 40 60 80 100 Newspaper Shwn are Sales vs TV, Radi and Newspaper, with a blue linear-regressin line fit separately t each. Can we predict Sales using these three? Perhaps we can d better using a mdel Sales f(tv, Radi, Newspaper) 1/30
Ntatin Here Sales is a respnse r target that we wish t predict. We generically refer t the respnse as Y. TV is a feature, r input, r predictr; we name it X 1. Likewise name Radi as X 2, and s n. We can refer t the input vectr cllectively as 0 1 Nw we write ur mdel as B X = @ X 1 X 2 X 3 C A Y = f(x)+ where captures measurement errrs and ther discrepancies. 2/30
What is f(x) gdfr? With a gd f we can make predictins f Y at new pints X = x. We can understand which cmpnents f X =(X 1,X 2,...,X p ) are imprtant in explaining Y, and which are irrelevant. e.g. Senirity and Years f Educatin have a big impact n Incme, but Marital Status typically des nt. Depending n the cmplexity f f, we may be able t understand hw each cmpnent X j f X a ects Y. 3/30
1 2 3 4 5 6 7 2 0 2 4 6 x y Is there an ideal f(x)? In particular, what is a gd value fr f(x) at any selected value f X, sayx = 4? There can be many Y values at X = 4. A gd value is f(4) = E(Y X = 4) E(Y X = 4) means expected value (average) f Y given X = 4. This ideal f(x) =E(Y X = x) is called the regressin functin. 4/30
The regressin functin f(x) Is als defined fr vectr X; e.g. f(x) =f(x 1,x 2,x 3 )=E(Y X 1 = x 1,X 2 = x 2,X 3 = x 3 ) 5/30
The regressin functin f(x) Is als defined fr vectr X; e.g. f(x) =f(x 1,x 2,x 3 )=E(Y X 1 = x 1,X 2 = x 2,X 3 = x 3 ) Is the ideal r ptimal predictr f Y with regard t mean-squared predictin errr: f(x) = E(Y X = x) isthe functin that minimizes E[(Y g(x)) 2 X = x] ver all functins g at all pints X = x. 5/30
The regressin functin f(x) Is als defined fr vectr X; e.g. f(x) =f(x 1,x 2,x 3 )=E(Y X 1 = x 1,X 2 = x 2,X 3 = x 3 ) Is the ideal r ptimal predictr f Y with regard t mean-squared predictin errr: f(x) = E(Y X = x) isthe functin that minimizes E[(Y g(x)) 2 X = x] ver all functins g at all pints X = x. = Y f(x) istheirreducible errr i.e. even if we knew f(x), we wuld still make errrs in predictin, since at each X = x there is typically a distributin f pssible Y values. 5/30
The regressin functin f(x) Is als defined fr vectr X; e.g. f(x) =f(x 1,x 2,x 3 )=E(Y X 1 = x 1,X 2 = x 2,X 3 = x 3 ) Is the ideal r ptimal predictr f Y with regard t mean-squared predictin errr: f(x) = E(Y X = x) isthe functin that minimizes E[(Y g(x)) 2 X = x] ver all functins g at all pints X = x. = Y f(x) istheirreducible errr i.e. even if we knew f(x), we wuld still make errrs in predictin, since at each X = x there is typically a distributin f pssible Y values. Fr any estimate ˆf(x) f f(x), we have E[(Y ˆf(X)) 2 2 X = x] =[f(x) ˆf(x)] {z } Reducible + Var( ) {z } Irreducible 5/30
Hw t estimate f Typically we have few if any data pints with X =4 exactly. S we cannt cmpute E(Y X = x)! Relax the definitin and let ˆf(x) = Ave(Y X 2N(x)) where N (x) is sme neighbrhd f x. 1 2 3 4 5 6 2 1 0 1 2 3 x y 6/30
Nearest neighbr averaging can be pretty gd fr small p i.e.p apple 4 and large-ish N. We will discuss smther versins, such as kernel and spline smthing later in the curse. 7/30
Nearest neighbr averaging can be pretty gd fr small p i.e.p apple 4 and large-ish N. We will discuss smther versins, such as kernel and spline smthing later in the curse. Nearest neighbr methds can be lusy when p is large. Reasn: the curse f dimensinality. Nearest neighbrs tend t be far away in high dimensins. We need t get a reasnable fractin f the N values f y i t average t bring the variance dwn e.g. 10%. A 10% neighbrhd in high dimensins need n lnger be lcal, s we lse the spirit f estimating E(Y X = x) by lcal averaging. 7/30
The curse f dimensinality 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 x1 x2 10% Neighbrhd 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.5 1.0 1.5 Fractin f Vlume Radius p= 1 p= 2 p= 3 p= 5 p= 10 8/30
Parametric and structured mdels The linear mdel is an imprtant example f a parametric mdel: f L (X) = 0 + 1 X 1 + 2 X 2 +... px p. A linear mdel is specified in terms f p + 1 parameters 0, 1,..., p. We estimate the parameters by fitting the mdel t training data. Althugh it is almst never crrect, a linear mdel ften serves as a gd and interpretable apprximatin t the unknwn true functin f(x). 9/30
A linear mdel ˆf L (X) = ˆ0 + ˆ1X gives a reasnable fit here 1 2 3 4 5 6 2 1 0 1 2 3 x y A quadratic mdel ˆf Q (X) = ˆ0 + ˆ1X + ˆ2X 2 fits slightly better. 1 2 3 4 5 6 2 1 0 1 2 3 x y 10 / 30
Incme Years f Educatin Senirity Simulated example. Red pints are simulated values fr incme frm the mdel f is the blue surface. incme = f(educatin, senirity)+ 11 / 30
Incme Years f Educatin Senirity Linear regressin mdel fit t the simulated data. ˆf L (educatin, senirity) = ˆ0+ ˆ1 educatin+ ˆ2 senirity 12 / 30
Incme Years f Educatin Senirity Mre flexible regressin mdel ˆf S (educatin, senirity) fitt the simulated data. Here we use a technique called a thin-plate spline t fit a flexible surface. We cntrl the rughness f the fit (chapter 7). 13 / 30
Incme Years f Educatin Senirity Even mre flexible spline regressin mdel ˆf S (educatin, senirity) fit t the simulated data. Here the fitted mdel makes n errrs n the training data! Als knwn as verfitting. 14 / 30
Sme trade- s Predictin accuracy versus interpretability. Linear mdels are easy t interpret; thin-plate splines are nt. 15 / 30
Sme trade- s Predictin accuracy versus interpretability. Linear mdels are easy t interpret; thin-plate splines are nt. Gd fit versus ver-fit r under-fit. Hw d we knw when the fit is just right? 15 / 30
Sme trade- s Predictin accuracy versus interpretability. Linear mdels are easy t interpret; thin-plate splines are nt. Gd fit versus ver-fit r under-fit. Hw d we knw when the fit is just right? Parsimny versus black-bx. We ften prefer a simpler mdel invlving fewer variables ver a black-bx predictr invlving them all. 15 / 30
Interpretability Lw High Subset Selectin Lass Least Squares Generalized Additive Mdels Trees Bagging, Bsting Supprt Vectr Machines Lw High Flexibility 16 / 30
Assessing Mdel Accuracy Suppse we fit a mdel ˆf(x) t sme training data Tr = {x i,y i } N 1, and we wish t see hw well it perfrms. We culd cmpute the average squared predictin errr ver Tr: MSE Tr = Ave i2tr [y i ˆf(xi )] 2 This may be biased tward mre verfit mdels. Instead we shuld, if pssible, cmpute it using fresh test data Te = {x i,y i } M 1 : MSE Te = Ave i2te [y i ˆf(xi )] 2 17 / 30
Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 2 5 10 20 X Flexibility Black curve is truth. Red curve n right is MSE Te, grey curve is MSE Tr. Orange, blue and green curves/squares crrespnd t fits f di erent flexibility. 18 / 30
Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 X 2 5 10 20 Flexibility Here the truth is smther, s the smther fit and linear mdel d really well. 19 / 30
Y 10 0 10 20 Mean Squared Errr 0 5 10 15 20 0 20 40 60 80 100 X 2 5 10 20 Flexibility Here the truth is wiggly and the nise is lw, s the mre flexible fits d the best. 20 / 30
Bias-Variance Trade- Suppse we have fit a mdel ˆf(x) t sme training data Tr, and let (x 0,y 0 ) be a test bservatin drawn frm the ppulatin. If the true mdel is Y = f(x)+ (with f(x) =E(Y X = x)), then 2 E y 0 ˆf(x0 ) = Var( ˆf(x0 )) + [Bias( ˆf(x 0 ))] 2 + Var( ). The expectatin averages ver the variability f y 0 as well as the variability in Tr. Nte that Bias( ˆf(x 0 ))] = E[ ˆf(x 0 )] f(x 0 ). Typically as the flexibility f ˆf increases, its variance increases, and its bias decreases. S chsing the flexibility based n average test errr amunts t a bias-variance trade-. 21 / 30
Bias-variance trade- fr the three examples 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0 5 10 15 20 MSE Bias Var 2 5 10 20 Flexibility 2 5 10 20 Flexibility 2 5 10 20 Flexibility 22 / 30
Classificatin Prblems Here the respnse variable Y is qualitative e.g. email is ne f C =(spam, ham) (ham=gd email), digit class is ne f C = {0, 1,...,9}. Our gals are t: Build a classifier C(X) that assigns a class label frm C t a future unlabeled bservatin X. Assess the uncertainty in each classificatin Understand the rles f the di erent predictrs amng X =(X 1,X 2,...,X p ). 23 / 30
y 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 6 7 x Is there an ideal C(X)? Suppse the K elements in C are numbered 1, 2,...,K. Let p k (x) =Pr(Y = kx = x), k=1, 2,...,K. These are the cnditinal class prbabilities at x; e.g. see little barplt at x = 5. Then the Bayes ptimal classifier at x is C(x) =j if p j (x) = max{p 1 (x),p 2 (x),...,p K (x)} 24 / 30
y 0.0 0.2 0.4 0.6 0.8 1.0 2 3 4 5 6 x Nearest-neighbr averaging can be used as befre. Als breaks dwn as dimensin grws. Hwever, the impact n Ĉ(x) is less than n ˆp k (x), k=1,...,k. 25 / 30
Classificatin: sme details Typically we measure the perfrmance f Ĉ(x) usingthe misclassificatin errr rate: Err Te = Ave i2te I[y i 6= Ĉ(x i)] The Bayes classifier (using the true p k (x)) has smallest errr (in the ppulatin). 26 / 30
Classificatin: sme details Typically we measure the perfrmance f Ĉ(x) usingthe misclassificatin errr rate: Err Te = Ave i2te I[y i 6= Ĉ(x i)] The Bayes classifier (using the true p k (x)) has smallest errr (in the ppulatin). Supprt-vectr machines build structured mdels fr C(x). We will als build structured mdels fr representing the p k (x). e.g. Lgistic regressin, generalized additive mdels. 26 / 30
Example: K-nearest neighbrs in tw dimensins X 1 X2 27 / 30
KNN: K=10 X 1 X2 FIGURE 2.15. The black curve indicates the KNN decisin bundary n the 28 / 30
FIGURE 2.15. The black curve indicates the KNN decisin bundary n the data frm Figure 2.13, using K =10.TheBayesdecisinbundaryisshwnas a purple dashed line.the KNN and Bayes decisin bundaries are very similar. KNN: K=1 KNN: K=100 FIGURE 2.16. A cmparisn f the KNN decisin bundaries (slid black curves) btained using K =1and K =100n the data frm Figure 2.13. With K =1,thedecisinbundaryisverlyflexible,whilewithK =100it is nt su ciently flexible. The Bayes decisin bundary is shwn as a purple dashed 29 / 30
Errr Rate 0.00 0.05 0.10 0.15 0.20 Training Errrs Test Errrs 0.01 0.02 0.05 0.10 0.20 0.50 1.00 1/K 30 / 30