What is Statistical Learning?

Size: px

Start display at page:

Download "What is Statistical Learning?"

Margaret O’Brien’
5 years ago
Views:

1 What is Statistical Learning? Sales Sales Sales TV Radi Newspaper Shwn are Sales vs TV, Radi and Newspaper, with a blue linear-regressin line fit separately t each. Can we predict Sales using these three? Perhaps we can d better using a mdel Sales f(tv, Radi, Newspaper) 1/30

2 Ntatin Here Sales is a respnse r target that we wish t predict. We generically refer t the respnse as Y. TV is a feature, r input, r predictr; we name it X 1. Likewise name Radi as X 2, and s n. We can refer t the input vectr cllectively as 0 1 Nw we write ur mdel as B X X 1 X 2 X 3 C A Y = f(x)+ where captures measurement errrs and ther discrepancies. 2/30

3 What is f(x) gdfr? With a gd f we can make predictins f Y at new pints X = x. We can understand which cmpnents f X =(X 1,X 2,...,X p ) are imprtant in explaining Y, and which are irrelevant. e.g. Senirity and Years f Educatin have a big impact n Incme, but Marital Status typically des nt. Depending n the cmplexity f f, we may be able t understand hw each cmpnent X j f X a ects Y. 3/30

4 x y Is there an ideal f(x)? In particular, what is a gd value fr f(x) at any selected value f X, sayx = 4? There can be many Y values at X = 4. A gd value is f(4) = E(Y X = 4) E(Y X = 4) means expected value (average) f Y given X = 4. This ideal f(x) =E(Y X = x) is called the regressin functin. 4/30

5 The regressin functin f(x) Is als defined fr vectr X; e.g. f(x) =f(x 1,x 2,x 3 )=E(Y X 1 = x 1,X 2 = x 2,X 3 = x 3 ) 5/30

6 The regressin functin f(x) Is als defined fr vectr X; e.g. f(x) =f(x 1,x 2,x 3 )=E(Y X 1 = x 1,X 2 = x 2,X 3 = x 3 ) Is the ideal r ptimal predictr f Y with regard t mean-squared predictin errr: f(x) = E(Y X = x) isthe functin that minimizes E[(Y g(x)) 2 X = x] ver all functins g at all pints X = x. 5/30

7 The regressin functin f(x) Is als defined fr vectr X; e.g. f(x) =f(x 1,x 2,x 3 )=E(Y X 1 = x 1,X 2 = x 2,X 3 = x 3 ) Is the ideal r ptimal predictr f Y with regard t mean-squared predictin errr: f(x) = E(Y X = x) isthe functin that minimizes E[(Y g(x)) 2 X = x] ver all functins g at all pints X = x. = Y f(x) istheirreducible errr i.e. even if we knew f(x), we wuld still make errrs in predictin, since at each X = x there is typically a distributin f pssible Y values. 5/30

8 The regressin functin f(x) Is als defined fr vectr X; e.g. f(x) =f(x 1,x 2,x 3 )=E(Y X 1 = x 1,X 2 = x 2,X 3 = x 3 ) Is the ideal r ptimal predictr f Y with regard t mean-squared predictin errr: f(x) = E(Y X = x) isthe functin that minimizes E[(Y g(x)) 2 X = x] ver all functins g at all pints X = x. = Y f(x) istheirreducible errr i.e. even if we knew f(x), we wuld still make errrs in predictin, since at each X = x there is typically a distributin f pssible Y values. Fr any estimate ˆf(x) f f(x), we have E[(Y ˆf(X)) 2 2 X = x] =[f(x) ˆf(x)] {z } Reducible + Var( ) {z } Irreducible 5/30

9 Hw t estimate f Typically we have few if any data pints with X =4 exactly. S we cannt cmpute E(Y X = x)! Relax the definitin and let ˆf(x) = Ave(Y X 2N(x)) where N (x) is sme neighbrhd f x x y 6/30

10 Nearest neighbr averaging can be pretty gd fr small p i.e.p apple 4 and large-ish N. We will discuss smther versins, such as kernel and spline smthing later in the curse. 7/30

11 Nearest neighbr averaging can be pretty gd fr small p i.e.p apple 4 and large-ish N. We will discuss smther versins, such as kernel and spline smthing later in the curse. Nearest neighbr methds can be lusy when p is large. Reasn: the curse f dimensinality. Nearest neighbrs tend t be far away in high dimensins. We need t get a reasnable fractin f the N values f y i t average t bring the variance dwn e.g. 10%. A 10% neighbrhd in high dimensins need n lnger be lcal, s we lse the spirit f estimating E(Y X = x) by lcal averaging. 7/30

12 The curse f dimensinality x1 x2 10% Neighbrhd Fractin f Vlume Radius p= 1 p= 2 p= 3 p= 5 p= 10 8/30

13 Parametric and structured mdels The linear mdel is an imprtant example f a parametric mdel: f L (X) = X X px p. A linear mdel is specified in terms f p + 1 parameters 0, 1,..., p. We estimate the parameters by fitting the mdel t training data. Althugh it is almst never crrect, a linear mdel ften serves as a gd and interpretable apprximatin t the unknwn true functin f(x). 9/30

14 A linear mdel ˆf L (X) = ˆ0 + ˆ1X gives a reasnable fit here x y A quadratic mdel ˆf Q (X) = ˆ0 + ˆ1X + ˆ2X 2 fits slightly better x y 10 / 30

15 Incme Years f Educatin Senirity Simulated example. Red pints are simulated values fr incme frm the mdel f is the blue surface. incme = f(educatin, senirity)+ 11 / 30

16 Incme Years f Educatin Senirity Linear regressin mdel fit t the simulated data. ˆf L (educatin, senirity) = ˆ0+ ˆ1 educatin+ ˆ2 senirity 12 / 30

17 Incme Years f Educatin Senirity Mre flexible regressin mdel ˆf S (educatin, senirity) fitt the simulated data. Here we use a technique called a thin-plate spline t fit a flexible surface. We cntrl the rughness f the fit (chapter 7). 13 / 30

18 Incme Years f Educatin Senirity Even mre flexible spline regressin mdel ˆf S (educatin, senirity) fit t the simulated data. Here the fitted mdel makes n errrs n the training data! Als knwn as verfitting. 14 / 30

19 Sme trade- s Predictin accuracy versus interpretability. Linear mdels are easy t interpret; thin-plate splines are nt. 15 / 30

20 Sme trade- s Predictin accuracy versus interpretability. Linear mdels are easy t interpret; thin-plate splines are nt. Gd fit versus ver-fit r under-fit. Hw d we knw when the fit is just right? 15 / 30

21 Sme trade- s Predictin accuracy versus interpretability. Linear mdels are easy t interpret; thin-plate splines are nt. Gd fit versus ver-fit r under-fit. Hw d we knw when the fit is just right? Parsimny versus black-bx. We ften prefer a simpler mdel invlving fewer variables ver a black-bx predictr invlving them all. 15 / 30

22 Interpretability Lw High Subset Selectin Lass Least Squares Generalized Additive Mdels Trees Bagging, Bsting Supprt Vectr Machines Lw High Flexibility 16 / 30

23 Assessing Mdel Accuracy Suppse we fit a mdel ˆf(x) t sme training data Tr = {x i,y i } N 1, and we wish t see hw well it perfrms. We culd cmpute the average squared predictin errr ver Tr: MSE Tr = Ave i2tr [y i ˆf(xi )] 2 This may be biased tward mre verfit mdels. Instead we shuld, if pssible, cmpute it using fresh test data Te = {x i,y i } M 1 : MSE Te = Ave i2te [y i ˆf(xi )] 2 17 / 30

24 Y Mean Squared Errr X Flexibility Black curve is truth. Red curve n right is MSE Te, grey curve is MSE Tr. Orange, blue and green curves/squares crrespnd t fits f di erent flexibility. 18 / 30

25 Y Mean Squared Errr X Flexibility Here the truth is smther, s the smther fit and linear mdel d really well. 19 / 30

26 Y Mean Squared Errr X Flexibility Here the truth is wiggly and the nise is lw, s the mre flexible fits d the best. 20 / 30

27 Bias-Variance Trade- Suppse we have fit a mdel ˆf(x) t sme training data Tr, and let (x 0,y 0 ) be a test bservatin drawn frm the ppulatin. If the true mdel is Y = f(x)+ (with f(x) =E(Y X = x)), then 2 E y 0 ˆf(x0 ) = Var( ˆf(x0 )) + [Bias( ˆf(x 0 ))] 2 + Var( ). The expectatin averages ver the variability f y 0 as well as the variability in Tr. Nte that Bias( ˆf(x 0 ))] = E[ ˆf(x 0 )] f(x 0 ). Typically as the flexibility f ˆf increases, its variance increases, and its bias decreases. S chsing the flexibility based n average test errr amunts t a bias-variance trade-. 21 / 30

28 Bias-variance trade- fr the three examples MSE Bias Var Flexibility Flexibility Flexibility 22 / 30

29 Classificatin Prblems Here the respnse variable Y is qualitative e.g. is ne f C =(spam, ham) (ham=gd ), digit class is ne f C = {0, 1,...,9}. Our gals are t: Build a classifier C(X) that assigns a class label frm C t a future unlabeled bservatin X. Assess the uncertainty in each classificatin Understand the rles f the di erent predictrs amng X =(X 1,X 2,...,X p ). 23 / 30

30 y x Is there an ideal C(X)? Suppse the K elements in C are numbered 1, 2,...,K. Let p k (x) =Pr(Y = kx = x), k=1, 2,...,K. These are the cnditinal class prbabilities at x; e.g. see little barplt at x = 5. Then the Bayes ptimal classifier at x is C(x) =j if p j (x) = max{p 1 (x),p 2 (x),...,p K (x)} 24 / 30

31 y x Nearest-neighbr averaging can be used as befre. Als breaks dwn as dimensin grws. Hwever, the impact n Ĉ(x) is less than n ˆp k (x), k=1,...,k. 25 / 30

32 Classificatin: sme details Typically we measure the perfrmance f Ĉ(x) usingthe misclassificatin errr rate: Err Te = Ave i2te I[y i 6= Ĉ(x i)] The Bayes classifier (using the true p k (x)) has smallest errr (in the ppulatin). 26 / 30

33 Classificatin: sme details Typically we measure the perfrmance f Ĉ(x) usingthe misclassificatin errr rate: Err Te = Ave i2te I[y i 6= Ĉ(x i)] The Bayes classifier (using the true p k (x)) has smallest errr (in the ppulatin). Supprt-vectr machines build structured mdels fr C(x). We will als build structured mdels fr representing the p k (x). e.g. Lgistic regressin, generalized additive mdels. 26 / 30

34 Example: K-nearest neighbrs in tw dimensins X 1 X2 27 / 30

35 KNN: K=10 X 1 X2 FIGURE The black curve indicates the KNN decisin bundary n the 28 / 30

36 FIGURE The black curve indicates the KNN decisin bundary n the data frm Figure 2.13, using K =10.TheBayesdecisinbundaryisshwnas a purple dashed line.the KNN and Bayes decisin bundaries are very similar. KNN: K=1 KNN: K=100 FIGURE A cmparisn f the KNN decisin bundaries (slid black curves) btained using K =1and K =100n the data frm Figure With K =1,thedecisinbundaryisverlyflexible,whilewithK =100it is nt su ciently flexible. The Bayes decisin bundary is shwn as a purple dashed 29 / 30

37 Errr Rate Training Errrs Test Errrs /K 30 / 30

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeff Reading: Chapter 2 STATS 202: Data mining and analysis September 27, 2017 1 / 20 Supervised vs. unsupervised learning In unsupervised