Simple Linear Regression (single variable)

Size: px

Start display at page:

Download "Simple Linear Regression (single variable)"

Norman Porter
5 years ago
Views:

1 Simple Linear Regressin (single variable) Intrductin t Machine Learning Marek Petrik January 31, 2017 Sme f the figures in this presentatin are taken frm An Intrductin t Statistical Learning, with applicatins in R (Springer, 2013) with permissin frm the authrs: G. James, D. Witten, T. Hastie and R. Tibshirani

2 Last Class 1. Basic machine learning framewrk Y = f(x) 2. Predictin vs inference: predict Y vs understand f 3. Parametric vs nn-parametric: linear regressin vs k-nn 4. Classificatin vs regressins: k-nn vs linear regressin 5. Why we need t have a test set: verfitting

3 What is Machine Learning Discver unknwn functin f: X = set f features, r inputs Y = target, r respnse Y = f(x) Sales Sales Sales TV Radi Newspaper Sales = f(tv, Radi, Newspaper)

4 Errrs in Machine Learning: Wrld is Nisy Wrld is t cmplex t mdel precisely Many features are nt captured in data sets Need t allw fr errrs ɛ in f: Y = f(x) + ɛ

5 Hw Gd are Predictins? Learned functin ˆf Test data: (x 1, y 1 ), (x 2, y 2 ),... Mean Squared Errr (MSE): MSE = 1 n n (y i ˆf(x i )) 2 i=1 This is the estimate f: MSE = E[(Y ˆf(X)) 2 ] = 1 Ω Imprtant: Samples x i are i.i.d. (Y (ω) ˆf(X(ω))) 2 ω Ω

6 D We Need Test Data? Why nt just test n the training data? Y Mean Squared Errr X Flexibility Flexibility is the degree f plynmial being fit Gray line: training errr, red line: testing errr

7 Types f Functin f Regressin: cntinuus target f : X R Years f Educatin Senirity Incme Classificatin: discrete target f : X {1, 2, 3,..., k} X1 X2

8 Bias-Variance Decmpsitin Y = f(x) + ɛ Mean Squared Errr can be decmpsed as: MSE = E(Y ˆf(X)) 2 = Var( }{{ ˆf(X)) + (E( } ˆf(X))) 2 + Var(ɛ) }{{} Variance Bias Bias: Hw well wuld methd wrk with infinite data Variance: Hw much des utput change with different data sets

9 Tday Basics f linear regressin Why linear regressin Hw t cmpute it Why cmpute it

10 Simple Linear Regressin We have nly ne feature Y β 0 + β 1 X Y = β 0 + β 1 X + ɛ Example: TV Sales Sales β 0 + β 1 TV

11 Hw T Estimate Cefficients N line that will have n errrs n data x i Predictin: ŷ i = ˆβ 0 + ˆβ 1 x i Errrs (y i are true values): e i = y i ŷ i Sales TV

12 Residual Sum f Squares Residual Sum f Squares RSS = e e e e 2 n = n i=1 e 2 i Equivalently: RSS = n (y i ˆβ 0 ˆβ 1 x i ) 2 i=1

13 Minimizing Residual Sum f Squares min β 0,β 1 RSS = min β 0,β 1 n i=1 e 2 i = min β 0,β 1 n (y i β 0 β 1 x i ) 2 i=1 RSS β 1 β 0

14 Minimizing Residual Sum f Squares min β 0,β 1 RSS = min β 0,β 1 n i=1 e 2 i = min β 0,β 1 n (y i β 0 β 1 x i ) 2 i= β β 0

15 Slving fr Minimal RSS min β 0,β 1 n (y i β 0 β 1 x i ) 2 i=1 RSS is a cnvex functin f β 0, β 1 Minimum achieved when (recall the chain rule): RSS β 0 RSS β 1 = 2 = 2 n (y i β 0 β 1 x i ) = 0 i=1 n x i (y i β 0 β 1 x i ) = 0 i=1

16 Linear Regressin Cefficients Slutin: min β 0,β 1 n (y i β 0 β 1 x i ) 2 i=1 β 0 = ȳ β 1 x β 1 = n i=1 (x i x)(y i ȳ) n i=1 (x i x) 2 = n i=1 x i(y i ȳ) n i=1 x i(x i x) where n n x = 1 n i=1 x i ȳ = 1 n i=1 y i

17 Why Minimize RSS

18 Why Minimize RSS 1. Maximize likelihd when Y = β 0 + β 1 X + ɛ when ɛ N (0, σ 2 )

19 Why Minimize RSS 1. Maximize likelihd when Y = β 0 + β 1 X + ɛ when ɛ N (0, σ 2 ) 2. Best Linear Unbiased Estimatr (BLUE): Gauss-Markv Therem (ESL 3.2.2)

20 Why Minimize RSS 1. Maximize likelihd when Y = β 0 + β 1 X + ɛ when ɛ N (0, σ 2 ) 2. Best Linear Unbiased Estimatr (BLUE): Gauss-Markv Therem (ESL 3.2.2) 3. It is cnvenient: can be slved in clsed frm

21 Bias in Estimatin Assume a true value µ Estimate µ is unbiased when E[µ] = µ Standard mean estimate is unbiased (e.g. X N (0, 1)): E [ 1 n ] n X i = 0 i=1 Standard variance estimate is biased (e.g. X N (0, 1)): E [ 1 n ] n (X i X) 2 1 i=1

22 Linear Regressin is Unbiased Y Y X X Gauss-Markv Therem (ESL 3.2.2)

23 Slutin f Linear Regressin TV Sales

24 Hw Gd is the Fit Hw well is linear regressin predicting the training data? Can we be sure that TV advertising really influences the sales? What is the prbability that we just gt lucky?

25 R 2 Statistic R 2 = 1 RSS n TSS = 1 i=1 (y i ŷ i ) 2 n i=1 (y i ȳ) 2 RSS - residual sum f squares, TSS - ttal sum f squares R 2 measures the gdness f the fit as a prprtin Prprtin f data variance explained by the mdel Extreme values: 0: Mdel des nt explain data 1: Mdel explains data perfectly

26 Example: TV Impact n Sales TV Sales

27 Example: TV Impact n Sales TV Sales R 2 = 0.61

28 Example: Radi Impact n Sales Radi Sales

29 Example: Radi Impact n Sales Radi Sales R 2 = 0.33

30 Example: Newspaper Impact n Sales Newspaper Sales

31 Example: Newspaper Impact n Sales Newspaper Sales R 2 = 0.05

32 Crrelatin Cefficient Measures dependence between tw randm variables X and Y r = Cv(X, Y ) Var(X) Var(Y ) Crrelatin cefficient r is between [ 1, 1] 0: Variables are nt related 1: Variables are perfectly related (same) 1: Variables are negatively related (different)

33 Crrelatin Cefficient Measures dependence between tw randm variables X and Y r = Cv(X, Y ) Var(X) Var(Y ) Crrelatin cefficient r is between [ 1, 1] 0: Variables are nt related 1: Variables are perfectly related (same) 1: Variables are negatively related (different) R 2 = r 2

34 Hypthesis Testing Null hypthesis H 0 : There is n relatinship between X and Y β 1 = 0 Alternative hypthesis H 1 : There is sme relatinship between X and Y β 1 0 Seek t reject hypthesis H 0 with small prbability (p-value) f making a mistake Imprtant tpic, but beynd the scpe f the curse

35 Multiple Linear Regressin Usually mre than ne feature is available sales = β 0 + β 1 TV + β 2 radi + β 3 newspaper + ɛ In general: p Y = β 0 + β j X j j=1

36 Multiple Linear Regressin Y X 2 X 1

37 Estimating Cefficients Predictin: p ŷ i = ˆβ 0 + ˆβ j x ij Errrs (y i are true values): j=1 Residual Sum f Squares e i = y i ŷ i RSS = e e e e 2 n = n i=1 e 2 i Hw t minimize RSS? Linear algebra!

38 Inference frm Linear Regressin 1. Are predictrs X 1, X 2,..., X p really predicting Y? 2. Is nly a subset f predictrs useful? 3. Hw well des linear mdel fit data? 4. What Y shuld be predict and hw accurate is it?

39 Inference 1 Are predictrs X 1, X 2,..., X p really predicting Y? Null hypthesis H 0 : There is n relatinship between X and Y β 1 = 0 Alternative hypthesis H 1 : There is sme relatinship between X and Y β 1 0 Seek t reject hypthesis H 0 with small prbability (p-value) f making a mistake See ISL n hw t cmpute F-statistic and reject H 0

40 Inference 2 Is nly a subset f predictrs useful? Cmpare predictin accuracy with nly a subset f features

41 Inference 2 Is nly a subset f predictrs useful? Cmpare predictin accuracy with nly a subset f features RSS always decreases with mre features!

42 Inference 2 Is nly a subset f predictrs useful? Cmpare predictin accuracy with nly a subset f features RSS always decreases with mre features! Other measures cntrl fr number f variables: 1. Mallws C p 2. Akaike infrmatin criterin 3. Bayesian infrmatin criterin 4. Adjusted R 2

43 Inference 2 Is nly a subset f predictrs useful? Cmpare predictin accuracy with nly a subset f features RSS always decreases with mre features! Other measures cntrl fr number f variables: 1. Mallws C p 2. Akaike infrmatin criterin 3. Bayesian infrmatin criterin 4. Adjusted R 2 Testing all subsets f features is impractical: 2 p ptins!

44 Inference 2 Is nly a subset f predictrs useful? Cmpare predictin accuracy with nly a subset f features RSS always decreases with mre features! Other measures cntrl fr number f variables: 1. Mallws C p 2. Akaike infrmatin criterin 3. Bayesian infrmatin criterin 4. Adjusted R 2 Testing all subsets f features is impractical: 2 p ptins! Mre n hw t d this later

45 Inference 3 Hw well des linear mdel fit data? R 2 als always increases with mre features (like RSS) Is the mdel linear? Plt it: Sales TV Radi Mre n this later

46 Inference 4 What Y shuld be predict and hw accurate is it? The linear mdel is used t make predictins: y predicted = ˆβ 0 + ˆβ 1 x new Can als predict a cnfidence interval (based n estimate n ɛ):

47 Inference 4 What Y shuld be predict and hw accurate is it? The linear mdel is used t make predictins: y predicted = ˆβ 0 + ˆβ 1 x new Can als predict a cnfidence interval (based n estimate n ɛ): Example: Spent $ n TV and $ n Radi advertising Cnfidence interval: predict f(x) (the average respnse): f(x) [10.985, 11, 528] Predictin interval: predict f(x) + ɛ (respnse + pssible nise) f(x) [7.930, ]

48 R ntebk

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017 Resampling Methds Crss-validatin, Btstrapping Marek Petrik 2/21/2017 Sme f the figures in this presentatin are taken frm An Intrductin t Statistical Learning, with applicatins in R (Springer, 2013) with