Simple Linear Regressin (single variable) Intrductin t Machine Learning Marek Petrik January 31, 2017 Sme f the figures in this presentatin are taken frm An Intrductin t Statistical Learning, with applicatins in R (Springer, 2013) with permissin frm the authrs: G. James, D. Witten, T. Hastie and R. Tibshirani
Last Class 1. Basic machine learning framewrk Y = f(x) 2. Predictin vs inference: predict Y vs understand f 3. Parametric vs nn-parametric: linear regressin vs k-nn 4. Classificatin vs regressins: k-nn vs linear regressin 5. Why we need t have a test set: verfitting
What is Machine Learning Discver unknwn functin f: X = set f features, r inputs Y = target, r respnse Y = f(x) Sales 5 10 15 20 25 Sales 5 10 15 20 25 Sales 5 10 15 20 25 0 50 100 200 300 TV 0 10 20 30 40 50 Radi 0 20 40 60 80 100 Newspaper Sales = f(tv, Radi, Newspaper)
Errrs in Machine Learning: Wrld is Nisy Wrld is t cmplex t mdel precisely Many features are nt captured in data sets Need t allw fr errrs ɛ in f: Y = f(x) + ɛ
Hw Gd are Predictins? Learned functin ˆf Test data: (x 1, y 1 ), (x 2, y 2 ),... Mean Squared Errr (MSE): MSE = 1 n n (y i ˆf(x i )) 2 i=1 This is the estimate f: MSE = E[(Y ˆf(X)) 2 ] = 1 Ω Imprtant: Samples x i are i.i.d. (Y (ω) ˆf(X(ω))) 2 ω Ω
D We Need Test Data? Why nt just test n the training data? Y 2 4 6 8 10 12 Mean Squared Errr 0.0 0.5 1.0 1.5 2.0 2.5 0 20 40 60 80 100 X 2 5 10 20 Flexibility Flexibility is the degree f plynmial being fit Gray line: training errr, red line: testing errr
Types f Functin f Regressin: cntinuus target f : X R Years f Educatin Senirity Incme Classificatin: discrete target f : X {1, 2, 3,..., k} X1 X2
Bias-Variance Decmpsitin Y = f(x) + ɛ Mean Squared Errr can be decmpsed as: MSE = E(Y ˆf(X)) 2 = Var( }{{ ˆf(X)) + (E( } ˆf(X))) 2 + Var(ɛ) }{{} Variance Bias Bias: Hw well wuld methd wrk with infinite data Variance: Hw much des utput change with different data sets
Tday Basics f linear regressin Why linear regressin Hw t cmpute it Why cmpute it
Simple Linear Regressin We have nly ne feature Y β 0 + β 1 X Y = β 0 + β 1 X + ɛ Example: 0 50 100 150 200 250 300 5 10 15 20 25 TV Sales Sales β 0 + β 1 TV
Hw T Estimate Cefficients N line that will have n errrs n data x i Predictin: ŷ i = ˆβ 0 + ˆβ 1 x i Errrs (y i are true values): e i = y i ŷ i Sales 5 10 15 20 25 0 50 100 150 200 250 300 TV
Residual Sum f Squares Residual Sum f Squares RSS = e 2 1 + e 2 2 + e 2 3 + + e 2 n = n i=1 e 2 i Equivalently: RSS = n (y i ˆβ 0 ˆβ 1 x i ) 2 i=1
Minimizing Residual Sum f Squares min β 0,β 1 RSS = min β 0,β 1 n i=1 e 2 i = min β 0,β 1 n (y i β 0 β 1 x i ) 2 i=1 RSS β 1 β 0
Minimizing Residual Sum f Squares min β 0,β 1 RSS = min β 0,β 1 n i=1 e 2 i = min β 0,β 1 n (y i β 0 β 1 x i ) 2 i=1 3 3 2.5 β 1 0.03 0.04 0.05 0.06 3 2.15 2.2 2.3 3 5 6 7 8 9 β 0
Slving fr Minimal RSS min β 0,β 1 n (y i β 0 β 1 x i ) 2 i=1 RSS is a cnvex functin f β 0, β 1 Minimum achieved when (recall the chain rule): RSS β 0 RSS β 1 = 2 = 2 n (y i β 0 β 1 x i ) = 0 i=1 n x i (y i β 0 β 1 x i ) = 0 i=1
Linear Regressin Cefficients Slutin: min β 0,β 1 n (y i β 0 β 1 x i ) 2 i=1 β 0 = ȳ β 1 x β 1 = n i=1 (x i x)(y i ȳ) n i=1 (x i x) 2 = n i=1 x i(y i ȳ) n i=1 x i(x i x) where n n x = 1 n i=1 x i ȳ = 1 n i=1 y i
Why Minimize RSS
Why Minimize RSS 1. Maximize likelihd when Y = β 0 + β 1 X + ɛ when ɛ N (0, σ 2 )
Why Minimize RSS 1. Maximize likelihd when Y = β 0 + β 1 X + ɛ when ɛ N (0, σ 2 ) 2. Best Linear Unbiased Estimatr (BLUE): Gauss-Markv Therem (ESL 3.2.2)
Why Minimize RSS 1. Maximize likelihd when Y = β 0 + β 1 X + ɛ when ɛ N (0, σ 2 ) 2. Best Linear Unbiased Estimatr (BLUE): Gauss-Markv Therem (ESL 3.2.2) 3. It is cnvenient: can be slved in clsed frm
Bias in Estimatin Assume a true value µ Estimate µ is unbiased when E[µ] = µ Standard mean estimate is unbiased (e.g. X N (0, 1)): E [ 1 n ] n X i = 0 i=1 Standard variance estimate is biased (e.g. X N (0, 1)): E [ 1 n ] n (X i X) 2 1 i=1
Linear Regressin is Unbiased Y 10 5 0 5 10 Y 10 5 0 5 10 2 1 0 1 2 X 2 1 0 1 2 X Gauss-Markv Therem (ESL 3.2.2)
Slutin f Linear Regressin 0 50 100 150 200 250 300 5 10 15 20 25 TV Sales
Hw Gd is the Fit Hw well is linear regressin predicting the training data? Can we be sure that TV advertising really influences the sales? What is the prbability that we just gt lucky?
R 2 Statistic R 2 = 1 RSS n TSS = 1 i=1 (y i ŷ i ) 2 n i=1 (y i ȳ) 2 RSS - residual sum f squares, TSS - ttal sum f squares R 2 measures the gdness f the fit as a prprtin Prprtin f data variance explained by the mdel Extreme values: 0: Mdel des nt explain data 1: Mdel explains data perfectly
Example: TV Impact n Sales 0 50 100 150 200 250 300 5 10 15 20 25 TV Sales
Example: TV Impact n Sales 0 50 100 150 200 250 300 5 10 15 20 25 TV Sales R 2 = 0.61
Example: Radi Impact n Sales 0 10 20 30 40 50 5 10 15 20 25 Radi Sales
Example: Radi Impact n Sales 0 10 20 30 40 50 5 10 15 20 25 Radi Sales R 2 = 0.33
Example: Newspaper Impact n Sales 0 20 40 60 80 100 5 10 15 20 25 Newspaper Sales
Example: Newspaper Impact n Sales 0 20 40 60 80 100 5 10 15 20 25 Newspaper Sales R 2 = 0.05
Crrelatin Cefficient Measures dependence between tw randm variables X and Y r = Cv(X, Y ) Var(X) Var(Y ) Crrelatin cefficient r is between [ 1, 1] 0: Variables are nt related 1: Variables are perfectly related (same) 1: Variables are negatively related (different)
Crrelatin Cefficient Measures dependence between tw randm variables X and Y r = Cv(X, Y ) Var(X) Var(Y ) Crrelatin cefficient r is between [ 1, 1] 0: Variables are nt related 1: Variables are perfectly related (same) 1: Variables are negatively related (different) R 2 = r 2
Hypthesis Testing Null hypthesis H 0 : There is n relatinship between X and Y β 1 = 0 Alternative hypthesis H 1 : There is sme relatinship between X and Y β 1 0 Seek t reject hypthesis H 0 with small prbability (p-value) f making a mistake Imprtant tpic, but beynd the scpe f the curse
Multiple Linear Regressin Usually mre than ne feature is available sales = β 0 + β 1 TV + β 2 radi + β 3 newspaper + ɛ In general: p Y = β 0 + β j X j j=1
Multiple Linear Regressin Y X 2 X 1
Estimating Cefficients Predictin: p ŷ i = ˆβ 0 + ˆβ j x ij Errrs (y i are true values): j=1 Residual Sum f Squares e i = y i ŷ i RSS = e 2 1 + e 2 2 + e 2 3 + + e 2 n = n i=1 e 2 i Hw t minimize RSS? Linear algebra!
Inference frm Linear Regressin 1. Are predictrs X 1, X 2,..., X p really predicting Y? 2. Is nly a subset f predictrs useful? 3. Hw well des linear mdel fit data? 4. What Y shuld be predict and hw accurate is it?
Inference 1 Are predictrs X 1, X 2,..., X p really predicting Y? Null hypthesis H 0 : There is n relatinship between X and Y β 1 = 0 Alternative hypthesis H 1 : There is sme relatinship between X and Y β 1 0 Seek t reject hypthesis H 0 with small prbability (p-value) f making a mistake See ISL 3.2.2 n hw t cmpute F-statistic and reject H 0
Inference 2 Is nly a subset f predictrs useful? Cmpare predictin accuracy with nly a subset f features
Inference 2 Is nly a subset f predictrs useful? Cmpare predictin accuracy with nly a subset f features RSS always decreases with mre features!
Inference 2 Is nly a subset f predictrs useful? Cmpare predictin accuracy with nly a subset f features RSS always decreases with mre features! Other measures cntrl fr number f variables: 1. Mallws C p 2. Akaike infrmatin criterin 3. Bayesian infrmatin criterin 4. Adjusted R 2
Inference 2 Is nly a subset f predictrs useful? Cmpare predictin accuracy with nly a subset f features RSS always decreases with mre features! Other measures cntrl fr number f variables: 1. Mallws C p 2. Akaike infrmatin criterin 3. Bayesian infrmatin criterin 4. Adjusted R 2 Testing all subsets f features is impractical: 2 p ptins!
Inference 2 Is nly a subset f predictrs useful? Cmpare predictin accuracy with nly a subset f features RSS always decreases with mre features! Other measures cntrl fr number f variables: 1. Mallws C p 2. Akaike infrmatin criterin 3. Bayesian infrmatin criterin 4. Adjusted R 2 Testing all subsets f features is impractical: 2 p ptins! Mre n hw t d this later
Inference 3 Hw well des linear mdel fit data? R 2 als always increases with mre features (like RSS) Is the mdel linear? Plt it: Sales TV Radi Mre n this later
Inference 4 What Y shuld be predict and hw accurate is it? The linear mdel is used t make predictins: y predicted = ˆβ 0 + ˆβ 1 x new Can als predict a cnfidence interval (based n estimate n ɛ):
Inference 4 What Y shuld be predict and hw accurate is it? The linear mdel is used t make predictins: y predicted = ˆβ 0 + ˆβ 1 x new Can als predict a cnfidence interval (based n estimate n ɛ): Example: Spent $100 000 n TV and $20 000 n Radi advertising Cnfidence interval: predict f(x) (the average respnse): f(x) [10.985, 11, 528] Predictin interval: predict f(x) + ɛ (respnse + pssible nise) f(x) [7.930, 14.580]
R ntebk