Machine Learning. 1. Linear Regression

Size: px

Start display at page:

Download "Machine Learning. 1. Linear Regression"

Kristopher Rodgers
5 years ago
Views:

1 Machine Learning Machine Learning 1. Linear Regression Lars Schmidt-Thieme Information Sstems and Machine Learning Lab (ISMLL) Institute for Business Economics and Information Sstems & Institute for Computer Science Universit of Hildesheim Course on Machine Learning, winter term 2009/2010 1/71 Machine Learning 1. The Regression Problem 2. Simple Linear Regression 3. Multiple Regression 4. Variable Interactions 5. Model Selection 6. Case Weights Course on Machine Learning, winter term 2009/2010 1/71

2 Machine Learning / 1. The Regression Problem Eample Eample: how does gas consumption depend on eternal temperature? (Whiteside, 1960s). weekl measurements of average eternal temperature total gas consumption (in 1000 cubic feets) A third variable encodes two heating seasons, before and after wall insulation. How does gas consumption depend on eternal temperature? How much gas is needed for a given termperature? Course on Machine Learning, winter term 2009/2010 1/71 Machine Learning / 1. The Regression Problem Eample 7 Gas consumption (1000 cubic feet) Average eternal temperature (deg. C) linear model Course on Machine Learning, winter term 2009/2010 2/71

3 Machine Learning / 1. The Regression Problem Eample 7 7 Gas consumption (1000 cubic feet) Gas consumption (1000 cubic feet) Average eternal temperature (deg. C) Average eternal temperature (deg. C) linear model more fleible model Course on Machine Learning, winter term 2009/2010 3/71 Machine Learning / 1. The Regression Problem Variable Tpes and Coding The most common variable tpes: numerical / interval-scaled / quantitative where differences and quotients etc. are meaningful, usuall with domain X := R, e.g., temperature, size, weight. nominal / discrete / categorical / qualitative / factor where differences and quotients are not defined, usuall with a finite, enumerated domain, e.g., X := {red, green, blue} or X := {a, b, c,...,, z}. ordinal / ordered categorical where levels are ordered, but differences and quotients are not defined, usuall with a finite, enumerated domain, e.g., X := {small, medium, large} Course on Machine Learning, winter term 2009/2010 4/71

4 Machine Learning / 1. The Regression Problem Variable Tpes and Coding Nominals are usuall encoded as binar dumm variables: { 1, if X = 0, δ 0 (X) := 0, else one for each 0 X (but one). Eample: X := {red, green, blue} Replace b one variable X with 3 levels: red, green, blue two variables δ red (X) and δ green (X) with 2 levels each: 0, 1 X δ red (X) δ green (X) red 1 0 green 0 1 blue Course on Machine Learning, winter term 2009/2010 5/71 Machine Learning / 1. The Regression Problem Let The Regression Problem Formall X 1, X 2,..., X p be random variables called predictors (or inputs, covariates). Let X 1, X 2,..., X p be their domains. We write shortl X := (X 1, X 2,..., X p ) for the vector of random predictor variables and for its domain. X := X 1 X 2 X p Y be a random variable called target (or output, response). Let Y be its domain. D P(X Y) be a (multi)set of instances of the unknown joint distribution p(x, Y ) of predictors and target called data. D is often written as enumeration D = {( 1, 1 ), ( 2, 2 ),..., ( n, n )} Course on Machine Learning, winter term 2009/2010 6/71

5 Machine Learning / 1. The Regression Problem The Regression Problem Formall The task of regression and classification is to predict Y based on X, i.e., to estimate r() := E(Y X = ) = p( )d based on data (called regression function). If Y is numerical, the task is called regression. If Y is nominal, the task is called classification. Course on Machine Learning, winter term 2009/2010 7/71 Machine Learning 1. The Regression Problem 2. Simple Linear Regression 3. Multiple Regression 4. Variable Interactions 5. Model Selection 6. Case Weights Course on Machine Learning, winter term 2009/2010 8/71

6 Machine Learning / 2. Simple Linear Regression Simple Eamples: Single Predictor vs. Multiple Predictors single predictor: = multiple predictors: = Course on Machine Learning, winter term 2009/2010 8/71 Machine Learning / 2. Simple Linear Regression Simple Eamples: Regression Function observations observations average truth observations observations average truth linear regression function: = non-linear regression function: = Course on Machine Learning, winter term 2009/2010 9/71

7 Machine Learning / 2. Simple Linear Regression Simple Eamples: Size of Errors (1/2) observations errors observations average truth Densit N = 100 Bandwidth = Small errors vs.... Course on Machine Learning, winter term 2009/ /71 Machine Learning / 2. Simple Linear Regression Simple Eamples: Size of Errors (2/2) observations errors observations average truth Densit N = 100 Bandwidth = large errors. Course on Machine Learning, winter term 2009/ /71

8 Machine Learning / 2. Simple Linear Regression Simple Eamples: Distribution of Errors (1/2) observations errors observations average truth Densit N = 100 Bandwidth = Normall distributed errors vs.... Course on Machine Learning, winter term 2009/ /71 Machine Learning / 2. Simple Linear Regression Simple Eamples: Distribution of Errors (2/2) observations errors observations average truth Densit N = 100 Bandwidth = uniforml distributed errors. Course on Machine Learning, winter term 2009/ /71

9 Machine Learning / 2. Simple Linear Regression Simple Eamples: Homoscedastic vs. Heteroscedastic Errors (1/2) observations observations average truth errors error Errors do not depend on predictors (homoscedastic) vs.... Course on Machine Learning, winter term 2009/ /71 Machine Learning / 2. Simple Linear Regression Simple Eamples: Homoscedastic vs. Heteroscedastic Errors (2/2) observations observations average truth errors error... errors do depend on predictors (heteroscedastic). Course on Machine Learning, winter term 2009/ /71

10 Machine Learning / 2. Simple Linear Regression Simple Eamples: Distribution of Predictors (1/2) observations predictors observations average truth Densit N = 100 Bandwidth = Predictors are uniforml distributed vs.... Course on Machine Learning, winter term 2009/ /71 Machine Learning / 2. Simple Linear Regression Simple Eamples: Distribution of Predictors (2/2) observations predictors observations average truth Densit N = 100 Bandwidth = predictors are normall distributed. Course on Machine Learning, winter term 2009/ /71

11 Machine Learning / 2. Simple Linear Regression Simple Linear Regression Model Make it simple: the predictor X is simple, i.e., one-dimensional (X = X 1 ). r() is assumed to be linear: r() = β 0 + β 1 assume that the variance does not depend on X: 3 parameters: Y = β 0 + β 1 X + ɛ, E(ɛ X) = 0, V (ɛ X) = σ 2 β 0 intercept (sometimes also called bias) β 1 slope σ 2 variance Course on Machine Learning, winter term 2009/ /71 Machine Learning / 2. Simple Linear Regression Simple Linear Regression Model parameter estimates ˆβ 0, ˆβ 1, ˆσ 2 fitted line ˆr() := ˆβ 0 + ˆβ 1 predicted / fitted values ŷ i := ˆr( i ) residuals ˆɛ i := i ŷ i = i ( ˆβ 0 + ˆβ 1 i ) residual sums of squares (RSS) / square loss / L2 loss n RSS = Course on Machine Learning, winter term 2009/ /71 ˆɛ 2 i

12 Machine Learning / 2. Simple Linear Regression How to estimate the parameters? Eample: Given the data D := {(1, 2), (2, 3), (4, 6)}, predict a value for = data Course on Machine Learning, winter term 2009/ /71 Machine Learning / 2. Simple Linear Regression How to estimate the parameters? Eample: Given the data D := {(1, 2), (2, 3), (4, 6)}, predict a value for = 3. Line through first two points: RSS: ˆr(3) = 4 ˆβ 1 = = 1 ˆβ 0 = 1 ˆβ 1 1 = 1 i i ŷ i ( i ŷ i ) data model Course on Machine Learning, winter term 2009/ /71

13 Machine Learning / 2. Simple Linear Regression How to estimate the parameters? Eample: Given the data D := {(1, 2), (2, 3), (4, 6)}, predict a value for = 3. Line through first and last point: RSS: ˆr(3) = ˆβ 1 = = 4/3 = ˆβ 0 = 1 ˆβ 1 1 = 2/3 = i i ŷ i ( i ŷ i ) data model Course on Machine Learning, winter term 2009/ /71 Machine Learning / 2. Simple Linear Regression Least Squares Estimates / Definition In principle, there are man different methods to estimate the parameters ˆβ 0, ˆβ 1 and ˆσ 2 from data depending on the properties the solution should have. The least squares estimates are those parameters that minimize n n n RSS = ˆɛ 2 i = ( i ŷ i ) 2 = ( i ( ˆβ 0 + ˆβ 1 i )) 2 The can be written in closed form as follows: n ˆβ 1 = ( i )( i ȳ) n ( i ) 2 ˆβ 0 =ȳ ˆβ 1 ˆσ 2 = 1 n n 2 ɛ 2 i Course on Machine Learning, winter term 2009/ /71

14 Machine Learning / 2. Simple Linear Regression Least Squares Estimates / Proof Proof (1/2): RSS = RSS ˆβ 0 = = n ˆβ 0 = n ( i ( ˆβ 0 + ˆβ 1 i )) 2 n 2( i ( ˆβ 0 + ˆβ 1 i ))( 1) =! 0 n i ˆβ 1 i Course on Machine Learning, winter term 2009/ /71 Machine Learning / 2. Simple Linear Regression Least Squares Estimates / Proof Proof (2/2): RSS = = = n ( i ( ˆβ 0 + ˆβ 1 i )) 2 n ( i (ȳ ˆβ 1 ) ˆβ 1 i ) 2 n ( i ȳ ˆβ 1 ( i )) 2 RSS n ˆβ = 2( i ȳ ˆβ 1 ( i ))( 1)( i ) =! 0 1 n = ˆβ 1 = ( i ȳ)( i ) n ( i ) 2 Course on Machine Learning, winter term 2009/ /71

15 Machine Learning / 2. Simple Linear Regression Least Squares Estimates / Eample Eample: Given the data D := {(1, 2), (2, 3), (4, 6)}, predict a value for = 3. Assume simple linear model. = 7/3, ȳ = 11/3. i i i ȳ ( i ) 2 ( i )( i ȳ) 1 4/3 5/3 16/9 20/9 2 1/3 2/3 1/9 2/9 3 5/3 7/3 25/9 35/9 42/9 57/9 n ˆβ 1 = ( i )( i ȳ) n ( = 57/42 = i ) 2 ˆβ 0 =ȳ ˆβ 1 = = = data model Course on Machine Learning, winter term 2009/ /71 Machine Learning / 2. Simple Linear Regression Least Squares Estimates / Eample Eample: Given the data D := {(1, 2), (2, 3), (4, 6)}, predict a value for = 3. Assume simple linear model. n ˆβ 1 = ( i )( i ȳ) n ( = 57/42 = i ) 2 ˆβ 0 =ȳ ˆβ 1 = = = 0.5 RSS: ˆr(3) = i i ŷ i ( i ŷ i ) data model Course on Machine Learning, winter term 2009/ /71

16 Machine Learning / 2. Simple Linear Regression A Generative Model So far we assumed the model Y = β 0 + β 1 X + ɛ, E(ɛ X) = 0, V (ɛ X) = σ 2 where we required some properties of the errors, but not its eact distribution. If we make assumptions about its distribution, e.g., and thus we can sample from this model. ɛ X N (0, σ 2 ) Y N (β 0 + β 1 X, σ 2 ) Course on Machine Learning, winter term 2009/ /71 Machine Learning / 2. Simple Linear Regression Maimum Likelihood Estimates (MLE) Let p(x, Y θ) be a joint probabilit densit function for X and Y with parameters θ. Likelihood: L D (θ) := n p( i, i θ) The likelihood describes the probabilit of the data. The maimum likelihood estimates (MLE) are those parameters that maimize the likelihood. Course on Machine Learning, winter term 2009/ /71

17 Machine Learning / 2. Simple Linear Regression Least Squares Estimates and Maimum Likelihood Estimates Likelihood: L D ( ˆβ 0, ˆβ 1, ˆσ 2 ) := n ˆp( i, i ) = n ˆp( i i )p( i ) = n ˆp( i i ) n p( i ) Conditional likelihood: L cond D ( ˆβ 0, ˆβ n 1, ˆσ 2 ) := ˆp( i i ) = n 1 2πˆσ e ( i ŷ i )2 2ˆσ 2 = 1 2π nˆσ n e 1 2ˆσ 2 n ( i ŷ i ) 2 Conditional log-likelihood: log L cond D ( ˆβ 0, ˆβ 1, ˆσ 2 ) n log ˆσ 1 2ˆσ 2 n ( i ŷ i ) 2 = if we assume normalit, the maimum likelihood estimates are just the minimal least squares estimates. Course on Machine Learning, winter term 2009/ /71 Machine Learning / 2. Simple Linear Regression Implementation Details 1 simple-regression(d) : 2 s := 0, s := 0 3 for i = 1,..., n do 4 s := s + i 5 s := s + i 6 od 7 := s/n, ȳ := s/n 8 a := 0, b := 0 9 for i = 1,..., n do 10 a := a + ( i )( i ȳ) 11 b := b + ( i ) 2 12 od 13 β 1 := a/b 14 β 0 := ŷ β 1ˆ 15 return (β 0, β 1 ) Course on Machine Learning, winter term 2009/ /71

18 Machine Learning / 2. Simple Linear Regression Implementation Details naive: 1 simple-regression(d) : 2 s := 0, s := 0 3 for i = 1,..., n do 4 s := s + i 5 s := s + i 6 od 7 := s/n, ȳ := s/n 8 a := 0, b := 0 9 for i = 1,..., n do 10 a := a + ( i )( i ȳ) 11 b := b + ( i ) 2 12 od 13 β 1 := a/b 14 β 0 := ŷ β 1ˆ 15 return (β 0, β 1 ) single loop: 1 simple-regression(d) : 2 s := 0, s := 0, s := 0, s := 0, s := 0 3 for i = 1,..., n do 4 s := s + i 5 s := s + i 6 s := s + 2 i 7 s := s + 2 i 8 s := s + i i 9 od 10 β 1 := (n s s s)/(n s s s) 11 β 0 := (s β 1 s)/n 12 return (β 0, β 1 ) Course on Machine Learning, winter term 2009/ /71 Machine Learning 1. The Regression Problem 2. Simple Linear Regression 3. Multiple Regression 4. Variable Interactions 5. Model Selection 6. Case Weights Course on Machine Learning, winter term 2009/ /71

19 Machine Learning / 3. Multiple Regression Several predictors Several predictor variables X 1, X 2,..., X p : Y =β 0 + β 1 X 1 + β 2 X 2 + β P X P + ɛ p =β 0 + β i X i + ɛ with p + 1 parameters β 0, β 1,..., β p. Course on Machine Learning, winter term 2009/ /71 Machine Learning / 3. Multiple Regression Linear form Several predictor variables X 1, X 2,..., X p : p Y =β 0 + β i X i + ɛ = β, X + ɛ where β := β 0 β 1. β p, X := 1 X 1. X p, Thus, the intercept is handled like an other parameter, for the artificial constant variable X 0 1. Course on Machine Learning, winter term 2009/ /71

20 Machine Learning / 3. Multiple Regression Simultaneous equations for the whole dataset For the whole dataset ( 1, 1 ),..., ( n, n ): where 1 Y :=., X := n Y = Xβ + ɛ 1. = n 1,1 1,2... 1,p...., ɛ := n,1 n,2... n,p ɛ 1., ɛ n Course on Machine Learning, winter term 2009/ /71 Machine Learning / 3. Multiple Regression Least squares estimates Least squares estimates ˆβ minimize Y Ŷ 2 = Y X ˆβ 2 The least squares estimates ˆβ are computed via X T X ˆβ = X T Y Proof: Y X ˆβ 2 = Y X ˆβ, Y X ˆβ (...) ˆβ = 2 X, Y X ˆβ = 2(X T Y X T X ˆβ)! = 0 Course on Machine Learning, winter term 2009/ /71

21 Machine Learning / 3. Multiple Regression How to compute least squares estimates ˆβ Solve the p p sstem of linear equations X T X ˆβ = X T Y i.e., A = b (with A := X T X, b = X T Y, = ˆβ). There are several numerical methods available: 1. Gaussian elimination 2. Cholesk decomposition 3. QR decomposition Course on Machine Learning, winter term 2009/ /71 Machine Learning / 3. Multiple Regression How to compute least squares estimates ˆβ / Eample Given is the following data: Predict a value for 1 = 3, 2 = 4. Course on Machine Learning, winter term 2009/ /71

22 Machine Learning / 3. Multiple Regression How to compute least squares estimates ˆβ / Eample Y =β 0 + β 1 X 1 + ɛ = X 1 + ɛ Y =β 0 + β 2 X 2 + ɛ = X 2 + ɛ data model data model ŷ( 1 = 3) = ŷ( 2 = 4) = Course on Machine Learning, winter term 2009/ /71 Machine Learning / 3. Multiple Regression How to compute least squares estimates ˆβ / Eample Now fit Y =β 0 + β 1 X 1 + β 2 X 2 + ɛ to the data: X = , Y = X T X = , X T Y = Course on Machine Learning, winter term 2009/ /71

23 Machine Learning / 3. Multiple Regression How to compute least squares estimates ˆβ / Eample i.e., ˆβ = / / / Course on Machine Learning, winter term 2009/ /71 Machine Learning / 3. Multiple Regression How to compute least squares estimates ˆβ / Eample Course on Machine Learning, winter term 2009/ /71

24 Machine Learning / 3. Multiple Regression How to compute least squares estimates ˆβ / Eample To visuall assess the model fit, a plot residuals ˆɛ = ŷ vs. true values can be plotted: hat Course on Machine Learning, winter term 2009/ /71 Machine Learning / 3. Multiple Regression The Normal Distribution (also Gaussian) written as: X N (µ, σ 2 ) with parameters: µ mean, σ standard deviance. probabilit densit function (pdf): φ() := 1 e ( µ)2 2σ 2 2πσ phi() cummulative densit function (cdf): Φ() := φ()d Φ 1 is called quantile function. Φ and Φ 1 have no analtical form, but have to computed numericall. Phi() Course on Machine Learning, winter term 2009/ /71

25 Machine Learning / 3. Multiple Regression written as: X t p with parameter: p degrees of freedom. probabilit densit function (pdf): 2 ) p() := Γ(p+1 2 (1 + ) Γ( p 2 t p p N (0, 1) p+1 p ) 2 The t Distribution f() Lars Schmidt-Thieme, Information Sstems and Machine Learning Lab (ISMLL), Institute BW/WI 6 & Institute 4 for 2Computer 0 Science, 2 Universit 4 of Hildesheim 6 Course on Machine Learning, winter term 2009/ /71 Machine Learning / 3. Multiple Regression written as: X χ 2 p with parameter: p degrees of freedom. probabilit densit function (pdf): p() := 1 Γ(p/2)2 p/2p 2 1 e 2 If X 1,..., X p N (0, 1), then p Y := Xi 2 χ 2 p F() The χ 2 Distribution f() p=5 p=10 p=50 p=5 p=10 p= Lars Schmidt-Thieme, Information Sstems and Machine Learning Lab (ISMLL), Institute BW/WI 0 & Institute 5 for Computer 10 Science, Universit 15 of Hildesheim 20 Course on Machine Learning, winter term 2009/ /71 F() p=5 p=7 p=10 p=5 p=7 p=10

26 Machine Learning / 3. Multiple Regression Parameter Variance ˆβ = (X T X) 1 X T Y is an unbiased estimator for β (i.e., E( ˆβ) = β). Its variance is V ( ˆβ) = (X T X) 1 σ 2 proof: ˆβ =(X T X) 1 X T Y = (X T X) 1 X T (Xβ + ɛ) = β + (X T X) 1 X T ɛ As E(ɛ) = 0: E( ˆβ) = β V ( ˆβ) =E(( ˆβ E( ˆβ))( ˆβ E( ˆβ)) T ) =E((X T X) 1 X T ɛɛ T X(X T X) 1 ) =(X T X) 1 σ 2 Course on Machine Learning, winter term 2009/ /71 Machine Learning / 3. Multiple Regression Parameter Variance An unbiased estimator for σ 2 is ˆσ 2 = 1 n ˆɛ 2 i = 1 n p n p n ( ŷ) 2 If ɛ N (0, σ 2 ), then ˆβ N (β, (X T X) 1 σ 2 ) Furthermore (n p)ˆσ 2 σ 2 χ 2 n p Course on Machine Learning, winter term 2009/ /71

27 Machine Learning / 3. Multiple Regression Parameter Variance / Standardized coefficient standardized coefficient ( z-score ): z i := ˆβ i ŝe( ˆβ i ), with ŝe2 ( ˆβ i ) the i-th diagonal element of (X T X) 1ˆσ 2 z i would be z i N (0, 1) if σ is known (under H 0 : β i = 0). With estimated ˆσ it is z i t n p. The Wald test for H 0 : β i = 0 with size α is: i.e., its p-value is ˆβ i reject H 0 if z i = ŝe( ˆβ i ) > F t 1 n p (1 α 2 ) ˆβ i p-value(h 0 : β i = 0) = 2(1 F tn p ( z i )) = 2(1 F tn p ( ŝe( ˆβ i ) )) and small p-values such as 0.01 and 0.05 are good. Course on Machine Learning, winter term 2009/ /71 Machine Learning / 3. Multiple Regression Parameter Variance / Confidence interval The 1 α confidence interval for β i : β i ± F 1 t n p (1 α 2 )ŝe( ˆβ i ) For large n, F tn p converges to the standard normal cdf Φ. As Φ 1 ( ) , the rule-of-thumb for a 5% confidence interval is β i ± 2ŝe( ˆβ i ) Course on Machine Learning, winter term 2009/ /71

28 Machine Learning / 3. Multiple Regression Parameter Variance / Eample We have alread fitted Ŷ = ˆβ 0 + ˆβ 1 X 1 + ˆβ 2 X 2 = X X 2 to the data: 1 2 ŷ ˆɛ 2 = ( ŷ) RSS ˆσ 2 = 1 n p n ˆɛ 2 i = = (X T X) 1ˆσ 2 = covariate ˆβ i ŝe( ˆβ i ) z-score p-value (intercept) X X Course on Machine Learning, winter term 2009/ /71 Machine Learning / 3. Multiple Regression Parameter Variance / Eample 2 Eample: sociographic data of the 50 US states in state dataset: income (per capita, 1974), illiterac (percent of population, 1970), life epectanc (in ears, ), percent high-school graduates (1970). population (Jul 1, 1975) murder rate per 100,000 population (1976) mean number of das with minimum temperature below freezing ( ) in capital or large cit land area in square miles Income Illiterac Life Ep HS Grad Course on Machine Learning, winter term 2009/ /71

29 Machine Learning / 3. Multiple Regression Parameter Variance / Eample 2 Murder =β 0 + β 1 Population + β 2 Income + β 3 Illiterac + β 4 LifeEp + β 5 HSGrad + β 6 Frost + β 7 Area n = 50 states, p = 8 parameters, n p = 42 degrees of freedom. Least squares estimators: Estimate Std. Error t value Pr(> t ) (Intercept) 1.222e e e-08 *** Population 1.880e e ** Income e e Illiterac 1.373e e Life Ep e e e-08 *** HS Grad 3.234e e Frost e e Area 5.967e e Course on Machine Learning, winter term 2009/ /71 Machine Learning 1. The Regression Problem 2. Simple Linear Regression 3. Multiple Regression 4. Variable Interactions 5. Model Selection 6. Case Weights Course on Machine Learning, winter term 2009/ /71

30 Machine Learning / 4. Variable Interactions Need for higher orders Assume a target variable does not depend linearl on a predictor variable, but sa quadratic. Eample: wa length vs. duration of a moving object with constant acceleration a. s(t) = 1 2 at2 + ɛ Can we catch such a dependenc? Can we catch it with a linear model? Course on Machine Learning, winter term 2009/ /71 Machine Learning / 4. Variable Interactions Need for general transformations To describe man phenomena, even more comple functions of the input variables are needed. Eample: the number of cells n vs. duration of growth t: n = βe αt + ɛ n does not depend on t directl, but on e αt (with a known α). Course on Machine Learning, winter term 2009/ /71

31 Machine Learning / 4. Variable Interactions Need for variable interactions In a linear model with two predictors Y = β 0 + β 1 X 1 + β 2 X 2 + ɛ Y depends on both, X 1 and X 2. But changes in X 1 will affect Y the same wa, regardless of X 2. There are problems where X 2 mediates or influences the wa X 1 affects Y, e.g. : the wa length s of a moving object vs. its constant velocit v and duration t: s = vt + ɛ Then an additional 1s duration will increase the wa length not in a uniform wa (regardless of the velocit), but a little for small velocities and a lot for large velocities. v and t are said to interact: does not depend onl on each predictor separatel, but also on their product. Course on Machine Learning, winter term 2009/ /71 Machine Learning / 4. Variable Interactions Derived variables All these cases can be handled b looking at derived variables, i.e., instead of one looks at with Y =β 0 + β 1 X ɛ Y =β 0 + β 1 e αx 1 + ɛ Y =β 0 + β 1 X 1 X 2 + ɛ Y =β 0 + β 1 X 1 + ɛ X 1 :=X1 2 X 1 :=e αx 1 X 1 :=X 1 X 2 Derived variables are computed before the fitting process and taken into account either additional to the original variables or instead of. Course on Machine Learning, winter term 2009/ /71

32 Machine Learning 1. The Regression Problem 2. Simple Linear Regression 3. Multiple Regression 4. Variable Interactions 5. Model Selection 6. Case Weights Course on Machine Learning, winter term 2009/ /71 Machine Learning / 5. Model Selection Underfitting data model If a model does not well eplain the data, e.g., if the true model is quadratic, but we tr to fit a linear model, one sas, the model underfits. Course on Machine Learning, winter term 2009/ /71

33 Machine Learning / 5. Model Selection Overfitting / Fitting Polnomials of High Degree data model Course on Machine Learning, winter term 2009/ /71 Machine Learning / 5. Model Selection Overfitting / Fitting Polnomials of High Degree data model Course on Machine Learning, winter term 2009/ /71

34 Machine Learning / 5. Model Selection Overfitting / Fitting Polnomials of High Degree data model Course on Machine Learning, winter term 2009/ /71 Machine Learning / 5. Model Selection Overfitting / Fitting Polnomials of High Degree data model Course on Machine Learning, winter term 2009/ /71

35 Machine Learning / 5. Model Selection Overfitting / Fitting Polnomials of High Degree If to data consisting of n points we fit ( 1, 1 ), ( 2, 2 ),..., ( n, n ) X = β 0 + β 1 X 1 + β 2 X β n 1 X n 1 i.e., a polnomial with degree n 1, then this results in an interpolation of the data points (if there are no repeated measurements, i.e., points with the same X 1.) As the polnomial r(x) = n X j i i j j i is of this tpe, and has minimal RSS = 0. Course on Machine Learning, winter term 2009/ /71 Machine Learning / 5. Model Selection Model Selection Measures Model selection means: we have a set of models, e.g., p 1 Y = β i X i indeed b p (i.e., one model for each value of p), make a choice which model describes the data best. If we just look at losses / fit measures such as RSS, then or equivalentl i=0 the larger p, the better the fit the larger p, the lower the loss as the model with p parameters can be reparametrized in a model with p > p parameters b setting { β i βi, for i p = 0, for i > p Course on Machine Learning, winter term 2009/ /71

36 Machine Learning / 5. Model Selection Model Selection Measures One uses model selection measures of tpe model selection measure = fit compleit or equivalentl model selection measure = loss + compleit The smaller the loss (= lack of fit), the better the model. The smaller the compleit, the simpler and thus better the model. The model selection measure tries to find a trade-off between fit/loss and compleit. Course on Machine Learning, winter term 2009/ /71 Machine Learning / 5. Model Selection Model Selection Measures Akaike Information Criterion (AIC): (maimize) AIC := log L p or (minimize) AIC := 2 log L + 2p = 2n log(rss/n) + 2p Baes Information Criterion (BIC) / Baes-Schwarz Information Criterion: (maimize) BIC := log L p 2 log n Course on Machine Learning, winter term 2009/ /71

37 Machine Learning / 5. Model Selection Variable Backward Selection AIC = X... X... X AIC = AIC = AIC = X X... XX... X X AIC = AIC = AIC = X X X X XX... XX X AIC = AIC = AIC = X removed variable Course on Machine Learning, winter term 2009/ /71 Machine Learning / 5. Model Selection Variable Backward Selection AIC = X... X... X AIC = AIC = AIC = X X... XX... X X AIC = AIC = AIC = X X X X XX... XX X AIC = AIC = AIC = X removed variable Course on Machine Learning, winter term 2009/ /71

38 Machine Learning / 5. Model Selection Variable Backward Selection AIC = X... X... X AIC = AIC = AIC = X X... XX... X X AIC = AIC = AIC = X X X X XX... XX X AIC = AIC = AIC = X removed variable Course on Machine Learning, winter term 2009/ /71 Machine Learning / 5. Model Selection Variable Backward Selection AIC = X... X... X AIC = AIC = AIC = X X... XX... X X AIC = AIC = AIC = X X X X XX... XX X AIC = AIC = AIC = X removed variable Course on Machine Learning, winter term 2009/ /71

39 Machine Learning / 5. Model Selection full model: Variable Backward Selection Estimate Std. Error t value Pr(> t ) (Intercept) 1.222e e e-08 *** Population 1.880e e ** Income e e Illiterac 1.373e e Life Ep e e e-08 *** HS Grad 3.234e e Frost e e Area 5.967e e AIC optimal model b backward selection: Estimate Std. Error t value Pr(> t ) (Intercept) 1.202e e e-08 *** Population 1.780e e ** Illiterac 1.173e e Life Ep e e e-08 *** Frost e e Area 6.804e e * Course on Machine Learning, winter term 2009/ /71 Machine Learning / 5. Model Selection How to do it in R librar(datasets); librar(mass); st = as.data.frame(state.77); mod.full = lm(murder ~., data=st); summar(mod.full); mod.opt = stepaic(mod.full); summar(mod.opt); Course on Machine Learning, winter term 2009/ /71

40 Machine Learning / 5. Model Selection Shrinkage Model selection operates b fitting models for a set of models with varing compleit and then picking the best one e post, omitting some parameters completel (i.e., forcing them to be 0) shrinkage operates b including a penalt term directl in the model equation and favoring small parameter values in general. Course on Machine Learning, winter term 2009/ /71 Machine Learning / 5. Model Selection Ridge regression: minimize Shrinkage / Ridge Regression [Hoe62] RSS λ ( ˆβ) =RSS( ˆβ) + λ p j=1 ˆβ 2 j = X ˆβ, X ˆβ + λ ˆβ =(X T X + λi) 1 X T with λ 0 a compleit parameter / regularization parameter. As solutions of ridge regression are not equivariant under scaling of the predictors, and as it does not make sense to include a constraint for the parameter of the intercept data is normalized before ridge regression: i,j := i,j.,j ˆσ(.,j ) Course on Machine Learning, winter term 2009/ /71 p j=1 ˆβ 2 j

41 Machine Learning / 5. Model Selection Shrinkage / Ridge Regression (2/3) Ridge regression is a combination of n p ( i ŷ i ) 2 +λ }{{} j=1 β 2 j }{{} = L2 loss +λ L2 regularization Course on Machine Learning, winter term 2009/ /71 Machine Learning / 5. Model Selection Shrinkage / Ridge Regression (3/3) / Tikhonov Regularization (1/2) L2 regularization / Tikhonov regularization can be derived for linear regression as follows: Treat the true parameters θ j as random variables Θ j with the following distribution (prior): Θ j N (0, σ Θ ), j = 1,..., p Then the joint likelihood of the data and the parameters is ( n ) p L D,Θ (θ) := p( i, i θ) p(θ j = θ j ) and the conditional joint log likelihood of the data and the parameters accordingl ( n ) p log L cond D,Θ (θ) := log p( i i, θ) + log p(θ j = θ j ) and 1 log p(θ j = θ j ) = log e 2σ Θ 2 = log( 2πσ Θ ) θ2 j 2πσΘ 2σΘ 2 Course on Machine Learning, winter term 2009/ /71 θ 2 j j=1 j=1

42 Machine Learning / 5. Model Selection Shrinkage / Ridge Regression (3/3) / Tikhonov Regularization (2/2) Dropping the terms that do not depend on θ j ields: ( n ) p log L cond D,Θ (θ) := log p( i i, θ) + log p(θ j = θ j ) ( n ) log p( i i, θ) j=1 1 2σ 2 Θ p j=1 θ 2 j This also gives a semantics to the compleit / regularization parameter λ: λ = 1 2σΘ 2 but σθ 2 is unknown. (We will see methods to estimate λ later on.) The parameters θ that maimize the joint likelihood of the data and the parameters are called Maimum Aposteriori Estimators (MAP estimators). Putting a prior on the parameters is called Baesian approach. Course on Machine Learning, winter term 2009/ /71 Machine Learning / 5. Model Selection How to compute ridge regression / Eample Fit Y =β 0 + β 1 X 1 + β 2 X 2 + ɛ to the data: X = , Y = , I := , X T X = , X T X + 5I = , X T Y = Course on Machine Learning, winter term 2009/ /71

43 Machine Learning 1. The Regression Problem 2. Simple Linear Regression 3. Multiple Regression 4. Variable Interactions 5. Model Selection 6. Case Weights Course on Machine Learning, winter term 2009/ /71 Machine Learning / 6. Case Weights Cases of Different Importance Sometimes different cases are of different importance, e.g., if their measurements are of different accurrac or reliabilit. Eample: assume the left most point is known to be measured with lower reliabilit. Thus, the model does not need to fit to this point equall as well as it needs to do to the other points. I.e., residuals of this point should get lower weight than the others data model Course on Machine Learning, winter term 2009/ /71

44 Machine Learning / 6. Case Weights Case Weights In such situations, each case ( i, i ) is assigned a case weight w i 0: the higher the weight, the more important the case. cases with weight 0 should be treated as if the have been discarded from the data set. Case weights can be managed as an additional pseudo-variable w in applications. Course on Machine Learning, winter term 2009/ /71 Machine Learning / 6. Case Weights Weighted Least Squares Estimates Formall, one tries to minimize the weighted residual sum of squares n w i ( i ŷ i ) 2 = W 2( 1 ŷ) 2 with W := w 1 0 w w n The same argument as for the unweighted case results in the weighted least squares estimates X T WX ˆβ = X T W Course on Machine Learning, winter term 2009/ /71

45 Machine Learning / 6. Case Weights Weighted Least Squares Estimates / Eample Do downweight the left most point, we assign case weights as follows: w data model (w. weights) model (w./o. weights) Course on Machine Learning, winter term 2009/ /71 Machine Learning / 6. Case Weights Summar For regression, linear models of tpe Y = X, β + ɛ can be used to predict a quantitative Y based on several (quantitative) X. The ordinar least squares estimates (OLS) are the parameters with minimal residual sum of squares (RSS). The coincide with the maimum likelihood estimates (MLE). OLS estimates can be computed b solving the sstem of linear equations X T X ˆβ = X T Y. The variance of the OLS estimates can be computed likewise ((X T X) 1ˆσ 2 ). For deciding about inclusion of predictors as well as of powers and interactions of predictors in a model, model selection measures (AIC, BIC) and different search strategies such as forward and backward search are available. Course on Machine Learning, winter term 2009/ /71

46 Machine Learning / 6. Case Weights References [Hoe62] A. E. Hoerl. Application of ridge analsis to regression problems. Chemical Engineering Progress, 58:54 59, Course on Machine Learning, winter term 2009/ /71

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.