Machine Learning. 1. Linear Regression

Size: px
Start display at page:

Download "Machine Learning. 1. Linear Regression"

Transcription

1 Machine Learning Machine Learning 1. Linear Regression Lars Schmidt-Thieme Information Sstems and Machine Learning Lab (ISMLL) Institute for Business Economics and Information Sstems & Institute for Computer Science Universit of Hildesheim Course on Machine Learning, winter term 2009/2010 1/71 Machine Learning 1. The Regression Problem 2. Simple Linear Regression 3. Multiple Regression 4. Variable Interactions 5. Model Selection 6. Case Weights Course on Machine Learning, winter term 2009/2010 1/71

2 Machine Learning / 1. The Regression Problem Eample Eample: how does gas consumption depend on eternal temperature? (Whiteside, 1960s). weekl measurements of average eternal temperature total gas consumption (in 1000 cubic feets) A third variable encodes two heating seasons, before and after wall insulation. How does gas consumption depend on eternal temperature? How much gas is needed for a given termperature? Course on Machine Learning, winter term 2009/2010 1/71 Machine Learning / 1. The Regression Problem Eample 7 Gas consumption (1000 cubic feet) Average eternal temperature (deg. C) linear model Course on Machine Learning, winter term 2009/2010 2/71

3 Machine Learning / 1. The Regression Problem Eample 7 7 Gas consumption (1000 cubic feet) Gas consumption (1000 cubic feet) Average eternal temperature (deg. C) Average eternal temperature (deg. C) linear model more fleible model Course on Machine Learning, winter term 2009/2010 3/71 Machine Learning / 1. The Regression Problem Variable Tpes and Coding The most common variable tpes: numerical / interval-scaled / quantitative where differences and quotients etc. are meaningful, usuall with domain X := R, e.g., temperature, size, weight. nominal / discrete / categorical / qualitative / factor where differences and quotients are not defined, usuall with a finite, enumerated domain, e.g., X := {red, green, blue} or X := {a, b, c,...,, z}. ordinal / ordered categorical where levels are ordered, but differences and quotients are not defined, usuall with a finite, enumerated domain, e.g., X := {small, medium, large} Course on Machine Learning, winter term 2009/2010 4/71

4 Machine Learning / 1. The Regression Problem Variable Tpes and Coding Nominals are usuall encoded as binar dumm variables: { 1, if X = 0, δ 0 (X) := 0, else one for each 0 X (but one). Eample: X := {red, green, blue} Replace b one variable X with 3 levels: red, green, blue two variables δ red (X) and δ green (X) with 2 levels each: 0, 1 X δ red (X) δ green (X) red 1 0 green 0 1 blue Course on Machine Learning, winter term 2009/2010 5/71 Machine Learning / 1. The Regression Problem Let The Regression Problem Formall X 1, X 2,..., X p be random variables called predictors (or inputs, covariates). Let X 1, X 2,..., X p be their domains. We write shortl X := (X 1, X 2,..., X p ) for the vector of random predictor variables and for its domain. X := X 1 X 2 X p Y be a random variable called target (or output, response). Let Y be its domain. D P(X Y) be a (multi)set of instances of the unknown joint distribution p(x, Y ) of predictors and target called data. D is often written as enumeration D = {( 1, 1 ), ( 2, 2 ),..., ( n, n )} Course on Machine Learning, winter term 2009/2010 6/71

5 Machine Learning / 1. The Regression Problem The Regression Problem Formall The task of regression and classification is to predict Y based on X, i.e., to estimate r() := E(Y X = ) = p( )d based on data (called regression function). If Y is numerical, the task is called regression. If Y is nominal, the task is called classification. Course on Machine Learning, winter term 2009/2010 7/71 Machine Learning 1. The Regression Problem 2. Simple Linear Regression 3. Multiple Regression 4. Variable Interactions 5. Model Selection 6. Case Weights Course on Machine Learning, winter term 2009/2010 8/71

6 Machine Learning / 2. Simple Linear Regression Simple Eamples: Single Predictor vs. Multiple Predictors single predictor: = multiple predictors: = Course on Machine Learning, winter term 2009/2010 8/71 Machine Learning / 2. Simple Linear Regression Simple Eamples: Regression Function observations observations average truth observations observations average truth linear regression function: = non-linear regression function: = Course on Machine Learning, winter term 2009/2010 9/71

7 Machine Learning / 2. Simple Linear Regression Simple Eamples: Size of Errors (1/2) observations errors observations average truth Densit N = 100 Bandwidth = Small errors vs.... Course on Machine Learning, winter term 2009/ /71 Machine Learning / 2. Simple Linear Regression Simple Eamples: Size of Errors (2/2) observations errors observations average truth Densit N = 100 Bandwidth = large errors. Course on Machine Learning, winter term 2009/ /71

8 Machine Learning / 2. Simple Linear Regression Simple Eamples: Distribution of Errors (1/2) observations errors observations average truth Densit N = 100 Bandwidth = Normall distributed errors vs.... Course on Machine Learning, winter term 2009/ /71 Machine Learning / 2. Simple Linear Regression Simple Eamples: Distribution of Errors (2/2) observations errors observations average truth Densit N = 100 Bandwidth = uniforml distributed errors. Course on Machine Learning, winter term 2009/ /71

9 Machine Learning / 2. Simple Linear Regression Simple Eamples: Homoscedastic vs. Heteroscedastic Errors (1/2) observations observations average truth errors error Errors do not depend on predictors (homoscedastic) vs.... Course on Machine Learning, winter term 2009/ /71 Machine Learning / 2. Simple Linear Regression Simple Eamples: Homoscedastic vs. Heteroscedastic Errors (2/2) observations observations average truth errors error... errors do depend on predictors (heteroscedastic). Course on Machine Learning, winter term 2009/ /71

10 Machine Learning / 2. Simple Linear Regression Simple Eamples: Distribution of Predictors (1/2) observations predictors observations average truth Densit N = 100 Bandwidth = Predictors are uniforml distributed vs.... Course on Machine Learning, winter term 2009/ /71 Machine Learning / 2. Simple Linear Regression Simple Eamples: Distribution of Predictors (2/2) observations predictors observations average truth Densit N = 100 Bandwidth = predictors are normall distributed. Course on Machine Learning, winter term 2009/ /71

11 Machine Learning / 2. Simple Linear Regression Simple Linear Regression Model Make it simple: the predictor X is simple, i.e., one-dimensional (X = X 1 ). r() is assumed to be linear: r() = β 0 + β 1 assume that the variance does not depend on X: 3 parameters: Y = β 0 + β 1 X + ɛ, E(ɛ X) = 0, V (ɛ X) = σ 2 β 0 intercept (sometimes also called bias) β 1 slope σ 2 variance Course on Machine Learning, winter term 2009/ /71 Machine Learning / 2. Simple Linear Regression Simple Linear Regression Model parameter estimates ˆβ 0, ˆβ 1, ˆσ 2 fitted line ˆr() := ˆβ 0 + ˆβ 1 predicted / fitted values ŷ i := ˆr( i ) residuals ˆɛ i := i ŷ i = i ( ˆβ 0 + ˆβ 1 i ) residual sums of squares (RSS) / square loss / L2 loss n RSS = Course on Machine Learning, winter term 2009/ /71 ˆɛ 2 i

12 Machine Learning / 2. Simple Linear Regression How to estimate the parameters? Eample: Given the data D := {(1, 2), (2, 3), (4, 6)}, predict a value for = data Course on Machine Learning, winter term 2009/ /71 Machine Learning / 2. Simple Linear Regression How to estimate the parameters? Eample: Given the data D := {(1, 2), (2, 3), (4, 6)}, predict a value for = 3. Line through first two points: RSS: ˆr(3) = 4 ˆβ 1 = = 1 ˆβ 0 = 1 ˆβ 1 1 = 1 i i ŷ i ( i ŷ i ) data model Course on Machine Learning, winter term 2009/ /71

13 Machine Learning / 2. Simple Linear Regression How to estimate the parameters? Eample: Given the data D := {(1, 2), (2, 3), (4, 6)}, predict a value for = 3. Line through first and last point: RSS: ˆr(3) = ˆβ 1 = = 4/3 = ˆβ 0 = 1 ˆβ 1 1 = 2/3 = i i ŷ i ( i ŷ i ) data model Course on Machine Learning, winter term 2009/ /71 Machine Learning / 2. Simple Linear Regression Least Squares Estimates / Definition In principle, there are man different methods to estimate the parameters ˆβ 0, ˆβ 1 and ˆσ 2 from data depending on the properties the solution should have. The least squares estimates are those parameters that minimize n n n RSS = ˆɛ 2 i = ( i ŷ i ) 2 = ( i ( ˆβ 0 + ˆβ 1 i )) 2 The can be written in closed form as follows: n ˆβ 1 = ( i )( i ȳ) n ( i ) 2 ˆβ 0 =ȳ ˆβ 1 ˆσ 2 = 1 n n 2 ɛ 2 i Course on Machine Learning, winter term 2009/ /71

14 Machine Learning / 2. Simple Linear Regression Least Squares Estimates / Proof Proof (1/2): RSS = RSS ˆβ 0 = = n ˆβ 0 = n ( i ( ˆβ 0 + ˆβ 1 i )) 2 n 2( i ( ˆβ 0 + ˆβ 1 i ))( 1) =! 0 n i ˆβ 1 i Course on Machine Learning, winter term 2009/ /71 Machine Learning / 2. Simple Linear Regression Least Squares Estimates / Proof Proof (2/2): RSS = = = n ( i ( ˆβ 0 + ˆβ 1 i )) 2 n ( i (ȳ ˆβ 1 ) ˆβ 1 i ) 2 n ( i ȳ ˆβ 1 ( i )) 2 RSS n ˆβ = 2( i ȳ ˆβ 1 ( i ))( 1)( i ) =! 0 1 n = ˆβ 1 = ( i ȳ)( i ) n ( i ) 2 Course on Machine Learning, winter term 2009/ /71

15 Machine Learning / 2. Simple Linear Regression Least Squares Estimates / Eample Eample: Given the data D := {(1, 2), (2, 3), (4, 6)}, predict a value for = 3. Assume simple linear model. = 7/3, ȳ = 11/3. i i i ȳ ( i ) 2 ( i )( i ȳ) 1 4/3 5/3 16/9 20/9 2 1/3 2/3 1/9 2/9 3 5/3 7/3 25/9 35/9 42/9 57/9 n ˆβ 1 = ( i )( i ȳ) n ( = 57/42 = i ) 2 ˆβ 0 =ȳ ˆβ 1 = = = data model Course on Machine Learning, winter term 2009/ /71 Machine Learning / 2. Simple Linear Regression Least Squares Estimates / Eample Eample: Given the data D := {(1, 2), (2, 3), (4, 6)}, predict a value for = 3. Assume simple linear model. n ˆβ 1 = ( i )( i ȳ) n ( = 57/42 = i ) 2 ˆβ 0 =ȳ ˆβ 1 = = = 0.5 RSS: ˆr(3) = i i ŷ i ( i ŷ i ) data model Course on Machine Learning, winter term 2009/ /71

16 Machine Learning / 2. Simple Linear Regression A Generative Model So far we assumed the model Y = β 0 + β 1 X + ɛ, E(ɛ X) = 0, V (ɛ X) = σ 2 where we required some properties of the errors, but not its eact distribution. If we make assumptions about its distribution, e.g., and thus we can sample from this model. ɛ X N (0, σ 2 ) Y N (β 0 + β 1 X, σ 2 ) Course on Machine Learning, winter term 2009/ /71 Machine Learning / 2. Simple Linear Regression Maimum Likelihood Estimates (MLE) Let p(x, Y θ) be a joint probabilit densit function for X and Y with parameters θ. Likelihood: L D (θ) := n p( i, i θ) The likelihood describes the probabilit of the data. The maimum likelihood estimates (MLE) are those parameters that maimize the likelihood. Course on Machine Learning, winter term 2009/ /71

17 Machine Learning / 2. Simple Linear Regression Least Squares Estimates and Maimum Likelihood Estimates Likelihood: L D ( ˆβ 0, ˆβ 1, ˆσ 2 ) := n ˆp( i, i ) = n ˆp( i i )p( i ) = n ˆp( i i ) n p( i ) Conditional likelihood: L cond D ( ˆβ 0, ˆβ n 1, ˆσ 2 ) := ˆp( i i ) = n 1 2πˆσ e ( i ŷ i )2 2ˆσ 2 = 1 2π nˆσ n e 1 2ˆσ 2 n ( i ŷ i ) 2 Conditional log-likelihood: log L cond D ( ˆβ 0, ˆβ 1, ˆσ 2 ) n log ˆσ 1 2ˆσ 2 n ( i ŷ i ) 2 = if we assume normalit, the maimum likelihood estimates are just the minimal least squares estimates. Course on Machine Learning, winter term 2009/ /71 Machine Learning / 2. Simple Linear Regression Implementation Details 1 simple-regression(d) : 2 s := 0, s := 0 3 for i = 1,..., n do 4 s := s + i 5 s := s + i 6 od 7 := s/n, ȳ := s/n 8 a := 0, b := 0 9 for i = 1,..., n do 10 a := a + ( i )( i ȳ) 11 b := b + ( i ) 2 12 od 13 β 1 := a/b 14 β 0 := ŷ β 1ˆ 15 return (β 0, β 1 ) Course on Machine Learning, winter term 2009/ /71

18 Machine Learning / 2. Simple Linear Regression Implementation Details naive: 1 simple-regression(d) : 2 s := 0, s := 0 3 for i = 1,..., n do 4 s := s + i 5 s := s + i 6 od 7 := s/n, ȳ := s/n 8 a := 0, b := 0 9 for i = 1,..., n do 10 a := a + ( i )( i ȳ) 11 b := b + ( i ) 2 12 od 13 β 1 := a/b 14 β 0 := ŷ β 1ˆ 15 return (β 0, β 1 ) single loop: 1 simple-regression(d) : 2 s := 0, s := 0, s := 0, s := 0, s := 0 3 for i = 1,..., n do 4 s := s + i 5 s := s + i 6 s := s + 2 i 7 s := s + 2 i 8 s := s + i i 9 od 10 β 1 := (n s s s)/(n s s s) 11 β 0 := (s β 1 s)/n 12 return (β 0, β 1 ) Course on Machine Learning, winter term 2009/ /71 Machine Learning 1. The Regression Problem 2. Simple Linear Regression 3. Multiple Regression 4. Variable Interactions 5. Model Selection 6. Case Weights Course on Machine Learning, winter term 2009/ /71

19 Machine Learning / 3. Multiple Regression Several predictors Several predictor variables X 1, X 2,..., X p : Y =β 0 + β 1 X 1 + β 2 X 2 + β P X P + ɛ p =β 0 + β i X i + ɛ with p + 1 parameters β 0, β 1,..., β p. Course on Machine Learning, winter term 2009/ /71 Machine Learning / 3. Multiple Regression Linear form Several predictor variables X 1, X 2,..., X p : p Y =β 0 + β i X i + ɛ = β, X + ɛ where β := β 0 β 1. β p, X := 1 X 1. X p, Thus, the intercept is handled like an other parameter, for the artificial constant variable X 0 1. Course on Machine Learning, winter term 2009/ /71

20 Machine Learning / 3. Multiple Regression Simultaneous equations for the whole dataset For the whole dataset ( 1, 1 ),..., ( n, n ): where 1 Y :=., X := n Y = Xβ + ɛ 1. = n 1,1 1,2... 1,p...., ɛ := n,1 n,2... n,p ɛ 1., ɛ n Course on Machine Learning, winter term 2009/ /71 Machine Learning / 3. Multiple Regression Least squares estimates Least squares estimates ˆβ minimize Y Ŷ 2 = Y X ˆβ 2 The least squares estimates ˆβ are computed via X T X ˆβ = X T Y Proof: Y X ˆβ 2 = Y X ˆβ, Y X ˆβ (...) ˆβ = 2 X, Y X ˆβ = 2(X T Y X T X ˆβ)! = 0 Course on Machine Learning, winter term 2009/ /71

21 Machine Learning / 3. Multiple Regression How to compute least squares estimates ˆβ Solve the p p sstem of linear equations X T X ˆβ = X T Y i.e., A = b (with A := X T X, b = X T Y, = ˆβ). There are several numerical methods available: 1. Gaussian elimination 2. Cholesk decomposition 3. QR decomposition Course on Machine Learning, winter term 2009/ /71 Machine Learning / 3. Multiple Regression How to compute least squares estimates ˆβ / Eample Given is the following data: Predict a value for 1 = 3, 2 = 4. Course on Machine Learning, winter term 2009/ /71

22 Machine Learning / 3. Multiple Regression How to compute least squares estimates ˆβ / Eample Y =β 0 + β 1 X 1 + ɛ = X 1 + ɛ Y =β 0 + β 2 X 2 + ɛ = X 2 + ɛ data model data model ŷ( 1 = 3) = ŷ( 2 = 4) = Course on Machine Learning, winter term 2009/ /71 Machine Learning / 3. Multiple Regression How to compute least squares estimates ˆβ / Eample Now fit Y =β 0 + β 1 X 1 + β 2 X 2 + ɛ to the data: X = , Y = X T X = , X T Y = Course on Machine Learning, winter term 2009/ /71

23 Machine Learning / 3. Multiple Regression How to compute least squares estimates ˆβ / Eample i.e., ˆβ = / / / Course on Machine Learning, winter term 2009/ /71 Machine Learning / 3. Multiple Regression How to compute least squares estimates ˆβ / Eample Course on Machine Learning, winter term 2009/ /71

24 Machine Learning / 3. Multiple Regression How to compute least squares estimates ˆβ / Eample To visuall assess the model fit, a plot residuals ˆɛ = ŷ vs. true values can be plotted: hat Course on Machine Learning, winter term 2009/ /71 Machine Learning / 3. Multiple Regression The Normal Distribution (also Gaussian) written as: X N (µ, σ 2 ) with parameters: µ mean, σ standard deviance. probabilit densit function (pdf): φ() := 1 e ( µ)2 2σ 2 2πσ phi() cummulative densit function (cdf): Φ() := φ()d Φ 1 is called quantile function. Φ and Φ 1 have no analtical form, but have to computed numericall. Phi() Course on Machine Learning, winter term 2009/ /71

25 Machine Learning / 3. Multiple Regression written as: X t p with parameter: p degrees of freedom. probabilit densit function (pdf): 2 ) p() := Γ(p+1 2 (1 + ) Γ( p 2 t p p N (0, 1) p+1 p ) 2 The t Distribution f() Lars Schmidt-Thieme, Information Sstems and Machine Learning Lab (ISMLL), Institute BW/WI 6 & Institute 4 for 2Computer 0 Science, 2 Universit 4 of Hildesheim 6 Course on Machine Learning, winter term 2009/ /71 Machine Learning / 3. Multiple Regression written as: X χ 2 p with parameter: p degrees of freedom. probabilit densit function (pdf): p() := 1 Γ(p/2)2 p/2p 2 1 e 2 If X 1,..., X p N (0, 1), then p Y := Xi 2 χ 2 p F() The χ 2 Distribution f() p=5 p=10 p=50 p=5 p=10 p= Lars Schmidt-Thieme, Information Sstems and Machine Learning Lab (ISMLL), Institute BW/WI 0 & Institute 5 for Computer 10 Science, Universit 15 of Hildesheim 20 Course on Machine Learning, winter term 2009/ /71 F() p=5 p=7 p=10 p=5 p=7 p=10

26 Machine Learning / 3. Multiple Regression Parameter Variance ˆβ = (X T X) 1 X T Y is an unbiased estimator for β (i.e., E( ˆβ) = β). Its variance is V ( ˆβ) = (X T X) 1 σ 2 proof: ˆβ =(X T X) 1 X T Y = (X T X) 1 X T (Xβ + ɛ) = β + (X T X) 1 X T ɛ As E(ɛ) = 0: E( ˆβ) = β V ( ˆβ) =E(( ˆβ E( ˆβ))( ˆβ E( ˆβ)) T ) =E((X T X) 1 X T ɛɛ T X(X T X) 1 ) =(X T X) 1 σ 2 Course on Machine Learning, winter term 2009/ /71 Machine Learning / 3. Multiple Regression Parameter Variance An unbiased estimator for σ 2 is ˆσ 2 = 1 n ˆɛ 2 i = 1 n p n p n ( ŷ) 2 If ɛ N (0, σ 2 ), then ˆβ N (β, (X T X) 1 σ 2 ) Furthermore (n p)ˆσ 2 σ 2 χ 2 n p Course on Machine Learning, winter term 2009/ /71

27 Machine Learning / 3. Multiple Regression Parameter Variance / Standardized coefficient standardized coefficient ( z-score ): z i := ˆβ i ŝe( ˆβ i ), with ŝe2 ( ˆβ i ) the i-th diagonal element of (X T X) 1ˆσ 2 z i would be z i N (0, 1) if σ is known (under H 0 : β i = 0). With estimated ˆσ it is z i t n p. The Wald test for H 0 : β i = 0 with size α is: i.e., its p-value is ˆβ i reject H 0 if z i = ŝe( ˆβ i ) > F t 1 n p (1 α 2 ) ˆβ i p-value(h 0 : β i = 0) = 2(1 F tn p ( z i )) = 2(1 F tn p ( ŝe( ˆβ i ) )) and small p-values such as 0.01 and 0.05 are good. Course on Machine Learning, winter term 2009/ /71 Machine Learning / 3. Multiple Regression Parameter Variance / Confidence interval The 1 α confidence interval for β i : β i ± F 1 t n p (1 α 2 )ŝe( ˆβ i ) For large n, F tn p converges to the standard normal cdf Φ. As Φ 1 ( ) , the rule-of-thumb for a 5% confidence interval is β i ± 2ŝe( ˆβ i ) Course on Machine Learning, winter term 2009/ /71

28 Machine Learning / 3. Multiple Regression Parameter Variance / Eample We have alread fitted Ŷ = ˆβ 0 + ˆβ 1 X 1 + ˆβ 2 X 2 = X X 2 to the data: 1 2 ŷ ˆɛ 2 = ( ŷ) RSS ˆσ 2 = 1 n p n ˆɛ 2 i = = (X T X) 1ˆσ 2 = covariate ˆβ i ŝe( ˆβ i ) z-score p-value (intercept) X X Course on Machine Learning, winter term 2009/ /71 Machine Learning / 3. Multiple Regression Parameter Variance / Eample 2 Eample: sociographic data of the 50 US states in state dataset: income (per capita, 1974), illiterac (percent of population, 1970), life epectanc (in ears, ), percent high-school graduates (1970). population (Jul 1, 1975) murder rate per 100,000 population (1976) mean number of das with minimum temperature below freezing ( ) in capital or large cit land area in square miles Income Illiterac Life Ep HS Grad Course on Machine Learning, winter term 2009/ /71

29 Machine Learning / 3. Multiple Regression Parameter Variance / Eample 2 Murder =β 0 + β 1 Population + β 2 Income + β 3 Illiterac + β 4 LifeEp + β 5 HSGrad + β 6 Frost + β 7 Area n = 50 states, p = 8 parameters, n p = 42 degrees of freedom. Least squares estimators: Estimate Std. Error t value Pr(> t ) (Intercept) 1.222e e e-08 *** Population 1.880e e ** Income e e Illiterac 1.373e e Life Ep e e e-08 *** HS Grad 3.234e e Frost e e Area 5.967e e Course on Machine Learning, winter term 2009/ /71 Machine Learning 1. The Regression Problem 2. Simple Linear Regression 3. Multiple Regression 4. Variable Interactions 5. Model Selection 6. Case Weights Course on Machine Learning, winter term 2009/ /71

30 Machine Learning / 4. Variable Interactions Need for higher orders Assume a target variable does not depend linearl on a predictor variable, but sa quadratic. Eample: wa length vs. duration of a moving object with constant acceleration a. s(t) = 1 2 at2 + ɛ Can we catch such a dependenc? Can we catch it with a linear model? Course on Machine Learning, winter term 2009/ /71 Machine Learning / 4. Variable Interactions Need for general transformations To describe man phenomena, even more comple functions of the input variables are needed. Eample: the number of cells n vs. duration of growth t: n = βe αt + ɛ n does not depend on t directl, but on e αt (with a known α). Course on Machine Learning, winter term 2009/ /71

31 Machine Learning / 4. Variable Interactions Need for variable interactions In a linear model with two predictors Y = β 0 + β 1 X 1 + β 2 X 2 + ɛ Y depends on both, X 1 and X 2. But changes in X 1 will affect Y the same wa, regardless of X 2. There are problems where X 2 mediates or influences the wa X 1 affects Y, e.g. : the wa length s of a moving object vs. its constant velocit v and duration t: s = vt + ɛ Then an additional 1s duration will increase the wa length not in a uniform wa (regardless of the velocit), but a little for small velocities and a lot for large velocities. v and t are said to interact: does not depend onl on each predictor separatel, but also on their product. Course on Machine Learning, winter term 2009/ /71 Machine Learning / 4. Variable Interactions Derived variables All these cases can be handled b looking at derived variables, i.e., instead of one looks at with Y =β 0 + β 1 X ɛ Y =β 0 + β 1 e αx 1 + ɛ Y =β 0 + β 1 X 1 X 2 + ɛ Y =β 0 + β 1 X 1 + ɛ X 1 :=X1 2 X 1 :=e αx 1 X 1 :=X 1 X 2 Derived variables are computed before the fitting process and taken into account either additional to the original variables or instead of. Course on Machine Learning, winter term 2009/ /71

32 Machine Learning 1. The Regression Problem 2. Simple Linear Regression 3. Multiple Regression 4. Variable Interactions 5. Model Selection 6. Case Weights Course on Machine Learning, winter term 2009/ /71 Machine Learning / 5. Model Selection Underfitting data model If a model does not well eplain the data, e.g., if the true model is quadratic, but we tr to fit a linear model, one sas, the model underfits. Course on Machine Learning, winter term 2009/ /71

33 Machine Learning / 5. Model Selection Overfitting / Fitting Polnomials of High Degree data model Course on Machine Learning, winter term 2009/ /71 Machine Learning / 5. Model Selection Overfitting / Fitting Polnomials of High Degree data model Course on Machine Learning, winter term 2009/ /71

34 Machine Learning / 5. Model Selection Overfitting / Fitting Polnomials of High Degree data model Course on Machine Learning, winter term 2009/ /71 Machine Learning / 5. Model Selection Overfitting / Fitting Polnomials of High Degree data model Course on Machine Learning, winter term 2009/ /71

35 Machine Learning / 5. Model Selection Overfitting / Fitting Polnomials of High Degree If to data consisting of n points we fit ( 1, 1 ), ( 2, 2 ),..., ( n, n ) X = β 0 + β 1 X 1 + β 2 X β n 1 X n 1 i.e., a polnomial with degree n 1, then this results in an interpolation of the data points (if there are no repeated measurements, i.e., points with the same X 1.) As the polnomial r(x) = n X j i i j j i is of this tpe, and has minimal RSS = 0. Course on Machine Learning, winter term 2009/ /71 Machine Learning / 5. Model Selection Model Selection Measures Model selection means: we have a set of models, e.g., p 1 Y = β i X i indeed b p (i.e., one model for each value of p), make a choice which model describes the data best. If we just look at losses / fit measures such as RSS, then or equivalentl i=0 the larger p, the better the fit the larger p, the lower the loss as the model with p parameters can be reparametrized in a model with p > p parameters b setting { β i βi, for i p = 0, for i > p Course on Machine Learning, winter term 2009/ /71

36 Machine Learning / 5. Model Selection Model Selection Measures One uses model selection measures of tpe model selection measure = fit compleit or equivalentl model selection measure = loss + compleit The smaller the loss (= lack of fit), the better the model. The smaller the compleit, the simpler and thus better the model. The model selection measure tries to find a trade-off between fit/loss and compleit. Course on Machine Learning, winter term 2009/ /71 Machine Learning / 5. Model Selection Model Selection Measures Akaike Information Criterion (AIC): (maimize) AIC := log L p or (minimize) AIC := 2 log L + 2p = 2n log(rss/n) + 2p Baes Information Criterion (BIC) / Baes-Schwarz Information Criterion: (maimize) BIC := log L p 2 log n Course on Machine Learning, winter term 2009/ /71

37 Machine Learning / 5. Model Selection Variable Backward Selection AIC = X... X... X AIC = AIC = AIC = X X... XX... X X AIC = AIC = AIC = X X X X XX... XX X AIC = AIC = AIC = X removed variable Course on Machine Learning, winter term 2009/ /71 Machine Learning / 5. Model Selection Variable Backward Selection AIC = X... X... X AIC = AIC = AIC = X X... XX... X X AIC = AIC = AIC = X X X X XX... XX X AIC = AIC = AIC = X removed variable Course on Machine Learning, winter term 2009/ /71

38 Machine Learning / 5. Model Selection Variable Backward Selection AIC = X... X... X AIC = AIC = AIC = X X... XX... X X AIC = AIC = AIC = X X X X XX... XX X AIC = AIC = AIC = X removed variable Course on Machine Learning, winter term 2009/ /71 Machine Learning / 5. Model Selection Variable Backward Selection AIC = X... X... X AIC = AIC = AIC = X X... XX... X X AIC = AIC = AIC = X X X X XX... XX X AIC = AIC = AIC = X removed variable Course on Machine Learning, winter term 2009/ /71

39 Machine Learning / 5. Model Selection full model: Variable Backward Selection Estimate Std. Error t value Pr(> t ) (Intercept) 1.222e e e-08 *** Population 1.880e e ** Income e e Illiterac 1.373e e Life Ep e e e-08 *** HS Grad 3.234e e Frost e e Area 5.967e e AIC optimal model b backward selection: Estimate Std. Error t value Pr(> t ) (Intercept) 1.202e e e-08 *** Population 1.780e e ** Illiterac 1.173e e Life Ep e e e-08 *** Frost e e Area 6.804e e * Course on Machine Learning, winter term 2009/ /71 Machine Learning / 5. Model Selection How to do it in R librar(datasets); librar(mass); st = as.data.frame(state.77); mod.full = lm(murder ~., data=st); summar(mod.full); mod.opt = stepaic(mod.full); summar(mod.opt); Course on Machine Learning, winter term 2009/ /71

40 Machine Learning / 5. Model Selection Shrinkage Model selection operates b fitting models for a set of models with varing compleit and then picking the best one e post, omitting some parameters completel (i.e., forcing them to be 0) shrinkage operates b including a penalt term directl in the model equation and favoring small parameter values in general. Course on Machine Learning, winter term 2009/ /71 Machine Learning / 5. Model Selection Ridge regression: minimize Shrinkage / Ridge Regression [Hoe62] RSS λ ( ˆβ) =RSS( ˆβ) + λ p j=1 ˆβ 2 j = X ˆβ, X ˆβ + λ ˆβ =(X T X + λi) 1 X T with λ 0 a compleit parameter / regularization parameter. As solutions of ridge regression are not equivariant under scaling of the predictors, and as it does not make sense to include a constraint for the parameter of the intercept data is normalized before ridge regression: i,j := i,j.,j ˆσ(.,j ) Course on Machine Learning, winter term 2009/ /71 p j=1 ˆβ 2 j

41 Machine Learning / 5. Model Selection Shrinkage / Ridge Regression (2/3) Ridge regression is a combination of n p ( i ŷ i ) 2 +λ }{{} j=1 β 2 j }{{} = L2 loss +λ L2 regularization Course on Machine Learning, winter term 2009/ /71 Machine Learning / 5. Model Selection Shrinkage / Ridge Regression (3/3) / Tikhonov Regularization (1/2) L2 regularization / Tikhonov regularization can be derived for linear regression as follows: Treat the true parameters θ j as random variables Θ j with the following distribution (prior): Θ j N (0, σ Θ ), j = 1,..., p Then the joint likelihood of the data and the parameters is ( n ) p L D,Θ (θ) := p( i, i θ) p(θ j = θ j ) and the conditional joint log likelihood of the data and the parameters accordingl ( n ) p log L cond D,Θ (θ) := log p( i i, θ) + log p(θ j = θ j ) and 1 log p(θ j = θ j ) = log e 2σ Θ 2 = log( 2πσ Θ ) θ2 j 2πσΘ 2σΘ 2 Course on Machine Learning, winter term 2009/ /71 θ 2 j j=1 j=1

42 Machine Learning / 5. Model Selection Shrinkage / Ridge Regression (3/3) / Tikhonov Regularization (2/2) Dropping the terms that do not depend on θ j ields: ( n ) p log L cond D,Θ (θ) := log p( i i, θ) + log p(θ j = θ j ) ( n ) log p( i i, θ) j=1 1 2σ 2 Θ p j=1 θ 2 j This also gives a semantics to the compleit / regularization parameter λ: λ = 1 2σΘ 2 but σθ 2 is unknown. (We will see methods to estimate λ later on.) The parameters θ that maimize the joint likelihood of the data and the parameters are called Maimum Aposteriori Estimators (MAP estimators). Putting a prior on the parameters is called Baesian approach. Course on Machine Learning, winter term 2009/ /71 Machine Learning / 5. Model Selection How to compute ridge regression / Eample Fit Y =β 0 + β 1 X 1 + β 2 X 2 + ɛ to the data: X = , Y = , I := , X T X = , X T X + 5I = , X T Y = Course on Machine Learning, winter term 2009/ /71

43 Machine Learning 1. The Regression Problem 2. Simple Linear Regression 3. Multiple Regression 4. Variable Interactions 5. Model Selection 6. Case Weights Course on Machine Learning, winter term 2009/ /71 Machine Learning / 6. Case Weights Cases of Different Importance Sometimes different cases are of different importance, e.g., if their measurements are of different accurrac or reliabilit. Eample: assume the left most point is known to be measured with lower reliabilit. Thus, the model does not need to fit to this point equall as well as it needs to do to the other points. I.e., residuals of this point should get lower weight than the others data model Course on Machine Learning, winter term 2009/ /71

44 Machine Learning / 6. Case Weights Case Weights In such situations, each case ( i, i ) is assigned a case weight w i 0: the higher the weight, the more important the case. cases with weight 0 should be treated as if the have been discarded from the data set. Case weights can be managed as an additional pseudo-variable w in applications. Course on Machine Learning, winter term 2009/ /71 Machine Learning / 6. Case Weights Weighted Least Squares Estimates Formall, one tries to minimize the weighted residual sum of squares n w i ( i ŷ i ) 2 = W 2( 1 ŷ) 2 with W := w 1 0 w w n The same argument as for the unweighted case results in the weighted least squares estimates X T WX ˆβ = X T W Course on Machine Learning, winter term 2009/ /71

45 Machine Learning / 6. Case Weights Weighted Least Squares Estimates / Eample Do downweight the left most point, we assign case weights as follows: w data model (w. weights) model (w./o. weights) Course on Machine Learning, winter term 2009/ /71 Machine Learning / 6. Case Weights Summar For regression, linear models of tpe Y = X, β + ɛ can be used to predict a quantitative Y based on several (quantitative) X. The ordinar least squares estimates (OLS) are the parameters with minimal residual sum of squares (RSS). The coincide with the maimum likelihood estimates (MLE). OLS estimates can be computed b solving the sstem of linear equations X T X ˆβ = X T Y. The variance of the OLS estimates can be computed likewise ((X T X) 1ˆσ 2 ). For deciding about inclusion of predictors as well as of powers and interactions of predictors in a model, model selection measures (AIC, BIC) and different search strategies such as forward and backward search are available. Course on Machine Learning, winter term 2009/ /71

46 Machine Learning / 6. Case Weights References [Hoe62] A. E. Hoerl. Application of ridge analsis to regression problems. Chemical Engineering Progress, 58:54 59, Course on Machine Learning, winter term 2009/ /71

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

Machine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme

Machine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme Machine Learning A. Supervised Learning A.1. Linear Regression Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Final Review. Yang Feng.   Yang Feng (Columbia University) Final Review 1 / 58 Final Review Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Final Review 1 / 58 Outline 1 Multiple Linear Regression (Estimation, Inference) 2 Special Topics for Multiple

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Supervised Learning: Regression I Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Some of the

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

Simple and Multiple Linear Regression

Simple and Multiple Linear Regression Sta. 113 Chapter 12 and 13 of Devore March 12, 2010 Table of contents 1 Simple Linear Regression 2 Model Simple Linear Regression A simple linear regression model is given by Y = β 0 + β 1 x + ɛ where

More information

Chapter 7: Model Assessment and Selection

Chapter 7: Model Assessment and Selection Chapter 7: Model Assessment and Selection DD3364 April 20, 2012 Introduction Regression: Review of our problem Have target variable Y to estimate from a vector of inputs X. A prediction model ˆf(X) has

More information

Linear regression Class 25, Jeremy Orloff and Jonathan Bloom

Linear regression Class 25, Jeremy Orloff and Jonathan Bloom 1 Learning Goals Linear regression Class 25, 18.05 Jerem Orloff and Jonathan Bloom 1. Be able to use the method of least squares to fit a line to bivariate data. 2. Be able to give a formula for the total

More information

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman Linear Regression Models Based on Chapter 3 of Hastie, ibshirani and Friedman Linear Regression Models Here the X s might be: p f ( X = " + " 0 j= 1 X j Raw predictor variables (continuous or coded-categorical

More information

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear

More information

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77 Linear Regression Chapter 3 September 27, 2016 Chapter 3 September 27, 2016 1 / 77 1 3.1. Simple linear regression 2 3.2 Multiple linear regression 3 3.3. The least squares estimation 4 3.4. The statistical

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 6: Model complexity scores (v3) Ramesh Johari ramesh.johari@stanford.edu Fall 2015 1 / 34 Estimating prediction error 2 / 34 Estimating prediction error We saw how we can estimate

More information

Statistics - Lecture Three. Linear Models. Charlotte Wickham 1.

Statistics - Lecture Three. Linear Models. Charlotte Wickham   1. Statistics - Lecture Three Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Linear Models 1. The Theory 2. Practical Use 3. How to do it in R 4. An example 5. Extensions

More information

Linear models and their mathematical foundations: Simple linear regression

Linear models and their mathematical foundations: Simple linear regression Linear models and their mathematical foundations: Simple linear regression Steffen Unkel Department of Medical Statistics University Medical Center Göttingen, Germany Winter term 2018/19 1/21 Introduction

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences Biostatistics-Lecture 16 Model Selection Ruibin Xi Peking University School of Mathematical Sciences Motivating example1 Interested in factors related to the life expectancy (50 US states,1969-71 ) Per

More information

Math 423/533: The Main Theoretical Topics

Math 423/533: The Main Theoretical Topics Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression ST 430/514 Recall: a regression model describes how a dependent variable (or response) Y is affected, on average, by one or more independent variables (or factors, or covariates).

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2015 Announcements TA Monisha s office hour has changed to Thursdays 10-12pm, 462WVH (the same

More information

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B Simple Linear Regression 35 Problems 1 Consider a set of data (x i, y i ), i =1, 2,,n, and the following two regression models: y i = β 0 + β 1 x i + ε, (i =1, 2,,n), Model A y i = γ 0 + γ 1 x i + γ 2

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation Yujin Chung November 29th, 2016 Fall 2016 Yujin Chung Lec13: MLE Fall 2016 1/24 Previous Parametric tests Mean comparisons (normality assumption)

More information

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata' Business Statistics Tommaso Proietti DEF - Università di Roma 'Tor Vergata' Linear Regression Specication Let Y be a univariate quantitative response variable. We model Y as follows: Y = f(x) + ε where

More information

For more information about how to cite these materials visit

For more information about how to cite these materials visit Author(s): Kerby Shedden, Ph.D., 2010 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution Share Alike 3.0 License: http://creativecommons.org/licenses/by-sa/3.0/

More information

Linear Regression In God we trust, all others bring data. William Edwards Deming

Linear Regression In God we trust, all others bring data. William Edwards Deming Linear Regression ddebarr@uw.edu 2017-01-19 In God we trust, all others bring data. William Edwards Deming Course Outline 1. Introduction to Statistical Learning 2. Linear Regression 3. Classification

More information

Cross-validation for detecting and preventing overfitting

Cross-validation for detecting and preventing overfitting Cross-validation for detecting and preventing overfitting A Regression Problem = f() + noise Can we learn f from this data? Note to other teachers and users of these slides. Andrew would be delighted if

More information

4. Nonlinear regression functions

4. Nonlinear regression functions 4. Nonlinear regression functions Up to now: Population regression function was assumed to be linear The slope(s) of the population regression function is (are) constant The effect on Y of a unit-change

More information

Overview. IAML: Linear Regression. Examples of regression problems. The Regression Problem

Overview. IAML: Linear Regression. Examples of regression problems. The Regression Problem 3 / 38 Overview 4 / 38 IAML: Linear Regression Nigel Goddard School of Informatics Semester The linear model Fitting the linear model to data Probabilistic interpretation of the error function Eamples

More information

The Simple Regression Model. Part II. The Simple Regression Model

The Simple Regression Model. Part II. The Simple Regression Model Part II The Simple Regression Model As of Sep 22, 2015 Definition 1 The Simple Regression Model Definition Estimation of the model, OLS OLS Statistics Algebraic properties Goodness-of-Fit, the R-square

More information

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1 BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1 Shaobo Li University of Cincinnati 1 Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013)

More information

Problem Selected Scores

Problem Selected Scores Statistics Ph.D. Qualifying Exam: Part II November 20, 2010 Student Name: 1. Answer 8 out of 12 problems. Mark the problems you selected in the following table. Problem 1 2 3 4 5 6 7 8 9 10 11 12 Selected

More information

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures FE661 - Statistical Methods for Financial Engineering 9. Model Selection Jitkomut Songsiri statistical models overview of model selection information criteria goodness-of-fit measures 9-1 Statistical models

More information

Multiple Regression Analysis. Part III. Multiple Regression Analysis

Multiple Regression Analysis. Part III. Multiple Regression Analysis Part III Multiple Regression Analysis As of Sep 26, 2017 1 Multiple Regression Analysis Estimation Matrix form Goodness-of-Fit R-square Adjusted R-square Expected values of the OLS estimators Irrelevant

More information

Statistical Machine Learning, Part I. Regression 2

Statistical Machine Learning, Part I. Regression 2 Statistical Machine Learning, Part I Regression 2 mcuturi@i.kyoto-u.ac.jp SML-2015 1 Last Week Regression: highlight a functional relationship between a predicted variable and predictors SML-2015 2 Last

More information

Model comparison and selection

Model comparison and selection BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)

More information

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff IEOR 165 Lecture 7 Bias-Variance Tradeoff 1 Bias-Variance Tradeoff Consider the case of parametric regression with β R, and suppose we would like to analyze the error of the estimate ˆβ in comparison to

More information

Day 4: Shrinkage Estimators

Day 4: Shrinkage Estimators Day 4: Shrinkage Estimators Kenneth Benoit Data Mining and Statistical Learning March 9, 2015 n versus p (aka k) Classical regression framework: n > p. Without this inequality, the OLS coefficients have

More information

Introduction to Statistical modeling: handout for Math 489/583

Introduction to Statistical modeling: handout for Math 489/583 Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

3. For a given dataset and linear model, what do you think is true about least squares estimates? Is Ŷ always unique? Yes. Is ˆβ always unique? No.

3. For a given dataset and linear model, what do you think is true about least squares estimates? Is Ŷ always unique? Yes. Is ˆβ always unique? No. 7. LEAST SQUARES ESTIMATION 1 EXERCISE: Least-Squares Estimation and Uniqueness of Estimates 1. For n real numbers a 1,...,a n, what value of a minimizes the sum of squared distances from a to each of

More information

Lecture 6: Linear Regression (continued)

Lecture 6: Linear Regression (continued) Lecture 6: Linear Regression (continued) Reading: Sections 3.1-3.3 STATS 202: Data mining and analysis October 6, 2017 1 / 23 Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d.

More information

The Multiple Regression Model Estimation

The Multiple Regression Model Estimation Lesson 5 The Multiple Regression Model Estimation Pilar González and Susan Orbe Dpt Applied Econometrics III (Econometrics and Statistics) Pilar González and Susan Orbe OCW 2014 Lesson 5 Regression model:

More information

How the mean changes depends on the other variable. Plots can show what s happening...

How the mean changes depends on the other variable. Plots can show what s happening... Chapter 8 (continued) Section 8.2: Interaction models An interaction model includes one or several cross-product terms. Example: two predictors Y i = β 0 + β 1 x i1 + β 2 x i2 + β 12 x i1 x i2 + ɛ i. How

More information

Lecture 6: Linear Regression

Lecture 6: Linear Regression Lecture 6: Linear Regression Reading: Sections 3.1-3 STATS 202: Data mining and analysis Jonathan Taylor, 10/5 Slide credits: Sergio Bacallado 1 / 30 Simple linear regression Model: y i = β 0 + β 1 x i

More information

1 History of statistical/machine learning. 2 Supervised learning. 3 Two approaches to supervised learning. 4 The general learning procedure

1 History of statistical/machine learning. 2 Supervised learning. 3 Two approaches to supervised learning. 4 The general learning procedure Overview Breiman L (2001). Statistical modeling: The two cultures Statistical learning reading group Aleander L 1 Histor of statistical/machine learning 2 Supervised learning 3 Two approaches to supervised

More information

Linear model selection and regularization

Linear model selection and regularization Linear model selection and regularization Problems with linear regression with least square 1. Prediction Accuracy: linear regression has low bias but suffer from high variance, especially when n p. It

More information

Linear Models and Estimation by Least Squares

Linear Models and Estimation by Least Squares Linear Models and Estimation by Least Squares Jin-Lung Lin 1 Introduction Causal relation investigation lies in the heart of economics. Effect (Dependent variable) cause (Independent variable) Example:

More information

Smoothness Selection. Simon Wood Mathematical Sciences, University of Bath, U.K.

Smoothness Selection. Simon Wood Mathematical Sciences, University of Bath, U.K. Smoothness Selection Simon Wood Mathematical Sciences, Universit of Bath, U.K. Smoothness selection approaches The smoothing model i = f ( i ) + ɛ i, ɛ i N(0, σ 2 ), is represented via a basis epansion

More information

Ch 3: Multiple Linear Regression

Ch 3: Multiple Linear Regression Ch 3: Multiple Linear Regression 1. Multiple Linear Regression Model Multiple regression model has more than one regressor. For example, we have one response variable and two regressor variables: 1. delivery

More information

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

Regression I: Mean Squared Error and Measuring Quality of Fit

Regression I: Mean Squared Error and Measuring Quality of Fit Regression I: Mean Squared Error and Measuring Quality of Fit -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 The Setup Suppose there is a scientific problem we are interested in solving

More information

Foundation of Intelligent Systems, Part I. Regression 2

Foundation of Intelligent Systems, Part I. Regression 2 Foundation of Intelligent Systems, Part I Regression 2 mcuturi@i.kyoto-u.ac.jp FIS - 2013 1 Some Words on the Survey Not enough answers to say anything meaningful! Try again: survey. FIS - 2013 2 Last

More information

Inference about the Slope and Intercept

Inference about the Slope and Intercept Inference about the Slope and Intercept Recall, we have established that the least square estimates and 0 are linear combinations of the Y i s. Further, we have showed that the are unbiased and have the

More information

Quantitative Analysis of Financial Markets. Summary of Part II. Key Concepts & Formulas. Christopher Ting. November 11, 2017

Quantitative Analysis of Financial Markets. Summary of Part II. Key Concepts & Formulas. Christopher Ting. November 11, 2017 Summary of Part II Key Concepts & Formulas Christopher Ting November 11, 2017 christopherting@smu.edu.sg http://www.mysmu.edu/faculty/christophert/ Christopher Ting 1 of 16 Why Regression Analysis? Understand

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1 MA 575 Linear Models: Cedric E Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1 1 Within-group Correlation Let us recall the simple two-level hierarchical

More information

Ch 2: Simple Linear Regression

Ch 2: Simple Linear Regression Ch 2: Simple Linear Regression 1. Simple Linear Regression Model A simple regression model with a single regressor x is y = β 0 + β 1 x + ɛ, where we assume that the error ɛ is independent random component

More information

Weighted Least Squares

Weighted Least Squares Weighted Least Squares The standard linear model assumes that Var(ε i ) = σ 2 for i = 1,..., n. As we have seen, however, there are instances where Var(Y X = x i ) = Var(ε i ) = σ2 w i. Here w 1,..., w

More information

STK-IN4300 Statistical Learning Methods in Data Science

STK-IN4300 Statistical Learning Methods in Data Science STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no STK-IN4300: lecture 2 1/ 38 Outline of the lecture STK-IN4300 - Statistical Learning Methods in Data Science Linear

More information

UNIVERSITY OF TORONTO Faculty of Arts and Science

UNIVERSITY OF TORONTO Faculty of Arts and Science UNIVERSITY OF TORONTO Faculty of Arts and Science December 2013 Final Examination STA442H1F/2101HF Methods of Applied Statistics Jerry Brunner Duration - 3 hours Aids: Calculator Model(s): Any calculator

More information

High-dimensional regression modeling

High-dimensional regression modeling High-dimensional regression modeling David Causeur Department of Statistics and Computer Science Agrocampus Ouest IRMAR CNRS UMR 6625 http://www.agrocampus-ouest.fr/math/causeur/ Course objectives Making

More information

STK-IN4300 Statistical Learning Methods in Data Science

STK-IN4300 Statistical Learning Methods in Data Science Outline of the lecture Linear Methods for Regression Linear Regression Models and Least Squares Subset selection STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no

More information

ISQS 5349 Spring 2013 Final Exam

ISQS 5349 Spring 2013 Final Exam ISQS 5349 Spring 2013 Final Exam Name: General Instructions: Closed books, notes, no electronic devices. Points (out of 200) are in parentheses. Put written answers on separate paper; multiple choices

More information

Machine Learning: Evaluation

Machine Learning: Evaluation Machine Learning: Evaluation Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Wintersemester 2007 / 2008 Comparison of Algorithms Comparison of Algorithms Is algorithm A better

More information

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods. TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin

More information

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017 UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics Tuesday, January 17, 2017 Work all problems 60 points are needed to pass at the Masters Level and 75

More information

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010 1 Linear models Y = Xβ + ɛ with ɛ N (0, σ 2 e) or Y N (Xβ, σ 2 e) where the model matrix X contains the information on predictors and β includes all coefficients (intercept, slope(s) etc.). 1. Number of

More information

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013 18.S096 Problem Set 3 Fall 013 Regression Analysis Due Date: 10/8/013 he Projection( Hat ) Matrix and Case Influence/Leverage Recall the setup for a linear regression model y = Xβ + ɛ where y and ɛ are

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7 MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7 1 Random Vectors Let a 0 and y be n 1 vectors, and let A be an n n matrix. Here, a 0 and A are non-random, whereas y is

More information

Lecture 4 Multiple linear regression

Lecture 4 Multiple linear regression Lecture 4 Multiple linear regression BIOST 515 January 15, 2004 Outline 1 Motivation for the multiple regression model Multiple regression in matrix notation Least squares estimation of model parameters

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Lecture 19 Multiple (Linear) Regression

Lecture 19 Multiple (Linear) Regression Lecture 19 Multiple (Linear) Regression Thais Paiva STA 111 - Summer 2013 Term II August 1, 2013 1 / 30 Thais Paiva STA 111 - Summer 2013 Term II Lecture 19, 08/01/2013 Lecture Plan 1 Multiple regression

More information

Estimators as Random Variables

Estimators as Random Variables Estimation Theory Overview Properties Bias, Variance, and Mean Square Error Cramér-Rao lower bound Maimum likelihood Consistency Confidence intervals Properties of the mean estimator Introduction Up until

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 15: Examples of hypothesis tests (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 32 The recipe 2 / 32 The hypothesis testing recipe In this lecture we repeatedly apply the

More information

2. Regression Review

2. Regression Review 2. Regression Review 2.1 The Regression Model The general form of the regression model y t = f(x t, β) + ε t where x t = (x t1,, x tp ), β = (β 1,..., β m ). ε t is a random variable, Eε t = 0, Var(ε t

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 42 Passenger car mileage Consider the carmpg dataset taken from

More information

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University. SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression

More information

Exam Applied Statistical Regression. Good Luck!

Exam Applied Statistical Regression. Good Luck! Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.

More information

11. Regression and Least Squares

11. Regression and Least Squares 11. Regression and Least Squares Prof. Tesler Math 186 Winter 2016 Prof. Tesler Ch. 11: Linear Regression Math 186 / Winter 2016 1 / 23 Regression Given n points ( 1, 1 ), ( 2, 2 ),..., we want to determine

More information

Linear regression methods

Linear regression methods Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response

More information

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models SCHOOL OF MATHEMATICS AND STATISTICS Linear and Generalised Linear Models Autumn Semester 2017 18 2 hours Attempt all the questions. The allocation of marks is shown in brackets. RESTRICTED OPEN BOOK EXAMINATION

More information

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients What our model needs to do regression Usually, we are not just trying to explain observed data We want to uncover meaningful trends And predict future observations Our questions then are Is β" a good estimate

More information

Lecture 6 Multiple Linear Regression, cont.

Lecture 6 Multiple Linear Regression, cont. Lecture 6 Multiple Linear Regression, cont. BIOST 515 January 22, 2004 BIOST 515, Lecture 6 Testing general linear hypotheses Suppose we are interested in testing linear combinations of the regression

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

REGRESSION WITH SPATIALLY MISALIGNED DATA. Lisa Madsen Oregon State University David Ruppert Cornell University

REGRESSION WITH SPATIALLY MISALIGNED DATA. Lisa Madsen Oregon State University David Ruppert Cornell University REGRESSION ITH SPATIALL MISALIGNED DATA Lisa Madsen Oregon State University David Ruppert Cornell University SPATIALL MISALIGNED DATA 10 X X X X X X X X 5 X X X X X 0 X 0 5 10 OUTLINE 1. Introduction 2.

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression ST 430/514 Recall: A regression model describes how a dependent variable (or response) Y is affected, on average, by one or more independent variables (or factors, or covariates)

More information

Geographically weighted regression approach for origin-destination flows

Geographically weighted regression approach for origin-destination flows Geographically weighted regression approach for origin-destination flows Kazuki Tamesue 1 and Morito Tsutsumi 2 1 Graduate School of Information and Engineering, University of Tsukuba 1-1-1 Tennodai, Tsukuba,

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Lecture 3. Hypothesis testing. Goodness of Fit. Model diagnostics GLM (Spring, 2018) Lecture 3 1 / 34 Models Let M(X r ) be a model with design matrix X r (with r columns) r n

More information

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin Regression Review Statistics 149 Spring 2006 Copyright c 2006 by Mark E. Irwin Matrix Approach to Regression Linear Model: Y i = β 0 + β 1 X i1 +... + β p X ip + ɛ i ; ɛ i iid N(0, σ 2 ), i = 1,..., n

More information

Multivariate Regression

Multivariate Regression Multivariate Regression The so-called supervised learning problem is the following: we want to approximate the random variable Y with an appropriate function of the random variables X 1,..., X p with the

More information

5. Erroneous Selection of Exogenous Variables (Violation of Assumption #A1)

5. Erroneous Selection of Exogenous Variables (Violation of Assumption #A1) 5. Erroneous Selection of Exogenous Variables (Violation of Assumption #A1) Assumption #A1: Our regression model does not lack of any further relevant exogenous variables beyond x 1i, x 2i,..., x Ki and

More information

Biostatistics in Research Practice - Regression I

Biostatistics in Research Practice - Regression I Biostatistics in Research Practice - Regression I Simon Crouch 30th Januar 2007 In scientific studies, we often wish to model the relationships between observed variables over a sample of different subjects.

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

High-dimensional regression

High-dimensional regression High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and

More information