MATH 567: Mathematical Techniques in Data Science Linear Regression: old and new

Size: px

Start display at page:

Download "MATH 567: Mathematical Techniques in Data Science Linear Regression: old and new"

Kelley Hunter
5 years ago
Views:

1 1/14 MATH 567: Mathematical Techniques in Data Science Linear Regression: old and new Dominique Guillot Departments of Mathematical Sciences University of Delaware February 13, 2017

2 Linear Regression: old and new 2/14 Typical problem: we are given n observations of variables X 1,..., X p and Y.

3 2/14 Linear Regression: old and new Typical problem: we are given n observations of variables X 1,..., X p and Y. Goal: Use X 1,..., X p to try to predict Y.

4 2/14 Linear Regression: old and new Typical problem: we are given n observations of variables X 1,..., X p and Y. Goal: Use X 1,..., X p to try to predict Y. Example: Cars data compiled using Kelley Blue Book (n = 805, p = 11).

5 2/14 Linear Regression: old and new Typical problem: we are given n observations of variables X 1,..., X p and Y. Goal: Use X 1,..., X p to try to predict Y. Example: Cars data compiled using Kelley Blue Book (n = 805, p = 11). Find a linear model Y = β 1 X β p X p.

6 2/14 Linear Regression: old and new Typical problem: we are given n observations of variables X 1,..., X p and Y. Goal: Use X 1,..., X p to try to predict Y. Example: Cars data compiled using Kelley Blue Book (n = 805, p = 11). Find a linear model Y = β 1 X β p X p. In the example, we want: price = β 1 mileage + β 2 cylinder +...

7 Linear regression: classical setting 3/14 p = nb. of variables, n = nb. of observations.

8 Linear regression: classical setting 3/14 p = nb. of variables, n = nb. of observations. Classical setting: n p (n much larger than p). With enough observations, we hope to be able to build a good model.

9 Linear regression: classical setting 3/14 p = nb. of variables, n = nb. of observations. Classical setting: n p (n much larger than p). With enough observations, we hope to be able to build a good model. Note: even if the true relationship between the variables is not linear, we can include transformations of variables.

10 Linear regression: classical setting 3/14 p = nb. of variables, n = nb. of observations. Classical setting: n p (n much larger than p). With enough observations, we hope to be able to build a good model. Note: even if the true relationship between the variables is not linear, we can include transformations of variables. E.g. X p+1 = X 2 1, X p+2 = X 2 2,...

11 Linear regression: classical setting 3/14 p = nb. of variables, n = nb. of observations. Classical setting: n p (n much larger than p). With enough observations, we hope to be able to build a good model. Note: even if the true relationship between the variables is not linear, we can include transformations of variables. E.g. X p+1 = X 2 1, X p+2 = X 2 2,... Note: adding transformed variables can increase p signicantly.

12 Linear regression: classical setting 3/14 p = nb. of variables, n = nb. of observations. Classical setting: n p (n much larger than p). With enough observations, we hope to be able to build a good model. Note: even if the true relationship between the variables is not linear, we can include transformations of variables. E.g. X p+1 = X 2 1, X p+2 = X 2 2,... Note: adding transformed variables can increase p signicantly. A complex model requires a lot of observations.

13 Linear regression: classical setting 3/14 p = nb. of variables, n = nb. of observations. Classical setting: n p (n much larger than p). With enough observations, we hope to be able to build a good model. Note: even if the true relationship between the variables is not linear, we can include transformations of variables. E.g. X p+1 = X 2 1, X p+2 = X 2 2,... Note: adding transformed variables can increase p signicantly. A complex model requires a lot of observations. Trade-o between complexity and interpretability.

14 Linear regression: classical setting p = nb. of variables, n = nb. of observations. Classical setting: n p (n much larger than p). With enough observations, we hope to be able to build a good model. Note: even if the true relationship between the variables is not linear, we can include transformations of variables. E.g. X p+1 = X 2 1, X p+2 = X 2 2,... Note: adding transformed variables can increase p signicantly. A complex model requires a lot of observations. Trade-o between complexity and interpretability. Modern setting: In modern problems, it is often the case that n p. Requires supplementary assumptions (e.g. sparsity). Can still build good models with very few observations. 3/14

15 Classical setting 4/14 Idea: Y R n 1 X R n p

16 Classical setting 4/14 Idea: Y R n 1 X R n p y 1 Y = y X = x 1 x 2... x p,... y n where x 1,..., x p R n 1 are the observations of X 1,... X p.

17 Classical setting 4/14 Idea: Y R n 1 X R n p y 1 Y = y X = x 1 x 2... x p,... y n where x 1,..., x p R n 1 are the observations of X 1,... X p. We want Y = β 1 X β p X p. Equivalent to solving β 1 β 2 Y = Xβ β =.. β p

18 Classical setting (cont.) 5/14 We need to solve Y = Xβ. In general, the system has no solution (n p) or innitely many solutions (n p).

19 Classical setting (cont.) 5/14 We need to solve Y = Xβ. In general, the system has no solution (n p) or innitely many solutions (n p). A popular approach is to solve the system in the least squares sense: ˆβ = argmin β R p Y Xβ 2.

20 Classical setting (cont.) 5/14 We need to solve Y = Xβ. In general, the system has no solution (n p) or innitely many solutions (n p). A popular approach is to solve the system in the least squares sense: ˆβ = argmin β R p Y Xβ 2. How do we compute the solution? Calculus approach:

21 Classical setting (cont.) 5/14 We need to solve Y = Xβ. In general, the system has no solution (n p) or innitely many solutions (n p). A popular approach is to solve the system in the least squares sense: ˆβ = argmin β R p Y Xβ 2. How do we compute the solution? Calculus approach: 0 = Y Xβ 2 = n (y k X k1 β 1 X k2 β 2 X kp β p ) 2 β i β i k=1 n = 2 (y k X k1 β 1 X k2 β 2 X kp β p ) ( X ki ) k=1

22 Classical setting (cont.) 5/14 We need to solve Y = Xβ. In general, the system has no solution (n p) or innitely many solutions (n p). A popular approach is to solve the system in the least squares sense: ˆβ = argmin β R p Y Xβ 2. How do we compute the solution? Calculus approach: 0 = Y Xβ 2 = n (y k X k1 β 1 X k2 β 2 X kp β p ) 2 β i β i k=1 n = 2 (y k X k1 β 1 X k2 β 2 X kp β p ) ( X ki ) Therefore, k=1 n n X ki (X k1 β 1 + X k2 β X kp β p ) = X ki y k k=1 k=1

23 Calculus approach (cont.) 6/14 Now n n X ki (X k1 β 1 +X k2 β 2 + +X kp β p ) = X ki y k i = 1,..., p, k=1 k=1 is equivalent to: X T Xβ = X T y (Normal equations).

24 Calculus approach (cont.) 6/14 Now n n X ki (X k1 β 1 +X k2 β 2 + +X kp β p ) = X ki y k i = 1,..., p, k=1 k=1 is equivalent to: X T Xβ = X T y (Normal equations). If X T X is invertible, then ˆβ = (X T X) 1 X T Y is the unique minimum of Y Xβ 2.

25 Calculus approach (cont.) 6/14 Now n n X ki (X k1 β 1 +X k2 β 2 + +X kp β p ) = X ki y k i = 1,..., p, k=1 k=1 is equivalent to: X T Xβ = X T y (Normal equations). If X T X is invertible, then ˆβ = (X T X) 1 X T Y is the unique minimum of Y Xβ 2. Proved by computing the Hessian matrix: 2 β i β j Y Xβ 2 = 2X T X.

26 Linear algebra approach 7/14 Want to solve Y = Xβ. Linear algebra approach: Recall: If V R n is a subspace and w V, then the best approximation of w be a vector in V is proj V (w).

27 Linear algebra approach 7/14 Want to solve Y = Xβ. Linear algebra approach: Recall: If V R n is a subspace and w V, then the best approximation of w be a vector in V is Best in the sense that: proj V (w). w proj V (w) w v v V.

28 Linear algebra approach 7/14 Want to solve Y = Xβ. Linear algebra approach: Recall: If V R n is a subspace and w V, then the best approximation of w be a vector in V is Best in the sense that: proj V (w). w proj V (w) w v v V. Note: Xβ col(x) = span(x 1,..., x p ). If Y col(x), then the best approximation of Y by a vector in col(x) is proj col(x)(y ).

29 Linear algebra approach (cont.) 8/14 So Y proj col(x)(y ) Y Xβ β R p.

30 Linear algebra approach (cont.) 8/14 So Y proj col(x)(y ) Y Xβ β R p. Therefore, to nd ˆβ, we solve X ˆβ = proj col(x)(y ) (Note: this system always has a solution.)

31 Linear algebra approach (cont.) 8/14 So Y proj col(x)(y ) Y Xβ β R p. Therefore, to nd ˆβ, we solve X ˆβ = proj col(x)(y ) (Note: this system always has a solution.) With a little more work, we can nd an explicit solution: Y X ˆβ = Y proj col(x)(y ) = proj col(x) (Y ).

32 Linear algebra approach (cont.) 8/14 So Y proj col(x)(y ) Y Xβ β R p. Therefore, to nd ˆβ, we solve X ˆβ = proj col(x)(y ) (Note: this system always has a solution.) With a little more work, we can nd an explicit solution: Recall Y X ˆβ = Y proj col(x)(y ) = proj col(x) (Y ). col(x) = null(x T ).

33 Linear algebra approach (cont.) 8/14 So Y proj col(x)(y ) Y Xβ β R p. Therefore, to nd ˆβ, we solve X ˆβ = proj col(x)(y ) (Note: this system always has a solution.) With a little more work, we can nd an explicit solution: Recall Thus, Y X ˆβ = Y proj col(x)(y ) = proj col(x) (Y ). col(x) = null(x T ). Y X ˆβ = proj null(x T )(Y ) null(x T ).

34 Linear algebra approach (cont.) 8/14 So Y proj col(x)(y ) Y Xβ β R p. Therefore, to nd ˆβ, we solve X ˆβ = proj col(x)(y ) (Note: this system always has a solution.) With a little more work, we can nd an explicit solution: Recall Thus, That implies: Y X ˆβ = Y proj col(x)(y ) = proj col(x) (Y ). col(x) = null(x T ). Y X ˆβ = proj null(x T )(Y ) null(x T ). X T (Y X ˆβ) = 0.

35 Linear algebra approach (cont.) 8/14 So Y proj col(x)(y ) Y Xβ β R p. Therefore, to nd ˆβ, we solve X ˆβ = proj col(x)(y ) (Note: this system always has a solution.) With a little more work, we can nd an explicit solution: Recall Thus, That implies: Equivalently, Y X ˆβ = Y proj col(x)(y ) = proj col(x) (Y ). col(x) = null(x T ). Y X ˆβ = proj null(x T )(Y ) null(x T ). X T X ˆβ = X T Y X T (Y X ˆβ) = 0. (Normal equations).

36 The least squares theorem 9/14 Theorem (Least squares theorem) Let A R n m and b R n. Then 1 Ax = b always has a least squares solution ˆx. 2 A vector ˆx is a least squares solution i it satises the normal equations A T Aˆx = A T b. 3 ˆx is unique the columns of A are linearly independent A T A is invertible. In that case, the unique least squares solution is given by ˆx = (A T A) 1 A T b.

37 The least squares theorem 9/14 Theorem (Least squares theorem) Let A R n m and b R n. Then 1 Ax = b always has a least squares solution ˆx. 2 A vector ˆx is a least squares solution i it satises the normal equations A T Aˆx = A T b. 3 ˆx is unique the columns of A are linearly independent A T A is invertible. In that case, the unique least squares solution is given by ˆx = (A T A) 1 A T b. In R: model <- lm(y X 1 + X X p ).

38 Measuring the t of a linear model 10/14 How good is our linear model? We examine the mean squared error: MSE( ˆβ) = 1 n y X ˆβ 2 = 1 n n (y i ŷ i ) 2. k=1

39 Measuring the t of a linear model 10/14 How good is our linear model? We examine the mean squared error: MSE( ˆβ) = 1 n y X ˆβ 2 = 1 n n (y i ŷ i ) 2. k=1 Example: model <- lm(auto$mpg ~ Auto$horsepower + Auto$weight) sm <- summary(model) mean(sm$residuals^2) # The MSE

40 The coecient of determination 11/14 The coecient of determination, called R squared and denoted R 2 : n R 2 i=1 = 1 (y i ŷ i ) 2 n i=1 (y i y) 2.

41 The coecient of determination 11/14 The coecient of determination, called R squared and denoted R 2 : n R 2 i=1 = 1 (y i ŷ i ) 2 n i=1 (y i y) 2. Often used to measure the quality of a linear model.

42 The coecient of determination 11/14 The coecient of determination, called R squared and denoted R 2 : n R 2 i=1 = 1 (y i ŷ i ) 2 n i=1 (y i y) 2. Often used to measure the quality of a linear model. In some sense, the R 2 measures how much better is the prediction, compared to a constant prediction equal to the average of the y i s.

43 The coecient of determination 11/14 The coecient of determination, called R squared and denoted R 2 : n R 2 i=1 = 1 (y i ŷ i ) 2 n i=1 (y i y) 2. Often used to measure the quality of a linear model. In some sense, the R 2 measures how much better is the prediction, compared to a constant prediction equal to the average of the y i s. In R: sm$r.squared. (As above, sm <- summary(model)).

44 The coecient of determination 11/14 The coecient of determination, called R squared and denoted R 2 : n R 2 i=1 = 1 (y i ŷ i ) 2 n i=1 (y i y) 2. Often used to measure the quality of a linear model. In some sense, the R 2 measures how much better is the prediction, compared to a constant prediction equal to the average of the y i s. In R: sm$r.squared. (As above, sm <- summary(model)). In a linear model with an intercept, R 2 equals the square of the correlation coecient between the observed Y and the predicted values Ŷ.

45 The coecient of determination 11/14 The coecient of determination, called R squared and denoted R 2 : n R 2 i=1 = 1 (y i ŷ i ) 2 n i=1 (y i y) 2. Often used to measure the quality of a linear model. In some sense, the R 2 measures how much better is the prediction, compared to a constant prediction equal to the average of the y i s. In R: sm$r.squared. (As above, sm <- summary(model)). In a linear model with an intercept, R 2 equals the square of the correlation coecient between the observed Y and the predicted values Ŷ. A model with a R 2 close to 1 ts the data well.

46 Measuring the t of a linear model (cont.) 12/14 We can examine the distribution of the residuals: hist(sm$residuals) Desirable properties: Symmetry Light tail.

47 Measuring the t of a linear model (cont.) 12/14 We can examine the distribution of the residuals: hist(sm$residuals) Desirable properties: Symmetry Light tail. A heavy tail suggests there may be outliers. Can use transformations such as log,, or 1/x to improve the t.

48 Measuring the t of a linear model (cont.) 13/14 Plotting the residuals as a function of the mpg (or tted values), we immediately observe some patterns. Outliers? Separate categories of cars?

49 Improving the model 14/14 Add more variables to the model. Select the best variables to include. Use transformations. Separate cars into categories. etc.

50 Improving the model 14/14 Add more variables to the model. Select the best variables to include. Use transformations. Separate cars into categories. etc. For example, let us t a model only for cars with a mpg less than 25:

MATH 829: Introduction to Data Mining and Analysis Linear Regression: old and new

1/15 MATH 829: Introduction to Data Mining and Analysis Linear Regression: old and new Dominique Guillot Departments of Mathematical Sciences University of Delaware February 10, 2016 Linear Regression: