MATH 829: Introduction to Data Mining and Analysis Linear Regression: old and new

1/15 MATH 829: Introduction to Data Mining and Analysis Linear Regression: old and new Dominique Guillot Departments of Mathematical Sciences University of Delaware February 10, 2016

Linear Regression: old and new 2/15 Typical problem: we are given n observations of variables X 1,..., X p and Y.

2/15 Linear Regression: old and new Typical problem: we are given n observations of variables X 1,..., X p and Y. Goal: Use X 1,..., X p to try to predict Y.

2/15 Linear Regression: old and new Typical problem: we are given n observations of variables X 1,..., X p and Y. Goal: Use X 1,..., X p to try to predict Y. Example: Cars data compiled using Kelley Blue Book (n = 805, p = 11). Find a linear model Y = β 1 X 1 + + β p X p. In the example, we want: price = β 1 mileage + β 2 cylinder +...

Linear regression: classical setting 3/15 p = nb. of variables, n = nb. of observations.

Linear regression: classical setting 3/15 p = nb. of variables, n = nb. of observations. Classical setting: n p (n much larger than p). With enough observations, we hope to be able to build a good model. Note: even if the true relationship between the variables is not linear, we can include transformations of variables. E.g. X p+1 = X 2 1, X p+2 = X 2 2,... Note: adding transformed variables can increase p signicantly. A complex model requires a lot of observations. Modern setting: In modern problems, it is often the case that n p. Requires supplementary assumptions (e.g. sparsity). Can still build good models with very few observations.

Classical setting 4/15 Idea: Y R n 1 X R n p

Classical setting 4/15 Idea: Y R n 1 X R n p y 1 Y = y 2...... X = x 1 x 2... x p,... y n where x 1,..., x p R n 1 are the observations of X 1,... X p.

Classical setting 4/15 Idea: Y R n 1 X R n p y 1 Y = y 2...... X = x 1 x 2... x p,... y n where x 1,..., x p R n 1 are the observations of X 1,... X p. We want Y = β 1 X 1 + + β p X p. Equivalent to solving β 1 β 2 Y = Xβ β =.. β p

Classical setting (cont.) 5/15 We need to solve Y = Xβ. Obviously, in general, the system has no solution.

Classical setting (cont.) 5/15 We need to solve Y = Xβ. Obviously, in general, the system has no solution. A popular approach is to solve the system in the least squares sense: ˆβ = argmin β R p Y Xβ 2. How do we compute the solution? Calculus approach: Y Xβ 2 = n (y k X k1 β 1 X k2 β 2 X kp β p ) 2 β i β i k=1 n = 2 (y k X k1 β 1 X k2 β 2 X kp β p ) ( X ki ) = 0. k=1

Classical setting (cont.) We need to solve Y = Xβ. Obviously, in general, the system has no solution. A popular approach is to solve the system in the least squares sense: ˆβ = argmin β R p Y Xβ 2. How do we compute the solution? Calculus approach: Y Xβ 2 = n (y k X k1 β 1 X k2 β 2 X kp β p ) 2 β i β i k=1 n = 2 (y k X k1 β 1 X k2 β 2 X kp β p ) ( X ki ) k=1 Therefore, = 0. n X ki (X k1 β 1 + X k2 β 2 + + X kp β p ) = k=1 n k=1 X ki y k 5/15

Calculus approach (cont.) 6/15 Now n n X ki (X k1 β 1 +X k2 β 2 + +X kp β p ) = X ki y k i = 1,..., p, k=1 k=1 is equivalent to: X T Xβ = X T y (Normal equations).

Calculus approach (cont.) 6/15 Now n n X ki (X k1 β 1 +X k2 β 2 + +X kp β p ) = X ki y k i = 1,..., p, k=1 k=1 is equivalent to: X T Xβ = X T y (Normal equations). We compute the Hessian: 2 β i β j Y Xβ 2 = 2X T X.

Calculus approach (cont.) Now n n X ki (X k1 β 1 +X k2 β 2 + +X kp β p ) = X ki y k i = 1,..., p, k=1 k=1 is equivalent to: X T Xβ = X T y (Normal equations). We compute the Hessian: 2 β i β j Y Xβ 2 = 2X T X. If X T X is invertible, then X T X is positive denite and ˆβ = (X T X) 1 X T Y is the unique minimum of Y Xβ 2. 6/15

Linear algebra approach 7/15 Want to solve Y = Xβ. Linear algebra approach: Recall: If V R n is a subspace and w V, then the best approximation of w be a vector in V is proj V (w).

Linear algebra approach 7/15 Want to solve Y = Xβ. Linear algebra approach: Recall: If V R n is a subspace and w V, then the best approximation of w be a vector in V is Best in the sense that: proj V (w). w proj V (w) w v v V. Here: Xβ col(x) = span(x 1,..., x p ). If Y col(x), then the best approximation of Y by a vector in col(x) is proj col(x)(y ).

Linear algebra approach (cont.) 8/15 So Y proj col(x)(y ) Y Xβ β R p.

Linear algebra approach (cont.) 8/15 So Y proj col(x)(y ) Y Xβ β R p. Therefore, to nd ˆβ, we solve X ˆβ = proj col(x)(y ) (Note: this system always has a solution.)

Linear algebra approach (cont.) 8/15 So Y proj col(x)(y ) Y Xβ β R p. Therefore, to nd ˆβ, we solve X ˆβ = proj col(x)(y ) (Note: this system always has a solution.) With a little more work, we can nd an explicit solution: Recall Thus, Y X ˆβ = Y proj col(x)(y ) = proj col(x) (Y ). col(x) = null(x T ). Y X ˆβ = proj null(x T )(Y ) null(x T ).

Linear algebra approach (cont.) 8/15 So Y proj col(x)(y ) Y Xβ β R p. Therefore, to nd ˆβ, we solve X ˆβ = proj col(x)(y ) (Note: this system always has a solution.) With a little more work, we can nd an explicit solution: Recall Thus, That implies: Equivalently, Y X ˆβ = Y proj col(x)(y ) = proj col(x) (Y ). col(x) = null(x T ). Y X ˆβ = proj null(x T )(Y ) null(x T ). X T X ˆβ = X T Y X T (Y X ˆβ) = 0. (Normal equations).

The least squares theorem 9/15 Theorem (Least squares theorem) Let A R n m and b R n. Then 1 Ax = b always has a least squares solution ˆx. 2 A vector ˆx is a least squares solution i it satises the normal equations A T Aˆx = A T b. 3 ˆx is unique the columns of A are linearly independent A T A is invertible. In that case, the unique least squares solution is given by ˆx = (A T A) 1 A T b.

Building a simple linear model with Python 10/15 The le JSE_Car_Lab.csv: Loading the data with the headers using Pandas: import pandas as pd data = pd.read_csv('./data/jse_car_lab.csv',delimiter=',') We extract the numerical columns: y = np.array(data['price']) x = np.array(data['mileage']) x = x.reshape(len(x),1)

Building a simple linear model with Python (cont.) 11/15 The scikit-learn package provides a lot of very powerful functions/objects to analyse datasets. Typical syntax: 1 Create object representing the model. 2 Call the fit method of the model with the data as arguments. 3 Use the predict method to make predictions. from sklearn.linear_model import LinearRegression lin_model = LinearRegression(fit_intercept=True) lin_model.fit(x,y) print lin_model.coef_ print lin_model.intercept_ We obtain price 0.17 mileage + 24764.5.

Measuring the t of a linear model 12/15 How good is our linear model? We examine the residual sum of squares: RSS( ˆβ) = y X ˆβ 2 = n (y i ŷ i ) 2. k=1 ((y-lin_model.predict(x))**2).sum() We nd: 76855792485.91. Quite a large error... The average absolute error: (abs(y-lin_model.predict(x))).mean() is 7596.28. Not so good... We examine the distribution of the residuals: import matplotlib.pyplot as plt plt.hist(y-lin_model.predict(x)) plt.show()

Measuring the t of a linear model (cont.) 13/15 Histogram of the residuals: Non-symmetric. Heavy tail.

Measuring the t of a linear model (cont.) 13/15 Histogram of the residuals: Non-symmetric. Heavy tail. The heavy tail suggests there may be outliers. It also suggests transforming the response variable using a transformation such as log,, or 1/x.

Measuring the t of a linear model (cont.) 14/15 Plotting the residuals as a function of the tted values, we immediately observe some patterns. Outliers? Separate categories of cars?

Improving the model Add more variables to the model. Select the best variables to include. Use transformations. Separate cars into categories (e.g. exclude expansive cars). etc. For example, let us use all the variables, and exclude Cadillacs from the dataset. Much more symmetric. Closer to a Gaussian distribution. Average absolute error drops to 4241.21. 15/15