MATH 829: Introduction to Data Mining and Analysis Linear Regression: old and new

Similar documents
MATH 567: Mathematical Techniques in Data Science Linear Regression: old and new

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

MATH 829: Introduction to Data Mining and Analysis Linear Regression: statistical tests

Least squares problems Linear Algebra with Computer Science Application

Chapter 6: Orthogonality

MATH 829: Introduction to Data Mining and Analysis Least angle regression

MATH 829: Introduction to Data Mining and Analysis Consistency of Linear Regression

Worksheet for Lecture 25 Section 6.4 Gram-Schmidt Process

MATH 829: Introduction to Data Mining and Analysis Graphical Models I

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Weighted Least Squares

The prediction of house price

x 21 x 22 x 23 f X 1 X 2 X 3 ε

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Solutions to Review Problems for Chapter 6 ( ), 7.1

CSCI5654 (Linear Programming, Fall 2013) Lectures Lectures 10,11 Slide# 1

Linear Regression and Its Applications

Math 3191 Applied Linear Algebra

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

MATH 829: Introduction to Data Mining and Analysis Clustering II

Homoskedasticity. Var (u X) = σ 2. (23)

MATH 829: Introduction to Data Mining and Analysis Logistic regression

Sparse Linear Models (10/7/13)

Linear Models for Regression CS534

Eigenvalues and Eigenvectors

Lecture 6: Linear Regression (continued)

Linear Models Review

A Short Introduction to the Lasso Methodology

Statistical Machine Learning, Part I. Regression 2

Foundation of Intelligent Systems, Part I. Regression 2

Math 290, Midterm II-key

Math 2331 Linear Algebra

Chapter 2: simple regression model

ES-2 Lecture: More Least-squares Fitting. Spring 2017

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018

Analysis of variance. Gilles Guillot. September 30, Gilles Guillot September 30, / 29

MATH 829: Introduction to Data Mining and Analysis Support vector machines

Multiple linear regression: estimation and model fitting

Linear Models for Regression CS534

MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels

Exploratory data analysis

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Statistics II Exercises Chapter 5

3. For a given dataset and linear model, what do you think is true about least squares estimates? Is Ŷ always unique? Yes. Is ˆβ always unique? No.

<br /> D. Thiebaut <br />August """Example of DNNRegressor for Housing dataset.""" In [94]:

Linear Models for Regression CS534

Lecture 1: OLS derivations and inference

MATH36061 Convex Optimization

Summer Session Practice Final Exam

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

ft-uiowa-math2550 Assignment NOTRequiredJustHWformatOfQuizReviewForExam3part2 due 12/31/2014 at 07:10pm CST

Lecture 5: Clustering, Linear Regression

9. Robust regression

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares

11 Hypothesis Testing

Lecture 6: Linear Regression

Math 3191 Applied Linear Algebra

Math Exam 2, October 14, 2008

Bayesian Regression Linear and Logistic Regression

STAT 350: Geometry of Least Squares

Math 4A Notes. Written by Victoria Kala Last updated June 11, 2017

Weighted Least Squares

Eigenvalues and Eigenvectors. Review: Invertibility. Eigenvalues and Eigenvectors. The Finite Dimensional Case. January 18, 2018

Applied Statistics and Econometrics

Notes 6: Multivariate regression ECO 231W - Undergraduate Econometrics

Math 261 Lecture Notes: Sections 6.1, 6.2, 6.3 and 6.4 Orthogonal Sets and Projections

Math Linear Algebra

Machine Learning. Regression Case Part 1. Linear Regression. Machine Learning: Regression Case.

Math 2331 Linear Algebra

Lecture 5: Clustering, Linear Regression

Lecture 2: Review of Prerequisites. Table of contents

MATH 304 Linear Algebra Lecture 19: Least squares problems (continued). Norms and inner products.

Pseudoinverse and Adjoint Operators

Least Squares. Stephen Boyd. EE103 Stanford University. October 28, 2017

STAT 100C: Linear models

Logistic Regression and Generalized Linear Models

Oslo Class 6 Sparsity based regularization

Correlation 1. December 4, HMS, 2017, v1.1

Fitting Linear Statistical Models to Data by Least Squares I: Introduction

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

AM205: Assignment 2. i=1

A Primer in Econometric Theory

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

Linear Algebra Math 221

3 Multiple Linear Regression

Fitting a regression model

3.3 Eigenvalues and Eigenvectors

ANALYTICAL MATHEMATICS FOR APPLICATIONS 2018 LECTURE NOTES 3

Chapter 16. Simple Linear Regression and Correlation

MATH 304 Linear Algebra Lecture 18: Orthogonal projection (continued). Least squares problems. Normed vector spaces.

Absolute Value Inequalities 2.5, Functions 3.6

Row Space, Column Space, and Nullspace

Introduction to Statistical modeling: handout for Math 489/583

HOMEWORK #4: LOGISTIC REGRESSION

The Comparison Test & Limit Comparison Test

CSC321 Lecture 2: Linear Regression

Chapter 7. Linear Regression (Pt. 1) 7.1 Introduction. 7.2 The Least-Squares Regression Line

MATH 223 FINAL EXAM APRIL, 2005

Transcription:

1/15 MATH 829: Introduction to Data Mining and Analysis Linear Regression: old and new Dominique Guillot Departments of Mathematical Sciences University of Delaware February 10, 2016

Linear Regression: old and new 2/15 Typical problem: we are given n observations of variables X 1,..., X p and Y.

2/15 Linear Regression: old and new Typical problem: we are given n observations of variables X 1,..., X p and Y. Goal: Use X 1,..., X p to try to predict Y.

2/15 Linear Regression: old and new Typical problem: we are given n observations of variables X 1,..., X p and Y. Goal: Use X 1,..., X p to try to predict Y. Example: Cars data compiled using Kelley Blue Book (n = 805, p = 11). Find a linear model Y = β 1 X 1 + + β p X p. In the example, we want: price = β 1 mileage + β 2 cylinder +...

Linear regression: classical setting 3/15 p = nb. of variables, n = nb. of observations.

Linear regression: classical setting 3/15 p = nb. of variables, n = nb. of observations. Classical setting: n p (n much larger than p). With enough observations, we hope to be able to build a good model.

Linear regression: classical setting 3/15 p = nb. of variables, n = nb. of observations. Classical setting: n p (n much larger than p). With enough observations, we hope to be able to build a good model. Note: even if the true relationship between the variables is not linear, we can include transformations of variables.

Linear regression: classical setting 3/15 p = nb. of variables, n = nb. of observations. Classical setting: n p (n much larger than p). With enough observations, we hope to be able to build a good model. Note: even if the true relationship between the variables is not linear, we can include transformations of variables. E.g. X p+1 = X 2 1, X p+2 = X 2 2,...

Linear regression: classical setting 3/15 p = nb. of variables, n = nb. of observations. Classical setting: n p (n much larger than p). With enough observations, we hope to be able to build a good model. Note: even if the true relationship between the variables is not linear, we can include transformations of variables. E.g. X p+1 = X 2 1, X p+2 = X 2 2,... Note: adding transformed variables can increase p signicantly.

Linear regression: classical setting 3/15 p = nb. of variables, n = nb. of observations. Classical setting: n p (n much larger than p). With enough observations, we hope to be able to build a good model. Note: even if the true relationship between the variables is not linear, we can include transformations of variables. E.g. X p+1 = X 2 1, X p+2 = X 2 2,... Note: adding transformed variables can increase p signicantly. A complex model requires a lot of observations. Modern setting: In modern problems, it is often the case that n p. Requires supplementary assumptions (e.g. sparsity). Can still build good models with very few observations.

Classical setting 4/15 Idea: Y R n 1 X R n p

Classical setting 4/15 Idea: Y R n 1 X R n p y 1 Y = y 2...... X = x 1 x 2... x p,... y n where x 1,..., x p R n 1 are the observations of X 1,... X p.

Classical setting 4/15 Idea: Y R n 1 X R n p y 1 Y = y 2...... X = x 1 x 2... x p,... y n where x 1,..., x p R n 1 are the observations of X 1,... X p. We want Y = β 1 X 1 + + β p X p. Equivalent to solving β 1 β 2 Y = Xβ β =.. β p

Classical setting (cont.) 5/15 We need to solve Y = Xβ. Obviously, in general, the system has no solution.

Classical setting (cont.) 5/15 We need to solve Y = Xβ. Obviously, in general, the system has no solution. A popular approach is to solve the system in the least squares sense: ˆβ = argmin β R p Y Xβ 2.

Classical setting (cont.) 5/15 We need to solve Y = Xβ. Obviously, in general, the system has no solution. A popular approach is to solve the system in the least squares sense: ˆβ = argmin β R p Y Xβ 2. How do we compute the solution? Calculus approach:

Classical setting (cont.) 5/15 We need to solve Y = Xβ. Obviously, in general, the system has no solution. A popular approach is to solve the system in the least squares sense: ˆβ = argmin β R p Y Xβ 2. How do we compute the solution? Calculus approach: Y Xβ 2 = n (y k X k1 β 1 X k2 β 2 X kp β p ) 2 β i β i k=1 n = 2 (y k X k1 β 1 X k2 β 2 X kp β p ) ( X ki ) = 0. k=1

Classical setting (cont.) We need to solve Y = Xβ. Obviously, in general, the system has no solution. A popular approach is to solve the system in the least squares sense: ˆβ = argmin β R p Y Xβ 2. How do we compute the solution? Calculus approach: Y Xβ 2 = n (y k X k1 β 1 X k2 β 2 X kp β p ) 2 β i β i k=1 n = 2 (y k X k1 β 1 X k2 β 2 X kp β p ) ( X ki ) k=1 Therefore, = 0. n X ki (X k1 β 1 + X k2 β 2 + + X kp β p ) = k=1 n k=1 X ki y k 5/15

Calculus approach (cont.) 6/15 Now n n X ki (X k1 β 1 +X k2 β 2 + +X kp β p ) = X ki y k i = 1,..., p, k=1 k=1 is equivalent to: X T Xβ = X T y (Normal equations).

Calculus approach (cont.) 6/15 Now n n X ki (X k1 β 1 +X k2 β 2 + +X kp β p ) = X ki y k i = 1,..., p, k=1 k=1 is equivalent to: X T Xβ = X T y (Normal equations). We compute the Hessian: 2 β i β j Y Xβ 2 = 2X T X.

Calculus approach (cont.) Now n n X ki (X k1 β 1 +X k2 β 2 + +X kp β p ) = X ki y k i = 1,..., p, k=1 k=1 is equivalent to: X T Xβ = X T y (Normal equations). We compute the Hessian: 2 β i β j Y Xβ 2 = 2X T X. If X T X is invertible, then X T X is positive denite and ˆβ = (X T X) 1 X T Y is the unique minimum of Y Xβ 2. 6/15

Linear algebra approach 7/15 Want to solve Y = Xβ. Linear algebra approach: Recall: If V R n is a subspace and w V, then the best approximation of w be a vector in V is proj V (w).

Linear algebra approach 7/15 Want to solve Y = Xβ. Linear algebra approach: Recall: If V R n is a subspace and w V, then the best approximation of w be a vector in V is Best in the sense that: proj V (w). w proj V (w) w v v V.

Linear algebra approach 7/15 Want to solve Y = Xβ. Linear algebra approach: Recall: If V R n is a subspace and w V, then the best approximation of w be a vector in V is Best in the sense that: proj V (w). w proj V (w) w v v V. Here: Xβ col(x) = span(x 1,..., x p ). If Y col(x), then the best approximation of Y by a vector in col(x) is proj col(x)(y ).

Linear algebra approach (cont.) 8/15 So Y proj col(x)(y ) Y Xβ β R p.

Linear algebra approach (cont.) 8/15 So Y proj col(x)(y ) Y Xβ β R p. Therefore, to nd ˆβ, we solve X ˆβ = proj col(x)(y ) (Note: this system always has a solution.)

Linear algebra approach (cont.) 8/15 So Y proj col(x)(y ) Y Xβ β R p. Therefore, to nd ˆβ, we solve X ˆβ = proj col(x)(y ) (Note: this system always has a solution.) With a little more work, we can nd an explicit solution: Y X ˆβ = Y proj col(x)(y ) = proj col(x) (Y ).

Linear algebra approach (cont.) 8/15 So Y proj col(x)(y ) Y Xβ β R p. Therefore, to nd ˆβ, we solve X ˆβ = proj col(x)(y ) (Note: this system always has a solution.) With a little more work, we can nd an explicit solution: Recall Y X ˆβ = Y proj col(x)(y ) = proj col(x) (Y ). col(x) = null(x T ).

Linear algebra approach (cont.) 8/15 So Y proj col(x)(y ) Y Xβ β R p. Therefore, to nd ˆβ, we solve X ˆβ = proj col(x)(y ) (Note: this system always has a solution.) With a little more work, we can nd an explicit solution: Recall Thus, Y X ˆβ = Y proj col(x)(y ) = proj col(x) (Y ). col(x) = null(x T ). Y X ˆβ = proj null(x T )(Y ) null(x T ).

Linear algebra approach (cont.) 8/15 So Y proj col(x)(y ) Y Xβ β R p. Therefore, to nd ˆβ, we solve X ˆβ = proj col(x)(y ) (Note: this system always has a solution.) With a little more work, we can nd an explicit solution: Recall Thus, That implies: Equivalently, Y X ˆβ = Y proj col(x)(y ) = proj col(x) (Y ). col(x) = null(x T ). Y X ˆβ = proj null(x T )(Y ) null(x T ). X T X ˆβ = X T Y X T (Y X ˆβ) = 0. (Normal equations).

The least squares theorem 9/15 Theorem (Least squares theorem) Let A R n m and b R n. Then 1 Ax = b always has a least squares solution ˆx. 2 A vector ˆx is a least squares solution i it satises the normal equations A T Aˆx = A T b. 3 ˆx is unique the columns of A are linearly independent A T A is invertible. In that case, the unique least squares solution is given by ˆx = (A T A) 1 A T b.

Building a simple linear model with Python 10/15 The le JSE_Car_Lab.csv: Loading the data with the headers using Pandas: import pandas as pd data = pd.read_csv('./data/jse_car_lab.csv',delimiter=',') We extract the numerical columns: y = np.array(data['price']) x = np.array(data['mileage']) x = x.reshape(len(x),1)

Building a simple linear model with Python (cont.) 11/15 The scikit-learn package provides a lot of very powerful functions/objects to analyse datasets. Typical syntax: 1 Create object representing the model. 2 Call the fit method of the model with the data as arguments. 3 Use the predict method to make predictions. from sklearn.linear_model import LinearRegression lin_model = LinearRegression(fit_intercept=True) lin_model.fit(x,y) print lin_model.coef_ print lin_model.intercept_ We obtain price 0.17 mileage + 24764.5.

Measuring the t of a linear model 12/15 How good is our linear model? We examine the residual sum of squares: RSS( ˆβ) = y X ˆβ 2 = n (y i ŷ i ) 2. k=1 ((y-lin_model.predict(x))**2).sum() We nd: 76855792485.91. Quite a large error... The average absolute error: (abs(y-lin_model.predict(x))).mean() is 7596.28. Not so good... We examine the distribution of the residuals: import matplotlib.pyplot as plt plt.hist(y-lin_model.predict(x)) plt.show()

Measuring the t of a linear model (cont.) 13/15 Histogram of the residuals: Non-symmetric. Heavy tail.

Measuring the t of a linear model (cont.) 13/15 Histogram of the residuals: Non-symmetric. Heavy tail. The heavy tail suggests there may be outliers. It also suggests transforming the response variable using a transformation such as log,, or 1/x.

Measuring the t of a linear model (cont.) 14/15 Plotting the residuals as a function of the tted values, we immediately observe some patterns. Outliers? Separate categories of cars?

Improving the model Add more variables to the model. Select the best variables to include. Use transformations. Separate cars into categories (e.g. exclude expansive cars). etc. For example, let us use all the variables, and exclude Cadillacs from the dataset. Much more symmetric. Closer to a Gaussian distribution. Average absolute error drops to 4241.21. 15/15