Machine Learning. Part 1. Linear Regression. Machine Learning: Regression Case. .. Dennis Sun DATA 401 Data Science Alex Dekhtyar..

Similar documents
Machine Learning. Regression Case Part 1. Linear Regression. Machine Learning: Regression Case.

Introduction to Statistical modeling: handout for Math 489/583

Sparse Linear Models (10/7/13)

CS6220: DATA MINING TECHNIQUES

MS-C1620 Statistical inference

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Machine Learning Linear Regression. Prof. Matteo Matteucci

Linear Regression (9/11/13)

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

MS&E 226: Small Data

Model Selection. Frank Wood. December 10, 2009

CS6220: DATA MINING TECHNIQUES

Model comparison and selection

Review: Support vector machines. Machine learning techniques and image analysis

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Regression, Ridge Regression, Lasso

Dimension Reduction Methods

A Bayesian Treatment of Linear Gaussian Regression

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

STAT 100C: Linear models

Machine Learning for OR & FE

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Week 5 Quantitative Analysis of Financial Markets Modeling and Forecasting Trend

Linear Regression In God we trust, all others bring data. William Edwards Deming

Regression Estimation Least Squares and Maximum Likelihood

Lecture 6: Linear Regression (continued)

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Weighted Least Squares

STA 414/2104, Spring 2014, Practice Problem Set #1

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

Minimum Description Length (MDL)

Math 423/533: The Main Theoretical Topics

Simple Linear Regression

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning for OR & FE

Linear model selection and regularization

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

CS 453X: Class 1. Jacob Whitehill

CS Homework 3. October 15, 2009

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

LASSO-Type Penalization in the Framework of Generalized Additive Models for Location, Scale and Shape

Math 533 Extra Hour Material

Chapter 14. Linear least squares

High-dimensional regression modeling

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

MLCC 2017 Regularization Networks I: Linear Models

The linear model is the most fundamental of all serious statistical models encompassing:

Big Data Analytics. Lucas Rego Drumond

Bootstrap, Jackknife and other resampling methods

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Well-developed and understood properties

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Linear Models in Machine Learning

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Linear Regression and Its Applications

Classification. Chapter Introduction. 6.2 The Bayes classifier

Lecture 6. Regression

Least Squares Regression

Regression diagnostics

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

MASM22/FMSN30: Linear and Logistic Regression, 7.5 hp FMSN40:... with Data Gathering, 9 hp

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

A Bahadur Representation of the Linear Support Vector Machine

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

ST 740: Linear Models and Multivariate Normal Inference

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Lecture 2 Machine Learning Review

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Ma 3/103: Lecture 25 Linear Regression II: Hypothesis Testing and ANOVA

CS 195-5: Machine Learning Problem Set 1

Introduction to SVM and RVM

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Data Mining - SVM. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55

Lecture 14: Shrinkage

General Linear Model: Statistical Inference

Terminology for Statistical Data

Association studies and regression

A Study of Relative Efficiency and Robustness of Classification Methods

Model Estimation Example

Correlation and Regression

Linear Methods for Prediction

Generalized Linear Models

y Xw 2 2 y Xw λ w 2 2

Classification: Logistic Regression from Data

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17

Support Vector Machines

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Least Squares Regression

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

Introduction to Machine Learning

Nonparametric Regression. Badr Missaoui

Introduction to Machine Learning

Weighted Least Squares

STAT 350: Geometry of Least Squares

Transcription:

.. Dennis Sun DATA 401 Data Science Alex Dekhtyar.. Machine Learning. Part 1. Linear Regression Machine Learning: Regression Case. Dataset. Consider a collection of features X = {X 1,...,X n }, such that dom(x i ) R for all i = 1... n 1. Consider also an additional feature Y, such that dom(y ) R Let D = {x 1,...,x m } be a collection of data points, such that ( j 1... m)(x j dom(x dom(y )). We write D as We also write x i = (x i1,...,x in ). X 1 X 2... X n x 11 x 12... x 1n y 1 D = x 11 x 12... x 1n y 2....... x 11 x 12... x 1n y m Machine Learning Question. We treat X as the independent or observed variables and Y as the dependent or target variable. Our goal can be expressed as such: Given the dataset D of data points, find a relationship between the vectors x i of observed data and the values y i of the dependent variable. Regression. Representing the relationship between the independent variables X 1,...,X n and the target variable Y (based on observations D) as a function 1 Later, we will relax this condition. f : dom(x 1 )... dom(x n ) dom(y ) 1

or, more generally, as f : R n R is called regression modeling, and the function f that represents this relationship is called the regression function. Linear Regression (Multivariate Case) To save space, we discuss multivariate regression case directly. Linear Regression. A regression function f(x 1,...,x n ) : R n R is called a linear regression function iff it has the form y = f(x 1,...,x n ) = β 0 + β 1 x 1 + β 2 x 2 +...β n x n + ǫ. The values β 1,...,β n are the linear coefficients of the regression function, otherwise known as feature loadings. The value β 0, otherwise known as the intercept is the value of the regression function on the vector x 0 = (0,...,0) : f(0,0,...,0) = β 0. The value ǫ represents error. The error is assumed to be independent of our data (vector (x 1,... x n )) and normally distributed with the mean of 0 and some unknown standard deviation σ 2 : ǫ N(0,σ 2 ) The linear regression equation may be rewritten as y = x T β + ǫ where x T = (1,x 1,...,x n ) and β = (β 0,β 1,...,β n ). Finding the best regression function. We can use the dataset D = {x 1,...,x m } for guidance. How do we find the right f(x)? Assume for a moment that we already know the values for coefficients β 0,β 1,...,β n for some linear regression function f(). Then, this function can be used to predict the values y 1,...,y m of the target variable Y on inputs x 1,...,x m respectively: y 1 = x 1 T β = β 1 x 11 + β 2 x 12 +... + β n x 1n + β 0 y 2 = x 2 T β = β 1 x 21 + β 2 x 22 +... + β n x 2n + β 0... y m = x T m β = β 1 x m1 + β 2 x m2 +... + β n x mn + β 0 2

If we, without loss of generality assume that matrix D is D = 1 x 11 x 12... x 1n 1 x 21 x 22... x 2n....... 1 x m1 x m2... x mn then we can rewrite the expressions above using the linear algebra notation: y = Dβ Thus, for each vector x i we have the regression prediction y i and the true value y i. Our goal is to minimize the error of prediction, i.e., to minimize the overall differences between the predicted and the observed values. Error. Given a vector x i, the prediction error error(x i ) is defined as: error(x i ) = y i y i = y i x i T = ǫ i Maximum Liklihood Estimation. Under our assumptions, ǫ i are independently identically distributed according to a normal distribution: ǫ i N(0,σ 2 ) = 1 i x i 2πσ 2 e (y 2σ 2 T β) 2 The probability of observing the vector ǫ = (ǫ 1,...,ǫ n ) = ((y 1 x 1 T β),... (y n x n T β)) can therefore be expressed as follows: P(ǫ β,σ 2 ) = m 1 i x i 2πσ 2 e (y 2σ 2 T β) 2 The log liklihood function can be represented as follows: l(β,σ) = n 2 log(2π) n 2 log(σ2 ) 1 2σ 2 n 2 log(2π) n 2 log(σ2 ) 1 2σ 2 n (y i x i β) 2 = (y Dβ) T (y Dβ) We can maximize the log liklihood function by taking the partial derivatives of l(β, σ) and setting the derivatives to 0. For the purpose of prediction, we need to estimate the values β = (β 1,...,β n ), but we actually do not need an estimate for the variance σ 2. 3

We need σ 2 estimated though if we want to understand how well our linear regression model matches (fits) the observed data. Estimating Regression Coefficients. parameters are: The partial derivatives of L(β,σ) on β i β j = 1 σ 2 x ij (y i x T i β) = 1 m σ 2 x ij (y i (β 0 + β 1 x i1 +... + β n x in )) If we set it to zero, we get x ij (y i x T i β) = 0 Generalizing for the entire β, we can get l β = 1 σ 2DT (y Dβ), which, when set to a zero vector, gives us: Solving for β we get D T (y Dβ) = 0 D T y = D T Dβ, which yields the closed form solution for β: ˆβ = (D T D) 1 D T y. Least Squares Approximation. What type of an estimator did we just obtain? Let us consider a numeric-analytical approach to predicting coefficients β. In predicting β, we want to minimize the error resulted in our prediction. There is a wide range of ways in which error can be considered minimized. The traditional way of doing so is called the least squares method. In this method we minimize the sum of squares of the individual errors of prediction. That is, we define the overall error of prediction Error(D) as following: Error(D) = Error(x 1,...x n ) = = (y i y i) 2 = error 2 (x i ) = (y i (x T i β + ǫ)) 2 4

Among all β vectors, we want to find the one that minimizes Error(D). If we fix the dataset D and instead let β and ǫ vary, can the error function as a function of β, ǫ: L(β,ǫ) = (y i (x T i β + ǫ)) 2 We want to minimize subject to β, ǫ: L(β) min β,ǫ L(β,ǫ) Solving the minimization problem. L(β) is minimized when the derivative of L() is equal to zero. This can be expressed as the following n + 1 equations: β 1 = 0 β 2 = 0... β n = 0 ǫ = 0 = ( m i=0 (y i (x T i β)) 2) = β j β i = 2 x ij (y i (β 0 x i0 +... + β n x in ) = 0 i=0 The system of linear equations generated can be described as: Its solution is D T (y Dβ ) = 0 ˆβ = (D T D) 1 D T y This can be rewritten as ( m ) 1 m ˆβ = (D T D) 1 D T T y = x i x i x i y i We can then estimate the values of y for x 1,...,x m : 5

ŷ i = x i T ˆβ = xi (D T D) 1 D T y, or: for P = D(D T D) 1 D T. ŷ = D ˆβ = D(D T D) 1 D T y = Py Conclusion. If we look at the Maximum Liklihood estimator we obtained and at the least squares estimator, we will notice and important fact it is the same estimator. Goodness-of-fit. You can test how good your regression model is by asking what percentage of variance the regression is responsible for. This is called a goodnessof-fit test, and the metric estimating the goodness of fit is called the R 2 metric, and is computed as follows: R 2 = m (ŷ i µ y ) 2 m (y i µ y ) 2 Tricks. We have two equations describing the solution for the estimates β. The first equation is The second equation is D T Dβ = D T y. β = (D T D) 1D T y. The advantage of the second equation is that it provides a direct way of computing β. The disadvantage is the fact that the computation involves taking the inverse of a matrix. The first equation is a linear equation system. While it does not give a closed form solution, it can be leveraged in actual computations, if you are relying on a linear algebra package (such as NumPy) for your linear algebra computations. In such a case, using the solver for a system of linear equations rather than a function that inverts a matrix is preferrable. Linear Regression with Categorical Variables The specification for linear regression assumes that all variables X 1,...,X n are numeric, i.e., dom(x i ) R. Sometimes, however, some of the variables in the set X are categorical. 6

Nominal Variables Case. Let X i be a nominal variable with the domain dom(x i ) = {a 1,...,a k }. We proceed as follows. Select one value (w/o loss of generality, a k ) as the baseline. For each value a {a 1,...,a k 1 } create a new numeric variable X a. Let the new set of variables be X = (X {X i }) {X a1,... X ak 1 } represent values a 1,...,a k 1 as vectors a 1 = (1,0,0,...,0) a 2 = (0,1,0,...,0)... a k 1 = (0,0,0,...,1), and represent value a k as the vector a k = (0,...,0) Replace each vector x j = (x 1,...,x i 1,a l,...,x n ) with vector x i = (x 1,...,x i 1,a l,x n ). Perform linear regression from X to Y. Ordinal Variables Case. Let X i be an ordinal variable with domain dom(x) = {1,2,...,k} 2. We replace X i as follows. Select one value (w/o loss of generality, a k ) as the baseline. For each value j {1,...,k 1} create a new numeric variable X ij. Let the new set of variables be X = (X {X i }) {X i1,... X ik 1 } Represent values 1,...,k 1 of X i as vectors a 1 = (1,0,0...,0) a 2 = (1,1,0...,0)... a k 1 = (1,1,1,...,1), and represent the value X i = k as the vector a k = (0,0,0,...,0). Replace each vector x j = (x 1,...,x i 1,l,...,x n ) with vector x i = (x 1,...,x i 1,a l,x n ). Perform linear regression from X to Y. 2 Without loss of generality, we can represent all ordinal domains this way. 7

Basis Expansion Given a set of features X = {X 1,...,X n }, and a dataset {(x i,y i )}, we can construct a least-squares linear regression using the features X 1,...X n. However, we can also transform features. Polynomial basis expansion. Given a feature X i, add features Xi 2,...,Xd i to the list of features, and represent the output as y = β 0 + x 1 β 1 +... + β i1 x i + β i2 x 2 i +... + β id x d i + β i+1 x i+1 +... + β n x n This can be applied to any number of features from X. Interactions. Let X i and X j be two independent variables in X. We can add a feature X ij = X i X j to the model. If we want to build a linear regression model that accounts for interactions between all pairs of features, the expression will look as follows: n n n y = β 0 +x 1 β 1 +...+x n β n +x 1 x 2 β12+...+x i 1 x i β (i 1)i = β 0 + x i β i + x i x j β ij j=1 General Expansion. Let f 1 (x),...,f s (x) be efficiently computable functions over dom(x 1 )...dom(x n ). We can consider the regression of the form: y = β 0 + f 1 (x)β 1 + f 2 (x)β 2 +... + f s (x)β s. Note. These regressions are non-linear in X, but they are linear in β. Therefore, the standard least squares regression techniques will work. Evaluation How accurate is the regression model? ways. This can be measured in a number of Training Error. Training error, or SSE is the target function we are optimizing: T SSE = (y i x ˆβ i ) 2 Problem: overfit for more complex models. Solution: Look for ways to penalize complexity. 8

Akaike Information Criterion (AIC). value of AIC chooses the model with the lowest AIC = (y i ( ˆβ 0 + ˆβ 1 x 1i +... + ˆβ p x pi )) 2 + 2p AIC adds a penalty for number of parameters p used in the model. Bayesian Information Criterion (BIC). value of BIC chooses the model with the lowest AIC = (y i ( ˆβ 0 + ˆβ 1 x 1i +... + ˆβ p x pi )) 2 + p log(m) Like AIC, BIC adds a penalty for number of parameters p used in the model, but it scales the penalty by log of the size of the dataset. 9