Machine Learning. Regression Case Part 1. Linear Regression. Machine Learning: Regression Case.

Size: px

Start display at page:

Download "Machine Learning. Regression Case Part 1. Linear Regression. Machine Learning: Regression Case."

Camilla Nicholson
5 years ago
Views:

1 .. Cal Poly CSC 566 Advanced Data Mining Alexander Dekhtyar.. Machine Learning. Regression Case Part 1. Linear Regression Machine Learning: Regression Case. Dataset. Consider a collection of features X = {X 1,,X n }, such that dom(x i ) R for all i = 1 n 1. Consider also an additional feature Y, such that dom(y ) R Let D = {x 1,,x m } be a collection of data points, such that ( j 1 m)(x j dom(x dom(y )). We write D as We also write x i = (x i1,,x in ). X 1 X 2 X n x 11 x 12 x 1n y 1 D = x 11 x 12 x 1n y x 11 x 12 x 1n y m Machine Learning Question. We treat X as the independent or observed variables and Y as the dependent or target variable. Our goal can be expressed as such: Given the dataset D of data points, find a relationship between the vectors x i of observed data and the values y i of the dependent variable. Regression. Representing the relationship between the independent variables X 1,,X n and the target variable Y (based on observations D) as a function 1 Later, we will relax this condition. f : dom(x 1 ) dom(x n ) dom(y ) 1

2 or, more generally, as f : R n R is called regression modeling, and the function f that represents this relationship is called the regression function. Simple Linear Regression (Multivariate Case) To save space, we discuss multivariate regression case directly. Linear Regression. A regression function f(x 1,,x n ) : R n R is called a linear regression function iff it has the form y = f(x 1,,x n ) = β 1 x 1 + β 2 x 2 + β n x n + ǫ. The values β 1,,β n are the linear coefficients of the regression function, otherwise known as feature loadings. The value ǫ, otherwise known as the intercept is the value of the regression function on the vector x 0 = (0,,0) : f(0,0,,0) = ǫ. The linear regression equation may be rewritten as y = x T β + ǫ Finding the best regression function. We can use the dataset D = {x 1,,x m } for guidance. How do we find the right f(x)? Assume for a moment that we already know the values for coefficients β 1,,β n, and ǫ for some linear regression function f(). Then, this function can be used to predict the values y 1,,y m of the target variable Y on inputs x 1,,x m respectively: y 1 = x 1 T β + ǫ = β 1 x 11 + β 2 x β n x 1n + ǫ y 2 = x 2 T β + ǫ = β 1 x 21 + β 2 x β n x 2n + ǫ y m = x m T β + ǫ = β 1 x m1 + β 2 x m2 + + β n x mn + ǫ or: y = D T β + 1ǫ. Thus, for each vector x i we have the regression prediction y i, and y i: the true value of Y. Our goal is to minimize the error of prediction, i.e., to minimize the overall differences between the predicted and the observed values. 2

3 Error. Given a vector x i, the prediction error error(x i ) is defined as: error(x i ) = y i y i = y i x i T + ǫ Minimizing the error. There is a wide range of ways in which error can be considered minimized. The traditional way of doing so is called the least squares method. In this method. we minimize the sum of squares of the individual errors of prediction. That is, we define the overall error of prediction Error(D) as following: Error(D) = Error(x 1,x n ) = = (y i y i) 2 = error 2 (x i ) = (y i (x T i β + ǫ)) 2 Among all β vectors, we want to find the one that minimizes If we fix the dataset Error(D) and instead let β and ǫ vary, can the error function as a function of β,ǫ: L(β,ǫ) = D (y i (x T i β + ǫ)) 2 We want to minimize subject to β, ǫ: L(β) min β,ǫ L(β,ǫ) Solving the minimization problem. L(β) is minimized when the derivative of L() is equal to zero. This can be expressed as the following n + 1 equations: β 1 = 0 β 2 = 0 β n = 0 3

4 ǫ = 0 To simplify notation, let β 0 = ǫ and let β = (β 0,β 1,β 2,,β n ). Further, let x i = (1,x i1,,x in ) T. = ( m i=0 (y i (x T i β)) 2) = β j β i = 2 x ij (y i (β 0 x i0 + + β n x in ) = 0 i=0 The system of linear equations generated can be described as: Its solution is D T (y Dβ ) = 0 ˆβ = (D T D) 1 D T y This can be rewritten as ( m ) 1 m ˆβ = (D T D) 1 D T T y = x i x i x i y i We can then estimate the values of y for x 1,,x m : or: ŷ i = x i T ˆβ = xi (D T D) 1 D T y, for P = D(D T D) 1 D T. ŷ = D ˆβ = D(D T D) 1 D T y = Py Goodness-of-fit. You can test how good your regression model is by asking what percentage of variance the regression is responsible for. This is called a goodnessof-fit test, and the metric estimating the goodness of fit is called the R 2 metric, and is computed as follows: R 2 = m (ŷ i µ y ) 2 m (y i µ y ) 2 Linear Regression with Categorical Variables The specification for linear regression assumes that all variables X 1,,X n are numeric, i.e., dom(x i ) R. Sometimes, however, some of the variables in the set X are categorical. 4

5 Nominal Variables Case. Let X i be a nominal variable with the domain dom(x i ) = {a 1,,a k }. We proceed as follows. Select one value (w/o loss of generality, a k ) as the baseline. For each value a {a 1,,a k 1 } create a new numeric variable X a. Let the new set of variables be X = (X {X i }) {X a1, X ak 1 } represent values a 1,,a k 1 as vectors a 1 = (1,0,0,,0) a 2 = (0,1,0,,0) a k 1 = (0,0,0,,1), and represent value a k as the vector a k = (0,,0) Replace each vector x j = (x 1,,x i 1,a l,,x n ) with vector x i = (x 1,,x i 1,a l,x n ). Perform linear regression from X to Y. Ordinal Variables Case. Let X i be an ordinal variable with domain dom(x) = {1,2,,k} 2. We replace X i as follows. Select one value (w/o loss of generality, a k ) as the baseline. For each value j {1,,k 1} create a new numeric variable X ij. Let the new set of variables be X = (X {X i }) {X i1, X ik 1 } Represent values 1,,k 1 of X i as vectors a 1 = (1,0,0,0) a 2 = (1,1,0,0) a k 1 = (1,1,1,,1), and represent the value X i = k as the vector a k = (0,0,0,,0). Replace each vector x j = (x 1,,x i 1,l,,x n ) with vector x i = (x 1,,x i 1,a l,x n ). Perform linear regression from X to Y. 2 Without loss of generality, we can represent all ordinal domains this way. 5

6 Basis Expansion Given a set of features X = {X 1,,X n }, and a dataset {(x i,y i )}, we can construct a least-squares linear regression using the features X 1,X n. However, we can also transform features. Polynomial basis expansion. Given a feature X i, add features Xi 2,,Xd i to the list of features, and represent the output as y = β 0 + x 1 β β i1 x i + β i2 x 2 i + + βidx d i + β i+1 x i β n x n This can be applied to any number of features from X. Interactions. Let X i and X j be two independent variables in X. We can add a feature X ij = X i X j to the model. If we want to build a linear regression model that accounts for interactions between all pairs of features, the expression will look as follows: n n n y = β 0 +x 1 β 1 ++x n β n +x 1 x 2 β12++x i 1 x i β (i 1)i = β 0 + x i β i + x i x j β ij j=1 General Expansion. Let f 1 (x),,f s (x) be efficiently computable functions over dom(x 1 ) dom(x n ). We can consider the regression of the form: y = β 0 + f 1 (x)β 1 + f 2 (x)β f s (x)β s. Note. These regressions are non-linear in X, but they are linear in β. Therefore, the standard least squares regression techniques will work. 6

Machine Learning. Part 1. Linear Regression. Machine Learning: Regression Case. .. Dennis Sun DATA 401 Data Science Alex Dekhtyar..

.. Dennis Sun DATA 401 Data Science Alex Dekhtyar.. Machine Learning. Part 1. Linear Regression Machine Learning: Regression Case. Dataset. Consider a collection of features X = {X 1,...,X n }, such that