Linear and Logistic Regression

Size: px

Start display at page:

Download "Linear and Logistic Regression"

Annabel George
6 years ago
Views:

1 Linear and Logistic Regression Marta Arias Dept. LSI, UPC Fall 2012

2 Linear regression Simple case: R 2 Here is the idea: 1. Got a bunch of points in R 2, {(x i, y i )}. 2. Want to fit a line y = ax + b that describes the trend. 3. We define a cost function that computes the total squared error of our predictions w.r.t. observed values y i J (a, b) = (ax i + b y i ) 2 that we want to minimize. 4. See it as a function of a and b: compute both derivatives, force them equal to zero, and solve for a and b. 5. The coefficients you get give you the minimum squared error. 6. Can do this for specific points, or in general and find the formulas. 7. More general version in R n.

3 Linear regression Simple case: R 2 Let h(x ) = ax + b, and J (a, b) = (h(x i ) y i ) 2 J (a, b) a = i (h(x i ) y i ) 2 a = (ax i + b y i ) 2 a i = i = 2 i 2(ax i + b y i ) (ax i + b y i ) a (ax i + b y i ) (ax i ) a = 2 i (ax i + b y i )x i

4 Linear regression Simple case: R 2 Let h(x ) = ax + b, and J (a, b) = (h(x i ) y i ) 2 J (a, b) b = i (h(x i ) y i ) 2 b = (ax i + b y i ) 2 b i = i = 2 i 2(ax i + b y i ) (ax i + b y i ) b (ax i + b y i ) (b) b = 2 i (ax i + b y i )

5 Linear regression Simple case: R 2 Normal equations Given {(x i, y i )} i, solve for a, b: (ax i + b)x i = i i (ax i + b) = i i x i y i y i

6 Linear regression General case: R n Now, each x i = x0 i, x 1 i, x 2 i,.., x n, i where x0 i = 1 for all i Parameters to estimate are a = a 0,.., a n T 1 For j = 0,.., n, we have Normal equations J (a) a j Given {(x i, y i )} i, solve for a 0, a 1,.., a n : = i ( n k=0 a k x i k y i )x i j n ( a k xk i )x j i = i i k=0 x i j y i (for each j = 0,.., n) 1 Notice a is defined as a column vector.

7 Linear regression General case: R n Remember a = a 0, a 1, a 2,..., a n T Let y = y 1, y 2,..., y m T 2 x 1 x x x x 1 n Let X =. = x0 2 x x 2 n... where all x 0 i = 1 x m x0 m x1 m.. xn m Now, the normal equation i ( n k=0 a k x i k )x i j = i x i j y i can be rewritten as: i x i j ( n k=0 a k x i k ) = i where X j is the j -th column of X 2 Notice y is defined as a column vector. x i j (xi a) = X T j y

8 Linear regression General case: R n We have i x i j (xi a) = X T j which can be solved as y for each j = 0,.., n. Compactly: X T Xa = X T y a = (X T X) 1 X T y How to compute parameters in GNU Octave 3 Given X of size m (n + 1) 4 and given label vector y, you can solve the least squares regression problem with the single command pinv(x * X) * X * y Assuming the original data matrix has been prepended an all-1 column. 5 Equivalent to X \ y using the built-in operator \.

9 Linear regression Practical example with Octave We have a dataset with data for 20 cities; for each city we have information on: Nr. of inhabitants Percentage of families incomes below 5000 USD Percentage of unemployed Number of murders per 10 6 inhabitants per annum We wish to perform regression analysis on the number of murders based on the other 3 features.

10 Linear regression Practical example with Octave Octave code: load data.txt n = size(data, 2) m = size(data, 1) X = [ ones(m, 1) data(:,1:n-1) ] y = data(:,n) a = pinv(x *X) * X * y Result: a = e e e e+00 So, we see that the variable that has the most impact is the percentage of unemployed.

11 Linear Regression What if n is too large? Computing a = (X T X) 1 X T y may not be feasible if n is large, since it involves the inverse of a matrix of size n n (or (n + 1) (n + 1) if we added the extra all 1 column) Gradient descent: an iterative optimization solution Start with any parameters a, and update a iteratively in order to minimize J (a). Gradient descent tells us that J (a) should decrease fastest if we follow the direction of the negative gradient of the cost function J (a): a = a α J (a) where α is a positive, real-valued parameter dictating how large J (a) each step is, and J (a) =,,.., a n T. J (a) a 0 J (a) a 1

12 Gradient descent

13 Gradient descent Algorithm, I Pseudocode: given J, α Initialize a to a random non-zero vector Repeat until convergence for all j = 0,.., n, do a j = a j α for all j = 0,.., n, do aj = a j Output a Should be careful with.. J (a) a j setting α small enough so that algorithm converges, but not too small because it may need innecessarily too many iterations perform feature scaling so that all features are on the same range (this is necessary because they share the same α in the updates)

14 Gradient descent Algorithm, II m examples {(x i, y i )} i example x = x 0, x 1,.., x n h a (x) = a 0 x 0 + a 1 x a n x n = n j =0 a j x j = xa J (a) = 1 m i=1 (h a(x i ) y i ) 2 m i=1 x j i(h a(x i ) y i ) = 1 m XT j (Xa y) 2m J (a) a j = 1 m J (a) = 1 m XT (Xa y) Pseudocode: given α, X, y Initialize a = 1,.., 1 T Normalize X Repeat until convergence a = a α m XT (Xa y) Output a

15 Gradient descent Algorithm, II m examples {(x i, y i )} i example x = x 0, x 1,.., x n h a (x) = a 0 x 0 + a 1 x a n x n = n j =0 a j x j = xa J (a) = 1 m i=1 (h a(x i ) y i ) 2 m i=1 x j i(h a(x i ) y i ) = 1 m XT j (Xa y) 2m J (a) a j = 1 m J (a) = 1 m XT (Xa y) Pseudocode: given α, X, y Initialize a = 1,.., 1 T Normalize X Repeat until convergence a = a α m XT (Xa y) Output a

16 Linear regression Practical example with Octave Octave code: % X is original m x n matrix a = ones(n, 1) % initial value for parameter vector X = studentize(x) % normalize X X = [ones(m, 1) X] % prepend all 1s column for t = 1:100 % repeat 100 times D = X*a - y a = a - alpha / m * X * D % we store consecutive values of J over time t J(t) = 1/2/m * D * D

17 Logistic regression What if y i {0, 1} instead of continuous real value? Binary classification Now, datasets are of the form {(x 1, 1), (x 2, 0),..}. In this case, linear regression will not do a good job in classifying examples as positive (y i = 1), or negative (y i = 0).

g(z ) 1, for all z R lim g(z ) = 0 and lim g(z ) = 1 z z + g(z ) 0.

18 Logistic regression Hypothesis space h a (x) = g( n j =0 a j x j ) = g(xa) g(z ) = 1 1+e is sigmoid function (a.k.a. logistic function) z 0 g(z ) 1, for all z R lim g(z ) = 0 and lim g(z ) = 1 z z + g(z ) 0.5 iff z 0 Given example x predict positive iff ha (x) 0.5 iff g(xa) 0.5 iff xa 0

19 Logistic regression Least square minimization for logistic regression Let us assume that P(y = 1 x ; a) = h a (x), and so P(y = 0 x ; a) = 1 h a (x) Given m training examples {(x i, y i )} i where y i {0, 1} we compute the likelihood (assuming independence of training examples) L(a) = i p(y i x i ; a) = i h a (x i ) yi (1 h a (x i )) 1 yi Our strategy will be to maximize the log likelihood

20 Logistic regression We will run gradient ascent to maximize the log likelihood, using: for any function f (x ), for the sigmoid function g(x ), log f (x ) x = 1 f (x ) f (x ) x g(x ) x = 1 x 1 + e x 1 e x = (1 + e x ) 2 x 1 = (1 + e x ) 2 e x ( ) 1 1 = 1 + e x e x = g(x )(1 g(x ))

21 Logistic regression Maximizing the log likelihood log L(a) = log p(y i x i ; a) = log p(y i x i ; a) i i = log (h ) a (x i ) yi (1 h a (x i )) 1 yi i = i y i log h a (x i ) + (1 y i ) log(1 h a (x i ))

22 Logistic regression Computing partial derivatives log L(a) a j = y i log h a (x i ) i a j + (1 yi ) log(1 h a (x i )) a j = i y i log g(x i a) a j = i = i = i = i + (1 y i ) log(1 g(xi a)) a j y i g(x i a) g(x i a) a j (1 yi ) g(x i a) 1 g(x i a) a j ( ) y i a g(x i a) (1 yi ) g(x i a) 1 g(x i a) a j ( ) y i g(x i a) (1 yi ) 1 g(x i a) ( y i g(x i a) (1 yi ) 1 g(x i a) = (y i g(x i a))x i j = (y i h a (x i ))x i j g(x i a)(1 g(x i a)) xi a a j ) g(x i a)(1 g(x i a))x i j

23 Gradient ascent for logistic regression Algorithm, I Pseudocode: given α, {(x i, y i )} m i=1 Initialize a = 1,.., 1 T Perform feature scaling on the examples attributes Repeat until convergence for each j = 0,.., n: a j = a j + α i (y i h a (x i ))x i j for each j = 0,.., n: Output a a j = a j

24 Gradient ascent for logistic regression Algorithm, II m examples {(x i, y i )} i g sigmoid function; g its generalization to vectors: g( z 1,.., z k ) = g(z 1 ),.., g(z k ) h a (x) = g( n j =0 a j x j ) = g(xa) J (a) = 1 m J (a) a j = 1 m i y i log h a (x i ) + (1 y i ) log(1 h a (x i )) m i=1 x j i(y i h a (x i )) = 1 m XT j (y g(xa))) J (a) = 1 m XT (g(xa) y)) Pseudocode: given α, X, y Initialize a = 1,.., 1 T Normalize X Repeat until convergence a = a + α m XT (y g(xa)) Output a

25 Logistic regression Practical example with Octave Octave code: % X is original m x n matrix a = ones(n, 1) % initial value for parameter vector X = studentize(x) % normalize X X = [ones(m, 1) X] % prepend all 1s column for t = 1:100 % repeat 100 times D = y - sigmoid(x*a) a = a + alpha / m * X * D % we store consecutive values of J over time t G = sigmoid(x*a) J(t) = 1/m * (log(g) *y + log(1-g) *(1-y))

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning. Lecture 3: Logistic Regression. Feng Li. Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification