ECS171: Machine Learning

Size: px

Start display at page:

Download "ECS171: Machine Learning"

Catherine Parsons
6 years ago
Views:

1 ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018

2 Linear Regression (LFD 3.2)

3 Regression Classification: Customer record Yes/No Regression: predicting credit limit Customer record dollar amount

4 Regression Classification: Customer record Yes/No Regression: predicting credit limit Customer record dollar amount Linear Regression: h(x) = d i=0 w ix i = w T x

5 The data set Dataset (historical decisions by credit officers): (x 1, y 1 ), (x 2, y 2 ),, (x N, y N ) y n R: credit limit for customer x n

6 The data set Dataset (historical decisions by credit officers): (x 1, y 1 ), (x 2, y 2 ),, (x N, y N ) y n R: credit limit for customer x n Linear regression: find a function h(x) = w T x to approximate y(= f (x))

7 The data set Dataset (historical decisions by credit officers): (x 1, y 1 ), (x 2, y 2 ),, (x N, y N ) y n R: credit limit for customer x n Linear regression: find a function h(x) = w T x to approximate y(= f (x)) Use square error (h(x) f (x)) 2 in-sample error :E in (h) = 1 N N (h(x n ) y n ) 2 n=1

8 Illustration Linear regression: find linear function with small residual

9 Matrix form of E in E in (w) = 1 N (xn T w y n ) 2 N n=1 x1 T = 1 w y 2 1 x2 T w y 2 N. xn T w y N x1 T = 1 y 1 x2 T N. w y 2. xn T = 1 N }{{} X w y }{{} N (d+1) N 1 2 y N 2

10 Minimize E in min E in(w) = 1 X w y 2 w N E in : continuous, differentiable, convex Necessary condition of optimal w: E in w 0 (w ) 0 E in (w ) =. =. E in w d (w ) 0

11 Minimizing E in E in (w) = 1 N X w y 2 = 1 N (w T X T X w 2w T X T y + y T y) E in (w) = 2 N (X T X w X T y) E in (w ) = 0 X T X w = X T y w = (X T X ) 1 X T y }{{} X X = (X T X ) 1 X T : pseudo-inverse of X

12 More on Linear Regression Solutions Case I: X T X is invertible Unique solution Often when N > d + 1 Case II: X T X is non-invertible Many solutions Minimal norm solution w = X y (X is defined by another way) Often when d + 1 > N

13 Linear Regression Linear regression algorithm: 1. Given data matrix X and label y 2. Compute w = X y

14 Linear Regression Linear regression algorithm: 1. Given data matrix X and label y 2. Compute w = X y E in (w) is minimized

15 Linear Regression Linear regression algorithm: 1. Given data matrix X and label y 2. Compute w = X y E in (w) is minimized We will show E out will also be small using VC-dimension bound

16 Logistic Regression (LFD 3.3)

17 Soft binary classification Example: heart attack problem

18 Soft binary classification Example: heart attack problem Soft binary classification: f (x) = P(+1 x) [0, 1]

19 Soft binary classification Same data as hard binary classification, but different target function

20 Linear Models Linear scoring function: s = w T x

Logistic Hypothesis Logistic function: θ(s) = es 1 + e s = 1 1 + e s.

21 Logistic Hypothesis Logistic function: θ(s) = es 1 + e s = e s. convert the score to estimated probability Logistic regression hypothesis: h(x) = θ(w T x)

22 Genuine Probability Data (x, y) with binary y generated by a noisy process: { f (x) for y = +1 P(y x) = 1 f (x) for y = 1 The target f : R d [0, 1] (probability) Goal: learn g(x) = θ(w T x) f (x) How to measure the error?

23 Error measure: likelihood Likelihood: How likely to get y based on h?

24 Error measure: likelihood Likelihood: How likely to get y based on h? { h(x) for y = +1 Given h, P(y x) = 1 h(x) for y = 1

25 Error measure: likelihood Likelihood: How likely to get y based on h? { h(x) for y = +1 Given h, P(y x) = 1 h(x) for y = 1 If h f, likelihood should be large!

26 Error measure: likelihood Likelihood of D = (x 1, y 1 ),, (x N, y N ): Π N n=1p(y n x n )

27 Error measure: likelihood Likelihood of D = (x 1, y 1 ),, (x N, y N ): Π N n=1p(y n x n ) { h(x) for y = +1 Given h, P(y x) = 1 h(x) for y = 1

28 Error measure: likelihood Likelihood of D = (x 1, y 1 ),, (x N, y N ): Π N n=1p(y n x n ) { h(x) for y = +1 Given h, P(y x) = 1 h(x) for y = 1 Substitute h(x) = θ(w T x) and 1 h(x) = 1 θ(w T x) = θ( w T x)

29 Error measure: likelihood Likelihood of D = (x 1, y 1 ),, (x N, y N ): Π N n=1p(y n x n ) { h(x) for y = +1 Given h, P(y x) = 1 h(x) for y = 1 Substitute h(x) = θ(w T x) and 1 h(x) = 1 θ(w T x) = θ( w T x) P(y x) = θ(yw T x)

30 Error measure: likelihood Likelihood of D = (x 1, y 1 ),, (x N, y N ): Π N n=1p(y n x n ) { h(x) for y = +1 Given h, P(y x) = 1 h(x) for y = 1 Substitute h(x) = θ(w T x) and 1 h(x) = 1 θ(w T x) = θ( w T x) P(y x) = θ(yw T x) Likelihood: Π N n=1p(y n x n ) = Π N n=1θ(y n w T x n )

31 Maximizing the likelihood Find w to maximize the likelihood! max w ΠN n=1θ(y n w T x n ) max w log(πn n=1θ(y n w T x n )) min w log(πn n=1θ(y n w T x n )) min w min w min w N log(θ(y n w T x n )) n=1 N 1 log( θ(y n w T x n ) ) n=1 N log(1 + e ynw T x n ) n=1

32 Empirical Risk Minimization Most linear ML algorithms follow 1 N min loss(w T x n, y n ) w N n=1 Linear regression: loss(h(x n ), y n ) = (w T x n y n ) 2 Logistic regression: loss(h(x n ), y n ) = log(1 + e ynw T x n )

33 Gradient descent and SGD

34 Optimization Goal: find the minimizer of a function min f (w) w For now we assume f is twice differentiable Machine learning algorithm: find the hypothesis that minimizes E in

35 Convex vs Nonconvex Convex function: f (x) = 0 Global minimum A function is convex if 2 f (x) is positive definite Example: linear regression, logistic regression, Non-convex function: f (x) = 0 Global min, local min, or saddle point most algorithms only converge to gradient= 0 Example: neural network,

36 Gradient Descent Gradient descent: repeatedly do w t+1 w t α f (w t ) α > 0 is the step size Step size too large diverge; too small slow convergence

37 Why gradient descent? Reason I: Gradient is the steepest direction to decrease the objective function locally

38 Why gradient descent? Reason I: Gradient is the steepest direction to decrease the objective function locally Reason II: successive approximation view At each iteration, form an approximation function of f ( ): f (w + d) g(d) := f (w t ) + f (w t ) T d + 1 2α d 2 Update solution by w t+1 w t + d d = arg min d g(d) g(d ) = 0 f (w t ) + 1 α d = 0 d = α f (w t ) d will decrease f ( ) if α (step size) is sufficiently small

39 Illustration of gradient descent

40 Illustration of gradient descent Form a quadratic approximation f (w t + d) g(d) = f (w t ) + f (w t ) T d + 1 2α d 2

41 Illustration of gradient descent Minimize g(d): g(d ) = 0 f (w t ) + 1 α d = 0 d = α f (w t )

42 Illustration of gradient descent Update w: w t+1 = w t + d = w t α f (w t )

43 Illustration of gradient descent

44 Illustration of gradient descent

45 Convergence Let L be a constant such that 2 f (x) LI for all x Theorem: gradient descent converges if α < 1 L In practice, we do not know L need to tune step size when running gradient descent

46 Applying to Logistic regression gradient descent for logistic regression Initialize the weights w 0 For t = 1, 2, Compute the gradient E in = 1 N N y n x n 1 + e ynw T x n n=1 Update the weights: w w η E in Return the final weights w

47 Applying to Logistic regression gradient descent for logistic regression Initialize the weights w 0 For t = 1, 2, Compute the gradient E in = 1 N N y n x n 1 + e ynw T x n n=1 Update the weights: w w η E in Return the final weights w When to stop? Fixed number of iterations, or Stop when E in < ɛ

48 Conclusions Linear regression: Target is real number Square loss E in can be minimized with a closed form solution Logistic regression: An classification model Based on a probability assumption Gradient descent Next class: SGD Questions?

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing Lecture 4: ML Models (Overview) Cho-Jui Hsieh UC Davis April 17, 2017 Outline Linear regression Ridge regression Logistic regression Other finite-sum