ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018
Linear Regression (LFD 3.2)
Regression Classification: Customer record Yes/No Regression: predicting credit limit Customer record dollar amount
Regression Classification: Customer record Yes/No Regression: predicting credit limit Customer record dollar amount Linear Regression: h(x) = d i=0 w ix i = w T x
The data set Dataset (historical decisions by credit officers): (x 1, y 1 ), (x 2, y 2 ),, (x N, y N ) y n R: credit limit for customer x n
The data set Dataset (historical decisions by credit officers): (x 1, y 1 ), (x 2, y 2 ),, (x N, y N ) y n R: credit limit for customer x n Linear regression: find a function h(x) = w T x to approximate y(= f (x))
The data set Dataset (historical decisions by credit officers): (x 1, y 1 ), (x 2, y 2 ),, (x N, y N ) y n R: credit limit for customer x n Linear regression: find a function h(x) = w T x to approximate y(= f (x)) Use square error (h(x) f (x)) 2 in-sample error :E in (h) = 1 N N (h(x n ) y n ) 2 n=1
Illustration Linear regression: find linear function with small residual
Matrix form of E in E in (w) = 1 N (xn T w y n ) 2 N n=1 x1 T = 1 w y 2 1 x2 T w y 2 N. xn T w y N x1 T = 1 y 1 x2 T N. w y 2. xn T = 1 N }{{} X w y }{{} N (d+1) N 1 2 y N 2
Minimize E in min E in(w) = 1 X w y 2 w N E in : continuous, differentiable, convex Necessary condition of optimal w: E in w 0 (w ) 0 E in (w ) =. =. E in w d (w ) 0
Minimizing E in E in (w) = 1 N X w y 2 = 1 N (w T X T X w 2w T X T y + y T y) E in (w) = 2 N (X T X w X T y) E in (w ) = 0 X T X w = X T y w = (X T X ) 1 X T y }{{} X X = (X T X ) 1 X T : pseudo-inverse of X
More on Linear Regression Solutions Case I: X T X is invertible Unique solution Often when N > d + 1 Case II: X T X is non-invertible Many solutions Minimal norm solution w = X y (X is defined by another way) Often when d + 1 > N
Linear Regression Linear regression algorithm: 1. Given data matrix X and label y 2. Compute w = X y
Linear Regression Linear regression algorithm: 1. Given data matrix X and label y 2. Compute w = X y E in (w) is minimized
Linear Regression Linear regression algorithm: 1. Given data matrix X and label y 2. Compute w = X y E in (w) is minimized We will show E out will also be small using VC-dimension bound
Logistic Regression (LFD 3.3)
Soft binary classification Example: heart attack problem
Soft binary classification Example: heart attack problem Soft binary classification: f (x) = P(+1 x) [0, 1]
Soft binary classification Same data as hard binary classification, but different target function
Linear Models Linear scoring function: s = w T x
Logistic Hypothesis Logistic function: θ(s) = es 1 + e s = 1 1 + e s. convert the score to estimated probability Logistic regression hypothesis: h(x) = θ(w T x)
Genuine Probability Data (x, y) with binary y generated by a noisy process: { f (x) for y = +1 P(y x) = 1 f (x) for y = 1 The target f : R d [0, 1] (probability) Goal: learn g(x) = θ(w T x) f (x) How to measure the error?
Error measure: likelihood Likelihood: How likely to get y based on h?
Error measure: likelihood Likelihood: How likely to get y based on h? { h(x) for y = +1 Given h, P(y x) = 1 h(x) for y = 1
Error measure: likelihood Likelihood: How likely to get y based on h? { h(x) for y = +1 Given h, P(y x) = 1 h(x) for y = 1 If h f, likelihood should be large!
Error measure: likelihood Likelihood of D = (x 1, y 1 ),, (x N, y N ): Π N n=1p(y n x n )
Error measure: likelihood Likelihood of D = (x 1, y 1 ),, (x N, y N ): Π N n=1p(y n x n ) { h(x) for y = +1 Given h, P(y x) = 1 h(x) for y = 1
Error measure: likelihood Likelihood of D = (x 1, y 1 ),, (x N, y N ): Π N n=1p(y n x n ) { h(x) for y = +1 Given h, P(y x) = 1 h(x) for y = 1 Substitute h(x) = θ(w T x) and 1 h(x) = 1 θ(w T x) = θ( w T x)
Error measure: likelihood Likelihood of D = (x 1, y 1 ),, (x N, y N ): Π N n=1p(y n x n ) { h(x) for y = +1 Given h, P(y x) = 1 h(x) for y = 1 Substitute h(x) = θ(w T x) and 1 h(x) = 1 θ(w T x) = θ( w T x) P(y x) = θ(yw T x)
Error measure: likelihood Likelihood of D = (x 1, y 1 ),, (x N, y N ): Π N n=1p(y n x n ) { h(x) for y = +1 Given h, P(y x) = 1 h(x) for y = 1 Substitute h(x) = θ(w T x) and 1 h(x) = 1 θ(w T x) = θ( w T x) P(y x) = θ(yw T x) Likelihood: Π N n=1p(y n x n ) = Π N n=1θ(y n w T x n )
Maximizing the likelihood Find w to maximize the likelihood! max w ΠN n=1θ(y n w T x n ) max w log(πn n=1θ(y n w T x n )) min w log(πn n=1θ(y n w T x n )) min w min w min w N log(θ(y n w T x n )) n=1 N 1 log( θ(y n w T x n ) ) n=1 N log(1 + e ynw T x n ) n=1
Empirical Risk Minimization Most linear ML algorithms follow 1 N min loss(w T x n, y n ) w N n=1 Linear regression: loss(h(x n ), y n ) = (w T x n y n ) 2 Logistic regression: loss(h(x n ), y n ) = log(1 + e ynw T x n )
Gradient descent and SGD
Optimization Goal: find the minimizer of a function min f (w) w For now we assume f is twice differentiable Machine learning algorithm: find the hypothesis that minimizes E in
Convex vs Nonconvex Convex function: f (x) = 0 Global minimum A function is convex if 2 f (x) is positive definite Example: linear regression, logistic regression, Non-convex function: f (x) = 0 Global min, local min, or saddle point most algorithms only converge to gradient= 0 Example: neural network,
Gradient Descent Gradient descent: repeatedly do w t+1 w t α f (w t ) α > 0 is the step size Step size too large diverge; too small slow convergence
Why gradient descent? Reason I: Gradient is the steepest direction to decrease the objective function locally
Why gradient descent? Reason I: Gradient is the steepest direction to decrease the objective function locally Reason II: successive approximation view At each iteration, form an approximation function of f ( ): f (w + d) g(d) := f (w t ) + f (w t ) T d + 1 2α d 2 Update solution by w t+1 w t + d d = arg min d g(d) g(d ) = 0 f (w t ) + 1 α d = 0 d = α f (w t ) d will decrease f ( ) if α (step size) is sufficiently small
Illustration of gradient descent
Illustration of gradient descent Form a quadratic approximation f (w t + d) g(d) = f (w t ) + f (w t ) T d + 1 2α d 2
Illustration of gradient descent Minimize g(d): g(d ) = 0 f (w t ) + 1 α d = 0 d = α f (w t )
Illustration of gradient descent Update w: w t+1 = w t + d = w t α f (w t )
Illustration of gradient descent
Illustration of gradient descent
Convergence Let L be a constant such that 2 f (x) LI for all x Theorem: gradient descent converges if α < 1 L In practice, we do not know L need to tune step size when running gradient descent
Applying to Logistic regression gradient descent for logistic regression Initialize the weights w 0 For t = 1, 2, Compute the gradient E in = 1 N N y n x n 1 + e ynw T x n n=1 Update the weights: w w η E in Return the final weights w
Applying to Logistic regression gradient descent for logistic regression Initialize the weights w 0 For t = 1, 2, Compute the gradient E in = 1 N N y n x n 1 + e ynw T x n n=1 Update the weights: w w η E in Return the final weights w When to stop? Fixed number of iterations, or Stop when E in < ɛ
Conclusions Linear regression: Target is real number Square loss E in can be minimized with a closed form solution Logistic regression: An classification model Based on a probability assumption Gradient descent Next class: SGD Questions?