Machine Learning Linear Models Fabio Vandin October 10, 2017 1
Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w i x i + b i=1 Note: each member of L d is a function x w, x + b b: bias 2
Linear Models Hypothesis class H: φ L d, where φ : R Y h H is h : R d Y φ depends on the learning problem Example binary classification, Y = { 1, 1} φ(z) = sign(z) regression, Y = R φ(z) = z 3
Equivalent Notation Given x X, w R d, b R, define: w = (b, w 1, w 2,..., w d ) R d+1 x = (1, x 1, x 2,..., x d ) R d+1 Then: h w,b (x) = w, x + b = w, x (1) we will consider bias term as part of w and assume x = (1, x 1, x 2,..., x d ) when needed, with h w (x) = w, x 4
Linear Classification X = R d, Y = { 1, 1}, 0-1 loss Hypothesis class = halfspaces HS d = sign L d = {x sign(h x,b (x)) : h w,b L d } (2) Example: X = R 2 5
Finding a Good Hypothesis Linear classification with hypothesis set H = halfspaces. How do we find a good hypothesis? Good = ERM rule Perceptron Algorithm (Rosenblatt, 1958) Note: if y i w, x i > 0 for all i = 1,..., m all points are classified correctly by model w 6
Perceptron Input: training set (x 1, y 1 ),..., (x m, y m ) initialize w (1) = (0,..., 0); for t = 1, 2,... do if i s.t. y i w (t), x i 0 then w (t+1) w (t) + y i x i ; else return w (t) ; Interpretation of update: Note that: y i w (t+1), x i = y i w (t) + y i x i, x i = y i w (t), x i + x i 2 updates guides w to be more correct on (x) i, y i ). Termination? Depends on the realizability assumption! 7
Perceptron with Linearly Separable Data Realizability assumption for halfspace = linearly separable data Proposition Assume that (x 1, y 1 ),..., (x m, y m ) is separable, let: B = min{ w : i, i = 1,..., m, y i w, x i }, and R = max i x i. Then the Perceptron algorithm stops after at most (RB) 2 iterations and when it stops it holds that i, i {1,..., m} : y i w (t), x i > 0. 8
Perceptron: Notes simple to implement for separable data convergence is guaranteed converge depends on B, which can be exponential in d... an ILP approach may be better to find ERM solution in such cases potentially multiple solutions, which one is picked depends on starting values non separable data? run for some time and keep best solution found up to that point (pocket algorithm) 9
10 Linear Regression X = R d, Y = R Hypothesis class: H reg = L d = {x sign(h x,b (x)) : h w,b L d } (3) Note: h H reg : R d R Commonly used loss function: squared-loss l(h, (x, y)) def = (h(x) y) 2 empirical risk function (training error): Mean Squared Error L S (h) = 1 m m (h(x i ) y i ) 2 i=1
Least Squares 11 How to find a good hypothesis? Least Squares algorithm Best hypothesis: arg min L 1 S(h w ) = arg min w w m m (h(x i ) y i ) 2 Equivalent formulation: w minimizing Residual Sum of Squares (RSS), i.e. m arg min (h(x i ) y i ) 2 w i=1 i=1
12 Let RSS: Matrix Form x 1 x 2 X =. x m X: design matrix y 1 y 2 y =. we have that RSS is m (h(x i ) y i ) 2 = (y Xw) T (y Xw) i=1 y m
13 Want to find w that minimizes RSS (=objective function): RSS(w = arg min (y w Xw)T (y Xw) How? Compute gradient RSS(w) w compare it to 0. of objective function w.r.t w and RSS(w) w = 2X T (y Xw) Then we need to find w such that 2X T (y Xw) = 0
14 2X T (y Xw) = 0 is equivalent to X T Xw = X T y If X T X is invertible solution to ERM problem is: w = (X T X) 1 X T y
Complexity Considerations 15 We need to compute (X T X) 1 X T y Algorithm: 1 compute X T X: product of (d + 1) m matrix and m (d + 1) matrix 2 compute (X T X) 1 inversion of (d + 1) (d + 1) matrix 3 compute (X T X) 1 X T : product of (d + 1) (d + 1) matrix and (d + 1) m matrix 4 compute (X T X) 1 X T : product of (d + 1) m matrix and m 1 matrix Most expensive operation? Inversion! done for (d + 1) (d + 1) matrix
X T X not invertible? How do we get w such that X T Xw = X T y if X T X is not invertible? Let A = X T X Let A + be the generalized inverse of A, i.e.: AA + A = A Proposition If A = X T X is not invertible, then ŵ = A + X T y is a solution to X T Xw = X T y. Proof: [UML] 9.2.1 16
Generalized Inverse of A 17 Note A = X T X is symmetric eigenvalue decomposition of A: A = VDV T with D: diagonal matrix (entries = eigenvalues of A) V: orthonormal matrix (V T V = I d d ) Define D + diagonal matrix such that: { 0 if Di,i = 0 D + i,i 1 D i,i otherwise
18 Let A + = VD + V T Then AA + A = VDV T VD + V T VDV T = VDD + DV T = VDV T = A A + is the generalized inverse of A.
Logistic Regression 19 Learn a function h from R d to [0, 1]. What can this be used for? Classification! Example: binary classification (Y = { 1, 1}) - h(x) = probability that label of x is 1. For simplicity of presentation, we consider binary classification with Y = { 1, 1}, but similar considerations apply for multiclass classification.
20 Logistic Regression: Model Hypothesis class H: φ sig L d, where φ : R [0, 1] is sigmoid function Sigmoid function = S-shaped function For logistic regression, the sigmoid φ sig used is the logistic regression: 1 φ sig (z) = 1 + e z
21 Therefore H sig = φ sig L d = {x φ sig ( w, x ) : w R d } Main difference with binary classification with halfspaces: when w, x 0 halfspace prediction is deterministically 1 or 1 φ sig ( w, x ) 1/2 uncertainty in predicted label
22 Loss Function Need to define how bad it is to predict h w (x) [0, 1] given that true label is y = ±1 Desiderata h w (x) large if y = 1 1 h w (x) large if y = 1 Note that 1 h w (x) = 1 = e w,x 1 1 + e w,x 1 + e w,x 1 = 1 + e w,x
23 Then reasonable loss function: increases monotonically with 1 1 + e y w,x reasonable loss function: increases monotonically with 1 + e y w,x Loss function for logistic regression: l(h w, (x, y)) = log (1 + e y w,x )
24 Therefore, given training set S = ((x 1, y 1 ),..., (x 1, y 1 )) the ERM problem for logistic regression is: 1 arg min w R d m m i=1 log (1 ) + e y i w,x i Notes: logistic loss function is a convex function ERM problem can be solved efficiently Definition may look a bit arbitrary: actually, ERM formulation is the same as the one arising from Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE) [UML, 24.1] 25 MLE is a statistical approach for finding the parameters that maximize the joint probability of a given dataset assuming a specific parametric probability function. Note: MLE essentially assumes a generative model for the data General approach: 1 given training set S = ((x 1, y 1 ),..., (x m, y m )), assume each (x i, y i ) is i.i.d. from some probability distribution of parameters θ 2 consider P[S θ] (likelihood of data given parameters) 3 log likelihood: L(S; θ) = log(p[s θ]) 4 maximum likelihood estimator: ˆθ = arg max θ L(S; θ)
Logistic Regression and MLE 26 Assuming x 1,..., x m are fixed, the probability that x i has label y i = 1 is 1 h w (x i ) = 1 + e w,x i while the probability that x i has label y i = 1 is (1 h w (x i ))) = Then the likelihood for training set S is: 1 1 + e w,x i m ( 1 ) 1 + e y i w,x i i=1
27 Therefore the log likelihood is: m i=1 log (1 ) + e y i w,x i And note that the maximum likelihood estimator for w is: arg max w m i=1 ( ) log 1 + e y i w,x i = arg min w m i=1 MLE solution is equivalent to ERM solution! log (1 ) + e y i w,x i
Bibliography 28 [UML] Chapter 9: no 9.1.1