Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013
Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized p x y = 1 = p(x i y = 1) d i=1 Or the variables corresponding to each dimension of the data are independent given the label 2
Discriminative classifier Directly estimate decision boundary h x posterior distribution p(y x) = ln q i x q j x Logistic regression, Support vector machine, Neural networks Do not estimate p(x y) and p(y) or h(x) or f x : = p y = 1 x is a function of x, and does not have probabilistic generative model for x, hence can not be used to sample data points Why discriminative classifier? Avoid difficult density estimation problem Empirically achieve better classification results 3
What is logistic regression model Assume that the posterior distribution p y = 1 x take a particular form 1 p y = 1 x, θ = 1 + exp( θ x) Logistic function f u = 1 1+exp u 4
Learning parameters in logistic regression Find θ, such that the conditional likelihood of the labels is maximized m max l θ : = log P θ yi x i, θ Good news: l θ is concave function of θ, and there is a single global optimum. i=1 Bad new: no closed form solution (resort to numerical method) 5
The objective function l θ logistic regression model p y = 1 x, θ = 1 1 + exp( θ x) Note that p y = 0 x, θ = 1 1 1 + exp θ x = exp θ x 1 + exp θ x Plug in m l θ : = log P y i x i, θ i=1 = (y i 1) θ x i log(1 + exp θ x i ) i 6
The gradient of l θ Gradient l θ : = log m i=1 P y i x i, θ = (y i 1) θ x i log(1 + exp θ x i ) i l(θ) θ = (yi 1) i x i x i + exp θ x i 1 + exp θ x Setting it to 0 does not lead to closed form solution 7
Gradient Ascent/Descent algorithm Initialize parameter θ 0 Do θ t+1 θ t + η (y i 1) i x i x i + exp θ x i 1 + exp θ x While the θ t+1 θ t > ε 8
Naïve Bayes vs. logistic regression Consider y 1, 1, x R n Number of parameters Naïve Bayes : 2n + 1 n mean, n variance, and 1 for prior logistic regression: n + 1 θ 0, θ 1, θ 2,, θ n 9
Naïve Bayes vs logistic regression II Asymptotic comparison (# training examples infinity) when model assumptions correct Naïve Bayes, logistic regression produce identical classifiers when model assumptions incorrect logistic regression is less biased does not assume conditional independence logistic regression has fewer parameters therefore expected to outperform Naïve Bayes 10
Naïve Bayes vs logistic regression III Estimation method: Naïve Bayes parameter estimates are decoupled (super easy) logistic regression parameter estimates are coupled (less easy) How to estimate the parameters in logistic regression? Maximum likelihood estimation More specifically, maximize the conditional likelihood the label 11
Experiments with handwritten digits 12
Which decision boundary is better? Suppose the training samples are linearly separable We can find a decision boundary which gives zero training error Class 2 But there are many such decision boundaries Which one is better? Class 1 13
Compare two decision boundaries Suppose we perturb the data, which boundary is more susceptible to error? 14
Geometric interpretation of a classifier Parameterizing decision boundary as: w x + b = 0 w denotes a vector orthogonal to the decision boundary b is a scalar offset term Dash lines are parallel to decision boundary and they just hit the data points w T x b 0 w Class 2 Class 1 c c 15
Constraints on data points Constraints on data points For all x in class 2, y = 1 and w x + b c For all x in class 1, y = 1 and w x + b c Or more compactly, (w x + b)y c w T x b 0 w Class 2 Class 1 c c 16
Classifier margin Pick two data points x 1 and x 2 which are on each dash line respectively The margin is γ = 1 w w x 1 x 2 = 2c w w T x b 0 w x 1 Class 2 x 2 Class 1 c c 17
Maximum margin classifier Find decision boundary w as far from data point as possible 2c max w,b w s. t. y i w x i + b c, i w T x b 0 w x 1 Class 2 x 2 Class 1 c c 18
Equivalent form max w,b 2c w s. t. y i w x i + b c, i Note that the magnitude of c merely scales w and b, and does not change the relative goodness of different classifiers Set c = 1 (and drop the 2) to get a cleaner problem 1 max w,b w s. t. y i w x i + b 1, i 19
Support vector machine A constrained convex quadratic programming problem min w,b w 2 s. t. y i w x i + b 1, i After optimization, the margin is given by 2 w Only a few of the constraints are relevant support vectors Kernel methods are introduced for nonlinear classification problem 20
Lagrangian Duality The primal problem min f(w) w st. g i w 0, i = 1,, k h i w = 0, i = 1,, l The Lagrangian function k l L w, α, β = f w + α i g i w + β i h i (w) α i 0, and β i are called the Lagrangian multipliers i i 21
The KKT conditions If there exists some saddle point of L, then the saddle point satisfies the following "Karush-Kuhn-Tucker" (KKT) conditions: L w = 0 L b = 0 L α = 0 L β = 0 g i w 0 h i w = 0 α i 0 α i g i w = 0 22
Dual problem of support vector machines min w,b w 2 s. t. y i w x i + b 1, i Convert to standard form 1 min w,b 2 w w s. t. 1 y i w x i + b 0, i The lagrangian function m L w, α, β = 1 2 w w + α i 1 y i w x i + b i 23
Deriving the dual problem m L w, α, β = 1 2 w w + α i 1 y i w x i + b i Taking derivative and set to zero m L w = w α iy i x i = i m L b = α iy i = 0 i 0 24
Plug back relation of w and b L w, α, β = 1 2 m i m i α i y i x i m j α j y j x j + α i 1 y i m j α j y j x j x i + b After simplification m 1 L w, α, β = α i 2 i m i,j α i α j y i y j x i x j 25
The dual problem m 1 L w, α, β = α i 2 i m i,j α i α j y i y j s. t. α i 0, i = 1,, m m i α i y i = 0 x i x j This is a constrained quadratic programming Nice and convex, and global maximum can be found w can be found as w = How about b? m i α i y i x i 26
Support vectors Note that the KKT condition α i g i w = 0 α i 1 y i w x i + b = 0 For data points with 1 y i w x i + b < 0, α i = 0 For data points with 1 y i w x i + b = 0, α i > 0 Class 2 a 5 =0 a 8 =0.6 a 10 =0 a 7 =0 a 2 =0 Call the training data points whose a i 's are nonzero the support vectors (SV) a 4 =0 a 6 =1.4 a 1 =0.8 a 9 =0 Class 1 a 3 =0 27
Computing b and obtain the classifer Pick any data point with α i > 0, solve for b with 1 y i w x i + b = 0 For a new test point z Compute w z + b = α i y i x i z + b i support vectors Classify z as class 1 if the result is positive, and class 2 otherwise 28
Interpretation of support vector machines The optimal w is a linear combination of a small number of data points. This sparse representation can be viewed as data compression To compute the weights α i, and to use support vector machines we need to specify only the inner products (or kernel) between the examples x i x j We make decisions by comparing each new example z with only the support vectors: y = sign α i y i x i z + b i support vectors 29