Announcements Tentative presentation order is out
Announcements Tentative presentation order is out Remember: Monday before the week of the presentation you must send me the final paper list (for posting on class website) and draft slides
Announcements Tentative presentation order is out Remember: Monday before the week of the presentation you must send me the final paper list (for posting on class website) and draft slides Exception: Rahul and Brendan send draft slides by next Friday
A few more presentation tips Make presentation interesting and accessible to everybody in the class: Define your problem so it makes sense to people outside of your area Clearly explain where machine learning techniques come in Emphasize high-level and conceptual content Make sure you understand everything on your slides don t put in any equations you can t explain Discuss connections to previous topics covered in class
A few more presentation tips Make presentation interesting and accessible to everybody in the class: Define your problem so it makes sense to people outside of your area Clearly explain where machine learning techniques come in Emphasize high-level and conceptual content Make sure you understand everything on your slides don t put in any equations you can t explain Discuss connections to previous topics covered in class Best presentation contest! Students who are present in class will score each presentation The scores will not be publicly announced and will not affect the presenter s grade Popular favorite and runner-up(s) to receive prizes at the end of the course!
Review: Bias-variance tradeoff
Review: Bias-variance tradeoff
Review: Classifiers Bayes classifier:
Review: Classifiers Bayes classifier: f (x) = argmax y Pr[Y = y x].
Review: Classifiers Bayes classifier: f (x) = argmax y Pr[Y = y x]. This is the optimal classifier for 0-1 loss.
Review: Classifiers Bayes classifier: f (x) = argmax y Pr[Y = y x]. This is the optimal classifier for 0-1 loss. Nearest-neighbor classifier
Review: Classifiers Bayes classifier: f (x) = argmax y Pr[Y = y x]. This is the optimal classifier for 0-1 loss. Nearest-neighbor classifier Linear classifiers
Review: Classifiers Bayes classifier: f (x) = argmax y Pr[Y = y x]. This is the optimal classifier for 0-1 loss. Nearest-neighbor classifier Linear classifiers Logistic regression:
Review: Classifiers Bayes classifier: f (x) = argmax y Pr[Y = y x]. This is the optimal classifier for 0-1 loss. Nearest-neighbor classifier Linear classifiers Logistic regression: Assume that the regression function η(x) = Pr[Y = 1 x] satisfies log Pr[1 x] 1 Pr[1 x] = w 0 + w T x. Then η(x) = 1 1 + e (w0+wt x).
Review: Classifiers Bayes classifier: f (x) = argmax y Pr[Y = y x]. This is the optimal classifier for 0-1 loss. Nearest-neighbor classifier Linear classifiers Logistic regression: Assume that the regression function η(x) = Pr[Y = 1 x] satisfies log Pr[1 x] 1 Pr[1 x] = w 0 + w T x. Then η(x) = Perceptron (Rosenblatt 1957): 1 1 + e (w0+wt x).
Review: Classifiers Bayes classifier: f (x) = argmax y Pr[Y = y x]. This is the optimal classifier for 0-1 loss. Nearest-neighbor classifier Linear classifiers Logistic regression: Assume that the regression function η(x) = Pr[Y = 1 x] satisfies log Pr[1 x] 1 Pr[1 x] = w 0 + w T x. Then η(x) = 1 1 + e (w0+wt x). Perceptron (Rosenblatt 1957): Find parameters w 0, w to minimize classifier output for misclassified examples: i misclassified y i(w 0 + w T x i ).
Review: Perceptron Optimization: starting with some initial values w 0, w, iterate over misclassified examples and update parameter values to reduce the error.
Review: Perceptron Optimization: starting with some initial values w 0, w, iterate over misclassified examples and update parameter values to reduce the error. Problems:
Review: Perceptron Optimization: starting with some initial values w 0, w, iterate over misclassified examples and update parameter values to reduce the error. Problems: When data is separable, solution depends on starting parameter values.
Review: Perceptron Optimization: starting with some initial values w 0, w, iterate over misclassified examples and update parameter values to reduce the error. Problems: When data is separable, solution depends on starting parameter values. May take a long time to converge (depending on learning rate).
Review: Perceptron Optimization: starting with some initial values w 0, w, iterate over misclassified examples and update parameter values to reduce the error. Problems: When data is separable, solution depends on starting parameter values. May take a long time to converge (depending on learning rate). When data is not separable, does not converge at all!
Review: Perceptron Optimization: starting with some initial values w 0, w, iterate over misclassified examples and update parameter values to reduce the error. Problems: When data is separable, solution depends on starting parameter values. May take a long time to converge (depending on learning rate). When data is not separable, does not converge at all! Historical note: because of problems with perceptrons (as described by Minsky & Papert, 1969), the field of neural networks fell out of favor for over ten years.
Recall: Geometry of hyperplanes A hyperplane is defined by an equation w T x + w 0 = 0 The unit vector w/ w is normal to the hyperplane. The signed distance of any point x i to the hyperplane is given by 1 w (wt x i + w 0 )
Maximum-margin separating hyperplane Margin maximization (for linearly separable data) is formulated as follows: max M (w,w 0 ) subject to y i (w T x i + w 0 ) M w, i = 1,..., n
Maximum-margin separating hyperplane Margin maximization (for linearly separable data) is formulated as follows: max M (w,w 0 ) subject to y i (w T x i + w 0 ) M w, i = 1,..., n Explanation: 1 w (wt x i + w 0 ) is the signed distance between x i and the hyperplane w T x + w 0 = 0. The constraints require that each training point is on the correct side of the decision boundary and is at least an unsigned distance M from it. The goal is to find the hyperplane with parameters w and w 0 that would have the largest such M.
Maximum-margin separating hyperplane Constrained optimization problem: max M (w,w 0 ) subject to y i (w T x i + w 0 ) M w, i = 1,..., n
Maximum-margin separating hyperplane Constrained optimization problem: max M (w,w 0 ) subject to y i (w T x i + w 0 ) M w, i = 1,..., n We can choose M = 1/ w and instead solve 1 min (w,w 0 ) 2 w 2 subject to y i (w T x i + w 0 ) 1, i = 1,..., n
Support vectors 1 min (w,w 0 ) 2 w 2 subject to y i (w T x i + w 0 ) 1, i = 1,..., n The quantity y i (w T x i + w 0 ) is the (functional) margin of x i. Points for which y i (w T x i + w 0 ) = 1 are support vectors.
Lagrange multipliers Source: G. Shakhnarovich min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. We want to transform this constrained problem into an unconstrained problem.
Lagrange multipliers Source: G. Shakhnarovich min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. We want to transform this constrained problem into an unconstrained problem. We will associate with each constraint the loss max α [ i 1 yi (w 0 + w T x i ) ] = α i 0
Lagrange multipliers Source: G. Shakhnarovich min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. We want to transform this constrained problem into an unconstrained problem. We will associate with each constraint the loss max α [ i 1 yi (w 0 + w T x i ) ] = α i 0 { 0, if y i (w 0 + w T x i ) 1 0, if constraint is violated
Lagrange multipliers Source: G. Shakhnarovich min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. We want to transform this constrained problem into an unconstrained problem. We will associate with each constraint the loss max α [ i 1 yi (w 0 + w T x i ) ] = α i 0 { 0, if y i (w 0 + w T x i ) 1 0, We can reformulate our problem now: min (w,w 0 ) 1 2 w 2 + max α i 0 α i if constraint is violated [ 1 yi (w 0 + w T x i ) ]
Lagrange multipliers Source: G. Shakhnarovich min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. We want to transform this constrained problem into an unconstrained problem. We will associate with each constraint the loss max α [ i 1 yi (w 0 + w T x i ) ] = α i 0 { 0, if y i (w 0 + w T x i ) 1 0, We can reformulate our problem now: min (w,w 0 ) 1 2 w 2 + max α i 0 α i if constraint is violated [ 1 yi (w 0 + w T x i ) ]
Optimization Source: G. Shakhnarovich We want all the constraint terms to be zero: { 1 min (w,w 0 ) 2 w 2 + max α [ i 1 yi (w 0 + w T x i ) ]} α i 0 { = min max 1 [ (w,w 0 ) {α i 0} 2 w 2 + α i 1 yi (w 0 + w T x i ) ]}
Optimization Source: G. Shakhnarovich We want all the constraint terms to be zero: { 1 min (w,w 0 ) 2 w 2 + max α [ i 1 yi (w 0 + w T x i ) ]} α i 0 { = min max 1 [ (w,w 0 ) {α i 0} 2 w 2 + α i 1 yi (w 0 + w T x i ) ]} { = max min 1 [ {α i 0} (w,w 0 ) 2 w 2 + α i 1 yi (w 0 + w T x i ) ]}. }{{} L(w,w 0 ;α) (Note: in general, it is not always valid to exchange min and max.)
Strategy for optimization Source: G. Shakhnarovich We need to find max min L(w, w 0; α). {α i 0} (w,w 0 ) We will first fix α = [α 1,..., α n ] and treat L(w, w 0 ; α) as a function of w,w 0. Find functions w(α), w 0 (α) that attain the minimum. Next, treat L(w(α), w 0 (α); α) as a function of α. Find α that attain the maximum. In the end, the solution is given by α, w(α ) and w 0 (α ).
Minimizing L(w, w 0 ; α) w.r.t. w, w 0 Source: G. Shakhnarovich For fixed α we can minimize L(w, w 0 ; α) = 1 [ 2 w 2 + α i 1 yi (w 0 + w T x i ) ] by setting derivatives w.r.t. w, w 0 to zero:
Minimizing L(w, w 0 ; α) w.r.t. w, w 0 Source: G. Shakhnarovich For fixed α we can minimize L(w, w 0 ; α) = 1 2 w 2 + [ α i 1 yi (w 0 + w T x i ) ] by setting derivatives w.r.t. w, w 0 to zero: w L(w, w 0; α) = w w 0 L(w, w 0 ; α) = α i y y x i = 0, α i y i = 0.
Minimizing L(w, w 0 ; α) w.r.t. w, w 0 Source: G. Shakhnarovich For fixed α we can minimize L(w, w 0 ; α) = 1 2 w 2 + [ α i 1 yi (w 0 + w T x i ) ] by setting derivatives w.r.t. w, w 0 to zero: w L(w, w 0; α) = w w 0 L(w, w 0 ; α) = α i y y x i = 0, α i y i = 0. Note that the bias term w 0 has dropped out but has produced a global constraint on α.
Solving for α Source: G. Shakhnarovich w(α) = α i y i x i, α i y i = 0. Now we can substitute this solution into { 1 2 w(α) 2 + max {α i 0, i α iy i =0} α i [ 1 yi (w 0 (α) + w(α) T x i ) ]}
Solving for α Source: G. Shakhnarovich w(α) = α i y i x i, α i y i = 0. Now we can substitute this solution into { 1 2 w(α) 2 + max {α i 0, i α iy i =0} = max {α i 0, i α iy i =0} α i 1 2 α i [ 1 yi (w 0 (α) + w(α) T x i ) ]} α i α j y i y j x T i x j. i,j=1
Max-margin and quadratic programming Source: G. Shakhnarovich We started by writing down the max-margin problem and arrived at the dual problem in α: max α i 1 α i α j y i y j x T i x j 2 subject to i,j=1 α i y i = 0, α i 0 for all i = 1,..., n. Solving this quadratic program yields α.
Max-margin and quadratic programming Source: G. Shakhnarovich We started by writing down the max-margin problem and arrived at the dual problem in α: max α i 1 α i α j y i y j x T i x j 2 subject to i,j=1 α i y i = 0, α i 0 for all i = 1,..., n. Solving this quadratic program yields α. We substitute α back to get w: ŵ = w(α ) = αi y i x i
Maximum margin decision boundary Source: G. Shakhnarovich ŵ = w(α ) = αi y i x i Recall that, at the optimal solution, we must have αi [ 1 yi (w 0 + ŵ T x i ) ] = 0. Suppose that, under the optimal solution, the margin of x i is y i (w 0 + ŵ T x i ) > 1 (x i is not a support vector).
Maximum margin decision boundary Source: G. Shakhnarovich ŵ = w(α ) = αi y i x i Recall that, at the optimal solution, we must have αi [ 1 yi (w 0 + ŵ T x i ) ] = 0. Suppose that, under the optimal solution, the margin of x i is y i (w 0 + ŵ T x i ) > 1 (x i is not a support vector). Then, necessarily, αi = 0. Thus, we can express the direction of the max-margin decision boundary as a function of support vectors alone: ŵ = αi y i x i. α i >0
Maximum margin decision boundary Source: G. Shakhnarovich ŵ = w(α ) = αi y i x i Recall that, at the optimal solution, we must have αi [ 1 yi (w 0 + ŵ T x i ) ] = 0. Suppose that, under the optimal solution, the margin of x i is y i (w 0 + ŵ T x i ) > 1 (x i is not a support vector). Then, necessarily, αi = 0. Thus, we can express the direction of the max-margin decision boundary as a function of support vectors alone: ŵ = αi y i x i. α i >0 We have ŵ 0 = y i ŵ T x i for any support vector x i. Or, we can compute w 0 by making sure the margin is balanced between the two classes.
Support vectors Source: G. Shakhnarovich ŵ = α i >0 α i y i x i. Given a test example x, it is classified by ŷ = sign ( ŵ 0 + ŵ T x ) ( ) T = sign ŵ 0 + α i y i x i x ( α i >0 = sign ŵ 0 + α i y i x T i x α i >0 ) The classifier is based on the expansion in terms of dot products of x with support vectors.
Non-separable case What if the training data are not linearly separable? We can no longer require exact margin constraints. One idea: minimize min w This is the 0-1 loss. 1 2 w 2 + C(#mistakes). The parameter C determines the penalty paid for violating margin constraints. (Tradeoff: number of mistakes and margin.)
Non-separable case What if the training data are not linearly separable? We can no longer require exact margin constraints. One idea: minimize min w This is the 0-1 loss. 1 2 w 2 + C(#mistakes). The parameter C determines the penalty paid for violating margin constraints. (Tradeoff: number of mistakes and margin.) Problem: not QP anymore, also does not distinguish between near misses and bad mistakes.
Non-separable case Another idea: rewrite the constraints with slack variables ξ i 0: 1 min (w,w 0 ) 2 w + C ξ i subject to y i ( w0 + w T x i ) 1+ξi 0.
Non-separable case Another idea: rewrite the constraints with slack variables ξ i 0: 1 min (w,w 0 ) 2 w + C ξ i subject to y i ( w0 + w T x i ) 1+ξi 0. Whenever margin is 1 (original constraint is satisfied), ξ i = 0. Whenever margin is < 1 (constraint violated), pay linear penalty.
Non-separable case Another idea: rewrite the constraints with slack variables ξ i 0: 1 min (w,w 0 ) 2 w + C ξ i subject to y i ( w0 + w T x i ) 1+ξi 0. Whenever margin is 1 (original constraint is satisfied), ξ i = 0. Whenever margin is < 1 (constraint violated), pay linear penalty. This is called the hinge loss:. max ( 0, 1 y i (w 0 + w T x i ) )
Connection between SVMs and logistic regression Logistic regression: Support vector machines: ( ) Hinge loss: max 0, 1 y i (w 0 + w T x i ) 1 P (y i x i ; w, w 0 ) = 1 + e y i(w 0 +w T x i ) Log loss: log (1 ) + e y i(w 0 +w T x i )
Non-separable case: solution Source: G. Shakhnarovich min w 1 2 w + C ξ i. We can solve this using Lagrange multipliers Introduce additional multipliers for the ξs. The resulting dual problem: max α i 1 2 subject to α i α j y i y j x T i x j i,j=1 α i y i = 0, 0 α i C for all i = 1,..., N.
SVM with slack variables Source: G. Shakhnarovich α = C, 0 < ξ < 1 0 < α < C, ξ = 0 0 < α < C, ξ = 0 α = C, ξ > 1 0 < α < C, ξ = 0 0 < α < C, ξ = 0 α = C, ξ > 1 Support vectors: points with α > 0 If 0 < α < C: SVs on the margin, ξ = 0. If 0 < α = C: over the margin, either misclassified (ξ > 1) or not (0 < ξ 1).