Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Size: px
Start display at page:

Download "Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington"

Transcription

1 Support Vector Machines CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1

2 A Linearly Separable Problem Consider the binary classification problem on the figure. The blue points belong to one class, with label +1. The orange points belong to the other class, with label -1. These two classes are linearly separable. Infinitely many lines separate them. Are any of those infinitely many lines preferable? 2

3 A Linearly Separable Problem Do we prefer the blue line or the red line, as decision boundary? What criterion can we use? Both decision boundaries classify the training data with 100% accuracy. 3

4 Margin of a Decision Boundary The margin of a decision boundary is defined as the smallest distance between the boundary and any of the samples. 4

5 Margin of a Decision Boundary One way to visualize the margin is this: For each class, draw a line that: is parallel to the decision boundary. touches the class point that is the closest to the decision boundary. The margin is the smallest distance between the decision boundary and one of those two parallel lines. In this example, the decision boundary is equally far from both lines. margin margin 5

6 Support Vector Machines One way to visualize the margin is this: For each class, draw a line that: is parallel to the decision boundary. touches the class point that is the closest to the decision boundary. The margin is the smallest distance between the decision boundary and one of those two parallel lines. margin 6

7 Support Vector Machines Support Vector Machines (SVMs) are a classification method, whose goal is to find the decision boundary with the maximum margin. The idea is that, even if multiple decision boundaries give 100% accuracy on the training data, larger margins lead to less overfitting. Larger margins can tolerate more perturbations of the data. margin margin 7

8 Support Vector Machines ote: so far, we are only discussing cases where the training data is linearly separable. First, we will see how to maximize the margin for such data. Second, we will deal with data that are not linearly separable. We will define SVMs that classify such training data imperfectly. Third, we will see how to define nonlinear SVMs, which can define non-linear decision boundaries. margin margin 8

9 Support Vector Machines ote: so far, we are only discussing cases where the training data is linearly separable. First, we will see how to maximize the margin for such data. Second, we will deal with data that are not linearly separable. We will define SVMs that classify such training data imperfectly. Third, we will see how to define nonlinear SVMs, which can define non-linear decision boundaries. An example of a nonlinear decision boundary produced by a nonlinear SVM. 9

10 Support Vectors In the figure, the red line is the maximum margin decision boundary. One of the parallel lines touches a single orange point. If that orange point moves closer to or farther from the red line, the optimal boundary changes. If other orange points move, the optimal boundary does not change, unless those points move to the right of the blue line. margin margin 10

11 Support Vectors In the figure, the red line is the maximum margin decision boundary. One of the parallel lines touches two blue points. If either of those points moves closer to or farther from the red line, the optimal boundary changes. If other blue points move, the optimal boundary does not change, unless those points move to the left of the blue line. margin margin 11

12 Support Vectors In summary, in this example, the maximum margin is defined by only three points: One orange point. Two blue points. These points are called support vectors. They are indicated by a black circle around them. margin margin 12

13 Distances to the Boundary The decision boundary consists of all points x that are solutions to equation: w T x + b = 0. w is a column vector of parameters (weights). x is an input vector. b is a scalar value (a real number). If x n is a training point, its distance to the boundary is computed using this equation: D x n, w = wt x + b w 13

14 Distances to the Boundary If x n is a training point, its distance to the boundary is computed using this equation: D x n, w = wt x n + b w Since the training data are linearly separable, the data from each class should fall on opposite sides of the boundary. Suppose that t n = 1 for points of one class, and t n = +1 for points of the other class. Then, we can rewrite the distance as: D x n, w = t n(w T x n + b) w 14

15 Distances to the Boundary So, given a decision boundary defined w and b, and given a training input x n, the distance of x n to the boundary is: D x n, w = t n(w T x n + b) w If t n = 1, then: w T x n + b < 0. t n w T x n + b > 0. If t n = 1, then: w T x n + b > 0. t n w T x n + b > 0. So, in all cases, t n (w T x n + b) is positive. 15

16 Optimization Criterion If x n is a training point, its distance to the boundary is computed using this equation: D x n, w = t n(w T x n + b) w Therefore, the optimal boundary w opt is defined as: (w opt, b opt ) = argmax w,b min n t n (w T x n + b) w In words: find the w and b that maximize the minimum distance of any training input from the boundary. 16

17 Optimization Criterion The optimal boundary w opt is defined as: (w opt, b opt ) = argmax w,b min n t n (w T x n + b) w Suppose that, for some values w and b, the decision boundary defined by w T x n + b = 0 misclassifies some objects. Can those values of w and b be selected as w opt, b opt? 17

18 Optimization Criterion The optimal boundary w opt is defined as: (w opt, b opt ) = argmax w,b min n t n (w T x n + b) w Suppose that, for some values w and b, the decision boundary defined by w T x n + b = 0 misclassifies some objects. Can those values of w and b be selected as w opt, b opt? o. If some objects get misclassified, then, for some x n it holds that t n (w T x n +b) w < 0. Thus, for such w and b, the expression in red will be negative. Since the data is linearly separable, we can find better values for w and b, for which the expression in red will be greater than 0. 18

19 Scale of w The optimal boundary w opt is defined as: (w opt, b opt ) = argmax w,b min n t n (w T x n + b) w Suppose that g is a real number, and c > 0. If w opt and b opt define an optimal boundary, then g w opt and g b opt also define an optimal boundary. We constrain the scale of w opt to a single value, by requiring that: min n t n (w T x n + b) = 1 19

20 Optimization Criterion We introduced the requirement that: min n t n (w T x n + b) = 1 Therefore, for any x n, it holds that: t n w T x n + b 1 The original optimization criterion becomes: w opt, b opt = argmax w,b min n t n w T x n +b w 1 w opt = argmax = argmin w w w w w opt = argmin w 1 2 w 2 These are equivalent formulations. The textbook uses the last one because it simplifies subsequent calculations. 20

21 Constrained Optimization Summarizing the previous slides, we want to find: w opt = argmin w 1 2 w 2 subject to the following constraints: n 1,,, t n w T x n + b 1 This is a different optimization problem than what we have seen before. We need to minimize a quantity while satisfying a set of inequalities. This type of problem is a constrained optimization problem. 21

22 Quadratic Programming Our constrained optimization problem can be solved using a method called quadratic programming. Describing quadratic programming in depth is outside the scope of this course. Our goal is simply to understand how to use quadratic programming as a black box, to solve our optimization problem. This way, you can use any quadratic programming toolkit (Matlab includes one). 22

23 Quadratic Programming The quadratic programming problem is defined as follows: Inputs: s: an R-dimensional column vector. Q: an R R-dimensional symmetric matrix. H: an Q R-dimensional symmetric matrix. z: an Q-dimensional column vector. Output: u opt : an R-dimensional column vector, such that: u opt = argmin u 1 2 ut Qu + s T u subject to constraint: Hu z 23

24 Quadratic Programming for SVMs Quadratic Programming: u opt = argmin u 1 2 ut Qu + s T u subject to constraint: Hu z SVM goal: w opt = argmin w 1 2 w 2 subject to constraints: n 1,,, t n w T x n + b 1 We need to define appropriate values of Q, s, H, z, u so that quadratic programming computes w opt and b opt. 24

25 Quadratic Programming for SVMs Quadratic Programming constraint: Hu z SVM constraints: n 1,,, t n w T x n + b 1 n 1,,, t n w T x n + b 1 25

26 Quadratic Programming for SVMs SVM constraints: n 1,,, t n w T x n + b 1 Define: H = t 1, t 1 x 11,, t 1 x 1D t 2, t 2 x 21,, t 1 x 2D, u = t, t x 1,, t x D b w 1 w 2 w D, z = Matrix H is (D + 1), vector u has (D + 1) rows, vector z has rows. Hu = t 1 b t 1 x 11 w 1 t 1 x 1D w 1 t b t x 1 w 1 t 1 x D w 1 = t 1 (b + w T x 1 ) t (b + w T x ) The n-th row of Hu is t n w T x n + b, which should be 1. SVM constraint quadratic programming constraint Hu z. 26

27 Quadratic Programming for SVMs Quadratic programming: u opt = argmin u 1 2 ut Qu + s T u SVM: w opt = argmin w 1 Define: Q = 2 0, 0, 0,, 0 0, 1, 0,, 0 0, 0, 1,, 0 0, 0, 0,, 1 w 2, u = b w 1 w 2 w D, s = D+1 Q is like the (D + 1) (D + 1) identity matrix, except that Q 11 = 0. Then, 1 2 ut Qu + s T u = 1 2 w 2. u already defined in the previous slides s: D dimensional column vector of zeros. 27

28 Quadratic Programming for SVMs Quadratic programming: u opt = argmin u 1 2 ut Qu + s T u SVM: w opt = argmin w 1 2 w 2 Alternative definitions that would OT work: Define: Q = 1, 0,, 0 0, 0,, 1, u = w 1 w D, s = 0 0 D Q is the D D identity matrix, u = w, s is the D-dimensional zero vector. It still holds that 1 2 ut Qu + s T u = 1 2 w 2. Why would these definitions not work? 28

29 Quadratic Programming for SVMs Quadratic programming: u opt = argmin u 1 2 ut Qu + s T u SVM: w opt = argmin w 1 2 w 2 Alternative definitions that would OT work: Define: Q = 1, 0,, 0 0, 0,, 1, u = w 1 w D, s = 0 0 D Q is the D D identity matrix, u = w, s is the D-dimensional zero vector. It still holds that 1 2 ut Qu + s T u = 1 w 2. 2 Why would these definitions not work? Vector u must also make Hu z match the SVM constraints. With this definition of u, no appropriate H and z can be found. 29

30 Quadratic Programming for SVMs 1 Quadratic programming: u opt = argmin u 2 ut Qu + s T u subject to constraint: Hu z SVM goal: w opt = argmin w 1 2 w 2 subject to constraints: n 1,,, t n w T x n + b 1 Task: define Q, s, H, z, u so that quadratic programming computes w opt and b opt. Q = 0, 0, 0,, 0 0, 1, 0,, 0 0, 0, 1,, 0 0, 0, 0,, 1, s = Like (D + 1) (D + 1) identity matrix, except that Q 11 = D + 1 rows, H = t 1, t 1 x 11,, t 1 x 1D t 2, t 2 x 21,, t 1 x 2D, z = t, t x 1,, t x D rows, D + 1 columns rows, u = b w 1 w 2 w D D + 1 rows 30

31 Quadratic Programming for SVMs 1 Quadratic programming: u opt = argmin u 2 ut Qu + s T u subject to constraint: Hu z SVM goal: w opt = argmin w 1 2 w 2 subject to constraints: n 1,,, t n w T x n + b 1 Task: define Q, s, H, z, u so that quadratic programming computes w opt and b opt. Q = 0, 0, 0,, 0 0, 1, 0,, 0 0, 0, 1,, 0 0, 0, 0,, 1, s = 0 0 0, H = t 1, t 1 x 11,, t 1 x 1D t 2, t 2 x 21,, t 1 x 2D, z = t, t x 1,, t x D 1 1 1, u = Quadratic programming takes as inputs Q, s, H, z, and outputs u opt, from which we get the w opt, b opt values for our SVM. b w 1 w 2 w D 31

32 Quadratic Programming for SVMs 1 Quadratic programming: u opt = argmin u 2 ut Qu + s T u subject to constraint: Hu z SVM goal: w opt = argmin w 1 2 w 2 subject to constraints: n 1,,, t n w T x n + b 1 Task: define Q, s, H, z, u so that quadratic programming computes w opt and b opt. Q = 0, 0, 0,, 0 0, 1, 0,, 0 0, 0, 1,, 0 0, 0, 0,, 1, s = 0 0 0, H = t 1, t 1 x 11,, t 1 x 1D t 2, t 2 x 21,, t 1 x 2D, z = t, t x 1,, t x D 1 1 1, u = w opt is the vector of values at dimensions 2,, D + 1 of u opt. b opt is the value at dimension 1 of u opt. b w 1 w 2 w D 32

33 Solving the Same Problem, Again So far, we have solved the problem of defining an SVM (i.e., defining w opt and b opt ), so as to maximize the margin between linearly separable data. If this were all that SVMs can do, SVMs would not be that important. Linearly separable data are a rare case, and a very easy case to deal with. margin margin 33

34 Solving the Same Problem, Again So far, we have solved the problem of defining an SVM (i.e., defining w opt and b opt ), so as to maximize the margin between linearly separable data. We will see two extensions that make SVMs much more powerful. The extensions will allow SVMs to define highly non-linear decision boundaries, as in this figure. However, first we need to solve the same problem again. Maximize the margin between linearly separable data. We will get a more complicated solution, but that solution will be easier to improve upon. 34

35 Lagrange Multipliers Our new solutions are derived using Lagrange multipliers. Here is a quick review from multivariate calculus. Let x be a D-dimensional vector. Let f(x) and g(x) be functions from R D to R. Functions f and g map D-dimensional vectors to real numbers. Suppose that we want to minimize f(x), subject to the constraint that g x 0. Then, we can solve this problem using a Lagrange multiplier to define a Lagrangian function. 35

36 Lagrange Multipliers To minimize f(x), subject to the constraint: g x 0: We define the Lagrangian function: L x, λ = f x λg(x) λ is called a Lagrange multiplier, λ 0. We find x opt = argmin x L(x, λ), and a corresponding value for λ, subject to the following constraints: 1. g x 0 2. λ 0 3. λg x = 0 If g x opt > 0, the third constraint implies that λ = 0. Then, the constraint g x 0 is called inactive. If g x opt = 0, then λ > 0. Then, constraint g x 0 is called active. 36

37 Multiple Constraints Suppose that we have constraints: n 1,,, g n x 0 We want to minimize f(x), subject to those constraints. Define vector λ = λ 1,, λ. Define the Lagrangian function as: L x, λ = f x λ n g n x n=1 We find x opt = argmin x L(x, λ), and a value for λ, subject to: n, g n x 0 n, λ n 0 n, λ n g n x = 0 37

38 Lagrange Dual Problems We have constraints: n 1,,, g n x 0 We want to minimize f(x), subject to those constraints. Under some conditions (which are satisfied in our SVM problem), we can solve an alternative dual problem: Define the Lagrangian function as before: L x, λ = f x λ n g n x n=1 We find x opt, and the best value for λ, denoted as λ opt, by solving: λ opt = argmax λ min x L x, λ x opt = argmin x L x, λ opt subject to constraints: λ n 0 38

39 Lagrange Dual Problems Lagrangian dual problem: Solve: λ opt = argmax λ min x L x, λ x opt = argmin x L x, λ opt subject to constraints: λ n 0 This dual problem formulation will be used in training SVMs. The key thing to remember is: We minimize the Lagrangian L with respect to x. We maximize L with respect to the Lagrange multipliers λ n. 39

40 Lagrange Multipliers and SVMs SVM goal: w opt = argmin w 1 2 w 2 subject to constraints: n 1,,, t n w T x n + b 1 To make the constraints more amenable to Lagrange multipliers, we rewrite them as: n 1,,, t n w T x n + b 1 0 Define a = (a 1,, a ) to be a vector of Lagrange multipliers. Define the Lagrangian function: L w, b, a = 1 2 w 2 a n t n w T x n + b 1 n=1 Remember from the previous slides, we minimize L with respect to w, b, and maximize L with respect to a. 40

41 Lagrange Multipliers and SVMs The w and b that minimize L w, b, a must satisfy: L = 0, L = w b 0. L w = w 2 n=1 a n t n w T x n + b 1 w = 0 w n=1 a n t n x n = 0 w = n=1 a n t n x n 41

42 Lagrange Multipliers and SVMs The w and b that minimize L w, b, a must satisfy: L = 0, L = w b 0. L b = w 2 n=1 a n t n w T x n + b 1 b = 0 n=1 a n t n = 0 42

43 Lagrange Multipliers and SVMs Our Lagrangian function is: L w, b, a = 1 2 w 2 a n t n w T x n + b 1 n=1 We showed that w = n=1 a n t n x n. Using that, we get: 1 2 w 2 = 1 2 wt w = 1 2 n=1 a n t n x n T n=1 a n t n x n 1 2 w 2 = 1 2 n=1 m=1 a n a m t n t m x n T x m 43

44 Lagrange Multipliers and SVMs We showed that: 1 2 w 2 = 1 2 a n a m t n t m x n T x m n=1 m=1 Define an matrix Q such that Q mn = t n t m x n T x m. Remember that we have defined a = (a 1,, a ). Then, it follows that: 1 2 w 2 = 1 2 at Qa 44

45 Lagrange Multipliers and SVMs L w, b, a = 1 2 at Qa a n t n w T x n + b 1 n=1 We showed that n=1 a n t n n=1 a n t n w T x n + b 1 = 0. Using that, we get: We have shown before that the red part equals 0. = a n t n w T x n + b a n t n a n n=1 n=1 n=1 = a n t n w T x n a n n=1 n=1 45

46 Lagrange Multipliers and SVMs L w, a = 1 2 at Qa a n t n w T x n n=1 + a n n=1 Function L now does not depend on b. We simplify more, using again the fact that w = n=1 a n t n x n : a n t n w T x n = a n t n a m t m x m T x n n=1 n=1 m=1 = a n a m t n t m x n T x n = a T Qa n=1 m=1 46

47 Lagrange Multipliers and SVMs L w, a = 1 2 at Qa a T Qa + a n n=1 = a n n=1 1 2 at Qa Function L now does not depend on w anymore. We can rewrite L as a function whose only input is a: L a = 1 2 at Qa a T Qa + a n n=1 = a n n=1 1 2 at Qa Remember, we want to maximize L a with respect to a. 47

48 Lagrange Multipliers and SVMs By combining the results from the last few slides, our optimization problem becomes: Maximize L a = a n n=1 1 2 at Qa subject to these constraints: a n 0 a n t n = 0 n=1 48

49 Lagrange Multipliers and SVMs L a = a n n=1 1 2 at Qa We want to maximize L a subject to some constraints. Therefore, we want to find an a opt such that: a opt = argmax a n=1 a n 1 2 at Qa = argmin a 1 2 at Qa a n n=1 subject to those constraints. 49

50 SVM Optimization Problem Our SVM optimization problem now is to find an a opt such that: a opt = argmin a 1 2 at Qa a n n=1 subject to these constraints: a n 0 n=1 a n t n = 0 This problem can be solved again using quadratic programming. 50

51 Using Quadratic Programming Quadratic programming: u opt = argmin u 1 2 ut Qu + s T u subject to constraint: Hu z SVM problem: find a opt = argmin subject to constraints: a n 0, a n t n a n=1 1 2 at Qa n=1 a n = 0 Again, we must find values for Q, s, H, z that convert the SVM problem into a quadratic programming problem. ote that we already have a matrix Q in the Lagrangian. It is an matrix Q such that Q mn = t n t m x n T x n. We will use the same Q for quadratic programming. 51

52 Using Quadratic Programming Quadratic programming: u opt = argmin u 1 2 ut Qu + s T u subject to constraint: Hu z SVM problem: find a opt = argmin a subject to constraints: a n 0, 1 2 at Qa n=1 a n n=1 a n t n = 0 Define: u = a, and define -dimensional vector s = Then, 1 2 ut Qu + s T u = 1 2 at Qa n=1 a n We have mapped the SVM minimization goal of finding a opt to the quadratic programming minimization goal. 52

53 Using Quadratic Programming Quadratic programming constraint: Hu z SVM problem constraints: a n 0, n=1 a n t n = 0 Define: H = 1, 0, 0,, 0 0, 1, 0,, 0, z = 0, 0, 0,, 1 t 1, t 2, t 3,, t n t 1, t 2, t 3,, t n z: dimensional column vector of zeros. The first rows of H are the negation of the identity matrix. Row + 1 of H is the transpose of vector t of target outputs. Row + 2 of H is the negation of the previous row. 53

54 Using Quadratic Programming Quadratic programming constraint: Hu z SVM problem constraints: a n 0, n=1 a n t n = 0 Define: H = 1, 0, 0,, 0 0, 1, 0,, 0, z = 0, 0, 0,, 1 t 1, t 2, t 3,, t n t 1, t 2, t 3,, t n z: dimensional column vector of zeros. Since we defined u = a: For n, the n-th row of Hu equals a n. For n, the n-th row of Hu and z specifies that a n 0 a n 0. The (n + 1)-th row of Hu and z specifies that n=1 a n t n 0. The (n + 2)-th row of Hu and z specifies that n=1 a n t n 0. 54

55 Using Quadratic Programming Quadratic programming constraint: Hu z SVM problem constraints: a n 0, n=1 a n t n = 0 Define: H = 1, 0, 0,, 0 0, 1, 0,, 0, z = 0, 0, 0,, 1 t 1, t 2, t 3,, t n t 1, t 2, t 3,, t n z: dimensional column vector of zeros. Since we defined u = a: For n, the n-th row of Hu and z specifies that a n 0. The last two rows of Hu and z specify that n=1 a n t n = 0. We have mapped the SVM constraints to the quadratic 55 programming constraint Hu z.

56 Interpretation of the Solution Quadratic programming, given the inputs Q, s, H, z defined in the previous slides, outputs the optimal value for vector a. We used this Lagrangian function: L w, b, a = 1 2 w 2 a n t n w T x n + b 1 n=1 Remember, when we find x opt to minimize any Lagrangian L x, λ = f x λ n g n x n=1 one of the constraints we enforce is that n, λ n g n x = 0. In the SVM case, this means: n, a n t n w T x n + b 1 = 0 56

57 Interpretation of the Solution In the SVM case, it holds that: n, a n t n w T x n + b 1 = 0 What does this mean? Mathematically, n: Either a n = 0, Or t n w T x n + b = 1. If t n w T x n + b = 1, what does that imply for x n? 57

58 Interpretation of the Solution In the SVM case, it holds that: What does this mean? Mathematically, n: Either a n = 0, n, a n t n w T x n + b 1 = 0 Or t n w T x n + b = 1. If t n w T x n + b = 1, what does that imply for x n? Remember, many slides back, we imposed the constraint that n, t n w T x n + b 1. Equality t n w T x n + b = 1 holds only for the support vectors. Therefore, a n > 0 only for the support vectors. 58

59 Support Vectors This is an example we have seen before. The three support vectors have a black circle around them. When we find the optimal values a 1, a 2,, a for this problem, only three of those values will be non-zero. If x n is not a support vector, then a n = 0. margin margin 59

60 Interpretation of the Solution n=1 We showed before that: w = a n t n x n This means that w is a linear combination of the training data. However, since a n > 0 only for the support vectors, obviously only the support vector influence w. margin margin 60

61 Computing b w = n=1 a n t n x n Define set S = n x n is a support vector}. Since a n > 0 only for the support vectors, we get: w = a n t n x n n S If x n is a support vector, then t n w T x n + b = 1. Substituting w = a m t m x m m S, if x n is a support vector: t n a m t m x m T x n + b = 1 m S 61

62 Computing b t n a m t m x m T x n + b = 1 m S Remember that t n can only take values 1 and 1. Therefore, t 2 n = 1 Multiplying both sides of the equation with t n we get: t n t n a m t m x m T x n + b = t n m S m S a m t m x m T x n + b = t n b = t n a m t m x m T x n m S 62

63 Computing b Thus, if x n is a support vector, we can compute b with formula: b = t n a m t m x m T x n m S To avoid numerical problems, instead of using a single support vector to calculate b, we can use all support vectors (and take the average of the computed values for b). If S is the number of support vectors, then: b = 1 S t n a m t m x m T x n n S m S 63

64 Classification Using a To classify a test object x, we can use the original formula y x = w T x + b. n=1 However, since w = a n t n x n, we can substitute that formula for w, and classify x using: y x = a n t n x n T x n S + b This formula will be our only choice when we use SVMs that produce nonlinear boundaries. Details on that will be coming later in this presentation. 64

65 Recap of Lagrangian-Based Solution We defined the Lagrangian function, which became (after simplifications): L a = a n n=1 1 2 at Qa Using quadratic programming, we find the value of a that maximizes L a, subject to constraints: a n 0, n=1 a n t n = 0 Then: w = n S a n t n x n b = 1 S t n a m t m x m T x n n S m S 65

66 Why Did We Do All This? The Lagrangian-based solution solves a problem that we had already solved, in a more complicated way. Typically, we prefer simpler solutions. However, this more complicated solution can be tweaked relatively easily, to produce more powerful SVMs that: Can be trained even if the training data is not linearly separable. Can produce nonlinear decision boundaries. 66

67 The ot Linearly Separable Case Our previous formulation required that the training examples are linearly separable. This requirement was encoded in the constraint: n 1,,, t n w T x n + b 1 According to that constraint, every x n has to be on the correct side of the boundary. To handle data that is not linearly separable, we introduce variables ξ n 0, to define a modified constraint: n 1,,, t n w T x n + b 1 ξ n These variables ξ n are called slack variables. The values for ξ n will be computed during optimization. 67

68 The Meaning of Slack Variables To handle data that is not linearly separable, we introduce slack variables ξ n 0, to define a modified constraint: n 1,,, t n w T x n + b 1 ξ n If ξ n = 0, then x n is where it should be: either on the blue line for objects of its class, or on the correct side of that blue line. If 0 < ξ n < 1, then x n is too close to the decision boundary. Between the red line and the blue line for objects of its class. If ξ n = 1, x n is on the red line. If ξ n > 1, x n is misclassified (on the wrong side of the red line). 68

69 The Meaning of Slack Variables If training data is not linearly separable, we introduce slack variables ξ n 0, to define a modified constraint: n 1,,, t n w T x n + b 1 ξ n If ξ n = 0, then x n is where it should be: either on the blue line for objects of its class, or on the correct side of that blue line. If 0 < ξ n < 1, then x n is too close to the decision boundary. Between the red line and the blue line for objects of its class. If ξ n = 1, x n is on the decision boundary (the red line). If ξ n > 1, x n is misclassified (on the wrong side of the red line). 69

70 Optimization Criterion Before, we minimized 1 2 w 2 subject to constraints: n 1,,, t n w T x n + b 1 ow, we want to minimize error function: C n=1 ξ n subject to constraints: n 1,,, t n w T x n + b 1 ξ n ξ n 0 C is a parameter that we pick manually, C 0. C controls the trade-off between maximizing the margin, and penalizing training examples that violate the margin. The higher ξ n is, the farther away x n is from where it should be. w 2 If x n is on the correct side of the decision boundary, and the distance of x n to the boundary is greater than or equal to the margin, then ξ n = 0, and x n does not contribute to the error function. 70

71 Lagrangian Function To do our constrained optimization, we define the Lagrangian: L w, b, ξ, a, μ = 1 2 w 2 + C ξ n n=1 a n t n y x n 1 + ξ n n=1 μ n ξ n n=1 The Lagrange multipliers are now a n and μ n. The term a n t n y x n 1 + ξ n in the Lagrangian corresponds to constraint t n y x n 1 + ξ n 0. The term μ n ξ n in the Lagrangian corresponds to constraint ξ n 0. 71

72 Lagrangian Function Lagrangian L w, b, ξ, a, μ : 1 2 w 2 + C ξ n n=1 a n t n y x n 1 + ξ n n=1 μ n ξ n n=1 Constraints: a n 0 μ n 0 ξ n 0 t n y x n 1 + ξ n 0 a n t n y x n 1 + ξ n = 0 μ n ξ n = 0 72

73 Lagrangian Function Lagrangian L w, b, ξ, a, μ : 1 2 w 2 + C ξ n n=1 a n t n y x n 1 + ξ n n=1 μ n ξ n n=1 As before, we can simplify the Lagrangian significantly: L w = 0 w = n=1 a nt n x n L b = 0 n=1 a nt n = 0 L ξ n = 0 C a n μ n = 0 a n = C μ n 73

74 Lagrangian Function Lagrangian L w, b, ξ, a, μ : 1 2 w 2 + C ξ n n=1 a n t n y x n 1 + ξ n n=1 μ n ξ n n=1 In computing L w, b, ξ, a, μ, ξ n contributes the following value: Cξ n a n ξ n μ n ξ n = ξ n (C a n μ n ) As we saw in the previous slide, C a n μ n = 0. Therefore, we can eliminate all those occurrences of ξ n. 74

75 Lagrangian Function Lagrangian L w, b, ξ, a, μ : 1 2 w 2 + C ξ n n=1 a n t n y x n 1 + ξ n n=1 μ n ξ n n=1 In computing L w, b, ξ, a, μ, ξ n contributes the following value: Cξ n a n ξ n μ n ξ n = ξ n (C a n μ n ) As we saw in the previous slide, C a n μ n = 0. Therefore, we can eliminate all those occurrences of ξ n. By eliminating ξ n we also eliminate all appearances of C and μ n. 75

76 Lagrangian Function By eliminating ξ n and μ n, we get: L w, b, a = 1 2 w 2 a n t n w T x n + b 1 n=1 This is exactly the Lagrangian we had for the linearly separable case. Following the same steps as in the linearly separable case, we can simplify even more to: L a = a n 1 2 at Qa n=1 76

77 Lagrangian Function So, we have ended up with the same Lagrangian as in the linearly separable case: L a = a n 1 2 at Qa n=1 There is a small difference, however, in the constraints. Linearly separable case: 0 a n Linearly inseparable case: 0 a n C a n t n = 0 a n t n = 0 n=1 n=1 77

78 Lagrangian Function So, we have ended up with the same Lagrangian as in the linearly separable case: L a = a n 1 2 at Qa n=1 Where does constraint 0 a n C come from? 0 a n because a n is a Lagrange multiplier. a n C comes from the fact that we showed earlier, that a n = C μ n. Since μ n is also a Lagrange multiplier, μ n 0 and thus a n C. 78

79 Using Quadratic Programming Quadratic programming: u opt = argmin u 1 2 ut Qu + s T u subject to constraint: Hu z SVM problem: find a opt = argmin subject to constraints: 0 a n C, a n t n a n=1 1 2 at Qa n=1 a n = 0 Again, we must find values for Q, s, H, z that convert the SVM problem into a quadratic programming problem. Values for Q and s are the same as in the linearly separable case, since in both cases we minimize 1 2 at Qa n=1 a n. 79

80 Using Quadratic Programming Quadratic programming: u opt = argmin u 1 2 ut Qu + s T u subject to constraint: Hu z SVM problem: find a opt = argmin subject to constraints: 0 a n C, a n t n a n=1 1 2 at Qa n=1 a n = 0 Q is an matrix such that Q mn = t n t m x n T x n. u = a, and s =

81 Using Quadratic Programming Quadratic programming constraint: Hu z SVM problem constraints: 0 a n C, n=1 a n t n = 0 Define: H = 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 t 1, t 2, t 3,, t n t 1, t 2, t 3,, t n rows 1 to rows ( + 1) to 2 rows (2 + 1) and (2 + 2) z = C C C 0 0 rows 1 to, set to 0. rows ( + 1) to 2, set to C. rows 2 + 1, 2 + 2, set to 0. The top rows of H are the negation of the identity matrix. Rows + 1 to 2 of H are the identity matrix Row + 1 of H is the transpose of vector t of target outputs. Row + 2 of H is the negation of the previous row. 81

82 Using Quadratic Programming Quadratic programming constraint: Hu z SVM problem constraints: 0 a n C, n=1 a n t n = 0 Define: H = 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 t 1, t 2, t 3,, t n t 1, t 2, t 3,, t n rows 1 to rows ( + 1) to 2 rows (2 + 1) and (2 + 2) z = C C C 0 0 rows 1 to, set to 0. rows ( + 1) to 2, set to C. rows 2 + 1, 2 + 2, set to 0. If 1 n, the n-th row of Hu is a n, and the n-th row of z is 0. Thus, the n-th rows of Hu and z capture constraint a n 0 a n 0. 82

83 Using Quadratic Programming Quadratic programming constraint: Hu z SVM problem constraints: 0 a n C, n=1 a n t n = 0 Define: H = 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 t 1, t 2, t 3,, t n t 1, t 2, t 3,, t n rows 1 to rows ( + 1) to 2 rows (2 + 1) and (2 + 2) z = C C C 0 0 rows 1 to, set to 0. rows ( + 1) to 2, set to C. rows 2 + 1, 2 + 2, set to 0. If ( + 1) n 2, the n-th row of Hu is a n, and the n-th row of z is C. Thus, the n-th rows of Hu and z capture constraint a n 0. 83

84 Using Quadratic Programming Quadratic programming constraint: Hu z SVM problem constraints: 0 a n C, n=1 a n t n = 0 Define: H = 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 t 1, t 2, t 3,, t n t 1, t 2, t 3,, t n rows 1 to rows ( + 1) to 2 rows (2 + 1) and (2 + 2) z = C C C 0 0 rows 1 to, set to 0. rows ( + 1) to 2, set to C. rows 2 + 1, 2 + 2, set to 0. If n = 2 + 1, the n-th row of Hu is t, and the n-th row of z is 0. Thus, the n-th rows of Hu and z capture constraint n=1 a n t n 0. 84

85 Using Quadratic Programming Quadratic programming constraint: Hu z SVM problem constraints: 0 a n C, n=1 a n t n = 0 Define: H = 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 t 1, t 2, t 3,, t n t 1, t 2, t 3,, t n rows 1 to rows ( + 1) to 2 rows (2 + 1) and (2 + 2) z = C C C 0 0 rows 1 to, set to 0. rows ( + 1) to 2, set to C. rows 2 + 1, 2 + 2, set to 0. If n = 2 + 2, the n-th row of Hu is t, and the n-th row of z is 0. Thus, the n-th rows of Hu and z capture constraint n=1 a n t n 0 n=1 a n t n 0. 85

86 Using Quadratic Programming Quadratic programming constraint: Hu z SVM problem constraints: 0 a n C, n=1 a n t n = 0 Define: H = 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 t 1, t 2, t 3,, t n t 1, t 2, t 3,, t n rows 1 to rows ( + 1) to 2 rows (2 + 1) and (2 + 2) z = C C C 0 0 rows 1 to, set to 0. rows ( + 1) to 2, set to C. rows 2 + 1, 2 + 2, set to 0. Thus, the last two rows of Hu and z in combination capture constraint n=1 a n t n = 0 86

87 Using the Solution Quadratic programming, given the inputs Q, s, H, z defined in the previous slides, outputs the optimal value for vector a. The value of b is computed as in the linearly separable case: b = 1 S t n a m t m x m T x n n S m S Classification of input x is done as in the linearly separable case: y x = a n t n x n T x + b n S 87

88 The Disappearing w We have used vector w, to define the decision boundary y x = w T x + b. However, w does not need to be either computed, or used. Quadratic programming computes values a n. Using those values a n, we compute b: b = 1 S t n a m t m x m T x n n S m S Using those values for a n and b, we can classify test objects x: y x = a n t n x n T x n S + b If we know the values for a n and b, we do not need w. 88

89 The Role of Training Inputs Where, during training, do we use input vectors? Where, during classification, do we use input vectors? Overall, input vectors are only used in two formulas: 1. During training, we use vectors x n to define matrix Q, where: Q mn = t n t m x n T x m 2. During classification, we use training vectors x n and test input x in this formula: y x = a n t n x n T x + b n S In both formulas, input vectors are only used by taking their dot products. 89

90 The Role of Training Inputs To make this more clear, define a kernel function k(x, x ) as: k x, x = x T x During training, we use vectors x n to define matrix Q, where: Q mn = t n t m k(x n, x m ) During classification, we use training vectors x n and test input x in this formula: y x = a n t n k(x n, x) n S + b In both formulas, input vectors are used only through function k. 90

91 The Kernel Trick We have defined kernel function k x, x = x T x. During training, we use vectors x n to define matrix Q, where: Q mn = t n t m k(x n, x m ) During classification, we use this formula: y x = a n t n k(x n, x) + b n S What if we defined k x, x differently? The SVM formulation (both for training and for classification) would remain exactly the same. In the SVM formulation, the kernel k x, x is a black box. You can define k x, x any way you like. 91

92 A Different Kernel Let x = x 1, x 2 and z = z 1, z 2 be 2-dimensional vectors. Consider this alternative definition for the kernel: Then: k x, z = 1 + x T z 2 k x, z = 1 + x T z 2 = 1 + x 1 z 1 + x 2 z 2 2 = 1 + 2x 1 z 1 + 2x 2 z 2 + (x 1 ) 2 (z 1 ) 2 +2x 1 z 1 2x 2 z 2 + (x 2 ) 2 (z 2 ) 2 Suppose we define a basis function φ(x) as: φ x = (1, 2x 1, 2x 2, (x 1 ) 2, 2x 1 x 2, (x 2 ) 2 ) It is easy to verify that k x, z = φ x T φ z 92

93 Kernels and Basis Functions In general, kernels make it easy to incorporate basis functions into SVMs: Define φ x any way you like. Define k x, z = φ x T φ z. The kernel function represents a dot product, but in a (typically) higher-dimensional feature space compared to the original space of x and z. 93

94 Polynomial Kernels Let x and z be D-dimensional vectors. A polynomial kernel of degree d is defined as: k x, z = c + x T z d The kernel k x, z = 1 + x T z 2 that we saw a couple of slides back was a quadratic kernel. Parameter c controls the trade-off between influence higher-order and lower-order terms. Increasing values of c give increasing influence to lowerorder terms. 94

95 Polynomial Kernels An Easy Case Decision boundary with polynomial kernel of degree 1. This is identical to the result using the standard dot product as kernel. 95

96 Polynomial Kernels An Easy Case Decision boundary with polynomial kernel of degree 2. The decision boundary is not linear anymore. 96

97 Polynomial Kernels An Easy Case Decision boundary with polynomial kernel of degree 3. 97

98 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree 1. 98

99 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree 2. 99

100 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree

101 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree

102 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree

103 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree

104 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree

105 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree

106 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree

107 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree

108 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree

109 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree

110 RBF/Gaussian Kernels The Radial Basis Function (RBF) kernel, also known as Gaussian kernel, is defined as: k σ x, z = e x z 2 2σ 2 Given σ, the value of k σ x, z only depends on the distance between x and z. k σ x, z decreases exponentially to the distance between x and z. Parameter σ is chosen manually. Parameter σ specifies how fast k σ x, z decreases as x moves away from z. 110

111 RBF Output Vs. Distance X axis: distance between x and z. Y axis: k σ x, z, with σ =

112 RBF Output Vs. Distance X axis: distance between x and z. Y axis: k σ x, z, with σ =

113 RBF Output Vs. Distance X axis: distance between x and z. Y axis: k σ x, z, with σ =1. 113

114 RBF Output Vs. Distance X axis: distance between x and z. Y axis: k σ x, z, with σ =

115 RBF Kernels An Easier Dataset Decision boundary with a linear kernel. 115

116 RBF Kernels An Easier Dataset Decision boundary with an RBF kernel. For this dataset, this is a relatively large value for σ, and it produces a boundary that is almost linear. 116

117 RBF Kernels An Easier Dataset Decision boundary with an RBF kernel. Decreasing the value of σ leads to less linear boundaries. 117

118 RBF Kernels An Easier Dataset Decision boundary with an RBF kernel. Decreasing the value of σ leads to less linear boundaries. 118

119 RBF Kernels An Easier Dataset Decision boundary with an RBF kernel. Decreasing the value of σ leads to less linear boundaries. 119

120 RBF Kernels An Easier Dataset Decision boundary with an RBF kernel. Decreasing the value of σ leads to less linear boundaries. 120

121 RBF Kernels An Easier Dataset Decision boundary with an RBF kernel. Decreasing the value of σ leads to less linear boundaries. 121

122 RBF Kernels An Easier Dataset Decision boundary with an RBF kernel. ote that smaller values of σ increase dangers of overfitting. 122

123 RBF Kernels An Easier Dataset Decision boundary with an RBF kernel. ote that smaller values of σ increase dangers of overfitting. 123

124 RBF Kernels A Harder Dataset Decision boundary with a linear kernel. 124

125 RBF Kernels A Harder Dataset Decision boundary with an RBF kernel. The boundary is almost linear. 125

126 RBF Kernels A Harder Dataset Decision boundary with an RBF kernel. The boundary now is clearly nonlinear. 126

127 RBF Kernels A Harder Dataset Decision boundary with an RBF kernel. 127

128 RBF Kernels A Harder Dataset Decision boundary with an RBF kernel. 128

129 RBF Kernels A Harder Dataset Decision boundary with an RBF kernel. Again, smaller values of σ increase dangers of overfitting. 129

130 RBF Kernels A Harder Dataset Decision boundary with an RBF kernel. Again, smaller values of σ increase dangers of overfitting. 130

131 RBF Kernels and Basis Functions The RBF kernel is defined as: k σ x, z = e x z 2 2σ 2 Is the RBF kernel equivalent to taking the dot product in some feature space? In other words, is there any basis function φ such that k σ x, z = φ x T φ z? The answer is "yes": There exists such a function φ, but its output is infinitedimensional. We will not get into more details here. 131

132 Kernels for on-vector Data So far, all our methods have been taking real-valued vectors as input. The inputs have been elements of R D, for some D 1. However, there are many interesting problems where the input is not such real-valued vectors. Examples??? 132

133 Kernels for on-vector Data So far, all our methods have been taking real-valued vectors as input. The inputs have been elements of R D, for some D 1. However, there are many interesting problems where the input is not such real-valued vectors. The inputs can be strings, like "cat", "elephant", "dog". The inputs can be sets, such as {1, 5, 3}, {5, 1, 2, 7}. Sets are OT vectors. They have very different properties. As sets, {1, 5, 3} = {5, 3, 1}. As vectors, 1, 5, 3 (5, 3, 1) There are many other types of non-vector data SVMs can be applied to such data, as long as we define an appropriate kernel function. 133

134 Training Time Complexity Solving the quadratic programming problem takes O(d 3 ) time, where d is the the dimensionality of vector u. In other words, d is the number of values we want to estimate. If we dw and b, then we estimate D + 1 values. The time complexity is O(D 3 ), where D is the dimensionality of input vectors x (or the dimensionality of φ x, if we use a basis function). If we use quadratic programming to compute vector a = (a 1,, a ), then we estimate values. The time complexity is O( 3 ), where is the number of training inputs. Which one is faster? 134

135 Training Time Complexity If we use quadratic programming to compute directly w and b, then the time complexity is O(D 3 ). If we use no basis function, D is the dimensionality of input vectors x. If we use a basis function φ, D is the dimensionality of φ x. If we use quadratic programming to compute vector a = (a 1,, a ), the time complexity is O( 3 ). For linear SVMs (i.e., SVMs with linear kernels, that use the regular dot product), usually D is much smaller than. If we use RBF kernels, φ x is infinite-dimensional. Computing but computing vector a still takes O( 3 ) time. If you want to use the kernel trick, then there is no choice, it takes O( 3 ) time to do training. 135

136 SVMs for Multiclass Problems As usual, you can always train one-vs.-all SVMs if there are more than two classes. Other, more complicated methods are also available. You can also train what is called "all-pairs" classifiers: Each SVM is trained to discriminate between two classes. The number of SVMs is quadratic to the number of classes. All-pairs classifiers can be used in different ways to classify an input object. Each pairwise classifier votes for one of the two classes it was trained on. Classification time is quadratic to the number of classes. There is an alternative method, called DAGSVM, where the all-pairs classifiers are organized in a directed acyclic graph, and classification time is linear to the number of classes. 136

137 SVMs: Recap Advantages: Training finds globally best solution. o need to worry about local optima, or iterations. SVMs can define complex decision boundaries. Disadvantages: Training time is cubic to the number of training data. This makes it hard to apply SVMs to large datasets. High-dimensional kernels increase the risk of overfitting. Usually larger training sets help reduce overfitting, but SVMs cannot be applied to large training sets due to cubic time complexity. Some choices must still be made manually. Choice of kernel function. Choice of parameter C in formula C ξ n n= w

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Support Vector Machines CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Support Vector Machine (continued)

Support Vector Machine (continued) Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need

More information

Pattern Recognition 2018 Support Vector Machines

Pattern Recognition 2018 Support Vector Machines Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask Machine Learning and Data Mining Support Vector Machines Kalev Kask Linear classifiers Which decision boundary is better? Both have zero training error (perfect training accuracy) But, one of them seems

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 Support Vector Machines Find a linear hyperplane

More information

(Kernels +) Support Vector Machines

(Kernels +) Support Vector Machines (Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Review: Support vector machines. Machine learning techniques and image analysis

Review: Support vector machines. Machine learning techniques and image analysis Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Some material on these is slides borrowed from Andrew Moore's excellent machine learning tutorials located at: http://www.cs.cmu.edu/~awm/tutorials/ Where Should We Draw the Line????

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

ML (cont.): SUPPORT VECTOR MACHINES

ML (cont.): SUPPORT VECTOR MACHINES ML (cont.): SUPPORT VECTOR MACHINES CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 40 Support Vector Machines (SVMs) The No-Math Version

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Support Vector Machines, Kernel SVM

Support Vector Machines, Kernel SVM Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Support Vector Machine (SVM) Hamid R. Rabiee Hadi Asheri, Jafar Muhammadi, Nima Pourdamghani Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Introduction

More information

Constrained Optimization and Support Vector Machines

Constrained Optimization and Support Vector Machines Constrained Optimization and Support Vector Machines Man-Wai MAK Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Support Vector Machines. Manfred Huber Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data

More information

Support Vector Machines

Support Vector Machines EE 17/7AT: Optimization Models in Engineering Section 11/1 - April 014 Support Vector Machines Lecturer: Arturo Fernandez Scribe: Arturo Fernandez 1 Support Vector Machines Revisited 1.1 Strictly) Separable

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Support vector machines (SVMs) are one of the central concepts in all of machine learning. They are simply a combination of two ideas: linear classification via maximum (or optimal

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning Support Vector Machines. Prof. Matteo Matteucci Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Kernel: Kernel is defined as a function returning the inner product between the images of the two arguments k(x 1, x 2 ) = ϕ(x 1 ), ϕ(x 2 ) k(x 1, x 2 ) = k(x 2, x 1 ) modularity-

More information

Support Vector Machines Explained

Support Vector Machines Explained December 23, 2008 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

L5 Support Vector Classification

L5 Support Vector Classification L5 Support Vector Classification Support Vector Machine Problem definition Geometrical picture Optimization problem Optimization Problem Hard margin Convexity Dual problem Soft margin problem Alexander

More information

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012 CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012 Luke ZeClemoyer Slides adapted from Carlos Guestrin Linear classifiers Which line is becer? w. = j w (j) x (j) Data Example i Pick the one with the

More information

CS798: Selected topics in Machine Learning

CS798: Selected topics in Machine Learning CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning

More information

Announcements - Homework

Announcements - Homework Announcements - Homework Homework 1 is graded, please collect at end of lecture Homework 2 due today Homework 3 out soon (watch email) Ques 1 midterm review HW1 score distribution 40 HW1 total score 35

More information

Lecture 10: A brief introduction to Support Vector Machine

Lecture 10: A brief introduction to Support Vector Machine Lecture 10: A brief introduction to Support Vector Machine Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

SVMs, Duality and the Kernel Trick

SVMs, Duality and the Kernel Trick SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Introduction to SVM and RVM

Introduction to SVM and RVM Introduction to SVM and RVM Machine Learning Seminar HUS HVL UIB Yushu Li, UIB Overview Support vector machine SVM First introduced by Vapnik, et al. 1992 Several literature and wide applications Relevance

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

More information

TDT4173 Machine Learning

TDT4173 Machine Learning TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods

More information

SVM optimization and Kernel methods

SVM optimization and Kernel methods Announcements SVM optimization and Kernel methods w 4 is up. Due in a week. Kaggle is up 4/13/17 1 4/13/17 2 Outline Review SVM optimization Non-linear transformations in SVM Soft-margin SVM Goal: Find

More information

Support Vector Machine. Industrial AI Lab.

Support Vector Machine. Industrial AI Lab. Support Vector Machine Industrial AI Lab. Classification (Linear) Autonomously figure out which category (or class) an unknown item should be categorized into Number of categories / classes Binary: 2 different

More information

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science Neural Networks Prof. Dr. Rudolf Kruse Computational Intelligence Group Faculty for Computer Science kruse@iws.cs.uni-magdeburg.de Rudolf Kruse Neural Networks 1 Supervised Learning / Support Vector Machines

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

Support Vector Machines

Support Vector Machines Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization

More information

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Machine Learning : Support Vector Machines

Machine Learning : Support Vector Machines Machine Learning Support Vector Machines 05/01/2014 Machine Learning : Support Vector Machines Linear Classifiers (recap) A building block for almost all a mapping, a partitioning of the input space into

More information

Support Vector Machines

Support Vector Machines Support Vector Machines INFO-4604, Applied Machine Learning University of Colorado Boulder September 28, 2017 Prof. Michael Paul Today Two important concepts: Margins Kernels Large Margin Classification

More information

Lecture 10: Support Vector Machine and Large Margin Classifier

Lecture 10: Support Vector Machine and Large Margin Classifier Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines CS4495/6495 Introduction to Computer Vision 8C-L3 Support Vector Machines Discriminative classifiers Discriminative classifiers find a division (surface) in feature space that separates the classes Several

More information

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes About this class Maximum margin classifiers SVMs: geometric derivation of the primal problem Statement of the dual problem The kernel trick SVMs as the solution to a regularization problem Maximizing the

More information

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Module - 5 Lecture - 22 SVM: The Dual Formulation Good morning.

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems

More information

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support

More information

Support Vector and Kernel Methods

Support Vector and Kernel Methods SIGIR 2003 Tutorial Support Vector and Kernel Methods Thorsten Joachims Cornell University Computer Science Department tj@cs.cornell.edu http://www.joachims.org 0 Linear Classifiers Rules of the Form:

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines: Maximum Margin Classifiers Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Reading: Ben-Hur & Weston, A User s Guide to Support Vector Machines (linked from class web page) Notation Assume a binary classification problem. Instances are represented by vector

More information

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee Support Vector Machine Industrial AI Lab. Prof. Seungchul Lee Classification (Linear) Autonomously figure out which category (or class) an unknown item should be categorized into Number of categories /

More information

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable. Linear SVM (separable case) First consider the scenario where the two classes of points are separable. It s desirable to have the width (called margin) between the two dashed lines to be large, i.e., have

More information

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights Linear Discriminant Functions and Support Vector Machines Linear, threshold units CSE19, Winter 11 Biometrics CSE 19 Lecture 11 1 X i : inputs W i : weights θ : threshold 3 4 5 1 6 7 Courtesy of University

More information

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3

More information

Data Mining - SVM. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55

Data Mining - SVM. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55 Data Mining - SVM Dr. Jean-Michel RICHER 2018 jean-michel.richer@univ-angers.fr Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55 Outline 1. Introduction 2. Linear regression 3. Support Vector Machine 4.

More information

Modelli Lineari (Generalizzati) e SVM

Modelli Lineari (Generalizzati) e SVM Modelli Lineari (Generalizzati) e SVM Corso di AA, anno 2018/19, Padova Fabio Aiolli 19/26 Novembre 2018 Fabio Aiolli Modelli Lineari (Generalizzati) e SVM 19/26 Novembre 2018 1 / 36 Outline Linear methods

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

Learning with kernels and SVM

Learning with kernels and SVM Learning with kernels and SVM Šámalova chata, 23. května, 2006 Petra Kudová Outline Introduction Binary classification Learning with Kernels Support Vector Machines Demo Conclusion Learning from data find

More information

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI Support Vector Machines II CAP 5610: Machine Learning Instructor: Guo-Jun QI 1 Outline Linear SVM hard margin Linear SVM soft margin Non-linear SVM Application Linear Support Vector Machine An optimization

More information

Support Vector Machines. Machine Learning Fall 2017

Support Vector Machines. Machine Learning Fall 2017 Support Vector Machines Machine Learning Fall 2017 1 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce

More information

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26 Huiping Cao, Slide 1/26 Classification Linear SVM Huiping Cao linear hyperplane (decision boundary) that will separate the data Huiping Cao, Slide 2/26 Support Vector Machines rt Vector Find a linear Machines

More information

Support vector machines

Support vector machines Support vector machines Jianxin Wu LAMDA Group National Key Lab for Novel Software Technology Nanjing University, China wujx2001@gmail.com May 10, 2018 Contents 1 The key SVM idea 2 1.1 Simplify it, simplify

More information

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML) Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:

More information