Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Size: px
Start display at page:

Download "Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington"

Transcription

1 Support Vector Machines CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1

2 A Linearly Separable Problem Consider the binary classification problem on the figure. The blue points belong to one class, with label +1. The orange points belong to the other class, with label -1. These two classes are linearly separable. Infinitely many lines separate them. Are any of those infinitely many lines preferable? 2

3 A Linearly Separable Problem Do we prefer the blue line or the red line, as decision boundary? What criterion can we use? Both decision boundaries classify the training data with 100% accuracy. 3

4 Margin of a Decision Boundary The margin of a decision boundary is defined as the smallest distance between the boundary and any of the samples. 4

5 Margin of a Decision Boundary One way to visualize the margin is this: For each class, draw a line that: is parallel to the decision boundary. touches the class point that is the closest to the decision boundary. The margin is the smallest distance between the decision boundary and one of those two parallel lines. In this example, the decision boundary is equally far from both lines. margin margin 5

6 Support Vector Machines One way to visualize the margin is this: For each class, draw a line that: is parallel to the decision boundary. touches the class point that is the closest to the decision boundary. The margin is the smallest distance between the decision boundary and one of those two parallel lines. margin 6

7 Support Vector Machines Support Vector Machines (SVMs) are a classification method, whose goal is to find the decision boundary with the maximum margin. The idea is that, even if multiple decision boundaries give 100% accuracy on the training data, larger margins lead to less overfitting. Larger margins can tolerate more perturbations of the data. margin margin 7

8 Support Vector Machines Note: so far, we are only discussing cases where the training data is linearly separable. First, we will see how to maximize the margin for such data. Second, we will deal with data that are not linearly separable. We will define SVMs that classify such training data imperfectly. Third, we will see how to define nonlinear SVMs, which can define non-linear decision boundaries. margin margin 8

9 Support Vector Machines Note: so far, we are only discussing cases where the training data is linearly separable. First, we will see how to maximize the margin for such data. Second, we will deal with data that are not linearly separable. We will define SVMs that classify such training data imperfectly. Third, we will see how to define nonlinear SVMs, which can define non-linear decision boundaries. An example of a nonlinear decision boundary produced by a nonlinear SVM. 9

10 Support Vectors In the figure, the red line is the maximum margin decision boundary. One of the parallel lines touches a single orange point. If that orange point moves closer to or farther from the red line, the optimal boundary changes. If other orange points move, the optimal boundary does not change, unless those points move to the right of the blue line. margin margin 10

11 Support Vectors In the figure, the red line is the maximum margin decision boundary. One of the parallel lines touches two blue points. If either of those points moves closer to or farther from the red line, the optimal boundary changes. If other blue points move, the optimal boundary does not change, unless those points move to the left of the blue line. margin margin 11

12 Support Vectors In summary, in this example, the maximum margin is defined by only three points: One orange point. Two blue points. These points are called support vectors. They are indicated by a black circle around them. margin margin 12

13 Distances to the Boundary The decision boundary consists of all points xx that are solutions to equation: ww TT xx + bb = 0. ww is a column vector of parameters (weights). xx is an input vector. bb is a scalar value (a real number). If xx nn is a training point, its distance to the boundary is computed using this equation: DD xx nn, ww = wwtt xx + bb ww 13

14 Distances to the Boundary If xx nn is a training point, its distance to the boundary is computed using this equation: DD xx nn, ww = wwtt xx nn + bb ww Since the training data are linearly separable, the data from each class should fall on opposite sides of the boundary. Suppose that tt nn = 1 for points of one class, and tt nn = +1 for points of the other class. Then, we can rewrite the distance as: DD xx nn, ww = tt nn(ww TT xx nn + bb) ww 14

15 Distances to the Boundary So, given a decision boundary defined ww and bb, and given a training input xx nn, the distance of xx nn to the boundary is: DD xx nn, ww = tt nn(ww TT xx nn + bb) ww If tt nn = 1, then: ww TT xx nn + bb < 0. tt nn ww TT xx nn + bb > 0. If tt nn = 1, then: ww TT xx nn + bb > 0. tt nn ww TT xx nn + bb > 0. So, in all cases, tt nn (ww TT xx nn + bb) is positive. 15

16 Optimization Criterion If xx nn is a training point, its distance to the boundary is computed using this equation: DD xx nn, ww = tt nn(ww TT xx nn + bb) ww Therefore, the optimal boundary ww opt is defined as: (ww opt, bb opt ) = argmax ww,bb min nn tt nn (ww TT xx nn + bb) ww In words: find the ww and bb that maximize the minimum distance of any training input from the boundary. 16

17 Optimization Criterion The optimal boundary ww opt is defined as: (ww opt, bb opt ) = argmax ww,bb min nn tt nn (ww TT xx nn + bb) ww Suppose that, for some values ww and bb, the decision boundary defined by ww TT xx nn + bb = 0 misclassifies some objects. Can those values of ww and bb be selected as ww opt, bb opt? 17

18 Optimization Criterion The optimal boundary ww opt is defined as: (ww opt, bb opt ) = argmax ww,bb min nn tt nn (ww TT xx nn + bb) ww Suppose that, for some values ww and bb, the decision boundary defined by ww TT xx nn + bb = 0 misclassifies some objects. Can those values of ww and bb be selected as ww opt, bb opt? No. If some objects get misclassified, then, for some xx nn it holds that tt nn (ww TT xx nn +bb) ww < 00. Thus, for such ww and bb, the expression in red will be negative. Since the data is linearly separable, we can find better values for ww and bb, for which the expression in red will be greater than 0. 18

19 Scale of ww The optimal boundary ww opt is defined as: (ww opt, bb opt ) = argmax ww,bb min nn tt nn (ww TT xx nn + bb) ww Suppose that gg is a real number, and cc > 0. If ww opt and bb opt define an optimal boundary, then gg ww opt and gg bb opt also define an optimal boundary. We constrain the scale of ww opt to a single value, by requiring that: min nn tt nn (ww TT xx nn + bb) = 1 19

20 Optimization Criterion We introduced the requirement that: min nn tt nn (ww TT xx nn + bb) = 1 Therefore, for any xx nn, it holds that: tt nn ww TT xx nn + bb 1 The original optimization criterion becomes: ww opt, bb opt = argmax ww,bb min nn tt nn ww TT xx nn +bb ww ww opt = argmax ww 1 ww = argmin ww ww ww opt = argmin ww 1 2 ww 22 These are equivalent formulations. The textbook uses the last one because it simplifies subsequent calculations. 20

21 Constrained Optimization Summarizing the previous slides, we want to find: ww opt = argmin ww 1 2 ww 22 subject to the following constraints: nn 1,,, tt nn ww TT xx nn + bb 1 This is a different optimization problem than what we have seen before. We need to minimize a quantity while satisfying a set of inequalities. This type of problem is a constrained optimization problem. 21

22 Quadratic Programming Our constrained optimization problem can be solved using a method called quadratic programming. Describing quadratic programming in depth is outside the scope of this course. Our goal is simply to understand how to use quadratic programming as a black box, to solve our optimization problem. This way, you can use any quadratic programming toolkit (Matlab includes one). 22

23 Quadratic Programming The quadratic programming problem is defined as follows: Inputs: ss: an RR-dimensional column vector. QQ: an RR RR-dimensional symmetric matrix. HH: an QQ RR-dimensional symmetric matrix. zz: an QQ-dimensional column vector. Output: uu opt : an RR-dimensional column vector, such that: uu opt = argmin uu 1 2 uutt QQQQ + ss TT uu subject to constraint: HHHH zz 23

24 Quadratic Programming for SVMs Quadratic Programming: 1 uu opt = argmin uu 2 uutt QQQQ + ss TT uu subject to constraint: HHHH zz 1 22 SVM goal: ww opt = argmin ww ww 2 subject to constraints: nn 1,,, tt nn ww TT xx nn + bb 1 We need to define appropriate values of QQ, ss, HH, zz, uu so that quadratic programming computes ww opt and bb opt. 24

25 Quadratic Programming for SVMs Quadratic Programming constraint: HHHH zz SVM constraints: nn 1,,, tt nn ww TT xx nn + bb 1 nn 1,,, tt nn ww TT xx nn + bb 1 25

26 Quadratic Programming for SVMs SVM constraints: nn 1,,, tt nn ww TT xx nn + bb 1 Define: HH = tt 1, tt 1 xx 11,, tt 1 xx 1DD tt 2, tt 2 xx 21,, tt 1 xx 2DD, uu = tt, tt xx N,, tt xx bb ww 1 ww 2 ww DD, zz = Matrix HH is (DD + 1), vector uu has (DD + 1) rows, vector zz has rows. HHuu = tt 1 bb tt 1 xx 11 ww 1 tt 1 xx 1DD ww 1 tt bb tt xx 1 ww 1 tt 1 xx DD ww 1 = tt 1 (bb + ww TT xx 1 ) tt (bb + ww TT xx ) The nn-th row of HHHH is tt nn ww TT xx nn + bb, which should be 1. SVM constraint quadratic programming constraint HHHH zz. 26

27 Quadratic Programming for SVMs Quadratic programming: uu opt = argmin uu 1 2 uutt QQQQ + ss TT uu SVM: ww opt = argmin ww 1 Define: QQ = 2 0, 0, 0,, 0 0, 1, 0,, 0 0, 0, 1,, 0 0, 0, 0,, 1 ww 22, uu = bb ww 1 ww 2 ww DD, ss = DD+1 QQ is like the (DD + 1) (DD + 1) identity matrix, except that QQ 11 = 0. Then, 1 2 uutt QQQQ + ss TT uu = 1 2 ww 22. uu already defined in the previous slides ss: DD dimensional column vector of zeros. 27

28 Quadratic Programming for SVMs Quadratic programming: uu opt = argmin uu 1 2 uutt QQQQ + ss TT uu SVM: ww opt = argmin ww 1 2 ww 22 Alternative definitions that would NOT work: Define: QQ = 1, 0,, 0 0, 0,, 1, uu = ww 1 ww DD, ss = 0 0 DD QQ is the DD DD identity matrix, uu = ww, ss is the DD-dimensional zero vector. It still holds that 1 2 uutt QQQQ + ss TT uu = 1 2 ww 22. Why would these definitions not work? 28

29 Quadratic Programming for SVMs Quadratic programming: uu opt = argmin uu 1 2 uutt QQQQ + ss TT uu SVM: ww opt = argmin ww 1 2 ww 22 Alternative definitions that would NOT work: Define: QQ = 1, 0,, 0 0, 0,, 1, uu = ww 1 ww DD, ss = 0 0 DD QQ is the DD DD identity matrix, uu = ww, ss is the DD-dimensional zero vector. It still holds that 1 2 uutt QQQQ + ss TT uu = 1 ww Why would these definitions not work? Vector uu must also make HHHH zz match the SVM constraints. With this definition of uu, no appropriate HH and zz can be found. 29

30 Quadratic Programming for SVMs 1 Quadratic programming: uu opt = argmin uu 2 uutt QQQQ + ss TT uu subject to constraint: HHHH zz 1 22 SVM goal: ww opt = argmin ww ww 2 subject to constraints: nn 1,,, tt nn ww TT xx nn + bb 1 Task: define QQ, ss, HH, zz, uu so that quadratic programming computes ww opt and bb opt. 0, 0, 0,, 0 0, 1, 0,, 0 QQ = 0, 0, 1,, 0, ss = 0, 0, 0,, 1 Like (DD + 1) (DD + 1) identity matrix, except that QQ 11 = DD + 1 rows, HH = tt 1, tt 1 xx 11,, tt 1 xx 1DD tt 2, tt 2 xx 21,, tt 1 xx 2DD, zz = tt, tt xx N,, tt xx rows, DD + 1 columns rows, uu = bb ww 1 ww 2 ww DD DD + 1 rows 30

31 Quadratic Programming for SVMs 1 Quadratic programming: uu opt = argmin uu 2 uutt QQQQ + ss TT uu subject to constraint: HHHH zz 1 22 SVM goal: ww opt = argmin ww ww 2 subject to constraints: nn 1,,, tt nn ww TT xx nn + bb 1 Task: define QQ, ss, HH, zz, uu so that quadratic programming computes ww opt and bb opt. QQ = 0, 0, 0,, 0 0, 1, 0,, 0 0, 0, 1,, 0 0, 0, 0,, 1, ss = 0 0 0, HH = tt 1, tt 1 xx 11,, tt 1 xx 1DD tt 2, tt 2 xx 21,, tt 1 xx 2DD, zz = tt, tt xx N,, tt xx 1 1 1, uu = Quadratic programming takes as inputs QQ, ss, HH, zz, and outputs uu opt, from which we get the ww opt, bb opt values for our SVM. bb ww 1 ww 2 ww DD 31

32 Quadratic Programming for SVMs 1 Quadratic programming: uu opt = argmin uu 2 uutt QQQQ + ss TT uu subject to constraint: HHHH zz 1 22 SVM goal: ww opt = argmin ww ww 2 subject to constraints: nn 1,,, tt nn ww TT xx nn + bb 1 Task: define QQ, ss, HH, zz, uu so that quadratic programming computes ww opt and bb opt. QQ = 0, 0, 0,, 0 0, 1, 0,, 0 0, 0, 1,, 0 0, 0, 0,, 1, ss = 0 0 0, HH = tt 1, tt 1 xx 11,, tt 1 xx 1DD tt 2, tt 2 xx 21,, tt 1 xx 2DD, zz = tt, tt xx N,, tt xx 1 1 1, uu = ww opt is the vector of values at dimensions 2,, DD + 1 of uu opt. bb opt is the value at dimension 1 of uu opt. bb ww 1 ww 2 ww DD 32

33 Solving the Same Problem, Again So far, we have solved the problem of defining an SVM (i.e., defining ww opt and bb opt ), so as to maximize the margin between linearly separable data. If this were all that SVMs can do, SVMs would not be that important. Linearly separable data are a rare case, and a very easy case to deal with. margin margin 33

34 Solving the Same Problem, Again So far, we have solved the problem of defining an SVM (i.e., defining ww opt and bb opt ), so as to maximize the margin between linearly separable data. We will see two extensions that make SVMs much more powerful. The extensions will allow SVMs to define highly non-linear decision boundaries, as in this figure. However, first we need to solve the same problem again. Maximize the margin between linearly separable data. We will get a more complicated solution, but that solution will be easier to improve upon. 34

35 Lagrange Multipliers Our new solutions are derived using Lagrange multipliers. Here is a quick review from multivariate calculus. Let xx be a DD-dimensional vector. Let ff(xx) and gg(xx) be functions from R DD to R. Functions ff and gg map DD-dimensional vectors to real numbers. Suppose that we want to minimize ff(xx), subject to the constraint that gg xx 0. Then, we can solve this problem using a Lagrange multiplier to define a Lagrangian function. 35

36 Lagrange Multipliers To minimize ff(xx), subject to the constraint: gg xx 0: We define the Lagrangian function: LL xx, λλ = ff xx λλgg(xx) λλ is called a Lagrange multiplier, λλ 0. We find xx opt = argmin xx LL(xx, λλ), and a corresponding value for λλ, subject to the following constraints: 1. gg xx 0 2. λλ 0 3. λλgg xx = 0 If gg xx opt > 0, the third constraint implies that λλ = 0. Then, the constraint gg xx 0 is called inactive. If gg xx opt = 0, then λλ > 0. Then, constraint gg xx 0 is called active. 36

37 Multiple Constraints Suppose that we have constraints: nn 1,,, gg nn xx 0 We want to minimize ff(xx), subject to those constraints. Define vector λλ = λλ 1,, λλ. Define the Lagrangian function as: LL xx, λλ = ff xx λλ nn gg nn xx nn=1 We find xx opt = argmin xx LL(xx, λλ), and a value for λλ, subject to: nn, gg nn xx 0 nn, λλ nn 0 nn, λλ nn gg nn xx = 0 37

38 Lagrange Dual Problems We have constraints: nn 1,,, gg nn xx 0 We want to minimize ff(xx), subject to those constraints. Under some conditions (which are satisfied in our SVM problem), we can solve an alternative dual problem: Define the Lagrangian function as before: LL xx, λλ = ff xx nn=1 λλ nn gg nn xx We find xx opt, and the best value for λλ, denoted as λλ opt, by solving: λλ opt = argmax λλ xx opt = argmin xx subject to constraints: λλ nn 00 min LL xx, λλ xx LL xx, λλ opt 38

39 Lagrange Dual Problems Lagrangian dual problem: Solve: λλ opt = argmax λλ min xx LL xx, λλ xx opt = argmin xx LL xx, λλ opt subject to constraints: λλ nn 00 This dual problem formulation will be used in training SVMs. The key thing to remember is: We minimize the Lagrangian LL with respect to xx. We maximize LL with respect to the Lagrange multipliers λλ nn. 39

40 Lagrange Multipliers and SVMs 1 SVM goal: ww opt = argmin ww ww 22 subject to constraints: 2 nn 1,,, tt nn ww TT xx nn + bb 1 To make the constraints more amenable to Lagrange multipliers, we rewrite them as: nn 1,,, tt nn ww TT xx nn + bb 1 0 Define aa = (aa 1,, aa ) to be a vector of Lagrange multipliers. Define the Lagrangian function: LL ww, bb, aa = 1 2 ww 22 nn=1 aa nn tt nn ww TT xx nn + bb 1 Remember from the previous slides, we minimize LL with respect to ww, bb, and maximize LL with respect to aa. 40

41 Lagrange Multipliers and SVMs The ww and bb that minimize LL ww, bb, aa must satisfy: LL = 0, = ww bb 0. ww = ww 22 nn=1 aa nn tt nn ww TT xx nn + bb 1 ww = 0 ww aa nn tt nn xx nn = 0 nn=1 ww = aa nn tt nn xx nn nn=1 41

42 Lagrange Multipliers and SVMs The ww and bb that minimize LL ww, bb, aa must satisfy: LL = 0, = ww bb 0. bb = ww 22 nn=1 aa nn tt nn ww TT xx nn + bb 1 bb = 0 nn=1 aa nn tt nn = 0 42

43 Lagrange Multipliers and SVMs Our Lagrangian function is: LL ww, bb, aa = 1 2 ww 2 aa nn tt nn ww TT xx nn + bb 1 nn=1 We showed that ww = nn=1 aa nn tt nn xx nn. Using that, we get: 1 aa nn tt nn xx nn TT aa nn tt nn xx nn 2 ww 2 = 1 2 wwtt ww = 1 2 nn=1 nn=1 1 2 ww 2 = 1 2 nn=1 mm=1 aa nn aa mm tt nn tt mm xx nn TT xx mm 43

44 Lagrange Multipliers and SVMs We showed that: 1 2 ww 2 = 1 2 nn=1 mm=1 aa nn aa mm tt nn tt mm xx nn TT xx mm Define an matrix QQ such that QQ mmmm = tt nn tt mm xx TT nn xx mm. Remember that we have defined aa = (aa 1,, aa ). Then, it follows that: 1 2 ww 2 = 1 2 aatt QQQQ 44

45 Lagrange Multipliers and SVMs LL ww, bb, aa = 1 2 aatt QQQQ nn=1 aa nn tt nn ww TT xx nn + bb 1 We showed that nn=1 aa nn tt nn = 0. Using that, we get: nn=1 = nn=1 = nn=1 aa nn tt nn ww TT xx nn + bb 1 aa nn tt nn ww TT xx nn aa nn tt nn ww TT xx nn + bb nn=1 nn=1 aa nn aa nn tt nn We have shown before that the red part equals 0. nn=1 aa nn 45

46 Lagrange Multipliers and SVMs LL ww, aa = 1 2 aatt QQQQ nn=1 aa nn tt nn ww TT xx nn + nn=1 aa nn Function LL now does not depend on bb. We simplify more, using again the fact that ww = nn=1 aa nn tt nn xx nn : aa nn tt nn ww TT xx nn = aa nn tt nn aa mm tt mm xx mm TT xx nn nn=1 nn=1 mm=1 = aa nn aa mm tt nn tt mm xx mm TT xx nn = aa TT QQQQ nn=1 mm=1 46

47 Lagrange Multipliers and SVMs LL ww, aa = 1 2 aatt QQQQ aa TT QQQQ + nn=1 aa nn = nn=1 aa nn 1 2 aatt QQQQ Function LL now does not depend on ww anymore. We can rewrite LL as a function whose only input is aa: LL aa = 1 2 aatt QQQQ aa TT QQQQ + nn=1 aa nn = nn=1 aa nn 1 2 aatt QQQQ Remember, we want to maximize LL aa with respect to aa. 47

48 Lagrange Multipliers and SVMs By combining the results from the last few slides, our optimization problem becomes: Maximize LL aa = nn=1 aa nn 1 2 aatt QQQQ subject to these constraints: aa nn 0 nn=1 aa nn tt nn = 0 48

49 Lagrange Multipliers and SVMs LL aa = aa nn 1 2 aatt QQQQ nn=1 We want to maximize LL aa subject to some constraints. Therefore, we want to find an aa opt such that: aa opt = argmax aa nn=1 aa nn 1 2 aatt QQQQ = argmin aa 1 2 aatt QQQQ nn=1 aa nn subject to those constraints. 49

50 SVM Optimization Problem Our SVM optimization problem now is to find an aa opt such that: aa opt = argmin aa 1 2 aatt QQQQ nn=1 aa nn subject to these constraints: aa nn 0 This problem can be solved again using quadratic programming. nn=1 aa nn tt nn = 0 50

51 Using Quadratic Programming Quadratic programming: uu opt = argmin uu 1 2 uutt QQQQ + ss TT uu subject to constraint: HHHH zz SVM problem: find aa opt = argmin subject to constraints: aa nn 0, aa nn=1 1 2 aatt QQQQ nn=1 aa nn tt nn = 0 Again, we must find values for QQ, ss, HH, zz that convert the SVM problem into a quadratic programming problem. Note that we already have a matrix QQ in the Lagrangian. It is an matrix QQ such that QQ mmmm = tt nn tt mm xx TT nn xx nn. We will use the same QQ for quadratic programming. 51 aa nn

52 Using Quadratic Programming Quadratic programming: uu opt = argmin uu 1 2 uutt QQQQ + ss TT uu subject to constraint: HHHH zz SVM problem: find aa opt = argmin aa 2 aatt QQQQ nn=1 subject to constraints: aa nn 0, nn=1 aa nn tt nn = 0 1 aa nn Define: uu = aa, and define N-dimensional vector ss = Then, 1 2 uutt QQQQ + ss TT uu = 1 2 aatt QQQQ nn=1 aa nn We have mapped the SVM minimization goal of finding aa opt to the quadratic programming minimization goal. 52

53 Using Quadratic Programming Quadratic programming constraint: HHHH zz SVM problem constraints: aa nn 0, nn=1 aa nn tt nn = 0 Define: HH = 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 tt 1, tt 2, tt 3,, tt nn tt 1, tt 2, tt 3,, tt nn +2, zz = zz: dimensional column vector of zeros. HH has size + 2. The first rows of HH are the negation of the identity matrix. Row + 1 of HH is the transpose of vector tt of target outputs. Row + 2 of HH is the negation of the previous row. 53

54 Using Quadratic Programming Quadratic programming constraint: HHHH zz SVM problem constraints: aa nn 0, nn=1 aa nn tt nn = 0 Define: HH = 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 tt 1, tt 2, tt 3,, tt nn tt 1, tt 2, tt 3,, tt nn +2, zz = zz: dimensional column vector of zeros. Since we defined uu = aa: For nn, the nn-th row of HHHH equals aa nn. For nn, the nn-th row of HHHH and zz specifies that aa nn 0 aa nn 0. The (nn + 1)-th row of HHHH and zz specifies that nn=1 aa nn tt nn 0. The (nn + 2)-th row of HHHH and zz specifies that 54 nn=1 aa nn tt nn 0.

55 Using Quadratic Programming Quadratic programming constraint: HHHH zz SVM problem constraints: aa nn 0, nn=1 aa nn tt nn = 0 Define: HH = 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 tt 1, tt 2, tt 3,, tt nn tt 1, tt 2, tt 3,, tt nn +2, zz = zz: dimensional column vector of zeros. Since we defined uu = aa: For nn, the nn-th row of HHHH and zz specifies that aa nn 0. The last two rows of HHHH and zz specify that nn=1 aa nn tt nn = 0. We have mapped the SVM constraints to the quadratic programming constraint HHHH zz. 55

56 Interpretation of the Solution Quadratic programming, given the inputs QQ, ss, HH, zz defined in the previous slides, outputs the optimal value for vector aa. We used this Lagrangian function: LL ww, bb, aa = 1 2 ww 22 nn=1 aa nn tt nn ww TT xx nn + bb 1 Remember, when we find xx opt to minimize any Lagrangian LL xx, λλ = ff xx nn=1 λλ nn gg nn xx one of the constraints we enforce is that nn, λλ nn gg nn xx = 0. In the SVM case, this means: nn, aa nn tt nn ww TT xx nn + bb 1 = 0 56

57 Interpretation of the Solution In the SVM case, it holds that: nn, aa nn tt nn ww TT xx nn + bb 1 = 0 What does this mean? Mathematically, nn: Either aa nn = 0, Or tt nn ww TT xx nn + bb = 1. If tt nn ww TT xx nn + bb = 1, what does that imply for xx nn? 57

58 Interpretation of the Solution In the SVM case, it holds that: What does this mean? Mathematically, nn: Either aa nn = 0, nn, aa nn tt nn ww TT xx nn + bb 1 = 0 Or tt nn ww TT xx nn + bb = 1. If tt nn ww TT xx nn + bb = 1, what does that imply for xx nn? Remember, many slides back, we imposed the constraint that nn, tt nn ww TT xx nn + bb 1. Equality tt nn ww TT xx nn + bb = 1 holds only for the support vectors. Therefore, aa nn > 0 only for the support vectors. 58

59 Support Vectors This is an example we have seen before. The three support vectors have a black circle around them. When we find the optimal values aa 1, aa 2,, aa for this problem, only three of those values will be non-zero. If xx nn is not a support vector, then aa nn = 0. margin margin 59

60 Interpretation of the Solution We showed before that: ww = nn=1 aa nn tt nn xx nn This means that ww is a linear combination of the training data. However, since aa nn > 0 only for the support vectors, obviously only the support vectors influence ww. margin margin 60

61 Computing bb ww = nn=1 aa nn tt nn xx nn Define set SS = nn xx nn is a support vector}. Since aa nn > 0 only for the support vectors, we get: ww = nn SS aa nn tt nn xx nn If xx nn is a support vector, then tt nn ww TT xx nn + bb = 1. Substituting ww = mm SS aa mm tt mm xx mm, if xx nn is a support vector: tt nn mm SS aa mm tt mm xx mm TT xx nn + bb = 1 61

62 Computing bb tt nn aa mm tt mm xx TT mm xx nn + bb = 1 mm SS Remember that tt nn can only take values 1 and 1. Therefore, tt 2 nn = 1 Multiplying both sides of the equation with tt nn we get: tt nn tt nn aa mm tt mm xx TT mm xx nn + bb = tt nn mm SS aa mm tt mm xx TT mm xx nn + bb = tt nn bb = tt nn mm SS mm SS aa mm tt mm xx mm TT xx nn 62

63 Computing bb Thus, if xx nn is a support vector, we can compute bb with formula: bb = tt nn aa mm tt mm xx mm TT xx nn mm SS To avoid numerical problems, instead of using a single support vector to calculate bb, we can use all support vectors (and take the average of the computed values for bb). If SS is the number of support vectors, then: bb = 1 SS tt nn aa mm tt mm xx mm TT xx nn nn SS mm SS 63

64 Classification Using aa To classify a test object xx, we can use the original formula yy xx = ww TT xx + bb. However, since ww = nn=1 aa nn tt nn xx nn, we can substitute that formula for ww, and classify xx using: yy xx = nn SS aa nn tt nn xx nn TT xx + bb This formula will be our only choice when we use SVMs that produce nonlinear boundaries. Details on that will be coming later in this presentation. 64

65 Recap of Lagrangian-Based Solution We defined the Lagrangian function, which became (after simplifications): LL aa = nn=1 aa nn 1 2 aatt QQQQ Using quadratic programming, we find the value of aa that maximizes LL aa, subject to constraints: aa nn 0, nn=1 aa nn tt nn = 0 Then: ww = aa nn tt nn xx nn bb = 1 SS tt nn aa mm tt mm xx mm TT xx nn nn SS nn SS mm SS 65

66 Why Did We Do All This? The Lagrangian-based solution solves a problem that we had already solved, in a more complicated way. Typically, we prefer simpler solutions. However, this more complicated solution can be tweaked relatively easily, to produce more powerful SVMs that: Can be trained even if the training data is not linearly separable. Can produce nonlinear decision boundaries. 66

67 The Not Linearly Separable Case Our previous formulation required that the training examples are linearly separable. This requirement was encoded in the constraint: nn 1,,, tt nn ww TT xx nn + bb 1 According to that constraint, every xx nn has to be on the correct side of the boundary. To handle data that is not linearly separable, we introduce variables ξξ nn 0, to define a modified constraint: nn 1,,, tt nn ww TT xx nn + bb 1 ξξ nn These variables ξξ nn are called slack variables. The values for ξξ nn will be computed during optimization. 67

68 The Meaning of Slack Variables To handle data that is not linearly separable, we introduce slack variables ξξ nn 0, to define a modified constraint: nn 1,,, tt nn ww TT xx nn + bb 1 ξξ nn If ξξ nn = 0, then xx nn is where it should be: either on the blue line for objects of its class, or on the correct side of that blue line. If 0 < ξξ nn < 1, then xx nn is too close to the decision boundary. Between the red line and the blue line for objects of its class. If ξξ nn = 1, xx nn is on the red line. If ξξ nn > 1, xx nn is misclassified (on the wrong side of the red line). 68

69 The Meaning of Slack Variables If training data is not linearly separable, we introduce slack variables ξξ nn 0, to define a modified constraint: nn 1,,, tt nn ww TT xx nn + bb 1 ξξ nn If ξξ nn = 0, then xx nn is where it should be: either on the blue line for objects of its class, or on the correct side of that blue line. If 0 < ξξ nn < 1, then xx nn is too close to the decision boundary. Between the red line and the blue line for objects of its class. If ξξ nn = 1, xx nn is on the decision boundary (the red line). If ξξ nn > 1, xx nn is misclassified (on the wrong side of the red line). 69

70 Optimization Criterion Before, we minimized 1 ww 22 subject to constraints: 2 nn 1,,, tt nn ww TT xx nn + bb 1 Now, we want to minimize error function: CC nn=1 ξξ nn subject to constraints: nn 1,,, tt nn ww TT xx nn + bb 1 ξξ nn ξξ nn 0 CC is a parameter that we pick manually, CC 0. CC controls the trade-off between maximizing the margin, and penalizing training examples that violate the margin. ww 22 The higher ξξ nn is, the farther away xx nn is from where it should be. If xx nn is on the correct side of the decision boundary, and the distance of xx nn to the boundary is greater than or equal to the margin, then ξξ nn = 0, and xx nn does not contribute to the error function. 70

71 Lagrangian Function To do our constrained optimization, we define the Lagrangian: LL ww, bb, ξξ, aa, μμ = 1 2 ww 22 + CC nn=1 ξξ nn nn=1 aa nn tt nn yy xx nn 1 + ξξ nn nn=1 μμ nn ξξ nn The Lagrange multipliers are now aa nn and μμ nn. The term aa nn tt nn yy xx nn 1 + ξξ nn in the Lagrangian corresponds to constraint tt nn yy xx nn 1 + ξξ nn 0. The term μμ nn ξξ nn in the Lagrangian corresponds to constraint ξξ nn 0. 71

72 Lagrangian Function Lagrangian LL ww, bb, ξξ, aa, μμ : 1 2 ww 22 + CC Constraints: nn=1 ξξ nn nn=1 aa nn tt nn yy xx nn 1 + ξξ nn aa nn 0 μμ nn 0 ξξ nn 0 tt nn yy xx nn 1 + ξξ nn 0 aa nn tt nn yy xx nn 1 + ξξ nn = 0 μμ nn ξξ nn = 0 nn=1 μμ nn ξξ nn 72

73 Lagrangian Function Lagrangian LL ww, bb, ξξ, aa, μμ : 1 2 ww 22 + CC nn=1 ξξ nn nn=1 aa nn tt nn yy xx nn 1 + ξξ nn As before, we can simplify the Lagrangian significantly: ww = 0 ww = nn=1 aa nn tt nn xx nn ξξ nn = 0 CC aa nn μμ nn = 0 aa nn = CC μμ nn = 0 nn=1 nn=1 aa nn tt nn = 0 μμ nn ξξ nn 73

74 Lagrangian Function Lagrangian LL ww, bb, ξξ, aa, μμ : 1 2 ww 22 + CC nn=1 ξξ nn nn=1 aa nn tt nn yy xx nn 1 + ξξ nn nn=1 μμ nn ξξ nn In computing LL ww, bb, ξξ, aa, μμ, ξξ nn contributes the following value: CCξξ nn aa nn ξξ nn μμ nn ξξ nn = ξξ nn (CC aa nn μμ nn ) As we saw in the previous slide, CC aa nn μμ nn = 0. Therefore, we can eliminate all those occurrences of ξξ nn. 74

75 Lagrangian Function Lagrangian LL ww, bb, ξξ, aa, μμ : 1 2 ww 22 + CC nn=1 ξξ nn nn=1 aa nn tt nn yy xx nn 1 + ξξ nn nn=1 μμ nn ξξ nn In computing LL ww, bb, ξξ, aa, μμ, ξξ nn contributes the following value: CCξξ nn aa nn ξξ nn μμ nn ξξ nn = ξξ nn (CC aa nn μμ nn ) As we saw in the previous slide, CC aa nn μμ nn = 0. Therefore, we can eliminate all those occurrences of ξξ nn. By eliminating ξξ nn we also eliminate all appearances of CC and μμ nn. 75

76 Lagrangian Function By eliminating ξξ nn and μμ nn, we get: LL ww, bb, aa = 1 2 ww 22 aa nn tt nn ww TT xx nn + bb 1 nn=1 This is exactly the Lagrangian we had for the linearly separable case. Following the same steps as in the linearly separable case, we can simplify even more to: LL aa = aa nn 1 2 aatt QQQQ nn=1 76

77 Lagrangian Function So, we have ended up with the same Lagrangian as in the linearly separable case: LL aa = aa nn 1 2 aatt QQQQ nn=1 There is a small difference, however, in the constraints. Linearly separable case: 0 aa nn Linearly inseparable case: 0 aa nn CC aa nn tt nn = 0 aa nn tt nn = 0 nn=1 nn=1 77

78 Lagrangian Function So, we have ended up with the same Lagrangian as in the linearly separable case: LL aa = aa nn 1 2 aatt QQQQ nn=1 Where does constraint 0 aa nn CC come from? 0 aa nn because aa nn is a Lagrange multiplier. aa nn CC comes from the fact that we showed earlier, that aa nn = CC μμ nn. Since μμ nn is also a Lagrange multiplier, μμ nn 0 and thus aa nn CC. 78

79 Using Quadratic Programming Quadratic programming: uu opt = argmin uu 1 2 uutt QQQQ + ss TT uu subject to constraint: HHHH zz SVM problem: find aa opt = argmin subject to constraints: aa 0 aa nn CC, nn=1 1 2 aatt QQQQ nn=1 aa nn tt nn = 0 Again, we must find values for QQ, ss, HH, zz that convert the SVM problem into a quadratic programming problem. Values for QQ and ss are the same as in the linearly separable case, since in both cases we minimize 1 2 aatt QQQQ nn=1 aa nn. aa nn 79

80 Using Quadratic Programming Quadratic programming: uu opt = argmin uu 1 2 uutt QQQQ + ss TT uu subject to constraint: HHHH zz SVM problem: find aa opt = argmin subject to constraints: aa 0 aa nn CC, nn=1 1 2 aatt QQQQ nn=1 aa nn tt nn = 0 QQ is an matrix such that QQ mmmm = tt nn tt mm xx nn TT xx nn. aa nn uu = aa, and ss =

81 Using Quadratic Programming Quadratic programming constraint: HHHH zz SVM problem constraints: 0 aa nn CC, nn=1 aa nn tt nn = 0 Define: HH = 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 tt 1, tt 2, tt 3,, tt nn tt 1, tt 2, tt 3,, tt nn rows 1 to rows (N + 1) to 2 rows (2N + 1) and (2N + 2) zz = CC CC CC 0 0 rows 1 to, set to 0. rows (N + 1) to 2, set to CC. rows 2N + 1, 2N + 2, set to 0. The top rows of HH are the negation of the identity matrix. Rows + 1 to 2N of HH are the identity matrix Row + 1 of HH is the transpose of vector tt of target outputs. Row + 2 of HH is the negation of the previous row. 81

82 Using Quadratic Programming Quadratic programming constraint: HHHH zz SVM problem constraints: 0 aa nn CC, nn=1 aa nn tt nn = 0 Define: HH = 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 tt 1, tt 2, tt 3,, tt nn tt 1, tt 2, tt 3,, tt nn rows 1 to rows (N + 1) to 2 rows (2N + 1) and (2N + 2) zz = CC CC CC 0 0 rows 1 to, set to 0. rows (N + 1) to 2, set to CC. rows 2N + 1, 2N + 2, set to 0. If 1 nn, the nn-th row of HHHH is aa nn, and the nn-th row of zz is 0. Thus, the nn-th rows of HHHH and zz capture constraint aa nn 0 aa nn 0. 82

83 Using Quadratic Programming Quadratic programming constraint: HHHH zz SVM problem constraints: 0 aa nn CC, nn=1 aa nn tt nn = 0 Define: HH = 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 tt 1, tt 2, tt 3,, tt nn tt 1, tt 2, tt 3,, tt nn rows 1 to rows (N + 1) to 2 rows (2N + 1) and (2N + 2) zz = CC CC CC 0 0 rows 1 to, set to 0. rows (N + 1) to 2, set to CC. rows 2N + 1, 2N + 2, set to 0. If (N + 1) nn 2, the nn-th row of HHHH is aa nn, and the nn-th row of zz is CC. Thus, the nn-th rows of HHHH and zz capture constraint aa nn 0. 83

84 Using Quadratic Programming Quadratic programming constraint: HHHH zz SVM problem constraints: 0 aa nn CC, nn=1 aa nn tt nn = 0 Define: HH = 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 tt 1, tt 2, tt 3,, tt nn tt 1, tt 2, tt 3,, tt nn rows 1 to rows (N + 1) to 2 rows (2N + 1) and (2N + 2) zz = CC CC CC 0 0 rows 1 to, set to 0. rows (N + 1) to 2, set to CC. rows 2N + 1, 2N + 2, set to 0. If nn = 2 + 1, the nn-th row of HHHH is tt, and the nn-th row of zz is 0. Thus, the nn-th rows of HHHH and zz capture constraint nn=1 aa nn tt nn 0. 84

85 Using Quadratic Programming Quadratic programming constraint: HHHH zz SVM problem constraints: 0 aa nn CC, nn=1 aa nn tt nn = 0 Define: HH = 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 tt 1, tt 2, tt 3,, tt nn tt 1, tt 2, tt 3,, tt nn rows 1 to rows (N + 1) to 2 rows (2N + 1) and (2N + 2) zz = CC CC CC 0 0 rows 1 to, set to 0. rows (N + 1) to 2, set to CC. rows 2N + 1, 2N + 2, set to 0. If nn = 2 + 2, the nn-th row of HHHH is tt, and the nn-th row of zz is 0. Thus, the nn-th rows of HHHH and zz capture constraint nn=1 aa nn tt nn 0 aa nn tt nn nn=1

86 Using Quadratic Programming Quadratic programming constraint: HHHH zz SVM problem constraints: 0 aa nn CC, nn=1 aa nn tt nn = 0 Define: HH = 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 1, 0, 0,, 0 0, 1, 0,, 0 0, 0, 0,, 1 tt 1, tt 2, tt 3,, tt nn tt 1, tt 2, tt 3,, tt nn rows 1 to rows (N + 1) to 2 rows (2N + 1) and (2N + 2) zz = CC CC CC 0 0 rows 1 to, set to 0. rows (N + 1) to 2, set to CC. rows 2N + 1, 2N + 2, set to 0. Thus, the last two rows of HHHH and zz in combination capture constraint aa nn tt nn = 0 nn=1 86

87 Using the Solution Quadratic programming, given the inputs QQ, ss, HH, zz defined in the previous slides, outputs the optimal value for vector aa. The value of bb is computed as in the linearly separable case: bb = 1 SS nn SS tt nn mm SS aa mm tt mm xx mm TT xx nn Classification of input x is done as in the linearly separable case: yy xx = aa nn tt nn xx nn TT xx + bb nn SS 87

88 The Disappearing ww We have used vector ww, to define the decision boundary yy xx = ww TT xx + bb. However, ww does not need to be either computed, or used. Quadratic programming computes values aa nn. Using those values aa nn, we compute bb: bb = 1 SS nn SS tt nn mm SS aa mm tt mm xx mm TT xx nn Using those values for aa nn and bb, we can classify test objects xx: yy xx = nn SS aa nn tt nn xx nn TT xx + bb If we know the values for aa nn and bb, we do not need ww. 88

89 The Role of Training Inputs Where, during training, do we use input vectors? Where, during classification, do we use input vectors? Overall, input vectors are only used in two formulas: 1. During training, we use vectors xx nn to define matrix QQ, where: QQ mmmm = tt nn tt mm xx nn TT xx mm 2. During classification, we use training vectors xx nn and test input xx in this formula: yy xx = nn SS aa nn tt nn xx nn TT xx + bb In both formulas, input vectors are only used by taking their dot products. 89

90 The Role of Training Inputs To make this more clear, define a kernel function kk(xx, xx ) as: kk xx, xx = xx TT xx During training, we use vectors xx nn to define matrix QQ, where: QQ mmmm = tt nn tt mm kk(xx nn, xx mm ) During classification, we use training vectors xx nn and test input xx in this formula: yy xx = nn SS aa nn tt nn kk(xx nn, xx) + bb In both formulas, input vectors are used only through function kk. 90

91 The Kernel Trick We have defined kernel function kk xx, xx = xx TT xx. During training, we use vectors xx nn to define matrix QQ, where: QQ mmmm = tt nn tt mm kk(xx nn, xx mm ) During classification, we use this formula: yy xx = nn SS aa nn tt nn kk(xx nn, xx) + bb What if we defined kk xx, xx differently? The SVM formulation (both for training and for classification) would remain exactly the same. In the SVM formulation, the kernel kk xx, xx is a black box. You can define kk xx, xx any way you like. 91

92 A Different Kernel Let xx = xx 1, xx 2 and zz = zz 1, zz 2 be 2-dimensional vectors. Consider this alternative definition for the kernel: Then: kk xx, zz = 1 + xx TT zz 2 kk xx, zz = 1 + xx TT zz 2 2 = 1 + xx 1 zz 1 + xx 2 zz 2 = 1 + 2xx 1 zz 1 + 2xx 2 zz 2 + (xx 1 ) 2 (zz 1 ) 2 +2xx 1 zz 1 2xx 2 zz 2 + (xx 2 ) 2 (zz 2 ) 2 Suppose we define a basis function φφ(xx) as: φφ xx = (1, 2xx 1, 2xx 2, (xx 1 ) 2, 2xx 1 xx 2, (xx 2 ) 2 ) It is easy to verify that kk xx, zz = φφ xx TT φφ zz 92

93 Kernels and Basis Functions In general, kernels make it easy to incorporate basis functions into SVMs: Define φφ xx any way you like. Define kk xx, zz = φφ xx TT φφ zz. The kernel function represents a dot product, but in a (typically) higher-dimensional feature space compared to the original space of xx and zz. 93

94 Polynomial Kernels Let xx and zz be DD-dimensional vectors. A polynomial kernel of degree dd is defined as: kk xx, zz = cc + xx TT zz dd The kernel kk xx, zz = 1 + xx TT zz 2 that we saw a couple of slides back was a quadratic kernel. Parameter cc controls the trade-off between influence higher-order and lower-order terms. Increasing values of cc give increasing influence to lowerorder terms. 94

95 Polynomial Kernels An Easy Case Decision boundary with polynomial kernel of degree 1. This is identical to the result using the standard dot product as kernel. 95

96 Polynomial Kernels An Easy Case Decision boundary with polynomial kernel of degree 2. The decision boundary is not linear anymore. 96

97 Polynomial Kernels An Easy Case Decision boundary with polynomial kernel of degree 3. 97

98 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree 1. 98

99 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree 2. 99

100 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree

101 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree

102 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree

103 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree

104 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree

105 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree

106 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree

107 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree

108 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree

109 Polynomial Kernels A Harder Case Decision boundary with polynomial kernel of degree

110 RBF/Gaussian Kernels The Radial Basis Function (RBF) kernel, also known as Gaussian kernel, is defined as: kk σσ xx, zz = ee xx zz 2 2σσ 2 Given σσ, the value of kk σσ xx, zz only depends on the distance between xx and zz. kk σσ xx, zz decreases exponentially to the distance between xx and zz. Parameter σσ is chosen manually. Parameter σσ specifies how fast kk σσ xx, zz decreases as xx moves away from zz. 110

111 RBF Output Vs. Distance X axis: distance between xx and zz. Y axis: kk σσ xx, zz, with σσ =

112 RBF Output Vs. Distance X axis: distance between xx and zz. Y axis: kk σσ xx, zz, with σσ =

113 RBF Output Vs. Distance X axis: distance between xx and zz. Y axis: kk σσ xx, zz, with σσ =1. 113

114 RBF Output Vs. Distance X axis: distance between xx and zz. Y axis: kk σσ xx, zz, with σσ =

115 RBF Kernels An Easier Dataset Decision boundary with a linear kernel. 115

116 RBF Kernels An Easier Dataset Decision boundary with an RBF kernel. For this dataset, this is a relatively large value for σσ, and it produces a boundary that is almost linear. 116

117 RBF Kernels An Easier Dataset Decision boundary with an RBF kernel. Decreasing the value of σσ leads to less linear boundaries. 117

118 RBF Kernels An Easier Dataset Decision boundary with an RBF kernel. Decreasing the value of σσ leads to less linear boundaries. 118

119 RBF Kernels An Easier Dataset Decision boundary with an RBF kernel. Decreasing the value of σσ leads to less linear boundaries. 119

120 RBF Kernels An Easier Dataset Decision boundary with an RBF kernel. Decreasing the value of σσ leads to less linear boundaries. 120

121 RBF Kernels An Easier Dataset Decision boundary with an RBF kernel. Decreasing the value of σσ leads to less linear boundaries. 121

122 RBF Kernels An Easier Dataset Decision boundary with an RBF kernel. Note that smaller values of σσ increase dangers of overfitting. 122

123 RBF Kernels An Easier Dataset Decision boundary with an RBF kernel. Note that smaller values of σσ increase dangers of overfitting. 123

124 RBF Kernels A Harder Dataset Decision boundary with a linear kernel. 124

125 RBF Kernels A Harder Dataset Decision boundary with an RBF kernel. The boundary is almost linear. 125

126 RBF Kernels A Harder Dataset Decision boundary with an RBF kernel. The boundary now is clearly nonlinear. 126

127 RBF Kernels A Harder Dataset Decision boundary with an RBF kernel. 127

128 RBF Kernels A Harder Dataset Decision boundary with an RBF kernel. 128

129 RBF Kernels A Harder Dataset Decision boundary with an RBF kernel. Again, smaller values of σσ increase dangers of overfitting. 129

130 RBF Kernels A Harder Dataset Decision boundary with an RBF kernel. Again, smaller values of σσ increase dangers of overfitting. 130

131 RBF Kernels and Basis Functions The RBF kernel is defined as: kk σσ xx, zz = ee xx zz 2 2σσ 2 Is the RBF kernel equivalent to taking the dot product in some feature space? In other words, is there any basis function φφ such that kk σσ xx, zz = φφ xx TT φφ zz? The answer is "yes": There exists such a function φφ, but its output is infinitedimensional. We will not get into more details here. 131

132 Kernels for Non-Vector Data So far, all our methods have been taking real-valued vectors as input. The inputs have been elements of R DD, for some DD 1. However, there are many interesting problems where the input is not such real-valued vectors. Examples??? 132

133 Kernels for Non-Vector Data So far, all our methods have been taking real-valued vectors as input. The inputs have been elements of R DD, for some DD 1. However, there are many interesting problems where the input is not such real-valued vectors. The inputs can be strings, like "cat", "elephant", "dog". The inputs can be sets, such as {1, 5, 3}, {5, 1, 2, 7}. Sets are NOT vectors. They have very different properties. As sets, {1, 5, 3} = {5, 3, 1}. As vectors, 1, 5, 3 (5, 3, 1) There are many other types of non-vector data SVMs can be applied to such data, as long as we define an appropriate kernel function. 133

134 Training Time Complexity Solving the quadratic programming problem takes OO(dd 3 ) time, where dd is the the dimensionality of vector uu. In other words, dd is the number of values we want to estimate. If we dww and bb, then we estimate DD + 1 values. The time complexity is OO(DD 3 ), where DD is the dimensionality of input vectors xx (or the dimensionality of φφ xx, if we use a basis function). If we use quadratic programming to compute vector aa = (aa 1,, aa ), then we estimate values. The time complexity is OO( 3 ), where is the number of training inputs. Which one is faster? 134

135 Training Time Complexity If we use quadratic programming to compute directly ww and bb, then the time complexity is OO(DD 3 ). If we use no basis function, DD is the dimensionality of input vectors xx. If we use a basis function φφ, DD is the dimensionality of φφ xx. If we use quadratic programming to compute vector aa = (aa 1,, aa ), the time complexity is OO( 3 ). For linear SVMs (i.e., SVMs with linear kernels, that use the regular dot product), usually DD is much smaller than. If we use RBF kernels, φφ xx is infinite-dimensional. Computing but computing vector aa still takes OO( 3 ) time. If you want to use the kernel trick, then there is no choice, it takes OO( 3 ) time to do training. 135

136 SVMs for Multiclass Problems As usual, you can always train one-vs.-all SVMs if there are more than two classes. Other, more complicated methods are also available. You can also train what is called "all-pairs" classifiers: Each SVM is trained to discriminate between two classes. The number of SVMs is quadratic to the number of classes. All-pairs classifiers can be used in different ways to classify an input object. Each pairwise classifier votes for one of the two classes it was trained on. Classification time is quadratic to the number of classes. There is an alternative method, called DAGSVM, where the all-pairs classifiers are organized in a directed acyclic graph, and classification time is linear to the number of classes. 136

137 SVMs: Recap Advantages: Training finds globally best solution. No need to worry about local optima, or iterations. SVMs can define complex decision boundaries. Disadvantages: Training time is cubic to the number of training data. This makes it hard to apply SVMs to large datasets. High-dimensional kernels increase the risk of overfitting. Usually larger training sets help reduce overfitting, but SVMs cannot be applied to large training sets due to cubic time complexity. Some choices must still be made manually. Choice of kernel function. Choice of parameter CC in formula CC nn=1 ξξ nn + 1 ww

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Support Vector Machines CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification

More information

Lecture 11. Kernel Methods

Lecture 11. Kernel Methods Lecture 11. Kernel Methods COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne This lecture The kernel trick Efficient computation of a dot product

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Support Vector Machine (continued)

Support Vector Machine (continued) Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need

More information

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask Machine Learning and Data Mining Support Vector Machines Kalev Kask Linear classifiers Which decision boundary is better? Both have zero training error (perfect training accuracy) But, one of them seems

More information

Dan Roth 461C, 3401 Walnut

Dan Roth   461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Some material on these is slides borrowed from Andrew Moore's excellent machine learning tutorials located at: http://www.cs.cmu.edu/~awm/tutorials/ Where Should We Draw the Line????

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Pattern Recognition 2018 Support Vector Machines

Pattern Recognition 2018 Support Vector Machines Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht

More information

Review: Support vector machines. Machine learning techniques and image analysis

Review: Support vector machines. Machine learning techniques and image analysis Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 Support Vector Machines Find a linear hyperplane

More information

Radial Basis Function (RBF) Networks

Radial Basis Function (RBF) Networks CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks 1 Function approximation We have been using MLPs as pattern classifiers But in general, they are function approximators Depending

More information

ML (cont.): SUPPORT VECTOR MACHINES

ML (cont.): SUPPORT VECTOR MACHINES ML (cont.): SUPPORT VECTOR MACHINES CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 40 Support Vector Machines (SVMs) The No-Math Version

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher Lecture 3 STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher Previous lectures What is machine learning? Objectives of machine learning Supervised and

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

(Kernels +) Support Vector Machines

(Kernels +) Support Vector Machines (Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Support Vector Machine (SVM) Hamid R. Rabiee Hadi Asheri, Jafar Muhammadi, Nima Pourdamghani Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Introduction

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Vector Data: Clustering: Part II Instructor: Yizhou Sun yzsun@cs.ucla.edu May 3, 2017 Methods to Learn: Last Lecture Classification Clustering Vector Data Text Data Recommender

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

More information

Support Vector Machines

Support Vector Machines EE 17/7AT: Optimization Models in Engineering Section 11/1 - April 014 Support Vector Machines Lecturer: Arturo Fernandez Scribe: Arturo Fernandez 1 Support Vector Machines Revisited 1.1 Strictly) Separable

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

Support Vector Machine. Industrial AI Lab.

Support Vector Machine. Industrial AI Lab. Support Vector Machine Industrial AI Lab. Classification (Linear) Autonomously figure out which category (or class) an unknown item should be categorized into Number of categories / classes Binary: 2 different

More information

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning Support Vector Machines. Prof. Matteo Matteucci Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way

More information

L5 Support Vector Classification

L5 Support Vector Classification L5 Support Vector Classification Support Vector Machine Problem definition Geometrical picture Optimization problem Optimization Problem Hard margin Convexity Dual problem Soft margin problem Alexander

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Support Vector Machine and Neural Network Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 1 Announcements Due end of the day of this Friday (11:59pm) Reminder

More information

Machine Learning : Support Vector Machines

Machine Learning : Support Vector Machines Machine Learning Support Vector Machines 05/01/2014 Machine Learning : Support Vector Machines Linear Classifiers (recap) A building block for almost all a mapping, a partitioning of the input space into

More information

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines: Maximum Margin Classifiers Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind

More information

Lesson 24: Using the Quadratic Formula,

Lesson 24: Using the Quadratic Formula, , b ± b 4ac x = a Opening Exercise 1. Examine the two equation below and discuss what is the most efficient way to solve each one. A. 4xx + 5xx + 3 = xx 3xx B. cc 14 = 5cc. Solve each equation with the

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Chapter 22 : Electric potential

Chapter 22 : Electric potential Chapter 22 : Electric potential What is electric potential? How does it relate to potential energy? How does it relate to electric field? Some simple applications What does it mean when it says 1.5 Volts

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

More information

Support Vector and Kernel Methods

Support Vector and Kernel Methods SIGIR 2003 Tutorial Support Vector and Kernel Methods Thorsten Joachims Cornell University Computer Science Department tj@cs.cornell.edu http://www.joachims.org 0 Linear Classifiers Rules of the Form:

More information

Support Vector Machines

Support Vector Machines Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Module - 5 Lecture - 22 SVM: The Dual Formulation Good morning.

More information

Lecture No. 1 Introduction to Method of Weighted Residuals. Solve the differential equation L (u) = p(x) in V where L is a differential operator

Lecture No. 1 Introduction to Method of Weighted Residuals. Solve the differential equation L (u) = p(x) in V where L is a differential operator Lecture No. 1 Introduction to Method of Weighted Residuals Solve the differential equation L (u) = p(x) in V where L is a differential operator with boundary conditions S(u) = g(x) on Γ where S is a differential

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output

More information

2.4 Error Analysis for Iterative Methods

2.4 Error Analysis for Iterative Methods 2.4 Error Analysis for Iterative Methods 1 Definition 2.7. Order of Convergence Suppose {pp nn } nn=0 is a sequence that converges to pp with pp nn pp for all nn. If positive constants λλ and αα exist

More information

Lecture 6. Notes on Linear Algebra. Perceptron

Lecture 6. Notes on Linear Algebra. Perceptron Lecture 6. Notes on Linear Algebra. Perceptron COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne This lecture Notes on linear algebra Vectors

More information

Announcements - Homework

Announcements - Homework Announcements - Homework Homework 1 is graded, please collect at end of lecture Homework 2 due today Homework 3 out soon (watch email) Ques 1 midterm review HW1 score distribution 40 HW1 total score 35

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee Support Vector Machine Industrial AI Lab. Prof. Seungchul Lee Classification (Linear) Autonomously figure out which category (or class) an unknown item should be categorized into Number of categories /

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines CS4495/6495 Introduction to Computer Vision 8C-L3 Support Vector Machines Discriminative classifiers Discriminative classifiers find a division (surface) in feature space that separates the classes Several

More information

Grover s algorithm. We want to find aa. Search in an unordered database. QC oracle (as usual) Usual trick

Grover s algorithm. We want to find aa. Search in an unordered database. QC oracle (as usual) Usual trick Grover s algorithm Search in an unordered database Example: phonebook, need to find a person from a phone number Actually, something else, like hard (e.g., NP-complete) problem 0, xx aa Black box ff xx

More information

Quadratic Equations and Functions

Quadratic Equations and Functions 50 Quadratic Equations and Functions In this chapter, we discuss various ways of solving quadratic equations, aaxx 2 + bbbb + cc 0, including equations quadratic in form, such as xx 2 + xx 1 20 0, and

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Work, Energy, and Power. Chapter 6 of Essential University Physics, Richard Wolfson, 3 rd Edition

Work, Energy, and Power. Chapter 6 of Essential University Physics, Richard Wolfson, 3 rd Edition Work, Energy, and Power Chapter 6 of Essential University Physics, Richard Wolfson, 3 rd Edition 1 With the knowledge we got so far, we can handle the situation on the left but not the one on the right.

More information

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Support Vector Machines. Manfred Huber Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data

More information

Support Vector Machines Explained

Support Vector Machines Explained December 23, 2008 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information

SVMs, Duality and the Kernel Trick

SVMs, Duality and the Kernel Trick SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018 From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support

More information

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012 CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012 Luke ZeClemoyer Slides adapted from Carlos Guestrin Linear classifiers Which line is becer? w. = j w (j) x (j) Data Example i Pick the one with the

More information

CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds

CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds CS 6347 Lecture 8 & 9 Lagrange Multipliers & Varitional Bounds General Optimization subject to: min ff 0() R nn ff ii 0, h ii = 0, ii = 1,, mm ii = 1,, pp 2 General Optimization subject to: min ff 0()

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Support Vector Machines. Machine Learning Fall 2017

Support Vector Machines. Machine Learning Fall 2017 Support Vector Machines Machine Learning Fall 2017 1 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce

More information

Lecture 3. Linear Regression

Lecture 3. Linear Regression Lecture 3. Linear Regression COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne Weeks 2 to 8 inclusive Lecturer: Andrey Kan MS: Moscow, PhD:

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 29 April, SoSe 2015 Support Vector Machines (SVMs) 1. One of

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37 COMP 652: Machine Learning Lecture 12 COMP 652 Lecture 12 1 / 37 Today Perceptrons Definition Perceptron learning rule Convergence (Linear) support vector machines Margin & max margin classifier Formulation

More information

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science Neural Networks Prof. Dr. Rudolf Kruse Computational Intelligence Group Faculty for Computer Science kruse@iws.cs.uni-magdeburg.de Rudolf Kruse Neural Networks 1 Supervised Learning / Support Vector Machines

More information

CS798: Selected topics in Machine Learning

CS798: Selected topics in Machine Learning CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning

More information

Useful Equations for solving SVM questions

Useful Equations for solving SVM questions Support Vector Machines (send corrections to yks at csail.mit.edu) In SVMs we are trying to find a decision boundary that maximizes the "margin" or the "width of the road" separating the positives from

More information

Lesson 24: True and False Number Sentences

Lesson 24: True and False Number Sentences NYS COMMON CE MATHEMATICS CURRICULUM Lesson 24 6 4 Student Outcomes Students identify values for the variables in equations and inequalities that result in true number sentences. Students identify values

More information

TDT4173 Machine Learning

TDT4173 Machine Learning TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

Secondary 3H Unit = 1 = 7. Lesson 3.3 Worksheet. Simplify: Lesson 3.6 Worksheet

Secondary 3H Unit = 1 = 7. Lesson 3.3 Worksheet. Simplify: Lesson 3.6 Worksheet Secondary H Unit Lesson Worksheet Simplify: mm + 2 mm 2 4 mm+6 mm + 2 mm 2 mm 20 mm+4 5 2 9+20 2 0+25 4 +2 2 + 2 8 2 6 5. 2 yy 2 + yy 6. +2 + 5 2 2 2 0 Lesson 6 Worksheet List all asymptotes, holes and

More information

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26 Huiping Cao, Slide 1/26 Classification Linear SVM Huiping Cao linear hyperplane (decision boundary) that will separate the data Huiping Cao, Slide 2/26 Support Vector Machines rt Vector Find a linear Machines

More information

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes About this class Maximum margin classifiers SVMs: geometric derivation of the primal problem Statement of the dual problem The kernel trick SVMs as the solution to a regularization problem Maximizing the

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Support Vector Machines, Kernel SVM

Support Vector Machines, Kernel SVM Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM

More information

SVM optimization and Kernel methods

SVM optimization and Kernel methods Announcements SVM optimization and Kernel methods w 4 is up. Due in a week. Kaggle is up 4/13/17 1 4/13/17 2 Outline Review SVM optimization Non-linear transformations in SVM Soft-margin SVM Goal: Find

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Constrained Optimization and Support Vector Machines

Constrained Optimization and Support Vector Machines Constrained Optimization and Support Vector Machines Man-Wai MAK Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/

More information

General Strong Polarization

General Strong Polarization General Strong Polarization Madhu Sudan Harvard University Joint work with Jaroslaw Blasiok (Harvard), Venkatesan Gurswami (CMU), Preetum Nakkiran (Harvard) and Atri Rudra (Buffalo) December 4, 2017 IAS:

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Support vector machines (SVMs) are one of the central concepts in all of machine learning. They are simply a combination of two ideas: linear classification via maximum (or optimal

More information

Math 171 Spring 2017 Final Exam. Problem Worth

Math 171 Spring 2017 Final Exam. Problem Worth Math 171 Spring 2017 Final Exam Problem 1 2 3 4 5 6 7 8 9 10 11 Worth 9 6 6 5 9 8 5 8 8 8 10 12 13 14 15 16 17 18 19 20 21 22 Total 8 5 5 6 6 8 6 6 6 6 6 150 Last Name: First Name: Student ID: Section:

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information