Optimization for Kernel Methods S. Sathiya Keerthi

Size: px
Start display at page:

Download "Optimization for Kernel Methods S. Sathiya Keerthi"

Transcription

1 Optimization for Kernel Methods S. Sathiya Keerthi Yahoo! Research, Burbank, CA, USA Kernel methods: Support Vector Machines (SVMs) Kernel Logistic Regression (KLR) Aim: To introduce a variety of optimization problems that arise in the solution of classification problems by kernel methods, briefly review relevant optimization algorithms, and point out which optimization methods are suited for these problems. The lectures in this topic will be divided into 6 parts: 1. Optimization problems arising in kernel methods 2. A review of optimization algorithms 3. Decomposition methods 4. Quadratic programming methods 5. Path tracking methods 6. Finite Newton method The first two topics form an introduction, the next two topics cover dual methods and the last two topics cover primal methods. 1

2 Part I Optimization Problems Arising in Kernel Methods References: 1. B. Schölkopf and A. Smola, Learning with Kernels, MIT Press, 2002, Chapter 7, Pattern Recognition. 2

3 Kernel methods for Classification problems Although kernel methods are used for a range of problems such as classification (binary/multiclass), regression, ordinal regression, ranking and unsupervised learning, we will focus only on binary classification problems. Training data: {x i, t i } n i=1 x i R m is the i-th input vector. t i {1, 1} is the target for x i, denoting the class to which the i-th example belongs; 1 denotes class 1 and 1 denotes class 2. Kernel methods transform x to a Reproducing Kernel Hilbert Space, H via φ : R m H and then develop a linear classifier in that space: y(x) = w φ(x) + b y(x) > 0 x Class 1; y(x) < 0 x Class 2 The dot product in H, i.e., k(x i, x j ) = φ(x i ) φ(x j ) is called the Kernel function. All computations are done using k only. Example: φ(x) is the vector of all monomials up to degree d on the components of x. For this example, k(x i, x j ) = (1+x i x j ) d. This is the polynomial kernel function. Larger the value of d is, more flexible and powerful is the classifer function. RBF kernel: k(x i, x j ) = e γ x i x j 2 is another popular kernel function. Larger the value of γ is, more flexible and powerful is the classifer function. Training problem: (w, b), which define the classifier are obtained by solving the following optimization problem: min w,b E = R + CL L is the Empirical error defined as L = i l(y(x i ), t i ) l is a loss function that describes the discrepancy between the classifier output y(x i ) and the target t i. 3

4 Minimizing only L can lead to overfitting on the training data. The regularizer function R prefers simpler models and helps prevent overfitting. The parameter C helps to establish a trade-off between R and L. C is a hyperparameter. (Other parameters such as d in the polynomial kernel and γ in the RBF kernel are also hyperparameters.) All hyperparameters need to be tuned at a higher level. Some commonly used loss functions SVM (Hinge) loss: l(y, t) = 1 ty if ty < 1; 0 otherwise. KLR (Logistic) loss: l(y, t) = log(1 + exp( ty)) L 2 -SVM loss: l(y, t) = (1 ty) 2 /2 if ty < 1; 0 Modified Huber loss: l(y, t) is: 0 if ξ 0; ξ 2 /2 if 0 < ξ < 2; and 2(ξ 1) if ξ 2, where ξ = 1 ty. 4

5 Margin based regularization The margin between the planes defined by y(x) = ±1 is 2/ w. Making the margin big is equivalent to making the function R = 1 2 w 2 small. This function is a very effective regularizing function. This is the natural regularizer associated with the RKHS. Although there are other regularizers that have been considered in the literature, in these lectures I will restrict attention to only the optimization problems directly related to the above mentioned natural regularizer. Primal problem: min 1 2 w 2 + C i l(y(x i), t i ) Sometimes the term 1 2 b2 is also added in order to handle w and b uniformly. (This is also equivalent to ignoring b and instead adding a constant to the kernel function.) 5

6 Solution via Wolfe dual w and y( ) have the representation: w = i α i t i φ(x i ), y(x) = i α i t i k(x, x i ) w could reside in an infinite dimensional space (e.g., in the case of the RBF kernel) and so we have to necessarily handle the solution via finite dimensional quantities such as the α i s. This is effectively done via the Wolfe dual (details will be covered in lectures on kernel methods by other speakers). SVM dual: (Convex quadratic program) min E(α) = 1 2 i,j t it j α i α j k(x i, x j ) i α i s.t. 0 α i C, i t iα i = 0 KLR dual: (Convex program) min E(α) = 1 t i t j α i α j k(x i, x j ) + C 2 i,j i g( α i C ) s.t. i t i α i = 0 where g(δ) = δ log δ + (1 δ) log(1 δ). L 2 -SVM dual: (Convex quadratic program) min E(α) = 1 t i t j α i α j k(xi, x j ) α i s.t. α i 0, 2 i,j i t i α i = 0 i where k(x i, x j ) = k(x i, x j ) + 1 C δ ij. Modified Huber: Dual can be written down, but it is a bit more complex. 6

7 Ordinal regression All the ideas for binary classification can be easily extended to ordinal regression. There are several ways of defining losses for ordinal regression. One way is to define a threshold for each successive class and include a loss term for each pair of classes. 7

8 An Alternative: Direct primal design Primal problem: min 1 2 w 2 + C i l(y(x i ), t i ) (1) Plug into (1), the representation: w = i β i t i φ(x i ), y(x) = i β i t i k(x, x i ) to get the problem min 1 t i t j β i β j k(x i, x j ) + C 2 ij i l(y(x i ), t i ) (2) We can attempt to directly solve (2) to get the β vector. Such an approach can be particularly attractive when the loss function l is differentiable, such as in the cases of KLR, L 2 -SVM and Modified Huber loss SVM, since the optimization problem is an unconstrained one. Sparse formulations (minimizing the number of nonzero α i ) Approach 1: Replace the regularizer in (2) by the sparsityinducing regularizer i β i to get the optimization problem: min i β i + C i l(y(x i ), t i ) (3) Approach 2: Include the sparsity regularizer, i β i in a graded fashion: min λ β i + 1 t i t j β i β j k(x i, x j ) + C l(y(x i ), t i ) (4) 2 i ij i Large λ will force sparse solutions while small λ will get us back to the original kernel solution. 8

9 Semi-supervised Learning In many problems a set of unlabeled examples, { x k } is available E is an edge relation on that set with weights ρ kl Then 1 2 kl E ρ kl(y(x k ) y(x l )) 2 can be included as an additional regularizer. (Nearby input vectors should have near y values.) Transductive design Solve the problem min 1 2 w 2 + C i l(y(x i ), t i ) + C k l(y( x k ), t k ) where the t k {1, 1} are also variables. This is a combinatorial optimization problem. There exist good special techniques for solving it. But we will not go into any details in these lectures. 9

10 Part II A Review of Optimization Algorithms References: 1. B. Schölkopf and A. Smola. Learning with Kernels, MIT Press, 2002, Chapter 6, Optimization. 2. D. P. Bertsekas, Nonlinear Programming. Athena Scientific,

11 Types of optimization problems min E(θ) θ Z E : Z R is continuously differentiable, Z R n Z = R n Unconstrained E = linear, Z =polyhedral Linear Programming E = quadratic, Z =polyhedral Quadratic Programming (example: SVM dual) Else, Nonlinear Programming These problems have been traditionally treated separately. Their methodologies have come closer in later years. Unconstrained: Optimality conditions At a minimum: Stationarity: E = 0 Non-negative curvature: 2 E is positive semi-definite E convex local minimum is a global minimum. 11

12 Geometry of descent E(θ) d < 0 12

13 A sketch of a descent algorithm 13

14 Exact line search: η = min φ(η) = E(θ + ηd) η Inexact line search: Armijo condition 14

15 Global convergence theorem E is Lipschitz continuous Sufficient angle of descent condition: E(θ k ) d k δ E(θ k ) d k, δ > 0 Armijo line search condition: For some 0 < µ < 0.5 (1 µ)η E(θ k ) d k E(θ k ) E(θ k + ηd k ) µη E(θ k ) d k Then, either E or θ k converges to a stationary point θ : E(θ ) = 0. Rate of convergence ɛ k = E(θ k+1 ) E(θ k ) ɛ k+1 = ρ ɛ k r in limit as k r = rate of convergence, a key factor for speed of convergence of optimization algorithms Linear convergence (r = 1) is quite a bit slower than quadratic convergence (r = 2). Many optimization algorithms have superlinear convergence (1 < r < 2) which is pretty good. 15

16 Gradient descent method d = E Linear convergence Very simple; locally good; but often very slow; rarely used in practice 16

17 Newton method Interpretations: d = [ 2 E] 1 E, η = 1 θ + d minimizes second order approximation Ê(θ + d) = E(θ) + E(θ) d d 2 E(θ)d θ + d solves linearized optimality condition E(θ + d) Ê(θ + d) = E(θ) + 2 E(θ)d = 0 Quadratic rate of convergence Implementation: Compute H = 2 E(θ), g = E(θ) and solve Hd = g Use Cholesky factorization H = LL Newton method may not converge (or worse, if H is not positive definite, it may not even be properly defined ) when started far from a minimum Modify the method in 2 ways: (a) Change H to a nearby positive definite matrix whenever it is not; and (b) add line search. Quasi-Newton methods: Instead of calculating Hessian and inverting it, QN methods build an approximation to the inverse Hessian over many steps using gradient information. Several update methods for the inverse Hessian exist; the BFGS method is popularly used. Applied to a convex quadratic function with exact line search they find the minimizer within n steps. With inexact line search the QN methods can be used for minimizing nonlinear functions too. They work well even with loose line search. 17

18 Method of conjugate directions Originally developed for convex quadratic minimization : P is pd min E(θ) = 1 2 θ P θ q θ Equivalently, solve P θ = q Define a set of P -conjugate search directions : {d 0, d 1,..., d n 1 } such that d ip d j = 0 i j Do exact line search along each direction Main result: The minimum of E will be reached in exactly n steps. Conjugate gradient method Choose any initial point, θ 0, set d 0 = E(θ 0 ) θ k+1 = θ k + η k d k where η k = arg min η E(θ k + ηd k ) d k+1 = E(θ k+1 ) + β k d k Simply choosing β k so that d k+1 Ad k = 0 is sufficient to ensure P -conjugacy of d k+1 with all previous directions. β k = E(θ k+1 2 E(θ k ) 2 (Fletcher-Reeves formula) β k = E (θ k+1 [ E (θ k+1 E(θ k )] E(θ k ) 2 (Polak-Ribierre formula) 18

19 Nonlinear Conjugate gradient method Simply use the nonlinear gradient function for E for getting the directions Replace exact line search by inexact line search Armijo condition needs to be replaced by a more stringent condition called the Wolfe condition Convergence not possible in n steps Practicalities: FR, PR usually behave very differently PR is usually better Methods are very sensitive to line search 19

20 Overall comparison of the methods Quasi-Newton methods are robust. But, they require O(n 2 ) memory space to store the approximate Hessian inverse, and so they are not directly suited for large scale problems. Modifications of these methods called Limited Memory Quasi-Newton methods use O(n) memory and they are suited for large scale problems. Conjugate gradient methods also work well and are well suited for large scale problems. However they need to be implemented carefully, with a carefully set line search. In some situations block coordinate descent methods (optimizing a selected subset of variables at a time) can be very much better suited than the above methods. We will say more about this later. 20

21 Linear Programming min E(θ) = c θ subject to Aθ b, θ 0 The solution occurs at a vertex of the fesible polyhedral region. Simplex algorithm: Starts at a vertex and moves along descending edges until an optimal vertex is found. In the worst case the algorithm takes a number of steps that is exponential in n; but, in practice it is very efficient. Interior point methods are alternative methods that are provably polynomial time. They are also very efficient when implemented via certain predictor-corrector ideas. Quadratic Programming The (Wolfe) SVM dual is a good example. Many traditional QP methods exist. For instance, the active set method which solves a sequence of equality constrained problems, is a good traditional QP method. We will talk about this method in detail in part IV. 21

22 Part III Decomposition methods References: 1. B. Schölkopf and A. Smola, Learning with Kernels, MIT Press, 2002, Chapter 10, Implementation. 2. T. Joachims, Making large-scale SVM learning practical, In B. Schölkopf, C. J. C. Burges and A. Smola, editors, Advances in kernel methods - Support Vector Learning, pp , MIT Press, publications/joachims_98c.pdf 3. J. C. Platt, Fast training of support vector machines using sequential minimal optimization, In B. Schölkopf, C. J. C. Burges and A. Smola, editors, Advances in kernel methods - Support Vector Learning, pp , MIT Press, research.microsoft.com/~jplatt/smo.html 4. S. S. Keerthi et al, Improvements to Platt s SMO algorithm for SVM classifier design, Neural Computation, Vol. 13, March 2001, pp svm/svm.shtml 5. LIBSVM: 6. SV M light : 22

23 SVM Dual Problem We will take up the details only for SVM (hinge loss). The ideas are quite similar for optimization problems arising from other loss functions such as L 2 -SVM, KLR, Modifier Huber etc. Recall the SVM dual convex quadratic program 1 min t i t j α i α j k(x i, x j ) 2 i,j i subject to 0 α i C, t i α i = 0 In matrix-vector notations... min subject to α 1 i 2 α Qα e α 0 α i C, i = 1,..., n t α = 0, where Q ij = t i t j k(x i, x j ) and e = [1,..., 1] Large Dense Quadratic Programming α i min α f(α) = 1 2 α Qα e α, subject to t α = 0, 0 α i C Q ij 0, Q : an n by n fully dense matrix 30,000 training points: 30,000 variables: (30, /2) bytes = 3GB RAM to store Q: still difficult Traditional optimization methods: Newton, Quasi Newton cannot be directly applied since they involve O(n 2 ) storage. Even methods such as CG which require O(n) storage are unsuitable since f = Qα e, which requires the entire kernel matrix. Decomposition (Block coordinate descent) methods are best suited 23

24 Decomposition Methods Working on a few variables each time Similar to coordinate-wise minimization. Chunking. Working set B; N = {1,..., n}\b fixed Size of B usually <= 100 Also referred to as Sub-problem in each iteration: 1 [ [ ] [ ] min α α B 2 B (α k N ) ] Q BB Q BN αb Q NB Q NN α k N [ [ ] e B (e k N ) ] α B α k N subject to 0 α l C, l B, t Bα B = t Nα k N Avoid Memory Problems The new objective function 1 2 α BQ BB α B + ( e B + Q BN α k N) α B + constant B columns of Q needed Calculated as and when needed Decomposition Method: the Algorithm 1. Find initial feasible α 1 Set k = If α k satisfies optimality conditions, stop. Otherwise, find working set B. Define N {1,..., n}\b 24

25 3. Solve sub-problem of α B : min subject to α B 1 2 α BQ BB α B + ( e B + Q BN α k N) α B 0 α l C, l B t Bα B = t Nα k N, 4. α k+1 N αk N. Set k k + 1; go to Step 2. Do these methods Really Work? Compared to Newton, Quasi-Newton Slow convergence (lot more steps to come close to solution) However, no need to have very accurate α ( n ) decision function = sgn α i y i K(x i, x) + b Prediction not affected much after a certain level of optimality is reached In some situations, # support vectors # training points i=1 Initial α 1 = 0, many elements are never used An example where machine learning knowledge affects optimization An example of solving 50,000 training instances by LIBSVM $./svm-train -m 200 -c 16 -g 4 22features optimization finished, #iter = Total nsv = 3370 real 3m32.828s On a Pentium M 1.7 GHz Laptop Calculating Q may have taken more than 3 minutes A good case where many α i remain at zero all the time 25

26 Working Set Selection Very important Better selection fewer iterations But Better selection higher cost per iteration Two issues: 1. Size B, # iterations B, # iterations 2. Selecting elements Size of the Working Set Keeping all nonzero α i in the working set If all SVs included optimum Few iterations (i.e., few sub-problems) Size varies May still have memory problems Existing software Small and fixed size Memory problems solved Though sometimes slower Sequential Minimal Optimization (SMO) Consider B = 2 B 2 is necessary because of the linear constraint Extreme case of decomposition methods 26

27 Sub-problem analytically solved; no need to use optimization software 1 [ ] [ ] [ ] Q min αi α ii Q ij αi j α i,α j 2 Q ij Q jj α j s.t. 0 α i, α j C, t i α i + t j α j = t Nα k N, B = {i, j} + (Q BN α k N e B ) [ αi α j ] Optimization people may not think this a big advantage Machine learning people do: they like simple code A minor advantage in optimization No need to have inner and outer stopping conditions B = {i, j} Too slow convergence? With other tricks, B = 2 fine in practice Selection by KKT Violation min f(α) α = 1 2 α Qα e α, subject to t α = 0, 0 α i C KKT optimality condition: α optimal if and only if f(α) + bt = λ µ, λ i α i = 0, µ i (C α i ) = 0, λ i 0, µ i 0, i = 1,..., n f(α) Qα e Rewritten as f(α) i + bt i 0 if α i < C f(α) i + bt i 0 if α i > 0 27

28 Note t i = ±1 KKT further rewritten as f(α) i + b 0 if α i < C, t i = 1 f(α) i b 0 if α i < C, t i = 1 f(α) i + b 0 if α i > 0, t i = 1 f(α) i b 0 if α i > 0, t i = 1 A condition on the range of b: Define max{ y l f(α) l α l < C, t l = 1 or α l > 0, t l = 1} b min{ y l f(α) l α l < C, t l = 1 or α l > 0, t l = 1} I up (α) {l α l < C, t l = 1 or α l > 0, t l = 1}, and I low (α) {l α l < C, t l = 1 or α l > 0, t l = 1}. α optimal if and only if feasible and max t i f(α) i min t i f(α) i. i I up(α) i I low (α) Violating Pair KKT equivalent to l I up (α) l I low (α) t l f(α) l Violating pair i I up (α), j I low (α), and t i f(α) i > t j f(α) j f(α k ) strictly decreases if and only if B has at least one violating pair However, simply choosing a violating pair not enough for convergence 28

29 Maximal Violating Pair If B = 2, natural to choose indices that most violate the KKT: i arg j arg max l I up(α l ) min l I low (α l ) {i, j} called maximal violating pair Obtained in O(n) operations t l f(α k ) l, t l f(α k ) l Calculating Gradient To find violating pairs, gradient is maintained throughout all iterations Memory problems occur as f(α) = Qα e involves Q Solved by using the following tricks 1. α 1 = 0 implies f(α 1 ) = Q 0 e = e Initial gradient easily obtained 2. Update f(α) using only Q BB and Q BN : f(α k+1 ) = f(α k ) + Q(α k+1 α k ) = f(α k ) + Q :,B (α k+1 α k ) B Only B columns of Q needed per iteration SV M light B = q; feasible direction vector d B obatined by solving min d B f(α k ) Bd B subject to t Bd B = 0, d t 0, if α k t = 0, t B, d t 0, if α k t = C, t B, 1 d t 1, t B. 29

30 A combinatorial problem: ( l q) possibilities But optimum is the q/2 most violating pairs From I up (α k ): largest q/2 elements t i1 f(α k ) i1 t i2 f(α k ) i2 t iq/2 f(α k ) iq/2 From I low (α k ): smallest q/2 elements t j1 f(α k ) j1 t jq/2 f(α k ) jq/2 An O(lq) procedure Used in popular SVM software: SV M light, LIBSVM (before Version 2.8), and others Caching and Shrinking Speed up decomposition methods Caching Store recently used Hessian columns in computer memory Example $ time./libsvm-2.81/svm-train -m 0.01 a4a s $ time./libsvm-2.81/svm-train -m 40 a4a 7.817s Shrinking Some bounded elements remain until the end Heuristically resized to a smaller problem After certain iterations, most bounded elements identified and not changed Stopping Condition In optimization software such conditions are important However, don t be surprised if you see no stopping conditions in an optimization code of ML software 30

31 Sometimes time/iteration limits more suitable From KKT condition max t i f(α) i min t i f(α) i + ɛ (1) i I up(α) i I low (α) a natural stopping condition Better Stopping Condition In LIBSVMɛ = 10 3 Experience: ok but sometimes too strict Many a times we get good results even with ɛ = 10 1 Large C large f(α) components Too strict many iterations Need a relative condition A very important issue not fully addressed yet Example of Slow Convergence Using C = 1 $./libsvm-2.81/svm-train -c 1 australian_scale optimization finished, #iter = 508 obj = , rho = Using C = 5000 $./libsvm-2.81/svm-train -c 5000 australian_scale optimization finished, #iter = obj = , rho = Optimization researchers may rush to solve difficult cases It turns out that large C less used than small C Finite Termination Given ɛ, finite termination can be shown for both, SMO and SV M light 31

32 Effect of hyperparameters If we use C = 20, γ = 400 $./svm-train -c 20 -g 400 train.1.scale $./svm-predict train.1.scale train.1.scale.model o Accuracy = 100% (3089/3089) (classification) 100% training accuracy but $./svm-predict test.1.scale train.1.scale.model o Accuracy = 82.7% (3308/4000) (classification) Very bad test accuracy Overfitting happens Overfitting and Underfitting When training and predicting a data, we should Avoid underfitting: large training error Avoid overfitting: too small training error In theory You can easily achieve 100% training accuracy But this is useless Parameter Selection Sometimes you can get away with default choices Usually a good idea to tune them correctly Now parameters are C and kernel parameters 32

33 Examples: γ of e γ x i x j 2 a, b, d of (x ix j /a + b) d How do we select them? Performance Evaluation Training errors not important; only test errors count l training data, x i R n, y i {+1, 1}, i = 1,..., l, a learning machine: x f(x, α), f(x, α) = 1 or 1. Different α: different machines The expected test error (generalized error) 1 R(α) = y f(x, α) dp (x, y) 2 y: class of x (i.e. 1 or -1) P (x, y) unknown, empirical risk (training error): R emp (α) = 1 2l l y i f(x i, α) i=1 1 2 y i f(x i, α) : loss, choose 0 η 1, with probability at least 1 η: R(α) R emp (α) + another term A good pattern recognition method: minimize the combined effect of both terms R emp (α) 0 another term large 33

34 In practice Available data training, validation, and (testing) Train + validation model k-fold cross validation: Data randomly separated to k groups. Each time k 1 as training and one as testing Select parameters with highest CV Another optimization problem Trying RBF Kernel First Linear kernel: special case of RBF Leave-one-out cross-validation accuracy of linear the same as RBF under certain parameters Related to optimization as well Polynomial: numerical difficulties (< 1) d 0, (> 1) d More parameters than RBF 34

35 Contour of Parameter Selection 35

36 Part IV Quadratic Programming Methods References: 1. L. Kaufman, Solving the quadratic programming problem arising in support vector machine classification, In B. Schölkopf, C. J. C. Burges and A. Smola, editors, Advances in kernel methods - Support Vector Learning, pp , MIT Press, S. V. N. Vishwanathan, A. Smola and M. N. Murty, Simple SVM, ICML papers/vissmomur03.pdf 36

37 Active Set Method Force each α i into one of three groups: O: α i = 0 C: α i = C A: only the α i in this (active) set is allowed to change α = (α A, α C, α O ), α C = Ce C, α O = 0 The optimization problem on only the α A variable set is: min subject to α B 1 2 α AQ AA α A + ( e A + CQ CA e C ) α A 0 α l C, l A t Aα A = Ct Ce C, The problem is as messy as the original problem, except for the fact that the working set is smaller in size. So, what do we do to simplify? Typically, the 0 < α i < C, i A. Think as if these constraints will be satisfied and solve the following equality constrained quadratic problem min 2 α AQ AA α A + ( e A + CQ CA e C ) α A subject to t Aα A = Ct Ce C, α B 1 The solution of this system can be obtained by solving a linear system Hγ = g where γ includes α A together with b, the Lagrange multiplier corresponding to the equality constraint. A factorization of H is maintained (and incrementally updated when H undergoes changes in the overall active set algorithm). 37

38 The basic iteration of the active set method consists of the following steps: Solve the above-mentioned equality constrained problem If the solution α A violates a constraint, move the first violated i of A into C or O. If the solution α A satisfies the constraints then check if any i in C or O violates optimality (KKT) conditions; if so bring it into A. The algorithm can be initialized by choosing A to be a small set, say two points. With appropriate conditions on the incoming new variable, the algorithm can be shown to have finite convergence. (Recall that decomposition methods such as SMO have only asymptotic convergence.) The method of bringing in a new variable from C or O into A has a big impact on the overall efficiency of the algorithm. The original active set algorithm chooses the point of C/O which maximally violates the optimality (KKT) condition. This usually ends up being expensive, especially in large scale problems where large kernel arrays cannot be stored. In the Simple SVM algorithm of Vishwanathan, Smola and Murty, the indices are randomly traversed and the first violating point is included. 38

39 Comparison with a decomposition method 39

40 Some Overall Comments Pros: Finite convergence, and so independent of stopping tolerances; speed usually unaffected by C; very good when the size of final A is not more than a few thousand. Very well suited when gradient based methods (Chapelle et al) are used for model selection. Cons: Storage and factorization can be expensive/impossible when the size of A goes beyond a few thousand. Seeding is not as clean as in decomposition methods since factorization needs to be entirely computed again. Simpler SVM: Vishwanathan et al have modified their Simple SVM method in 2 ways: Replace factorization techniques by CG methods of solving the linear systems that arise. Instead of choosing the in-coming points randomly, use a heuristically defined priority queue on the points so that those points which are more likely to violate optimality conditions come first. 40

41 Part V Path Tracking Methods References: 1. S. Rosset and J. Zhu, Piecewise linear regularized solution paths, piecewise-revised.pdf 2. S. S. Keerthi, Generalized LARS as an effective feature selection tool for text classification with SVMs, ICML _GeneralizedLARS_Keerthi.pdf 41

42 A general problem formulation Consider the optimization problem min f(β) = λj(β) + L(β) β where J(β) = j β j and L is a differentiable piecewise convex quadratic function. (Piecewise: This means that the β-space is divided into a finite number of zones, in each of which L is a convex quadratic function and, at the boundary of the zones, the pieces of L merge properly to maintain differentiablity.) Our aim is to track the minimizer of f as a function of λ. Let g = L. At any one λ let β(λ) be the minimizer, A = {j : β j (λ) 0} and A c be the complement set. Optimality conditions: g j + λsgn(β j ) = 0 j A (1) g j λ j A c (2) Within one quadratic zone (1) defines a set of linear equations in β j, j A. Let γ denote the direction in β space thus defined. At large λ (specifically, λ > max j g j (0) ), β = 0 is the solution. 42

43 Rosset-Zhu path tracking algorithm Initialize: β = 0, A = arg max j g j, get γ While (max g j > 0) d 1 = arg min d 0 min j A c g j (β + dγ) = λ + d d 2 = arg min d 0 min j A (β +dγ) j = 0 (an active component hits 0) Find d 3, the first d value at which a piecewise quadratic zone boundary is crossed. set d = min(d 1, d 2, d 3 ) If d = d 1 then add variable attaining equality at d to A. If d = d 2 then remove variable attaining 0 at d from A. β β + dγ Update info and compute the new direction vector γ Implementation: The second order matrix and its factorization needed to obtain γ can be efficiently done. 43

44 Feature selection in Linear classifiers L(β) = 1 2 β 2 + i l(y(x i ), t i ) where y(x) = β x and l is a differentiable, piecewise quadratic loss function. Examples: L 2 -SVM loss, Modified Huber loss. Even logistic loss can be nicely approximated by piecewise quadratic functions quite well... When the minimum of f = λj + L is tracked with respect to λ, we get β = 0 at large λ values and we retrieve the minimizer of L as λ 0. Intermediate λ values give approximations where feature selection is achieved. 44

45 Selecting features in Text classification The following figure shows the application of path tracking to a dataset from the Reuters corpus. The plots show F -measure (larger is the better) as a function of the number of features chosen (which is much the same as λ 0). The black curve corresponds to keeping β 2 /2, the blue curve corresponds to leaving out β 2 /2, and the red curve corresponds to feature selection using the information gain measure. SVM corresponds to the L 2 -SVM loss while RLS corresponds to regularized least squares. 45

46 Forming sparse nonlinear kernel classifiers Consider the nonlinear kernel primal problem min 1 2 w 2 + C i l(y(x i, t i ) where l is a differentiable, piecewise quadratic loss function. As before, l can be one of L 2 -SVM loss, Modified Huber loss or a piecewise quadratic approximation of the logistic loss. Use the primal substitution w = β i t i φ(x i ) to get min L(β) = 1 2 β Qβ + i l(y(x i ), t i ) where y(x) = i β it i k(x, x i ) When the minimum of f = λj +L is tracked with respect to λ, we get β = 0 at large λ values and we retrieve the minimizer of L as λ 0. Intermediate λ values give approximations where sparsity is achieved. 46

47 Performance on Two Datasets 47

48 Part VI Finite Newton Method (FNM) References: 1. O. L. Mangasarian, A finite Newton method for classification, Optimization Methods and Software, Vol. 17, pp , ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/ pdf 2. S. S. Keerthi and D. W. DeCoste, A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs, Journal of Machine Learning Research, Vol. 6, pp , keerthi05a.pdf 3. J. Zhu and T. Hastie, Kernel logistic regression and the import vector machine, Journal of Computational and Graphical Statistics, Vol. 14, pp umich.edu/~jizhu/research/05klr.pdf 4. S. S. Keerthi, O. Chapelle and D. W. DeCoste, Building support vector machines with reduced classifier complexity, submitted to JMLR. 48

49 Introduction FNM is much more efficient and better suited than dual decomposition methods in certain situations. Differnt dimensions: Linear, Nonlinear kernel machines Classification, Ordinal Regression, Regression Differentiable loss functions: Least squares L 2 -SVM Modified Huber KLR 49

50 FNM: A General Format min f(β) = 1 β 2 β Rβ + i l i (β) R is positive definite. l i is the loss for the i-th example. Assume: l i is convex, differentiable and piecewise quadratic. (Piecewise: Same as mentioned in path tracking.) FNM Iteration β k = starting vector at the k-th iteration q i = local quadratic of l i at β k. 1 β = arg min β 2 β Rβ + i q i (β) Note that β can be obtained by solving a linear system. Define direction: d k = β β k New point by exact line search: β k+1 = β k + δ k d k, δ k = arg min δ f(β k + δd k ) 50

51 Finite convergence of FNM First, global convergence theorem (discussed in part II) ensures that β k β, the minimizer of f. Since f is continuous, β is also the minimizer (i.e., β) of every local quadratic approximation of f at β. Thus, there is an open neighborhood around β such that, from any point there FNM will reach β in exactly one iteration. Convergence of β k β ensures that the above mentioned neighborhood will be reached in a finite number of steps. Thus FNM has finite convergence. Comments # iterations usually very small (5-50) Linear system in each iteration is of the form: A k β = b k, A k = R + i γ i s i s i Factorization of A k can be done incrementally and is very efficient. In many cases, the linear system is of the RLS (Regularized Least Squares) form and so special methods can be called in. 51

52 Linear Kernel Machines: Small input dimension R = I, l i (β) = C h(t i β x i ) d = dimension of β is small Factorization of A k is very efficient O(nd 2 ) complexity 52

53 Linear Kernel Machines: Large input dimension Text classification: n 200, 000; d 250, 000 Data matrix, X (containing the x i s) is sparse : 0.1% non-zeros Linear System: Use quadratic conjugate-gradient methods Theoretically CG will need d + 1 iterations for exact convergence. But exact solution is completely unnecessary. With inexact termination, CG requires a very small number of iterations. Example: # CG iterations in various FNM iterations Each CG iteration requires a couple of calls of the form Xy or X z. There are about 1000 such calls. Compare: SMO does calculations equivalent to one Xβ calculation in each of its iterations involving the update of a pair of alphas, (α i, α j ). SMO uses tens of thousands of such iterations! Unlike SMO, where the number of iterations is very sensitive to C, the number of FNM iterations is not at all sensitive to C. 53

54 Heuristics H1: Terminate first FNM iteration in 10 CG iterations. Most nonsupport vectors usually get well identified in this phase. These vectors will not get involved in the CG iterations of the future FNM iterations. This heuristic is particularly powerful when the number of support vectors is a small fraction of n. H2: First run FNM with a high tolerance; then do another run with a low tolerance. Example: % SV = 6.8 No H: secs H1 only: secs H2 only: secs H1 & H2: secs β-seeding can be used when going from one C value to another nearby C value. β-seeding is more effective than the α-seeding in SMO. 54

55 Comparison of FNM (L 2 -SVM), SV M light and BSVM Computing time versus training set size for Adult and Web datasets 55

56 Feature selection in Linear Kernel Classifiers The ideas form a very good alternative to the L 1 regularization path tracking ideas discussed earlier. Start with no feature; add features greedily Let β = optimized vector with a current set of features β j = a feature not yet chosen Evaluation criterion: f j = arg min β j f(β, β j ), β = fixed where f is the training cost function. Choose the β j with smallest f j After choosing the best j, solve min f(β, β j ) using (β, 0) as the seed. Factorization updates needed for linear system solution can be updated very efficiently. 56

57 Sparse Nonlinear Kernel Machines The ideas are parallel to what we discussed earlier on the same topic. Use the primal substitution w = β i t i φ(x i ) to get min L(β) = 1 2 β Qβ + i l(y(x i ), t i ) where y(x) = i β it i k(x, x i ) Note that, except for the regularizer being a more general quadratic function (β Qβ/2), this problem is essentially in the linear classifier form. New non-zero β j variables can be selected greedily as in the feature selection process of the previous slide. At each step of the greedy process it is usually sufficient to restrict the evaluation to a small, randomly chosen number (say, 50) of β j s. A similar choice doesn t exist for the L 1 regularization method. The result is an effective algorithm with a clearly defined small complexity: O(d 2 n) algorithm for selecting d kernel basis functions. On many datasets, a small d gives nearly the same accuracy as the full kernel classifier using all basis functions. 57

58 Performance on some UCI Datasets SpSVM SVM Dataset TestErate #Basis TestErate #SV Banana Breast Diabetis Flare German Heart Ringnorm Thyroid Titanic Twonorm Waveform

59 An Example 59

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Training Support Vector Machines: Status and Challenges

Training Support Vector Machines: Status and Challenges ICML Workshop on Large Scale Learning Challenge July 9, 2008 Chih-Jen Lin (National Taiwan Univ.) 1 / 34 Training Support Vector Machines: Status and Challenges Chih-Jen Lin Department of Computer Science

More information

Support Vector Machine (continued)

Support Vector Machine (continued) Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

Working Set Selection Using Second Order Information for Training SVM

Working Set Selection Using Second Order Information for Training SVM Working Set Selection Using Second Order Information for Training SVM Chih-Jen Lin Department of Computer Science National Taiwan University Joint work with Rong-En Fan and Pai-Hsuen Chen Talk at NIPS

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Chih-Jen Lin Department of Computer Science National Taiwan University Talk at Machine Learning Summer School 2006, Taipei Chih-Jen Lin (National Taiwan Univ.) MLSS 2006, Taipei

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Andreas Maletti Technische Universität Dresden Fakultät Informatik June 15, 2006 1 The Problem 2 The Basics 3 The Proposed Solution Learning by Machines Learning

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Support Vector Machines. Manfred Huber Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

Support Vector Machine via Nonlinear Rescaling Method

Support Vector Machine via Nonlinear Rescaling Method Manuscript Click here to download Manuscript: svm-nrm_3.tex Support Vector Machine via Nonlinear Rescaling Method Roman Polyak Department of SEOR and Department of Mathematical Sciences George Mason University

More information

Solving the SVM Optimization Problem

Solving the SVM Optimization Problem Solving the SVM Optimization Problem Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Improvements to Platt s SMO Algorithm for SVM Classifier Design

Improvements to Platt s SMO Algorithm for SVM Classifier Design LETTER Communicated by John Platt Improvements to Platt s SMO Algorithm for SVM Classifier Design S. S. Keerthi Department of Mechanical and Production Engineering, National University of Singapore, Singapore-119260

More information

Lecture 10: Support Vector Machine and Large Margin Classifier

Lecture 10: Support Vector Machine and Large Margin Classifier Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods Support Vector Machines and Kernel Methods Geoff Gordon ggordon@cs.cmu.edu July 10, 2003 Overview Why do people care about SVMs? Classification problems SVMs often produce good results over a wide range

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Comments on the Core Vector Machines: Fast SVM Training on Very Large Data Sets

Comments on the Core Vector Machines: Fast SVM Training on Very Large Data Sets Journal of Machine Learning Research 8 (27) 291-31 Submitted 1/6; Revised 7/6; Published 2/7 Comments on the Core Vector Machines: Fast SVM Training on Very Large Data Sets Gaëlle Loosli Stéphane Canu

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

Review: Support vector machines. Machine learning techniques and image analysis

Review: Support vector machines. Machine learning techniques and image analysis Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

SVMs, Duality and the Kernel Trick

SVMs, Duality and the Kernel Trick SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today

More information

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37 COMP 652: Machine Learning Lecture 12 COMP 652 Lecture 12 1 / 37 Today Perceptrons Definition Perceptron learning rule Convergence (Linear) support vector machines Margin & max margin classifier Formulation

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Machine Learning. Lecture 6: Support Vector Machine. Feng Li. Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Warm Up 2 / 80 Warm Up (Contd.)

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines: Maximum Margin Classifiers Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel

More information

SMO vs PDCO for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines

SMO vs PDCO for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines vs for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines Ding Ma Michael Saunders Working paper, January 5 Introduction In machine learning,

More information

SVMC An introduction to Support Vector Machines Classification

SVMC An introduction to Support Vector Machines Classification SVMC An introduction to Support Vector Machines Classification 6.783, Biomedical Decision Support Lorenzo Rosasco (lrosasco@mit.edu) Department of Brain and Cognitive Science MIT A typical problem We have

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable. Linear SVM (separable case) First consider the scenario where the two classes of points are separable. It s desirable to have the width (called margin) between the two dashed lines to be large, i.e., have

More information

Support Vector Machines

Support Vector Machines Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Support Vector Machines

Support Vector Machines EE 17/7AT: Optimization Models in Engineering Section 11/1 - April 014 Support Vector Machines Lecturer: Arturo Fernandez Scribe: Arturo Fernandez 1 Support Vector Machines Revisited 1.1 Strictly) Separable

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

A Simple Decomposition Method for Support Vector Machines

A Simple Decomposition Method for Support Vector Machines Machine Learning, 46, 291 314, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. A Simple Decomposition Method for Support Vector Machines CHIH-WEI HSU b4506056@csie.ntu.edu.tw CHIH-JEN

More information

Part 4: Active-set methods for linearly constrained optimization. Nick Gould (RAL)

Part 4: Active-set methods for linearly constrained optimization. Nick Gould (RAL) Part 4: Active-set methods for linearly constrained optimization Nick Gould RAL fx subject to Ax b Part C course on continuoue optimization LINEARLY CONSTRAINED MINIMIZATION fx subject to Ax { } b where

More information

LMS Algorithm Summary

LMS Algorithm Summary LMS Algorithm Summary Step size tradeoff Other Iterative Algorithms LMS algorithm with variable step size: w(k+1) = w(k) + µ(k)e(k)x(k) When step size µ(k) = µ/k algorithm converges almost surely to optimal

More information

Support Vector Machine

Support Vector Machine Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

More information

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,

More information

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014 Convex Optimization Dani Yogatama School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA February 12, 2014 Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12,

More information

Sparse Support Vector Machines by Kernel Discriminant Analysis

Sparse Support Vector Machines by Kernel Discriminant Analysis Sparse Support Vector Machines by Kernel Discriminant Analysis Kazuki Iwamura and Shigeo Abe Kobe University - Graduate School of Engineering Kobe, Japan Abstract. We discuss sparse support vector machines

More information

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification JMLR: Workshop and Conference Proceedings 1 16 A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification Chih-Yang Hsia r04922021@ntu.edu.tw Dept. of Computer Science,

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Sequential Minimal Optimization (SMO)

Sequential Minimal Optimization (SMO) Data Science and Machine Intelligence Lab National Chiao Tung University May, 07 The SMO algorithm was proposed by John C. Platt in 998 and became the fastest quadratic programming optimization algorithm,

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

LIBSVM: a Library for Support Vector Machines (Version 2.32)

LIBSVM: a Library for Support Vector Machines (Version 2.32) LIBSVM: a Library for Support Vector Machines (Version 2.32) Chih-Chung Chang and Chih-Jen Lin Last updated: October 6, 200 Abstract LIBSVM is a library for support vector machines (SVM). Its goal is to

More information

(Kernels +) Support Vector Machines

(Kernels +) Support Vector Machines (Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Support Vector Machines, Kernel SVM

Support Vector Machines, Kernel SVM Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM

More information

Pattern Recognition 2018 Support Vector Machines

Pattern Recognition 2018 Support Vector Machines Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht

More information

Classifier Complexity and Support Vector Classifiers

Classifier Complexity and Support Vector Classifiers Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl

More information

CS269: Machine Learning Theory Lecture 16: SVMs and Kernels November 17, 2010

CS269: Machine Learning Theory Lecture 16: SVMs and Kernels November 17, 2010 CS269: Machine Learning Theory Lecture 6: SVMs and Kernels November 7, 200 Lecturer: Jennifer Wortman Vaughan Scribe: Jason Au, Ling Fang, Kwanho Lee Today, we will continue on the topic of support vector

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Large Scale Semi-supervised Linear SVMs. University of Chicago

Large Scale Semi-supervised Linear SVMs. University of Chicago Large Scale Semi-supervised Linear SVMs Vikas Sindhwani and Sathiya Keerthi University of Chicago SIGIR 2006 Semi-supervised Learning (SSL) Motivation Setting Categorize x-billion documents into commercial/non-commercial.

More information

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Ryan M. Rifkin Google, Inc. 2008 Plan Regularization derivation of SVMs Geometric derivation of SVMs Optimality, Duality and Large Scale SVMs The Regularization Setting (Again)

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

A Distributed Solver for Kernelized SVM

A Distributed Solver for Kernelized SVM and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,

More information

A Dual Coordinate Descent Method for Large-scale Linear SVM

A Dual Coordinate Descent Method for Large-scale Linear SVM Cho-Jui Hsieh b92085@csie.ntu.edu.tw Kai-Wei Chang b92084@csie.ntu.edu.tw Chih-Jen Lin cjlin@csie.ntu.edu.tw Department of Computer Science, National Taiwan University, Taipei 106, Taiwan S. Sathiya Keerthi

More information

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Lecture Notes on Support Vector Machine

Lecture Notes on Support Vector Machine Lecture Notes on Support Vector Machine Feng Li fli@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional space, a hyper plane is defined by ω T x + b = 0 (1) where ω R n is

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Links between Perceptrons, MLPs and SVMs

Links between Perceptrons, MLPs and SVMs Links between Perceptrons, MLPs and SVMs Ronan Collobert Samy Bengio IDIAP, Rue du Simplon, 19 Martigny, Switzerland Abstract We propose to study links between three important classification algorithms:

More information

Nearest Neighbors Methods for Support Vector Machines

Nearest Neighbors Methods for Support Vector Machines Nearest Neighbors Methods for Support Vector Machines A. J. Quiroz, Dpto. de Matemáticas. Universidad de Los Andes joint work with María González-Lima, Universidad Simón Boĺıvar and Sergio A. Camelo, Universidad

More information

Machine Learning A Geometric Approach

Machine Learning A Geometric Approach Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector Machines (SVM) Professor Liang Huang some slides from Alex Smola (CMU) Linear Separator Ham Spam From Perceptron

More information

Machine Learning Techniques

Machine Learning Techniques Machine Learning Techniques ( 機器學習技法 ) Lecture 5: Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University ( 國立台灣大學資訊工程系 ) Hsuan-Tien

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

Learning Kernel Parameters by using Class Separability Measure

Learning Kernel Parameters by using Class Separability Measure Learning Kernel Parameters by using Class Separability Measure Lei Wang, Kap Luk Chan School of Electrical and Electronic Engineering Nanyang Technological University Singapore, 3979 E-mail: P 3733@ntu.edu.sg,eklchan@ntu.edu.sg

More information

Homework 5. Convex Optimization /36-725

Homework 5. Convex Optimization /36-725 Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Mathematical Programming for Multiple Kernel Learning

Mathematical Programming for Multiple Kernel Learning Mathematical Programming for Multiple Kernel Learning Alex Zien Fraunhofer FIRST.IDA, Berlin, Germany Friedrich Miescher Laboratory, Tübingen, Germany 07. July 2009 Mathematical Programming Stream, EURO

More information

How Good is a Kernel When Used as a Similarity Measure?

How Good is a Kernel When Used as a Similarity Measure? How Good is a Kernel When Used as a Similarity Measure? Nathan Srebro Toyota Technological Institute-Chicago IL, USA IBM Haifa Research Lab, ISRAEL nati@uchicago.edu Abstract. Recently, Balcan and Blum

More information