Linear programming III

Similar documents
Pattern Recognition 2014 Support Vector Machines

IAML: Support Vector Machines

Support-Vector Machines

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

Support Vector Machines and Flexible Discriminants

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

COMP 551 Applied Machine Learning Lecture 4: Linear classification

What is Statistical Learning?

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

The blessing of dimensionality for kernel methods

Smoothing, penalized least squares and splines

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Contents. This is page i Printer: Opaque this

Resampling Methods. Chapter 5. Chapter 5 1 / 52

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

Support Vector Machines and Flexible Discriminants

SURVIVAL ANALYSIS WITH SUPPORT VECTOR MACHINES

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Chapter 3: Cluster Analysis

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001)

Differentiation Applications 1: Related Rates

Module 3: Gaussian Process Parameter Estimation, Prediction Uncertainty, and Diagnostics

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

The Solution Path of the Slab Support Vector Machine

Computational modeling techniques

Simple Linear Regression (single variable)

Math Foundations 20 Work Plan

Lead/Lag Compensator Frequency Domain Properties and Design Methods

Linear Classification

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

CHAPTER 3 INEQUALITIES. Copyright -The Institute of Chartered Accountants of India

Section 6-2: Simplex Method: Maximization with Problem Constraints of the Form ~

Thermodynamics Partial Outline of Topics

Tree Structured Classifier

NUMBERS, MATHEMATICS AND EQUATIONS

Thermodynamics and Equilibrium

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

Modeling the Nonlinear Rheological Behavior of Materials with a Hyper-Exponential Type Function

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

Admin. MDP Search Trees. Optimal Quantities. Reinforcement Learning

Preparation work for A2 Mathematics [2018]

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Chapters 29 and 35 Thermochemistry and Chemical Thermodynamics

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

SPH3U1 Lesson 06 Kinematics

You need to be able to define the following terms and answer basic questions about them:

Distributions, spatial statistics and a Bayesian perspective

Computational modeling techniques

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

Determining Optimum Path in Synthesis of Organic Compounds using Branch and Bound Algorithm

Materials Engineering 272-C Fall 2001, Lecture 7 & 8 Fundamentals of Diffusion

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

Chapter 2 GAUSS LAW Recommended Problems:

Lecture 5: Equilibrium and Oscillations

Elements of Machine Intelligence - I

7 TH GRADE MATH STANDARDS

A Matrix Representation of Panel Data

CHM112 Lab Graphing with Excel Grading Rubric

Experiment #3. Graphing with Excel

Function notation & composite functions Factoring Dividing polynomials Remainder theorem & factor property

Lecture 7: Damped and Driven Oscillations

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

Homology groups of disks with holes

Lecture 8: Multiclass Classification (I)

Dead-beat controller design

Maximum A Posteriori (MAP) CS 109 Lecture 22 May 16th, 2016

CS 109 Lecture 23 May 18th, 2016

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

Section 5.8 Notes Page Exponential Growth and Decay Models; Newton s Law

Math 302 Learning Objectives

Reinforcement Learning" CMPSCI 383 Nov 29, 2011!

Lyapunov Stability Stability of Equilibrium Points

Lab 1 The Scientific Method

Floating Point Method for Solving Transportation. Problems with Additional Constraints

Preparation work for A2 Mathematics [2017]

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

Administrativia. Assignment 1 due thursday 9/23/2004 BEFORE midnight. Midterm exam 10/07/2003 in class. CS 460, Sessions 8-9 1

Department of Economics, University of California, Davis Ecn 200C Micro Theory Professor Giacomo Bonanno. Insurance Markets

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

**DO NOT ONLY RELY ON THIS STUDY GUIDE!!!**

Lab #3: Pendulum Period and Proportionalities

Determining the Accuracy of Modal Parameter Estimation Methods

Building to Transformations on Coordinate Axis Grade 5: Geometry Graph points on the coordinate plane to solve real-world and mathematical problems.

Part 3 Introduction to statistical classification techniques

Exponential Functions, Growth and Decay

Revision: August 19, E Main Suite D Pullman, WA (509) Voice and Fax

Chapter Summary. Mathematical Induction Strong Induction Recursive Definitions Structural Induction Recursive Algorithms

Pipetting 101 Developed by BSU CityLab

Transcription:

Linear prgramming III

Review 1/33 What have cvered in previus tw classes LP prblem setup: linear bjective functin, linear cnstraints. exist extreme pint ptimal slutin. Simplex methd: g thrugh extreme pint t find the ptimal slutin. Primal-dual prperty f the LP prblem. Interir pint algrithm: based n the Primal-dual prperty, travel thrugh the interir f the feasible slutin space. Quadratic prgramming: based n KKT cnditin. LP applicatin: quantile regressin minimize the asymmetric abslute deviatins.

LP/QP applicatin in statistics II: LASSO 2/33 Cnsider usual regressin settings with data (x i, y i ), where x i = (x i1,..., x ip ) is a vectr f predictrs and y i is the respnse fr the i th bject. The rdinary linear regressin setting is: Find cefficient t minimize the residual sum f squares: n ˆb = argmin (y b i x i b) 2 Here b = (b 1, b 2,..., b p ) T is a vectr f cefficients. The slutin happens t be the MLE assuming a nrmal mdel: i=1 y i = x i b + ɛ i, ɛ i N(0, σ 2 ) This is nt ideal when the number f predictrs (p) is large, because 1. it requires p < n, r there must be sme degree f freedms fr residual. 2. ne wants a small subset f predictrs in the mdel, but OLS prvides an estimated cefficient fr each predictr.

The LASSO 3/33 LASSO stands fr Least Abslute Shrinkage and Selectin Operatr, which aims fr mdel selectin when p is large (wrks even p > n). The LASSO prcedure will shrink the cefficients tward 0, and eventually frce sme t be exactly 0 (s predictrs with 0 cefficient will be selected ut). The LASSO estimates are defined as: n b = argmin (y b i x i b) 2, s.t. b 1 t i=1 Here b 1 = p j=1 b j is the L 1 nrm, and t 0 is a tuning parameter cntrlling the strength f shrinkage. S LASSO tries t minimize the residual sum f square, with a cnstraint n the sum f the abslute values f the cefficients. NOTE: There are ther types f regularized regressins. Fr example, regressin with an L 2 penalty, e.g., j b 2 j t, is called ridge regressin.

Mdel selectin by LASSO 4/33 The feasible slutin space fr LASSO is linear (defined by the cnstraints), s ften the ptimal slutin is at a crner pint. The implicatin: at ptimal, many cefficient (nn-basic variables) will be 0 variable selectin. On the cntrary, ridge regressin usually desn t have any cefficient being 0, s it desn t d mdel selectin. The LASSO prblem can be slved by standard quadratic prgramming algrithm.

LASSO mdel fitting 5/33 In LASSO, we need t slve the fllwing ptimizatin prblem: n max (y i b j x j ) 2 s.t. i=1 b j t j The trick is t cnvert the prblem int the standard QP prblem setting, e.g., remve the abslute value peratr. The easiest way is t let b j = b + j b j, where b + j, b j 0. Then b j = b + j + b j, and the prblem can be written as: n max (y i b + j x j + b j x j) 2 s.t. i=1 (b + j + b j ) t, j b + j, b j 0 j j j This is a standard QP prblem can be slved by standard QP slvers.

A little mre n LASSO 6/33 The Lagrangian fr the LASSO ptimizatin prblem is: n L(b, λ) = (y i b j x j ) 2 λ i=1 j p b j This is equivalent t the likelihd functin f a hierarchical mdel with a duble expnential (DE) prir n b s (remember ADE used in quantile regressin?): b j DE(1/λ) Y X, b N(Xb, 1) j=1 The DE density functin is f (x, τ) = 1 ( ) x 2τ exp. τ

As a side nte, the ridge regressin is equivalent with the hierarchical mdel with a Nrmal prir n b (verify it).

LASSO in R 8/33 The glmnet package has functin glmnet glmnet package:glmnet R Dcumentatin fit a GLM with lass r elasticnet regularizatin Descriptin: Fit a generalized linear mdel via penalized maximum likelihd. The regularizatin path is cmputed fr the lass r elasticnet penalty at a grid f values fr the regularizatin parameter lambda. Can deal with all shapes f data, including very large sparse data matrices. Fits linear, lgistic and multinmial, pissn, and Cx regressin mdels. Usage: glmnet(x, y, family=c("gaussian","binmial","pissn","multinmial","cx","mgaussian"), weights, ffset=null, alpha = 1, nlambda = 100, lambda.min.rati = ifelse(nbs<nvars,0.01,0.0001), lambda=null, standardize = TRUE, intercept=true, thresh = 1e-07, dfmax = nvars + 1, pmax = min(dfmax * 2+20, nvars), exclude, penalty.factr = rep(1, nvars), lwer.limits=-inf, upper.limits=inf, maxit=100000, type.gaussian=ifelse(nvars<500,"cvariance","naive"), type.lgistic=c("newtn","mdified.newtn"), standardize.respnse=false, type.multinmial=c("ungruped","gruped"))

LASSO in R example 9/33 > x=matrix(rnrm(100*20),100,10) > b = c(-1, 2) > y=rnrm(100) + x[,1:2]%*%b > fit1=glmnet(x,y) > > cef(fit1, s=0.05) 11 x 1 sparse Matrix f class "dgcmatrix" 1 (Intercept) 0.003020916 V1-0.967153276 V2 1.809566641 V3-0.106775004 V4 0.041574896 V5. V6. V7 0.102566050 V8. V9. V10.

> cef(fit1, s=0.1) 11 x 1 sparse Matrix f class "dgcmatrix" 1 (Intercept) 0.01304181 V1-0.92725224 V2 1.76178647 V3-0.05743472 V4. V5. V6. V7 0.05953563 V8. V9. V10. > cef(fit1, s=0.5) 11 x 1 sparse Matrix f class "dgcmatrix" 1 (Intercept) 0.08689072 V1-0.52883089 V2 1.29823139 V3. V4. V5. V6....

> plt(fit1, "lambda") #### run crss validatin > cv=cv.glmnet(x,y) > plt(cv) 10 7 5 4 2 2 10 10 7 5 5 4 3 2 2 2 0 Cefficients 1.0 0.0 0.5 1.0 1.5 5 4 3 2 1 0 Lg Lambda Mean Squared Errr 1 2 3 4 5 5 4 3 2 1 0 lg(lambda)

Supprt Vectr Machine (SVM) 12/33 Figures fr the slides are btained frm Hastie et al. The Elements f Statistical Learning. Prblem setting: Given training data pairs (x 1, y 1 ),..., (x N, y N ). x i s are p-vectr predictrs. y i { 1, 1} are utcmes. Our gal: t predict y based n x (find a classifier). Such classifier is defined as a functin f x, G(x). G is estimated based n the training data (x, y) pairs. Once G is btained, it can be used fr future predictins. There are many ways t cnstruct G(x), and Supprt Vectr Machine (SVM) is ne f them. We ll first cnsider the simple case: G(x) is based n linear functin f x. It s ften called linear SVM r supprt vectr classifier.

Simple case: perfectly separable case 13/33 First define a linear hyperplane by {x : f (x) = x T b + b 0 = 0}. It is required that b is a unit vectr with b = 1 fr identifiability. A classificatin rule can be defined as G(x) = sign[x T b + b 0 ]. The prblem is t estimate b s. Cnsider a simple case where tw grups are perfectly separated. We want t find a brder t separate tw grups. There are infinite number f brders can perfectly separate tw grups. Which ne is ptimal? Cnceptually, the ptimal brder shuld separates the tw classes with the largest margins. We define the ptimal brder t be the ne satisfying: (1) the distances between the clsest pints t the brder are the same in bth grups, dente the distance by M; and (2) M is maximized. M is called the margin.

Prblem setup 14/33 Then prblem t find the best brder can be framed int fllwing ptimizatin prblem: max β,β 0 s.t. M y i (x T i b + b 0) M, i = 1,..., N This is nt a typical LP/QP prblem s we d sme transfrmatins t make it lk mre familiar. Divided bth sides f the cnstraint by M, and define β = b/m, β 0 = b 0 /M, the cnstraints becme: y i (x T i β + β 0) 1. This means that we scale the cefficients f the brder hyperplane, s that the margin lines are in the frms f x T i β + β 0 + 1 = 0 (upper margin) and x T i β + β 0 1 = 0 (lwer margin).

Nw we have β = b /M = 1/M. S the bjective functin (maximizing M) is equivalent t minimizing β. After this transfrmatin, the ptimizatin prblem can be expressed as a simpler, mre familiar frm: min β,β 0 s.t. β y i (x T i β + β 0) 1, i = 1,..., N This is a typical quadratic prgram prblem.

Illustratin f the ptimal brder (slid line) with margins (dash lines). 418 12. Flexible Discriminants x T β + β 0 =0 M = 1 β M = 1 β margin x T β + β 0 =0 ξ ξ 3 1 ξ 2 ξ 4 ξ5 M = FIGURE 12.1. Supprt vectr classifiers. The left panel shw case. The decisin bundary is the slid line, while brken lines b maximal margin f width 2M =2/ β. The right panel shws th (verlap) case. The pints labeled ξj are n the wrng side f t an amunt ξj = Mξ j ; pints n the crrect side have ξj =0. maximized subject t a ttal budget P ξ i cnstant. Hence P distance f pints n the wrng side f their margin.

{x : f(x) =x T β + β 0 =0}, (12.1) Nn-separable case 17/33 When tw classes are nt perfectly separable, we still want t find a brder with tw margins. But nw there will be pints n the wrng sides. We intrduce slack 418 variables 12. Flexible t accunt Discriminants fr thse pints. x T β + β 0 =0 M = 1 β M = 1 β margin x T β + β 0 =0 ξ ξ 3 1 ξ 2 ξ 4 ξ5 M = 1 β M = 1 β margin FIGURE 12.1. Supprt vectr classifiers. The left panel shws the separable case. Define The decisin slack variables bundary is the {ξslid 1,.. line,., ξwhile N }, where brken lines ξ i bund 0 i theand shaded maximal margin f width 2M =2/ β. The right panel shws the nnseparable (verlap) case. The pints labeled ξj are n the wrng side f their margin by an amunt ξj = Mξ j ; pints n the crrect side have ξj =0.Themarginis maximized subject t a ttal budget P ξ i cnstant. Hence P ξj is the ttal distance f pints n the wrng side f their margin. ξ = 0 when the pint is n the crrect side f the margin. ξ > 1 when the pint passes the brder t the wrng side. 0 < ξ < 1 when the pint is in the margin but still n the crrect side. Our training data cnsists f N pairs (x 1,y 1 ), (x 2,y 2 ),...,(x N,y N ), with x i IR p and y i { 1, 1}. Define a hyperplane by

Nw the cnstraints in the riginal ptimizatin prblem is mdified t: y i (x T i β + β 0) 1 ξ i, i = 1,..., N ξ i can be interpreted as the prprtinal amunt by which the predicatin is n the wrng side f the margin. Anther cnstraint i ξ i C is added t bund the ttal number f misclassificatin. Tgether, the ptimizatin prblem fr this case is written as : 1 min β,β 0 2 β s.t. y i (xi T β + β 0) 1 ξ i ξ i C, ξ i 0 Again this is a quadratic prgramming prblem. What are the unknwns? i

Cmputatin 19/33 The primal Lagrangian is: L P = 1 2 β 2 + γ ξ i α i [y i (xi T β + β 0) (1 ξ i )] µ i ξ i i i i Take derivatives f β, β 0, ξ i then set t zer, get (the statinary cnditins) : β = α i y i x i i 0 = α i y i i α i = γ µ i, i Plug these back t the primal Lagrangian, get the fllwing dual bjective functin (verify): L D = α i 1 α i α i y i y i xi T 2 x i i i i

The L D needs t be maximized subject t cnstraints: α i y i = 0 i 0 α i γ The KKT cnditins fr the prblem (in additinal t the statinary cnditins) include fllwing cmplementary slackness and primal/dual feasibilities: α i [y i (x T i β + β 0) (1 ξ i )] = 0 µ i ξ i = 0 y i (x T i β + β 0) (1 ξ i ) 0 α i, µ i, ξ i 0 The QP prblem can be slved using interir pint methd based n these.

Slve fr ˆβ 0 21/33 With ˆα i and ˆβ given, we still need t get ˆβ 0 t cnstruct the decisin bundary. One f the cmplementary slackness cnditin is: α i [y i (x T i β + β 0) (1 ξ i )] = 0 Any pint with ˆα i > 0 and ˆξ i = 0 (the pints n the margins) can be used t slve fr ˆβ 0. In practice we ften use the average f thse t get a stable result fr ˆβ 0.

The supprt vectrs 22/33 At ptimal slutin, β is in the frm f: ˆβ = i ˆα i y i x i. This means ˆβ is a linear cmbinatin f y i x i, and nly depends n thse data pints with ˆα 0. These data pints are called supprt vectrs. Accrding t the cmplmentary slackness in the KKT cnditins, at ptimal pint we have: α i [y i (xi T β + β 0) (1 ξ i )] = 0, i which means α i culd be nn-zer nly when y i (x T i β + β 0) (1 ξ i ) = 0. What des this result tell us?

Fr pints with nn-zer α i : The pints with ξ i = 0 will have y i (x T i β + β 0) = 1, r these pints are n the margin lines. Other pints with y i (x T i β + β 0) = 1 ξ i are n the wrng side f the margins. S nly the pints n the margin r at the wrng side f the margin are infrmative fr the separating hyperplane. These pints are called the supprt vectrs, because they prvide supprt fr the decisin bundary. This makes sense, because the pints that can be crrectly separated and far away frm the margin (thse easy pints) dn t tell us anything abut the classificatin rule (the hyperplane).

Supprt Vectr Machine 24/33 We have discussed supprt vectr classifier, which uses hyperplane t separate tw grups. Supprt Vectr Machine enlarges the feature space t make the prcedure mre flexible. T be specific, we transfrm the input data x i using sme basis functins h m (x), m = 1,..., M. Nw the input data becme h(x i ) = (h 1 (x i ),..., h M (x i )). This basically transfrm the data t anther space, which culd be nnlinear in the riginal space. We then find SV classifier in the transfrmed space using the same prcedure, e.g., find ptimal ˆ f (x) = h(x) T ˆβ + ˆβ 0. And the decisin is made by: Ĝ(x) = sign( ˆ f (x)). Nte: the classifier is linear in the transfrmed space, but nnlinear in the riginal ne.

Chse basis functin? 25/33 Nw the prblem becmes the chice f basis functin, r d we even need t chse basis functin. Recall in the linear space, β is in the frm f: β = α i y i x i. In the transfrmed space, it becmes: β = α i y i h(x i ). i i S the decisin bundary is: f (x) = h(x) T i α i y i h(x i ) + β 0 = α i y i h(x), h(x i ) + β 0. i

Mrever, the dual bjective functin in transfrmed space becmes: L D = α i 1 α i α i y i y i h(x i ), h(x i ) 2 i i i What des this tell us? Bth the bjective functin and the decisin bundary in the transfrmed space invlves nly the inner prducts f the transfrmed data, nt the transfrmatin itself! S the basis functins are nt imprtant, as lng as we knw h(x), h(x i ).

Kernel tricks 27/33 Define the kernal functin K : R P R P R, t represent the inner prduct in the transfrmed space: K(x, x ) = h(x), h(x ). K needs t be a symmetric and psitive semi-definite. With the kernel trick, the decisin bundary becmes: f (x) = α i y i K(x, x i ) + β 0. i Sme ppular chices f the kernel functins are: Plynmial with d degree: K(x, x ) = (a 0 + a 1 x, x ) d. Radial basis functin (RBF): K(x, x ) = exp{ x x 2 /c}. Sigmid: K(x, x ) = tanh(a 0 + a 1 x, x ).

Cmputatin f SVM 28/33 With kernels defined, the Lagrangian dual functin is: L D = α i 1 α i α i y i y i K(x i, x i ) 2 i i i Maximize L D, with α i s being the unknwns, subject t the same cnstrains: α i y i = 0 i 0 < α i < γ This is a standard QP prblem can be slved easily.

The rle f γ 29/33 T cntrl the smthness f bundary. Remember γ is intrduced in the primal prblem t cntrl the ttal misclassificatin, e.g., dual variable fr riginal cnstraint i ξ i C. we can always prject the riginal data t higher dimensinal space s that they can be better separated by a linear classifier (in the transfrmed space), but Large γ: fewer errr in transfrmed space, wiggly bundary in riginal space. Small γ: mre errrs in transfrm space, smther bundary in riginal space. γ is a tuning parameter ften btained frm crss-validatin.

A little mre abut the decisin rule 30/33 Recall the decisin bundary nly depends n supprt vectrs, r the pints with α i 0. S f (x) can be written as: f (x) = α i y i K(x, x i ) + β 0, where S is the set f supprt vectrs. i S The kernel K(x, x ) can be seen as a similarity measure between x and x. S t classify fr pint x, the decisin is made essentially by a weighted sum f similarity f x t all the supprt vectrs.

An example 31/33 SVM using 4-degree plynmial kernal. Decisin bundary prjected int 2-D space. 12.3 Supprt Vectr Machines and Kernels 425 SVM - Degree-4 Plynmial in Feature Space.................................................................................................................................................................................................................................... Training Errr: 0.180...... Test Errr: 0.245......... Bayes Errr: 0.210 SVM - Radial Kernel in Feature Space................

SVM in R 32/33 There are several R packages include SVM functin: e1071, kernlab, klar, svmpath, etc. Jurnal f Statistical Sftware 21 Table belw summarize the R SVM functins. Fr mre details please refer t the Supprt Vectr Machines in R paper at the class website. ksvm() svm() svmlight() svmpath() (kernlab) (e1071) (klar) (svmpath) Frmulatins C-SVC, C-SVC, ν- C-SVC, -SVR binary C-SVC ν-svc, SVC, ne- C-BSVC, SVC, -SVR, spc-svc, ν-svr ne-svc, - SVR, ν-svr, -BSVR Kernels Gaussian, plynmial, Gaussian, plynmial, Gaussian, plynmial, Gaussian, plynmial linear, sigmid, linear, sigmid linear, sigmid Laplace, Bessel, Anva, Spline Optimizer SMO, TRON SMO chunking NA Mdel Selectin hyperparameter estimatin grid-search functin NA NA fr Gaussian kernels Data frmula, ma- frmula, ma- frmula, ma- matrix

Summary f SVM 33/33 Strengths f SVM: flexibility. scales well fr high-dimensinal data. can cntrl cmplexity and errr trade-ff explicitly. as lng as a kernel can be defined, nn-traditinal (vectr) data, like strings, trees can be input. Weakness: hw t chse a gd kernel (a lw degree plynmial r radial basis functin can be a gd start).