Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Size: px

Start display at page:

Download "Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University"

Betty Craig
5 years ago
Views:

1 Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University

2 Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations for highlighting the research areas carried out in our department. - Friday 09:30-12:15 (my presentation at 10:15) Introducing DREAM (our very first undergraduate research experience program) - 2

3 Last time Multi-class SVM Simultaneously-learn-3-sets-- w + of-weights:-- w - How-do-we-guarantee-the-- correct-labels?-- Need-new-constraints!-- w o The- score -of-the-correct-- class-must-be-be?er-than- the- score -of-wrong-classes:-- slide by Eric Xing 3

4 Last time Multi-class SVM As#for#the#SVM,#we#introduce#slack#variables#and#maximize#margin:## To predict, we use: Now#can#we#learn#it?### slide by Eric Xing 4

5 Last time Kernels Original data Data in feature space (implicit) Solve in feature space using kernels 5

Last time Example kernel functions Examples of kernels k(x, x 0 ) Linear hx, x 0 i Laplacian RBF exp ( kx x 0 k) Gaussian RBF exp kx x 0 k 2 Polynomial (hx, x 0 i + ci) d,c 0, d2 N

6 Last time Example kernel functions Examples of kernels k(x, x 0 ) Linear hx, x 0 i Laplacian RBF exp ( kx x 0 k) Gaussian RBF exp kx x 0 k 2 Polynomial (hx, x 0 i + ci) d,c 0, d2 N B-Spline B 2n+1 (x x 0 ) Cond. Expectation E c [p(x c)p(x 0 c)] Simple trick for checking Mercer s condition Compute the Fourier transform of the kernel and check that it is nonnegative. 6

7 Last time The Kernel Trick for SVMs Linear soft margin problem 1 minimize w,b 2 kwk2 + C X i i subject to y i [hw, x i i + b] 1 i and i 0 Dual problem maximize 1 2 X i,j i j y i y j hx i,x j i + X i i subject to X i i y i = 0 and i 2 [0,C] Support vector expansion f(x) = X i i y i hx i,xi + b

8 Last time The Kernel Trick for SVMs Linear soft margin problem 1 minimize w,b 2 kwk2 + C X i i subject to y i [hw, (x i )i + b] 1 i and i 0 Dual problem maximize 1 2 X i,j i j y i y j k(x i,x j )+ X i i subject to X i i y i = 0 and i 2 [0,C] Support vector expansion f(x) = X i i y i k(x i,x)+b

9 C=10

10 C=50

11 C=100

12 Last time Nonlinear Separation Pattern Recognition Increasing C allows for more nonlinearities Decreases number of errors SV boundary need not be contiguous Kernel width adjusts function class Figure D toy example of a binary classification problem solved using a soft margin SVC. In all cases, a Gaussian kernel (7.27) is used. From left to right, we decrease the kernel width. Note that for a large width, the decision boundary is almost linear, and the data set cannot be separated without error (see text). Solid lines represent decision boundaries; dotted lines depict the edge of the margin (where (7.34) becomes an equality with ξi = 0).

13 Today Risk and Loss Support Vector Regression 13

14 Risk and Loss 14

15 Loss function point of view Constrained quadratic program minimize w,b 1 2 kwk2 + C X i i subject to y i [hw, x i i + b] 1 i and i 0 Risk minimization setting minimize w,b 1 2 kwk2 + C X i max [0, 1 y i [hw, x i i + b]] empirical risk Follows from finding minimal slack variable for given (w,b) pair.

16 Soft margin as proxy for binary Soft margin loss max(0, 1 yf(x)) Binary loss {yf(x) < 0} convex upper bound binary loss function margin

17 More loss functions Logistic log h1+e f(x)i Huberized loss 8 >< >: 0 if f(x) > (1 f(x))2 if f(x) 2 [0, 1] 1 2 f(x) if f(x) < 0 (asymptotically) linear Soft margin max(0, 1 f(x)) (asymptotically) 0

18 Risk minimization view Find function f minimizing classification error R[f] :=E x,y p(x,y) [{yf(x) > 0}] Compute empirical average R emp [f] := 1 m R reg [f] := 1 m mx i=1 mx i=1 {y i f(x i ) > 0} Minimization is nonconvex Overfitting as we minimize empirical error Compute convex upper bound on the loss Add regularization for capacity control max(0, 1 y i f(x i )) + [f] how to control ƛ regularization

19 Support Vector Regression 19

20 Regression Estimation Find function f minimizing regression error R[f] :=E x,y p(x,y) [l(y, f(x))] Compute empirical average R emp [f] := 1 mx l(y i,f(x i )) m i=1 Overfitting as we minimize empirical error Add regularization for capacity control R reg [f] := 1 m mx i=1 l(y i,f(x i )) + [f] 20

21 Squared loss l(y, f(x)) = 1 2 (y f(x))2 21

22 l1 loss l(y, f(x)) = y f(x) 22

23 ε-insensitive Loss allow some deviation without a penalty l(y, f(x)) = max(0, y f(x) ) 23

24 Penalized least mean squares Optimization problem Solution minimize w 1 2m mx (y i hx i,wi) 2 + kwk 2 2 w [...]= 1 mx xi x > i w x i y i + w m i=1 apple 1 1 = m XX> + 1 w Xy =0 m hence w = XX > + Outer product matrix in X m1 1 Xy Conjugate Gradient Sherman Morrison Woodbury 24

25 Penalized least mean squares... now with kernels Optimization problem minimize w 1 2m mx (y i h (x i ),wi) 2 + kwk 2 2 i=1 Representer Theorem (Kimeldorf & Wahba, 1971) w k kwk 2 = w k 2 + kw? k 2 empirical w? risk dependent 25

26 Penalized least mean squares... now with kernels Optimization problem Representer Theorem (Kimeldorf & Wahba, 1971) Optimal solution is in span of data Proof - risk term only depends on data via Regularization ensures that orthogonal part is 0 Optimization problem in terms of w minimize solve for minimize w 1 2m 1 2m i=1 mx (y i h (x i ),wi) 2 + kwk 2 2 i=1 j w = X i mx X 2 X y i K ij j + i j K ij 2 =(K + m 1) 1 y i,j as linear system i (x i ) (x i )

27 Penalized least mean squares... now with kernels Optimization problem Representer Theorem (Kimeldorf & Wahba, 1971) Optimal solution is in span of data Proof - risk term only depends on data via Regularization ensures that orthogonal part is 0 Optimization problem in terms of w minimize solve for minimize w 1 2m 1 2m i=1 mx (y i h (x i ),wi) 2 + kwk 2 2 i=1 j w = X i mx X 2 X y i K ij j + i j K ij 2 =(K + m 1) 1 y i,j as linear system i (x i ) (x i )

28 SVM Regression (ϵ-insensitive loss) y loss x x x x x x x x ξ x x x x x +ε ε 0 ε +ε ξ x y f(x) x x don t care about deviations within the tube 28

29 SVM Regression (ϵ-insensitive loss) Optimization Problem (as constrained QP) minimize w,b 1 2 kwk2 + C Lagrange Function mx i=1 [ i + i ] subject to hw, x i i + b apple y i + + i and i 0 hw, x i i + b y i i and i 0 L = 1 mx mx 2 kwk2 + C [ i + i ] [ i i + i i ]+ mx i=1 i=1 i [hw, x i i + b y i i ]+ i=1 i=1 mx i [y i i hw, x i i b] 29

30 SVM Regression (ϵ-insensitive loss) First order w L =0=w + X i [ i i ] x b L =0= X i [ i i ] Dual i L =0=C i i L =0=C i i 1 minimize, 2 ( ) > K( )+ 1 > ( + )+y > ( ) subject to 1 > ( ) = 0 and i, i 2 [0,C] 30

31 Properties Ignores typical instances with small error Only upper or lower bound active at any time QP in 2n variables as cheap as SVM problem Robustness with respect to outliers - l1 loss yields same problem without epsilon - Huber s robust loss yields similar problem but with added quadratic penalty on coefficients 31

32 Regression example sinc x sinc x approximation 32

33 Regression example sinc x sinc x approximation 33

34 Regression example sinc x sinc x approximation 34

35 Regression example Support Vectors 35

36 Huber s robust loss l(y, f(x)) = ( 1 2 (y f(x))2 if y f(x) < 1 y f(x) 1 2 otherwise linear trimmed mean estimatior quadratic 36

37 Summary Advantages: - Kernels allow very flexible hypotheses - Poly-time exact optimization methods rather than approximate methods - Soft-margin extension permits mis-classified examples - Variable-sized hypothesis space - Excellent results (1.1% error rate on handwritten digits vs. LeNet s 0.9%) Disadvantages: slide by Sanja Fidler - Must choose kernel parameters - Very large problems computationally intractable - Batch algorithm 37

38 Software SVM light : one of the most widely used SVM packages. Fast optimization, can handle very large datasets, C++ code. LIBSVM Both of these handle multi-class, weighted SVM for unbalanced data, etc. There are several new approaches to solving the SVM objective that can be much faster: - Stochastic subgradient method (discussed in a few lectures) - Distributed computation (also to be discussed) See machine learning open source software 38

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University Multi-class SVMs Lecture 17: Aykut Erdem April 2016 Hacettepe University Administrative We will have a make-up lecture on Saturday April 23, 2016. Project progress reports are due April 21, 2016 2 days