1 Kernel methods & optimization

Size: px

Start display at page:

Download "1 Kernel methods & optimization"

Bernice Richardson
6 years ago
Views:

1 Machine Learning Class Notes Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions is the Gaussian kernel, ( ) y 2 ep 2σ 2 For the Gaussian kernel, k(, ) = 1 for any vector, and k(, y) 0 if is very different from y. Thus, a kernel function can be interpreted as a similarity function. However, not just any similarity function is a valid kernel. In particular, recall that (by definition) k(, y) is a valid kernel if and only if φ : X R d s.t. k(, y) = φ( ) φ( y). One consequence of this is that kernel functions must be symmetric, since φ( ) φ( y) = φ( y) φ( ). Today s lecture will eplore these requirements of kernel functions in more depth, culmunating with Mercer s theorem. Together, these requirements provide a mathematical foundation for kernel methods, ensuring both that there is a sensible feature vector representation for every data point and that the support vector machine (SVM) objective has a unique global optimum and is easy to optimize. 1.1 Background from linear algebra A matri M R d R d is said to be positive semi-definite if z R d, z T Mz 0. For eample, suppose M = I. Then, z T Iz = z i z j I ij = z 2, which is always 0. Thus, the identify matri is positive semi-definite. Net we review several concepts from linear algebra, and then use these to give an alternative definition of positive semi-definite (PSD) matrices. Suppose we find a vector v and a value λ such that M v = λ v. We call v an eigenvector of the matri M, and λ an eigenvalue. A matri M can be shown to be PSD if and only if M has all non-negative eigenvalues. We will now show one of the directions ( ). To see this, first write M = V ΛV T, where Λ is a matri with the eigenvalues along the diagonal (zero off diagonal) and V i=1 1

2 is the matri of eigenvectors: 1 M = V Net, we split Λ in two, λ M = V 0 λ λ λd 0 λ λ d V T λ λ λd V T = UUT. Letting v = z T U, since vv T = v v 0 we have that (z T U)(U T z) = z T Mz 0, showing that M is positive semi-definite (we used the fact that the eigenvalues were non-negative when taking their square root). 1.2 Mercer s Theorem For a training set S = { i } and a function k( u, v), the kernel matri (also called the Gram matri) K S is the matri of dimension S S where (K S ) ij = k( i, j ). Theorem 1 (Mercer s theorem). k( u, v) is a valid kernel if and only if the corresponding kernel matri is PSD for all training sets S = { i }. Proof. ( ) Since k( u, v) is a valid kernel, it has a corresponding feature map φ such that k( u, v) = φ( u) φ( v). Thus, the kernel matri K s has entries (K S ) ij = φ( i ) φ( j ). Let V be the matri [ φ( 1 )... φ( n ) ], where we treat φ( i ) as a column vector. Then, we have K S = V T V. However, this shows that K S must be positive semi-definite, because for any vector z R S, (z T V T )(V z) 0. ( ) Let S be the set of all possible data points (we will assume that it is finite). Since the corresponding kernel matri K S is positive semi-definite, it has non-negative eigenvalues and can be factored as K S = UU T. Let φ( i ) = u i, where u i is the i th row of U. This gives the feature mapping for i such that k( i, j ) = u i u j. Mercer s theorem guarantees for us that the kernel matri is positive semidefinite. As we show in the net section, this will guarantee that the SVM dual objective is concave, which means that it is easy to optimize. 1 This proof assumes that M is symmetric, which kernel matrices are. 2

3 Not conve: Conve: Set specified by linear inequalities: X = { R 2 : A b} Figure 1: Illustration of a non-conve and two conve sets in R Conveity A set X R d is a conve set if for any, y X and 0 α 1, α + (1 α) y X Informally, if for any two points, y that are in the set every point on the line connecting and y is also included in the set, then the set is conve. See Figure 1 for eamples of non-conve and conve sets. A function f : X R is conve for a conve set X if, y X and 0 α 1, f(α + (1 α) y) αf( ) + (1 α)f( y) (1) Informally, a function is conve if the line between any two points on the curve always upper bounds the function. We call a function strictly conve if the inequality in Eq. 1 is a strict inequality. See See Figure 2 for eamples of nonconve and conve functions. A function f() is concave is f() is conve. Importantly, it can be shown that strictly conve functions always have a unique minima. For a function f() defined over the real line, one can show that f() is d conve if and only if 2 d f 0. Just as before, strict conveity occurs when 2 the inequality is strict. For eample, consider f() = 2. The first derivative of f() is given by d d always strictly greater than 0, we have proven that f() = 2 is strictly conve. As a second eample, consider f() = log(). The first derivative is d d f = 1, f = 2 and its second derivative by d2 d 2 f = 2. Since this is and its second derivative is given by d2 d 2 f = 1 2. Since this is negative for all > 0, we have proven that log() is a concave function over R +. Not conve: Conve: f() f() = 2 f() Figure 2: Illustration of a non-conve and two conve functions over X = R. 3

4 This matters because optimization for conve functions is easy. In particular, one can show that nearly any reasonable optimization method, such as gradient descent (where one starts at arbitrary point, moves a little bit in the direction opposite to the gradient, and then repeats), is guaranteed to reach a global optimum of the function. Note that whereas the minimization of conve functions is easy, likewise, the maimization of concave functions is easy. Finally, to generalize this second definition of conve functions to higher dimensions (i.e., X = R d ), we introduce the notion of the Hessian matri of a function f, 2 f( ) = 2 1. d 1 1 d which is the matri of dimension d d with entries ( 2 f) ij equal to the partial derivative of the function with respect to i and then with respect to j, denoted i j. Note that since the order of the partial derivatives does not matter, i.e. i j = 2 f j i, the Hessian matri is symmetric. We are finally ready for our second definition of conve functions in higher dimension. A function f : X R is conve for a conve set X R d if and only if its Hessian matri 2 f( ) is positive semi-definite for all X.. 2 d 1.4 The dual SVM objective is concave Recall the dual of the support vector machine (SVM) objective, f( α) = α i 1 2 i=1 The first partial derivative is given by α i α j y i y j k( i, j ) (2) f = 1 α i y i y s k( i, s ) α s k( s, s ) α s i s The second partial derivative is given by α t α s = y t y s k( t, s ) Denote the Hessian matri as 2 f. To show that the dual SVM objective is concave, we must show that α T 2 f α = n n α iα 2 f j α i α j 0 for all α, which can be equivalently written as (α i y i )(α j y j )k( i, j ) 0. (3) 4

5 Coordinate descent: f(, y) = 10 f(, y) =8 y Figure 3: Illustration of coordinate descent on the function f(, y). Shown here are the level sets of the function. The numbers indicate the first point, the second point, etc., until the optimal solution is found. However, since k( u, v) is a valid kernel, Mercer s theorem guarantees for us that K S, the kernel matri for the n data points, is positive semi-definite. As a result, we have that z T K S z = z i z j k( i, j ) 0 (4) for all vectors z R n. Given any α, let z be given by z i = α i y i for i = 1,..., n. Using this, Eq. 4 implies Eq. 3. There are many approaches for minimizing f( α). One of the simplest such methods is called the sequential minimal optimization (SMO) algorithm, and is based on the concept of block coordinate descent. Coordinate descent is illustrated in Fig. 3 for a function defined on R 2. An arbitrary starting point is chosen. Then, in each step, one coordinate (or, in general, a set of coordinates, called a block) is chosen and the function is minimized as much as possible with respect to that coordinate (keeping all other variables fied to their current values). The larger the blocks, the faster the convergence to the optimum solution. The blocks are typically chosen to be as large as possible such that minimizing the function with respect to these coordinates can be performed in closed form. For the dual SVM, because of the constraint i y iα i = 0, the smallest block size that can be chosen is 2. The algorithm proceeds by choosing in each iteration α i and α j, then minimizing the function as much as possible with respect to these two variables. 5

UNCONSTRAINED OPTIMIZATION PAUL SCHRIMPF OCTOBER 24, 2013

UNCONSTRAINED OPTIMIZATION PAUL SCHRIMPF OCTOBER 24, 2013 PAUL SCHRIMPF OCTOBER 24, 213 UNIVERSITY OF BRITISH COLUMBIA ECONOMICS 26 Today s lecture is about unconstrained optimization. If you re following along in the syllabus, you ll notice that we ve skipped