Lecture 3 January 28

Size: px

Start display at page:

Download "Lecture 3 January 28"

Ada Green
5 years ago
Views:

1 EECS 28B / STAT 24B: Advanced Topics in Statistical LearningSpring 2009 Lecture 3 January 28 Lecturer: Pradeep Ravikumar Scribe: Timothy J. Wheeler Note: These lecture notes are still rough, and have only have been mildly proofread. 3. Recap of last lecture As in the previous lecture, we assume that (X, Y ) P, where X takes values in X = R d, and Y takes values in Y = {, }. Let D n = {(X (i), Y (i) )} n i= be a set of n i.i.d. samples from P. Each θ R d defines a function f θ (x) = θ, x. We consider the decision rule Recall that the margin is defined as and the radius of the data set is defined as g(x) = sgn(f θ (X)). Y (i) θ, X (i) δ(θ, D n ) = min, i n θ 2 R(D n ) = max i n X(i) 2. In the previous lecture, we proved that if D n is linearly separable, then the Perceptron Algorithm converges in T = R 2 /δ 2 steps. 3.2 Motivation for maximizing the margin The 0- loss function is given by l(f θ (X), Y ) = { if sgn(f θ (X)) y, 0 otherwise. Let k be chosen uniformly at random from the set {0,,..., n }, and define the truncated data set D n,k = {(X (i), Y (i) )} k i=. Use the Perceptron Algorithm to compute a classifier f n,k for the set D n,k. Let f n denote the resulting classifier, which depends on the random variables k and D n. 3-

2 To compute the risk of the classifier f n, we take the expectation of the loss over the random variables k, X, Y, and D n E k E X,Y,Dn l(f n (X), Y ) = = n n E X,Y,Dn l(f n,k (X), Y ) n k=0 n k=0 ) E Dn l (f n,k (X (k+) ), Y (k+) (# of mistakes) n R2 nδ 2 Here, the first equality follows from the fact that k is uniformly distributed. Note that (X (k+), Y (k+) ) P, and the random variable f n is independent of the data {(X (i), Y (i) )} n i=k+, which was not used in training. Hence, we have the second equality. The last inequality follows from the result we proved last lecture, which said that the total number of mistakes made by the Perceptron Algorithm is bounded by R 2 /δ 2. Thus, we conclude that, in this case, maximizing the margin minimizes risk. 3.3 Max-margin as an optimization problem Next, we cast the problem of maximizing the margin as an optimization problem max δ 0, s.t. δ Y (i) θ, X (i) θ 2 δ, i =,...,n Since the vector θ is normalized, we can take θ 2 = /δ and rewrite our problem as min 2 θ 2 2 s.t. Y (i) θ, X (i), i =,...,n. (3.) Because the objective function is quadratic and the constraints are affine, this optimization problem is called a quadratic program (QP). We refer to (3.) as the primal problem. A point θ R d is feasible for the problem (3.) if it satisfies all the constraints. Note that a feasible point exists if and only if the data are linearly separable. In the following derivation, we assume that the data are linearly separable. 3-2

3 3.3. The dual formulation Define the Lagrangian L(θ, α) = n 2 θ α i ( Y (i) θ, X (i) ), i= where α 0 (component-wise). The α i are called the dual variables. If ˆθ is a feasible point for problem (3.), then ( Y (i) ˆθ, X (i) ) 0, for i =,...,n. Hence, sup L(ˆθ, α) = { 2 ˆθ 2 2, if ˆθ is feasible + otherwise. (3.2) Therefore, computing p = inf sup L(θ, α), is the same as solving the primal problem (3.). By equation (3.2), the following relations hold for all feasible ˆθ R d and all α 0: sup L(ˆθ, α) = 2 ˆθ 2 2 L(ˆθ, α) inf L(θ, α). (3.3) Since this holds for all feasible ˆθ and all α 0, we can take the infimum on the left and the supremum on the right to get The optimization p = inf sup L(θ, α) sup q = sup inf L(θ, α) = q (3.4) inf L(θ, α) (3.5) is known as the dual problem, and the inequality (3.4) is a result known as weak duality. Because the original problem (3.) satisfies Slater s condition, we actually have strong duality (i.e., p = q ). The Lagrangian L(θ, α) is a convex function, so the θ that minimizes L(θ, α) is given by L(θ, α) θ = θ + n α i ( Y (i) X (i) ) = 0, (3.6) which yields θ = α i Y (i) X (i). Substituting this value of θ into L(θ, α) gives i= q(α) = inf θ L(θ, α) = 2 αt (K Y )α + α, e, where K, Y R n n are the Gram matrices given by K ij = X (i), X (j) and Y ij = Y (i) Y (j), and e R n is the vector of ones. The symbol denotes the Hadamard (element-by-element) product. 3-3

4 3.3.2 Interpretation Suppose that (θ, α ) is an optimum for the Lagrangian. From equations (3.2) and (3.6), an optimum point has the following properties:. θ = n i= α i Y (i) X (i). 2. If α i > 0, then Y (i) θ, X (i) = 0. The second condition is known as complementary slackness. Together, these two conditions tell us that an optimum θ only depends on the data (X (i), Y (i) ) where the corresponding αi are nonzero. The points X (i) where αi > 0 are called support vectors, and they are contained in the hyperplanes {X R d θ, X = } and {X R d θ, X = }. This method of finding the maximum-margin classifier is called the support vector machine. 3.4 Extension What if the data D n are not linearly separable? Observe that for an optimum (θ, α ), θ, X = n αi Y (i) X (i), X. i= Hence, we are mainly concerned with the inner product on X. If we extend our notion of inner product, we can use the support vector machine ideas described above to classify data sets that are not linearly separable. Consider the data set shown in Figure 3.. Clearly, these data are not linearly separable. However, if we apply the map (X, X 2 ) (, X, X 2, X X 2, X 2, X2 2 ), and take our inner product in R 6, rather than R 2, we can use the techniques described above to determine θ R 6 that correctly classifies the data (see Figure 3.2). This intuition motivates the use of kernel methods, to which we turn in the next lectures. 3-4

5 3 2 Y = Y=+ X X Figure 3.. A set of data that is not linearly separable. 3 2 Classifier Y = Y=+ X X Figure 3.2. Classifier that separates the data. 3-5

Support Vector Machines

EE 17/7AT: Optimization Models in Engineering Section 11/1 - April 014 Support Vector Machines Lecturer: Arturo Fernandez Scribe: Arturo Fernandez 1 Support Vector Machines Revisited 1.1 Strictly) Separable