1 Bounding the Margin

Size: px

Start display at page:

Download "1 Bounding the Margin"

Louisa Robbins
5 years ago
Views:

1 COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost fro the last lecture. We have shown previously, with probability at least 1 δ over the choice of S, for all f F, ( ) E D [f] ÊS[f] + 2 ˆR ln(1/δ) S (F) + O (1.1) where ÊS[f] represents the saple average. Additionally, we also want to show that if we can get a large argin fro AdaBoost, this will translate to a low generalization error. We want to use (1.1) to prove a bound on the generalization error in ters of the argins. First of all, H will denote a weak hypothesis space with VC-diension of d, and recall the convex hull of H which we defined to be: { } T T co(h) = f : f(x) = a t h t (x) T 1, a t 0, a t = 1, h 1,..., h T H t=1 t=1 (1.2) Theore 1. With training set of size, for all f co (H), and for all argin, Θ > 0, ( ) P r D [yf(x) 0] P ˆ r S [yf(x) Θ] + Õ d/θ 2 + ln 1 /δ with probability at least 1 δ. Proof. As shown in the last lecture, we state soe of the relationships known: ( ) Rˆ S (H) Õ d/ (1.3) M = {(x, y) yf(x) : f co(h)} (1.4) ˆR S (M) = ˆR S (co(h)) = ˆR S (H) (Radeacher coplexity) (1.5) We also need to note, fro last lecture, that: P r D [yf(x) 0] = E D [1{yf(x) 0}] (1.6) ˆ P r S [yf(x) Θ] = ÊS[1{yf(x) Θ}] (1.7)

2 Fro Figure 1.1, we can arrive at: where φ(u) is the Lipschitz function shown in the figure 1{u 0} φ(u) 1{u Θ} (1.8) Figure 1.1: Visualization of the indicator functions and the Lipschitz function, φ(u), where Lipschitz constant, L φ = 1 Θ (i.e. slope of line). Fro (1.6), Fro (1.7), P r D [yf(x) 0] = E D [1{yf(x) 0}] E D [φ(yf(x)] (1.9) P r D [yf(x) Θ] = ÊS[1{yf(x) Θ}] ÊS[φ(yf(x))] (1.10) By applying (1.1), for all f co(h), ( ) E D [φ(yf(x))] ÊS[φ(yf(x))] + 2 ˆR ln 1/δ S (φ M) + O (1.11) Now, the only thing is to copute 2 ˆR S (φ M). Recall fro previous lecture, the Talagrand s Lea, which states that for any hypothesis set H of real-valued functions, the following inequality holds if φ : R R is a Lipschitz function 1 : Fro (1.12), ˆR S (φ F) L φ ˆRS (F) (1.12) 1 φ is Lipschitz continuous eans that u, v, φ(u) φ(v) L φ u v, where L φ is soe Lipschitz constant 2

3 ˆR S (φ M) L φ ˆRS (M) = 1 /Θ R(H) ( ) 1 /Θ Õ d/ (1.13) Fro (1.9), P r D [yf(x) 0] = E D [1{yf(x) 0}] E D [φ(yf(x))] ( ) ÊS[φ(yf(x))] + Õ d/θ 2 + ln 1 /δ (using 1.13) ( ) P ˆ r S [yf(x) Θ] + Õ d/θ 2 + ln 1 /δ (using 1.10) 2 Support Vector Machines (SVM) We start off with labeled exaples (x 1, y 1 ),..., (x, y ), where x i R n and y i { 1, +1}, and our goal is to find a hypothesis that is consistent with all the exaples. We are assuing for now that the data is linearly separable and x i 2 1. Figure 2.1: Labeled exaples in a plane in which a separating hyperplane (i.e. shown to be passing through the origin. line) is Since the hypothesis ust be one that separates between the postive and negative labels as shown in Figure 2.1, there are in fact a lot of choices to choose fro in this case. Intuitively, the natural idea is to choose the hyperplane that axially separates between the positive and negative labels (i.e. to axiize δ). This is because we do not want the jiggering of a point to change the classification. 3

4 Assuing that the hyperplane passes through the origin and fro Figure 2.1, v is defined to be the noral vector to the hyperplane such that v 2 = 1 (i.e. L2-nor). Also the distance of a point x to the hyperplane is v x: Hence, the proble forulation becoes: > 0 if x above hyperplane v x = 0 if on hyperplane < 0 if x below hyperplane find v, δ ax δ s.t. v 2 = 1 { v x i δ if y i = +1 i v x i δ if y i = 1 (2.1) in which (2.1) is atheatically equivalent to y i (v x i ) δ (2.2) Another way to view this proble is to think in ters of increasing our confidence of the argin. Since we do not have control of the nuber of saples but only the choices of the hyperplanes, we could control the coplexities of the hyperplanes. Recall that the VC-diension of the linear threshold functions in R n equals to n, and not n + 1 because they go through the origin. Furtherore, the VC-diension of the linear threshold functions with argin δ is saller than 1 /δ 2, which eans that it is independent of n (to be proved as a corollary later). Since the VC-diension of linear threshold functions with argin δ is only dependent on δ 2, this eans that the bigger the argin δ, the less coplex the hyperplane gets. Hence, going back to the proble forulation in (2.1), we divide the ters in (2.2) by δ: If we let w = v δ, y i ( v δ x i) δ δ = 1 w = v = 1 δ δ We can then rewrite the proble forulation into a convex progra: in 1 2 w 2 2 (2.3) s.t. y i (w x i ) 1 (2.4) Since for all convex progras, there is an equivalent forulation, we can transfor it to its dual axiization for using Lagrange ultipliers α i (to be shown in later lectures!): 4

5 ax α i 1 2 i=1 s.t. i : α i 0 α i α j y i y j x i x j i=1 j=1 It turns out that: w = α i y i x i i=1 and a lot of α i will be zeros. In fact, α i will be zero only if the corresponding exaple (x i, y i ) is a support vector eaning that y i (w v) = 1. Since w does not depend on the points that are not support vectors but only on a sall subset of exaples, we can also see, fro proble 1 of hoework 4, that err(h) Õ ( k+ln 1/δ ), where k is the nuber of support vectors, δ is the argin, and error only depends on the k, and not the nuber of diensions. 2.1 Linearly Inseparable Data So far, we have been talking about linearly separable data such as the one shown in Figure 2.1, where there exists soe hyperplane that separates the positive and negative exaples. What happens when the data is linearly inseparable? Consider the following case: Figure 2.2: Linearly inseperable data on a plane with hypothesis as a line passing through the origin, where ξ 1 and ξ 2 represents the distances oved. In this siple case, to ake the data linearly separable, we can ove the data, through the distance ξ 1 and ξ 2 to the correct side of the hyperplane. However, there is a penalty to be paid for the distances oved and we will need to incorporate it to the proble forulation, (2.4): 5

6 in 1 2 w 2 +C ξ 2 i,where Cis a paraeter user has to choose i=1 s.t. y i (w v) 1 ξ i ξ i 0 The dual axiization for then becoes: ax α i 1 2 i=1 s.t. i : 0 α i C i=1 j=1 α i α j y i y j x i x j However, what if it is not even close to being linearly separable? Consider the following case: Figure 2.3: Linearly inseperable data on a plane. Support Vector Machine can still be used in this case. To do this, we ap the data to higher diensional space (i.e. fro 2 diensional space to 6 diensional space): (x 1, x 2 ) = x ψ(x) = (1, x 1, x 2, x 2 1, x 2 2, x 1 x 2 ) (2.5) However, how is this linearly separable now? Since the hyperplane equation is given by w x = 0, the hyperplane equation for the 6 diensional space becoes: a 1 + bx 1 + cx 2 + dx ex fx 1 x 2 = 0 (2.6) which turns out to be the equation of the conic section in 2 diensional space. Graphically, the hyperplane in this exaple is: 6

7 Figure 2.4: Linearly inseperable data on a plane with hypothesis as a circle in this case. The above approach could be generalized to n diensions: x R n ψ(x) However, the proble with this is that we have both coputational and statistical issues. Coputational Proble: We start with exaples in n diensions and add all ters up to degree d, then we are effectively apping to O(n d ) diensions. So, we ight start in 100 diensions, and end up with a trillion diensions or ore. So, even reading such a vector just once will take a very long tie. Statistical Proble: There ay be overfitting due to the addition of so any paraeters. For the statistical proble, Support Vector Machine is able to overcoe it since VCdiension = 1 /δ 2, and hence err(h) does not depend on the nuber of diensions. Fro the dual axiization forulation, we can see that there is only a need to copute the inner product of pairs of instances because w = i=1 α iy i x i and H(x) = sign(w x) = sign ( i=1 α iy i (x i x i )). Hence, if we are able to copute the inner product efficiently, then we can overcoe the coputational proble as well. We first consider adding constants to (2.5), which has no effect on the hyperplanes since the constants are arbitrary. Siilarly, (x 1, x 2 ) = x ψ(x) = (1, 2x 1, 2x 2, x 2 1, x 2 2, 2x 1 x 2 ) (2.7) (u 1, u 2 ) = u ψ(u) = (1, 2u 1, 2u 2, u 2 1, u 2 2, 2u 1 u 2 ) (2.8) For raising 2 diensional space to 6 diensional space, we just have to perfor the inner product for ψ(x) and ψ(u): 7

8 ψ(x) ψ(u) = 1 + 2x 1 u 1 + 2x 2 u 2 + x 2 1u x 2 2u x 1 x 2 u 1 u 2 = (1 + x 1 u 1 + x 2 u 2 ) 2 = (1 + x u) 2 Hence, adding all ters up to degree d, the coputation would be odified to becoe: (1 + x u) d. This elegant result eans that we can iplicitly copute inner product in the high diensional space by perforing all operations in the original low diensional space and then siply raising to the power of d. This ethod is called the kernel trick. In general, a kernel is a function K(x, u) that iplicitly coputes the inner product ψ(x) ψ(u) for soe apping ψ. An exaple of this is the Radial Basis Kernel, K(x, u) = exp[ ( x u 2 2 )]. To apply the kernel trick ethod, we just have to replace x i x j by K(x i, x j ) in the dual axiization forulation. It is also iportant to note the following: Recall that when we used the assuption, x i 2 1, VC-diension 1 /δ 2. However, if x i 2 R, VC-diension R2 /δ 2. This eans that there is a tradeoff between radius, R and δ, since the apping ψ ay increase δ, but ight also increase R. 8

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability