Machine Learning 4771

Size: px

Start display at page:

Download "Machine Learning 4771"

Gordon Fox
5 years ago
Views:

1 Machine Learning 477 Instructor: Tony Jebara

2 Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines

3 Empirical Risk Minimization Example: non-pdf linear classifiers f x;θ Recall ERM: R ( emp θ) = N N L( y i, f ( x i ;θ)) 0, i= Have loss function: quadratic: L( y,x, θ) = y f x;θ 2 linear: L y,x, θ binary: = sign ( θ T x + θ 0 ) {,} ( ) = y f ( x;θ) = step yf ( x;θ) L y,x, θ Empirical R approximates the true risk (expected error) R( θ) emp θ = E { P L( x,y, θ) } = P ( x,y)l( x,y, θ)dx dy 0, X Y But, we don t know the true P(x,y)! If infinite data, law of large numbers says: lim N min θ R ( emp θ) = min θ R( θ) But, in general, can t make guarantees for ERM solution: arg min θ R ( emp θ) arg min θ R( θ)

4 Bounding the True Risk ERM is inconsistent not guaranteed may do better on training than on test! R( ˆθ ) R ( emp ˆθ ) Idea: add a prior or regularizer to Define capacity or confidence = = R ( emp θ) +C ( θ) J θ R emp R emp ( θ) θ which favors simpler J ( θ) If, R θ we have bound J θ is a guaranteed risk After train, can guarantee future error rate is min θ J θ

5 Bound the True Risk with VC But, how to find a guarantee? Difficult, but there is one Theorem (Vapnik): with probability -η where η is a number between [0,], the following bound holds: R( θ) J ( θ) = R emp ( θ) + 2h log 2eN h + 2log 4 η N N = number of data points h = Vapnik-Chervonenkis (VC) dimension (970 s) = capacity of the classifier class f.;θ Note, above is independent of the true P(x,y) A worst-case scenario bound, guaranteed for all P(x,y) VC dimension not just the # of parameters a classifier has VC measures # of different datasets it can classify perfectly Structural Risk Minimization: minimize risk bound J(θ) + + NR emp h log 2eN h ( θ) + log 4 η

6 VC Dimension & Shattering f (.;θ) How to compute h or VC for a family of functions h = # of training points that can be shattered Recall, classifier maps input to output f ( x;θ) y {,} Shattering: I pick h points & place them at x You challenge me with 2 h,,x h possible labelings y,,y h ± VC dimension is maximum # of points I can place which a f ( x;θ) can correctly classify for arbitrary labeling y,,y h Example: for 2d linear classifier h=3 f x;θ { } h = x θ + x 2 θ 2 + θ 0 can t ever shatter 4 points! or 3 points on a straight line

7 VC Dimension & Shattering More generally for higher dimensional linear classifiers, a hyperplane in shatters any set of linearly independent points. Can choose d+ linearly indep. points so h=d+ Note: VC is not necessarily proportional to # of parameters Example: sinusoidal d classifier f x;θ number of parameters= but h=infinity! since I can choose: x i = 0 i i =,,h no matter what labeling you challenge: y,,y h ± using h θ = π + y 2 ( i )0 i i= But, as a side note, if I choose 4 equally spaced x s then cannot shatter = sign sin( θx) shatters perfectly { } h

8 VC Dimension & Shattering Recall that VC dimension gives an upper bound We want to minimize h since that minimizes C(θ) & J(θ) If can t compute h exactly but can compute h + can plug in h + in bound & still guarantee Also, sometimes bound is trivial Need h/n = 0.3 before C(θ)< (recall R(θ) in [0,]) Note: h = low good performance h = poor performance

Nearest Neighbors & VC Consider Nearest Neighbors classification algorithm: Input a query example x Find training example x i in {x, x N } closest to x Predict label for x as y i of neighbor Often

9 Nearest Neighbors & VC Consider Nearest Neighbors classification algorithm: Input a query example x Find training example x i in {x, x N } closest to x Predict label for x as y i of neighbor Often use Euclidean distance x x i to measure closeness Nearest neighbors shatters any set of points! So VC=infinity, C(θ)=infinity, guaranteed risk=infinity But still works well in practice h = poor performance h = low good performance

10 VC Dimension & Large Margins Linear classifiers are too big a function class since h=d+ Can reduce VC dimension if we restrict them Constrain linear classifiers to data living inside a sphere Gap-Tolerant classifiers: a linear classifier whose activity is constrained to a sphere & outside a margin Only count errors in shaded region Elsewhere have L(x,y,θ)=0 If M is small relative to D, can still shatter 3 points: M D M=margin D=diameter d=dimensionality

11 VC Dimension & Large Margins But, as M grows relative to D, can only shatter 2 points! Can t shatter 3 Can shatter 2 For hyperplanes, as M grows vs. D, shatter fewer points! VC dimension h goes down if gap-tolerant classifier has larger margin, general formula is: D 2 h min ceil Before, just had h=d+. Now we have a smaller h If data is anywhere, D is infinite and back to h=d+ Typically real data is bounded (by sphere), D is fixed Maximizing M reduces h, improving guaranteed risk J(θ) Note: R(θ) doesn t count errors in margin or outside sphere M 2,d +

12 Structural Risk Minimization Structural Risk Minimization: minimize risk bound J(θ) reducing empirical error & reduce VC dimension h 2eN 2h log( h ) + 2log( R( θ) J ( θ) = R emp ( θ) 4 η) NR emp ( θ) + N + + h log( 2eN h ) + log ( 4 η) for each model i in list of hypothesis h ) compute its h=h h 2 h 3 i 2) θ * = arg min θ R emp ( θ) Space of different 3) compute J ( θ *,h i ) Classifiers or choose model with lowest Hypotheses = arg min θ,h J θ,h J θ *,h i Or, directly optimize over both θ *,h If possible, min empirical error while also minimizing VC For gap-tolerant linear classifiers, minimize R emp (θ) while maximizing margin, support vector machines do just that!

13 Support Vector Machines Support vector machines are (in the simplest case) linear classifiers that do structural risk minimization (SRM) Directly maximize margin to reduce guaranteed risk J(θ) Assume first the 2-class data is linearly separable: have ( x,y ),, x N,y N { } { } where x i D and y i, = sign w T x +b f x;θ Decision boundary or hyperplane given by w T x +b = 0 Note: can scale w & b while keeping same boundary Many solutions exist which have empirical error R emp (θ)=0 Want widest or thickest one (max margin), also it s unique!

14 Side Note: Constraints How to minimize a function subject to equality constraints? min x,x 2 f x = min x,x 2 b x +b 2 x H x = min x b T x + x T Hx 2 f x = b + Hx = 0 x = H b 2 + H 2 x x 2 + H x Only walk on x =2x 2 or x -2x 2 =0 Use Lagrange Multipliers, for each constraint, subtract it times a lambda variable. Lambda blows up the minimization if we don t satisfy the constraint: min x,x 2 max λ f ( x ) λ equalitycondition = 0 = min x max,x 2 λ b x +b 2 x 2 + H x 2 + H 2 2 x x 2 + H x 2 λ x x 2

15 Side Note: Constraints Cost minimization with equality constraints: ) Subtract each constraint times an extra variable (a Lagrange multiplier λ, like an adversary variable) 2) Take partials with respect to x and set to zero 3) Plug solution into constraint to find lambda min x max λ f = min x max λ b T x + 2 f x = b + H x λ H λ 2 ( x ) λ equalitycondition = 0 x T H x λ x 2x 2 2 H b T = 0 2 x = H λ = 0 λ = b T H 2 2 T H x 2x 2 = x T 2 H b 2 2

16 Support Vector Machines Define: w T x +b = 0 H + =positive margin hyperplane H - =negative margin hyperplane q =distance from decision plane to origin q = min x x 0 subject to w T x +b = 0 x 0 2 λ w T x +b min x 2 ) grad x = 0 2 xt x λ w T x +b x λw = 0 x = λw 3) Sol n ˆx = ( b )w w T w 4) distance q = ˆx 0 = b w = w T w 5) Define without loss of generality since can scale b & w 2) plug into constraint b w T w w T w = b w H w T x +b = 0 H + w T x +b = + H w T x +b = w T x +b = 0 w T ( λw) +b = 0 λ = b w T w

17 Support Vector Machines The constraints on the SVM for R emp (θ)=0 are thus: w T x i +b + y i = + w T x i +b y i = y i ( w T x i +b) 0 Or more simply: The margin of the SVM is: H + w T x +b = + H w T x +b = m = d + + d Distance to origin: Therefore: d + = d = w H q = b and margin Want to max margin, or equivalently minimize: SVM Problem: min w 2 subject to y 2 i ( w T x i +b) 0 This is a quadratic program! Can plug this into a matlab function called qp(), done! w H + q + = b m = w 2 w H q = b w or w 2 w

18 Side Note: Optimization Tools A hierarchy of Matlab optimization packages to use: Linear Programming <Quadratic Programming min x b T x s.t. T ci x αi i x T Hx + b T x s.t. T ci x αi min x 2 <Quadratically Constrained Quadratic Programming <Semidefinite Programming <Convex Programming <Polynomial Time Algorithms i LP QP QCQP SDP CP P

19 Side Note: Optimization Tools Each data point adds y i ( w T x i +b) 0 linear inequality to QP Each point cuts a half plane of allowable SVMs and reduces green region The SVM is closest point to the origin that is still in the green region The preceptron algorithm just puts us randomly in green region QP runs in cubic polynomial time There are D values in the w vector Needs O(D 3 ) run time But, there is a DUAL SVM in O(N 3 )! 2 wt w

20 SVM in Dual Form We can also solve the problem via convex duality Primal SVM problem L P : This is a quadratic program, quadratic cost function with multiple linear inequalities (these carve out a convex hull) Subtract from cost each inequality times an α Lagrange multiplier, take derivatives of w & b: Plug back in, dual: Also have constraints: min w 2 subject to y 2 i ( w T x i +b) 0 L P = min w,b max w 2 α 0 α 2 ( i y i ( w T x i +b) ) i L = w α w P y x = 0 w = α i i i i i i y i x i L = α b P y i = 0 i i L D = α i i α 2 i α j y i y j x T i j i x j α i y i = 0 & α i 0 i Above L D must be maximized! convex duality also qp()

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14 Learning Theory Piyush Rai CS5350/6350: Machine Learning September 27, 2011 (CS5350/6350) Learning Theory September 27, 2011 1 / 14 Why Learning Theory? We want to have theoretical guarantees about our