CSCI B609: Foundations of Data Science

CSCI B609: Foundatons of Data Scence Lecture 13/14: Gradent Descent, Boostng and Learnng from Experts Sldes at http://grgory.us/data-scence-class.html Grgory Yaroslavtsev http://grgory.us

Constraned Convex Optmzaton Non-convex optmzaton s NP-hard: Knapsack: x 1 x = 0 : x 0,1 Mnmze c x Subject to: w x W Convex optmzaton can often be solved by ellpsod algorthm n poly(n) tme, but too slow

Convex multvarate functons Convexty: x, y R n : f x f y + x y f(y) x, y R n, 0 λ 1: f λx + 1 λ y λf x + 1 λ f(y) If hgher dervatves exst: f x = f y + f y x y + x y T f x x y + f x j = f x x j s the Hessan matrx f s convex ff t s Hessan s postve semdefnte, y T fy 0 for all y.

Examples of convex functons l p -norm s convex for 1 p : λx + 1 λ y p λx p + 1 λ y p = λ x p + 1 λ y p f x = log(e x 1 + e x + e x n) max x 1,, x n f x max x 1,, x n + log n f x = x T Ax where A s a p.s.d. matrx, f = A Examples of constraned convex optmzaton: (Lnear equatons wth p.s.d. constrants): mnmze: 1 xt Ax b T x (soluton satsfes Ax = b) (Least squares regresson): Mnmze: Ax b = x T A T A x Ax T b + b T b

Constraned Convex Optmzaton General formulaton for convex f and a convex set K: mnmze: f x subject to: x K Example (SVMs): Data: X 1,, X N R n labeled by y 1,, y N 1,1 (spam / non-spam) Fnd a lnear model: W X 1 X s spam W X 1 X s non-spam : 1 y WX 0 More robust verson: mnmze: Loss 1 W y X + λ W E.g. hnge loss Loss(t)=max(0,t) Another regularzer: λ W (favors sparse solutons)

Gradent Descent for Constraned Convex Optmzaton (Projecton): x K y K y = argmn z K z x Easy to compute for : y = x/ x Let f x G, max x y x,y K D. Let T = 4D G ε Gradent descent (gradent + projecton oracles): Let η = D/G T Repeat for = 0,, T: y (+1) = x () η f x x (+1) = projecton of y (+1) on K Output z = 1 T x ()

Gradent Descent for Constraned Convex Optmzaton x +1 x y +1 x = x x η f x = x x Usng defnton of G: f x x x 1 η x x x +1 x + η G + η f x η f x x x f x f x 1 η Sum over = 1,, T: T =1 f x f x 1 η x x x +1 x x 0 x x T x η + G + Tη G

Gradent Descent for Constraned T f x f x =1 1 η f 1 T f Convex Optmzaton x 0 x 1 T Set η = x G D 1 T x x T x f x : f x T RHS DG T ε Tη + G D ηt + η G

Onlne Gradent Descent Gradent descent works n a more general case: f sequence of convex functons f 1, f, f T At step need to output x () K Let x be the mnmzer of f (w) Mnmze regret: f x f (x ) Same analyss as before works n onlne case.

Stochastc Gradent Descent (Expected gradent oracle): returns g such that E g g = f x. Example: for SVM pck randomly one term from the loss functon. Let g be the gradent returned at step Let f = g T x be the functon used n the -th step of OGD Let z = 1 T x () and x be the mnmzer of f.

Stochastc Gradent Descent Thm. E f z f x + DG where G s an upper bound of T any gradent output by oracle. f z f x 1 T (f x f(x )) (convexty) = 1 T 1 T f x x x E,g T (x x )- (grad. oracle) = 1 T E,f (x ) f (x )- = 1 T E, f (x ) f (x )- E[] = regret of OGD, always ε

VC-dm of combnatons of concepts For k concepts 1,, k + a Boolean functon f: comb f 1, k = *x X: f 1 x, k x = 1+ Ex: H = ln. separators, f = AND / f = Majorty For a concept class H + a Boolean functon f: COMB f,k H = *comb f 1, k : H+ Lem. If VC-dm(H)= d then for any f: VC-dm COMB f,k H O(kd log(kd))

VC-dm of combnatons of concepts Lem. If VC-dm(H)= d then for any f: VC-dm COMB f,k H O(kd log(kd)) Let n = VC-dm COMB f,k H set S of n ponts shattered by COMB f,k H Sauer s lemma n d ways of labelng S by H Each labelng n COMB f,k H determned by k labelngs of S by H (n d ) k = n kd labelngs n n kd n kd log n n kd log kd

Back to the batch settng Classfcaton problem Instance space X: 0,1 d or R d (feature vectors) Classfcaton: come up wth a mappng X *0,1+ Formalzaton: Assume there s a probablty dstrbuton D over X c = target concept (set c X of postve nstances) Gven labeled..d. samples from D produce h X Goal: have h agree wth c over dstrbuton D Mnmze: err D h = Pr D,h Δ c - err D h = true or generalzaton error

Boostng Strong learner: succeeds wth prob. 1 ε Weak learner: succeeds wth prob. 1 + γ Boostng (nformal): weak learner that works under any dstrbuton strong learner Idea: run weak leaner A on sample S under reweghtngs focusng on msclassfed examples

Boostng (cont.) H = class of hypothess produced by A Apply majorty rule to 1,, t0 H: VC-dm O(t 0 VC-dm H log(t 0 VC dm H )) Algorthm: Gven S = (x 1,, x n ) set w = 1 n w = (w 1,, w n ) For t = 1,, t 0 do: Call weak learner on S, w hypothess t For msclassfed x multply w by α = ( 1 + γ)/( 1 γ) Output: MAJ( 1,, t0 )

Boostng: analyss Def (γ-weak learner on sample): For labeled examples x weghted by w wth weght of correct 1 + γ n w =1 Thm. If A s γ-weak learner on S for t 0 = O 1 γ log n boostng acheves 0 error on S. Proof. m = # mstakes of the fnal classfer Each was msclassfed t 0 tmes weght αt 0/ Total weght mα t 0/ Total weght at t = W t W t + 1 α 1 γ + 1 + γ W t = 1 + γ W(t)

Boostng: analyss (cont.) W 0 = n W t 0 n 1 + γ t 0 mα t 0/ W t 0 n 1 + γ t 0 α = ( 1 + γ)/( 1 γ) = (1 + γ)/(1 γ) m n 1 γ t 0/ 1 + γ t0/ = n 1 4γ t 0/ 1 x e x m ne γ t 0 t 0 = O 1 γ log n Comments: Apples even f the weak learners are adversaral VC-dm bounds n = O 1 ε VC dm (H) γ