Machine Learning Theory (CS 6783)

Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT features (typically X R d ) Speech recogitio: Mel Cepstral co-efficiets X R 2 legth Natural Laguage Processig: Bag-of-words features (X N documet size ), -grams 2. Y: Outcome space, label space Examples: Biary classificatio Y = {±}, multiclass classificatio Y = {,..., K}, regressio Y R) 3. l : Y Y R: loss fuctio (measures predictio error) Examples: Classificatio l(y, y) = {y y}, Support vector machies l(y, y) = max{0, y y}, regressio l(y, y) = (y y ) 2 4. F Y X : Model/ Hypothesis class (set of fuctios from iput space to outcome space) Examples: Liear classifier: F = {x sig(f x) : f R d } Liear SVM: F = {x f x : f R d, f 2 R} Neural Netoworks (deep learig): F = {x σ(w out σ(w K σ(... σ(w 2 (W σ(w i x)))))} where σ is some o-liear trasformatio (Eg. ReLU) Learer observes sample: S = (x, y ),..., (x, y ) Learig Algorithm : (forecastig strategy, estimatio procedure) ŷ : X (X Y) t Y Give ew iput istace x the learig algorithm predicts ŷ(x, S). Whe cotext is clear (ie. sample S is uderstood) we will fudge otatio ad simply use otatio ŷ( ) = ŷ(, S). ŷ is the predictor retured by the learig algorithm.

Example: liear SVM Learig algorithm solves the optimizatio problem: w SVM = argmi max{0, y t w x t } + λ w w ad the predictor is ŷ(x) = ŷ(x, S) = w SVM x. PAC framework Y = {±}, l(y, y) = {y y} Iput istaces geerated as x,..., x D X where D X is some ukow distributio over iput space. The labels are geerated as y t = f (x t ) where target fuctio f F. Learig algorithm oly gets sample S ad does ot kow f or D X. Goal: Fid ŷ that miimizes P x DX (ŷ(x) f (x)).2 No-parametric Regressio Y R, l(y, y) = (y y) 2 Iput istaces geerated as x,..., x D X where D X is some ukow distributio over iput space. The labels are geerated as y t = f (x t ) + ε t where ε t N(0, σ) where target fuctio f F. Learig algorithm oly gets sample S ad does ot kow f or D X. Goal: Fid ŷ that miimizes E x DX (ŷ(x) f (x)) 2 =: ŷ f L2 (D X ).3 Statistical Learig (Agostic PAC) Geeric X, Y, l ad F Samples geerated as (x, y ),..., (x, y ) D where D is some ukow distributio over X Y. Goal: Fid ŷ that miimizes E (x,y) D l(ŷ(x), y) if f F E (x,y) D l(f(x), y) For ay mappig g : X Y we shall use the otatio L D (g) = E (x,y) D l(g(x), y) ad so our goal ca be re-writte as: L D (ŷ) if f F L D(f) Remarks:. ŷ is a radom quatity as it depeds o the sample 2. Hece formal statemets we make will be i high probability over the sample or i expectatio over draw of samples 2

.4 Olie Learig For t = to (a) Iput istace x t X is produced (b) Learig algorithm outputs predictio ŷ t (c) True outcome y t is revealed to learer Ed For Oe ca thik of ŷ t = ŷ t (x t, ((x, y ),..., (x t, y t ))). Goal: Fid learig algorithm ŷ that miimizes regret w.r.t. hypothesis class F Y X give by, Reg = l(ŷ t, y t ) if l(f(x t ), y t ) f F 2 Example : Classificatio usig Fiite Class, Realizable Settig I this sectio we cosider the classificatio settig where Y = {±} ad l(y, y) = {y y}. We further make the realizability assumptio meaig y t = f (x t ) where f is obviously ot kow to the learer. 2. Olie Framework The olie framework is just as described earlier with the realizability assumptio added i. That is, at every roud the true label y t revealed to us is set as y t = f (x t ) for some fixed f ot kow to the learig algorithm. However x t s ca be preseted to us arbitrarily. First ote that uder the realizability assumptio, we have that mi f F l(f(x t ), y t ) = {f (x t ) y t } = 0 Hece the aim i such a framework is to simply miimize umber of mistakes l(ŷ t, y t ) ad prove mistake bouds. Now say F = {f,..., f N }, a fiite set of hypothesis. What strategy ca we provide for this problem? How well does it work? If we simply pick some hypothesis that has ot made a mistake so far, such a algorithm ca make a large umber of mistakes (Eg. as may as N). A simple strategy that works i this sceario is the followig. At ay poit t, we have observed x,..., x t ad labels y,..., y t. Now say F t = {f F : i t, f(x i ) = y i }. Now give x t, we pick ŷ t = sig( f F t f(x t )). That is we go with the majority of predictios by hypothesis i F t. How well does this algorithm work? 3

Claim. For ay sequece x,..., x, the above algorithm makes at most log 2 N umber of mistakes. Proof. Notice that each time we make a mistake, ie. sig( f F t f(x t )) y t, the we kow that at least half the umber of fuctios i F t are wrog ad so each time we make a mistake, F t+ F t /2 ad hece, we ca make at most log 2 N umber of mistakes. That is the average error is log 2 N. 2.2 PAC Framework I the PAC framework, x,..., x are draw iid from some fixed distributio D X ad our goal is to miimize P x Dx (ŷ(x) f (x)) either i expectatio or high probability over sample {x,..., x }. Ulike the olie settig, i the PAC settig oe ca simply pick ay hypothesis that has ot made ay mistakes o traiig sample. That is, ŷ(, S) = argmi f F {f(x t ) y t }. (x t,y t) S How well does this algorithm work? How should we aalyze this? Let us show a boud of error with high probability over samples. To this ed we will use the so called Berstei cocetratio boud. Fact: Cosider biary r.v. Z,..., Z draw iid. Let µ = EZ be their expectatio. We have the followig boud o the average of these radom variables. (otice that sice Z s are biary their variace if give by µ µ 2 ) ( ) ( ) P µ Z t > θ exp θ2 2µ + θ 3 Now for ay f F, let Z f t = {f(x t ) f (x t ) where x t are draw from D X. Note that EZ f = P x DX (f(x) f (x)). Hece ote that for ay sigle f F, ( ) ( ) P S P x DX (f(x) f (x)) {f(x t ) f (x t )} > θ exp θ2 2µ + θ 3 Let use write the R.H.S. above as δ, ad hece, rewritig, we have that with probability at least δ over sample, P x DX (f(x) f (x)) {f(x t ) f (x t )} log(/δ) Px DX (f(x) f + (x)) log(/δ) 3 This upo further massagig (use iequality ab a/2 + b/2) leads to the boud P x DX (f(x) f (x)) 2 {f(x t ) f (x t )} 2 log(/δ) 4

Usig uio boud, we have that for ay δ > 0, with probability at least δ over sample, simultaeously, f F P x DX (f(x) f (x)) 2 {f(x t ) f (x t )} 2 log( F /δ) Sice ŷ F, from the above we coclude that, for ay δ > 0, with probability at least δ over sample, P x DX (ŷ(x) f (x)) 2 {ŷ(x t ) f (x t )} 2 log( F /δ) But ote that by realizability assumptio ad the defiitio of ŷ, we have that {ŷ f (x t )} = ad so, with probability at least δ over sample, 3 Example 2: Predictig Bits 3. Statistical Learig {ŷ y t } = 0 P x DX (ŷ(x) f (x)) 2 log( F /δ) We cosider as a warmup example, the simplest statistical learig/predictio problem. That of learig coi flips! Let us cosider the case where we do t receive ay iput istace (or X = {}) ad Y = {±}. We receive ± valued samples y,..., y {±} draw iid from Beroullis distributio with parameter p (ie. Y is + with probability p ad with probability p). Our loss fuctio is the zero-oe loss fuctio l(y, y) = {y y}. Recall that our goal i statistical learig is to miimize L p (ŷ) if L p (f). (Effectively our oly choice of F for this problem is the set of costat mappigs, F = {±}). Claim 2. Let ŷ = sig ( y ) be the predictio rule we use. For the problem above, oe has the boud, L p (ŷ) if L p(f) log / The predictio rule that ejoys the above boud is 5

Proof. Now ote that : L p (ŷ) mi L p(f) = E y p {y ŷ} mi E y p {f y} = E y p {y ŷ} Ey p {sig(2p ) y} = E y p {y ŷ} {ŷ y t } + {ŷ y t } E y p {sig(2p ) y} E y p {y ŷ} {ŷ y t } + {sig(2p ) y t } E y p {sig(2p ) y} 2 max E y p {y f} {f y t } Hece we coclude that P S (L p (ŷ) mi L p(f) > θ) P ( 2 max E y p {y f} ) {f y t } > θ Now we ca boud the RHS above usig Hoeffdig/Berstei boud + uio boud over the two choices as P S (L p (ŷ) mi L p(f) > θ) 4 exp( 2θ 2 ) Writte aother way, we ca claim that for ay δ > 0, with probability at least δ, log(4/δ) L p (ŷ) mi L p(f) 2 3.2 Ca we eve hope to hadle this problem i the olie settig? 6