Alpaydin Chaper, Michell Chaper 7 Alpaydin slides are in urquoise. Ehem Alpaydin, copyrigh: The MIT Press, 010. alpaydin@boun.edu.r hp://www.cmpe.boun.edu.r/ ehem/imle All oher slides are based on Michell. Learning a Class from Examples Class C of a family car Predicion: Is car x a family car? Knowledge exracion: Wha do people expec from a family car? Oupu: Posiive (+) and negaive ( ) examples Inpu represenaion: x 1 : price, x : engine power 1 3 Training se X r X x {,r } 1if x isposiive 0 if x isnegaive Class C p price p AD e engine power e 1 1 x x x 1 4 5
Hypohesis class H 1if h says x isposiive h( x) 0 if h says x isnegaive S, G, and he Version Space mos specific hypohesis, S mos general hypohesis, G Error of h on H E( h X) 1h x r h H, beween S and G is consisen and make up he version space (Michell, 1997) 6 7 Compuaional Learning Theory (from Michell Compuaional Learning Theory Chaper 7) Theoreical characerizaion of he difficulies and capabiliies of learning algorihms. Wha general laws consrain inducive learning? We seek heory o relae: Quesions: Probabiliy of successful learning Condiions for successful/unsuccessful learning umber of raining examples Condiions of success for paricular algorihms Two frameworks: Probably Approximaely Correc (PAC) framework: classes of hypoheses ha can be learned; complexiy of hypohesis Complexiy of hypohesis space Accuracy o which arge concep is approximaed Manner in which raining examples presened space and bound on raining se size. Misake bound framework: number of raining errors made before correc hypohesis is deermined. 3
Specific Quesions Sample Complexiy Sample complexiy: How many raining examples are needed for a learner o converge? Compuaional complexiy: How much compuaional effor is needed for a learner o converge? Misake bound: How many raining examples will he learner misclassify before converging? Issues: When o say i was successful? How are inpus acquired? How many raining examples are sufficien o learn he arge concep? 1. If learner proposes insances, as queries o eacher Learner proposes insance x, eacher provides c(x). If eacher (who knows c) provides raining examples eacher provides sequence of examples of form x, c(x) 3. If some random process (e.g., naure) proposes insances insance x generaed randomly, eacher provides c(x) 4 5 True Error of a Hypohesis Two oions of Error - Where c and h disagree c Insance space X + + h - - Training error of hypohesis h wih respec o arge concep c How ofen h(x) c(x) over raining insances True error of hypohesis h wih respec o c How ofen h(x) c(x) over fuure random insances Definiion: The rue error (denoed error D (h)) of hypohesis h wih respec o arge concep c and disribuion D is he probabiliy ha h will misclassify an insance drawn a random according o D. Our concern: Can we bound he rue error of h given he raining error of h? Firs consider when raining error of h is zero (i.e., h V S H,D ) error D (h) Pr [c(x) h(x)] x D 6 7
Exhausing he Version Space error =.3 r =.1 error =.1 r =. Hypohesis space H error =. r =0 VS H,D error =.1 r =0 error =.3 r =.4 error =. r =.3 (r = raining error, error = rue error) Definiion: The version space V S H,D is said o be ɛ-exhaused wih respec o c and D, if every hypohesis h in V S H,D has error less han ɛ wih respec o c and D. ( h V S H,D ) error D (h) < ɛ How many examples will ɛ-exhaus he VS? Theorem: [Haussler, 1988]. If he hypohesis space H is finie, and D is a sequence of m 1 independen random examples of some arge concep c, hen for any 0 ɛ 1, he probabiliy ha he version space wih respec o H and D is no ɛ-exhaused (wih respec o c) is less han H e ɛm This bounds he probabiliy ha any consisen learner will oupu a hypohesis h wih error(h) ɛ If we wan his probabiliy o be below δ hen H e ɛm δ m 1 (ln H + ln(1/δ)) ɛ 8 9 Proof of ɛ-exhasing Theorem Theorem: Prob. of V S H,D no being ɛ-exhaused is H e ɛm. Proof: Le h i H (i = 1..k) be hose ha have rue error greaer han ɛ wr c (k H ). We fail o ɛ-exhaus he VS iff a leas one h i is consisen wih all m sample raining insances (noe: hey have rue error greaer han ɛ). Prob. of a single hypohesis wih error > ɛ is consisen for one random sample is a mos (1 ɛ). Prob. of ha hypohesis being consisen wih m samples is (1 ɛ) m. Prob. of a leas one of k hypoheses wih error > ɛ is consisen wih m samples is k(1 ɛ) m. PAC Learning Consider a class C of possible arge conceps defined over a se of insances X of lengh n, and a learner L using hypohesis space H. Definiion: C is PAC-learnable by L using H if for all c C, disribuions D over X, ɛ such ha 0 < ɛ < 1/, and δ such ha 0 < δ < 1/, learner L will wih probabiliy a leas (1 δ) oupu a hypohesis h H such ha error D (h) ɛ, in ime ha is polynomial in 1/ɛ, 1/δ, n and size(c). Since k H, and for 0 ɛ 1, (1 ɛ) e ɛ : k(1 ɛ) m H (1 ɛ) m H e ɛm 10 11
Agnosic Learning Shaering a Se of Insances So far, we assumed ha c H. Wha if i is no he case? Agnosic learning seing: don assume c H Wha do we wan hen? The hypohesis h ha makes fewes errors on raining daa Wha is sample complexiy in his case? derived from Hoeffding bounds: m 1 (ln H + ln(1/δ)) ɛ Definiion: a dichoomy of a se S is a pariion of S ino wo disjoin subses. Definiion: a se of insances S is shaered by hypohesis space H if and only if for every dichoomy of S here exiss some hypohesis in H consisen wih his dichoomy. P r[error D (h) > error D (h) + ɛ] e mɛ 1 13 Three Insances Shaered Insance space X The Vapnik-Chervonenkis Dimension Definiion: The Vapnik-Chervonenkis dimension, V C(H), of hypohesis space H defined over insance space X is he size of he larges finie subse of X shaered by H. If arbirarily large finie ses of X can be shaered by H, hen V C(H). oe ha H can be infinie, while V C(H) finie! Each closed conour indicaes one dichoomy. Wha kind of hypohesis space H can shaer he insances? 14 15
VC Dim. of Linear Decision Surfaces VC Dimension: Anoher Example ( a) ( b) S = {3.1, 5.7}, and hypohesis space includes inervals a < x < b. Dichoomies: boh, none, 3.1, or 5.7. When H is a se of lines, and S a se of poins, V C(H) = 3. (a) can be shaered, bu (b) canno be. However, if a leas one subse of size 3 can be shaered, ha s fine. Are here inervals ha cover all he above dichoomies? Wha abou S = x 0, x 1, x for an arbirary x i? (cf. collinear poins). Se of size 4 canno be shaered, for any combinaion of poins (hink abou an XOR-like siuaion). 16 17 Sample Complexiy from VC Dimension How many randomly drawn examples suffice o ɛ-exhaus V S H,D wih probabiliy a leas (1 δ)? m 1 ɛ (4 log (/δ) + 8V C(H) log (13/ɛ)) Misake Bounds So far: how many examples needed o learn? Wha abou: how many misakes before convergence? This is an ineresing quesion because some learning sysems may need o sar operaing while sill learning. V C(H) is direcly relaed o he sample complexiy: More expressive H needs more samples. More samples needed for H wih more unable parameers. Le s consider similar seing o PAC learning: Insances drawn a random from X according o disribuion D. Learner mus classify each insance before receiving correc classificaion from eacher. Can we bound he number of misakes learner makes before converging? 18 19
Opimal Misake Bounds Le M A (C) be he max number of misakes made by algorihm A o learn conceps in C. (maximum over all possible c C, and all possible raining sequences) M A (C) max c C M A(c) Misake Bounds and VC Dimension Lilesone (1987) showed: V C(C) Op(C) M Halving (C) log ( C ) Definiion: Le C be an arbirary non-empy concep class. The opimal misake bound for C, denoed Op(C), is he minimum over all possible learning algorihms A of M A (C). Op(C) min M A(C) A learning algorihms 0 V C(C) Op(C) M Halving (C) log ( C ). oise and Model Complexiy Use he simpler one because Simpler o use (lower compuaional complexiy) Easier o rain (lower space complexiy) Easier o explain (more inerpreable) Generalizes beer (lower variance - Occam s razor) Muliple Classes, C i i=1,...,k 1 X x r i h {,r } 1 if x Ci 0 if x C i j, j i Train hypoheses h i (x), i =1,...,K: x 1 if x Ci 0 if x C j, j 11 1
Regression X r r E 1 g X r gx E f x, r x g xw1x w0 1 w1, w0 X r w 1x w0 g xwx w1x w0 Model Selecion & Generalizaion Learning is an ill-posed problem; daa is no sufficien o find a unique soluion The need for inducive bias, assumpions abou H Generalizaion: How well a model performs on new daa Overfiing: H more complex han C or f Underfiing: H less complex han C or f 13 14 Triple Trade-Off There is a rade-off beween hree facors (Dieerich, 003): 1. Complexiy of H, c (H),. Training se size,, 3. Generalizaion error, E, on new daa As E As c (H)firs Eand hen E Cross-Validaion To esimae generalizaion error, we need daa unseen during raining. We spli he daa as Training se (50%) Validaion se (5%) Tes (publicaion) se (5%) Resampling when here is few daa 15 16
Dimensions of a Supervised Learner 1. Model:. Loss funcion: 3. Opimizaion procedure: g x E X Lr, gx * argmine X 17