CS 2750 Machine Learning. Lecture 23. Concept learning. CS 2750 Machine Learning. Concept Learning

Lecture 3 Cocept learig Milos Hauskrecht milos@cs.pitt.edu Cocept Learig Outlie: Learig boolea fuctios Most geeral ad most specific cosistet hypothesis. Mitchell s versio space algorithm Probably approximately correct (PAC) learig. Sample complexity for PAC. VapikChervoekis (VC) dimesio. Improved sample complexity bouds. 1

Learig cocepts Assume objects (examples) described i terms of attributes: Sky AirTemp Humidity Wid Water Forecast EjoySport Suy Warm Normal Strog Warm Same yes Raiy Cold Normal Strog Warm Chage o Cocept = a set of objects Cocept learig: Give a sample of labeled objects we wat to lear a boolea mappig from objects to T/F idetifyig a uderlyig cocept E.g. EjoySport cocept Cocept (hypothesis) space H Restrictio o the boolea descriptio of cocepts Learig cocepts Object (istace) space X Cocept (hypothesis) spaces H H X!!!! Assume biary attributes (e.g. true/false, warm/cold) Istace space X: differet objects Cocept space H: possible cocepts = all possible subsets of objects

Learig cocepts Problem: Cocept space too large Solutio: restricted hypothesis space H Example: cojuctive cocepts ( Sky Suy ) (Weather Cold ) 3 possible cocepts Why? Other restricted spaces: 3CNF (or kcnf) 3DNF (or kdnf) ( 7 a1 a3 a ) (...) ( 9 a1 a5 a ) (...) Learig cocepts After seeig k examples the hypothesis space (eve if restricted) ca have may cosistet cocept hypotheses Cosistet hypothesis: a cocept c that evaluates to T o all positive examples ad to F o all egatives. What to lear? Geeral to specific learig. Start from all true ad refie with the maximal (cosistet) geeralizatio. Specific to geeral learig. Start from all false ad refie with the most restrictive specializatio. Versio space learig. Keep all cosistet hypothesis aroud the combiatio of the above two cases. 3

Specific to geeral learig (for cojuctive cocepts) Assume two hypotheses: h1 ( Suy,?,? Strog h ( Suy,?,?,?,?,?) The we say that:,?,?) h is more geeral tha h1, h1 is a special case (specializatio of) h arbitrary Specific to geeral learig: start from the allfalse hypothesis h0 (,,,,, ) by scaig samples, gradually refie the hypothesis (make it more geeral) wheever it does ot satisfy the ew sample see (keep the most restrictive specializatio of positives) Specific to geeral learig. Example Cojuctive cocepts, target is a cojuctive cocept h (,,,,, ) All false (Suy,Warm, Normal, Strog, Warm, Same) T h ( Suy, Warm, Normal, Strog, Warm, Same ) (Raiy, Cold, Normal, Strog, Warm, Chage) F h ( Suy, Warm, Normal, Strog, Warm, Same ) (Suy,Warm, High, Strog, Warm, Same) T h ( Suy, Warm,?, Strog, Warm, Same ) (Suy,Warm, High, Strog, Cool, Same) T h ( Suy, Warm,?, Strog,?, Same ) 4

Geeral to specific learig Dual problem to the specific to geeral learig Start from the all true hypothesis h0 (?,?,?,?,?,?) Refie the cocept descriptio such that all samples are cosistet (keep maximal possible geeralizatio) h (?,?,?,?,?,?) (Suy,Warm, Normal, Strog, Warm, Same) T h (?,?,?,?,?,?) (Suy,Warm, High, Strog, Warm, Same) T h (?,?,?,?,?,?) (Raiy, Cold, Normal, Strog, Warm, Chage) F h ( Suy,?,?,?,?,?), (?,?,?,?,?, (?, Warm Same )?,?,?,?), Mitchell s versio space algorithm Keeps the space of cosistet hypotheses Most geeral rule Upper boud (frige) Pushed dow by examples Versio space Lower boud (frige) Pushed up by examples Most specific rule 5

Mitchell s versio space algorithm Keeps ad refies the friges of the versio space Coverges to the target cocept wheever the target is a member of the hypotheses space H Assumptio: No oise i the data samples (the same example has always the same label) The hope is that the frige is always small Is this correct? Expoetial frige set example Cojuctive cocepts, upper frige (geeral to specific) Samples: ( true, true, true, true,..., true ) T 1 ( false, false, true, true,..., true ) ( true, true, false, false,..., true )... ( true, true, true,..., false, false ) Maximal geeralizatios differet hypotheses we eed to remember ( true,?, true,?,..., true,?) (?, true, true,?,..., true,?) ( true,?,?, true,..., true,?)... (?, true,?, true,...,?, true ) F F F 6

Learig cocepts Versio space algorithm may require large umber of samples to coverge to the target cocept I the worst case we must see all cocepts before covergig to it. The samples may come from differet distributios it may take a very log time to see all examples The frige ca go expoetial i the umber of attributes Alterative solutio: Select a hypothesis that is cosistet after some umber of (, ) samples is see by our algorithm Ca we tell how far are we from the solutio? Yes!!! PAC framework develops the criteria for measurig the accuracy of our choice i probabilistic terms Valiat s framework Probability distributio from which samples are draw There is a error permitted i assigig the labels to examples The cocept leared does ot have to be perfect but it should ot be very far from the target cocept ct target cocept c leared cocept x ext sample from the distributio Error ( ct, c) P( x c x ct ) P( x c x ct ) accuracy parameter We would like to have cocept such that Error ( c, c) T 7

PAC learig To get the error to be smaller tha the accuracy parameter i all cases may be hard: Some examples may be very rare ad to see them may require large umber of samples Istead we choose: where P( Error ( c, c) ) 1 T is a cofidece factor Probably approximately correct (PAC) learig With probability 1 a cocept with a error ot more tha is foud Sample complexity of PAC learig How may samples we eed to see to satisfy PAC criterio? Assume: we saw m idepedet samples draw from the distributio, ad h is a hypothesis that is cosistet with all m examples ad its error is larger tha epsilo Error ( c, h) P(a sample is cosistet with a give h) (1 ) P ( m samples are cosistet with a give h) (1 ) There are at most H hypotheses i the space P ( ay bad hypothesis survives m samples) H (1 ) T m m 8

Sample complexity of PAC learig P ( ay bad hypothesis survives m samples) H (1 ) H e m I the PAC framework we wat to boud this probability with the cofidece factor H Expressig for m e m (l( 1 / ) l H ) m After m samples satisfyig the above iequality ay cosistet hypothesis satisfies the PAC criterio m Efficiet PAC learability The cocept is efficietly PAC learable if the time it takes to output the cocept is polyomial i, 1 /, 1 / Two aspects: Sample complexity a umber of examples eeded to lear the cocept satisfyig PAC criterio A prerequisite to efficiet PAC learability Time complexity the time it takes to fid the cocept Eve if the sample complexity is OK, the learig procedure may ot be efficiet (e.g. expoetial frige) 9

Efficiet PAC learability Sample complexities depeds o the hypothesis space we use Cojuctive cocepts 3 possible cocepts m (l( 1 / ) l 3 ) (l(1 / ) l 3) efficiet All possible cocepts (ubiased hypothesis space) m (l(1 / ) l ) (l(1 / ) l ) iefficiet Efficiet PAC learability Polyomial sample complexity is ecessary but ot sufficiet Algorithm should work i polyomial time Some types of cocept (hypothesis) ca be leared efficietly. Example: cojuctive cocepts Specific to geeral learig. Keeps oe hypothesis aroud. The most specific descriptio of all positive examples. Ca be doe i poly time. Geeral to specific learig. We eed to keep the complete upper frige which ca be expoetial. Caot be doe i poly time. Other cocept (hypothesis) spaces with poly sample complexity: kdnf caot be PAC leared i poly time. kcnf polyomial time solutio 10

Learig cojuctive cocepts Learig cojuctive cocepts specific to geeral learig It is sufficiet to keep oe hypothesis aroud which is the most specific descriptio of all positive examples. Ca be doe i poly time. How? Iitial hypothesis: all false a1 a1 a a... a k a k Whe positive imstace is see we remove icosistet terms from the cojuctio: Positive istace: a, 1 a,... a k Hypothesis: a1 a1 a a... a k a k We keep doig this for m steps Learig 3CNF Sample complexity for the kcnf ad kdnf kdnf caot be leared efficietly kcnf ca be leared efficietly. How? Assume 3CNF ( a1 a3 a7 ) ( a a 4 a5 )... Oly a polyomial umber of clauses with at most 3 variables!! 3 ( 1) ( 1)( ) O ( ) Algorithm (specific to geeral learig): Start with the cojuctio of all possible clauses (always false) O positive example ay clause that is ot true is deleted O egative examples do othig Iterestig Ay kdnf ca be coverted ito kcnf 11

Quatifyig iductive bias Durig learig oly small fractio of samples see We eed to geeralize to usee examples Choice of the hypotheses space restrict our learig optios biases our learig Other biases: preferece towards simpler hypothesis, smaller degrees of freedom Questios: How to measure the bias? To what extet our biases affect our learig capabilities? Ca we lear eve if the hypotheses space is ifiite? (l( 1 / ) l H ) m VapikChervoekis dimesio Measures the biases of the cocept space Allows us to: Obtai better sample complexity boud Ca be exteded to attributes with ifiite value spaces. VC idea: do ot measure the size of the space, but the umber of distict istaces that ca be completely discrimiated usig H Example: H is a set of space of rectagles Discrimiatio of labeligs of 3 poits with rectagles 1

Shatterig of a set of istaces A set of istaces S X H shatters S if for every dichotomy (combiatio of labels) there is a hypothesis h cosistet with the dichotomy Example: H is a set of space of rectagles A set of 3 istaces (most flexible choice) Dichotomy 1 Dichotomy Dichotomy k 3 differet dichotomies, hypothesis for each of them VapikChervoekis dimesio VC dimesio of a hypothesis space H is the size of the largest subset of istaces that is shattered by H. Example: rectagles (VC at least 3) Try 4: Ca be shattered (for the most flexible 4), VC dimesio at least 4 Try 5: No set of 5 poits that ca be shattered, thus VC dimesio is 4 13

VC dimesio ad sample complexity Oe ca derive the sample complexity boud for PAC learig usig VC dimesio istead of hypothesis space size (we wo t do it here) m ( 4 l( / ) 8VC dim( H ) l(13 / )) Addig oise We have a target cocept but there is a chace of mislabelig the examples see Ca we PAClear also i this case? Blumer (1986). If h is a hypothesis that agrees with at least 1 m l( ) samples draw from the distributio the P( error ( h, ct ) ) Mitchell gives the sample complexity boud for the choice of the hypothesis with the best traiig error 14

Summary Learig boolea fuctios Most geeral ad most specific cosistet hypothesis. Mitchell s versio space algorithm Probably approximately correct (PAC) learig. Sample complexity for PAC. VapikChervoekis (VC) dimesio. Improved sample complexity bouds. Addig oise. 15