CS:4420 Artificial Intelligence

CS:4420 Artificial Intelligence Spring 2017 Learning frm Examples Cesare Tinelli The University f Iwa Cpyright 2004 17, Cesare Tinelli and Stuart Russell a a These ntes were riginally develped by Stuart Russell and are used with permissin. They are cpyrighted material and may nt be used in ther curse settings utside f the University f Iwa in their current r mdified frm withut the express written cnsent f the cpyright hlders. CS:4420 Spring 2017 p.1/36

Readings Chap. 18 f [Russell and Nrvig, 2012] CS:4420 Spring 2017 p.2/36

Learning Agents A distinct feature f intelligent agents in nature is their ability t learn frm experience Using his experience and his internal knwledge, a learning agent is able t prduce new knwledge That is, given his internal knwledge and a percept sequence, the agent is able t learn facts that are cnsistent with bth the percepts and the previus knwledge, d nt just fllw frm the percepts and the previus knwledge CS:4420 Spring 2017 p.3/36

Example: Learning fr Lgical Agents Learning in lgical agents can be frmalized as fllws. Let Γ, be set f sentences where Γ is the agent s knwledge base, the agent s current knwledge is a representatin f a percept sequence, the evidential data A learning agent is an agent able t generate facts ϕ frm Γ and such that Γ {ϕ} is satisfiable (cnsistency f ϕ) usually, Γ = ϕ (nvelty f ϕ) CS:4420 Spring 2017 p.4/36

Learning Agent: Cnceptual Cmpnents Perfrmance standard Critic Sensrs feedback learning gals Learning element changes knwledge Perfrmance element Envirnment Prblem generatr Agent Effectrs CS:4420 Spring 2017 p.5/36

Learning Elements Machine learning research has prduced a large variety f learning elements Majr issues in the design f learning elements: Which cmpnents f the perfrmance element are t be imprved What representatin is used fr thse cmpnents What kind f feedback is available: supervised learning reinfrcement learning unsupervised learning What prir knwledge is available CS:4420 Spring 2017 p.6/36

Learning as Learning f Functins Any cmpnent f a perfrmance element can be described mathematically as a functin: cnditin-actin rules predicates in the knwledge base next-state peratrs gal-state recgnizers search heuristic functins belief netwrks utility functins... All learning can be seen as learning the representatin f a functin CS:4420 Spring 2017 p.7/36

Inductive Learning A lt f learning is f an inductive nature: Given sme experimental data, the agent learns the general principles gverning thse data and is able t make crrect predictins n future data, based n these general principles. Examples: After a baby is tld that certain bjects in the huse are chairs, the baby is able t learn the cncept f chair and then recgnize previusly unseen chairs as such. Yur grandfather watches a sccer match fr the first time and frm the actin and the cmmentatrs reprt is able t figure ut the rules f the game. CS:4420 Spring 2017 p.8/36

Purely Inductive Learning Given a cllectin {(x 1,f(x 1 )),...,(x n,f(x n ))} f input/utput pairs, r examples, fr a functin f prduce a hypthesis, (a cmpact representatin f) a functin h that apprximates f (a) (b) (c) (d) In general, there are quite a lt f different hyptheses cnsistent with the examples CS:4420 Spring 2017 p.9/36

Bias in Learning Any kind f preference fr a hypthesis h ver anther is called a bias Bias is inescapable: Just the chice f frmalism t describe h already intrduces a bias. Bias is necessary: Learning is nearly impssible withut bias. (Which f the many hyptheses d yu chse?) CS:4420 Spring 2017 p.10/36

Learning Decisin Trees The simplest frm f learning frm examples ccurs in learning decisin trees A decisin tree is a Blean peratr that takes as input a set f predicates describing an bject r a situatin, and utputs a discrete value It is represented by a tree in which every nn-leaf nde crrespnds t a test n the value f ne f the predicates every leaf nde specifies the value t be returned if that leaf is reached Decisin trees returning a binary value (e.g., a Blean) act as classifiers CS:4420 Spring 2017 p.11/36

A Decisin Tree This tree can be used t decide whether t wait fr a table at a restaurant Patrns? Nne Sme Full F T WaitEstimate? >60 30 60 10 30 0 10 F Alternate? Hungry? T N Yes N Yes Reservatin? Fri/Sat? T Alternate? N Yes N Yes N Yes Bar? T F T T Raining? N Yes N Yes F T F T CS:4420 Spring 2017 p.12/36

A Decisin Tree as Predicates A decisin tree with Blean utput defines a lgical predicate Patrns? Nne Sme Full F T Hungry? Yes N Type? F French Italian Thai Burger T F Fri/Sat? T N Yes F T WillWait Patrns = Sme Patrns = Full Hungry Type = French Patrns = Full Hungry Type = Burger Patrns = Full Hungry Type = Thay isfrisat CS:4420 Spring 2017 p.13/36

Building Decisin Trees Hw can we build a decisin tree fr a specific predicate? We can lk at a number f examples that satisfy, r d nt satisfy, the predicate and try t extraplate the tree frm them Example Attributes Gal Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait X 1 Yes N N Yes Sme $$$ N Yes French 0 10 Yes X 2 Yes N N Yes Full $ N N Thai 30 60 N X 3 N Yes N N Sme $ N N Burger 0 10 Yes X 4 Yes N Yes Yes Full $ N N Thai 10 30 Yes X 5 Yes N Yes N Full $$$ N Yes French >60 N X 6 N Yes N Yes Sme $$ Yes Yes Italian 0 10 Yes X 7 N Yes N N Nne $ Yes N Burger 0 10 N X 8 N N N Yes Sme $$ Yes Yes Thai 0 10 Yes X 9 N Yes Yes N Full $ Yes N Burger >60 N X 10 Yes Yes Yes Yes Full $$$ N Yes Italian 10 30 N X 11 N N N N Nne $ N N Thai 0 10 N X 12 Yes Yes Yes Yes Full $ N N Burger 30 60 Yes CS:4420 Spring 2017 p.14/36

Sme Terminlgy The gal predicate is the predicate t be implemented by a decisin tree. The training set is the set f examples used t build the tree A member f the training set is a psitive example if it is satisfies the gal predicate, it is a negative example if it des nt A Blean decisin tree implements classifier: given a ptential instance f a gal predicate, it is able t say, by lking at sme attributes f the instance, whether the instance is a psitive example f the predicate r nt CS:4420 Spring 2017 p.15/36

Gd Decisin Trees It is trivial t cnstruct a decisin tree that agrees with a given training set (Hw?) Hwever, the trivial tree will simply memrize the given examples We want a tree that extraplates a cmmn pattern frm the examples We want the tree t crrectly classify all pssible examples, nt just thse in the training set CS:4420 Spring 2017 p.16/36

Lking fr Decisin Trees In general, there are several decisin trees that describe the same gal predicate. Which ne shuld we prefer? Ockham s razr: always prefer the simplest descriptin, that is, the smallest tree Prblem: searching thrugh the space f pssible trees and finding the smallest ne is pssible but takes expnential time Slutin: apply sme simple heuristics that lead t small (if nt smallest) trees Main Idea: start building the tree by testing at its rt an attribute that better splits the training set int hmgeneus classes CS:4420 Spring 2017 p.17/36

Chsing an attribute A gd attribute splits the examples int subsets that are ideally all psitive r all negative Patrns? Type? Nne Sme Full French Italian Thai Burger Patrns? is a better chice: it gives mre infrmatin abut the classificatin CS:4420 Spring 2017 p.18/36

Chsing an attribute Preferring mre infrmative attributes leads t smaller trees 1 3 4 6 8 12 2 5 7 9 10 11 Type? 1 3 4 6 8 12 2 5 7 9 10 11 Patrns? French Italian Thai Burger 1 5 6 10 4 8 2 11 3 12 7 9 7 11 Nne Sme Full 1 3 6 8 4 12 2 5 9 10 N Yes Hungry? N Yes 4 12 (a) (b) 5 9 2 10 CS:4420 Spring 2017 p.19/36

Building the Tree: General Prcedure 1. Chse fr the rt nde test the attribute that best partitins the given training set E int hmgeneus sets 2. If the chsen attribute has n pssible values, it will partitin E int n sets E 1,...,E n. Add a branch i t the rt nde fr each set E i 3. Fr each branch i: (a) If E i is empty, chse the mst cmmn yes/n classificatin amng E s examples and add a crrespnding leaf t the branch (b) If E i cntains nly psitive examples, add a yes leaf t the branch (c) If E i cntains nly negative examples, add a n leaf t the branch (d) Otherwise, add a nn-leaf nde t the branch and apply the prcedure recursively t that nde with the remaining attributes and with E i as the training set CS:4420 Spring 2017 p.20/36

Chsing the Best Attribute What d we exactly mean by best partitins the training set int hmgeneus classes? What if each attribute splits the training set int nn-hmgeneus classes? Which ne is better? Infrmatin Thery can be used t devise a measure f gdness fr attributes CS:4420 Spring 2017 p.21/36

Infrmatin Thery Studies the mathematical laws gverning systems designed t cmmunicate r manipulate infrmatin It defines quantitative measures f infrmatin and the capacity f varius systems t transmit, stre, and prcess infrmatin In particular, it measures the infrmatin cntent, r entrpy, f messages/events Infrmatin is measured in bits One bit represents the infrmatin we need t answer a yes/n questin when we have n idea abut the answer CS:4420 Spring 2017 p.22/36

Infrmatin Cntent If an event has n pssible utcmes v i, each with prir prbability P(v i ), the infrmatin cntent H f the event s actual utcme is H(P(v 1 ),...,P(v n )) = n P(v i )lg 2 P(v i ) i=1 i.e., the average infrmatin cntent f each utcme, lg 2 P(v i ), weighted by the utcme s prbability CS:4420 Spring 2017 p.23/36

Infrmatin Cntent/Entrpy Examples H(P(v 1 ),...,P(v n )) = 1) Entrpy f fair cin tss: n P(v i )lg 2 P(v i ) i=1 H(P(h),P(t)) = H( 1 2, 1 2 ) = 1 2 lg 2 1 2 1 2 lg 2 1 2 = 1 2 + 1 2 = 1 bit 2) Entrpy f a laded cin tss where P(head) = 0.99: H(P(h),P(t)) = H( 99 100, 1 100 ) = 0.99lg 20.99 0.01lg 2 0.01 0.08 bits 3) Entrpy f cin tss fr a cin with heads n bth sides: H(P(h),P(h)) = H(1,0) = 1lg 2 1 0lg 2 0 = 0 0 = 0 bits CS:4420 Spring 2017 p.24/36

Entrpy f a Decisin Tree Fr decisin trees, the event is questin is whether the tree will return yes r n fr a given input example e Assume the training set E is a representative sample f the dmain That is, the relative frequency f psitive examples in E clsely apprximates the prir prbability f a psitive example If E cntains p psitive examples and n negative examples, the prbability distributin f answers by a crrect decisin tree is: P(yes) = p p+n P(n) = n p+n CS:4420 Spring 2017 p.25/36

Infrmatin Cntent f an Attribute Checking the value f a single attribute A in the tree prvides nly sme f the infrmatin prvided by the whle tree But we can measure hw much infrmatin is still needed after A has been checked CS:4420 Spring 2017 p.26/36

Infrmatin Cntent f an Attribute Let E 1,...,E m be the sets int which A partitins the current training set E Fr i = 1,...,m, let p = # f psitive examples in E n = # f negative examples in E p i = # f psitive examples in E i n i = # f negative examples in E i Then, alng branch i f nde A we will need Remainder(A) = m i=1 p i +n i p+n H ( pi p i +n i, n i p i +n i ) extra bits f infrmatin t classify the input example after we have checked A CS:4420 Spring 2017 p.27/36

Chsing an Attribute Cnclusin: The smaller the value f Remainder(A), the higher the infrmatin cntent f attribute A fr the purpse f classifying the input example Heuristic: When building a nn-leaf nde f a decisin tree, chse the attribute with the smallest remainder CS:4420 Spring 2017 p.28/36

Building Decisin Trees: An Example Prblem: Frm the infrmatin belw abut several prductin runs in a given factry, cnstruct a decisin tree t determine the factrs that influence prductin utput Run Supervisr Operatr Machine Overtime Output 1 Patrick Je a n high 2 Patrick Samantha b yes lw 3 Thmas Jim b yes lw 4 Patrick Jim b n high 5 Sally Je c n high 6 Thmas Samantha c n lw 7 Thmas Je c n lw 8 Patrick Jim a yes lw CS:4420 Spring 2017 p.29/36

Building Decisin Trees: An Example First identify the attribute with the lwest infrmatin remainder by using the whle table as the training set (the psitive examples are thse with high utput) Since fr each attribute A Remainder(A) = m i=1 = n i=1 p i +n i p+n I( p i p i +n i, n i p i +n i ) p i +n i p+n ( p i p i +n i lg 2 p i p i +n i n i p i +n i lg 2 n i p i +n i ) we need t cmpute all the relative frequencies invlved CS:4420 Spring 2017 p.30/36

Example (1) Here is hw each attribute splits the training set, tgether with the entrpy each branch Patrick 1(+) 4(+) 2 8 Supervisr Thmas Sally 5(+) 3 6 7 1 0 0 Jim Operatr Je 4(+) 1(+) 3 5(+) 8 7 Samantha 2 6 0.92 0.92 0 Machine 1(+) 4(+) 5(+) 8 3 2 6 7 1 0.92 0.92 1(+) 4(+) 5(+) 6 7 0.97 Overtime a b c n yes 2 3 8 0 Remainder(Supervisr) = 4 8 1+ 1 8 0+ 3 8 0 = 0.50 Remainder(Operatr) = 3 8 0.92+ 3 8 0.92+ 2 8 0 = 0.69 Remainder(Machine) = 2 8 1+ 3 8 0.92+ 3 8 0.92 = 0.94 Remainder(Overtime) = 5 8 0.97+ 3 8 0 = 0.61 Chse Supervisr since it has the lwest remainder CS:4420 Spring 2017 p.31/36

Example (2) Thmas runs are all negative and Sally s are all psitive Patrick 1(+) 4(+) 2 8 Supervisr Thmas Sally 5(+) 3 6 7 1 0 0 Jim Operatr Je 4(+) 1(+) 3 5(+) 8 7 Samantha 2 6 0.92 0.92 0 We need t further classify just Patrick s runs Machine 1(+) 4(+) 5(+) 8 3 2 6 7 1 0.92 0.92 1(+) 4(+) 5(+) 6 7 0.97 Overtime a b c n yes 2 3 8 0 CS:4420 Spring 2017 p.32/36

Example (2) Recmpute the remainders f the remaining attributes, but this time based slely n Patrick s runs Operatr Machine Overtime Jim Je Samantha a b c n yes 4(+) 1(+) 8 1 0 2 0 1(+) 4(+) 8 2 1 1 1(+) 2 4(+) 8 0 0 Remainder(Operatr) = 2 4 1+ 1 4 0+ 1 4 Remainder(Machine) = 2 4 1+ 2 4 1 = 1 Remainder(Overtime) = 2 4 0+ 2 4 0 = 0 0 = 0.5 Chse Overtime t further classify Patrick s runs CS:4420 Spring 2017 p.32/36

Example (3) The final decisin tree: Patrick Overtime Supervisr Sally yes Thmas n n yes n yes CS:4420 Spring 2017 p.33/36

Prblems in Building Decisin Trees Nise. Tw training examples may have identical values fr all the attributes but be classified differently Overfitting. Irrelevant attributes may make spurius distinctins amng training examples Missing data. The value f sme attributes f sme training examples may be missing Multi-valued attributes. The infrmatin gain f an attribute with many different values tends t be nn-zer even when the attribute is irrelevant Cntinuus-valued attributes. They must be discretized t be used. Of all the pssible discretizatins, sme are better than thers fr classificatin purpses. CS:4420 Spring 2017 p.34/36

Perfrmance measurement Hw d we knw that the learned hypthesis h apprximates the intended functin f? Use therems f cmputatinal/statistical learning thery Try h n a new test set f examples, using same distributin ver example space as training set Learning curve = % crrect n test set as a functin f training set size 1 % crrect n test set 0.9 0.8 0.7 0.6 0.5 0.4 0 20 40 60 80 100 Training set size 100 randmly-generated restaurant examples graph averaged ver 20 trials fr i = 1,...,99, each trial selects i examples randmly CS:4420 Spring 2017 p.35/36

Chsing the best hypthesis Cnsider a set S = {(x,y) y = f(x)} f N input/utput examples fr a target functin f Statinarity assumptin: All examples E S have the same prir prbability distributin P(E) and each f them is independent frm the previusly bserved nes Errr rate f an hypthesis h: {(x,y) (x,y) S, h(x) y} N Hldut crss-validatin: Partins S randmly int a training set and a test set. k-fld crss-validatin: Partins S int k subsets S 1,...,S n f the same size. Fr each i = 1,...,k, use S i as the test set and S \Si as the training set. Use the average errr rate CS:4420 Spring 2017 p.36/36