CSCI 5622 Machine Learning DATE READ DUE Mon, Aug 31 1, 2 & 3 Wed, Sept 2 3 & 5 Wed, Sept 9 TBA Prelim Proposal www.rodneynielsen.com/teaching/csci5622f09/ Instructor: Rodney Nielsen Assistant Professor Adjunct, CU Dept. of Computer Science Research Assistant Professor, DU, Dept. of Electrical & Computer Engr. Research Scientist, Boulder Language Technologies Aug2409 1
Supervised Learning Given a set of training exs X and corresponding outputs Y We want the function f(x) = y Select a hypothesis space H and learning algorithm Search H for a hypothesis h that approximates f Aug2409 2
Decision Trees Aug2409 3
Are you a skier? Decision Tree Run? Bike? # sports? Hate the cold? Coordinated? Which is closer Breckenridge, Keystone, or Vail? Where is Mary Jane? Where is the Flying Dutchman? Aug2409 4
What we learned Evaluate all attributes at each level Need a principled means of evaluating decisions to ensure generalization Prefer purity Prefer smaller tree Attributes with numerous values must be penalized Tree doesn t generalize very well if leaves have very few exs Test on data not in the training set Early stopping or post pruning to improve generalization Skiers are more coordinated Aug2409 5
DT: Hypothesis Space Disjunction of conjunctions Good Poor Don t Go Fresh Yes Don t Go No Go Yes No Don t Go (wait) (or come back early) Go IF Snow=Good & WantToPassML=Yes OR Snow=Poor OR Snow=Fresh & Smart=Yes Then Don t Go Aug2409 6
DT: Hypothesis Space Unrestricted hypothesis space Aug2409 7
DT: Algorithm Learning: Depthfirst greedy search through the state space Classification: Run through tree according to attribute values x ˆ y = c k Aug2409 8
DT: ID3 Learning Algorithm ID3(trainingData, attributes) If (attributes= ) OR (all trainingdata is in one class) return leaf node predicting the majority class x* best attribute to split on Nd Create decision node splitting on x* attributesleft attributes x* For each possible value, v k, of x* addchild(id3(trainingdata subset with x*=v k, attributesleft)) return Nd Aug2409 9
DT: Attribute Selection Evaluate each attribute Use heuristic choice (generally based on statistics or information theory) x * = argmax x i utility( x ) i x 1 x 2 x 3 Aug2409 10
Entropy Entropy( X) = H( X) = p( y =1)log 2 ( y =1) p( y = 0)log 2 ( y = 0) H( [ N,0] ) = 1.0log 2 ( 1.0) 0.0log 2 0.0 Entropy X K ( ) = p y = c k k= 0 ( )log 2 y = c k ( ) = 0.0; H N 2, N 2 ( ) = 0.5log 2 0.5 ( ) 0.5log 2 ( 0.5) =1.0 H(X) = Entropy(X) [0,N] [N/2,N/2] [N,0] Aug2409 11
DT: Information Gain Decrease in entropy as a result of partitioning the data InfoGain( X, x i ) = Entropy( X) v Values( x i ) X v X Entropy X v ( ) { } ; X v x X x i = v Ex: X=[6,7], H(X)= 6/13 log 2 6/13 7/13 log 2 7/13 = 0.996 InfoGain = 0.996 3 /13(1 log 2 1 0 log 2 0) 10 /13(.4 log 2.4.6 log 2.6) = 0.249 x 1 InfoGain = 0.996 6 /13( 5 /6 log 25 /6 1 /6 log 21 /6) 7 /13( 2 /7 log 22 /7 5 /7 log 25 /7) = 0.231 x 2 0.996 2 /13(1 log 2 1 0 log 2 0) 4 /13( 3 /4 log 23 /4 1 /4 log 21 /4) 7 /13( 2 /7 log 22 /7 5 /7 log 25 /7) = 0.281 x 3 Aug2409 12
DT: ID3 Learning Algorithm ID3(trainingData, attributes) If (attributes= ) OR (all trainingdata is in one class) return leaf node predicting the majority class x* best attribute to split on Nd Create decision node splitting on x* attributesleft attributes x* For each possible value, v k, of x* addchild(id3(trainingdata subset with x*=v k, attributesleft)) return Nd Aug2409 13
DT: Inductive Bias Inductive bias Small vs. Large Trees Occam s Razor Aug2409 14
Admin Textbook: Tom Mitchell. Machine Learning. Project Preliminary proposal due Wed, Sept 9 Send idea sooner if you can Topics Learning types Aug2409 15
DT: Overfitting the Data Grow tree until it perfectly classifies the data? Noise or too few training instances Overfitting: Error(h, trngx) < Error(h, trngx), but Error(h, distx) > Error(h, distx) Aug2409 16
DT: Avoiding Overfitting Avoiding overfitting Early stopping Postpruning Criteria Another dataset (development or validation) Reduced Error Pruning Statistical test Encoding size: Minimum description length Rule postpruning IF Snow=Good & WantToPassML=Yes Aug2409 OR Snow=Poor OR Snow=Fresh & Smart=Yes Then Don t Go 17
DT: ContinuousValued Attrs Discretization On the fly discretization Aug2409 18
DT: Inductive Bias Inductive Bias Disjunction of conjunctions Aug2409 19
DT: Inductive Bias Inductive Bias Disjunction of conjunctions Aug2409 20
DT: Inductive Bias Inductive Bias Disjunction of conjunctions Aug2409 21
DT: Alternative Attr Selection Gain Ratio Split Information ( ) = SplitInfo X,a i GainRatio X,a i v Values( a i ) X v X log 2 ( ) = InfoGain ( X,a i) ( ) SplitInfo X,a i X v X = Entropy( X,V ); versus Entropy( X,C) InfoGain = 0.249 GainRatio = 0.319 x 1 InfoGain = 0.231 GainRatio = 0.231 x 2 InfoGain = 0.281 GainRatio = 0. 198 x 3 Aug2409 22
DT: Missing Attribute Values Estimate missing value Most common value at that node Most common value in same class at that node Probabilistic value assignment Split instance into fractions based on proportion of examples with each value Aug2409 23
DT: Attributes w Different Costs Ex: Expensive test such as a CatScan Divide InfoGain by the cost of the test Aug2409 24
DT: Other Issues Cost of Misclassification Regression Aug2409 25
DT: Key Points Practical Generally a topdown greedy search Unrestricted hypothesis space Inductive bias: Preference for smaller trees Use postpruning to avoid overfitting Numerous ID3 extensions and numerous other algorithms such as CART (Classification and Regression Trees) Aug2409 26