Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Similar documents
Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

CS145: INTRODUCTION TO DATA MINING

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Introduction to Machine Learning CMU-10701

the tree till a class assignment is reached

EECS 349:Machine Learning Bryan Pardo

Machine Learning 2nd Edi7on

Learning Decision Trees

Decision Tree Learning Lecture 2

Statistics and learning: Big Data

Classification and Prediction

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Machine Learning 3. week

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Lecture 7 Decision Tree Classifier

Learning Decision Trees

Induction of Decision Trees

Decision Tree Learning

C4.5 - pruning decision trees

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Knowledge Discovery and Data Mining

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Learning Decision Trees

Notes on Machine Learning for and

day month year documentname/initials 1

CS 6375 Machine Learning

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Empirical Risk Minimization, Model Selection, and Model Assessment

Lecture 7: DecisionTrees

Informal Definition: Telling things apart

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Dan Roth 461C, 3401 Walnut

Lecture 3: Decision Trees

Machine Learning & Data Mining

Decision Trees. Tirgul 5

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Data Mining and Analysis: Fundamental Concepts and Algorithms

Lecture 3: Decision Trees

Supervised Learning via Decision Trees

Decision Tree Learning

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Statistical Learning. Philipp Koehn. 10 November 2015

Classification and Regression Trees

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Classification: Decision Trees

Decision Trees. Gavin Brown

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Classification Using Decision Trees

CHAPTER-17. Decision Tree Induction

Decision trees COMS 4771

UVA CS 4501: Machine Learning

Decision Tree. Decision Tree Learning. c4.5. Example

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

Rule Generation using Decision Trees

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Statistical Consulting Topics Classification and Regression Trees (CART)

Classification and regression trees

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Decision T ree Tree Algorithm Week 4 1

Learning from Observations

Learning from Observations. Chapter 18, Sections 1 3 1

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Decision Trees Part 1. Rao Vemuri University of California, Davis

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Classification: Decision Trees

From inductive inference to machine learning

Jeffrey D. Ullman Stanford University

SF2930 Regression Analysis

Regression tree methods for subgroup identification I

Decision Tree Learning

Decision Trees.

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

CS 380: ARTIFICIAL INTELLIGENCE

Holdout and Cross-Validation Methods Overfitting Avoidance

1 Handling of Continuous Attributes in C4.5. Algorithm

Generative v. Discriminative classifiers Intuition

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Decision Tree And Random Forest

Generalization to Multi-Class and Continuous Responses. STA Data Mining I

Decision Trees. Introduction. Some facts about decision trees: They represent data-classification models.

Learning and Neural Networks

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Decision Trees Lecture 12

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Data classification (II)

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Chapter 3: Decision Tree Learning

1. Courses are either tough or boring. 2. Not all courses are boring. 3. Therefore there are tough courses. (Cx, Tx, Bx, )

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Information Theory & Decision Trees

Apprentissage automatique et fouille de données (part 2)

Decision Trees.

IMBALANCED DATA. Phishing. Admin 9/30/13. Assignment 3: - how did it go? - do the experiments help? Assignment 4. Course feedback

Transcription:

Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018

Decision trees

Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model discrete outcomes nicely can be very powerful, can be as complex as you need them C4.5 and CART - from top 10 entries on Kaggle - decision trees are very effective and popular

Sure, But Why Trees? Easy to understand knowledge representation Can handle mixed variables Recursive, divide and conquer learning method Efficient inference Example: Play outside =>

Divide-and-conquer Classification Consider input tuples (x i,y i ) for i-th observation x 2 E s θ 3 B θ 2 C D x 1 > θ 1 A x 2 θ 2 x 2 > θ 3 θ 1 θ 4 x 1 x 1 θ 4 A B C D E

Tree learning Finding best tree is intractable Must consider all 2 m combinations, where m is number of features Often just greedily grow it by splitting attributes one by one. To determine which attribute to split, look at node impurity. Top-down recursive divide and conquer algorithm Start with all examples at root Select best attribute/feature Recurse and repeat Other issues: How to construct features When to stop growing Pruning irrelevant parts of the tree

Fraud Age Degree StartYr Series7 + 22 Y 2005 N - 25 N 2003 Y - 31 Y 1995 Y - 27 Y 1999 Y Score each attribute split for these instances: Age, Degree, StartYr, Series7 + 24 N 2006 N - 29 N 2003 N Y choose split on Series7 N Fraud Age Degree StartYr Series7-25 N 2003 Y - 31 Y 1995 Y - 27 Y 1999 Y Y Fraud Age Degree StartYr Series7 + 22 Y 2005 N + 24 N 2006 N - 29 N 2003 N choose split on Age>28 N Score each attribute split for these instances: Age, Degree, StartYr Fraud Age Degree StartYr Series7-29 N 2003 N Fraud Age Degree StartYr Series7 + 22 Y 2005 N + 24 N 2006 N

t 1 t 3 Overview (with two features and 1D target) X 1 X 1 X 1 t 1 X 2 t 2 X 1 t 3 Y X 2 t 4 R 1 R 2 R 3 X 2 X 1 R 4 R 5 FIGURE 9.2. Partitions and CART. Top right panel shows a partition of a Features: X two-dimensional feature space by recursive 1, X binary 2 splitting, as usedincart, applied to some fake data. Top left Target: panel shows Y a general partition that cannot be obtained from recursive binary splitting. Bottom left panel shows the tree corresponding to the partition in the top right panel, and a perspective plot of the prediction surface appears in the bottom right panel.

Tree models Most well-known systems CART: Breiman, Friedman, Olshen and Stone ID3, C4.5: Quinlan How do they differ? Split scoring function Stopping criterion Pruning mechanism Predictions in leaf nodes

Scoring functions: Local split value

Choosing an attribute/feature Idea: a good feature splits the examples into subsets that distinguish among the class labels as much as possible... ideally into pure sets of "all positive" or "all negative" Patrons? Type? None Some Full French Italian Thai Burger Bias-variance tradeoff: Choosing most discriminating attribute first may not be best tree (bias), but it can make tree small (low variance).

Association between attribute and class label Data Income High Med Low no no yes yes yes no yes yes yes no yes no yes yes Contingency table Attribute value Class label value Buy No buy High 2 2 Med 4 2 Low 3 1

Mathematically Defining Good Split We start with information theory How uncertain of the answer will be if we split the tree this way? Say need to decide between k options Uncertainty in the answer Y {1,,k} when probability is (p 1,..., p k ) can be quantified via entropy: H(p 1,...,p k )= X i p i log 2 p i Convenient notation: B(p) = H(p, 1 p), number of bits necessary to encode

Amount of Information in the Tree Suppose we have p positive and n negative examples at the root B(p/(p + n)) bits needed to classify a new example Information is always conserved If encoding the information in the leaves is lossless then tree has lossless encoding The entropy of the leaves (amount of bits) + the tree information (bits) carried in the tree = total information in the data Let split Y i have p i positive and n i negative examples B(p i /(p i + n i )) bits needed to classify a new example expected number of bits per example over all branches is X i p i + n i p + n B(p i/(p i + n i )) choose the next attribute to split that minimizes the remaining information needed Which maximizes the information in the tree (as information is conserved)

<latexit sha1_base64="xbivkjhhtbzalqgmioraqk8ih6y=">aaahunic3vvnb9naen22tcmhqathlluqsag1krpxkvspiesllkuifcm2qvv6kqyy9lq747bpyn+jxwmhdvanohjhhvuhadreqikvld/pzppmvb3thqkubj3v68li0o3lldrqzfqttdt37q5v3ptgvky5dlmssn8mmqepeuiiqakfuw0sdiuchqm3hf/wglqrknmp4xscma0s0recotmdre/7mcmhz9k+zqmpciqwcknxcnqautwnjgsdmgusrbfdrk8edimja3emctwutsyp1re8ljdzdb60k7bfqnvwtlh8w48uz2jiketmtk/tprhyplfwcxndzwykji/yahrrsuhnwmiwgt2ddj3thvnhtk+0exkke+t5kmwxmem4djffj+airzbe5utl2h8rwjgkgulcy0t9tfjutjcqrkidrzl2ghetxlmud5lmhj3qm1mkf6ns0ut1mqwxg4fwkg4szwkhoeu2keznlamauosa6be1q5acacxcckfopf8jacomrwmnkor28tp86m2g0v3kbsvwwlucsyr6bp23ftc1yaghubw+czwnamqz1p0isc+4iybdr9uobl3rafcibuc6itq76efqjaftdra9r3rnhza936btgo4drxypumfzgkyzfshlbigz2c4+jyio+fxot8+02e6k6ljn4jvezbgdqzxk6uhp/yvn+hmuvv3dnkurc0ldet2fe9jtfox0/fnctl1xknpnz/+ctpdnqdp2r0/p/6tpoumtz/qgfu508fukmqhsxufahofsxaknrfyzo1hs2on1noe/savdlygs2lxqyag4gjx0s5slddde++klma+6ndbllvfuydbes+q+wcupyepsjg3ynoyrfxjauostt+qz+ua+r3xz+vlbqc2voyslfec+mvm1tv8xnmos</latexit> <latexit sha1_base64="59igpv32zwfgr8k1ohldlaycw/g=">aaahmhic3vxnbhmxehabhhl+wjhycamcamqjtcsvukvkxjc4fefopeyq8nonirxvemv726bwvgfpa0d4kz4qv668alpzvwiatjeqwfkub2fm88x8tjxhkowxnne8conylcxa1avr9es3bt66vbxy54nrmebq5uoqvrsya1ik0lxcsthnnba4llatjl4v/p190eao5l0dpxdebjcivudmomlv+yefmzvktlp3ofuthfphhaf2cbqskztqifmw7y2ves1vsug8afdgjvrre29l8acfkz7fuauxzjhe20tt4ji2gkvi635migv8xabqi/zfahiwgwnc4asnndbqh9g+0vhllj1yt5ici40zxyfgfh2y077cejavl9n+88cjjm0sjlxm1m8ktyowateim+zwjhewrgwws/mqacytyjitpdjbkivnxp/jelmrakxiwhgwcjczbmff6kgyfvckudm9dmbiujctwbio6kz6lwbdxprooags3e3+vp94q6hertaogqou4pgl0spnv+njnqhdpcid8w1wnlojjqduf0l6bo8ynr1wo3d1rqnbiroadhnpdtczq5eetnlz9x7sjq3a9h6dtgk6cf4ij+jvqkgmzl6qcjougawxnxmuvhwunj4zbby7quxuixgnurmoikisnb897b/ktd+dsqolmhn1dc6ok6/nc0liwceo458scuo6v9elev7ntd3rlqk2f/2w/k+anpdu+uwpyogvg69s0mwqxtykb8iopyifna7yz55eswohf9pqf5ow4mtaeychfyozfq+wkn5pw1ace+3tq2eeddutfy3v7eo1rafvvfgi98h90irt8oxskddkm3qjjx/jj/kffk19rh3xvtw+l6gxfiroxtkzaj9+as+jwuw=</latexit> <latexit sha1_base64="59igpv32zwfgr8k1ohldlaycw/g=">aaahmhic3vxnbhmxehabhhl+wjhycamcamqjtcsvukvkxjc4fefopeyq8nonirxvemv726bwvgfpa0d4kz4qv668alpzvwiatjeqwfkub2fm88x8tjxhkowxnne8conylcxa1avr9es3bt66vbxy54nrmebq5uoqvrsya1ik0lxcsthnnba4llatjl4v/p190eao5l0dpxdebjcivudmomlv+yefmzvktlp3ofuthfphhaf2cbqskztqifmw7y2ves1vsug8afdgjvrre29l8acfkz7fuauxzjhe20tt4ji2gkvi635migv8xabqi/zfahiwgwnc4asnndbqh9g+0vhllj1yt5ici40zxyfgfh2y077cejavl9n+88cjjm0sjlxm1m8ktyowateim+zwjhewrgwws/mqacytyjitpdjbkivnxp/jelmrakxiwhgwcjczbmff6kgyfvckudm9dmbiujctwbio6kz6lwbdxprooags3e3+vp94q6hertaogqou4pgl0spnv+njnqhdpcid8w1wnlojjqduf0l6bo8ynr1wo3d1rqnbiroadhnpdtczq5eetnlz9x7sjq3a9h6dtgk6cf4ij+jvqkgmzl6qcjougawxnxmuvhwunj4zbby7quxuixgnurmoikisnb897b/ktd+dsqolmhn1dc6ok6/nc0liwceo458scuo6v9elev7ntd3rlqk2f/2w/k+anpdu+uwpyogvg69s0mwqxtykb8iopyifna7yz55eswohf9pqf5ow4mtaeychfyozfq+wkn5pw1ace+3tq2eeddutfy3v7eo1rafvvfgi98h90irt8oxskddkm3qjjx/jj/kffk19rh3xvtw+l6gxfiroxtkzaj9+as+jwuw=</latexit> <latexit sha1_base64="59igpv32zwfgr8k1ohldlaycw/g=">aaahmhic3vxnbhmxehabhhl+wjhycamcamqjtcsvukvkxjc4fefopeyq8nonirxvemv726bwvgfpa0d4kz4qv668alpzvwiatjeqwfkub2fm88x8tjxhkowxnne8conylcxa1avr9es3bt66vbxy54nrmebq5uoqvrsya1ik0lxcsthnnba4llatjl4v/p190eao5l0dpxdebjcivudmomlv+yefmzvktlp3ofuthfphhaf2cbqskztqifmw7y2ves1vsug8afdgjvrre29l8acfkz7fuauxzjhe20tt4ji2gkvi635migv8xabqi/zfahiwgwnc4asnndbqh9g+0vhllj1yt5ici40zxyfgfh2y077cejavl9n+88cjjm0sjlxm1m8ktyowateim+zwjhewrgwws/mqacytyjitpdjbkivnxp/jelmrakxiwhgwcjczbmff6kgyfvckudm9dmbiujctwbio6kz6lwbdxprooags3e3+vp94q6hertaogqou4pgl0spnv+njnqhdpcid8w1wnlojjqduf0l6bo8ynr1wo3d1rqnbiroadhnpdtczq5eetnlz9x7sjq3a9h6dtgk6cf4ij+jvqkgmzl6qcjougawxnxmuvhwunj4zbby7quxuixgnurmoikisnb897b/ktd+dsqolmhn1dc6ok6/nc0liwceo458scuo6v9elev7ntd3rlqk2f/2w/k+anpdu+uwpyogvg69s0mwqxtykb8iopyifna7yz55eswohf9pqf5ow4mtaeychfyozfq+wkn5pw1ace+3tq2eeddutfy3v7eo1rafvvfgi98h90irt8oxskddkm3qjjx/jj/kffk19rh3xvtw+l6gxfiroxtkzaj9+as+jwuw=</latexit> Information gain Information Gain (Gain) is the amount of information that the tree structure encodes H[X] is the entropy: expected number of bits to encode a randomly selected subset X A is the set of subsets of the data with a given split S is the entire data X Gain(S, A) =H[S] A A A S H[A] H[buys_computer] = -9/14 log 9/14-5/14 log 5/14 = 0.9400

<latexit sha1_base64="i7j2vmunum2dglfkylffyjf+qu4=">aaahe3ic3vxnbhmxehyldsx8txdksqwkffabbsiqifspiasslyirwim7qrzeswlfxq9sb9pu2seai7wij8svb+a5eafms6vqng1viigl1x6emc8z/sayo1rwy33/x9lylasrtwur1+s3bt66fwdt/e57ozlnomuuupogogyet6brurvwkgqgmhkwh41eff79i9cgq+sdnaqqsjpiej8zathucys1q0afe5kfrm36lx86vexqrsamqcbe4frkzybwljoqwcaomb22n9rquw05e5dxg8xastmidqaxh/hujfsccd3xtozca6a/9vpk45dyb2o9txjugjoreuywzzqzvsj4nq+x2f6z0pekzswkrezuz4rnlvdo4mvca7nigoayzbfcjw2ppsyiunnzirwtuslk9bkmko5akyvdx2jcqmy7newjk8puqmejtfxemsfnwbqknwzvme5xghux1aobbkhyn/1tppe3ioflyvacy6akpen8yavv+tj2knkd3lnayowpnfwe6kgrpgewj7djt9qhqzcadq/iaebuyk3hptvkseg1o1v+q29722v6v0gnbb0el5dn4wlbqwbmphdijxizbbxtkqorpuma++w1253uivduvia4decqvkkujp7tv+tmpmfz1sxmhbo659sv1/mfibhxenx8u0loxbcqesme/zltzzulqo1fp6x/k6anjhub1qpj8xyivaqawqwli3tm7vbwya1xlx++eywnhl9oq/9zwmshygnrveijulialqhkg4bim9e++ygsgm6n9bzlv328ubtbvrer5d55qjqktz6sxfka7jeuyusrd+qt+vz7wpts+1r7voyul1wce2ru1l7/avu/tg4=</latexit> Gain(S, A) =H[S] X A A A S H[A] Income Entropy(Income=high) = -2/4 log 2/4-2/4 log 2/4 = 1 High Med Low Entropy(Income=med) = -4/6 log 4/6-2/6 log 2/6 = 0.9183 A no no yes yes yes no yes yes yes no yes no yes yes Entropy(Income=low) = -3/4 log 3/4-1/4 log 1/4 = 0.8113 Gain(D,Income) = 0.9400 - (4/14 [1] + 6/14 [0.9183] + 4/14 [0.8113]) = 0.029

Gini gain Similar to information gain Uses gini index instead of entropy Measures decrease in gini index after split: Gain(S, A) = Gigi(S) X A A A S Gini(A)

Comparing information gain to gini gain 0.0 0.1 0.2 0.3 0.4 0.5 Gini index Misclassification error Entropy Gini Entropy 0.0 0.2 0.4 0.6 0.8 1.0 Information Gain Gini Gain p Fraction of target A into branch that outputs B

Comparing information gain to gini gain 0.0 0.1 0.2 0.3 0.4 0.5 Gini index Misclassification error Entropy Gini Entropy 0.0 0.2 0.4 0.6 0.8 1.0 p Fraction of target A into branch that outputs B

How does score function affect feature selection? Entropy Entropy Gini Gini x 2 Gini score can produce larger gain

<latexit sha1_base64="etvy7ia0eyqzalmlvign6e+f2vw=">aaahqhic3vxnbhmxehyldsx8txdk4ljfslebbvb8ckwqxawjs5eirztdrl7vjlfir1e2t21q7zvwnhcef+arochejrpe7co0/bubbjzw++3mfdsz31h2lhkmjed9wvi8cnwpdm35ev3gzvu376ys3n2nzayodknkuu1franncxqnmxz2ugverbx2o/hlwr97aeozmbw1kxrcqyyjgzbkjdp1v54egpgrjdzu5fs+7ubaz6jvwaed749xmfce2qbsm7yfoc829v3cunfex1n3wt504bogxyf1vk2d/ursjycwnboqgmqj1r22l5rqemuy5zdxg0xdsuiydkexh7buj0sadu3rtmccn5w/xgop3jmyplwejfkitj6iyeuwhentvsj4nq+xmcgz0likzqwktew0ydg2ehec4zgpoizphcbumvcupipiddfo1rksxb+nlfzn9bkmgoxbsslcs0lcgc+7rwhj48puqm4irdte6hfjqbce09spm+2xg7exueohakhyo32tpfbwiu5+5yisokrscjled23weud2sbspyw5tof3lqdhsgopbkasn3cih47xaoa03gg0m8rbcn7eihz0zykmim/6mt4g3tndt+w38evgovha87lawe2tghddooxhpylp4nkkw4lom3jxxs+2nxnfpxcuiy3ahwirjxdgz/kvo7dmsq7qeeayu/5y68np+rkg3eof0/fnczlwxknpjz/+ctuftuqftx9+l/5omjys1avfdwdzpemgufdfsfqfpitmjzgqz2lb++umunhj0oc35t9mimwjdiuffksffasx5unpcqlsm2qcvhbog67eet7w3j9a3t6v7yhndrw9qe7xru7snxqed1euuvucf0cf0ufax9rx2rfa9df1cqdj30nyq/fwfzw5eoa==</latexit> Chi-Square score Widely used to test independence between two categorical attributes (e.g., feature and class label) Hypothesis H 0 : Attributes are independent Consider a contingency table with k entries (k = rows x columns) Considers counts in a contingency table and calculates the normalized squared deviation of observed (predicted) values from expected (actual) values given H 0 X 2 = kx i=1 (o i e i ) 2 e i If counts are large (large number of examples), sampling distribution can be approximated by a chi-square distribution

Contingency tables Buy No buy Income High 2 2 Med 4 2 Low 3 1 4 6 4 9 5 14

<latexit sha1_base64="etvy7ia0eyqzalmlvign6e+f2vw=">aaahqhic3vxnbhmxehyldsx8txdk4ljfslebbvb8ckwqxawjs5eirztdrl7vjlfir1e2t21q7zvwnhcef+arochejrpe7co0/bubbjzw++3mfdsz31h2lhkmjed9wvi8cnwpdm35ev3gzvu376ys3n2nzayodknkuu1franncxqnmxz2ugverbx2o/hlwr97aeozmbw1kxrcqyyjgzbkjdp1v54egpgrjdzu5fs+7ubaz6jvwaed749xmfce2qbsm7yfoc829v3cunfex1n3wt504bogxyf1vk2d/ursjycwnboqgmqj1r22l5rqemuy5zdxg0xdsuiydkexh7buj0sadu3rtmccn5w/xgop3jmyplwejfkitj6iyeuwhentvsj4nq+xmcgz0likzqwktew0ydg2ehec4zgpoizphcbumvcupipiddfo1rksxb+nlfzn9bkmgoxbsslcs0lcgc+7rwhj48puqm4irdte6hfjqbce09spm+2xg7exueohakhyo32tpfbwiu5+5yisokrscjled23weud2sbspyw5tof3lqdhsgopbkasn3cih47xaoa03gg0m8rbcn7eihz0zykmim/6mt4g3tndt+w38evgovha87lawe2tghddooxhpylp4nkkw4lom3jxxs+2nxnfpxcuiy3ahwirjxdgz/kvo7dmsq7qeeayu/5y68np+rkg3eof0/fnczlwxknpjz/+ctuftuqftx9+l/5omjys1avfdwdzpemgufdfsfqfpitmjzgqz2lb++umunhj0oc35t9mimwjdiuffksffasx5unpcqlsm2qcvhbog67eet7w3j9a3t6v7yhndrw9qe7xru7snxqed1euuvucf0cf0ufax9rx2rfa9df1cqdj30nyq/fwfzw5eoa==</latexit> Calculating expected values for a cell Class X 2 = kx i=1 (o i e i ) 2 e i Attribute + 0 a b 1 c d o (0,+) = a N e (0,+) = p(a =0,C =+) N = p(a = 0)p(C =+ A = 0) N = p(a = 0)p(C = +) N (assuming independence) apple apple a + b a + c = N N N

Example calculation Observed Expected Buy No buy Buy No buy High 2 2 Med 4 2 Low 3 1 High 2.57 1.43 Med 3.86 2.14 Low 2.57 1.43 χ 2 = ( ) 2 k o i e i % = ' & i=1 e i % + ' & (2 2.57)2 2.57 = 0.57 (2 1.43)2 1.43 ( % * + ' ) & ( % * + ' ) & (4 3.86)2 3.86 (2 2.14)2 2.14 ( % * + ' ) & (3 2.57)2 2.57 ( * ) ( % ( * + ' (1 1.43)2 * ) & 1.43 )

Tree learning Top-down recursive divide and conquer algorithm Start with all examples at root Select best attribute/feature Partition examples by selected attribute Recurse and repeat Other issues: How to construct features When to stop growing Pruning irrelevant parts of the tree

Controlling Variance One major problem with trees is their high variance. Often a small change in the data can result in a very different series of splits, making interpretation somewhat precarious. The major reason for this instability is the hierarchical nature of the process: the effect of an error in the top split is propagated down to all of the splits below it.

Overfitting Consider a distribution D of data representing a population and a sample DS drawn from D, which is used as training data Given a model space M, a score function S, and a learning algorithm that returns a model m M, the algorithm overfits the training data DS if: m' M such that S(m,DS) > S(m',DS) but S(m,D) < S(m',D) In other words, there is another model (m ) that is better on the entire distribution and if we had learned from the full data we would have selected it instead

Task: Devise a rule to classify items based on the attribute X Example learning problem Knowledge representation: If-then rules Example rule: If x > 25 then + Else - + What is the model space? All possible thresholds - What score function? Prediction error rate X

Approaches to avoid overfitting Regularization (Priors) Hold out evaluation set, used to adjust structure of learned model e.g., pruning in decision trees Statistical tests during learning to only include structure with significant associations e.g., pre-pruning in decision trees Penalty term in classifier scoring function i.e., change scorre function to prefer simpler models

How to avoid overfitting in decision trees Postpruning Use a separate set of examples to evaluate the utility of pruning nodes from the tree (after tree is fully grown) Prepruning Apply a statistical test to decide whether to expand a node Use an explicit measure of complexity to penalize large trees (e.g., Minimum Description Length)

Algorithm comparison CART Evaluation criterion: Gini index Search algorithm: Simple to complex, hill-climbing search Stopping criterion: When leaves are pure Pruning mechanism: Cross-validation to select Gini threshold C4.5 Evaluation criterion: Information gain Search algorithm: Simple to complex, hill-climbing search Stopping criterion: When leaves are pure Pruning mechanism: Reduce error pruning

CART: Finding Good Gini Threshold Background: K-fold cross validation Randomly partition training data into k folds For i=1 to k Learn model on D - i th fold; evaluate model on i th fold Average results from all k trials Train1 Train2 Train3 Train4 Train5 Train6 Dataset Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Test1 Test2 Test3 Test4 Test5 Test6 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2

Choosing a Gini threshold with cross validation For i in 1.. k For t in threshold set (e.g, [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]) Learn decision tree on Traini with Gini gain threshold t (i.e. stop growing when max Gini gain is less than t) Evaluate learned tree on Testi (e.g., with accuracy) Set tmax,i to be the t with best performance on Testi Set tmax to the average of tmax,i over the k trials Relearn the tree on all the data using tmax as Gini gain threshold

C4.5: reduced error pruning Use pruning set to estimate accuracy in sub-trees and for individual nodes Let T be a sub-tree rooted at node v v Define: T Repeat: Prune at node with largest gain until until only negative gain nodes remain Bottom-up restriction : T can only be pruned if it does not contain a sub-tree with lower error than T Source: www.ailab.si/blaz/predavanja/uisp/slides/uisp05-postpruning.ppt

Pre-pruning methods Stop growing tree at some point during top-down construction when there is no longer sufficient data to make reliable decisions Approach: Choose threshold on feature score Stop splitting if the best feature score is below threshold

<latexit sha1_base64="etvy7ia0eyqzalmlvign6e+f2vw=">aaahqhic3vxnbhmxehyldsx8txdk4ljfslebbvb8ckwqxawjs5eirztdrl7vjlfir1e2t21q7zvwnhcef+arochejrpe7co0/bubbjzw++3mfdsz31h2lhkmjed9wvi8cnwpdm35ev3gzvu376ys3n2nzayodknkuu1franncxqnmxz2ugverbx2o/hlwr97aeozmbw1kxrcqyyjgzbkjdp1v54egpgrjdzu5fs+7ubaz6jvwaed749xmfce2qbsm7yfoc829v3cunfex1n3wt504bogxyf1vk2d/ursjycwnboqgmqj1r22l5rqemuy5zdxg0xdsuiydkexh7buj0sadu3rtmccn5w/xgop3jmyplwejfkitj6iyeuwhentvsj4nq+xmcgz0likzqwktew0ydg2ehec4zgpoizphcbumvcupipiddfo1rksxb+nlfzn9bkmgoxbsslcs0lcgc+7rwhj48puqm4irdte6hfjqbce09spm+2xg7exueohakhyo32tpfbwiu5+5yisokrscjled23weud2sbspyw5tof3lqdhsgopbkasn3cih47xaoa03gg0m8rbcn7eihz0zykmim/6mt4g3tndt+w38evgovha87lawe2tghddooxhpylp4nkkw4lom3jxxs+2nxnfpxcuiy3ahwirjxdgz/kvo7dmsq7qeeayu/5y68np+rkg3eof0/fnczlwxknpjz/+ctuftuqftx9+l/5omjys1avfdwdzpemgufdfsfqfpitmjzgqz2lb++umunhj0oc35t9mimwjdiuffksffasx5unpcqlsm2qcvhbog67eet7w3j9a3t6v7yhndrw9qe7xru7snxqed1euuvucf0cf0ufax9rx2rfa9df1cqdj30nyq/fwfzw5eoa==</latexit> Determine chi-square threshold analytically Stop growing when chi-square feature score is not statistically significant Chi-square has known sampling distribution, can look up significance threshold Degrees of freedom= (#rows-1)(#cols-1) 2X2 table: 3.84 is 95% critical value X 2 = kx i=1 (o i e i ) 2 e i