# Decision Trees Some exercises

Size: px
Start display at page:

Download "Decision Trees Some exercises"

## Transcription

1 Decision Trees Some exercises 0.

2 . Exemplifying how to compute information gains and how to work with decision stumps CMU, 03 fall, W. Cohen E. Xing, Sample questions, pr. 4

3 . Timmy wants to know how to do well for ML exam. He collects those old statistics and decides to use decision trees to get his model. He now gets 9 data points, and two features: whether stay up late before exam (S) and whether attending all the classes (A). We already know the statistics is as below: Set(all) = [5+,4 ] Set(S+) = [3+, ],Set(S ) = [+, ] Set(A+) = [5+, ],Set(A ) = [0+,3 ] Suppose we are going to use the feature to gain most information at first split, which feature should we choose? How much is the information gain? You may use the following approximations: N log N

4 [5+,4 ] S + [5+,4 ] A + 3. [3+, ] [+, ] [5+, ] [0+,3 ] H= H=0 H(all) not. = H[5+,4 ] not. = H ( ) 5 sim. = H 9 ( ) 4 def. = log log 9 4 = log log log 4 = log log = 8 9 +log log 5 = H(all S) def. = 5 9 H[3+, ]+ 4 9 H[+, ] =... IG(all, S) def. = H(all) H(all S) = = = IG(all, A) def. = H(all) H(all A) = H(all A) def. = 6 9 H[5+, ]+ 3 9 H[0+,3 ] =... IG(all,S) < IG(all,A) H(all S) > H(all A) = =

5 4. Exemplifying the application of the ID3 algorithm on a toy mushrooms dataset CMU, 00(?) spring, Andrew Moore, midterm example questions, pr.

6 5. You are stranded on a deserted island. Mushrooms of various types grow widely all over the island, but no other food is anywhere to be found. Some of the mushrooms have been determined as poisonous and others as not (determined by your former companions trial and error). You are the only one remaining on the island. You have the following data to consider: Example NotHeavy Smelly Spotted Smooth Edible A B 0 0 C 0 0 D E 0 0 F 0 0 G H U 0? V 0? W 0 0? You know whether or not mushrooms A through H are poisonous, but you do not know about U through W.

7 6. For the a d questions, consider only mushrooms A through H. a. What is the entropy of Edible? b. Which attribute should you choose as the root of a decision tree? Hint: You can figure this out by looking at the data without explicitly computing the information gain of all four attributes. c. What is the information gain of the attribute you chose in the previous question? d. Build a ID3 decision tree to classify mushrooms as poisonous or not. e. Classify mushrooms U, V and W using the decision tree as poisonous or not poisonous. f. If themushrooms Athrough H that youknow are not poisonous suddenly became scarce, should you consider trying U, V and W? Which one(s) and why? Or if none of them, then why not?

8 7. a. H Edible = H[3+,5 ] def. = 3 8 log log 5 8 = 3 8 log log 8 5 = log log 5 = log log

9 8. b. [3+,5 ] [3+,5 ] [3+,5 ] [3+,5 ] NotHeavy 0 Smelly 0 Spotted 0 Smooth [+, ] [+,3 ] [+,3 ] [+, ] [+,3 ] [+, ] [+, ] [+,3 ] Node Node

10 9. c. def. H 0/Smooth = 4 8 H[+, ]+ 4 8 H[+,3 ] = + = + ( ) 4 log 3 = + ( 4 log log ( 3 4 log 3 ) ) 4 3 = log 3 = log IG 0/Smooth def. = H Edible H 0/Smooth = =

11 d. 0. H 0/NotHeavy def. = 3 8 H[+, ]+ 5 8 H[+,3 ] = 3 ( 8 3 log 3 + ) 3 log ( 8 5 log ) 5 log 5 3 = 3 ( 8 3 log 3+ 3 log 3 ) ( 8 5 log log 5 3 ) 5 log 3 = 3 ( log 8 3 ) + 5 ( log log 3 ) 5 = 3 8 log log log 3 8 = 5 8 log IG 0/NotHeavy def. = H Edible H 0/NotHeavy = = 0.003, IG 0/NotHeavy = IG 0/Smelly = IG 0/Spotted = < IG 0/Smooth =

12 Important Remark (in Romanian). În loc să fi calculat efectiv aceste câştiguri de informaţie, pentru a determina atributul cel mai,,bun, ar fi fost suficient să comparăm valorile entropiilor condiţionale medii H 0/Smooth şi H 0/NotHeavy : IG 0/Smooth > IG 0/NotHeavy H 0/Smooth < H 0/NotHeavy log 3 < 5 8 log 5 3log 3 < 5log < 5log 5+3log 3 6 < (adev.) În mod alternativ, ţinând cont de formulele de la problema UAIC, 07 fall, S. Ciobanu, L. Ciortuz, putem proceda chiar mai simplu relativ la calcule (nu doar aici, ori de câte ori nu avem de-a face cu un număr mare de instanţe): H 0/Netedă < H 0/Uşoară < < < < }{{} 04 < }{{} >3 8 5 }{{} 000 (adev.)

13 . Node : Smooth = 0 [+, ] NotHeavy 0 [+, ] Smelly 0 [+, ] Spotted 0 [0+, ] [+, ] [+,0 ] [0+, ] [+, ] [+, ] 0 0

14 3. Node : Smooth = [+,3 ] NotHeavy 0 [+,3 ] Smelly 0 [+,3 ] Spotted 0 [+, ] [0+, ] [0+,3 ] [+,0 ] [+, ] [0+, ] Node

15 The resulting ID3 Tree 4. [3+,5 ] Smooth 0 [+, ] [+,3 ] Smelly Smelly 0 0 IF (Smooth = 0 AND Smelly = 0) OR (Smooth = AND Smelly = ) THEN Edibile; ELSE Edible; [+,0 ] [0+, ] [0+,3 ] [+,0 ] 0 0 Classification of test instances: U Smooth =, Smelly = Edible = V Smooth =, Smelly = Edible = W Smooth = 0, Smelly = Edible = 0

16 5. Exemplifying the greedy character of the ID3 algorithm CMU, 003 fall, T. Mitchell, A. Moore, midterm, pr. 9.a

17 6. Fie atributele binare de intrare A,B,C, atributul de ieşire Y şi următoarele exemple de antrenament: A B C Y a. Determinaţi arborele de decizie calculat de algoritmul ID3. Este acest arbore de decizie consistent cu datele de antrenament?

18 7. Răspuns Nodul 0: (rădăcina) [+, ] A 0 [+, ] B 0 [+, ] C 0 [+, ] [+, ] [+, ] [+, ] [0+, ] [+, ] 0 Nod Se observă imediat că primii doi compaşi de decizie (engl. decision stumps) au IG = 0, în timp ce al treilea compas de decizie are IG > 0. Prin urmare, în nodul 0 (rădăcină) vom pune atributul C.

19 8. Nodul : Avem de clasificat instanţele cu C =, deci alegerea se face între atributele A şi B. [+, ] A 0 [+, ] B 0 [+, ] [+,0 ] [+, ] [+,0 ] Nod Cele două entropii condiţionale medii sunt egale: H /A = H /B = 3 H[+, ]+ 3 H[+,0 ] Aşadar, putem alege oricare dintre cele două atribute. Pentru fixare, îl alegem pe A.

20 9. Nodul : La acest nod nu mai avem disponibil decât atributul B, deci îl vom pune pe acesta. [+, ] C 0 Arborele ID3 complet este reprezentat în figura alăturată. [0+, ] 0 [+, ] A 0 Prin construcţie, algoritmul ID3 este consistent cu datele de antrenament dacă acestea sunt consistente (i.e., necontradictorii). În cazul nostru, se verifică imediat că datele de antrenament sunt consistente. [+, ] [+,0 ] B 0 [0+, ] [+,0 ] 0

21 0. b. Există un arbore de decizie de adâncime mai mică (decât cea a arborelui ID3) consistent cu datele de mai sus? Dacă da, ce concept (logic) reprezintă acest arbore? Răspuns: Din date se observă că atributul de ieşire Y reprezintă de fapt funcţia logică A xor B. Reprezentând această funcţie ca arbore de decizie, vom obţine arborele alăturat. 0 [+, ] [+, ] A [+, ] Acest arbore are cu un nivel mai puţin decât arborele construit cu algoritmul ID3. Prin urmare, arborele ID3 nu este optim din punctul de vedere al numărului de niveluri. B 0 0 [0+, ] [+,0 ] [+,0 ] [0+, ] 0 0 B

22 . Aceasta este o consecinţă a caracterului greedy al algoritmului ID3, datorat faptului că la fiecare iteraţie alegem,,cel mai bun atribut în raport cu criteriul câştigului de informaţie. Se ştie că algoritmii de tip greedy nu grantează obţinerea optimului global.

23 . Exemplifying the application of the ID3 algorithm in the presence of both categorical and continue attributes CMU, 0 fall, Eric Xing, Aarti Singh, HW, pr..

24 3. As of September 0, 800 extrasolar planets have been identified in our galaxy. Supersecret surveying spaceships sent to all these planets have established whether they are habitable for humans or not, but sending a spaceship to each planet is expensive. In this problem, you will come up with decision trees to predict if a planet is habitable based only on features observable using telescopes. a. In nearby table you are given the data from all 800 planets surveyed so far. The features observed by telescope are Size ( Big or Small ), and Orbit ( Near or Far ). Each row indicates the values of the features and habitability, and how many times that set of values was observed. So, for example, there were 0 Big planets Near their star that were habitable. Size Orbit Habitable Count Big Near Yes 0 Big Far Yes 70 Small Near Yes 39 Small Far Yes 45 Big Near No 30 Big Far No 30 Small Near No Small Far No 55 Derive and draw the decision tree learned by ID3 on this data (use the maximum information gain criterion for splits, don t do any pruning). Make sure to clearly mark at each node what attribute you are splitting on, and which value corresponds to which branch. By each leaf node of the tree, write in the number of habitable and inhabitable planets in the training data that belong to that node.

25 Answer: Level 4. H(374/800) [374+,46 ] Size H(374/800) [374+,46 ] Orbit B S N F [90+,60 ] [84+,66 ] [59+,4 ] [5+,85 ] H(9/35) H(9/5) H(47/00) H(43/00) H(Habitable Size) = H = H(Habitable Orbit) = 3 8 H ( ( ) ( ) H = ) + 5 ( ) 43 8 H 00 IG(Habitable; Size) = 0.08 IG(Habitable; Orbit) = = = 0.990

26 5. The final decision tree [374+,46 ] Size B [90+,60 ] Orbit S [84+,66 ] Orbit N F N F [0+,30 ] [70+,30 ] [39+, ] [45+,55 ] + +

27 6. b. For just 9 of the planets, a third feature, Temperature (in Kelvin degrees), has been measured, as shown in the nearby table. Redo all the steps from part a on this data using all three features. For the Temperature feature, in each iteration you must maximize over all possible binary thresholding splits (such as T 50 vs. T > 50, for example). Size Orbit Temperature Habitable Big Far 05 No Big Near 05 No Big Near 60 Yes Big Near 380 Yes Small Far 05 No Small Far 60 Yes Small Near 60 Yes Small Near 380 No Small Near 380 No According to your decision tree, would a planet with the features (Big, Near, 80) be predicted to be habitable or not habitable? Hint: You might need to use the following values of the entropy function for a Bernoulli variable of parameter p: H(/3) = 0.98, H(/5) = , H(9/5) = , H(43/00) = , H(6/35) = , H(47/00) =

28 7. Answer Binary threshold splits for the continuous attribute Temperature: Temperature

29 Answer: Level 8. H(4/9) [4+,5 ] H(4/9) [4+,5 ] H(4/9) [4+,5 ] H(4/9) [4+,5 ] Size Orbit T<=3.5 T<=30 B S F N Y N Y N [+, ] H= [+,3 ] H(/5) [+, ] H(/3) [3+,3 ] H= > > [0+,3 ] H=0 > = [4+, ] H(/3) [3+,3 ] H= [+, ] H(/3) ( ) H(Habitable Size) = ( ) 87 9 H = IG(Habitable; Size) = H = H(Habitable Temp 3.5) = ( ) = H = 0.98 = 0.6 = IG(Habitable; Temp 3.5) =

30 Answer: Level 9. H(/3) [4+, ] Size H(/3) [4+, ] Orbit H(/3) [4+, ] T<=30 B S F N Y N [+,0 ] + H=0 [+, ] H= [+,0 ] + H=0 [3+, ] H(/5) [3+,0 ] + H=0 [+, ] H(/3) >= > > > Note: The plain lines indicate that both the specific conditional entropies and their coefficients (weights) in the average conditional entropies satisfy the indicated relationship. (For example, H(/5) > H(/3) and 5/6 > 3/6.) The dotted lines indicate that only the specific conditional entropies satisfy the indicated relationship. (For example, H(/) = > H(/5) but 4/6 < 5/6.)

31 The final decision tree: 30. [4+,5 ] T<=3.5 Y N [0+,3 ] [4+, ] T<=30 Y N [3+,0 ] [+, ] c. According to your decision tree, would a planet with the features (Big, Near, 80) be predicted to be habitable or not habitable? Answer: habitable. + Size B [+,0 ] S [0+, ] +

32 3. Exemplifying the application of the ID3 algorithm on continuous attributes, and in the presence of a noise. Decision surfaces; decision boundaries. The computation of the LOOCV error CMU, 00 fall, Andrew Moore, midterm, pr. 3

33 3. Suppose we are learning a classifier with binary input values Y = 0 and Y =. There is one real-valued input X. The training data is given in the nearby table. Assume that we learn a decision tree on this data. Assume that when the decision tree splits on the real-valued attribute X, it puts the split threshold halfway between the attributes that surround the split. For example, using information gain as splitting criterion, the decision tree would initially choos to split at X = 5, which is halfway between X = 4 and X = 6 datapoints. X Y Let Algorithm DT be the method of learning a decision tree with only two leaf nodes (i.e., only one split). Let Algorithm DT be the method of learning a decision tree fully, with no prunning. a. What will be the training set error for DT and respectively DT on our data? b. What will be the leave-one-out cross-validation error (LOOCV) for DT and respectively DT on our data?

34 33. training data: discretization / decision thresholds: compact representation of the ID3 tree: X X ID3 tree: [5,5+] X<5 Da [4,0+] 0 0 Nu [,5+] X<8.5 Da Nu decision surfaces : X X [0,3+] [,+] X<8.75 Da [,0+] [0,+] 0 Nu

35 34. Level 0: [5,5+] [5,5+] [5,5+] X<5 X<8.5 X<8.75 Da Nu Da Nu Da Nu ID3: IG computations [4,0+] [,5+] < [4,3+] < = [,+] [5,3+] [0,+] Level : [,5+] X<8.5 [,5+] X<8.75 Da Nu Da Nu [0,3+] IG: 0.9 [,+] [,3+] IG: 0.09 [0,+] Decision "surfaces":

36 ID3, LOOCV: Decision surfaces LOOCV error: 3/0 X=,,3: X=4: X=6: X=7: X=8: X=8.5: X=9: X=0:

37 36. DT [5,5+] X<5 Da [4,0+] Nu [,5+] 0 Decision "surfaces": + 5

38 DT, LOOCV IG computations 37. Case : X=,, 3, 4 Da [3,0+] [4,5+] X<5 /4.5 X<8.5 X<8.75 Nu Da Nu Da Nu [,5+] < [3,3+] [4,5+] < = [,+] [4,3+] [4,5+] [0,+] Case : X=6, 7, 8 [5,4+] [5,4+] [5,4+] X<5 /5.5 X<8.5 /7.75 X<8.75 Da Nu Da Nu Da Nu [4,0+] [,4+] [4,+] [,+] [5,+] [0,+] < = < =

39 DT, CVLOO IG computations (cont d) [4,5+] 38. Case 3: X=8.5 Da X<5 Nu [4,0+] [0,5+] Case : X=9, 0 [5,4+] [5,4+] [5,4+] X<5 X<8.5 X<8.75 /9.5 Da Nu Da Nu Da Nu [4,0+] [,4+] [4,3+] [,+] [5,3+] [0,+] < < = CVLOO error: /0

40 39. Applying ID3 on a dataset with two continuous attributes: decision zones Liviu Ciortuz, 07

41 40. Consider the training dataset in the nearby figure. X and X are considered countinous attributes. Apply the ID3 algorithm on this dataset. Draw the resulting decision tree. X 4 3 Make a graphical representation of the decision areas and decision boundaries determined by ID X

42 4. Solution Level : [4+,5 ] [4+,5 ] [4+,5 ] [4+,5 ] [4+,5 ] X < 5/ X < 9/ X < 3/ X < 5/ X < 7/ Y N Y N Y N Y N Y N [+,0 ] [+,5 ] [+,4 ] [+, ] [+, ] [3+,4 ] [3+, ] [+,3 ] [4+, ] [0+,3 ] H(/7) H(/3) H(/3) H= H(3/7) H(/5) H(/4) H(/3) IG=0.09 H=0 > H=0 IG=0.39 > IG=0.378 = < H[Y. ] = 7/9 H(/7) H[Y. ] = 5/9 H(/5) + 4/9 H(/4) H[Y. ] = /3 H(/3)

43 Level : [4+, ] X < 5/ [4+, ] [4+, ] [4+, ] X < 4 X < 3/ X < 5/ 4. Y N Y N Y N Y N Notes: [+,0 ] H=0 IG=0.5 H[Y. ] = /3 [+, ] H= = = [+, ] H= IG=0.5 > [+,0 ] H=0 [+, ] [3+, ] H= H(/4) IG=0.04 H[Y. ] = /3 + /3 H(/4) > [3+, ] [+,0 ] H(/5) H=0 IG=0.09 H[Y. ] = 5/6 H(/5). Split thresholds for continuous attributes must be recomputed at each new iteration, because they may change. (For instance, here above, 4 replaces 4.5 as a threshold for X.). In the current stage, i.e., for the current node in the ID3 tree you may choose (as test) either X < 5/ or X < Here above we have an example of reverse relationships between weighted and respectively un-weighted specific entropies: H[+, ] > H[3+, ] but 4 6 H[+, ] < 5 6 H[3+, ].

44 43. The final decision tree: [4+,5 ] X < 7/ X Decision areas: Y N [4+, ] [0+,3 ] X < 5/ Y N [+,0 ] [+, ] B X < 4 S + [+,0 ] [0+, ] X

45 44. Exemplifying pre- and post-puning of decision trees using a threshold for the Information Gain CMU, 006 spring, Carlos Guestrin, midterm, pr. 4 [adapted by Liviu Ciortuz]

46 45. Starting from the data in the following table, the ID3 algorithm builds the decision tree shown nearby. V W X Y X 0 V 0 0 W 0 W 0 0 a. One idea for pruning such a decision tree would be to start at the root, and prune splits for which the information gain (or some other criterion) is less than some small ε. This is called topdown pruning. What is the decision tree returned for ε = 0.000? What is the training set error for this tree?

47 Answer We will first augment the given decision tree with informations regarding the data partitions (i.e., the number of positive and, respectively, negative instances) which were assigned to each test node during the application of ID3 algorithm. The information gain yielded by the attribute X in the root node is: H[3+; ] /5 0 4/5 = = 0.7 > ε. Therefore, this node will not be eliminated from the tree. The information gain for the attribute V (in the lefthand side child of the root node) is: H[+; ] / / = = 0 < ε. [+; ] So, the whole left subtree will be cut off and replaced by a decision node, as shown nearby. The training error produced by this tree is /5. [+; ] [3+; ] X V 0 0 [+; ] W W [+;0 ] [0+; ] [+;0 ] [+;0 ] [0+; ] X

48 47. b. Another option would be to start at the leaves, and prune subtrees for which the information gain (or some other criterion) of a split is less than some small ε. In this method, no ancestors of children with high information gain will get pruned. This is called bottom-up pruning. What is the tree returned for ε = 0.000? What is the training set error for this tree? Answer: The information gain of V is IG(Y;V) = 0. A step later, the information gain of W (for either one of the descendent nodes of V) is IG(Y;W) =. So bottom-up pruning won t delete any nodes and the tree [given in the problem statement] remains unchanged. The training error is 0.

49 48. c. Discuss when you would want to choose bottom-up pruning over top-down pruning and vice versa. Answer: Top-down pruning is computationally cheaper. When building the tree we can determine when to stop (no need for real pruning). But as we saw top-down pruning prunes too much. On the other hand, bottom-up pruning is more expensive since we have to first build a full tree which can be exponentially large and then apply pruning. The second problem with bottom-up pruning is that supperfluous attributes may fullish it (see CMU, CMU, 009 fall, Carlos Guestrin, HW, pr..4). The third problem with it is that in the lower levels of the tree the number of examples in the subtree gets smaller so information gain might be an inappropriate criterion for pruning, so one would usually use a statistical test instead.

50 49. Exemplifying χ -Based Pruning of Decision Trees CMU, 00 fall, Ziv Bar-Joseph, HW, pr..

51 50. In class, we learned a decision tree pruning algorithm that iteratively visited subtrees and used a validation dataset to decide whether to remove the subtree. However, sometimes it is desirable to prune the tree after training on all of the available data. One such approach is based on statistical hypothesis testing. After learning the tree, we visit each internal node and test whether the attribute split at that node is actually uncorrelated with the class labels. We hypothesize that the attribute is independent and then use Pearson s chi-square test to generate a test statistic that may provide evidence that we should reject this null hypothesis. If we fail to reject the hypothesis, we prune the subtree at that node.

52 5. a. At each internal node we can create a contingency table for the training examples that pass through that node on their paths to the leaves. The table will have the c class labels associated with the columns and the r values the split attribute associated with the rows. Each entry O i,j in the table is the number of times we observe a training sample with that attribute value and label, where i is the row index that corresponds to an attribute value and j is the column index that corresponds to a class label. In order to calculate the chi-square test statistic, we need a similar table of expected counts. The expected count is the number of observations we would expect if the class and attribute are independent. Derive a formula for each expected count E i,j in the table. Hint: What is the probability that a training example that passes through the node has a particular label? Using this probability and the independence assumption, what can you say about how many examples with a specific attribute value are expected to also have the class label?

53 5. b. Given these two tables for the split, you can now calculate the chi-square test statistic r c χ (O i,j E i,j ) = i= j= with degrees of freedom (r )(c ). You can plug the test statistic and degrees of freedom into a software package a or an online calculator b to calculate a p-value. Typically, if p < 0.05 we reject the null hypothesis that the attribute and class are independent and say the split is statistically significant. The decision tree given on the next slide was built from the data in the table nearby. For each of the 3 internal nodes in the decision tree, show the p-value for the split and state whether it is statistically significant. How many internal nodes will the tree have if we prune splits with p 0.05? a Use -chicdf(x,df) in MATLAB or CHIDIST(x,df) in Excel. b ( distribution#table of.cf.87 value vs p-value. E i,j

54 53. Input: X X X 3 X 4 Class [3,0+] [4,+] X [0,+] [4,8+] X 4 0 [,+] X 0 [0,6+] [,0+] 0

55 54. Idea While traversing the ID3 tree [usually in bottom-up manner], remove the nodes for which there is not enough ( significant ) statistical evidence that there is a dependence between the values of the input attribute tested in that node and the values of the output attribute (the labels), supported by the set of instances assigned to that node.

56 O X4 Class = 0 Class = X 4 = 0 4 X 4 = 0 6 Contingency tables N= P(X 4 = 0) = 6 =, P(X 4 = ) = P(Class = 0) = 4 = 3, P(Class = ) = 3 P(X = 0 X 4 = 0) = 3 6 = 55. O X X 4 =0 Class = 0 Class = X = X = N=6 P(X = X 4 = 0) = P(Class = 0 X 4 = 0) = 4 6 = 3 P(Class = X 4 = 0) = 3 P(X = 0 X 4 = 0,X = ) = 3 O X X 4 =0,X = Class = 0 Class = X = 0 0 N=3 P(X = X 4 = 0,X = ) = 3 X = 0 P(Class = 0 X 4 = 0,X = ) = 3 P(Class = X 4 = 0,X = ) = 3

57 56. The reasoning that leads to the computation of the expected number of observations P(A = i,c = j) = P(A = i) P(C = j) c k= P(A = i) = O r i,k k= and P(C = j) = O k,j N N P(A = i,c = j) indep. = ( c k= O i,k)( r k= O k,j) N E[A = i,c = j] = N P(A = i,c = j)

58 Expected number of observations 57. E X4 Class = 0 Class = X 4 = 0 4 X 4 = 4 E X X 4 Class = 0 Class = X = 0 X = E X X 4,X = Class = 0 Class = X = 0 X = E X4 (X 4 = 0,Class = 0) : N =, P(X 4 = 0) = şi P(Class = 0) = 3 N P(X 4 = 0,Class = 0) = N P(X 4 = 0) P(Class = 0) = 3 =

59 χ Statistics r c χ (O i,j E i,j ) = i= j= E i,j 58. χ X 4 = (4 ) + (0 ) + ( 4) 4 + (6 4) 4 = +++ = 6 χ X X 4 = (3 ) =0 χ X X 4 =0,X = = ( ( ) + (0 ) + ( ) = 3 ) ( ) ( 4 ) p-values: 0.043, , and respectively ( 0 ) 3 3 = = 3 The first one of these p-values is smaller than ε, therefore the root node (X 4 ) cannot be prunned.

60 Chi Squared Pearson test statistics 59. p value k = k = k = 3 k = 4 k = 6 k = χ Pearson s cumulative test statistic

61 60. Output (pruned tree) for 95% confidence level 0 X 4 0

62 6. The AdaBoost algorithm: The convergence in certain conditions to 0 of the training error CMU, 05 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr..-5 CMU, 009 fall, Carlos Guestrin, HW, pr. 3. CMU, 009 fall, Eric Xing, HW3, pr. 4..

63 Consider m training examples S = {(x,y ),...,(x m,y m )}, where x X and y {,}. Suppose we have a weak learning algorithm A which produces a hypothesis h : X {, } given any distribution D of examples. AdaBoost is an iterative algorithm which works as follows: Begin with a uniform distribution D (i) = m, i =,...,m. At each round t =,...,T, run the weak learning algorithm A on the distribution D t and produce the hypothsis h t ; Note: Since A is a weak learning algorithm, the produced hypothesis h t at round t is only slightly better than random guessing, say, by a margin γ t : 6. ε t = err Dt (h t ) = Pr x Dt [y h t (x)] = γ t. [If at a certain iteration t < T the weak classifier A cannot produce a hypothesis better than random guessing (i.e., γ t = 0) or it produces a hypothesis for which ε t = 0, then the AdaBoost algorithm should be stopped.] update D t+ (i) = Z t D t (i) e α ty i h t (x i ) for i =,...,m, not. where α t = ln ε t, and Z t is the normalizer. ε t ( T ) In the end, deliver H = sign t= α th t as the learned hypothesis, which will act as a weighted majority vote.

64 63. We will prove that the training error err S (H) of AdaBoost decreases at a very fast rate, and in certain cases it converges to 0. Important Remark The above formulation of the AdaBoost algorithm states no restriction on the h t hypothesis delivered by the weak classifier A at iteration t, except that ε t < /. However, in another formulation of the AdaBoost algorithm (in a more general setup; see for instance MIT, 006 fall, Tommi Jaakkola, HW4, problem 3), it is requested / reccommended that hypothesis h t be chosen by (approximately) maximizing the [criterion of] weighted training error on a whole class of hypotheses like, for instance, decision trees of depth (decision stumps). In this problem we will not be concerned with such a request, but we will comply with it for instance in problem CMU, 05 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr.6, when showing how AdaBoost works in practice.

65 64. a. Show that the [particular] choice of α t = ln ε t implies err Dt+ (h t ) = /. ε t ( ) e b. Show that D T+ (i) = m T t= Z y i f(x i t ), where f(x) = T t= α th t (x). c. Show that err S (H) T t= Z t, where err S (H) not. = m m i= {H(x i ) y i } is the traing error produced by AdaBoost. d. Obviously, we would like to minimize test set error produced by AdaBoost, but it is hard to do so directly. We thus settle for greedily optimizing the upper bound on the training error found at part c. Observe that Z,...,Z t are determined by the first t iterations, and we cannot change them at iteration t. A greedy step we can take to minimize the trainingset errorbound onroundtisto minimize Z t. Prove that the value of α t that minimizes Z t (among all possible values for α t ) is indeed α t = ln ε t (see the previous slide). ε t e. Show that T t= Z t e T t= γ t. f. From part c and d, we know the training error decreases at exponential rate with respect to T. Assume that there is a number γ > 0 such that γ γ t for t =,...,T. (This is called empirical γ-weak learnability.) How many rounds are needed to achieve a training error ε > 0? Please express in big-o notation, T = O( ).

66 Solution 65. a. If we define the correct set C = {i : y i h t (x i ) 0} and the mistake set M = {i : y i h t (x i ) < 0}, we can write: err Dt+ (h t ) = Z t = m m i= D t+ (i) {yi h t (x i )} = i M i=d t (i)e αtyiht(xi) = i CD t (i)e αt + i M ε e α t = e ln t ε t = e ln Z t = ( ε t ) ε t ε t εt ε t +ε t err Dt+ (h t ) = Z t ε t e α t = Note that ε t ε t > 0 because ε t (0, /). D t (i)e α t = Z t Z t i M D t (i) e α t = ε t e α t Z t }{{} ε t D t (i)e α t = ( ε t ) e α t +ε t e α t () ε t ( ε t ) ε t εt εt = and e α t = ε t e α = t ε t εt = ε t ( ε t ) () ε t εt = ε t

67 b. We will expand D t (i) recursively: 66. D T+ (i) = Z T D T (i)e α T y i h T (x i ) = D T (i) Z T e α T y i h T (x i ) = D T (i) e α T y i h T (x i ) e α T y i h T (x i ) Z T Z T. = D (i) T t= Z t e T t= α t y i h t (x i ) = m T t= Z t e y i f(x i ). c. We will make use of the fact that exponential loss upper bounds the 0- loss, i.e. {x<0} e x : m err S (H) = m i= {yi f(x i )<0} m e y i f(x i ) = m D T+ (i) m m m i= i= ( m ) ( T ) T = D T+ (i) Z t = Z t. i= } {{ } t= t= T Z t = t= m T D T+ (i) i= t= Z t

68 d. We will start from the equation 67. Z t = ε t e α t +( ε t ) e α t, which has been proven at part a. Note that the right-hand side is constant with respect to ε t (the error produced by h t, the hypothesis produced by the weak classifier A at the current step). Then we will proceed as usually, computing the partial derivative w.r.t. ε t : α t ( εt e α t +( ε t ) e α t ) = 0 ε t e α t ( ε t ) e α t = 0 ε t (e α t ) = ε t e α t = ε t ε t α t = ln ε t ε t Note that ε t ε t > (and therefore α t > 0) because ε t (0, /). It can also be immediately shown that α t = ln ε t ε t α t = ln ε t ε t. is indeed the value for which we reach the minim of the expression ε t e α t +( ε t ) e α t, and therefore of Z t too: ε t e α t ( ε t ) e α t > 0 e α t ε t ε t > 0 α t > ln ε t ε t > 0.

69 e. Making use of the relationship () proven at part a, and using the fact that x e x for all x R, we can write: T T T ( )( ( )) T Z t = ε t ( ε t ) = γ t γ t = 4γt t= t= T t= e 4γ t = T t= t= (e γ t) = T t= e γ t = e T t= γ t. t= 68. f. From the result obtained at parts c and d, we get: Therefore, err S (H) e T t= γ t e Tγ err S (H) < ε if Tγ < lnε Tγ > lnε Tγ > ln ε T > γ ln ε ( Hence we need T = O γ ln ). ε Note: It follows that err S (H) 0 as T.

70 69. Exemplifying the application of AdaBoost algorithm CMU, 05 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr..6

71 70. Consider the training dataset in the nearby figure. Run T = 3 iterations of AdaBoost with decision stumps (axis-aligned separators) as the base learners. Illustrate the learned weak hypotheses h t in this figure and fill in the table given below. (For the pseudo-code of the AdaBoost algorithm, see CMU, 05 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr..-5. Please read the Important Remark that follows that pseudo-code!) X x x x 3 x 6 x 7 x 4 x 5 x 8 x X t ε t α t D t () D t () D t (3) D t (4) D t (5) D t (6) D t (7) D t (8) D t (9) err S (H) 3 Note: The goal of this exercise is to help you understand how AdaBoost works in practice. It is advisable that after understanding this exercise you would implement a program / function that calculates the weighted training error produced by a given decision stump, w.r.t. a certain probabilistic distribution (D) defined on the training dataset. Later on you will extend this program to a full-fledged implementation of AdaBoost.

72 7. Solution Unlike the graphical reprezentation that we used until now for decision stumps (as trees of depth ), here we will work with the following analitical representation: for a continuous attribute X taking values x R and for any threshold s R, we can define two decision stumps: sign(x s) = { if x s if x < s and sign(s x) = { if x s if x < s. For convenience, in the sequel we will denote the first decision stump with X s and the second with X < s. According to the Important Remark that follows the AdaBoost pseudo-code [see CMU, 05 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr..-5], at each iteration (t) the weak algorithm A selects the/a decision stump which, among all decision stumps, has the minimum weighted training error w.r.t. the current distribution (D t ) on the training data.

73 Notes 7. When applying the ID3 algorithm, for each continous attribute X, we used a threshold for each pair of examples (x i,y i ), (x i+,y i+ ), with y i y i+ < 0 such that x i < x i+, but no x j Val(X) for which x i < x j < x i+. We will proceed similarly when applying AdaBoost with decision stumps and continous attributes. [In the case of ID3 algorithm, there is a theoretical result stating that there is no need to consider other thresholds for a continuous attribute X apart from those situated beteen pairs of successive values (x i < x i+ ) having opposite labels (y i y i+ ), because the Information Gain (IG) for the other thresholds (x i < x i+, with y i = y i+ ) is provably less than the maximal IG for X. LC: A similar result can be proven, which allows us to simplify the application of the weak classifier (A) in the framework of the AdaBoost algorithm.] Moreover, we will consider also a threshold from the outside of the interval of values taken by the attribute X in the training dataset. [The decision stumps corresponding to this outside threshold can be associated with the decision trees of depth 0 that we met in other problems.]

74 73. Iteration t = : Therefore, at this stage (i.e, the first iteration of AdaBoost) the thresholds for the two continuous variables (X and X ) correspondingto the two coordinates of the training instances (x,...,x 9 ) are, 5, and 9 for X, and, 3, 5 and 7 for X. One can easily see that we can get rid of the outside threshold for X, because the decision stumps corresponding to this threshold act in the same as the decision stumps associated to the outside threshold for X. The decision stumps corresponding to this iteration together with their associated weighted training errors are shown on the next slide. When filling those tabels, we have used the equalities err Dt (X s) = err Dt (X < s) and, similarly, err Dt (X s) = err Dt (X < s), for any threshold s and every iteration t =,,... These equalities are easy to prove.

75 s err D (X < s) err D (X s) s err D (X < s) err D (X s) = = = It can be seen that the minimal weighted training error (ε = /9) is obtained ( for) the 7 decision stumps X < 5/ and X < 7/. Therefore we can choose h = sign X as best hypothesis at iteration t = ; the corresponding separator is the line X = 7. The h hypothesis wrongly classifies the instances x 4 and x 5. Then γ = 9 = 5 8 and α = ln ε ε = ln ( ) : = ln 0.66

76 75. Now the algorithm must get a new distribution (D ) by altering the old one (D ) so that the next iteration concentrates more on the misclassified instances. D (i) = D (i)(e α Z } {{} ) y i h (x i ) Z = 9 for i {,,3,6,7,8,9}; 7 /7 Z 9 7 for i {4,5}. Remember that Z is a normalization factor for D. So, ( Z = ) = 4 = Therefore, D (i) = = 4 7 = 4 for i {4,5}; for i {4,5}. X /4 x x /4 /4 /4 /4 x 3 x x 6 7 x /4 x 4 8 /4 /4 x x /4 9 + X h

77 76. Note ( ) 7 If, instead of sign X we would have taken, as hypothesis h, the decision stump sign ( ) 5 X, the subsequent calculus would have been slightly different (although both decision stumps have the same minimal weighted training error, 9 ): x 8 and x 9 would have been allocated the weights 4, while x 4 şi x 5 would have been allocated the weights 4. (Therefore, the output of AdaBoost may not be uniquely determined!)

78 Iteration t = : 77. s err D (X < s) err D (X s) s err D (X < s) err D (X s) = = = = Note: According to the theoretical result presented at part a of CMU, 05 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr..-5, computing the weighted error rate of the decision stump [corresponding to the test] X < 7/ is now superfluous, because this decision stump has been chosen as optimal hypothesis at the previous iteration. (Nevertheless, we had placed it into the tabel, for the sake of a thorough presentation.)

79 ( ) 5 Now the best hypothesis is h = sign X ; the corresponding separator is the line X = ε = P D ({x 8,x 9 }) = 4 = 7 = 0.43 γ = 7 = 5 4 ( ε = ln = ln ) : 7 7 = ln 6 = α ε D 3 (i) = D (i) (e α Z }{{} ) y i h (x i ) = / 6 = 4 Z Z 4 Z 6 for i {,,3,6,7}; 6 for i {4,5}; 4 6 for i {8,9}. Z D (i) D (i) 6 Z 6 if h (x i ) = y i ; otherwise

80 79. Z = = = = = D 3 (i) = = 4 6 = = 4 for i {,,3,6,7}; for i {4,5}; for i {8,9}. X h 4 /4 3 x /4 x + /4 /4 /4 x 3 x 6 x 7 7/48 /4 x 4 x 8 7/48 /4 x x h X

81 80. Iteration t = 3: s err D3 (X < s) err D3 (X s) s err D3 (X < s) err D3 (X s) = = = = = =

82 8. ( The new best hypothesis is h 3 = sign X 9 ) ; the corresponding separator is the line X = 9. ε 3 = P D3 ({x,x,x 7 }) = = 3 4 = 8 X h h x 3 x 6 3 x x 7 + h γ 3 = 8 = 3 8 α 3 = ln ε3 ε 3 = ln ( ) : 8 8 = ln 7 = x x 4 x 5 x 8 x X

83 8. Finally, after filling our results in the given table, we get: t ε t α t D t () D t () D t (3) D t (4) D t (5) D t (6) D t (7) D t (8) D t (9) err S (H) /9 ln 7/ /9 /9 /9 /9 /9 /9 /9 /9 /9 /9 /4 ln 6 /4 /4 /4 /4 /4 /4 /4 /4 /4 /9 3 /8 ln 7 /4 /4 /4 7/48 7/48 /4 /4 /4 /4 0 Note: The following table helps ( you understand how err S (H) was computed; remember that H(x i ) def. T ) = sign t= α t h t (x i ). t α t x x x 3 x 4 x 5 x 6 x 7 x 8 x H(x i )

84 83. Remark: One can immediately see that the [test] instance (, 4) will be classified by the hypothesis H learned by AdaBoost as negative (since α +α α 3 = < 0). After making other similar calculi, we can conclude that the decision zones and the decision boundaries produced by AdaBoost for the given training data will be as indicated in the nearby figure. X h h x x x x 3 6 x 4 x x 7 x 8 x 9 X h Remark: The execution of AdaBoost could continue (if we would have taken initially T > 3...), although we have obtained err S (H) = 0 at iteration t = 3. By elaborating the details, we would see that for t = 4 we would obtain as optimal hypothesis X < 7/ (which has already been selected at iteration t = ). This hypothesis produces now the weighted training error ε 4 = /6. Therefore, α 4 = ln 5, and this will be added to α = ln 7/ in the new output H. In this way, the confidence in the hypothesis X < 7/ would be strenghened. So, we should keep in mind that AdaBoost can select several times a certain weak hypothesis (but never at consecutive iterations, cf. CMU, 05 fall, E. Xing, Z. Bar- Joseph, HW4, pr..).

85 84. Seeing AdaBoost as an optimization algorithm, w.r.t. the [inverse] exponential loss function CMU, 008 fall, Eric Xing, HW3, pr. 4.. CMU, 008 fall, Eric Xing, midterm, pr. 5.

86 85. At CMU, 05 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr..-5, part d, we have shown that in AdaBoost, we try to [indirectly] minimize the training error err S (H) by sequentially minimizing its upper bound T t= Z t, i.e. at each iteration t ( t T) we choose α t so as to minimize Z t (viwed as a function of α t ). Here you will see that another way to explain AdaBoost is by sequentially minimizing the negative exponential loss: E def. = m i= exp( y i f T (x i )) not. = m T exp( y i α t h t (x i )). (3) i= t= That is to say, at the t-th iteration ( t T) we want to choose besides the appropriate classifier h t the corresponding weight α t so that the overall loss E (accumulated up to the t-th iteration) is minimized. Prove that this [new] strategy will lead to the same update rule for α t used in AdaBoost, i.e., α t = ln ε t ε t. Hint: You can use the fact that D t (i) exp( y i f t (x i )), and it [LC: the proportionality factor] can be viewed as constant when we try to optimize E with respect to α t in the t-th iteration.

87 Solution 86. At the t-th iteration, we have ( ( m m t ) ) E = exp( y i f t (x i )) = exp y i α t h t (x i ) y i α t h t (x i ) = i= i= t = m exp( y i f t (x i )) exp( y i α t h t (x i )) i= m i= D t (i) exp( y i α t h t (x i )) not. = E (see CMU, 05 fall, Z. Bar-Joseph, E. Xing, HW4, pr..-5, part b) Further on, we can rewrite E as E = m i=d t (i) exp( y i α t h t (x i )) = i CD t (i)exp( α t )+ i M D t (i)exp(α t ) = ( ε t ) e α t +ε t e α t, (4) where C is the set of examples which are correctly classified by h t, and M is the set of examples which are mis-classified by h t.

88 87. The relation (4) is identical with the expression () from part a of CMU, 05 fall, Ziv Bar-Joseph, Eric Xing, HW4, pr..-5 (see the solution). Therefore, E will reach its minim for α t = ln ε t ε t.

89 88. AdaBoost algorithm: the notion of [voting] margin; some properties CMU, 06 spring, W. Cohen, N. Balcan, HW4, pr. 3.3

90 Despite that model complexity increases with each iteration, AdaBoost does not usually overfit. The reason behind this is that the model becomes more confident as we increase the number of iterations. The confidence can be expressed mathematically as the [voting] margin. Recall that after the algorithm AdaBoost terminates with T iterations the [output] classifier is ( T ) 89. H T (x) = sign t= α t h t (x). Similarly, we can define the intermediate weighted classifier after k iterations as: ( k ) H k (x) = sign t= α t h t (x). As its output is either or, it does not tell the confidence of its judgement. Here, without changing the decision rule, let ( k ) H k (x) = sign t= ᾱ t h t (x), where ᾱ t = α t k t = α t so that the weights on each weak classifier are normalized.

91 90. Define the margin after the k-th iteration as [the sum of] the [normalized] weights of h t voting correctly minus [the sum of] the [normalized] weights of h t voting incorrectly. Margin k (x) = ᾱ t ᾱ t. t:h t (x)=y t:h t (x) y a. Let f k (x) not. = k t=ᾱth t (x). Show that Margin k (x i ) = y i f k (x i ) for all training instances x i, with i =,...,m. b. If Margin k (x i ) > Margin k (x j ), which of the samples x i and x j will receive a higher weight in iteration k +? Hint: Use the relation D k+ (i) = m k t= Z exp( y i f k (x i )) which was proven t at CMU, 05 fall, Z. Bar-Joseph, E. Xing, HW4, pr...

92 9. Solution a. We will prove the equality starting from its right hand side: k y i f k (x i ) = y i ᾱ t h t (x i ) = t= = Margin k (x i ). k ᾱ t y i h t (x i ) = t= t:h t (x i )=y i ᾱ t t:h t (x i ) y i ᾱ t b. According to the relationship already proven at part a, Margin k (x i ) > Margin k (x j ) y i f k (x i ) > y j f k (x j ) y i f k (x i ) < y j f k (x j ) exp( y i f k (x i )) < exp( y j f k (x j )). Based on the given Hint, it follows that D k+ (i) < D k+ (j).

93 9. Important Remark It can be shown mathematically that boosting tends to increase the margins of training examples see the relation (3) at CMU, 008 fall, Eric Xing, HW3, pr. 4.., and that a large margin on training examples reduces the generalization error. Thus we can explain why, although the number of parameters of the model created by AdaBoost increases with at every iteration therefore complexity rises, it usually doesn t overfit.

94 93. AdaBoost: a sufficient condition for γ-weak learnability, based on the voting margin CMU, 06 spring, W. Cohen, N. Balcan, HW4, pr. 3..4

95 94. At CMU, 05 fall, Z. Bar-Joseph, E. Xing, HW4, pr..5 we encountered the notion of empirical γ-weak learnability. When this condition γ γ t for def. all t, where γ t = ε t, with ε t being the weighted training error produced by the weak hypothesis h t is met, it ensures that AdaBoost will drive down the training error quickly. However, this condition does not hold all the time. In this problem we will prove a sufficient condition for empirical weak learnability [to hold]. This condition refers to the notion of voting margin which was presented in CMU, 06 spring, W. Cohen, N. Balcan, HW4, pr Namely, we will prove that if there is a constant θ > 0 such that the [voting] margins of all training instances are lower-bounded by θ at each iteration of the AdaBoost algorithm, then the property of empirical γ-weak learnability is guaranteed, with γ = θ/.

96 [Formalisation] 95. Suppose we are given a training set S = {(x,y ),...,(x m,y m )}, such that for some weak hypotheses h,...,h k from the hypothesis space H, and some nonnegative coefficients α,...,α k with k j= α j =, there exists θ > 0 such that y i ( k α j h j (x i )) θ, (x i,y i ) S. j= Note: according to CMU, 06 spring, W. Cohen, N. Balcan, HW4, pr. 3.3, y i ( k j= α j h j (x i )) = Margin k (x i ) = f k (x i ), where f k (x i ) not. = k α j h j (x i ). j= Key idea: We will show that if the condition above is satisfied (for a given k), then for any distribution D over S, there exists a hypothesis h l {h,...,h k } with weighted training error at most θ over the distribution D. It will follow that when the condition above is satisfied for any k, the training set S is empirically γ-weak learnable, with γ = θ.

97 96. a. Show that if the condition stated above is met there exists a weak hypothesis h l from {h,...,h k } such that E i D [y i h l (x i )] θ. Hint: Taking expectation under the same distribution does not change the inequality conditions. b. Show that the inequality E i D [y i h l (x i )] θ is equivalent to Pr i D [y i h l (x i )] θ, meaning that the weighted training error of h l is at most θ, and therefore γ t θ.

98 Solution 97. a. Since y i ( k j= α jh j (x i )) θ y i f k (x i ) θ for i =,...,m, it follows (according to the Hint) that E i D [y i f k (x i )] θ where f k (x i ) not. = k α j h j (x i ). (5) On the other side, E i D [y i h l (x i )] θ def. m i= y ih l (x i ) D(i) θ. Suppose, on contrary, that E i D [y i h l (x i )] < θ, that is m i= y ih l (x i ) D(i) < θ for l =,...,k. Then m i= y ih l (x i ) D(i) α l < θ α l for l =,...,k. By summing up these inequations for l =,...,k we get ( k m k m k ) k y i h l (x i ) D(i) α l < θ α l y i D(i) h l (x i )α l < θ α l l= i= l= i= j= l= m y i f k (x i ) D(i) < θ, (6) i= because k j= α j = şi f k (x i ) not. = k l= α lh l (x i ). The inequation (6) can be written as E i D [y i f k (x i )] < θ. Obviously, it contradicts the relationship (5). Therefore, the previous supposition is false. In conclusion, there exist l {,...,k} such that E i D [y i h l (x i )] θ. l=

99 98. Solution (cont d) b. We already said that E i D [y i h l (x i )] θ m i= y ih l (x i ) D(i) θ. Since y i {,+} and h l (x i ) {,+} for i =,...,m and l =,...,k, we have m y i h l (x i ) D(i) θ i= D(x i ) D(x i ) θ ( ε l ) ε l θ i:y i =h l (x i ) i:y i h l (x i ) ε l θ ε l θ ε l θ def. Pr i D [y i h l (x i )] θ.

100 99. AdaBoost: Any set of consistently labelled instances from R is empirically γ-weak learnable by using decision stumps Stanford, 06 fall, Andrew Ng, John Duchi, HW, pr. 6.abc

101 00. At CMU, 05, Z. Bar-Joseph, E. Xing, HW4, pr..5 we encountered the notion of empirical γ-weak learnability. When this condition γ γ t for all t, where γ t def. = ε t, with ε t being the weighted training error produced by the weak hypothesis h t is met, it ensures that AdaBoost will drive down the training error quickly. In this problem we will assume that our input attribute vectors x R, that is, they are one-dimensional, and we will show that [LC] when these vectors are consitently labelled, decision stumps based on thresholding provide a weaklearning guarantee (γ).

102 Decision stumps: analytical definitions / formalization 0. Thresholding-based decision stumps can be seen as functions indexed by a threshold s and sign +/, such that φ s,+ (x) = { if x s if x < s and φ s, (x) = { if x s if x < s. Therefore, φ s,+ (x) = φ s, (x). Key idea for the proof We will show that given a consistently labelled training set S = {(x,y ),...,(x m,y m )}, with x i R and y i {,+} for i =,...,m, there is some γ > 0 such that for any distribution p defined on this training set there is a threshold s R for which error p (φ s,+ ) γ or error p(φ s, ) γ, where error p (φ s,+ ) and error p (φ s, ) denote the weighted training error of φ s,+ and respectively φ s,, computed according to the distribution p.

103 0. Convention: In our problem we will assume that our training instances x,...x m R are distinct. Moreover, we will assume (without loss of generality, but this makes the proof notationally simpler) that x > x >... > x m. a. Show that, given S, for each threshold s R there is some m 0 (s) {0,,...,m} such that m error p (φ s,+ ) def. = p i {yi φ s,+ (x i )} = m 0 (s) m y i p i y i p i i= i= i=m 0 (s)+ }{{} not.: f(m 0 (s)) and m error p (φ s, ) def. = p i {yi φ s, (x i )} = m m 0 (s) y i p i y i p i i= i=m 0 (s)+ i= }{{} not.: f(m 0 (s)) Note: Treat sums over empty sets of indices as zero. Therefore, 0 i= a i = 0 for any a i, and similarly m i=m+ a i = 0.

104 03. b. Prove that, given S, there is some γ > 0 (which may depend on the training set size m) such that for any set of probabilities p on the training set (therefore p i 0 and m i= p i = ) we can find m 0 {0,...,m} so as f(m 0 ) γ, where f(m 0 ) not. = m 0 y i p i m y i p i, i= i=m 0 (s)+ Note: γ should not depend on p. Hint: Consider the difference f(m 0 ) f(m 0 ). What is your γ?

105 04. c. Based on your answers to parts a and b, what edge can thresholded decision stumps guarantee on any training set {x i,y i } m i=, where the raw attributes x i R are all distinct? Recall that the edge of a weak classifier φ : R {,} is the constant γ (0,/) such that error p (φ) def. = m p i {φ(xi ) y i } γ. i= d. Can you give an upper bound on the number of thresholded decision stumps required to achieve zero error on a given training set?

106 Solution 05. a. We perform several algebraic steps. Let sign(t) = if t 0, and sign(t) = otherwise. Then {φs,+ (x) y} = {sign(x s) y} = {y sign(x s) 0}, where the symbol { } denotes the well known indicator function. Thus we have error p (φ s,+ ) def. = = m m p i {yi φ s,+ (x i )} = p i {yi sign(x i s) 0} i= i= p i {yi = } + i:x i s i:x i <s p i {yi =} Thus, if we let m 0 (s) be the index in {0,...,m} such that x i s for i m 0 (s) and x i < s for i > m 0 (s), which we know must exist because x > x >..., we have error p (φ s,+ ) def. = m p i {yi φ s,+ (x i )} = i= m 0 (s) i= p i {yi = } + m i=m 0 (s)+ p i {yi =}.

107 06. Now we make a key observation: we have {y= } = y because y {,}. and {y=} = +y, Consequently, error p (φ s,+ ) def. = m 0 (s) m p i {yi φ s,+ (x i )} = i= = p i y i + i= = m 0 (s) p i y i i= i=m 0 (s)+ m 0 (s) i= m p i +y i m p i y i. i=m 0 (s)+ p i {yi = } + = The last equality follows because m i= p i =. m p i i= m i=m 0 (s)+ m 0 (s) i= p i y i + p i {yi =} The case for φ s, is symmetric to this one, so we omit the argument. m i=m 0 (s)+ p i y i

108 Solution (cont d) 07. b. For any m 0 {,...,m} we have f(m 0 ) f(m 0 ) = m 0 i= y i p i m i=m 0 + y i p i m 0 i= y i p i + m i=m 0 y i p i = y m0 p m0. Therefore, f(m 0 ) f(m 0 ) = y m0 p m0 = p m0 for all m 0 {,...,m}. Because m i= p i =, there must be at least one index m 0 with p m 0 m. Thus we have f(m 0 ) f(m 0 ), and so it must be the case that at least m one of f(m 0 ) m or f(m 0 ) m holds. Depending on which one of those two inequations is true, we would then return m 0 or m 0. (Note: If f(m 0 ) m and m 0 =, then we have to consider an outside threshold, s > x.) Finally, we have γ = m.

### A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

### Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabás Póczos Contents Decision Trees: Definition + Motivation Algorithm for Learning Decision Trees Entropy, Mutual Information, Information

More information

### Name (NetID): (1 Point)

CS446: Machine Learning Fall 2016 October 25 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains four

More information

### 10701/15781 Machine Learning, Spring 2007: Homework 2

070/578 Machine Learning, Spring 2007: Homework 2 Due: Wednesday, February 2, beginning of the class Instructions There are 4 questions on this assignment The second question involves coding Do not attach

More information

### Midterm, Fall 2003

5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

More information

### The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

) Set W () i The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m m =.5 1 n W (m 1) i y i h(x i ; 2 ˆθ

More information

### Lecture 8. Instructor: Haipeng Luo

Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

### Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Trees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 5 th, 2007 2005-2007 Carlos Guestrin 1 Linear separability A dataset is linearly separable iff 9 a separating

More information

### Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Decision Trees Claude Monet, The Mulberry Tree Slides from Pedro Domingos, CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore Michael Guerzhoy

More information

### Decision Trees: Overfitting

Decision Trees: Overfitting Emily Fox University of Washington January 30, 2017 Decision tree recap Loan status: Root 22 18 poor 4 14 Credit? Income? excellent 9 0 3 years 0 4 Fair 9 4 Term? 5 years 9

More information

### Classification and Regression Trees

Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

More information

### Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Decision Trees Machine Learning CSEP546 Carlos Guestrin University of Washington February 3, 2014 17 Linear separability n A dataset is linearly separable iff there exists a separating hyperplane: Exists

More information

### Machine Learning, Midterm Exam

10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have

More information

### Machine Learning, Fall 2009: Midterm

10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

### Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit

More information

### Voting (Ensemble Methods)

1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

More information

### Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

### CS 6375 Machine Learning

CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

### Final Exam, Machine Learning, Spring 2009

Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

### Boosting: Foundations and Algorithms. Rob Schapire

Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you

More information

### Chapter 14 Combining Models

Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients

More information

### Decision Tree Learning Lecture 2

Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

### Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model

More information

### Generative v. Discriminative classifiers Intuition

Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

More information

### CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

### ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

### Notes on Machine Learning for and

Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Learning = improving with experience Improve over task T (e.g, Classification, control tasks) with respect

More information

### CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#\$

More information

### Name (NetID): (1 Point)

CS446: Machine Learning (D) Spring 2017 March 16 th, 2017 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains

More information

### Classification: Decision Trees

Classification: Decision Trees These slides were assembled by Byron Boots, with grateful acknowledgement to Eric Eaton and the many others who made their course materials freely available online. Feel

More information

### Empirical Risk Minimization, Model Selection, and Model Assessment

Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,

More information

### FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE You are allowed a two-page cheat sheet. You are also allowed to use a calculator. Answer the questions in the spaces provided on the question sheets.

More information

### CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,

More information

### Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c15 2013/9/9 page 331 le-tex 331 15 Ensemble Learning The expression ensemble learning refers to a broad class

More information

### Decision Trees (Cont.)

Decision Trees (Cont.) R&N Chapter 18.2,18.3 Side example with discrete (categorical) attributes: Predicting age (3 values: less than 30, 30-45, more than 45 yrs old) from census data. Attributes (split

More information

### FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

### Machine Learning, Fall 2011: Homework 5

0-60 Machine Learning, Fall 0: Homework 5 Machine Learning Department Carnegie Mellon University Due:??? Instructions There are 3 questions on this assignment. Please submit your completed homework to

More information

### Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

### Statistics and learning: Big Data

Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

### 2 Upper-bound of Generalization Error of AdaBoost

COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Haipeng Zheng March 5, 2008 1 Review of AdaBoost Algorithm Here is the AdaBoost Algorithm: input: (x 1,y 1 ),...,(x m,y

More information

### Learning Theory Continued

Learning Theory Continued Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 1 A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec.

More information

### Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

10-601 Machine Learning, Midterm Exam: Spring 2008 Please put your name on this cover sheet If you need more room to work out your answer to a question, use the back of the page and clearly mark on the

More information

### 10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

### Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good

More information

### MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on

More information

### Introduction to Machine Learning Midterm Exam Solutions

10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

### Learning Decision Trees

Learning Decision Trees CS194-10 Fall 2011 Lecture 8 CS194-10 Fall 2011 Lecture 8 1 Outline Decision tree models Tree construction Tree pruning Continuous input features CS194-10 Fall 2011 Lecture 8 2

More information

### CIS519: Applied Machine Learning Fall Homework 5. Due: December 10 th, 2018, 11:59 PM

CIS59: Applied Machine Learning Fall 208 Homework 5 Handed Out: December 5 th, 208 Due: December 0 th, 208, :59 PM Feel free to talk to other members of the class in doing the homework. I am more concerned

More information

### ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

### CHAPTER-17. Decision Tree Induction

CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes

More information

### Stochastic Gradient Descent

Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

### Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

### Learning with multiple models. Boosting.

CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

### Machine Learning

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,

More information

### Introduction to Machine Learning Midterm Exam

10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

### COMS 4771 Lecture Boosting 1 / 16

COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?

More information

### Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example Günther Eibl and Karl Peter Pfeiffer Institute of Biostatistics, Innsbruck, Austria guenther.eibl@uibk.ac.at Abstract.

More information

### Machine Learning, Fall 2012 Homework 2

0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

### Machine Learning, Midterm Exam: Spring 2009 SOLUTION

10-601 Machine Learning, Midterm Exam: Spring 2009 SOLUTION March 4, 2009 Please put your name at the top of the table below. If you need more room to work out your answer to a question, use the back of

More information

### Decision Trees. Tirgul 5

Decision Trees Tirgul 5 Using Decision Trees It could be difficult to decide which pet is right for you. We ll find a nice algorithm to help us decide what to choose without having to think about it. 2

More information

### CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

### The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

The Boosting Approach to Machine Learning Maria-Florina Balcan 10/31/2016 Boosting General method for improving the accuracy of any given learning algorithm. Works by creating a series of challenge datasets

More information

### Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,

More information

### Lecture 7: DecisionTrees

Lecture 7: DecisionTrees What are decision trees? Brief interlude on information theory Decision tree construction Overfitting avoidance Regression trees COMP-652, Lecture 7 - September 28, 2009 1 Recall:

More information

### The Perceptron algorithm

The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following

More information

### Classification Trees

Classification Trees 36-350, Data Mining 29 Octber 2008 Classification trees work just like regression trees, only they try to predict a discrete category (the class), rather than a numerical value. The

More information

### Decision Tree Learning

Topics Decision Tree Learning Sattiraju Prabhakar CS898O: DTL Wichita State University What are decision trees? How do we use them? New Learning Task ID3 Algorithm Weka Demo C4.5 Algorithm Weka Demo Implementation

More information

### Useful Equations for solving SVM questions

Support Vector Machines (send corrections to yks at csail.mit.edu) In SVMs we are trying to find a decision boundary that maximizes the "margin" or the "width of the road" separating the positives from

More information

### Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

### Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

JMLR: Workshop and Conference Proceedings vol 35:1 8, 014 Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers Balázs Kégl LAL/LRI, University

More information

### TDT4173 Machine Learning

TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods

More information

### Machine Learning

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 13, 2011 Today: The Big Picture Overfitting Review: probability Readings: Decision trees, overfiting

More information

### Linear Classifiers and the Perceptron

Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers Let s assume that every instance is an n-dimensional vector of real numbers x R n, and there are only two possible

More information

### Decision Tree Learning

Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,

More information

### Information Gain, Decision Trees and Boosting ML recitation 9 Feb 2006 by Jure

Information Gain, Decision Trees and Boosting 10-701 ML recitation 9 Feb 2006 by Jure Entropy and Information Grain Entropy & Bits You are watching a set of independent random sample of X X has 4 possible

More information

### Information Theory, Statistics, and Decision Trees

Information Theory, Statistics, and Decision Trees Léon Bottou COS 424 4/6/2010 Summary 1. Basic information theory. 2. Decision trees. 3. Information theory and statistics. Léon Bottou 2/31 COS 424 4/6/2010

More information

### Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

### C4.5 - pruning decision trees

C4.5 - pruning decision trees Quiz 1 Quiz 1 Q: Is a tree with only pure leafs always the best classifier you can have? A: No. Quiz 1 Q: Is a tree with only pure leafs always the best classifier you can

More information

### Qualifier: CS 6375 Machine Learning Spring 2015

Qualifier: CS 6375 Machine Learning Spring 2015 The exam is closed book. You are allowed to use two double-sided cheat sheets and a calculator. If you run out of room for an answer, use an additional sheet

More information

### Ensembles. Léon Bottou COS 424 4/8/2010

Ensembles Léon Bottou COS 424 4/8/2010 Readings T. G. Dietterich (2000) Ensemble Methods in Machine Learning. R. E. Schapire (2003): The Boosting Approach to Machine Learning. Sections 1,2,3,4,6. Léon

More information

### the tree till a class assignment is reached

Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal

More information

### Dan Roth 461C, 3401 Walnut

CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

### CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of

More information

### Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

CS 188 Spring 2014 Introduction to Artificial Intelligence Final You have approximately 2 hours and 50 minutes. The exam is closed book, closed notes except your two-page crib sheet. Mark your answers

More information

### Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:

More information

### Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Introduction to ML Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Why Bayesian learning? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical

More information

### Midterm 2 V1. Introduction to Artificial Intelligence. CS 188 Spring 2015

S 88 Spring 205 Introduction to rtificial Intelligence Midterm 2 V ˆ You have approximately 2 hours and 50 minutes. ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib

More information

### i=1 = H t 1 (x) + α t h t (x)

AdaBoost AdaBoost, which stands for ``Adaptive Boosting", is an ensemble learning algorithm that uses the boosting paradigm []. We will discuss AdaBoost for binary classification. That is, we assume that

More information

### COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University BOOSTING Robert E. Schapire and Yoav

More information

### Evaluation requires to define performance measures to be optimized

Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

### Lecture 3: Decision Trees

Lecture 3: Decision Trees Cognitive Systems - Machine Learning Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning last change November 26, 2014 Ute Schmid (CogSys,

More information

### Midterm Exam Solutions, Spring 2007

1-71 Midterm Exam Solutions, Spring 7 1. Personal info: Name: Andrew account: E-mail address:. There should be 16 numbered pages in this exam (including this cover sheet). 3. You can use any material you

More information

### Linear Discrimination Functions

Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

### BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

BBM406 - Introduc0on to ML Spring 2014 Ensemble Methods Aykut Erdem Dept. of Computer Engineering HaceDepe University 2 Slides adopted from David Sontag, Mehryar Mohri, Ziv- Bar Joseph, Arvind Rao, Greg

More information

### Boos\$ng Can we make dumb learners smart?

Boos\$ng Can we make dumb learners smart? Aarti Singh Machine Learning 10-601 Nov 29, 2011 Slides Courtesy: Carlos Guestrin, Freund & Schapire 1 Why boost weak learners? Goal: Automa'cally categorize type

More information

### Active Learning: Disagreement Coefficient

Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

### Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some

More information

### Linear & nonlinear classifiers

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

### MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

### DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

1 DECISION TREE LEARNING [read Chapter 3] [recommended exercises 3.1, 3.4] Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting Decision Tree 2 Representation: Tree-structured

More information