Introduction to Machine Learning 2008/B Assignment 5

Introduction to Machine Learning 008/B Assignment 5 Michael Orlov Department of Computer Science orlovm@cs.bgu.ac.il September 6, 008 Abstract Submission of Assignment 5 in Introduction to Machine Learning, 0- -546. Question You are given the following data set: 0.04, 0.08, 0., 0.3, 0.44, 0.78, 0.79, 0.88, 0.9. Question (a Build a histogram with h = 0.5, and h = 0.. The naive estimator is given by ˆp(x = {x t : x h < x t x + h} Nh for bin width h and samples {x t } N t=. It can also be written as ˆp(x = Nh w(u = { N ( x xt w, h t=, if u < 0, otherwise. This reformulation changes the x t x+h condition to x t < x+h however, it changes value of ˆp(x at only a finite number of points, and it is still a probability density function. For implementation purposes, we note that w(u = min( u, The resulting histograms are shown in Fig. and Fig... Perhaps the question asked for a histogram with fixed bin points. Since naive estimator is more general than that, I am leaving the answer as-is.

Assignment 5 M. Orlov ˆp(x 0.8 0.6 0.4 0. Histogram estimator, h = 0.5 naive estimation samples 0 0.5 0 0.5.5 x Figure : Histogram with h = 0.5. ˆp(x.5.5 0.5 Histogram estimator, h = 0. naive estimation samples 0 0.5 0 0.5.5 x Figure : Histogram with h = 0..

M. Orlov Assignment 5 ˆp(x 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0. 0. 0 Kernel estimator, h = 0. kernel estimation samples 0.5 0 0.5.5 x Figure 3: Kernel estimate with h = 0.. Question (b Using a Gaussian Kernel estimator with h = 0., what is the probability density at 0.? In Gaussian Kernel estimator, we use K(u instead of w(u in the previous question, K(u = π e u. Substituting K(u for w(u, we have ˆp(0. 0.88495, for h = 0.. Fig. 3 shows the complete kernel estimate. Question (c Using k-nearest neighbor with k =, what is the probability density at 0.? The k-nearest neighbor density estimate is given by ˆp(x = k Nd k (x. For k =, we have ˆp(x = Nd (x = 9d (x, where d (x is the distance from x to second-closest sample. For x = 0., the closest sample is 0.3, and the second-closest sample is 0., therefore ˆp(0. = 9 0. 0..38889. 3

Assignment 5 M. Orlov size color shape result medium blue brick yes small red sphere yes large green pillar yes large green sphere yes small red wedge no large red wedge no large red pillar no Table : Classification data. Question (d Using k-nearest neighbor with k = and a Gaussian kernel, what is the probability density at 0.? With Gaussian kernel, the probability density is given by ˆp(x = Nd k (x N ( x xt K d k (x t= For k = and x = 0., we have ˆp(0. = = Question 9d (0. 9 0.08.356. 9 ( x xt K d (x t= 9 ( x xt K 0.08 t=. Use the data in Table to build a decision-tree. Stop only with pure leaf nodes. Attributes on which to split tree nodes are picked according to the entropybased impurity measure. If at node m, N mj of N m samples take branch j, and N i mj of those belong to class C i, the post-split impurity of node m is given by I m = n j= = N m = N m N mj N m n K i= K j= i= j= N i mj N mj log N i mj N mj N i mj log N i mj N mj n ( K N mj log N mj Nmj i log Nmj i i=. 4

M. Orlov Assignment 5 shape brick sphere pillar wedge yes yes color no red green no yes Figure 4: The decision tree. The blue branch of decision node color is not shown, since it is unreachable. For two-class classification problem, the above can be written as I m = N m n ( N mj log N mj Nmj log Nmj Nmj log Nmj j=. Minimizing the impurity after each split [, p. 79], the root split is on the shape attribute, after which pillar is the only decision node it is then split on the color attribute. The resulting decision tree is shown in Fig. 4. Question 3 Use the same data to learn rules. Learn two rules without pruning. Here, the rule induction process is according to Ripper algorithm [, p. 87], but without pruning. Rules are learned one at a time, and explain positive samples. The conditions are also added to a rule one at a time, maximizing the information gain, ( Gain(R, R = s log N + N log N + N where R is a rule R after adding one condition, N and N are the number of samples covered by the rules, N + and N + are the number of true positives in them, and s is the number of true positives in R that are still true positives in R i.e., N +. After the rule is grown (covers no negative samples, all the samples that it covers are removed from the training set., 5

Assignment 5 M. Orlov As shown in the appendix, when growing the first rule, the first condition maximizing the information gain can be either green or sphere. We arbitrarily pick green, and the rule is now complete, since no negative samples are covered. The two relevant samples can now be removed from Table. The first rule is thus: IF color = green THEN yes. The second rule is similarly IF size = medium THEN yes, but can also discriminate on several other attribute values. A Listings Question xt = [ 0.04 0.08 0. 0.3 0.44 0.78 0.79 0.88 0.9 ]; function retval = naive(x retval = ( - floor(min(abs(x, / ; endfunction function retval = kernel(x retval = / sqrt( * pi * exp(- (x.* x / ; endfunction function retval = ne(h,xt,x retval = /(length(xt * h * sum(naive((x-xt / h; endfunction function retval = ke(h,xt,x retval = /(length(xt * h * sum(kernel((x-xt / h; endfunction x = [-0.5:0.0:.5]; for i = :length(x y05(i = ne(0.5,xt,x(i; y0(i = ne(0.,xt,x(i; k0(i = ke(0.,xt,x(i; endfor resh05 = [x y05 ]; resh0 = [x y0 ]; save res-h05 resh05 save res-h0 resh0 printf("p K(0.: %.0f\n", ke(0., xt, 0.; 6

M. Orlov Assignment 5 resk0 = [x k0 ]; save res-k0 resk0 printf("p(0.: %.0f\n", /(9*0.08 * sum(kernel((0.-xt / 0.08; Questions and 3 7

Sheet Split node Attribute Values N_mj^yes N_mj^no N_mj N_m Partial impurity Impurity root size small 7 0.9 0.86 medium 0 7 0.00 large 4 7 0.57 color red 3 4 7 0.46 0.46 green 0 7 0.00 blue 0 7 0.00 shape brick 0 7 0.00 0.9 sphere 0 7 0.00 pillar 7 0.9 wedge 0 7 0.00 pillar size small 0 0 0 0.00.00 medium 0 0 0 0.00 large.00 color red 0 0.00 0.00 green 0 0.00 blue 0 0 0 0.00 Page

Sheet Condition Attribute Values N N+ N' N'+ s Gain root size small 7 4-0.9 medium 7 4 0.8 large 7 4 4-0.39 color red 7 4 4 -.9 green 7 4.6 blue 7 4 0.8 shape brick 7 4 0.8 sphere 7 4.6 pillar 7 4-0.9 wedge 7 4 0 0 0.00 root size small 5 0.3 medium 5.3 large 5 0 0 0.00 color red 5 4-0.68 blue 5.3 shape brick 5.3 sphere 5.3 pillar 5 0 0 0.00 wedge 5 0 0 0.00 Page

Assignment 5 M. Orlov References [] Ethem Alpaydin. Introduction to Machine Learning. The MIT Press, October 004. ISBN 0-6-0-. 0