Machine Learning, Spring 2011: Homework 1 Solution

Size: px

Start display at page:

Download "Machine Learning, Spring 2011: Homework 1 Solution"

Kerrie Perkins
6 years ago
Views:

1 Machie Learig, Sprig 011: Homework 1 Solutio February 1, 011 Istructios There are 3 questios o this assigmet. The last questio ivolves codig. Attach your code to the writeup. Please submit your homework as 3 separate sets of pages accordig to TAs, with your ame ad userid o each set. 1 Iformatio Gai, KL-divergece ad Etropy [Xi Che, 30 poits] 1. Whe we costruct a decisio tree, the ext attribute to split is the oe with maximum mutual iformatio (a.k.a. iformatio gai, which is defied i terms of etropy. I this problem, we will explore its coectio to KL-divergece. The KL-divergece from a distributio p(x to a distributio q(x ca be thought of as a distace measure from p to q: KL(p q = x p(x log q(x p(x. If p(x = q(x, the KL(p q = 0. Otherwise, KL(p q > 0. 1 We ca defie mutual iformatio as the KL-divergece from the observed joit distributio of X ad Y to the product of their margials: I(X, Y KL(p(x, y p(xp(y (a Show that this defiitio of mutual iformatio is equivalet to the oe give i class,. That is, show that I(X, Y = H(X H(X Y ad I(X, Y = H(Y H(Y X from the defiitio i terms of KL-divergece. From this defiitio, we ca easily see that mutual iformatio is symmetric, i.e. I(X, Y = I(Y, X. [10pt] (b Accordig to this defiitio, uder what coditios do we have that I(X, Y = 0. [5pt] (a KL(p q = x = x = y ( p(xp(y p(x, y log p(x, y y p(x, y(log p(x + log p(y log p(x, y y p(y log p(y + x p(x y p(y x log p(y x = H(Y H(Y X Equivalece to H(X H(X Y ca be show i a similar way. 1 For more details o KL-divergece, refer to Sectio 1.6 i Bishop. 1

2 (b Whe X ad Y are statistically idepedet, i.e. p(x, y = p(xp(y, I(X, Y = 0.. I the class, we defie the etropy based o a discrete radom variable X. Now cosider the case that X is a cotiuous radom variable with the probability desity fuctio p(x. The etropy is defied as: H(X = p(x l p(xdx Assume that X follows a Gaussia distributio with the mea µ ad variace σ, i.e. (a Please derive its etropy H(X. [10pt] p(x = 1 (x µ exp πσ σ (b Give a careful observatio for the etropy you derived ad please idicate oe property, which holds for the etropy for (ay discrete radom variable, but does ot hold here. [5pt] 0H(X = p(x l p(xdx = p(x ( 1 l(πσ = 1 (l(πσ + 1σ = 1 ( l(πσ + 1 (x µ σ p(x(x µ dx dx The last iequality is accordig to the variace of a stadard ormal distributio: σ = var(x = E ( (X µ = p(x(x µ dx Note that ulike the etropy for discrete variable which is always o-egative, whe σ < 1 πe, H(x < 0. Bayes Rule ad Poit Estimatio [Xi Che, 30 poits] 1. Assume the probability of a certai disease is The probability of testig positive give that a perso is ifected with the disease is 0.95 ad the probability of testig positive give the perso is ot ifected with the disease is (a Calculate the probability of testig positive. [5pt] (b Use Bayes Rule to calculate the probability of beig ifected with the disease give that the test is positive. [5pt] (a Give the iformatio i the problem, we have P (D = 0.01, P (T D = 0.95 ad P (T D = P (T = P (T D + P (T D = P (T DP (D + P (T DP (D = = 0.059

3 (b P (D T = P (T DP (D P (T = The Poisso distributio is a useful discrete distributio which ca be used to model the umber of occurreces of somethig per uit time. For example, i etworkig, the umber of packets to arrive i a give time widow is ofte assumed to follow a poisso distributio. If X is Poisso distributed, i.e. X P oisso(λ, its probability mass fuctio takes the followig form: P (X λ = λx e λ, X! It ca be show that if E(X = λ. Assume ow we have i.i.d. data poits from P oisso(λ: D = {X 1,..., X }. (For the purpose of this problem, you ca oly use the kowledge about the Poisso ad Gamma distributios provided i this problem. (a Show that the sample mea ˆλ = 1 X i is the maximum likelihood estimate (MLE of λ ad it is ubiased (E(ˆλ = λ. [8pt] (b Now let s be Bayesia ad put a prior distributio over λ. Assumig that λ follows a Gamma distributio with the parameters (α, β, its probability desity fuctio: p(λ α, β = βα Γ(α λα 1 e βλ, where Γ(α = (α 1! (here we assume α is a positive iteger. Compute the posterior distributio over λ. [6pt] (c Derive a aalytic expressio for the maximum a posterior (MAP of λ uder Gamma(α, β prior. [6pt] (a Write dow the log-likelihood (b l P (D λ = l e λ λ Xi X i! = λ + {X i l λ l(x i!}. The MLE ˆλ = arg max λ P (D λ = arg max λ l P (D λ, which ca be obtaied by settig the gradiet of l P (D λ with respect to λ to 0. More specifically: d dλ l P (D λ = + 1 X i = 0 = λ ˆλ = 1 X i Sice X 1,..., X are i.i.d. from P oisso(λ, for ay X i, E(X i = λ. ˆλ is ubiased because: ( 1 E(ˆλ = E X i = 1 E(X i = 1 λ = λ p(λ D P (D λp(λ ( = e λ λ Xi β α X i! Γ(α λα 1 e βλ λ Xi+α 1 e λ βλ Therefore, the posterior distributio p(λ D Gamma( X i + α, + β 3

4 (c The MAP λ = arg max λ p(λ D = arg max λ l p(λ D. Sice p(λ D λ Xi+α 1 e λ βλ, ( l p(λ D = X i + α 1 l λ ( + βλ + C, where C is a costat with respect to λ. Take the gradiet of l p(λ D with respect to λ ad set it to 0: d l p(λ D = X i + α 1 ( + β = 0 = λ = X i + α 1 dλ λ + β 3 Decisio Tree [Yi Zhag & Carl Doersch, 50 poits] I this questio, you will write our decisio tree code ad perform experimets with it. You will observe ad discuss the overfittig ad post-pruig of the decisio trees. Our data is a biary classificatio data set with discrete attributes, ad we oly require the decisio tree to be able to process this kid of data. All resources are provided i the file hw1 dt.zip, icludig the traiig, validatio (i.e., pruig ad testig data sets, a partially implemeted decisio tree codebase i C (with a few core parts removed, ad ecessary istructios to compile ad ru the codebase. Note: use of this codebase i ot required. If you are ot comfortable with codig i C, feel free to choose ay other laguage to implemet your ow decisio tree, as log as the tree ca perform the experimets we require o the particular data set we provided. While this codebase is ot a part of the questio, we icluded it so that, hopefully, may of you will be able to avoid the tedious implemetatio details ad focus istead o the iterestig parts of the decisio tree algorithm. Buildig the decisio tree: we build the decisio tree as we leared i the class. Give the traiig set, we will start from a sigle root ode with all the traiig examples assiged to it. The for each ode, if the assiged examples are ot pure (i.e., ot with the same label, we cosider further splitig this ode usig a best attribute. Selectig the best attribute for a give ode is the most importat part for buildig decisio trees, which is achieved by maximizig the iformatio gai of the split, or equivaletly miimizig the weighted average etropy after the splittig (i.e., the coditioal etropy give the attribute, show as H S (Y A i page 11 ad 1 of the slides. We will stop splitig a ode if: 1 the ode is pure; or we caot fid ay attribute that leads to a positive iformatio gai. Checkig a specific ode for pruig: Pruig a ode meas removig the subtree beeath it, keepig the ode as a leaf. As a result, all traiig examples assiged to the subtree are assiged to this ode. Examples assiged to the ode may ot all have the same label, ad i this case the label attached to this ode is the label of the majority class (ad examples of miority classes i this ode are misclassified, ad usually the classificatio accuracy o the traiig set will decrease. For performace reasos, we use a criterio differet from the lecture: give a specific ode ad the validatio set (i.e., the pruig set, we prue this ode if the classificatio accuracy of the resultig ew tree o the validatio set improves at least EP SILON. EP SILON is the threshold of miimal improvemet for pruig. For ow, we set EP SILON as (i.e., 0.5%, this default value has already bee set i our codebase. Post-pruig a decisio tree: top-dow ad bottom-up. I order to post-prue the etire decisio tree, we basically eed to perform a tree traversal, ad check all the odes alog the traversal. We cosider depth-first traversal, which ca be easily implemeted as oe fuctio via recursive calls. By placig the recursive calls at differet locatios of the fuctio, we ca make two choices: 1 check the curret ode before ivokig the recursive calls o its childre; ivoke the recursive calls o its childre before checkig the curret ode. Note that if a ode is checked ad actually prued, we will o loger travel to its childre. We iitially call the traversal fuctio at the root ode, ad clearly the two choices we metioed will lead to differet orders of checkig the tree odes. We call the first oe the top-dow approach sice it checks (ad tries to prue the paret ode before recursively checkig childre, ad we call the secod oe the bottom-up approach sice it checks childre before checkig the paret. Implemetatio ad the C codebase. The C codebase provides a decisio tree implemetatio (with a few parts removed by the TAs with much more fuctioality tha what we eed i this questio. So if you decide to use the C codebase, you oly eed to make chages o a few files (as detailed later without 4

5 really diggig ito every detail of this codebase. For a quick guide o how to compile ad ru the codebase, see quick start.txt i the hw1 dt.zip file. Data files. We use a oisy mushroom data set for this problem. Usig this data set, we will trai decisio trees to classify each mushroom as poisoous or ot, usig discrete features such as cap shape, cap color ad gill size. There are three data files i hw1 dt.zip: oisy10 trai.ssv, oisy10 valid.ssv, ad oisy10 test.ssv. They are traiig set, validatio set (i.e., pruig set, ad testig set. The format of each file is: first three lies are data statistics (umber of variables plus label, variable ames, properties of each variable, ad from the 4th lie is the data, where each lie is a example ad each colum is either the label (the first colum or a variable. You do t eed to worry about the data format if you use our codebase. 3.1 Complete the implemetatio [0 poits] To fully implemet the decisio tree usig the C codebase, there are maily two places i the codebase we eed to make chages: 1 etropy.c: the file implemetig ad usig the etropy fuctio to calculate iformatio gai ad choose the best splittig attribute whe buildig the decisio tree (search the commet YOU MUST MODIFY THIS FUNCTION i this file to fid the place to add your code ; pruedt.c: the file implemetig the post-pruig of the tree (search the commet YOU MUST MODIFY THIS FUNCTION i this file to fid the place to add your code. Prit the code you added i etropy.c ad prue-dt.c ad attach to your homework writeup. Note: if you choose to implemet your decisio tree without usig the codebase, just prit ad attach your code to the writeup. See the fuctio Etropy i etropy.c ad the fuctio PrueDecisioTree i prue-dt.c from the solutio code. Commo mistake 1: Not checkig the boudary coditio whe calculatig the etropy. If a ode cotais o positive example or o egative examples, we should directly retur 0.0 as the etropy istead of attemptig to calculate it, i.e., we do t wat to compute log (0. Commo mistake : Not covertig it to double before calculatig the quotiet of two umbers. C is ot as smart as Matlab ad R: we eed to make sure at least either umerator or deomiator is a floatig poit umber before computig their divisio. 3. Experimets with differet post-pruig strategies [0 poits] We ve discussed the top-dow ad the bottom-up approaches to travel ad prue the tree, which you should already implemeted i prue-dt.c. Ru the codebase (or your ow implemetatio with both approaches, usig the traiig set for buildig the tree, the validatio set for post-pruig, ad testig set to fially test the classificatio accuracy (agai, see quick start.txt for compilig ad ruig the codebase. Report i your homework 1 for the fully grow tree (without post-pruig: the tree size (i.e., the umber of odes ad the depth of the tree, the classificatio accuracy o the traiig set, ad the classificatio accuracy o the testig set; for the post-prued tree with top-dow approach: the tree size, the classificatio accuracy o the traiig set, ad the classificatio accuracy o the testig set; 3 for the post-prued tree with bottom-up approach: the tree size, the classificatio accuracy o the traiig set, ad the classificatio accuracy o the testig set. Note: all the iformatio ca be foud from the output whe we ru the codebase. [10 poits] Discuss how differet pruig approaches affect the size of the tree, traiig accuracy, ad testig accuracy. Also commet o the differece betwee the traiig accuracy ad the testig accuracy for each differet tree (i.e., the full tree ad two prued trees [10 poits]. 5

6 Table 1: Number of odes as EP SILON chages EP SILON = EP SILON = EP SILON = 0.01 EP SILON = 0.03 Top-Dow Bottom-Up The full-grow tree has 919 odes ad its depth is 1, with the traiig accuracy as 99.7% ad the testig accuracy as 79.6%. The top-dow prued tree has 116 odes ad its depth is 8, with the traiig accuracy as 89.6% ad the testig accuracy as 89.0%. The bottom-up prued tree has 681 odes ad its depth is 11, with the traiig accuracy as 9.% ad the testig accuracy as 88.1%. The top-dow pruig strategy tries to prue higher-level odes (i.e., those close to the root before attemptig to prue lower-level odes (i.e., those close to the leaves, so it is a more aggressive pruig strategy ad teds to produce smaller post-prued trees. Sice the resultig tree is small, i.e., a less complex model, the traiig accuracy will geerally be lower tha that of the full-grow tree (which overfits the traiig samples, but the testig accuracy will usually be higher tha that of the full-grow tree as the less complex model geeralizes to usee testig samples better. The bottom-up pruig strategy tries to prue childre odes before attemptig to prue parets, so it is ot as aggressive as the top-dow strategy ad thus will ted to prue less odes ad produce larger post-prued trees (compared to the top-dow prued trees. As a result, the traiig accuracy of the resultig tree will geerally be higher tha the top-dow prued tree (as it is larger ad more complex tha a top-dow prued tree, but lower tha that of a full-grow tree (sice the prued tree is still smaller ad thus less complex tha the full-grow tree. The testig accuracy of bottom-up prued tree is usually higher tha the full-grow tree (as pruig helps to prevet overfittig. It s difficult to predict which prued tree will have lower testig accuracy tha the other, because both of the followig cases could happe: (1 the top-dow prued tree is over-prued ad thus is too simple to get good testig accuracy; ( the bottom-up prued tree is ot sufficietly prued ad thus still overfit the traiig samples to certai degree. I our results, the bottom-up prued tree has slightly lower testig accuracy (88.1% tha that of the top-dow prued tree (89.0%, idicatig the case ( might happe here. The gap betwee the traiig accuracy ad the testig accuracy is a good idicator of how much the model overfits the traiig samples. As we ca see, the full grow tree with 919 odes has a large gap: 99.7% traiig accuracy ad 79.6% testig accuracy, idicatig serious overfittig. The bottom-up prued tree with 681 odes has a small gap: 9.% traiig accuracy ad 88.1% testig accuracy, idicatig slight overfittig. The top-dow prued tree with 116 odes has almost o gap: 89.6% traiig accuracy ad 89.0% testig accuracy, idicatig almost o overfittig. Fially, we wat to clarify that, although i this questio the smallest tree (i.e., the top-dow prued tree achieves the best testig accuracy, it is ot always the case that the simplest model is the best. Over-simplified model caot perform well. 3.3 Experimets with differet threshold EP SILON [10 poits] Whe checkig each ode, we require a miimal improvemet of validatio accuracy EP SILON for pruig. So far we use the default EP SILON = 0.005(i.e, 0.5%. For both top-dow ad bottom up pruig, chage EP SILON ad report the umber of odes i the prued tree for EP SILON = 0.001, 0.005, 0.01, Briefly explai your results (oe or two seteces will suffice. NOTE: i the codebase, EP SILON is defied i auxi.h: search #defie EPSILON to fid the locatio of EP SILON. See the Table 1 for detailed results. Geerally speakig, larger EP SILON will require more improvemet of validatio accuracy to approve a pruig, so icreasig EP SILON will ted to prue less odes ad produce larger post-prued trees. 6

1 Review of Probability & Statistics

1 Review of Probability & Statistics a. I a group of 000 people, it has bee reported that there are: 61 smokers 670 over 5 960 people who imbibe (drik alcohol) 86 smokers who imbibe 90 imbibers over 5