NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

CS4445 ata Mining and Kwledge iscery in atabases. B Term 2014 Exam 1 Nember 24, 2014 Prf. Carlina Ruiz epartment f Cmputer Science Wrcester Plytechnic Institute NAME: Prf. Ruiz Prblem I: Prblem II: Prblem III: Prblem IV: (/10 pints) ata Preprcessing (/15 pints) Mdel Ealuatin (/30 pints) ecisin Trees (/45 pints) Baian Mdels Instructins: TOTAL SCORE: (/100 pints) Shw yur wrk and justify yur answers Use the space prided t write yur answers Ask in case f dubt Prblem I. ata Preprcessing [10 pints] 1. [5 pints] What is the difference between simple randm sampling and stratified randm sampling? Slutin: (Taken frm the slutins t Exam 1 CS4445 B Term 2012) Simple randm sampling draws data instances at randm using a unifrm distributin (that is each data instance is equally likely t be chsen), while stratified randm sampling draws data instances at randm accrding t the distributin f the target attribute (s that the subsample preseres the distributin f the target attribute). 2. [5 pints] Assume that A is a minal attribute, ther than the target attribute. Cnsider a missing alue fr this attribute A. a. Briefly describe a pssible unsuperised methd t replace this missing alue. Slutin: Replace the missing alue with the mde f attribute A. [This is an unsuperised methd because it desn t use the target attribute at all.] b. Briefly describe a pssible superised methd t replace this missing alue. Slutin: Replace the missing alue with the mde f attribute A n the data instances that hae the same classificatin (target alue) f the instance that cntains the missing alue. [This is superised methd because it uses the target attribute t mdify A.] Page 1 f 6

Prblem II. Mdel Ealuatin [15 pints]. 1. [10 pints] Explain hw n-fld crss alidatin wrks (t make it easier t explain, use n=10). w is the accuracy reprted by this ealuatin methd cmputed? Slutin: (Taken frm the slutins t Exam 1 CS4445 term 2003. Against my wn suggestin abe, I will explain the prcedure fr a general n rather than using n=10) Partitin the input data int n flds (i.e., mutually disjint and cllectiely exhaustie parts), apprximately f the same size, at randm using stratificatin. Let's dete thse flds as F1,F2,..., Fn. Nw, perfrm the fllwing prcess: Fr i := 1 t n d - cnstruct mdel Mi using as training data the unin f all flds except fr Fi. That is, the unin f F1,..., F(i-1), F(1+1),..., Fn - test mdel Mi n fld Fi, and recrd the accuracy (r the errr) btained. End Fr Return the aerage f the accuracies (r f the errrs) f all the mdels Mi. 2. [5 pints] Briefly describe an adantage and a disadantage f this ealuatin methd. Slutin: [Althugh perfrming n-fld crss alidatins has seeral adantages, we discuss just ne f them here as that s all is required by the prblem statement.] Adantage: This systematic prcedure allws each and eery instance in the dataset t be part f the training set in sme experiments (n-1 t be precise) and f the test set in ther experiments (1 t be precise). isadantage: The prcess might take a lng time, as n mdels are cnstructed and tested. Page 2 f 6

Prblem III. ecisin Trees [30 pints] An alternatie metric fr selecting the best attribute t split a de in a decisin tree is the Gini metric. Belw are sme facts abut the Gini metric. The frmulas fr the Entrpy and fr the Gini metrics are: c Entrpy(t) = p(i t) lg 2 p(i t) i=1 and Gini(t) = 1 [p(i t)] 2 where c is the number f classes (i.e., alues f the target attribute) and p(i t) is the relatie frequency f class i at de t. As with Entrpy, the Gini alue f an attribute is the weighted sum f the Gini alues f each f the attribute alues. As with Entrpy, the attribute with the lwest Gini alue is selected t split the tree de. c i=1 Cnsider the fllwing dataset f 10 data instances. Assume that efaulted Brrwer is the target attribute. me Owner () Marital Status (M) Annual Incme (A) efaulted Brrwer () dirced >85K dirced >85K married >85K married >85K married 85K married 85K single >85K single 85K single 85K single >85K The Gini alues f the predicting attributes fr this dataset are: Gini alue f use Owner is 0.3428 Gini alue f Marital Status is 0.3 Gini alue f Annual Incme is 0.4166 1. [10 pints] Using the frmula fr Gini, shw that the Gini alue f Annual Incme is indeed 0.4166. Shw yu wrk (please use the tatin [# f s, # f es] t neatly summarize the cunts). Slutin: The [, ] cunts fr 85K are [3,1] and the [,] cunts fr >85K are [4,2]. Gini(A) = Gini([3,1],[4,2]) = (4/10)*Gini([3,1]) + (6/10)*Gini([4,2]) = (4/10)*[1 [(3/4)^2 + (1/4)^2]] + (6/10)*[1 [(4/6)^2 + (2/6)^2]] = (4/10)*[1 [(9/16) + (1/16)]] + (6/10)*[1 [(16/36) + (4/36)]] = (4/10)*[1 (10/16)] + (6/10)*[1 (20/36)] = (4/10)*(6/16) + (6/10)*(16/36) = (3/20)+(4/15)=0.4166 Page 3 f 6

2. [20 pints] Cnstruct the full I3 decisin tree using Gini t rank the predicting attributes (me Owner, Marital Status, Annual Incme) with respect t the target/classificatin attribute (efaulted Brrwer). Fr the rt de, yu can assume that the Gini alue f use Owner is 0.3428, the Gini alue f Marital Status is 0.3, and the Gini alue f Annual Incme is 0.4166 withut calculating these alues explicitly. Fr des ther than the rt, shw all the steps f yur Gini calculatins. Make sure t shw yur wrk. Slutin: Since Marital Status has the lwest Gini alue, it is chsen t split the rt de. Fr M=dirced (left-mst child), has / cunt [1,1]. By simple inspectin, me Owner perfectly splits this de, while Annual Incme desn t split it. ence, we select me Owner t split this de. Fr M=married (middle child), the de is hmgeus [4,0], s it is cnerted int a leaf. Fr M=single (right-mst child), the de is hetergeneus [2,2] and neither me Owner r Annual Incme splits it perfectly well. S we calculate the Gini alue f these tw attributes fr this de: Gini() = Gini([1,2],[1,0]) = (3/4)*Gini([1,2]) + (1/4)*Gini([1,0]) = (3/4)*[1-[(1/3)^2 + (2/3)^2]]+0 = (3/4)*[1 (5/9)] = (3/4)*[4/9] = 1/3 = 0.33 Gini(A) = Gini([1,1],[1,1]) = (2/4)*Gini([1,1]) + (2/4)*Gini([1,1]) = [1-[(1/2)^2 + (1/2)^2]] = [1 (1/2)] = 1/2 = 0.5 ence, me Owner is chsen t split this de. The = child de is hmgeneus s we make it int a leaf. The = child de is hetergeneus, s we split it with the nly remaining attribute aailable in that subtree, namely A. One f children f A is still hetergeneus [1,1], but since there are mre attributes aailable t split it, we cnert it int a leaf and break the tie chsing the first class alue listed n the dataset, namely, fllwing Weka s cnentin. [7,3] [1,1] dirced M married [4,0] single [2,2] [0,1] [1,0] [1,2] [1,0] A 85 >85 [1,1] [0,1] Page 4 f 6

Prblem IV. Baian Mdels [45 pints] Cnsider the fllwing dataset, where efaulted Brrwer is the target attribute: me Owner () Marital Status (M) Annual Incme (A) efaulted Brrwer () dirced >85K married >85K married >85K married 85K married 85K single >85K single 85K dirced >85K single 85K single >85K 1. Naïe Ba: a. [5 pints] isplay the tplgy f the naïe Ba graph fr the training dataset. [10 pints] Cmpute all f the Cnditinal Prbability Tables (CPTs) in the graph. Shw yur wrk neatly. Slutin: (7+1)/12 (3+1)/12 M A (4+1)/9 (3+1)/9 (3+1)/5 (0+1)/5 M dirced married single (1+1)/10 (4+1)/10 (2+1)/10 (1+1)/6 (0+1)/6 (2+1)/6 A 85 >85 (3+1)/9 (4+1)/9 (1+1)/5 (2+1)/5 b. [15 pints] etermine the efaulted Brrwer alue that this naïe Ba mdel predicts fr the test data instance: me Owner =, Marital Status = single and Annual Incme 85K (let s abbreiate this as: =, M=single and A 85K). Shw yur wrk in detail. Slutin: The predictin f the Naïe Ba mdel fr this data instance is: argmax P(= = & M=single & A 85K) = argmax P(= & M=single & A 85K =) P(=) = argmax P(= =) P(M=single =) P(A 85K =) P(=) because f the naïe assumptin Fr = : (4/9) (3/10) (4/9) (8/12) = 16/405 = 0.0395 Fr = : (1/5) (3/6) (2/5) (4/12) = 1/75 = 0.013 Since = gets the highest prbability, then the naïe Ba mdel predicts. Page 5 f 6

2. Cnsider the fllwing Baian net fr the abe dataset: M A We want t determine the efaulted Brrwer alue that this Baian net predicts fr the test data instance: =, M=single and A 85K. One can pre (but yu dn t need t d s) that the predictin f this Baian net will be the fllwing: Predicted alue f = = argmax P(= = & M=single & A 85K) = argmax P(= & M=single & A 85K =) P(=) = argmax P(= M=single & A 85K) P(M=single =) P(A 85K) P(=) a. [5 pints] Assume that all the prbability alues abe are different frm 0. Simplify the last line f the deriatin abe as much as yu can, eliminating prbability expressins that dn t need t be cnsidered. Explain yur answer. Slutin: Since P(= M=single & A 85K) and P(A 85K) dn t ile =, they wn t affect the result f the argmax. In ther wrds, they are cnstant with respect t. ence, they can be eliminated frm the last line f the deriatin abe withut affecting the result: = argmax P(M=single =) P(=) b. [10 pints] Using yur simplified frmula, determine the efaulted Brrwer alue that this Baian net will predict fr this test data. Calculate explicitly nly the entries f the Cnditinal Prbability Tables (CPTs) that yu need in rder t answer this questin. Shw yur wrk. Slutin: argmax P(M=single =) P(=) Fr = : (3/10) (8/12) = 1/5 Fr = : (3/6) (4/12) = 1/6 [Nte that the CPT tables fr and fr M n this Baian net are identical t the nes calculated fr the naïe Ba mdel.] Since = gets the highest prbability, then this Baian net mdel predicts. Page 6 f 6