22 February 2007 CSE-4412(M) Midterm p. 1 of 12 CSE-4412(M) Midterm Sur / Last Name: Given / First Name: Student ID: Instructor: Parke Godfrey Exam Duration: 75 minutes Term: Winter 2007 Answer the following questions to the best of your knowledge. Be precise and be careful. The exam is closed-book and closed-notes. Write any assumptions you need to make along with your answers, whenever necessary. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated. If you need additional space for an answer, just indicate clearly where you are continuing. Regrade Policy Regrading should only be requested in writing. Write what you would like to be reconsidered. Note, however, that an exam accepted for regrading will be reviewed and regraded in entirety (all questions). Grading Box 1. /10 2. /10 3. /10 4. /10 5. /10 Total /50
22 February 2007 CSE-4412(M) Midterm p. 2 of 12 1. (10 points) Misc. Don t eat the Smarties TM! Calculation a. (3 points) John randomly takes a smartie from one of the bowls A, B, or C. (He randomly chose a bowl, then randomly chose a smartie.) We saw that the one he took, and then ate, was red. We also somehow know what the distribution of the smarties in the bowls was before John took one: A B C red 10 20 30 60 blue 40 30 20 90 50 50 50 150 We learn that, unfortunately, all the smarties in bowl A are poisoned! (The ones in B and C are fine.) What is the probability that John has been poisoned? State the conditional probability that represents this and calculate the probability as a fraction. b. (3 points) We learn the following about students and the salaries high is more than, or equal to, $200k and low is less than $200k that they earn in their first jobs after graduating. Students who took the data mining course are represented by dm and those who did not by dm. low high dm 70 30 100 dm 830 70 900 900 100 1000 What is the lift of dm & high? Calculate the number.
22 February 2007 CSE-4412(M) Midterm p. 3 of 12 c. (4 points) Consider that I 1 I 2 I 3 A B C A C D A C E A C F A D F A E F I 1 I 2 I 3 B C E B C F B D F C D E C E F D E F were the frequent 3-itemsets found by the Apriori algorithm. Show the 4-itemset pre-candidates that would be generated by the join step. Which of these pre-candidates would then be eliminated by the apriori pruning step (leaving the actual candidates)?
22 February 2007 CSE-4412(M) Midterm p. 4 of 12 2. (10 points) Association Rule Mining. No, I m not associating with you. Analysis An association rule can be generated as follows: For a given frequent itemset L, generate all nonempty, proper subsets of L. For every nonempty, proper subset S of L, output the rule support(l) S L S if support(s) θ minconf (in which θ minconf is the minimum confidence threshold). a. (5 points) For a frequent itemset L of size k, how many rules should be tested under this method? b. (5 points) By apriori, we know that for any nonempty subset S of S, support(s ) support(s). Given frequent itemset L and subset S of L, prove that confidence(s (L S )) confidence(s (L S)), for S S.
22 February 2007 CSE-4412(M) Midterm p. 5 of 12 3. (10 points) General. Rock, scissors, paper... Multiple Choice Each question below is worth one point. Choose one best answer for each. a. Consider the following statements about strong association rules. I. If A B then B A. II. If A B and B C then A C. III. If A C then A B C. IV. If A B C then A C. Which of the above statements are true for any A, B, and C? A. I & II B. I, II, & III C. I, II, & IV D. II & III E. none Assume that the largest frequent itemset is of size k. b. How many passes does the apriori algorithm need in worst case? A. k 1 B. k C. k + 1 D. k 2 E. 2k F. 2 k 1 c. There are at least how many frequent itemsets? A. k 1 B. k C. k + 1 D. k 2 E. 2k F. 2 k 1 d. Consider pass i of the Apriori algorithm in which the frequent i-itemsets are being found. Let n be the number of frequent items. We know that a transaction T can be eliminated during pass i if it is found that T does not contain at least some candidate i-itemsets. (T then could never support any candidate j-itemsets for j > i.) So, we will throw away T if it does not contain at least x candidate i-itemsets. What is the largest value for x that we can safely use? A. 1 B. 2 C. i 1 D. i E. i + 1 F. n
22 February 2007 CSE-4412(M) Midterm p. 6 of 12 e. Which of the following is true? A. If A B is an association rule, A and B are positively correlated. B. If A B is an association rule, A and B are at least not negatively correlated. C. If both A B and B A are association rules, A and B are positively correlated. D. If both A and B are correlated, then A B is a strong association rule. E. Association does not imply correlation. f. Apriori pruning in the search for frequent itemsets works because A. support count is monotonic with respect to itemsets. B. support count is anti-monotonic with respect to itemsets. C. support count diverges as we add to the itemset. D. we search in transaction ID order. E. it is an excellent heuristic, but it does not work 100% of the time. g. Naïve Bayesian classification A. is guaranteed never to misclassify any of its training data. B. needs no prior probabilities. C. is theoretically optimal with respect to minimizing classification error, modulo the conditional independence assumption. D. works well even with very little training data. E. cannot be made to work with continuous attributes.
22 February 2007 CSE-4412(M) Midterm p. 7 of 12 Consider that we have four classes A, B, C, & D of 100 items each that constitute our sample set. h. The expected information to classify a given sample is A. 0 B. 1 2 C. 1 D. 2 E. π Keep considering the four classes A, B, C, & D of 100 items each. Consider also a boolean attribute b let t denote true and f denote false such that s A,t = 0 s B,t = 50 s C,t = 50 s D,t = 100 s A,f = 100 s B,f = 50 s C,f = 50 s D,f = 0 i. The entropy of b with respect to the sample set is A. 0 B. 1 C. 1 1 2 D. 2 E. π j. Gain(b) is A. 0 B. 1 2 C. 1 D. 1 1 2 E. 2 F. π
22 February 2007 CSE-4412(M) Midterm p. 8 of 12 4. (10 points) Bayesian Classification. Hey, I m not that naïve. Exercise outlook temperature humidity windy? play? rainy cool normal Y no rainy cool normal N yes rainy mild high Y no rainy mild high N yes rainy mild normal N yes overcast cool normal Y yes overcast cool high Y no overcast mild high Y yes overcast hot high N yes overcast hot normal N yes sunny cool normal N yes sunny mild high N no sunny mild normal Y yes sunny hot high Y no sunny hot high N no a. (4 points) Calculate P(C i X = sunny,hot,high,n ). How would the naïve Bayes classifier classify the data instance X = sunny, hot, high, N? Does this agree with the classification given for X = sunny, hot, high, N in the table? Justify your answer via calculations.
22 February 2007 CSE-4412(M) Midterm p. 9 of 12 b. (3 points) Consider a new data instance X = overcast,cool,high,y. How would the naïve Bayes classifier classify X? Again, justify your answer via calculations. c. (3 points) What is a Laplacian correction? Would it be needed for this dataset for any classification calculations?
22 February 2007 CSE-4412(M) Midterm p. 10 of 12 5. (10 points) Decision Tree & Rule Induction. Break all the rules! Short Answer a. (3 points) Is the ID3 decision tree induction algorithm guaranteed to find an optimal tree (that is, a tree that best classifies the training tuples over all possible trees)? Why or why not? b. (2 points) When designing a sequential rule induction algorithm, one needs a rule quality measure to choose the next best candidate rule. What two things should this measure balance?
22 February 2007 CSE-4412(M) Midterm p. 11 of 12 c. (3 points) Why is a conflict resolution strategy often necessary for rule-based classifiers? d. (2 points) List three common conflict resolution strategies for rule-based classifiers.
22 February 2007 CSE-4412(M) Midterm p. 12 of 12 (Scratch space.) Relax. Turn in your exam. Go home.