CS 498F: Machine Learning Fall 2010

Size: px

Start display at page:

Download "CS 498F: Machine Learning Fall 2010"

Violet Crawford
5 years ago
Views:

1 Instructions: Make sure you write your name on every page you submit. This is a closed-book, closed-note exam. There are 8 questions + 1 extra credit question. Some reminders of material you may need is listed next. c Entropy(S) p i log 2 p i i=1 Gain(S, A) Entropy(S) v V alues(a) S v S Entropy(S v) Page 1 of 7

2 Consider the following training data set, to be used in questions 1 and 2. It contains the following attributes: Floor, which says on which floor the apartment is and could have values First, Top, or Bottom Neighborhood, which could be Great, OK, or Bad Near Metro, which is Yes or No Rent, which is either High or Normal The target attribute ShouldRent indicates whether or not the apartment is worth renting. Floor Neighborhood NearMetro Rent ShouldRent Middle Bad Yes High No Middle Bad No High No Top Bad Yes High Yes First Ok Yes High Yes First Great Yes Normal Yes First Great No Normal No Top Great No Normal Yes Middle Ok Yes High No Middle Great Yes Normal Yes First Ok Yes Normal Yes Middle Ok No Normal Yes Top Ok No High Yes Top Bad Yes Normal Yes First Ok No High No Page 2 of 7

3 Question 1: (15 Points) Given a new instance X = Top, Great, Yes, High, what would be the probability that P(Y = Yes X ) according to a Naive Bayes classifier? For full credit, show what probabilities Naive Bayes would multiply together to get to P(Y = Yes X ) and how these probabilities are estimated. You do not need to show how Naive Bayes estimates all of its probabilities; just the ones necessary for the computation of P(Y = Yes X ) are sufficient. P(Y = Yes T, G, Y, H) P(T Yes) P(G Yes) P(Y Yes) P(H Yes) P(Yes) = 4/9 * 3/9 * 6/9 * 3/9 * 9/14 = = A Everyone that got that far got full credit. A 100% correct solution is one that notices that this is not an exact probability because we are not dividing by P(T, G, Y, H). To get an correct probability, we would need to compute P(Y = No T, G, Y, H) P(T No) P(G No) P(Y No) P(H No) P(No) = B, and compute P(Y = Yes T, G, Y, H) as A / (A + B). B turns out to be 0, so P(Y = Yes T, G, Y, H) = 1. Page 3 of 7

4 Question 2: (15 Points) In the apartment renting example, which of the two attributes, Floor or Neighborhood, would ID3 Decision tree learning prefer to split on first? For full credit, show your complete reasoning. Let S be the given training set. Entropy(S) = -9/14 log (9/14) - 5/14 log (5/14) = Values (Floor) = {M, T, F} SM = [2+, 3-]: Entropy(SM) = -2/5 log 2/5-3/5 log 3/5 = ST = [4+, 0-]: Entropy(ST) = 0 SF = [3+, 2-]: Entropy(SF) = -2/5 log 2/5-3/5 log 3/5 = Gain(S, Floor) = Entropy(S) - 5/14 Entropy(SM) - 4/14 Entropy(ST) - 5/14 Entropy(SF) = Values(Neighborhood) = {B, O, G} ST = [2+, 2-]: Entropy(ST) = 1 SO = [4+, 2-]: Entropy(SO) = -4/6 log 4/6-2/6 log 2/6 = SG = [3+, 1-]: Entropy(SG) = -3/4 log 3/4-1/4 log 1/4 = Gain(S, Neighborhood) = Entropy(S) - 4/14 Entropy(SB) - 6/14 Entropy(SO) - 4/14 Entropy(SG) = Floor has higher gain, so ID3 will prefer it. Page 4 of 7

5 Question 3: a) (5 points) In an instance space X of finite size m, what is the number of possible hypotheses that can be defined over X, assuming that the target attribute of each instance is Boolean (i.e. each instance can be labeled as either 0 or 1)? 2^m (m instances, 2 choices for each) b) (8 points) What would be the number of possible hypotheses if instead of being Boolean, the target attribute of each instance can take on the values Red, Green, or Blue? 3^m (m instances, 3 choices for each) Question 4: Explain: a) (5 points) What is overfitting? Overfitting occurs when the hypothesis fits the training data too closely, such that accuracy on the training data continues to increase, while accuracy on unseen test data decreases. b) (5 points) What is a property of the training data that may cause overfitting? Noise, small sample size c) (5 points) What is a technique that can help avoid overfitting? Cross-validation, pruning Question 5: (12 Points) Design a perceptron unit that computes the function NAND(x1, x2), where x1 and x2 are Boolean variables (NAND is the negated AND function). One option is x1-0.5x2. Returns values that are positive whenever at least one of the inputs is 0, and a negative value when they are both 1. Page 5 of 7

6 Question 6: (10 points) Consider X consists of all points on X is described by two be the set of circles (note that ellipses). Can a set shattered? If yes, example; if no, give why. the scenario where the instance space the x,y plane, i.e. each instance real numbers x, y. Let H hypotheses described as circles are different from of three instances be show an an argument Yes Question 7: (10 points) What is a consistent learner? One that outputs a hypothesis that is 100% correct on the training data whenever possible. Question 8: (10 Points) In the weighted majority algorithm, what would be the final weight of a predictor that makes exactly k mistakes, assuming that we are using a discount rate β = 1/5? What is the effect of varying β (i.e how does the behavior of the algorithm change as we increase or decrease β)? (1/5)^ β Larger values of β lead to smaller penalties of predictors when they make mistakes. In the extreme case, as β tends to 0, weighted majority morphs into a predictor-elimination algorithm that eliminates a predictor as soon as it makes a single mistake. Page 6 of 7

7 Question 9 (10 points Extra Credit) Consider the scenario where the instance space X consists of all points on the x,y plane, i.e. each instance X is described by two real numbers x, y. Let H be the set of hypotheses described as squares (not necessarily axis-aligned) in the x, y plane. What is VC(H)? 3 points can be shattered with an example analogous to one in question 6. 4 points can also be shattered (Thanks to Michael Stockman for pointing this out with the following very nice construction). The green square shows how we can represent the dichotomy where D and C are in and A and B are out. The red square shows things the other way around. The remaining dichotomies are straightforward and are not shown. Page 7 of 7

Name (NetID): (1 Point)

Name (NetID): (1 Point) CS446: Machine Learning (D) Spring 2017 March 16 th, 2017 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains