CS47300 Fall 2017 Assignment 3 solutions

Size: px

Start display at page:

Download "CS47300 Fall 2017 Assignment 3 solutions"

Alison May Wiggins
6 years ago
Views:

1 CS47300 Fall 2017 Assignment 3 solutions A.Non-Linear Classification (10 points) Say we have the following document set, plotted in ( x ), where x is rain and y is wind. y The following represent documents that have a large amount of mentions of wind (Group A). ( 0 3 ), (0 2 ) The following represent documents that have a large amount of mentions of rain (Group B). ( 3 0 ), (2 0 ) Lastly, these represent documents with many mentions of both wind and rain (Group C). ( 3 3 ), (2 2 ), (1 1 ), (4 4 ) Now assume we begin to group these into the categories of Normal Weather (assuming windy or rainy weather are not considered bad weather), and Hurricane/Bad Weather. Groups A and B could be classified as Normal Weather, while Group C could be categorized as Bad Weather. These categories cannot be separated linearly. This is because there is no single line that can be drawn to distinguish these two category groupings. Instead, a 2-dimensional shape must be used. (Credit to Adam Johnston) Rubrics: (1) Given example of documents with actual words [2 points] (2) Show representation in vector space [2 points] with labels [1 point] (3) Explain why it is not linearly separable (with graph or words) [5 points] B. Bayes Classifier (10 points) (1) This is not practical for information retrieval systems because we would have to estimate far too many parameters. We would need estimates for every possible combination of the terms, which is not feasible. Instead, we assume that terms are independent to reduce the number of estimates needed, giving us Naïve Bayes. [8 points] (2) This estimation is problematic if x is labeled as r in the training set, i.e. if P(Y=r & X=x) = 0. In practical information retrieval systems, we frequently come across term vectors that do not appear in our training set, meaning that this estimate would be 0 and our model would be unable to generalize well. [8 points] Having (1) and (2) will be 10 points.

2 C) 1. You can t say anything about the classifier with just the micro-averaged F1 score of It might be good or it might be bad. We can make a classifier with a higher F1 score as follows: Classify all documents as belonging to class C Then confusion matrices for each class are Class A Relevant 0 2 Irrelevant Class B Relevant 0 2 Irrelevant Class C Relevant Irrelevant 4 0 Micro-averaged precision = Micro-averaged recall = Micro-averaged F1 = Macro-averaged precision = ( ) / 3 = Macro-averaged recall = (0+0+1) / 3 = Macro-averaged F1 = In this case, macro-averaged F1 is better measure than micro-averaged performance as the class sizes are skewed. 2) When F1 scores of all classes are equal F1(A) = TP(A) / (2 TP(A) + FP(A) + FN(A)) F1(B) = TP(B) / (2 TP(B) + FP(B) + FN(B))

3 F1(C) = TP(C) / (2 TP(C) + FP(C) + FN(C)) Micro-averaged F1 = TP(A) + TP(B) + TP(C)2(TP(A) + TP(B) + TP(C)) + FP(A) + FP(B) + FP(C) + FN(A) + FN(B) + FN(C) Macro-averaged F1 = F1(A) + F1(B) + F1(C) / 3 (5/5) Both are equal if F1 scores of all classes are equal. (0/5) Consider the following counter-example for argument class sizes need to be equal : 3 docs in each class, all except 1 doc (correctly) in each class is classified as class C. It is neither necessary, nor sufficient. (4/5) Consider following counter example for the for argument precision and recall of all classes need to be equal. It is sufficient but not necessary: Class A: 10 docs Relevant 6 4 Irrelevant 9 21 Class B: 15 docs Relevant 6 9 Irrelevant 4 21 Class C: 15 docs Relevant 6 9 Irrelevant 4 21 D. Naïve Bayes 1. All of the estimated probabilities are as below. You may use Laplace smoothing. P(Bronchitis) = 3/6 = 0.5 P(Tuberculosis) = 3/6 = 0.5

4 P(Shadow_on_xray Bronchitis ) = count(shadow_on_xray Bronchitis)/(# words in Cat_Bronchitis) = 2 / 7 P(Dyspnea Bronchitis ) = count(dyspnea Bronchitis)/(# words in Cat_Bronchitis) = 2 / 7 P(Lung_inflammation Bronchitis ) = count(lung_inflammation Bronchitis)/(# words in Cat_Bronchitis) = 3 / 7 P(Lung_inflammation Tuberculosis ) = count(lung_inflammation Tuberculosis)/(# words in Cat_ Tuberculosis) = 1 / 4 P(Shadow_on_xray Tuberculosis ) = count(shadow_on_xray Tuberculosis)/(# words in Cat_ Tuberculosis) = 2 / 4 P(Dyspnea Tuberculosis ) = count(dyspnea Tuberculosis)/(# words in Cat_ Tuberculosis) = 1 / 4 2. Category Classification P( Bronchitis Shadow_on_xray Lung_inflammation) ~ P(Bronchitis) * P(Shadow_on_xray Bronchitis) *P(Lung_inflammation Bronchitis) ~ (0.5) * (2/7) * (3/7) ~ P( Tuberculosis Shadow_on_xray Lung_inflammation) ~ P(Tuberculosis) * P(Shadow_on_xray Tuberculosis) *P(Lung_inflammation Tuberculosis) ~ (0.5)*(2/4)*(1/4) ~ Therefore, the category will be Tuberculosis. E. K-Means Clustering This may vary depending on the assumption about term representation. The answer below is one of sample answers. Assume we use TF-IDF term weighting. Each document vector will be presented as [cat mouse ate slept and] T A: TF-IDF(cat) = (1/2)*log(8/4) = TF-IDF(ate) = (1/2)*log(8/4) =

5 doc vector of A = [ ] T B: TF-IDF(cat) = (1/2)*log(8/4) = TF-IDF(slept) = (1/2)*log(8/2) = doc vector of B = [ ] T C: TF-IDF(mouse) = (1/2)*log(8/5) = TF-IDF(ate) = (1/2)*log(8/4) = doc vector of C = [ ] T D: TF-IDF(mouse) = (1/2)*log(8/5) = TF-IDF(slept) = (1/2)*log(8/2) = doc vector of D = [ ] T E: TF-IDF(cat) = (1/3)*log(8/4) = TF-IDF(ate) = (1/3)*log(8/4) = TF-IDF(mouse) = (1/3)*log(8/5) = doc vector of E = [ ] T F: TF-IDF(cat) = (1/1)*log(8/4) = doc vector of F = [ ] T G: TF-IDF(mouse) = (1/1)*log(8/5) = 0.47 doc vector of G = [ ] T H: TF-IDF(mouse) = (1/4)*log(8/5) = TF-IDF(ate) = (2/4)*log(8/4) = TF-IDF(and) = (1/4)*log(8/1) = doc vector of H = [ ] T Then, we run KMean clustering and get {A, B, F}, {D, G}, and {C, E, H}

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology