Classification. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

Size: px

Start display at page:

Download "Classification. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162"

Silas Stevens
5 years ago
Views:

1 Classification Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

2 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 What is Classification? Problem Examples Given an object, which class of objects does it belong to? Given object x, predict its class label y. Computer vision: Is this object a chair? Credit cards: Is this customer to be trusted? Marketing: Will this customer buy/like our product? Function prediction: Is this protein an enzyme? Gene finding: Does this sequence contain a splice site? Personalized medicine: Will this patient respond to drug treatment?

3 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 What is Classification? Setting Role of y Classification is usually performed in a supervised setting: We are given a training dataset. A training dataset is a dataset of pairs {(x i, y i )} n i=1, that is objects and their known class labels. The test set is a dataset of test points {x i }d i=1 with unknown class label. The task is to predict the class label y i of x i via a function f. if y {0, 1}: then we are dealing with a binary classification problem if y {1,..., n}, (3 n N): a multiclass classification problem if y R: a regression problem

4 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers The Contingency Table In a binary classification problem, one can represent the accuracy of the predictions in a contingency table: y = 1 y = 1 f (x) = 1 TP FP f (x) = 1 FN TN Here, T refers to True, F to False, P to Positive (prediction) and N to Negative (prediction).

5 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers Accuracy The accuracy of a classifier is defined as TP + TN TP + TN + FP + FN Accuracy measures which percentage of the predictions is correct. It is the most common criterion for reporting the performance of a classifier. Still, it has a fundamental shortcoming: If the classes are unbalanced, the accuracy on the entire dataset may look high, while being low on the smaller class.

6 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers Precision-Recall If the positive class is much smaller than the negative class, one should rather use precision and recall to evaluate the classifier. The precision of a classifier is defined as. The recall of a classifier is defined as. TP TP + FP TP TP + FN

7 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers Trade-off between Precision and Recall There is a trade-off between precision and recall: By predicting all points to be positive (f (x = 1)) one can guarantee that the recall is 1. However, the precision will then be bad. By only predicting points to be members of the positive class for which one is highly confident about the prediction, one will increase precision, but lower recall. One workaround is to report the precision recall break-even point, that is the value at which precision and recall are identical.

8 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers Dependence on Classification Threshold TP, TN, FP, FN depend on f (x) where x D. The most common definition of f (x) is { 1 if s(x) θ, f (x) = 1 if s(x) < θ, where s : D R is a scoring function, and θ R is a threshold. As the predictions based on f vary with θ, so do TP, TN, FP, FN, and all evaluations criteria based on them. It is therefore important to report results as a function of θ whenever possible, not just for one fixed choice of θ.

9 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers How to report results as a function of θ The efficient strategy to compute all solutions as a function of θ is to rank all points x by their score s(x). This ranking is a vector of length t. This ranking is a vector r of length t, whose ith element is r(i). We then perform the following steps: For i = 1 to t 1 Define the positive predictions P to be the set {r(1),..., r(i)}. Define the negative predictions N to be the set {r(i + 1),..., r(t)}. Compute the evaluation criteria e(i) of interest for P and N. Return vector e The common strategy is to compute two evaluation criteria e 1 and e 2 and to then visualize the result in a 2-D plot.

10 That is, the fraction of negative points that were misclassified. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers ROC curves One popular such 2D-plot is the Receiver Operating Characteristics Curve, which represents the true positive rate versus the false positive rate. The true positive rate (or sensitivity) is identical to the recall: TP TP + FN. That is, the fraction of positive points that were correctly classified. The false positive rate (or 1 specificity) is FP FP + TN

11 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers ROC curves Each ROC curves starts at (0,0). If no point is predicted to be positive, then there are no True Positives and False Positives. Each ROC curve ends at (1,1). If all points are predicted to be positive, then there are no True Negatives and False Negatives.

12 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers ROC curves The ROC curve of a perfect classifier runs through the point (0,1) - it correctly classifies all negative points (FP=0) and correctly classifies all positive points (FN=0). While the ROC curve does not depend on an arbitrarily chosen threshold θ, it seems difficult to summarize the performance of a classifier in terms of a ROC curve.

13 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers ROC curves The solution to this problem is the Area under the Receiver Operating Characteristics (AUC), a number between 0 and 1. The AUC can be interpreted as follows: When we present one negative and one positive test point to the classifier, then the AUC is the probability with which the classifier will assign a larger score to the positive than to the negative point. The larger AUC, the better the classifier. The AUC of a perfect classifier can be shown to be 1. The AUC of a random classifier (guessing the prediction) is 0.5. The AUC of a stupid classifier (misclassifying all points) is 0.

14 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers Summarizing PR values 2-D plot of (recall,precision) values for different values of θ Starts at (0,1). Full precision, no recall. The precision recall break-even point is the point at which the precision-recall-curve intersects the bisecting line. The area under the precision-recall-curve (AUPRC) is another statistic to quantify the performance of a classifier. It is 1 for a perfect classifier, that is, it reaches 100% precision and 100% recall at the same time.

15 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers Example: The Good and the Bad We are given 102 test points, 2 are positive, 100 negative. Our prediction ranks ten negative points first, then the 2 positive points, then the remaining 90 points Precision True Positive Rate Recall False Positive Rate

16 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers What to do if we only have one dataset for training and test? If only one dataset is available for training and testing, it is essential not to train and test on the same instance, but rather to split the available data into training and test data. Splitting the dataset into k subsets and using one of them for testing and the rest for training is referred to as k-fold cross-validation. If k = n, cross-validation is referred to as leave-one-out-validation. Randomly sampling subsets of the data for training and testing and averaging over the results is called bootstrapping.

17 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers Illustration of cross-validation: 10-fold cross-validation Step Step 2... Step

18 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers How to optimize the parameters of a classifier? Most classifiers use parameters c that have to be set (more on these later). It is wrong to optimize these parameters by trying out different values and picking those that perform best on the test set. These parameters are overfit on this particular test dataset, and may not generalize to other datasets. Instead, one needs an internal cross-validation on the training data to optimize c.

19 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers Illustration of cross-validation with parameter optimization: Step Outer Loop: Step Inner Loop: Step 1... Inner Loop: Step Repeat the inner loop for different values of c.

20 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers Illustration of cross-validation with parameter optimization: Step Outer Loop: Step Repeat the inner loop for different values of c Inner Loop: Step 1... Inner Loop: Step 9

21 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Choosing Classifiers Criteria for a good classifier Accuracy Runtime and scalability Interpretability Flexibility

22 Nearest Neighbour Classification Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

23 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Nearest Neighbour The actual classification Given x, we predict its label y by x i = arg min x D x x 2 f (x) = y i The predicted label of x is that of the point closest to it, that is, its nearest neighbour.

24 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Nearest Neighbour k-nn classification An object is classified to the class most common among its k nearest neighbors (k N). If k = 1, then the object is assigned to the class of its single nearest neighbor. k is a hyperparameter that is chosen by the user. The larger k, the lower the influence of single noisy points on the classification. In binary classification problems, one usually chooses odd k to avoid ties. It is an instance of instance-based learning and lazy learning.

25 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Nearest Neighbour Runtime Naively, one has to compute the distance to all n neighbours in the training dataset for each point: O(n) to find the nearest neighbour in 1-NN O(n + n log n) to find the k nearest neighbours in k-nn

26 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Nearest Neighbour Challenges in Nearest Neighbour Classification Speed of prediction Choice of hyperparameter k Choice of distance function and weights of dimensions

27 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Nearest Neighbour How to speed Nearest Neighbour up Exploit the triangle inequality: d(x 1, x 2 ) + d(x 2, x 3 ) d(x 1, x 3 ) This holds for any metric d.

28 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Nearest Neighbour How to speed Nearest Neighbour up Rewrite triangle inequality: d(x 1, x 2 ) d(x 1, x 3 ) d(x 2, x 3 ) That means if you know d(x 1, x 3 ) and d(x 2, x 3 ), you can provide a lower bound on d(x 1, x 2 ). If you know a point that is closer to x 1 than d(x 1, x 3 ) d(x 2, x 3 ), then you can avoid to compute d(x 1, x 2 ).

29 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Nearest Neighbour How to set the hyperparameter k: Bootstrapping and cross-validation Cross-validation on the training data Bootstrapping on training data Split the training dataset in two subsets T 1 and T 2. For different choices of k, compute the accuracy on T 2, using T 1 as training set. Repeat the above two steps several times. Pick the k with the highest average accuracy across all iterations.

30 Nearest Neighbour How to weight the dimensions The Mahalanobis distance d M takes the d d covariance structure Σ between features into account: d M (x, x ) = (x x ) Σ 1 (x x ), where the covariance matrix is defined as: Σ ij = cov(x i, X j ) = E [ (X i µ i )(X j µ j ) ] X i is a random variable representing feature i and µ i is its mean. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

31 Naive Bayes Classifier Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

32 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Naive Bayes Bayes Rule (named after Thomas Bayes, ) P(Y = y X = x) = P(X = x Y = y)p(y = y) P(X = x) Here, X is the random variable that describes the features of an object, and Y the random variable that describes its class label. x is an instantiation of X and y and instantiation of Y.

33 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Naive Bayes Naive Bayes Classification Classify x into one of m classes y 1,..., y m : arg max y i P(X = x Y = y i )P(Y = y i ) P(Y = y i X = x) = arg max, y i P(X = x) where P(Y = y i ) is the prior probability, P(X = x) is the evidence, P(X = x Y = y i ) is the likelihood, P(Y = y i X = x) is the posterior probability.

34 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Naive Bayes Simplifications P(X = x) is the same for all classes, ignore this term. That means: P(Y = y i X = x) P(Y = y i )P(X = x Y = y i ). If x is multidimensional, that is if x contains d features x = (x 1,..., x d ), we further assume that the features are conditionally independent given the class label: P(X = x Y = y i ) = P(X = (x 1,..., x d ) Y = y i ) = d P(X j = x j Y = y i ). j=1

35 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Naive Bayes The actual classification The actual classification is performed by computing arg max y i P(y = y i X = x) P(Y = y i ) d P(X j = x j Y = y i ) j=1

36 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Naive Bayes How to train the model One has to estimate P(Y ). Typically, one assumes that all classes have the same probability, or one infers the class probabilities from the class frequencies in the training dataset. One has to estimate P(X Y ). Popular choices are for binary data: P(X = x Y = y i ) = d j=1 Ber(X j = x j θ j,i )), where θ j,i is the probability that feature X j has value x j in class y i. for continuous data: P(X = x Y = y i ) = d j=1 N (x j µ j,i, σj,i 2 ), where µ j,i is the mean of feature j in objects of class i, and σj,i 2 is the variance.

37 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Naive Bayes Fundamental Probability Distributions { θ if x = 1 The Bernoulli distribution: Ber(x θ) = 1 θ if x = 0 The Gaussian (normal) distribution: N (x µ, σ 2 ) = 1 1 2πσ 2 e 2σ 2 (x µ)2 The multivariate Gaussian distribution: N (x µ, Σ) = 1 (2π) d 2 Σ 1 2 e 1 2 ((x µ) Σ 1 (x µ))

38 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Naive Bayes P(y = y i X = x) P(Y = y i ) d j=1 P(X j = x j Y = y i ) A C A T C A G T A T C A G C G T C G A T G T A G G T A G C A A C G G C G G C A G C A A T G G C G G T G T C G

39 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Naive Bayes Advantages Speed: Effort of prediction for one test point is O(md), as we have to compute the class posterior for all m classes Ability to deal with missing data: Missing features x j can simply be dropped when evaluating the class posteriors, by dropping P(x j y i ). Ability to combine discrete and continuous features: Use discrete or continuous probability distributions for each attribute Practical performance: Despite the unrealistic independence assumption on the features, Naive Bayes often provides good results in practice.

40 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Naive Bayes Fundamental Probability Distributions { θ if x = 1 The Bernoulli distribution: Ber(x θ) = 1 θ if x = 0 The Gaussian (normal) distribution: N (x µ, σ 2 ) = 1 1 2πσ 2 e 2σ 2 (x µ)2 The multivariate Gaussian distribution: N (x µ, Σ) = 1 (2π) d 2 Σ 1 2 e 1 2 ((x µ) Σ 1 (x µ))

41 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Linear Discriminant Analysis based on: Shai Shalev-Shwartz and Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press 2014, Chapter 24.3

42 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Linear Discriminant Analysis Underlying assumptions of LDA The goal is to predict a label y {0, 1} from a vector x of d features, that is, x = (x 1,..., x d ). We assume that P(Y = 1) = P(Y = 0) = 1 2. We assume that the conditional probability of X given Y is a multivariate Gaussian distribution. The covariance matrix Σ is the same for both classes, Y = 0 and Y = 1, but they differ in their means, µ 0 and µ 1.

43 Linear Discriminant Analysis Log-likelihood ratio The density distribution is then: N (x µ y, Σ) = 1 (2π) d 2 Σ 1 2 e 1 2 ((x µ y ) Σ 1 (x µ y )) We would like to find: arg max P(Y = y)p(x = x Y = y) y {0,1} We predict f (x) = 1 if the log-likelihood ratio exceeds zero: P(Y = 1)P(X = x Y = 1) log( P(Y = 0)P(X = x Y = 0) ) > 0 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

44 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Linear Discriminant Analysis Link to linear regression The log likelihood ratio is equivalent to This can be rewritten as where and 1 2 ((x µ 0) Σ 1 (x µ 0 )) 1 2 ((x µ 1) Σ 1 (x µ 1 )). w, x + b = Σ d i=1w i x i + b, w = (µ 1 µ 0 ) Σ 1 b = 1 2 (µ 0 Σ 1 µ 0 µ 1 Σ 1 µ 1 ).

45 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Linear Discriminant Analysis Link to linear regression This shows that linear regression is a Bayes-optimal classifier under the aforementioned assumptions. To perform LDA, all we have to do is to estimate µ 0, µ 1 and Σ from the data, and use the equations above to determine w and b. Then the classification is performed via f (x) = sgn( w, x + b) This mathematical elegance goes hand in hand with the difficulty to compute Σ 1 in very high-dimensional spaces.

46 Logistic Regression Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

47 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Logistic Regression Concept (David Cox, 1958) Logistic Regression is a classification model for binary output variables, y { 1, 1}. We define an auxiliary target variable z, which is expressed as a linear function of the input variable x, that is z = i=1 w ix i + w 0. The logistic function f now maps z to the interval [0, 1]: f (z) = exp(z) exp(z) + 1 = exp( z)

48 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Logistic function f(z) z

49 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Logistic Regression Concept (David Cox, 1958) The logistic function can now be written as: f w (x) = f ( w, x ) = f w (x) is the probability that x is in class e ( (w 0+ d i=1 w i x i ))

50 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Logistic Regression Concept (David Cox, 1958) One can now define the inverse of the logistic function g = f 1, the logit or log-odds function: ( ) fw (x) d g(f w (x)) = ln = w 0 + w i x i 1 f w (x) g f w clarifies the link to linear regression. i=1

51 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Logistic Regression Training the model While the link to linear regression is clear now, how to training the model remains unclear. We need an objective that is to be optimized in the training step to learn the weights w.

52 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Logistic Regression P(y = 1 X = x) = 1 1+exp( w,x ) loss <w,x>

53 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Logistic Regression P(y = 1 X = x) = 1 1+exp( w,x ) loss <w,x>

54 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Logistic Regression Training the model The log probability of each point is therefore: 1 log( 1 + exp( y w, x ) ) = log((1 + exp( y w, x )) 1 ) = log(1 + exp( y w, x )) To train the logistic regression model, we minimize the total negative log probability over all points, the logistic loss function, which is convex in w: 1 arg min w R d n Σn i=1 log(1 + exp( y i w, x i )) Several approaches for optimizing this objective exist, e.g. Maximum Likelihood Estimation, and software packages that implement them, e.g. liblinear.

55 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Logistic Regression Discussion (Kevin Murphy, MIT Press 2012) Logistic regression models are easy to fit: The algorithms are easy to implement and fast. There are methods to train Logistic regression models that take time linear in the number of non-zero features in the data, which is the minimal possible time. Logistic regression models are easy to interpret, as their output represents the log odd ratio between the positive and the negative class. Several important extensions (multiclass classification, nonlinear decision boundaries, other types of output data) are possible.

56 Decision Trees Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

57 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree Key idea Recursively split the data space into regions that contain a single class only Age Less than Treated with drug B Pregnancy NO YES YES NO Treatment No Treatment Treated with drug C Heart problems NO YES NO YES Treatment No Treatment Treatment No Treatment

58 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree Concept A decision tree is a flowchart like tree structure with a root: this is the uppermost node internal nodes: these represents tests on an attribute branches: these represent outcomes of a test leaf nodes: these hold a class label Age Less than Treated with drug B Pregnancy NO YES YES NO Treatment No Treatment Treated with drug C Heart problems NO YES NO YES Treatment No Treatment Treatment No Treatment

59 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree Classification given a test point x perform test on the attributes of x at the root follow the branch that corresponds to the outcome of this test repeat this procedure, until you reach a leaf node predict the label of x to be the label of that leaf node Age Less than Treated with Pregnancy drug B NO YES YES NO Treatment No Treatment Treated with drug C Heart problems NO YES NO YES Treatment No Treatment Treatment No Treatment

60 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree Popularity requires no domain knowledge easy to interpret construction and prediction is fast But how to construct a decision tree? Age Less than Treated with drug B Pregnancy NO YES YES NO Treatment No Treatment Treated with drug C Heart problems NO YES NO YES Treatment No Treatment Treatment No Treatment

61 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree Construction of Decision Trees Training Procedure 1 start with all training examples 2 select attribute and threshold that gives the best split 3 create child nodes based on split 4 repeat Step 2 and 3 on each child using its data until a stopping criterion is reached all examples are in the same class number of examples in a node is too small the tree gets too large Central problem: How to choose the best attribute? Via a cost function.

62 Decision Tree function DecisionTree(D) If Stopping Criterion fulfilled: Predicted class for points in D is the majority class in D; otherwise arg max (A,θ) = cost(d) cost(d A,θ ); D A,<θ = {x D A < θ}; D A, θ = {x D A θ}; Create two child nodes of D, containing D A,<θ and D A, θ DecisionTree(D A,<θ ); DecisionTree(D A, θ ); end Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

63 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree Information gain (Quinlan, 1986) ID3 (short for Iterative Dichotomizer 3) uses the information gain as its attribute selection measure. The information content is defined as: Info(D) = m p(y = y i x D) log 2 (p(y = y i x D)), i=1 where p(y = y i x D) is the probability that an arbitrary tuple in D belongs to class y i and is estimated by {(x j,y j ) x j D y j =y i } D. This is also known as the Shannon entropy of D.

64 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree Information gain Assume that attribute A was used to split D into v partitions or subsets, {D 1, D 2,..., D v }, where D j contains those tuples in D that have outcome a j of A. Ideally, the D j would provide a perfect classification, but they seldomly do. How much more information do we need to arrive at an exact classification? This is quantified by Info A (D) = v j=1 D j D Info(D j). (1)

65 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree Information gain The information gain is the loss of entropy (increase in information) that is caused by splitting with respect to attribute A We pick A such that this gain is maximised. Gain(A) = Info(D) Info A (D) (2)

66 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree Gain ratio (Quinlan, 1993) The information gain is biased towards attributes with a large number of values For example, an ID attribute maximises the information gain! Hence C4.5 uses an extension of information gain: the gain ratio The gain ratio is based on the split information and is defined as v D j SplitInfo A (D) = D log 2( D j ). (3) D j=1 GainRatio(A) = Gain(A) SplitInfo(A) (4)

67 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree Gain ratio The attribute with maximum gain ratio is selected as the splitting attribute The ratio becomes unstable, as the split information approaches zero A constraint is added to ensure that the information gain of the test selected is at least as great as the average gain over all tests examined.

68 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree Gini index (Breiman, Friedman, Olshen and Stone, 1984) Attribute selection measure in the CART system. Gini index measures class impurity as Gini(D) = 1 m i=1 p(y = y i) 2 If we split via attribute A into partitions {D 1, D 2,..., D v }, the Gini index of this partitioning is defined as Gini A (D) = v j=1 and the reduction in impurity by a split on A is D j D Gini(D j) (5) Gini(D) = Gini(D) Gini A (D) (6)

69 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree - Advantages I Simple to understand and to interpret. Trees can be visualised. Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree. Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing datasets that have only one type of variable. See algorithms for more information. Able to handle multi-output problems. Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.

70 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree - Advantages II Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model. Performs well even if its assumptions are somewhat violated by the true model from which the data were generated. Source:

71 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree - Disadvantages I Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning (not currently supported), setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem. Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble. The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by

72 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree - Disadvantages II training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement. There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems. Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree. Source:

73 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Trees - XOR Problem Y x

74 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Trees Random Forests (Breiman, 2001) To minimize overfitting and the large variability in decision trees, one may use an ensemble of several decision trees. Breiman (2001) introduced Random Forests a collection of k decision trees. The idea is to subsample a subset of n instances and a subset of d features of the training dataset k times. On each of these k samples, a decision tree is constructed. Subsequently, the classification of trees is performed by taking a majority vote over the trees. Three key parameters: number of instances in the subset n, number of features in the subsets d, and number of trees k.

75 Support Vector Machines Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

76 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines Hyperplane classifiers Vapnik et al. (1974) defined a family of classifiers for binary classification problems. This family is the class of hyperplanes in some dot product space H, where w H, b R. These correspond to decision functions ( classifiers ): w, x + b = 0, (7) f (x) = sgn( w, x + b) (8) Vapnik et al. proposed a learning algorithm for determining this f from the training dataset.

77 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines The optimal hyperplane maximises the margin of separation between any training point and the hyperplane max min{ x x i x H, w, x + b = 0, i {1,..., n}} (9) w H,b R

78 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines Hard-margin SVM Why minimise 1 2 w 2? min w H,b R 1 2 w 2 (10) subject to y i ( w, x i + b) 1, i {1,..., n} 2 The size of the margin is w. The smaller w, the larger the margin. Why do we have to obey the constraints y i ( w, x i + b) 1? They ensure that all training data points of the same class are on the same side of the hyperplane and outside the margin.

79 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines The Lagrangian We form the Lagrangian: L(w, b, α) = 1 2 w 2 n α i (y i ( x i, w + b) 1) (11) The Lagrangian is minimised with respect to the primal variables w and b, and maximised with respect to the dual variables α i. i=1

80 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines Support Vectors At optimality, such that L(w, b, α) = 0 and b L(w, b, α) = 0 (12) w n α i y i = 0 and w = i=1 n α i y i x i (13) i=1 Hence the solution vector w, the crucial parameter of the SVM classifier, has an expansion in terms of the training points and their labels. Those training points with α i > 0 are the Support Vectors.

81 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines The dual problem Plugging (13) into the Lagrangian (11), we obtain the dual optimization problem that is solved in practice: maximise α R nw (α) = n α i 1 n α i α j y i y j x i, x j (14) 2 i=1 i,j=1

82 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines The kernel trick The key insight is that (14) accesses the training data only in terms of inner products x i, x j. We can plug in an inner product of our choice from any space H here! This is referred to as a kernel k: k(x i, x j ) = x i, x j H (15)

83 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines Definition of a kernel We assume that data points x R are mapped to a space H via a mapping φ : R H. Then the kernel is defined as an inner product, : H H R between data points in this space H: k(x, x ) = φ(x), φ(x )

84 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines Prediction The decision function for Support Vector Machine classification then becomes f (x) = sgn( = sgn( n y i α i φ(x), φ(x i ) + b) = (16) i=1 n y i α i k(x, x i ) + b); (17) i=1 The prediction effort is linear in the number of non-zero entries of α, that is, in worst-case linear in O(n). In practice, the number of support vectors is often much smaller than n.

85 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines ξ i! ξ j!

86 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines Soft-margin SVM: C-SVM (Cortes and Vapnik, 1995) Points are now allowed to lie inside the margin or in the wrong halfspace (margin errors): 1 min w H,b R,ξ R n 2 w 2 + CΣ n i=1ξ i, (18) subject to y i ( w, x i + b) 1 ξ i, (19) i {1,..., n}, ξ i 0 C R is a penalty score parameter that determines the tradeoff between maximizing the margin and minimizing margin errors. ξ is a slack variable, which measures the degree of misclassification of each margin error.

87 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines Soft-margin SVM: ν-svm (Schölkopf et al., 2000) An alternative formulation of soft-margin SVMs uses the parameter ν (0, 1]: 1 min w H,b,ρ R,ξ R n 2 w 2 + ( 1 n Σn i=1ξ i ) νρ, (20) subject to y i ( w, x i + b) ρ ξ i, (21) i {1,..., n}, ξ i 0, ρ 0. ν can be shown to be an lower bound for the fraction of support vectors and an upper bound for the fraction of margin errors among all training points.

88 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines Class imbalance: SVM with different error costs (Veropoulos et al., 1999) Margin errors in different classes are penalized differently (two parameters C + and C instead of one): 1 min w H,b R,ξ R n 2 w 2 + C + Σ i yi =1ξ i + C Σ i yi = 1ξ i, (22) subject to y i ( w, x i + b) 1 ξ i, (23) i {1,..., n}, ξ i 0 The idea is that margin errors in the smaller class should be more expensive than in the larger class.

89 Kernels Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

90 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Kernels Building kernels in practice In practical applications, one constructs kernels by 1 Proposing a kernel function and explicitly defining the corresponding mapping φ or/and 2 Combining known kernels in ways that obey the closure properties of kernels. Knowing these closure properties and a set of basic kernel functions is of utmost importance when working with kernels in practice.

91 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Kernels Closure properties of kernels Assume we are given two kernels k 1 and k 2. Then k 1 + k 2 is a kernel as well. Assume we are given two kernel functions k 1 and k 2. Then their product is a kernel k 1 k 2 as well. Assume we are given a kernel function k and a postive scalar λ R +. Then λk is a kernel as well. Assume that a kernel k is only defined on a set D. Then its zero-extension k 0 is also a kernel: { k 0 (x, x k(x, x ) = ) if x D and x D 0 otherwise

92 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Kernels Some prominent kernels linear kernel k(x, x ) = d x l x l = x x, (24) l=1 polynomial kernel k(x, x ) = (x x + c) d, where c, d R, Gaussian Radial Basis Function (RBF) kernel k(x, x ) = exp( 1 2σ 2 x x 2 ), (25)

93 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Kernels Some useful kernels the constant all-ones kernel: the delta (Dirac) kernel: k(x, x ) = 1 { k(x, x 1 x = x ) = 0 otherwise

94 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Kernels on structured data R-convolution kernels (Haussler, 1999) R-Convolution kernels are a famous recipe for constructing kernels on structured data. It is based on decomposing objects X and X via a relation R into sets of substructures S and S. Their most simple and most widely used instance k R is the idea to compare all pairs of these substructures of X and X pairwise: k R (X, X ) = k base (s, s ) s S,s S For instance, a substructure would be the elements of a set, the nodes of a graph or the substrings of a string. k base is an arbitrary vectorial kernel, very often even the delta kernel.

95 Kernels on strings String Kernels The spectrum kernel, mentioned above, counts all pairs of matching substrings in two strings. That is, X and X are strings and S and S are the sets of all of their substrings. k R (X, X ) = k base (s, s ) s S,s S k base is a delta kernel, that is, for every pair of matching substrings, the kernel increases by one. Naively, one has to enumerate all substrings (quadratic effort O( X 2 + X 2 ). Via a special data structure called Suffix Trees, it can be done in linear time O( X + X ) (Vishwanathan and Smola, 2003). Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

96 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Kernels on graphs Random Walk Kernel Count matching walks in G and G : the more matching walks there are, the higher the similarity between G and G. Matching walks are sequences of nodes and edges with matching node labels Elegant Computation (Gärtner et al., 2003) Compute the direct product graph of G and G and count all walks of length k in it. Every walk in the product graph corresponds to one walk in G and in G : k (G, G ) = V i,j=1 [ λ k A k ] ij = e (1 λa ) 1 e k=0

97 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Kernels on graphs! " # " X "!!! "!!!!! " "!! "! "!!! "! "! #!!! #! "!

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering