Classification. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

Size: px
Start display at page:

Download "Classification. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162"

Transcription

1 Classification Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

2 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 What is Classification? Problem Examples Given an object, which class of objects does it belong to? Given object x, predict its class label y. Computer vision: Is this object a chair? Credit cards: Is this customer to be trusted? Marketing: Will this customer buy/like our product? Function prediction: Is this protein an enzyme? Gene finding: Does this sequence contain a splice site? Personalized medicine: Will this patient respond to drug treatment?

3 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 What is Classification? Setting Role of y Classification is usually performed in a supervised setting: We are given a training dataset. A training dataset is a dataset of pairs {(x i, y i )} n i=1, that is objects and their known class labels. The test set is a dataset of test points {x i }d i=1 with unknown class label. The task is to predict the class label y i of x i via a function f. if y {0, 1}: then we are dealing with a binary classification problem if y {1,..., n}, (3 n N): a multiclass classification problem if y R: a regression problem

4 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers The Contingency Table In a binary classification problem, one can represent the accuracy of the predictions in a contingency table: y = 1 y = 1 f (x) = 1 TP FP f (x) = 1 FN TN Here, T refers to True, F to False, P to Positive (prediction) and N to Negative (prediction).

5 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers Accuracy The accuracy of a classifier is defined as TP + TN TP + TN + FP + FN Accuracy measures which percentage of the predictions is correct. It is the most common criterion for reporting the performance of a classifier. Still, it has a fundamental shortcoming: If the classes are unbalanced, the accuracy on the entire dataset may look high, while being low on the smaller class.

6 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers Precision-Recall If the positive class is much smaller than the negative class, one should rather use precision and recall to evaluate the classifier. The precision of a classifier is defined as. The recall of a classifier is defined as. TP TP + FP TP TP + FN

7 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers Trade-off between Precision and Recall There is a trade-off between precision and recall: By predicting all points to be positive (f (x = 1)) one can guarantee that the recall is 1. However, the precision will then be bad. By only predicting points to be members of the positive class for which one is highly confident about the prediction, one will increase precision, but lower recall. One workaround is to report the precision recall break-even point, that is the value at which precision and recall are identical.

8 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers Dependence on Classification Threshold TP, TN, FP, FN depend on f (x) where x D. The most common definition of f (x) is { 1 if s(x) θ, f (x) = 1 if s(x) < θ, where s : D R is a scoring function, and θ R is a threshold. As the predictions based on f vary with θ, so do TP, TN, FP, FN, and all evaluations criteria based on them. It is therefore important to report results as a function of θ whenever possible, not just for one fixed choice of θ.

9 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers How to report results as a function of θ The efficient strategy to compute all solutions as a function of θ is to rank all points x by their score s(x). This ranking is a vector of length t. This ranking is a vector r of length t, whose ith element is r(i). We then perform the following steps: For i = 1 to t 1 Define the positive predictions P to be the set {r(1),..., r(i)}. Define the negative predictions N to be the set {r(i + 1),..., r(t)}. Compute the evaluation criteria e(i) of interest for P and N. Return vector e The common strategy is to compute two evaluation criteria e 1 and e 2 and to then visualize the result in a 2-D plot.

10 That is, the fraction of negative points that were misclassified. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers ROC curves One popular such 2D-plot is the Receiver Operating Characteristics Curve, which represents the true positive rate versus the false positive rate. The true positive rate (or sensitivity) is identical to the recall: TP TP + FN. That is, the fraction of positive points that were correctly classified. The false positive rate (or 1 specificity) is FP FP + TN

11 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers ROC curves Each ROC curves starts at (0,0). If no point is predicted to be positive, then there are no True Positives and False Positives. Each ROC curve ends at (1,1). If all points are predicted to be positive, then there are no True Negatives and False Negatives.

12 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers ROC curves The ROC curve of a perfect classifier runs through the point (0,1) - it correctly classifies all negative points (FP=0) and correctly classifies all positive points (FN=0). While the ROC curve does not depend on an arbitrarily chosen threshold θ, it seems difficult to summarize the performance of a classifier in terms of a ROC curve.

13 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers ROC curves The solution to this problem is the Area under the Receiver Operating Characteristics (AUC), a number between 0 and 1. The AUC can be interpreted as follows: When we present one negative and one positive test point to the classifier, then the AUC is the probability with which the classifier will assign a larger score to the positive than to the negative point. The larger AUC, the better the classifier. The AUC of a perfect classifier can be shown to be 1. The AUC of a random classifier (guessing the prediction) is 0.5. The AUC of a stupid classifier (misclassifying all points) is 0.

14 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers Summarizing PR values 2-D plot of (recall,precision) values for different values of θ Starts at (0,1). Full precision, no recall. The precision recall break-even point is the point at which the precision-recall-curve intersects the bisecting line. The area under the precision-recall-curve (AUPRC) is another statistic to quantify the performance of a classifier. It is 1 for a perfect classifier, that is, it reaches 100% precision and 100% recall at the same time.

15 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers Example: The Good and the Bad We are given 102 test points, 2 are positive, 100 negative. Our prediction ranks ten negative points first, then the 2 positive points, then the remaining 90 points Precision True Positive Rate Recall False Positive Rate

16 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers What to do if we only have one dataset for training and test? If only one dataset is available for training and testing, it is essential not to train and test on the same instance, but rather to split the available data into training and test data. Splitting the dataset into k subsets and using one of them for testing and the rest for training is referred to as k-fold cross-validation. If k = n, cross-validation is referred to as leave-one-out-validation. Randomly sampling subsets of the data for training and testing and averaging over the results is called bootstrapping.

17 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers Illustration of cross-validation: 10-fold cross-validation Step Step 2... Step

18 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers How to optimize the parameters of a classifier? Most classifiers use parameters c that have to be set (more on these later). It is wrong to optimize these parameters by trying out different values and picking those that perform best on the test set. These parameters are overfit on this particular test dataset, and may not generalize to other datasets. Instead, one needs an internal cross-validation on the training data to optimize c.

19 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers Illustration of cross-validation with parameter optimization: Step Outer Loop: Step Inner Loop: Step 1... Inner Loop: Step Repeat the inner loop for different values of c.

20 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Evaluating Classifiers Illustration of cross-validation with parameter optimization: Step Outer Loop: Step Repeat the inner loop for different values of c Inner Loop: Step 1... Inner Loop: Step 9

21 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Choosing Classifiers Criteria for a good classifier Accuracy Runtime and scalability Interpretability Flexibility

22 Nearest Neighbour Classification Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

23 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Nearest Neighbour The actual classification Given x, we predict its label y by x i = arg min x D x x 2 f (x) = y i The predicted label of x is that of the point closest to it, that is, its nearest neighbour.

24 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Nearest Neighbour k-nn classification An object is classified to the class most common among its k nearest neighbors (k N). If k = 1, then the object is assigned to the class of its single nearest neighbor. k is a hyperparameter that is chosen by the user. The larger k, the lower the influence of single noisy points on the classification. In binary classification problems, one usually chooses odd k to avoid ties. It is an instance of instance-based learning and lazy learning.

25 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Nearest Neighbour Runtime Naively, one has to compute the distance to all n neighbours in the training dataset for each point: O(n) to find the nearest neighbour in 1-NN O(n + n log n) to find the k nearest neighbours in k-nn

26 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Nearest Neighbour Challenges in Nearest Neighbour Classification Speed of prediction Choice of hyperparameter k Choice of distance function and weights of dimensions

27 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Nearest Neighbour How to speed Nearest Neighbour up Exploit the triangle inequality: d(x 1, x 2 ) + d(x 2, x 3 ) d(x 1, x 3 ) This holds for any metric d.

28 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Nearest Neighbour How to speed Nearest Neighbour up Rewrite triangle inequality: d(x 1, x 2 ) d(x 1, x 3 ) d(x 2, x 3 ) That means if you know d(x 1, x 3 ) and d(x 2, x 3 ), you can provide a lower bound on d(x 1, x 2 ). If you know a point that is closer to x 1 than d(x 1, x 3 ) d(x 2, x 3 ), then you can avoid to compute d(x 1, x 2 ).

29 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Nearest Neighbour How to set the hyperparameter k: Bootstrapping and cross-validation Cross-validation on the training data Bootstrapping on training data Split the training dataset in two subsets T 1 and T 2. For different choices of k, compute the accuracy on T 2, using T 1 as training set. Repeat the above two steps several times. Pick the k with the highest average accuracy across all iterations.

30 Nearest Neighbour How to weight the dimensions The Mahalanobis distance d M takes the d d covariance structure Σ between features into account: d M (x, x ) = (x x ) Σ 1 (x x ), where the covariance matrix is defined as: Σ ij = cov(x i, X j ) = E [ (X i µ i )(X j µ j ) ] X i is a random variable representing feature i and µ i is its mean. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

31 Naive Bayes Classifier Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

32 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Naive Bayes Bayes Rule (named after Thomas Bayes, ) P(Y = y X = x) = P(X = x Y = y)p(y = y) P(X = x) Here, X is the random variable that describes the features of an object, and Y the random variable that describes its class label. x is an instantiation of X and y and instantiation of Y.

33 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Naive Bayes Naive Bayes Classification Classify x into one of m classes y 1,..., y m : arg max y i P(X = x Y = y i )P(Y = y i ) P(Y = y i X = x) = arg max, y i P(X = x) where P(Y = y i ) is the prior probability, P(X = x) is the evidence, P(X = x Y = y i ) is the likelihood, P(Y = y i X = x) is the posterior probability.

34 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Naive Bayes Simplifications P(X = x) is the same for all classes, ignore this term. That means: P(Y = y i X = x) P(Y = y i )P(X = x Y = y i ). If x is multidimensional, that is if x contains d features x = (x 1,..., x d ), we further assume that the features are conditionally independent given the class label: P(X = x Y = y i ) = P(X = (x 1,..., x d ) Y = y i ) = d P(X j = x j Y = y i ). j=1

35 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Naive Bayes The actual classification The actual classification is performed by computing arg max y i P(y = y i X = x) P(Y = y i ) d P(X j = x j Y = y i ) j=1

36 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Naive Bayes How to train the model One has to estimate P(Y ). Typically, one assumes that all classes have the same probability, or one infers the class probabilities from the class frequencies in the training dataset. One has to estimate P(X Y ). Popular choices are for binary data: P(X = x Y = y i ) = d j=1 Ber(X j = x j θ j,i )), where θ j,i is the probability that feature X j has value x j in class y i. for continuous data: P(X = x Y = y i ) = d j=1 N (x j µ j,i, σj,i 2 ), where µ j,i is the mean of feature j in objects of class i, and σj,i 2 is the variance.

37 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Naive Bayes Fundamental Probability Distributions { θ if x = 1 The Bernoulli distribution: Ber(x θ) = 1 θ if x = 0 The Gaussian (normal) distribution: N (x µ, σ 2 ) = 1 1 2πσ 2 e 2σ 2 (x µ)2 The multivariate Gaussian distribution: N (x µ, Σ) = 1 (2π) d 2 Σ 1 2 e 1 2 ((x µ) Σ 1 (x µ))

38 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Naive Bayes P(y = y i X = x) P(Y = y i ) d j=1 P(X j = x j Y = y i ) A C A T C A G T A T C A G C G T C G A T G T A G G T A G C A A C G G C G G C A G C A A T G G C G G T G T C G

39 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Naive Bayes Advantages Speed: Effort of prediction for one test point is O(md), as we have to compute the class posterior for all m classes Ability to deal with missing data: Missing features x j can simply be dropped when evaluating the class posteriors, by dropping P(x j y i ). Ability to combine discrete and continuous features: Use discrete or continuous probability distributions for each attribute Practical performance: Despite the unrealistic independence assumption on the features, Naive Bayes often provides good results in practice.

40 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Naive Bayes Fundamental Probability Distributions { θ if x = 1 The Bernoulli distribution: Ber(x θ) = 1 θ if x = 0 The Gaussian (normal) distribution: N (x µ, σ 2 ) = 1 1 2πσ 2 e 2σ 2 (x µ)2 The multivariate Gaussian distribution: N (x µ, Σ) = 1 (2π) d 2 Σ 1 2 e 1 2 ((x µ) Σ 1 (x µ))

41 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Linear Discriminant Analysis based on: Shai Shalev-Shwartz and Shai Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press 2014, Chapter 24.3

42 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Linear Discriminant Analysis Underlying assumptions of LDA The goal is to predict a label y {0, 1} from a vector x of d features, that is, x = (x 1,..., x d ). We assume that P(Y = 1) = P(Y = 0) = 1 2. We assume that the conditional probability of X given Y is a multivariate Gaussian distribution. The covariance matrix Σ is the same for both classes, Y = 0 and Y = 1, but they differ in their means, µ 0 and µ 1.

43 Linear Discriminant Analysis Log-likelihood ratio The density distribution is then: N (x µ y, Σ) = 1 (2π) d 2 Σ 1 2 e 1 2 ((x µ y ) Σ 1 (x µ y )) We would like to find: arg max P(Y = y)p(x = x Y = y) y {0,1} We predict f (x) = 1 if the log-likelihood ratio exceeds zero: P(Y = 1)P(X = x Y = 1) log( P(Y = 0)P(X = x Y = 0) ) > 0 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

44 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Linear Discriminant Analysis Link to linear regression The log likelihood ratio is equivalent to This can be rewritten as where and 1 2 ((x µ 0) Σ 1 (x µ 0 )) 1 2 ((x µ 1) Σ 1 (x µ 1 )). w, x + b = Σ d i=1w i x i + b, w = (µ 1 µ 0 ) Σ 1 b = 1 2 (µ 0 Σ 1 µ 0 µ 1 Σ 1 µ 1 ).

45 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Linear Discriminant Analysis Link to linear regression This shows that linear regression is a Bayes-optimal classifier under the aforementioned assumptions. To perform LDA, all we have to do is to estimate µ 0, µ 1 and Σ from the data, and use the equations above to determine w and b. Then the classification is performed via f (x) = sgn( w, x + b) This mathematical elegance goes hand in hand with the difficulty to compute Σ 1 in very high-dimensional spaces.

46 Logistic Regression Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

47 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Logistic Regression Concept (David Cox, 1958) Logistic Regression is a classification model for binary output variables, y { 1, 1}. We define an auxiliary target variable z, which is expressed as a linear function of the input variable x, that is z = i=1 w ix i + w 0. The logistic function f now maps z to the interval [0, 1]: f (z) = exp(z) exp(z) + 1 = exp( z)

48 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Logistic function f(z) z

49 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Logistic Regression Concept (David Cox, 1958) The logistic function can now be written as: f w (x) = f ( w, x ) = f w (x) is the probability that x is in class e ( (w 0+ d i=1 w i x i ))

50 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Logistic Regression Concept (David Cox, 1958) One can now define the inverse of the logistic function g = f 1, the logit or log-odds function: ( ) fw (x) d g(f w (x)) = ln = w 0 + w i x i 1 f w (x) g f w clarifies the link to linear regression. i=1

51 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Logistic Regression Training the model While the link to linear regression is clear now, how to training the model remains unclear. We need an objective that is to be optimized in the training step to learn the weights w.

52 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Logistic Regression P(y = 1 X = x) = 1 1+exp( w,x ) loss <w,x>

53 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Logistic Regression P(y = 1 X = x) = 1 1+exp( w,x ) loss <w,x>

54 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Logistic Regression Training the model The log probability of each point is therefore: 1 log( 1 + exp( y w, x ) ) = log((1 + exp( y w, x )) 1 ) = log(1 + exp( y w, x )) To train the logistic regression model, we minimize the total negative log probability over all points, the logistic loss function, which is convex in w: 1 arg min w R d n Σn i=1 log(1 + exp( y i w, x i )) Several approaches for optimizing this objective exist, e.g. Maximum Likelihood Estimation, and software packages that implement them, e.g. liblinear.

55 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Logistic Regression Discussion (Kevin Murphy, MIT Press 2012) Logistic regression models are easy to fit: The algorithms are easy to implement and fast. There are methods to train Logistic regression models that take time linear in the number of non-zero features in the data, which is the minimal possible time. Logistic regression models are easy to interpret, as their output represents the log odd ratio between the positive and the negative class. Several important extensions (multiclass classification, nonlinear decision boundaries, other types of output data) are possible.

56 Decision Trees Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

57 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree Key idea Recursively split the data space into regions that contain a single class only Age Less than Treated with drug B Pregnancy NO YES YES NO Treatment No Treatment Treated with drug C Heart problems NO YES NO YES Treatment No Treatment Treatment No Treatment

58 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree Concept A decision tree is a flowchart like tree structure with a root: this is the uppermost node internal nodes: these represents tests on an attribute branches: these represent outcomes of a test leaf nodes: these hold a class label Age Less than Treated with drug B Pregnancy NO YES YES NO Treatment No Treatment Treated with drug C Heart problems NO YES NO YES Treatment No Treatment Treatment No Treatment

59 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree Classification given a test point x perform test on the attributes of x at the root follow the branch that corresponds to the outcome of this test repeat this procedure, until you reach a leaf node predict the label of x to be the label of that leaf node Age Less than Treated with Pregnancy drug B NO YES YES NO Treatment No Treatment Treated with drug C Heart problems NO YES NO YES Treatment No Treatment Treatment No Treatment

60 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree Popularity requires no domain knowledge easy to interpret construction and prediction is fast But how to construct a decision tree? Age Less than Treated with drug B Pregnancy NO YES YES NO Treatment No Treatment Treated with drug C Heart problems NO YES NO YES Treatment No Treatment Treatment No Treatment

61 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree Construction of Decision Trees Training Procedure 1 start with all training examples 2 select attribute and threshold that gives the best split 3 create child nodes based on split 4 repeat Step 2 and 3 on each child using its data until a stopping criterion is reached all examples are in the same class number of examples in a node is too small the tree gets too large Central problem: How to choose the best attribute? Via a cost function.

62 Decision Tree function DecisionTree(D) If Stopping Criterion fulfilled: Predicted class for points in D is the majority class in D; otherwise arg max (A,θ) = cost(d) cost(d A,θ ); D A,<θ = {x D A < θ}; D A, θ = {x D A θ}; Create two child nodes of D, containing D A,<θ and D A, θ DecisionTree(D A,<θ ); DecisionTree(D A, θ ); end Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

63 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree Information gain (Quinlan, 1986) ID3 (short for Iterative Dichotomizer 3) uses the information gain as its attribute selection measure. The information content is defined as: Info(D) = m p(y = y i x D) log 2 (p(y = y i x D)), i=1 where p(y = y i x D) is the probability that an arbitrary tuple in D belongs to class y i and is estimated by {(x j,y j ) x j D y j =y i } D. This is also known as the Shannon entropy of D.

64 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree Information gain Assume that attribute A was used to split D into v partitions or subsets, {D 1, D 2,..., D v }, where D j contains those tuples in D that have outcome a j of A. Ideally, the D j would provide a perfect classification, but they seldomly do. How much more information do we need to arrive at an exact classification? This is quantified by Info A (D) = v j=1 D j D Info(D j). (1)

65 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree Information gain The information gain is the loss of entropy (increase in information) that is caused by splitting with respect to attribute A We pick A such that this gain is maximised. Gain(A) = Info(D) Info A (D) (2)

66 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree Gain ratio (Quinlan, 1993) The information gain is biased towards attributes with a large number of values For example, an ID attribute maximises the information gain! Hence C4.5 uses an extension of information gain: the gain ratio The gain ratio is based on the split information and is defined as v D j SplitInfo A (D) = D log 2( D j ). (3) D j=1 GainRatio(A) = Gain(A) SplitInfo(A) (4)

67 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree Gain ratio The attribute with maximum gain ratio is selected as the splitting attribute The ratio becomes unstable, as the split information approaches zero A constraint is added to ensure that the information gain of the test selected is at least as great as the average gain over all tests examined.

68 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree Gini index (Breiman, Friedman, Olshen and Stone, 1984) Attribute selection measure in the CART system. Gini index measures class impurity as Gini(D) = 1 m i=1 p(y = y i) 2 If we split via attribute A into partitions {D 1, D 2,..., D v }, the Gini index of this partitioning is defined as Gini A (D) = v j=1 and the reduction in impurity by a split on A is D j D Gini(D j) (5) Gini(D) = Gini(D) Gini A (D) (6)

69 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree - Advantages I Simple to understand and to interpret. Trees can be visualised. Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree. Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing datasets that have only one type of variable. See algorithms for more information. Able to handle multi-output problems. Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.

70 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree - Advantages II Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model. Performs well even if its assumptions are somewhat violated by the true model from which the data were generated. Source:

71 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree - Disadvantages I Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning (not currently supported), setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem. Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble. The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by

72 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Tree - Disadvantages II training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement. There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems. Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree. Source:

73 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Trees - XOR Problem Y x

74 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Decision Trees Random Forests (Breiman, 2001) To minimize overfitting and the large variability in decision trees, one may use an ensemble of several decision trees. Breiman (2001) introduced Random Forests a collection of k decision trees. The idea is to subsample a subset of n instances and a subset of d features of the training dataset k times. On each of these k samples, a decision tree is constructed. Subsequently, the classification of trees is performed by taking a majority vote over the trees. Three key parameters: number of instances in the subset n, number of features in the subsets d, and number of trees k.

75 Support Vector Machines Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

76 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines Hyperplane classifiers Vapnik et al. (1974) defined a family of classifiers for binary classification problems. This family is the class of hyperplanes in some dot product space H, where w H, b R. These correspond to decision functions ( classifiers ): w, x + b = 0, (7) f (x) = sgn( w, x + b) (8) Vapnik et al. proposed a learning algorithm for determining this f from the training dataset.

77 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines The optimal hyperplane maximises the margin of separation between any training point and the hyperplane max min{ x x i x H, w, x + b = 0, i {1,..., n}} (9) w H,b R

78 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines Hard-margin SVM Why minimise 1 2 w 2? min w H,b R 1 2 w 2 (10) subject to y i ( w, x i + b) 1, i {1,..., n} 2 The size of the margin is w. The smaller w, the larger the margin. Why do we have to obey the constraints y i ( w, x i + b) 1? They ensure that all training data points of the same class are on the same side of the hyperplane and outside the margin.

79 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines The Lagrangian We form the Lagrangian: L(w, b, α) = 1 2 w 2 n α i (y i ( x i, w + b) 1) (11) The Lagrangian is minimised with respect to the primal variables w and b, and maximised with respect to the dual variables α i. i=1

80 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines Support Vectors At optimality, such that L(w, b, α) = 0 and b L(w, b, α) = 0 (12) w n α i y i = 0 and w = i=1 n α i y i x i (13) i=1 Hence the solution vector w, the crucial parameter of the SVM classifier, has an expansion in terms of the training points and their labels. Those training points with α i > 0 are the Support Vectors.

81 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines The dual problem Plugging (13) into the Lagrangian (11), we obtain the dual optimization problem that is solved in practice: maximise α R nw (α) = n α i 1 n α i α j y i y j x i, x j (14) 2 i=1 i,j=1

82 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines The kernel trick The key insight is that (14) accesses the training data only in terms of inner products x i, x j. We can plug in an inner product of our choice from any space H here! This is referred to as a kernel k: k(x i, x j ) = x i, x j H (15)

83 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines Definition of a kernel We assume that data points x R are mapped to a space H via a mapping φ : R H. Then the kernel is defined as an inner product, : H H R between data points in this space H: k(x, x ) = φ(x), φ(x )

84 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines Prediction The decision function for Support Vector Machine classification then becomes f (x) = sgn( = sgn( n y i α i φ(x), φ(x i ) + b) = (16) i=1 n y i α i k(x, x i ) + b); (17) i=1 The prediction effort is linear in the number of non-zero entries of α, that is, in worst-case linear in O(n). In practice, the number of support vectors is often much smaller than n.

85 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines ξ i! ξ j!

86 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines Soft-margin SVM: C-SVM (Cortes and Vapnik, 1995) Points are now allowed to lie inside the margin or in the wrong halfspace (margin errors): 1 min w H,b R,ξ R n 2 w 2 + CΣ n i=1ξ i, (18) subject to y i ( w, x i + b) 1 ξ i, (19) i {1,..., n}, ξ i 0 C R is a penalty score parameter that determines the tradeoff between maximizing the margin and minimizing margin errors. ξ is a slack variable, which measures the degree of misclassification of each margin error.

87 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines Soft-margin SVM: ν-svm (Schölkopf et al., 2000) An alternative formulation of soft-margin SVMs uses the parameter ν (0, 1]: 1 min w H,b,ρ R,ξ R n 2 w 2 + ( 1 n Σn i=1ξ i ) νρ, (20) subject to y i ( w, x i + b) ρ ξ i, (21) i {1,..., n}, ξ i 0, ρ 0. ν can be shown to be an lower bound for the fraction of support vectors and an upper bound for the fraction of margin errors among all training points.

88 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Support Vector Machines Class imbalance: SVM with different error costs (Veropoulos et al., 1999) Margin errors in different classes are penalized differently (two parameters C + and C instead of one): 1 min w H,b R,ξ R n 2 w 2 + C + Σ i yi =1ξ i + C Σ i yi = 1ξ i, (22) subject to y i ( w, x i + b) 1 ξ i, (23) i {1,..., n}, ξ i 0 The idea is that margin errors in the smaller class should be more expensive than in the larger class.

89 Kernels Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

90 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Kernels Building kernels in practice In practical applications, one constructs kernels by 1 Proposing a kernel function and explicitly defining the corresponding mapping φ or/and 2 Combining known kernels in ways that obey the closure properties of kernels. Knowing these closure properties and a set of basic kernel functions is of utmost importance when working with kernels in practice.

91 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Kernels Closure properties of kernels Assume we are given two kernels k 1 and k 2. Then k 1 + k 2 is a kernel as well. Assume we are given two kernel functions k 1 and k 2. Then their product is a kernel k 1 k 2 as well. Assume we are given a kernel function k and a postive scalar λ R +. Then λk is a kernel as well. Assume that a kernel k is only defined on a set D. Then its zero-extension k 0 is also a kernel: { k 0 (x, x k(x, x ) = ) if x D and x D 0 otherwise

92 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Kernels Some prominent kernels linear kernel k(x, x ) = d x l x l = x x, (24) l=1 polynomial kernel k(x, x ) = (x x + c) d, where c, d R, Gaussian Radial Basis Function (RBF) kernel k(x, x ) = exp( 1 2σ 2 x x 2 ), (25)

93 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Kernels Some useful kernels the constant all-ones kernel: the delta (Dirac) kernel: k(x, x ) = 1 { k(x, x 1 x = x ) = 0 otherwise

94 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Kernels on structured data R-convolution kernels (Haussler, 1999) R-Convolution kernels are a famous recipe for constructing kernels on structured data. It is based on decomposing objects X and X via a relation R into sets of substructures S and S. Their most simple and most widely used instance k R is the idea to compare all pairs of these substructures of X and X pairwise: k R (X, X ) = k base (s, s ) s S,s S For instance, a substructure would be the elements of a set, the nodes of a graph or the substrings of a string. k base is an arbitrary vectorial kernel, very often even the delta kernel.

95 Kernels on strings String Kernels The spectrum kernel, mentioned above, counts all pairs of matching substrings in two strings. That is, X and X are strings and S and S are the sets of all of their substrings. k R (X, X ) = k base (s, s ) s S,s S k base is a delta kernel, that is, for every pair of matching substrings, the kernel increases by one. Naively, one has to enumerate all substrings (quadratic effort O( X 2 + X 2 ). Via a special data structure called Suffix Trees, it can be done in linear time O( X + X ) (Vishwanathan and Smola, 2003). Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

96 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Kernels on graphs Random Walk Kernel Count matching walks in G and G : the more matching walks there are, the higher the similarity between G and G. Matching walks are sequences of nodes and edges with matching node labels Elegant Computation (Gärtner et al., 2003) Compute the direct product graph of G and G and count all walks of length k in it. Every walk in the product graph corresponds to one walk in G and in G : k (G, G ) = V i,j=1 [ λ k A k ] ij = e (1 λa ) 1 e k=0

97 Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162 Kernels on graphs! " # " X "!!! "!!!!! " "!! "! "!!! "! "! #!!! #! "!

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Pattern Recognition 2018 Support Vector Machines

Pattern Recognition 2018 Support Vector Machines Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht

More information

Review: Support vector machines. Machine learning techniques and image analysis

Review: Support vector machines. Machine learning techniques and image analysis Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization

More information

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Support Vector Machine (continued)

Support Vector Machine (continued) Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

ESS2222. Lecture 4 Linear model

ESS2222. Lecture 4 Linear model ESS2222 Lecture 4 Linear model Hosein Shahnas University of Toronto, Department of Earth Sciences, 1 Outline Logistic Regression Predicting Continuous Target Variables Support Vector Machine (Some Details)

More information

Bayesian Decision Theory

Bayesian Decision Theory Introduction to Pattern Recognition [ Part 4 ] Mahdi Vasighi Remarks It is quite common to assume that the data in each class are adequately described by a Gaussian distribution. Bayesian classifier is

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Support Vector Machines CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

More information

Chapter 6: Classification

Chapter 6: Classification Chapter 6: Classification 1) Introduction Classification problem, evaluation of classifiers, prediction 2) Bayesian Classifiers Bayes classifier, naive Bayes classifier, applications 3) Linear discriminant

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 4 Discriminant Analysis, k-nearest Neighbors Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se

More information

Statistical Methods for SVM

Statistical Methods for SVM Statistical Methods for SVM Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot,

More information

Introduction to SVM and RVM

Introduction to SVM and RVM Introduction to SVM and RVM Machine Learning Seminar HUS HVL UIB Yushu Li, UIB Overview Support vector machine SVM First introduced by Vapnik, et al. 1992 Several literature and wide applications Relevance

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

More information

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 by Tan, Steinbach, Karpatne, Kumar 1 Classification: Definition Given a collection of records (training set ) Each

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

day month year documentname/initials 1

day month year documentname/initials 1 ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

TDT4173 Machine Learning

TDT4173 Machine Learning TDT4173 Machine Learning Lecture 3 Bagging & Boosting + SVMs Norwegian University of Science and Technology Helge Langseth IT-VEST 310 helgel@idi.ntnu.no 1 TDT4173 Machine Learning Outline 1 Ensemble-methods

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18 Decision Tree Analysis for Classification Problems Entscheidungsunterstützungssysteme SS 18 Supervised segmentation An intuitive way of thinking about extracting patterns from data in a supervised manner

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization

More information

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis Week 5 Based in part on slides from textbook, slides of Susan Holmes Part I Linear Discriminant Analysis October 29, 2012 1 / 1 2 / 1 Nearest centroid rule Suppose we break down our data matrix as by the

More information

Support Vector Machines

Support Vector Machines EE 17/7AT: Optimization Models in Engineering Section 11/1 - April 014 Support Vector Machines Lecturer: Arturo Fernandez Scribe: Arturo Fernandez 1 Support Vector Machines Revisited 1.1 Strictly) Separable

More information

Support Vector Machine & Its Applications

Support Vector Machine & Its Applications Support Vector Machine & Its Applications A portion (1/3) of the slides are taken from Prof. Andrew Moore s SVM tutorial at http://www.cs.cmu.edu/~awm/tutorials Mingyue Tan The University of British Columbia

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37 COMP 652: Machine Learning Lecture 12 COMP 652 Lecture 12 1 / 37 Today Perceptrons Definition Perceptron learning rule Convergence (Linear) support vector machines Margin & max margin classifier Formulation

More information

Linear Classification and SVM. Dr. Xin Zhang

Linear Classification and SVM. Dr. Xin Zhang Linear Classification and SVM Dr. Xin Zhang Email: eexinzhang@scut.edu.cn What is linear classification? Classification is intrinsically non-linear It puts non-identical things in the same class, so a

More information

A Tutorial on Support Vector Machine

A Tutorial on Support Vector Machine A Tutorial on School of Computing National University of Singapore Contents Theory on Using with Other s Contents Transforming Theory on Using with Other s What is a classifier? A function that maps instances

More information

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008) Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008) RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement

More information

CS798: Selected topics in Machine Learning

CS798: Selected topics in Machine Learning CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Support Vector Machines Explained

Support Vector Machines Explained December 23, 2008 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information

L5 Support Vector Classification

L5 Support Vector Classification L5 Support Vector Classification Support Vector Machine Problem definition Geometrical picture Optimization problem Optimization Problem Hard margin Convexity Dual problem Soft margin problem Alexander

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Reading: Ben-Hur & Weston, A User s Guide to Support Vector Machines (linked from class web page) Notation Assume a binary classification problem. Instances are represented by vector

More information

Announcements - Homework

Announcements - Homework Announcements - Homework Homework 1 is graded, please collect at end of lecture Homework 2 due today Homework 3 out soon (watch email) Ques 1 midterm review HW1 score distribution 40 HW1 total score 35

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Machine Learning Concepts in Chemoinformatics

Machine Learning Concepts in Chemoinformatics Machine Learning Concepts in Chemoinformatics Martin Vogt B-IT Life Science Informatics Rheinische Friedrich-Wilhelms-Universität Bonn BigChem Winter School 2017 25. October Data Mining in Chemoinformatics

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Thomas G. Dietterich tgd@eecs.oregonstate.edu 1 Outline What is Machine Learning? Introduction to Supervised Learning: Linear Methods Overfitting, Regularization, and the

More information

Kaggle.

Kaggle. Administrivia Mini-project 2 due April 7, in class implement multi-class reductions, naive bayes, kernel perceptron, multi-class logistic regression and two layer neural networks training set: Project

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Support Vector Machine for Classification and Regression

Support Vector Machine for Classification and Regression Support Vector Machine for Classification and Regression Ahlame Douzal AMA-LIG, Université Joseph Fourier Master 2R - MOSIG (2013) November 25, 2013 Loss function, Separating Hyperplanes, Canonical Hyperplan

More information

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Support Vector Machines. Manfred Huber Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data

More information

Statistical Methods for Data Mining

Statistical Methods for Data Mining Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Induction of Decision Trees

Induction of Decision Trees Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.

More information

Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

(Kernels +) Support Vector Machines

(Kernels +) Support Vector Machines (Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning

More information

Artificial Intelligence Roman Barták

Artificial Intelligence Roman Barták Artificial Intelligence Roman Barták Department of Theoretical Computer Science and Mathematical Logic Introduction We will describe agents that can improve their behavior through diligent study of their

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers Computational Methods for Data Analysis Massimo Poesio SUPPORT VECTOR MACHINES Support Vector Machines Linear classifiers 1 Linear Classifiers denotes +1 denotes -1 w x + b>0 f(x,w,b) = sign(w x + b) How

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Machine Learning, Midterm Exam: Spring 2009 SOLUTION 10-601 Machine Learning, Midterm Exam: Spring 2009 SOLUTION March 4, 2009 Please put your name at the top of the table below. If you need more room to work out your answer to a question, use the back of

More information

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science Neural Networks Prof. Dr. Rudolf Kruse Computational Intelligence Group Faculty for Computer Science kruse@iws.cs.uni-magdeburg.de Rudolf Kruse Neural Networks 1 Supervised Learning / Support Vector Machines

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Nearest Neighbors Methods for Support Vector Machines

Nearest Neighbors Methods for Support Vector Machines Nearest Neighbors Methods for Support Vector Machines A. J. Quiroz, Dpto. de Matemáticas. Universidad de Los Andes joint work with María González-Lima, Universidad Simón Boĺıvar and Sergio A. Camelo, Universidad

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information