5. Classification and Prediction. 5.1 Introduction

Size: px
Start display at page:

Download "5. Classification and Prediction. 5.1 Introduction"

Transcription

1 5. Classification and Prediction Contents of this Chapter 5.1 Introduction 5.2 Evaluation of Classifiers 5.3 Decision Trees 5.4 Bayesian Classification 5.5 Nearest-Neighbor Classification 5.6 Neural Networks 5.7 Support Vector Machines 5.8RegressionAnalysis SFU, CMPT 740, 03-3, Martin Ester Introduction The Classification Problem LetO be a set of objects of the form (o 1,...,o d ) with attributes A i,1 i d, and class membership c i, c i C = {c 1,..., c k } Wanted: class membership for objects from D \ O a classifier K : D C Difference to clustering classifikation: set of classes C known apriori clustering: classes are output Related problem: prediction predict the value of a numerical attribute method e.g. regression analysis SFU, CMPT 740, 03-3, Martin Ester 197 1

2 5.1 Introduction NAME RANK YEARS TENURED Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no Anne Associate Prof 3 no (Jeff, Professor, 4) Unseen Data Training Data Tenured? yes Classification Algorithm Classifier (Model) if rank = professor or years > 6 then tenured = yes SFU, CMPT 740, 03-3, Martin Ester Evaluation of Classifiers Approaches Train-and-Test partition set O into two subsets: Training data for training the classifier (model construction) Test data to evaluate the trained classifier m-fold cross validation - partition set O into m same size subsets -trainm different classifiers using a different one of these m subsets as test data and the other subsets for training - combine the evaluation results of the m classifiers SFU, CMPT 740, 03-3, Martin Ester 199 2

3 5.2 Evaluation of Classifiers Evaluation Criteria Classification accuracy Interpretability e.g. size of a decision tree insight gained by the user Efficiency of model construction of model application Scalability for large datasets for secondary storage data Robustness w.r.t. noise and unknown attribute values Classification as optimization problem: score of a classifier SFU, CMPT 740, 03-3, Martin Ester Evaluation of Classifiers Classification Accuracy Let K be a classifier, TR O the training data, TE O the test data. C(o): actual class of object o. classification accuracy of K on TE: classification error G ( o TE K o C o TE K ) { ( ) = ( )} = TE F ( o TE K o C o TE K ) { ( ) ( )} = TE aggregates over all classes c i C not appropriate if minority class is most important SFU, CMPT 740, 03-3, Martin Ester 201 3

4 5.2 Evaluation of Classifiers Classification Accuracy Letc 1 C be the target (positive) class, the union of all other classes the contrasting (negative) class. Comparing the predicted and the actual class labels, we can distinguish four different cases: Actually positive Actually negative Predicted as positive True Positive (TP) False Positive (FP) Predicted as negative False Negative (FN) True Negative (TN) Confusion matrix SFU, CMPT 740, 03-3, Martin Ester Evaluation of Classifiers Classification Accuracy We define the following two measures of K w.r.t. the given target class: TP Precision( K) = TP + FP TP Recall( K ) = TP + FN There is a trade-off between precision and recall. Therefore, we also define a measure combining precision and recall: F Measure( K ) = Precision( K ) + Recall( K ) 2 Precision( K ) Recall( K ) SFU, CMPT 740, 03-3, Martin Ester 203 4

5 5.2 Evaluation of Classifiers True Classification Error We measure the classification error on a (small) test dataset O X. Questions: How to estimate the true classification error on the whole instance space X? How does the deviation from the observed classification error depend on the size of the test set? Random experiment to determine the classification error on test set (of size n): repeat n times (1) draw random object from X (2) compare predicted vs. actual class label for this object classification error is percentage of misclassified objects observed classification errors follow a Binomial distribution SFU, CMPT 740, 03-3, Martin Ester Evaluation of Classifiers Binomial distribution n repeated tosses of a coin with unknown probability p of head head = misclassified object Record thenumber r of heads (misclassifications) Binomial distribution defines probability for all possible values of r: n r n r P r =! ( ) p (1 p r!( n r)! ) Random variable Y governed by a Binomial distribution: E[ Y ] = n p expected value Var[ Y ] = np(1 p) σ = np( 1 p) Y SFU, CMPT 740, 03-3, Martin Ester 205 5

6 5.2 Evaluation of Classifiers Estimating the True Classification Error We want to estimate the unknown true classification error (p). Estimator for p: r E [ Y ] = n p = r p = n We want also confidence intervals for our estimate. Standard deviation for the true classification error (Y/n): σ Y n Y = σ n = np( 1 p) n σ Y n r n ( 1 n r ) n use r n as estimator for p SFU, CMPT 740, 03-3, Martin Ester Evaluation of Classifiers Estimating the True Classification Error For sufficiently large values of n, the Binomial distribution can be approximated by a Normal distribution with the same mean and standard deviation. Random variable Y Normal distributed with mean µ and standard deviation σ and y be the observed value of Y: Then the mean of Y falls into the following interval with a probability of N % y ± z N σ In our context, N % confidence interval for the true classification error: r ± z n N r n ( 1 n r ) n interval size decreases with increasing n interval size increases with increasing N and z N SFU, CMPT 740, 03-3, Martin Ester 207 6

7 5.3 Decision Trees Introduction ID Age Autotype Risk 1 23 Family high 2 17 Sports high 3 43 Sports high 4 68 Family low 5 32 Truck low Autotype = Truck Truck Risk = low Age >60 60 Risk = low Risk = high disjunction of conjunction of attribute constraints and hierarchical structure SFU, CMPT 740, 03-3, Martin Ester Decision Trees Introduction A decision tree is a tree with the following properties: An inner node represents an attribute. An edge represents a test on the attribute of the father node. A leaf represents one of the classes of C. Construction of a decision tree Based on the training data Top-Down strategy Application of a decision tree Traversal of the decision tree from the root to one of the leaves Unique path Assignment of the object to class of the resulting leaf SFU, CMPT 740, 03-3, Martin Ester 209 7

8 5.3 Decision Trees Construction of Decision Trees Base algorithm Initially, all training data records belong to the root. Next attribute is selected and split (split strategy). Training data records are partitioned according to the chosen split. Method is applied recursively to each partition. local optimization method (greedy) Termination conditions No more split attributes. All (most) training data records of the node belong to the same class. SFU, CMPT 740, 03-3, Martin Ester Decision Trees Example Day Outlook Temperature Humidity Wind PlayTennis? 1 sunny hot high weak no 2 sunny hot high strong no 3 overcast hot high weak yes 4 rainy mild high weak yes 5 rainy cool normal weak yes 6 rainy cool normal strong no Is today a day to play tennis? SFU, CMPT 740, 03-3, Martin Ester 211 8

9 5.3 Decision Trees Example Outlook sunny overcast rainy Humidity yes Wind high normal strong weak no yes no yes SFU, CMPT 740, 03-3, Martin Ester 212 Categorical attributes 5.3 Decision Trees Types of Splits Conditions of the form attribute = a or attribute set Many possible subsets attribute attribute =a 1 =a 2 =a 3 s 1 s 2 Numerical attributes Conditions of the form attribute < a Many possible split points attribute < a a SFU, CMPT 740, 03-3, Martin Ester 213 9

10 5.3 Decision Trees Quality Measures for Splits Given asettof training data a disjoint, exhaustive partitioning T 1, T 2,...,T m of T p i the relative frequency of class c i in T Wanted A measure of the impurity of set S (of training data) w.r.t. class labels AsplitofT in T 1, T 2,...,T m minimizing this impurity measure information gain, gini-index SFU, CMPT 740, 03-3, Martin Ester Decision Trees Information Gain Entropy: minimal number of bits to encode a message to transmit the class of a random training data record Entropy for a set T of training data: entropy( T ) = k i= 1 p i p i log 2 entropy(t)=0,ifp i =1forsomei entropy(t)=1fork = 2 classes with p i =1/2 Let attribute A produce the partitioning T 1, T 2,...,T m of T. Theinformation gain of attribute A w.r.t T is defined as InformationGain T = m i ( T, A) entropy( T ) entropy( Ti ) i= 1 T SFU, CMPT 740, 03-3, Martin Ester

11 5.3 Decision Trees Gini-Index Gini index for a set T of training data records gini( T)= 1 low gini index low impurity, high gini index high impurity k 2 pj j= 1 Let attribute A produce the partitioning T 1, T 2,...,T m of T. Gini index of attribute A w.r.t. T is defined as gini A m Ti ( T) = T gini ( T i ) i= 1 SFU, CMPT 740, 03-3, Martin Ester Decision Trees Example 9 yes 5 no entropy = Humidity high normal 3 yes 4 no 6 yes 1 no Entropy = Entropy = yes 5 no entropy = Wind weak stark 6 yes 2 no 3 yes 3 no Entropy = Entropy = InformationGain( T, Humidity) = = InformationGain( T, Wind ) = = Humidity yields the higher information gain SFU, CMPT 740, 03-3, Martin Ester

12 5.3 Decision Trees Overfitting Overfitting: there are two decision trees T and T with T has a lower error rate than T on the training data, but T hasalower true error rate than T. Classification accuracy on training data on test data Tree size SFU, CMPT 740, 03-3, Martin Ester Decision Trees Approaches for Avoiding Overfitting Removal of erroneous training data in particular, inconsistent training data Choice of appopriate size of training data set not too small, not too large Choice of appropriate minimum support minimum support: minimum number of training data records belonging to a leaf node minimum support >> 1 SFU, CMPT 740, 03-3, Martin Ester

13 5.3 Decision Trees Approaches for Avoiding Overfitting Choice of appropriate minimum confidence minimum confidence: minimum percentage of the majority class of a leaf node minimum confidence << 100% leaves can also absorb noisy / erroneous training data records Subsequent pruning of the decision tree remove overfitting branches see next section SFU, CMPT 740, 03-3, Martin Ester Decision Trees Error Reduction-Pruning [Mitchell 1997] Train-and-Test paradigm Construction of decision tree T for training data set TR. PruningofT using test data set TE Determine subtree of T such that its removal leads to the maximum reduction of the classification error on TE. Remove this subtree. Stop, if no more such subtree. only applicable if enough labled data available SFU, CMPT 740, 03-3, Martin Ester

14 5.3 Decision Trees Minimal Cost Complexity Pruning [Breiman, Friedman, Olshen & Stone 1984] Cross-Validation paradigm Applicable even if only small number of labled data available Pruning of decision tree using training data set Cannot use classification error as quality measure Novel quality measure for decision trees Trade-off between (observed) classification error and tree size Weighted sum of classification error and tree size Small decision trees tend to generalize better to unseen data SFU, CMPT 740, 03-3, Martin Ester Decision Trees Notions Size T of decision tree T: number of leaves Cost complexity of T w.r.t. training data set TR and complexity parameter α 0: CC TR ( T, α) = error ( T ) + α T Thesmallest minimizing subtree T(α) oft w.r.t. α has the following properties : (1) There is no subtree of T with smaller cost complexity. (2) If T(α)andT satisfy condition (1), then T(α)isasubtreeofT. α =0:T(α)=T α = : T(α) = root of T 0<α < : T(α) = true subtree of T (more than the root) TR SFU, CMPT 740, 03-3, Martin Ester

15 5.3 Decision Trees Notions T e :subtreeoft with root e, {e}: tree consisting only of node e T > T : subtree relationship For small values of α: CC TR (T e, α)<cc TR ({e}, α), for large values of α: CC TR (T e, α)>cc TR ({e}, α). critical value of α w.r.t. e α crit : CC TR (T e, α crit )=CC TR ({e}, α crit ) for α α crit the subtree of node e should be pruned weakest link: node with minimal α crit value SFU, CMPT 740, 03-3, Martin Ester 224 Start with complete decision tree T. 5.3 Decision Trees Method Iteratively, each time remove the weakest link from the current tree. If several weakest links: remove all of them in the same step. sequence of pruned trees T(α 1 )>T(α 2 )>...>T(α m ) with α 1 < α 2 <...<α m Selection of the best T(α i ) estimate the true classification error of all T(α 1 ), T(α 2 ),..., T(α m ) performing l-fold cross-validation on the training data set SFU, CMPT 740, 03-3, Martin Ester

16 5.3 Decision Trees Example i Ti training error estimated error true error ,0 0,46 0, ,0 0,45 0, ,04 0,43 0, ,10 0,38 0, ,12 0,38 0, ,2 0,32 0, ,29 0,31 0, ,32 0,39 0, ,41 0,47 0, T 7 has the lowest estimated error and the lowest true error SFU, CMPT 740, 03-3, Martin Ester Bayesian Classification Motivation Given an object o and two classes positive and negative, three independent hypotheses h 1, h 2, h 3. A-posteriori-probabilities of the hypotheses for given o P(h 1 o)=0,4, P(h 2 o)=0,3, P(h 3 o) = 0,3 A-posteriori- probabilities of the classes for given hypothesis P(negative h 1 )=0,P(positive h 1 )=1 P(negative h 2 )=1,P(positive h 2 )=0 P(negative h 3 )=1,P(positive h 3 )=0 o is positive with probability 0.4 and negative with probability 0.6 SFU, CMPT 740, 03-3, Martin Ester

17 5.4 Bayesian Classification Optimal Bayes Classifier LetH={h 1,..., h l }asetofindependent hypotheses. Leto an object to be classified. Theoptimal Bayes-classifier assigns o to the following class: in the example: o classified as negative No other classifier with the same A-priori-knowledge achieves a better average classification accuracy. optimal Bayes-classifier argmax P( c j hi) P( hi o) cj C hi H SFU, CMPT 740, 03-3, Martin Ester Bayesian Classification Optimal Bayes Classifier Assumption: always exactly one hypothesis h i satisfied. Simplified decision rule argmax P( c j o) cj C P(c j o) typically unknown Reformulation using Bayes theorem Poc ( j) Pc ( j) argmax P( c j o) = argmax = argmax P( o c j) P( c j) Po ( ) cj C cj C cj C Final decision rule of the optimal Bayes-classifier argmax P( o c ) P( c ) cj C Maximum-Likelihood-Classifier j j SFU, CMPT 740, 03-3, Martin Ester

18 5.4 Bayesian Classification Naive Bayes Classifier Estimate the P(c j ) using the observed frequencies of the individual classes. How to estimate the P(o c j )? Assumption: o =(o 1,...,o d ) Attribute values o i are conditionally independent for a given class c j Decision rule of the Naive Bayes-Classifier argmax P( c ) P( o c ) j i j cj C i = 1 d P o i c ) are easier to estimate from the training data than P(o c j ) ( j SFU, CMPT 740, 03-3, Martin Ester Bayesian Classification Bayesian Networks Graphwithnodes = random variable and edge = conditional dependency Each random variable is (for given values of the predecessor variables) conditionally independent from all variables that are no successors. For each node: (random variable): Table of conditional probabilities Training of a Bayesian network: With given network structure and all known random variables With given network structure and partially unknown random variables With apriori unknown network structure SFU, CMPT 740, 03-3, Martin Ester

19 5.4 Bayesian Classification Example Family History Smoker (FH,~S) (~FH,~S) (FH,S) (~FH,S) LC LungCancer Emphysema ~LC PositiveXRay Dyspnea Conditional probabilities for LungCancer For given values of FamilyHistory and Smoker, the value of Emhysema does not provide any additional information about LungCancer SFU, CMPT 740, 03-3, Martin Ester Bayesian Classification Interpretation of Raster Images Automatical interpretation of d raster images of a given region for each pixel: a d-dimensional vector of grey values (o 1,...,o d ) Assumption: different kinds of landuse exhibit characteristic behaviors of reflection / emission (12),(17.5) (8.5),(18.7) Earth Surface Cluster 1 Band 1 Agricultural Cluster 2 12 Water 10 Urban Cluster Band 2 Feature-Space SFU, CMPT 740, 03-3, Martin Ester

20 5.4 Bayesian Classification Interpretation of Raster Images Application of the optimal Bayes classifier Estimate the P(o c) without assuming conditional indepency Assume a d-dimensional Normal distribution of the grey value vectors of a given class Probability of Class Membership Water Urban Decision Surfaces Agricultural SFU, CMPT 740, 03-3, Martin Ester Bayesian Classification Method Estimate from the training data µ i : d-dimensional mean vector of all feature vektors of class c i Σ i : d d covariance matrix of class c i Problems of the decision rule - Likelihood for the chosen class very small - Likelihood for several classes similar unclassified regions Threshold SFU, CMPT 740, 03-3, Martin Ester

21 5.4 Bayesian Classification Discussion + Optimality property Standard for comparison with other classifiers + High classification accuracy in many applications + Incrementality classifier can easily be adapted to new training objects + Integration of domain knowledge - Applicability the conditional probabilities may not be available - Inefficient For high numbers of features in particular, Bayesian networks SFU, CMPT 740, 03-3, Martin Ester Nearest-Neighbor Classification Motivation Optimal Bayes classifier assuming a d-dimensional Normal distribution Requires estimates for µ i and Σ i Estimate for µ i needs much less training data Goal classifier using only the mean vectors per class Nearest-neighbor classifier SFU, CMPT 740, 03-3, Martin Ester

22 5.5 Nearest-Neighbor Classification Example Wolf µ Wolf Wolf Dog µ Dog Dog Dog q Cat Cat Cat Cat µ Cat Classifier: q is a dog! Instance-Based Learning Related to Case-Based Reasoning SFU, CMPT 740, 03-3, Martin Ester Nearest-Neighbor Classification Overview Base method Training objects o as feature (attribute) vectors o =(o 1,...,o d ) Calculate the mean vector µ i for each class c i Assign unseen object to class c i with nearest mean vector µ i Generalisations Use more than one representative per class Consider k > 1 neighbors Weight the classes of the k nearest neighbors SFU, CMPT 740, 03-3, Martin Ester

23 5.5 Nearest-Neighbor Classification Notions Distance function defines similarity (dissimilarity) for pairs of objects k: number of neighbors considered Decision Set set of k-nearest neighbors considered for classification Decision rule how to determine the class of the unseen object from the classes of the decision set? SFU, CMPT 740, 03-3, Martin Ester Nearest-Neighbor Classification Example classes + and - Decision set for k = 1 Decision set for k = 5 Uniform weight for the decision set k = 1: classification as +, k = 5 classification as Inverse squared distance as weight for the decision set k =1andk = 5: classification as + SFU, CMPT 740, 03-3, Martin Ester

24 5.5 Nearest-Neighbor Classification Choice of Parameter k too small k: very sensitive to outliers too large k: many objects from other clusters (classes) in the decision set mediumk: highest classification accuracy, often 1 << k <10 Decision set for k =1 x Decision set for k =7 Decision set for k =17 x: to be classified SFU, CMPT 740, 03-3, Martin Ester 242 Standard rule 5.5 Nearest-Neighbor Classification Decision Rule Choose the majority class within the decision set Weighted decision rule Weight the classes of the decision set By distance By class distribution (often skewed!) class A: 95 %, class B 5 % Decisionset={A,A,A,A,B,B,B} Standard rule A, Weighted rule B SFU, CMPT 740, 03-3, Martin Ester

25 5.5 Nearest-Neighbor Classification Index Support for k-nearest-neighbor Queries Balanced index tree (such as X-tree or M-tree) Query point p PartitionList BBs of subtrees that need to be processed, sorted in ascending order w.r.t. MinDist to p NN Nearest neighbor of p in the data pages read so far A MinDist(A,p) p MinDist(B,p) B SFU, CMPT 740, 03-3, Martin Ester Nearest-Neighbor Classification Index Support for k-nearest-neighbor Queries Remove all BBs from PartitionList that have a larger distance to p than the currently best NN of p PartitionList is sorted in ascending order w.r.t. MinDist to p Always pick the first element of PartitionList as the next subtree to be explored Does not read any unnecessary disk pages! Query processing limited to a few paths of the index structure Average rumtine O(log n) for not too many attributes For very large numbers of attributes: O(n) SFU, CMPT 740, 03-3, Martin Ester

26 5.5 Nearest-Neighbor Classification Discussion + Local method Does not have to find a global decision function (decision surface) + High classification accuracy In many applications +Incremental Classifier can easily be adapted to new training objects + Can be used also for prediction - Application of classifier expensive Requires k-nearest neighbor query - Does not generate explicit knowledge about the classes SFU, CMPT 740, 03-3, Martin Ester Neural Networks Introduction [Bigus 1996], [Bishop 1995] Paradigma for a machine / computation model Function inspired by function of biological brains Neural net: set of neurons, connected via edges Neuron: models biological neuron Activation through input signals Generation of output signal that is transmitted to other neurons Organisation of neural net Input layer, Hidden layer(s), Output layer Nodes of a layer connected to all nodes of the preceding layer SFU, CMPT 740, 03-3, Martin Ester

27 5.6 Neural Networks Introduction Edges have weights Optimization goal: output vector identical to class labels of training data Output vector y Output layer Hidden layer Optimization method: (1) Design network topology (2) Train parameters w ij Intput layer Input vector x SFU, CMPT 740, 03-3, Martin Ester Neural Networks Neurons General neuron a: activation value n i = 1 a = w x Threshold logic unit (TLU) Perceptron i i x 1 x 2 x n x 1 x 2 x n... w 1... w 2 w n w 1 w 2 w n Σ Σ a a y θ a y 1 = + 1 e a 1, if a θ y = 0, else SFU, CMPT 740, 03-3, Martin Ester

28 5.6 Neural Networks Neurons Classification using a Perceptron Represents a (hyper-)plane Left of hyperplane: class 0 Right of hyperplane: class 1 n i = 1 w x = θ i i x Training a Perceptron Learn the correct weights to distinguish the two classes x 1 Iterative adaption of the weights w ij Rotation of the hyperplane (w, θ) byasmalldegree in direction of v, ifv is not yet on the correct side of the hyperplane SFU, CMPT 740, 03-3, Martin Ester Neural Networks Combination of Several Neurons Two classes that are not linearly separable: two inner nodes and one output node Example y 1 0 A A 1 A A A A A A A A A A A B B A B B A A B B B B 0 h 1 h 2 h = 0 h = 0: y = 0 ( class B) 1 2 otherwise : y = 1 ( class A) SFU, CMPT 740, 03-3, Martin Ester

29 5.6 Neural Networks Training Complex Neural Nets If predicted and actual class deviate: Adapt the weights of several nodes Question To which degree do the different nodes contribute to the error? Adapting the weights Minimize the overall error using gradient-based local optimization Overall error: sum of the (squared) deviations of the actual output y from the correct output t over all input vectors Prerequisite: output y continuous function of activation a SFU, CMPT 740, 03-3, Martin Ester Neural Networks Algorithm Backpropagation for each pair(v,t) // v = Input,t = correct output forward pass : Determine the actual output y for input v; backpropagation : Determine the error (t y) of the output nodes and adapt the weights of the output nodes in the direction minimizing the error; while input layer has not been reached: Propagate the error to the next layer and adapt the weight of this layer such that the error is minimized; SFU, CMPT 740, 03-3, Martin Ester

30 5.6 Neural Networks Design of the Network Topology Choiceof Number of input nodes Number of inner layers and corresponding numbers of nodes Number of output nodes Strong impact on classification accuracy Too few nodes Cannot distinguish classes Too many nodes Overfitting SFU, CMPT 740, 03-3, Martin Ester Neural Networks Design of the Network Topology Static toplogy Topology is fixed apriori One hidden layer is often sufficient Dynamic topology Dynamic addition of neurons (and hidden layers) as along as the classification accuracy is significantly improved Multiple topologies Training of several dynamic nets in parallel E.g., three networks with 1, 2 and 3 hidden layers SFU, CMPT 740, 03-3, Martin Ester

31 5.6 Neural Networks Design of the Network Topology Pruning Train a network with static topolgy A posteriori removal of the least important neurons as long as classification accuracy is improved Conclusion Static topology: low classification accuracy, but relatively fast Pruning: higher classification accuracy, but training very expensive SFU, CMPT 740, 03-3, Martin Ester Neural Networks Discussion + In general, high classification accuracy Arbitrarily complex decision surfaces + Robust to noisy training data + Efficient model application - Inefficient model construction very long training times - Model is hard to interpret Learns weights, but no class descriptions - No integration of domain knowledge SFU, CMPT 740, 03-3, Martin Ester

32 5.7 Support Vector Machines Introduction [Burges 1998] Input a training set of objects S = {( x1, y1 ),...,( x n, yn )} X and their known classes xi y i { 1, + 1} Output a classifier f : X { 1, + 1} Goal Find the best separating hyperplane (e.g., lowest classification error) Two-class problem SFU, CMPT 740, 03-3, Martin Ester Support Vector Machines Introduction Half-space: w.x + b > 0 (Class +1) Half-space: w.x + b < 0 (Class 1) w Classification based on the sign of the decision function f b w, ( x) = w. x + b. denotes the inner product of two vectors Hyperplane: w.x + b = 0 SFU, CMPT 740, 03-3, Martin Ester

33 5.7 Support Vector Machines Introduction Choose hyperplane with largest margin (maximum distance to closest training object) SFU, CMPT 740, 03-3, Martin Ester Support Vector Machines Introduction w. x 1 +b=0 x 2 γ w.x + b = +1 w. x 2 +b=1 w. (x 2 -x 1 )=1 w x 2 -x 1 cos 0 = 1 x 1 w.x + b = 0 w 1 γ = x 2 x1 = w γ: margin SFU, CMPT 740, 03-3, Martin Ester

34 5.7 Support Vector Machines Method Problem Minimize w 2 Under the constraints i = 1,..., n : yi ( w. xi + b) 1 0 Dual problem Introduce dual variables α i for each training object i Findα i maximizing n n 1 L ( α) = αi αi α j yi y j xi. x i= 1 2 i, j= 1 n under the constraints α 0 and α i y i = 0 i i= 1 Quadratic programming problem j SFU, CMPT 740, 03-3, Martin Ester Support Vector Machines Method α i = 0 α i > 0 Only training objects with α i >0 contribute to w These training objects are the support vectors w α i = 0 Typically, number of support vectors << n SFU, CMPT 740, 03-3, Martin Ester

35 5.7 Support Vector Machines Non-Linear Classifiers Ψ Transformation of space X SFU, CMPT 740, 03-3, Martin Ester Support Vector Machines Non-Linear Classifiers Decision function f b w, ( x) = w. Ψ( x) + b Kernel of two objects x, x' X : K( x, x') = Ψ( x). Ψ( x') Explicit computation of Ψ(x) is not necessary Example: 2 2 Ψ( x ) = ( x1, x2, 2x1 x2, 2x1, 2x2,1) K( x, x') = Ψ( x). Ψ( x') = ( x. x' + 1) 2 SFU, CMPT 740, 03-3, Martin Ester

36 5.7 Support Vector Machines Kernels Kernel is a similarity measure K(x,x ) is a kernel iff X is a symmetric and positive definite matrix x i K( x1, x1 ) K( x1, x2) K( x2, x1 ) K( x2, x ) L : 2 L L SFU, CMPT 740, 03-3, Martin Ester Support Vector Machines SVM for Protein Classification [Leslie et al 2002] Two sequences are similar when they share many common substrings (subsequences) s K( x, x') = where λ is a parameter s common λ substring and s denotes the length of string s Very high classification accuracy for protein sequences Variation of the kernel (when allowing gaps in the matching subsequences) length( s, x) + length( s, x') s common substring K( x, x') = λ length(s,x): length of the subsequence of x matching s SFU, CMPT 740, 03-3, Martin Ester

37 5.7 Support Vector Machines SVM for Prediction of Translation Initiation Sites [Zienetal2000] Translation initiation site (TIS): starting position of a protein coding region in DNA AllTISstartwiththetriplet ATG Problem: given an ATG triplet, does it belong to a TIS? Representation of DNA Window of 200 nucleotides around candidate ATG Encode each nucleotide with a 5 bit word (00001, 00010,..., 10000) for A, C, G, T and unknown Vectors of 1000 bits SFU, CMPT 740, 03-3, Martin Ester 268 Kernels 5.7 Support Vector Machines SVM for Prediction of Translation Initiation Sites K( x, x') = d (x.x' ) d = 1: number of common bits d = 2: number of common pairs of bits... Locally improved kernel: compare only small window around ATG Experimental results Long range correlations do not improve performance Locally improved kernel performs best Outperforms state-of-the-art methods SFU, CMPT 740, 03-3, Martin Ester

38 5.7 Support Vector Machines Discussion + Strong mathematical foundation + Find global optimum + Scale well to very high-dimensional datasets + Very high classification accuracy In many challenging applications - Inefficient model construction Long training times (~ O (n 2 )) - Model is hard to interpret Learn only weights of features Weights tend to be almost uniformly distributed SFU, CMPT 740, 03-3, Martin Ester Regression Analysis Prediction Commonality with classification First, construct a model Second, use model to predict unknown value Major method for prediction is regression Simple and multiple regression Linear and non-linear regression Difference from classification Classification refers to predict categorical class label Prediction models continuous-valued functions SFU, CMPT 740, 03-3, Martin Ester

39 5.8 Regression Analysis Linear Regression Predict the values of the response variable y basedonalinear combination of the given values of the predictor variable(s) x i ˆ d y = a 0 + j= 1 a j x j Simple regression: one predictor variable regression line Multiple regression: several predictor variables regression plane Residuals: differences between observed and predicted values Use the residuals to measure the model fit SFU, CMPT 740, 03-3, Martin Ester Regression Analysis Linear Regression d y( i) = yˆ( i) + e( i) = a0 + a jx j ( i) + e( i), j= 1 1 i n y: vector of the y values for the n training objects X: matrix of the values of the d predictor variables for the n training objects (and an additional column of 1s) e: vector of the residuals for the n training objects Matrix notation: y = Xa + e SFU, CMPT 740, 03-3, Martin Ester

40 Optimization goal: minimize 5.8 Regression Analysis Linear Regression n i= 1 n d 2 ( i) = [ y( i) a jx j ( i i= 1 j= 0 T 1 T Solution: a = ( X X ) X y Computational issues X T X must be invertible Problems if linear dependencies between predictor variables Solution may be unstable If predictor variables almost linear dependent e Equation solving e.g. using LU decomposition or SVD Runtime complexity O(d 2 n + d 3 ) )] 2 SFU, CMPT 740, 03-3, Martin Ester 274 Limitations of linear regression Only linear models One global model 5.8 Regression Analysis Locally Weighted Regression Locally weighted regression Construct an explicit approximation to f over a local neighborhood of query instance xq Weight the neighboring objects based on their distance to xq Distance-decreasing weight K Related to nearest neighbor classification Minimize the squared local weighted error SFU, CMPT 740, 03-3, Martin Ester

41 5.8 Regression Analysis Locally Weighted Regression Local weighted error W.r.t. query instance xq Arbitrary approximating function Pairwise distance function d Three major alternatives: E( x q ) = 1 ( f ( x) fˆ( x)) 2 2 x k_ nearest_ neighbors_ of _ x q E( x 1 )] 2 q ) = [ ( ) ˆ( ( (, )) 2 f x f x K d x q x x D E( x ) 1 ( f ( x) fˆ( x)) 2 q = K( d( x, x)) 2 q x k_ nearest_ neighbors_ of _ x q fˆ SFU, CMPT 740, 03-3, Martin Ester Regression Analysis Discussion + Strong mathematical foundation + Simple to calculate and to understand For moderate number of dimensions + High classification accuracy In many applications - Many dependencies are non-linear Can be generalized - Model is global Cannot adapt well to locally different data distributions But: Locally weighted regression SFU, CMPT 740, 03-3, Martin Ester

Classification and Prediction

Classification and Prediction Classification and Prediction Introduction Evaluation of Classifiers Decision Trees Bayesian Classification Nearest-Neighbor Classification Support Vector Machines Multi-relational Classification Regression

More information

4. Classification and Prediction

4. Classification and Prediction 4. Classification and Prediction Contents of this Chapter 4.1 Introduction 4.2 Evaluation of Classifiers 4.3 Decision Trees 4.4 Bayesian Classification 4.5 Nearest-Neighbor Classification 4.6 Support Vector

More information

Chapter 6: Classification

Chapter 6: Classification Chapter 6: Classification 1) Introduction Classification problem, evaluation of classifiers, prediction 2) Bayesian Classifiers Bayes classifier, naive Bayes classifier, applications 3) Linear discriminant

More information

Bayesian Classification. Bayesian Classification: Why?

Bayesian Classification. Bayesian Classification: Why? Bayesian Classification http://css.engineering.uiowa.edu/~comp/ Bayesian Classification: Why? Probabilistic learning: Computation of explicit probabilities for hypothesis, among the most practical approaches

More information

Classification and Prediction

Classification and Prediction Classification Classification and Prediction Classification: predict categorical class labels Build a model for a set of classes/concepts Classify loan applications (approve/decline) Prediction: model

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees Machine Learning Fall 2018 Some slides from Tom Mitchell, Dan Roth and others 1 Key issues in machine learning Modeling How to formulate your problem as a machine learning problem?

More information

Decision Trees.

Decision Trees. . Machine Learning Decision Trees Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees Machine Learning Spring 2018 1 This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning decision trees The ID3 algorithm: A greedy

More information

Machine Learning 2nd Edi7on

Machine Learning 2nd Edi7on Lecture Slides for INTRODUCTION TO Machine Learning 2nd Edi7on CHAPTER 9: Decision Trees ETHEM ALPAYDIN The MIT Press, 2010 Edited and expanded for CS 4641 by Chris Simpkins alpaydin@boun.edu.tr h1p://www.cmpe.boun.edu.tr/~ethem/i2ml2e

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,

More information

Lecture 3: Decision Trees

Lecture 3: Decision Trees Lecture 3: Decision Trees Cognitive Systems - Machine Learning Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning last change November 26, 2014 Ute Schmid (CogSys,

More information

Decision Trees.

Decision Trees. . Machine Learning Decision Trees Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Introduction to ML Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Why Bayesian learning? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical

More information

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Summary! Input Knowledge representation! Preparing data for learning! Input: Concept, Instances, Attributes"

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Lecture 3: Decision Trees

Lecture 3: Decision Trees Lecture 3: Decision Trees Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning Lecture 3: Decision Trees p. Decision

More information

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td Data Mining Andrew Kusiak 2139 Seamans Center Iowa City, Iowa 52242-1527 Preamble: Control Application Goal: Maintain T ~Td Tel: 319-335 5934 Fax: 319-335 5669 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak

More information

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning Question of the Day Machine Learning 2D1431 How can you make the following equation true by drawing only one straight line? 5 + 5 + 5 = 550 Lecture 4: Decision Tree Learning Outline Decision Tree for PlayTennis

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology SE lecture revision 2013 Outline 1. Bayesian classification

More information

Decision Tree Learning and Inductive Inference

Decision Tree Learning and Inductive Inference Decision Tree Learning and Inductive Inference 1 Widely used method for inductive inference Inductive Inference Hypothesis: Any hypothesis found to approximate the target function well over a sufficiently

More information

Learning Classification Trees. Sargur Srihari

Learning Classification Trees. Sargur Srihari Learning Classification Trees Sargur srihari@cedar.buffalo.edu 1 Topics in CART CART as an adaptive basis function model Classification and Regression Tree Basics Growing a Tree 2 A Classification Tree

More information

Chapter 6: Classification

Chapter 6: Classification Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Chapter 6: Classification Lecture: Prof. Dr. Thomas

More information

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4] 1 DECISION TREE LEARNING [read Chapter 3] [recommended exercises 3.1, 3.4] Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting Decision Tree 2 Representation: Tree-structured

More information

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction Classification What is classification Classification Simple methods for classification Classification by decision tree induction Classification evaluation Classification in Large Databases Classification

More information

the tree till a class assignment is reached

the tree till a class assignment is reached Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal

More information

Decision Trees. Tirgul 5

Decision Trees. Tirgul 5 Decision Trees Tirgul 5 Using Decision Trees It could be difficult to decide which pet is right for you. We ll find a nice algorithm to help us decide what to choose without having to think about it. 2

More information

CSCE 478/878 Lecture 6: Bayesian Learning

CSCE 478/878 Lecture 6: Bayesian Learning Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Outline CSCE 478/878 Lecture 6: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchell

More information

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof Ganesh Ramakrishnan October 20, 2016 1 / 25 Decision Trees: Cascade of step

More information

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan, Steinbach, Kumar Adapted by Qiang Yang (2010) Tan,Steinbach,

More information

Numerical Learning Algorithms

Numerical Learning Algorithms Numerical Learning Algorithms Example SVM for Separable Examples.......................... Example SVM for Nonseparable Examples....................... 4 Example Gaussian Kernel SVM...............................

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology Decision trees Special Course in Computer and Information Science II Adam Gyenge Helsinki University of Technology 6.2.2008 Introduction Outline: Definition of decision trees ID3 Pruning methods Bibliography:

More information

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1 Decision Trees Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, 2018 Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1 Roadmap Classification: machines labeling data for us Last

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

Decision Trees. Gavin Brown

Decision Trees. Gavin Brown Decision Trees Gavin Brown Every Learning Method has Limitations Linear model? KNN? SVM? Explain your decisions Sometimes we need interpretable results from our techniques. How do you explain the above

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition Introduction Decision Tree Learning Practical methods for inductive inference Approximating discrete-valued functions Robust to noisy data and capable of learning disjunctive expression ID3 earch a completely

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 by Tan, Steinbach, Karpatne, Kumar 1 Classification: Definition Given a collection of records (training set ) Each

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

Data classification (II)

Data classification (II) Lecture 4: Data classification (II) Data Mining - Lecture 4 (2016) 1 Outline Decision trees Choice of the splitting attribute ID3 C4.5 Classification rules Covering algorithms Naïve Bayes Classification

More information

The Naïve Bayes Classifier. Machine Learning Fall 2017

The Naïve Bayes Classifier. Machine Learning Fall 2017 The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University Decision Tree Learning Mitchell, Chapter 3 CptS 570 Machine Learning School of EECS Washington State University Outline Decision tree representation ID3 learning algorithm Entropy and information gain

More information

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng 1 Decision Trees 2 Instances Describable by Attribute-Value Pairs Target Function Is Discrete Valued Disjunctive Hypothesis May Be Required Possibly Noisy Training Data Examples Equipment or medical diagnosis

More information

Dan Roth 461C, 3401 Walnut

Dan Roth   461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis Decision-Tree Learning Chapter 3: Decision Tree Learning CS 536: Machine Learning Littman (Wu, TA) [read Chapter 3] [some of Chapter 2 might help ] [recommended exercises 3.1, 3.2] Decision tree representation

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

More information

Machine Learning & Data Mining

Machine Learning & Data Mining Group M L D Machine Learning M & Data Mining Chapter 7 Decision Trees Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University Top 10 Algorithm in DM #1: C4.5 #2: K-Means #3: SVM

More information

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997 Outline Training Examples for EnjoySport Learning from examples General-to-specific ordering over hypotheses [read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] Version spaces and candidate elimination

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

CHAPTER-17. Decision Tree Induction

CHAPTER-17. Decision Tree Induction CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Decision Tree Learning - ID3

Decision Tree Learning - ID3 Decision Tree Learning - ID3 n Decision tree examples n ID3 algorithm n Occam Razor n Top-Down Induction in Decision Trees n Information Theory n gain from property 1 Training Examples Day Outlook Temp.

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Supervised Learning (contd) Decision Trees. Mausam (based on slides by UW-AI faculty)

Supervised Learning (contd) Decision Trees. Mausam (based on slides by UW-AI faculty) Supervised Learning (contd) Decision Trees Mausam (based on slides by UW-AI faculty) Decision Trees To play or not to play? http://www.sfgate.com/blogs/images/sfgate/sgreen/2007/09/05/2240773250x321.jpg

More information

Chapter 3: Decision Tree Learning

Chapter 3: Decision Tree Learning Chapter 3: Decision Tree Learning CS 536: Machine Learning Littman (Wu, TA) Administration Books? New web page: http://www.cs.rutgers.edu/~mlittman/courses/ml03/ schedule lecture notes assignment info.

More information

Algorithms for Classification: The Basic Methods

Algorithms for Classification: The Basic Methods Algorithms for Classification: The Basic Methods Outline Simplicity first: 1R Naïve Bayes 2 Classification Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Classification: Decision Trees

Classification: Decision Trees Classification: Decision Trees Outline Top-Down Decision Tree Construction Choosing the Splitting Attribute Information Gain and Gain Ratio 2 DECISION TREE An internal node is a test on an attribute. A

More information

Classification Using Decision Trees

Classification Using Decision Trees Classification Using Decision Trees 1. Introduction Data mining term is mainly used for the specific set of six activities namely Classification, Estimation, Prediction, Affinity grouping or Association

More information

Midterm: CS 6375 Spring 2018

Midterm: CS 6375 Spring 2018 Midterm: CS 6375 Spring 2018 The exam is closed book (1 cheat sheet allowed). Answer the questions in the spaces provided on the question sheets. If you run out of room for an answer, use an additional

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Intelligent Data Analysis Decision Trees Paul Prasse, Niels Landwehr, Tobias Scheffer Decision Trees One of many applications:

More information

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order MACHINE LEARNING Definition 1: Learning is constructing or modifying representations of what is being experienced [Michalski 1986], p. 10 Definition 2: Learning denotes changes in the system That are adaptive

More information

Decision Tree Learning

Decision Tree Learning Topics Decision Tree Learning Sattiraju Prabhakar CS898O: DTL Wichita State University What are decision trees? How do we use them? New Learning Task ID3 Algorithm Weka Demo C4.5 Algorithm Weka Demo Implementation

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Lecture 9: Bayesian Learning

Lecture 9: Bayesian Learning Lecture 9: Bayesian Learning Cognitive Systems II - Machine Learning Part II: Special Aspects of Concept Learning Bayes Theorem, MAL / ML hypotheses, Brute-force MAP LEARNING, MDL principle, Bayes Optimal

More information

Decision Tree. Decision Tree Learning. c4.5. Example

Decision Tree. Decision Tree Learning. c4.5. Example Decision ree Decision ree Learning s of systems that learn decision trees: c4., CLS, IDR, ASSISA, ID, CAR, ID. Suitable problems: Instances are described by attribute-value couples he target function has

More information

EECS 349:Machine Learning Bryan Pardo

EECS 349:Machine Learning Bryan Pardo EECS 349:Machine Learning Bryan Pardo Topic 2: Decision Trees (Includes content provided by: Russel & Norvig, D. Downie, P. Domingos) 1 General Learning Task There is a set of possible examples Each example

More information

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University 2018 CS420, Machine Learning, Lecture 5 Tree Models Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html ML Task: Function Approximation Problem setting

More information

Decision Trees Part 1. Rao Vemuri University of California, Davis

Decision Trees Part 1. Rao Vemuri University of California, Davis Decision Trees Part 1 Rao Vemuri University of California, Davis Overview What is a Decision Tree Sample Decision Trees How to Construct a Decision Tree Problems with Decision Trees Classification Vs Regression

More information

Supervised Learning. George Konidaris

Supervised Learning. George Konidaris Supervised Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom Mitchell,

More information

UVA CS 4501: Machine Learning

UVA CS 4501: Machine Learning UVA CS 4501: Machine Learning Lecture 21: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department of Computer Science Where are we? è Five major sections of this course

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function

More information

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label. Decision Trees Supervised approach Used for Classification (Categorical values) or regression (continuous values). The learning of decision trees is from class-labeled training tuples. Flowchart like structure.

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Final Exam, Fall 2002

Final Exam, Fall 2002 15-781 Final Exam, Fall 22 1. Write your name and your andrew email address below. Name: Andrew ID: 2. There should be 17 pages in this exam (excluding this cover sheet). 3. If you need more room to work

More information

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we

More information

Tutorial 6. By:Aashmeet Kalra

Tutorial 6. By:Aashmeet Kalra Tutorial 6 By:Aashmeet Kalra AGENDA Candidate Elimination Algorithm Example Demo of Candidate Elimination Algorithm Decision Trees Example Demo of Decision Trees Concept and Concept Learning A Concept

More information

Rule Generation using Decision Trees

Rule Generation using Decision Trees Rule Generation using Decision Trees Dr. Rajni Jain 1. Introduction A DT is a classification scheme which generates a tree and a set of rules, representing the model of different classes, from a given

More information

Artificial Intelligence. Topic

Artificial Intelligence. Topic Artificial Intelligence Topic What is decision tree? A tree where each branching node represents a choice between two or more alternatives, with every branching node being part of a path to a leaf node

More information

Decision Trees / NLP Introduction

Decision Trees / NLP Introduction Decision Trees / NLP Introduction Dr. Kevin Koidl School of Computer Science and Statistic Trinity College Dublin ADAPT Research Centre The ADAPT Centre is funded under the SFI Research Centres Programme

More information

Decision Tree Learning

Decision Tree Learning 0. Decision Tree Learning Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 3 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell PLAN 1. Concept learning:

More information

Symbolic methods in TC: Decision Trees

Symbolic methods in TC: Decision Trees Symbolic methods in TC: Decision Trees ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado https://www.cs.tcd.ie/kevin.koidl/cs0/ kevin.koidl@scss.tcd.ie, maldonaa@tcd.ie 01-017 A symbolic

More information

Machine Learning 3. week

Machine Learning 3. week Machine Learning 3. week Entropy Decision Trees ID3 C4.5 Classification and Regression Trees (CART) 1 What is Decision Tree As a short description, decision tree is a data classification procedure which

More information

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18 Decision Tree Analysis for Classification Problems Entscheidungsunterstützungssysteme SS 18 Supervised segmentation An intuitive way of thinking about extracting patterns from data in a supervised manner

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Learning = improving with experience Improve over task T (e.g, Classification, control tasks) with respect

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Nonlinear Classification

Nonlinear Classification Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions

More information

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function Administration Chapter 3: Decision Tree Learning (part 2) Book on reserve in the math library. Questions? CS 536: Machine Learning Littman (Wu, TA) Measuring Entropy Entropy Function S is a sample of training

More information

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Machine Learning Recitation 8 Oct 21, Oznur Tastan Machine Learning 10601 Recitation 8 Oct 21, 2009 Oznur Tastan Outline Tree representation Brief information theory Learning decision trees Bagging Random forests Decision trees Non linear classifier Easy

More information

Lecture 7: DecisionTrees

Lecture 7: DecisionTrees Lecture 7: DecisionTrees What are decision trees? Brief interlude on information theory Decision tree construction Overfitting avoidance Regression trees COMP-652, Lecture 7 - September 28, 2009 1 Recall:

More information

Typical Supervised Learning Problem Setting

Typical Supervised Learning Problem Setting Typical Supervised Learning Problem Setting Given a set (database) of observations Each observation (x1,, xn, y) Xi are input variables Y is a particular output Build a model to predict y = f(x1,, xn)

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

More information

Artificial Intelligence Roman Barták

Artificial Intelligence Roman Barták Artificial Intelligence Roman Barták Department of Theoretical Computer Science and Mathematical Logic Introduction We will describe agents that can improve their behavior through diligent study of their

More information