Decision trees COMS 4771
|
|
- Oliver Lawrence
- 5 years ago
- Views:
Transcription
1 Decision trees COMS 4771
2 1. Prediction functions (again)
3 Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). 1 / 19
4 Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1 / 19
5 Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (i.e., predictor) ˆf : X Y, This is called learning or training. 1 / 19
6 Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (i.e., predictor) ˆf : X Y, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 1 / 19
7 Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (i.e., predictor) ˆf : X Y, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 3. Outcome is Y, and squared loss is ( ˆf(X) Y ) 2 (regression problems). zero-one loss is 1{ ˆf(X) Y } (classification problems). 1 / 19
8 Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples). X takes values in X. E.g., X = R d. Y takes values in Y. E.g., (regression problems) Y = R; (classification problems) Y = {1,..., K} or Y = {0, 1} or Y = { 1, +1}. 1. We observe (X 1, Y 1),..., (X n, Y n), and the choose a prediction function (i.e., predictor) ˆf : X Y, This is called learning or training. 2. At prediction time, observe X, and form prediction ˆf(X). 3. Outcome is Y, and squared loss is ( ˆf(X) Y ) 2 (regression problems). zero-one loss is 1{ ˆf(X) Y } (classification problems). Note: expected zero-one loss is E[1{ ˆf(X) Y }] = P( ˆf(X) Y ), which we also call error rate. 1 / 19
9 Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). 2 / 19
10 Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 2 / 19
11 Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 2 / 19
12 Distributions over labeled examples X : space of possible side-information (feature space). Y: space of possible outcomes (label space or output space). Distribution P of random pair (X, Y ) taking values in X Y can be thought of in two parts: 1. Marginal distribution P X of X: P X is a probability distribution on X. 2. Conditional distribution P Y X=x of Y given X = x, for each x X : P Y X=x is a probability distribution on Y. 2 / 19
13 Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? 3 / 19
14 Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? Conditional on X = x, the minimizer of conditional risk ŷ P(ŷ Y X = x) is ŷ := { 1 if P(Y = 1 X = x) > 1/2, 0 if P(Y = 1 X = x) 1/2. 3 / 19
15 Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? Conditional on X = x, the minimizer of conditional risk is ŷ := ŷ P(ŷ Y X = x) { 1 if P(Y = 1 X = x) > 1/2, 0 if P(Y = 1 X = x) 1/2. Therefore, the function f : X {0, 1} where { f 1 if P(Y = 1 X = x) > 1/2, (x) = 0 if P(Y = 1 X = x) 1/2, has the smallest risk. x X, 3 / 19
16 Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? Conditional on X = x, the minimizer of conditional risk is ŷ := ŷ P(ŷ Y X = x) { 1 if P(Y = 1 X = x) > 1/2, 0 if P(Y = 1 X = x) 1/2. Therefore, the function f : X {0, 1} where { f 1 if P(Y = 1 X = x) > 1/2, (x) = 0 if P(Y = 1 X = x) 1/2, has the smallest risk. f is called the Bayes (optimal) classifier. x X, 3 / 19
17 Optimal classifier For binary classification, what function f : X {0, 1} has smallest risk (i.e., error rate) R(f) := P(f(X) Y )? Conditional on X = x, the minimizer of conditional risk is ŷ := ŷ P(ŷ Y X = x) { 1 if P(Y = 1 X = x) > 1/2, 0 if P(Y = 1 X = x) 1/2. Therefore, the function f : X {0, 1} where { f 1 if P(Y = 1 X = x) > 1/2, (x) = 0 if P(Y = 1 X = x) 1/2, has the smallest risk. f is called the Bayes (optimal) classifier. For Y = {1,..., K}, f (x) = arg max P(Y = y X = x), x X. y Y x X, 3 / 19
18 Optimal classifiers from discrete probability tables 4 / 19
19 Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Y = 1 Y = 2 Y = 3 X = X = / 19
20 Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Y = 1 Y = 2 Y = 3 X = X = / 19
21 Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Y = 1 Y = 2 Y = 3 X = X = Y = 1 Y = 2 Y = 3 X = X = X = X = X = X = X = X = X = X = / 19
22 Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Y = 1 Y = 2 Y = 3 X = X = Y = 1 Y = 2 Y = 3 X = X = X = X = X = X = X = X = X = X = / 19
23 Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Error rate: 31.1% Y = 1 Y = 2 Y = 3 X = X = Y = 1 Y = 2 Y = 3 X = X = X = X = X = X = X = X = X = X = / 19
24 Optimal classifiers from discrete probability tables What is the Bayes optimal classifier under the following distribution? Error rate: 50% Y = 1 Y = 2 Y = 3 X = X = Y = 1 Y = 2 Y = 3 X C X C X C X C X C X C X C X C X C X C Assume C 1,..., C 10 partition X. Optimal classifier? 4 / 19
25 2. Decision trees
26 Decision trees A decision tree is a function f : X Y, represented by a binary tree in which: Each tree node is associated with a splitting rule g : X {0, 1}. Each leaf node is associated with a label y Y. Tree nodes partition X into cells; each cell corresponds to a leaf, and f is constant within each cell. 5 / 19
27 Decision trees A decision tree is a function f : X Y, represented by a binary tree in which: Each tree node is associated with a splitting rule g : X {0, 1}. Each leaf node is associated with a label y Y. Tree nodes partition X into cells; each cell corresponds to a leaf, and f is constant within each cell. When X = R d, typically only consider splitting rules of the form x 1 > 1.7 h(x) = 1{x i > t} for some i [d] and t R, i.e., axis-aligned splits. (Notation: [d] := {1,..., d}) ŷ = 1 x 2 > 2.8 ŷ = 2 ŷ = 3 5 / 19
28 Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width sepal length/width 6 / 19
29 Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width ŷ = sepal length/width 6 / 19
30 Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width x 1 > sepal length/width 6 / 19
31 Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width x 1 > sepal length/width ŷ = 1 ŷ = 3 6 / 19
32 Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width x 1 > sepal length/width ŷ = 1 x 2 > / 19
33 Decision tree example petal length/width 6 Classifying irises by sepal and petal measurements X = R 2, Y = {1, 2, 3} x 1 = ratio of sepal length to width x 2 = ratio of petal length to width x 1 > sepal length/width ŷ = 1 x 2 > 2.8 ŷ = 2 ŷ = 3 6 / 19
34 Basic decision tree learning algorithm ŷ = 2 x1 > 1.7 x1 > 1.7 ŷ = 1 ŷ = 3 ŷ = 1 x2 > 2.8 ŷ = 2 ŷ = 3 Top-down greedy algorithm for decision trees 1: Initially, tree is a single leaf node containing all (training) data. 2: repeat 3: Pick the leaf l and rule h that maximally reduces uncertainty. 4: Split data in l using h, and grow tree accordingly. 5: until some stopping condition is met. 6: Set label of each leaf to (estimate of) optimal prediction for leaf s cell. 7 / 19
35 Notions of uncertainty Suppose S are the labeled examples reaching a leaf l. 8 / 19
36 Notions of uncertainty Suppose S are the labeled examples reaching a leaf l. Classification (Y = {1,..., K}) Gini index: u(s) := 1 y Y p 2 y, where p y is the fraction of examples in S with label y (for each y Y). 8 / 19
37 Notions of uncertainty Suppose S are the labeled examples reaching a leaf l. Classification (Y = {1,..., K}) Gini index: u(s) := 1 y Y p 2 y, where p y is the fraction of examples in S with label y (for each y Y). Regression (Y = R) Variance: u(s) := 1 S where µ(s) := 1 S (x,y) S y. (x,y) S (y µ(s)) 2, 8 / 19
38 Notions of uncertainty Suppose S are the labeled examples reaching a leaf l. Classification (Y = {1,..., K}) Gini index: u(s) := 1 y Y p 2 y, where p y is the fraction of examples in S with label y (for each y Y). Regression (Y = R) Variance: u(s) := 1 S where µ(s) := 1 S (x,y) S y. (x,y) S (y µ(s)) 2, Both are minimized when every example in S has the same label. 8 / 19
39 Notions of uncertainty Suppose S are the labeled examples reaching a leaf l. Classification (Y = {1,..., K}) Gini index: u(s) := 1 y Y p 2 y, where p y is the fraction of examples in S with label y (for each y Y). Regression (Y = R) Variance: u(s) := 1 S where µ(s) := 1 S (x,y) S y. (x,y) S (y µ(s)) 2, Both are minimized when every example in S has the same label. (Other popular notions of uncertainty exist, such as entropy.) 8 / 19
40 Uncertainty reduction Suppose the data S at a leaf l is split by a rule h into S L and S R, where w L := S L / S and w R := S R / S. data S at leaf l uncertainty: u(s) w L fraction w R fraction have h(x) = 0 have h(x) = 1 S L S R uncertainty: u(s L ) uncertainty: u(s R ) 9 / 19
41 Uncertainty reduction Suppose the data S at a leaf l is split by a rule h into S L and S R, where w L := S L / S and w R := S R / S. data S at leaf l uncertainty: u(s) w L fraction w R fraction have h(x) = 0 have h(x) = 1 S L S R uncertainty: u(s L ) uncertainty: u(s R ) The reduction in uncertainty from using rule h at leaf l is ( ) u(s) w L u(s L) + w R u(s R). 9 / 19
42 Uncertainty reduction One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). 5 petal length/width sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 10 / 19
43 Uncertainty reduction One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). Other leaf (with ŷ = 3) has Gini index petal length/width ( ) 2 ( ) 2 ( ) u(s) = = sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 10 / 19
44 Uncertainty reduction petal length/width sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). Other leaf (with ŷ = 3) has Gini index ( ) 2 ( ) 2 ( ) u(s) = = Split S with 1{x 1 > t} to S L, S R: reduction in uncertainty t 10 / 19
45 Uncertainty reduction petal length/width sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). Other leaf (with ŷ = 3) has Gini index ( ) 2 ( ) 2 ( ) u(s) = = Split S with 1{x 2 > t} to S L, S R: reduction in uncertainty t 10 / 19
46 Uncertainty reduction petal length/width sepal length/width x 1 > 1.7 ŷ = 1 ŷ = 3 One leaf (with ŷ = 1) already has zero uncertainty (a pure leaf). Other leaf (with ŷ = 3) has Gini index ( ) 2 ( ) 2 ( ) u(s) = = Split S with 1{x 2 > } to S L, S R: ) u(s L ) = 1 ( 0 2 ( ) 1 2 ( ) = , ) u(s R ) = 1 ( 1 2 ( ) 49 2 ( ) = Reduction in uncertainty: ( ) = / 19
47 Limitations of uncertainty notions Suppose X = R 2 and Y = {red, blue}, and the data is as follows: x 2 x 1 Every split of the form 1{x i > t} provides no reduction in uncertainty. 11 / 19
48 Limitations of uncertainty notions Suppose X = R 2 and Y = {red, blue}, and the data is as follows: x 2 x 1 Every split of the form 1{x i > t} provides no reduction in uncertainty. Upshot: Zero reduction in uncertainty may not be a good stopping condition. 11 / 19
49 Basic decision tree learning algorithm ŷ = 2 x1 > 1.7 x1 > 1.7 ŷ = 1 ŷ = 3 ŷ = 1 x2 > 2.8 ŷ = 2 ŷ = 3 Top-down greedy algorithm for decision trees 1: Initially, tree is a single leaf node containing all (training) data. 2: repeat 3: Pick the leaf l and rule h that maximally reduces uncertainty. 4: Split data in l using h, and grow tree accordingly. 5: until some stopping condition is met. 6: Set label of each leaf to (estimate of) optimal prediction for leaf s cell. 12 / 19
50 Stopping condition Many alternatives; two common choices are: 13 / 19
51 Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. 13 / 19
52 Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 13 / 19
53 Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 2. Stop when every leaf is pure (i.e., only one training example reaches each leaf). 13 / 19
54 Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 2. Stop when every leaf is pure (i.e., only one training example reaches each leaf). Serious danger of overfitting spurious structure due to sampling. 13 / 19
55 Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 2. Stop when every leaf is pure (i.e., only one training example reaches each leaf). Serious danger of overfitting spurious structure due to sampling. x 2 x 1 13 / 19
56 Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 2. Stop when every leaf is pure (i.e., only one training example reaches each leaf). Serious danger of overfitting spurious structure due to sampling. x 2 x 1 13 / 19
57 Stopping condition Many alternatives; two common choices are: 1. Stop when the tree reaches a pre-specified size. Involves setting additional tuning parameters. 2. Stop when every leaf is pure (i.e., only one training example reaches each leaf). Serious danger of overfitting spurious structure due to sampling. x 2 x 1 13 / 19
58 Overfitting risk true risk training risk number of nodes in tree Training risk goes to zero as the number of nodes in the tree increases. True risk decreases initially, but eventually increases due to overfitting. 14 / 19
59 What can be done about overfitting? Common strategy: Split training data S into two parts, S 1 and S 2. Use first part S 1 to grow the tree until all leaves are pure. Use second part S 2 to choose a good pruning of the tree. 15 / 19
60 What can be done about overfitting? Common strategy: Split training data S into two parts, S 1 and S 2. Use first part S 1 to grow the tree until all leaves are pure. Use second part S 2 to choose a good pruning of the tree. Pruning algorithm Loop: Replace a tree node by a leaf node if it improves the empirical risk on S until no more such improvements possible. 15 / 19
61 What can be done about overfitting? Common strategy: Split training data S into two parts, S 1 and S 2. Use first part S 1 to grow the tree until all leaves are pure. Use second part S 2 to choose a good pruning of the tree. Pruning algorithm Loop: Replace a tree node by a leaf node if it improves the empirical risk on S until no more such improvements possible. This can be done in bottom-up traversal of the tree. 15 / 19
62 What can be done about overfitting? Common strategy: Split training data S into two parts, S 1 and S 2. Use first part S 1 to grow the tree until all leaves are pure. Use second part S 2 to choose a good pruning of the tree. Pruning algorithm Loop: Replace a tree node by a leaf node if it improves the empirical risk on S until no more such improvements possible. This can be done in bottom-up traversal of the tree. Independence of S and S make it unlikely for spurious structures in each to perfectly align. 15 / 19
63 Example: Spam filtering Data messages. Y = {spam, not spam} (39.4% are spam.) s represented by 57 features: 48: percentange of words that is specific word (e.g., free, business ) 6: percentage of characters that is specific character (e.g.,! ). 3: other features (e.g., average length of ALL-CAPS words). Results Final decision tree has just 17 leaves; test risk is 9.3%. ŷ = not spam ŷ = spam y = not spam 57.3% 4.0% y = spam 5.3% 33.4% 16 / 19
64 Example: Spam filtering 600/ / / /861 80/652 77/423 20/238 19/236 1/2 57/185 48/113 37/101 1/12 9/72 3/229 0/ /204 36/123 16/94 14/89 3/5 9/29 16/81 9/112 6/109 0/3 48/359 26/337 19/110 18/109 0/1 7/227 0/22 spam spam spam spam spam spam spam spam spam spam spam spam ch$< remove<0.06 ch!<0.191 george<0.005 hp<0.03 CAPMAX<10.5 receive<0.125 edu<0.045 our<1.2 CAPAVE< free<0.065 business<0.145 george<0.15 hp<0.405 CAPAVE< <0.58 ch$> remove>0.06 ch!>0.191 george>0.005 hp>0.03 CAPMAX>10.5 receive>0.125 edu>0.045 our>1.2 CAPAVE> free>0.065 business>0.145 george>0.15 hp>0.405 CAPAVE> > / 19
65 Final remarks Decision trees are very flexible classifiers. Very easy to overfit training data. NP-hard (i.e., computationally intractable, in general) to find smallest decision tree consistent with data. 18 / 19
66 Key takeaways 1. Structure of decision tree classifiers. 2. Greedy learning algorithm based on notions of uncertainty; limitations of the greedy algorithm. 3. High-level idea of overfitting and ways to deal with it. 19 / 19
SF2930 Regression Analysis
SF2930 Regression Analysis Alexandre Chotard Tree-based regression and classication 20 February 2017 1 / 30 Idag Overview Regression trees Pruning Bagging, random forests 2 / 30 Today Overview Regression
More informationDecision Tree Learning Lecture 2
Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over
More informationthe tree till a class assignment is reached
Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal
More informationLogistic regression and linear classifiers COMS 4771
Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random
More informationCS6375: Machine Learning Gautam Kunapuli. Decision Trees
Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s
More informationMachine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler
+ Machine Learning and Data Mining Decision Trees Prof. Alexander Ihler Decision trees Func-onal form f(x;µ): nested if-then-else statements Discrete features: fully expressive (any func-on) Structure:
More informationDecision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag
Decision Trees Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Supervised Learning Input: labelled training data i.e., data plus desired output Assumption:
More informationInformal Definition: Telling things apart
9. Decision Trees Informal Definition: Telling things apart 2 Nominal data No numeric feature vector Just a list or properties: Banana: longish, yellow Apple: round, medium sized, different colors like
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit
More informationJeffrey D. Ullman Stanford University
Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some
More informationLinear Methods for Classification
Linear Methods for Classification Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Classification Supervised learning Training data: {(x 1, g 1 ), (x 2, g 2 ),..., (x
More informationData Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition
Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 by Tan, Steinbach, Karpatne, Kumar 1 Classification: Definition Given a collection of records (training set ) Each
More informationEmpirical Risk Minimization, Model Selection, and Model Assessment
Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,
More informationCSE 151 Machine Learning. Instructor: Kamalika Chaudhuri
CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Announcements Midterm is graded! Average: 39, stdev: 6 HW2 is out today HW2 is due Thursday, May 3, by 5pm in my mailbox Decision Tree Classifiers
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationClassification objectives COMS 4771
Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification
More informationMachine Learning Recitation 8 Oct 21, Oznur Tastan
Machine Learning 10601 Recitation 8 Oct 21, 2009 Oznur Tastan Outline Tree representation Brief information theory Learning decision trees Bagging Random forests Decision trees Non linear classifier Easy
More informationDecision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1
Decision Trees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 5 th, 2007 2005-2007 Carlos Guestrin 1 Linear separability A dataset is linearly separable iff 9 a separating
More informationIntroduction to Data Science Data Mining for Business Analytics
Introduction to Data Science Data Mining for Business Analytics BRIAN D ALESSANDRO VP DATA SCIENCE, DSTILLERY ADJUNCT PROFESSOR, NYU FALL 2014 Fine Print: these slides are, and always will be a work in
More informationLearning theory. Ensemble methods. Boosting. Boosting: history
Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Intelligent Data Analysis Decision Trees Paul Prasse, Niels Landwehr, Tobias Scheffer Decision Trees One of many applications:
More informationHoldout and Cross-Validation Methods Overfitting Avoidance
Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest
More informationKernel Density Estimation
Kernel Density Estimation If Y {1,..., K} and g k denotes the density for the conditional distribution of X given Y = k the Bayes classifier is f (x) = argmax π k g k (x) k If ĝ k for k = 1,..., K are
More informationData Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018
Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationLinear Classifiers. Michael Collins. January 18, 2012
Linear Classifiers Michael Collins January 18, 2012 Today s Lecture Binary classification problems Linear classifiers The perceptron algorithm Classification Problems: An Example Goal: build a system that
More informationCLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition
CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition Ad Feelders Universiteit Utrecht Department of Information and Computing Sciences Algorithmic Data
More informationCS 6375 Machine Learning
CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationNotes on Machine Learning for and
Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Learning = improving with experience Improve over task T (e.g, Classification, control tasks) with respect
More informationDecision Tree Learning
Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,
More informationMachine Learning 2nd Edi7on
Lecture Slides for INTRODUCTION TO Machine Learning 2nd Edi7on CHAPTER 9: Decision Trees ETHEM ALPAYDIN The MIT Press, 2010 Edited and expanded for CS 4641 by Chris Simpkins alpaydin@boun.edu.tr h1p://www.cmpe.boun.edu.tr/~ethem/i2ml2e
More informationMachine Learning 3. week
Machine Learning 3. week Entropy Decision Trees ID3 C4.5 Classification and Regression Trees (CART) 1 What is Decision Tree As a short description, decision tree is a data classification procedure which
More informationStatistics and learning: Big Data
Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees
More informationDecision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro
Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,
More informationData Mining and Analysis: Fundamental Concepts and Algorithms
Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering
More informationDecision Tree Learning
Decision Tree Learning Goals for the lecture you should understand the following concepts the decision tree representation the standard top-down approach to learning a tree Occam s razor entropy and information
More informationDecision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore
Decision Trees Claude Monet, The Mulberry Tree Slides from Pedro Domingos, CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore Michael Guerzhoy
More informationGeneralization and Overfitting
Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne
More informationLecture 7: DecisionTrees
Lecture 7: DecisionTrees What are decision trees? Brief interlude on information theory Decision tree construction Overfitting avoidance Regression trees COMP-652, Lecture 7 - September 28, 2009 1 Recall:
More informationMachine Learning & Data Mining
Group M L D Machine Learning M & Data Mining Chapter 7 Decision Trees Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University Top 10 Algorithm in DM #1: C4.5 #2: K-Means #3: SVM
More informationMachine Learning (CS 567) Lecture 2
Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol
More informationRandomized Decision Trees
Randomized Decision Trees compiled by Alvin Wan from Professor Jitendra Malik s lecture Discrete Variables First, let us consider some terminology. We have primarily been dealing with real-valued data,
More informationTufts COMP 135: Introduction to Machine Learning
Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)
More informationDecision Trees. Tirgul 5
Decision Trees Tirgul 5 Using Decision Trees It could be difficult to decide which pet is right for you. We ll find a nice algorithm to help us decide what to choose without having to think about it. 2
More informationDecision Trees. Gavin Brown
Decision Trees Gavin Brown Every Learning Method has Limitations Linear model? KNN? SVM? Explain your decisions Sometimes we need interpretable results from our techniques. How do you explain the above
More informationInduction of Decision Trees
Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.
More informationReview of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations
Review of Lecture 1 This course is about finding novel actionable patterns in data. We can divide data mining algorithms (and the patterns they find) into five groups Across records Classification, Clustering,
More information12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016
12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses
More informationDecision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1
Decision Trees Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, 2018 Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1 Roadmap Classification: machines labeling data for us Last
More informationDecision Tree Learning
Decision Tree Learning Goals for the lecture you should understand the following concepts the decision tree representation the standard top-down approach to learning a tree Occam s razor entropy and information
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationChapter 6. Ensemble Methods
Chapter 6. Ensemble Methods Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Introduction
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative
More informationDan Roth 461C, 3401 Walnut
CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn
More informationEnsemble Methods and Random Forests
Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationLecture 3: Introduction to Complexity Regularization
ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationUVA CS 4501: Machine Learning
UVA CS 4501: Machine Learning Lecture 21: Decision Tree / Random Forest / Ensemble Dr. Yanjun Qi University of Virginia Department of Computer Science Where are we? è Five major sections of this course
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationClassification and Regression Trees
Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Lecture 06 - Regression & Decision Trees Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom
More informationCSE 151 Machine Learning. Instructor: Kamalika Chaudhuri
CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of
More informationDeconstructing Data Science
econstructing ata Science avid Bamman, UC Berkeley Info 290 Lecture 6: ecision trees & random forests Feb 2, 2016 Linear regression eep learning ecision trees Ordinal regression Probabilistic graphical
More informationBINARY TREE-STRUCTURED PARTITION AND CLASSIFICATION SCHEMES
BINARY TREE-STRUCTURED PARTITION AND CLASSIFICATION SCHEMES DAVID MCDIARMID Abstract Binary tree-structured partition and classification schemes are a class of nonparametric tree-based approaches to classification
More informationDecision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)
Decision Trees Lewis Fishgold (Material in these slides adapted from Ray Mooney's slides on Decision Trees) Classification using Decision Trees Nodes test features, there is one branch for each value of
More informationC4.5 - pruning decision trees
C4.5 - pruning decision trees Quiz 1 Quiz 1 Q: Is a tree with only pure leafs always the best classifier you can have? A: No. Quiz 1 Q: Is a tree with only pure leafs always the best classifier you can
More informationCMSC858P Supervised Learning Methods
CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors
More informationLecture 4 Discriminant Analysis, k-nearest Neighbors
Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se
More informationLinear regression COMS 4771
Linear regression COMS 4771 1. Old Faithful and prediction functions Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40 Statistical model for time between
More informationVariable Selection and Sensitivity Analysis via Dynamic Trees with an application to Computer Code Performance Tuning
Variable Selection and Sensitivity Analysis via Dynamic Trees with an application to Computer Code Performance Tuning Robert B. Gramacy University of Chicago Booth School of Business faculty.chicagobooth.edu/robert.gramacy
More informationCOMS 4771 Lecture Boosting 1 / 16
COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?
More informationLeast Squares Classification
Least Squares Classification Stephen Boyd EE103 Stanford University November 4, 2017 Outline Classification Least squares classification Multi-class classifiers Classification 2 Classification data fitting
More informationLearning Decision Trees
Learning Decision Trees Machine Learning Spring 2018 1 This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning decision trees The ID3 algorithm: A greedy
More informationLoss Functions, Decision Theory, and Linear Models
Loss Functions, Decision Theory, and Linear Models CMSC 678 UMBC January 31 st, 2018 Some slides adapted from Hamed Pirsiavash Logistics Recap Piazza (ask & answer questions): https://piazza.com/umbc/spring2018/cmsc678
More informationContents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)
Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture
More informationChapter 6: Classification
Chapter 6: Classification 1) Introduction Classification problem, evaluation of classifiers, prediction 2) Bayesian Classifiers Bayes classifier, naive Bayes classifier, applications 3) Linear discriminant
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationBoosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13
Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationday month year documentname/initials 1
ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi
More informationChapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning
Chapter ML:III III. Decision Trees Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning ML:III-34 Decision Trees STEIN/LETTMANN 2005-2017 Splitting Let t be a leaf node
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More information15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018
15-388/688 - Practical Data Science: Decision trees and interpretable models J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Decision trees Training (classification) decision trees Interpreting
More informationPerformance Evaluation
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Example:
More informationPredictive Modeling: Classification. KSE 521 Topic 6 Mun Yi
Predictive Modeling: Classification Topic 6 Mun Yi Agenda Models and Induction Entropy and Information Gain Tree-Based Classifier Probability Estimation 2 Introduction Key concept of BI: Predictive modeling
More informationLecture 3: Decision Trees
Lecture 3: Decision Trees Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning Lecture 3: Decision Trees p. Decision
More informationRandom Forests. Sören Mindermann September 21, Bachelorproject Begeleiding: dr. B Kleijn
Random Forests Sören Mindermann September 21, 2016 Bachelorproject Begeleiding: dr. B Kleijn Korteweg-de Vries Instituut voor Wiskunde Faculteit der Natuurwetenschappen, Wiskunde en Informatica Universiteit
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationCART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )
CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions with R 1,..., R m R p disjoint. f(x) = M c m 1(x R m ) m=1 The CART algorithm is a heuristic, adaptive
More informationLearning Decision Trees
Learning Decision Trees Machine Learning Fall 2018 Some slides from Tom Mitchell, Dan Roth and others 1 Key issues in machine learning Modeling How to formulate your problem as a machine learning problem?
More informationCART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions
CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions f (x) = M c m 1(x R m ) m=1 with R 1,..., R m R p disjoint. The CART algorithm is a heuristic, adaptive
More informationDiscriminative v. generative
Discriminative v. generative Naive Bayes 2 Naive Bayes P (x ij,y i )= Y i P (y i ) Y j P (x ij y i ) P (y i =+)=p MLE: max P (x ij,y i ) a j,b j,p p = 1 N P [yi =+] P (x ij =1 y i = ) = a j P (x ij =1
More informationLecture 7 Decision Tree Classifier
Machine Learning Dr.Ammar Mohammed Lecture 7 Decision Tree Classifier Decision Tree A decision tree is a simple classifier in the form of a hierarchical tree structure, which performs supervised classification
More information