Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Similar documents
Lecture 9: Bayesian Learning

Bayesian Classification. Bayesian Classification: Why?

Lecture 3: Decision Trees

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Bayesian Learning Features of Bayesian learning methods:

Lecture 3: Decision Trees

Decision Trees.

Learning Decision Trees

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Decision Tree Learning

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

CS 6375 Machine Learning

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Decision Tree Learning

Decision Trees.

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

The Naïve Bayes Classifier. Machine Learning Fall 2017

Machine Learning 2nd Edi7on

Chapter 3: Decision Tree Learning

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Learning Decision Trees

Apprentissage automatique et fouille de données (part 2)

Classification and Prediction

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

the tree till a class assignment is reached

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Decision Tree Learning and Inductive Inference

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Chapter 3: Decision Tree Learning (part 2)

Dan Roth 461C, 3401 Walnut

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Decision Trees. Tirgul 5

Typical Supervised Learning Problem Setting

Learning Classification Trees. Sargur Srihari

Decision Tree Learning

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Data classification (II)

Decision Trees. Gavin Brown

Algorithms for Classification: The Basic Methods

Decision Trees. Danushka Bollegala

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

CSCE 478/878 Lecture 6: Bayesian Learning

COMP 328: Machine Learning

Classification and regression trees

Decision Tree Learning - ID3

Machine Learning & Data Mining

Data Mining Part 4. Prediction

Classification Using Decision Trees

Decision Trees Part 1. Rao Vemuri University of California, Davis

ARTIFICIAL INTELLIGENCE. Supervised learning: classification

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Stephen Scott.

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Classification II: Decision Trees and SVMs

Induction on Decision Trees

The Bayesian Learning

Rule Generation using Decision Trees

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Classification: Decision Trees

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

EECS 349:Machine Learning Bryan Pardo

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Confusion matrix. a = true positives b = false negatives c = false positives d = true negatives 1. F-measure combines Recall and Precision:

10-701/ Machine Learning: Assignment 1

Decision Trees / NLP Introduction

Artificial Intelligence. Topic

UVA CS / Introduc8on to Machine Learning and Data Mining

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Symbolic methods in TC: Decision Trees

Classification and Regression Trees

Modern Information Retrieval

Mining Classification Knowledge

Bayesian Learning. Remark on Conditional Probabilities and Priors. Two Roles for Bayesian Methods. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.

Decision Support. Dr. Johan Hagelbäck.

Notes on Machine Learning for and

Machine Learning 3. week

Machine Learning. Bayesian Learning.

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Bayesian Learning. Bayesian Learning Criteria

Introduction Association Rule Mining Decision Trees Summary. SMLO12: Data Mining. Statistical Machine Learning Overview.

The Solution to Assignment 6

Naïve Bayesian. From Han Kamber Pei

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Mining Classification Knowledge

CS145: INTRODUCTION TO DATA MINING

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Classification: Decision Trees

Chapter 6: Classification

Soft Computing. Lecture Notes on Machine Learning. Matteo Mattecci.

CHAPTER-17. Decision Tree Induction

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Empirical Risk Minimization, Model Selection, and Model Assessment

Jialiang Bao, Joseph Boyd, James Forkey, Shengwen Han, Trevor Hodde, Yumou Wang 10/01/2013

The Quadratic Entropy Approach to Implement the Id3 Decision Tree Algorithm

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Transcription:

Introduction to ML Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Why Bayesian learning? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems Incremental (online): Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

Bayesian classification: the basic idea The classification problem may be formalized using a- posteriori probabilities: P(v X) = prob. that the sample tuple X=<a 1,,a k > is of class v. E.g. P(class=No outlook=sunny,windy=true, ) Idea: assign to instance X the class label v such that P(v X) is maximal

Estimating a-posteriori probabilities Bayes theorem: P(v X) = P(X v) P(v) / P(X) P(X) is constant for all classes P(v) = relative freq of class vs all samples v such that P(v X) is maximum = v such that P(X v) P(v) is maximum Problem: computing P(X v)=p(a 1,,a k v) is unfeasible (requires too many probabilities 2*2 k!) Practical difficulty: required initial knowledge of many probabilities, significant computational costs

Naïve Bayesian Classification Naïve assumption: attribute independence P(a 1,,a k v) = P(a 1 v) P(a k v) If i-th attribute is categorical (discrete-valued): P(a i v) is estimated as the relative freq of samples having value a i as i-th attribute in class v If i-th attribute is continuous: 1) it could be discretized and treated as categorical; 2) or P(a i v) is estimated assuming it has Gaussian (normal) distribution requires only mean and variance Computationally easy in both cases

The Play Tennis Dataset [Day, Outlook, Temp, Humidity, Wind, PlayTennis] (D1 Sunny Hot High Weak No) (D2 Sunny Hot High Strong No) (D3 Overcast Hot High Weak Yes) (D4 Rain Mild High Weak Yes) (D5 Rain Cool Normal Weak Yes) (D6 Rain Cool Normal Strong No) (D7 Overcast Cool Normal Strong Yes) (D8 Sunny Mild High Weak No) (D9 Sunny Cool Normal Weak Yes) (D10 Rain Mild Normal Weak Yes) (D11 Sunny Mild Normal Strong Yes) (D12 Overcast Mild High strong Yes) (D13 Overcast Hot Normal Weak Yes) (D14 Rain Mild High Strong No)

Naïve Bayes: PlayTennis Example Predict the target value (yes or no) of the target concept PlayTennis for the new instance Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong

Naïve Bayes: PlayTennis Example v NB is the target value output by the Naïve Bayes classifier v NB = argmax P v j i P a i v j v j {yes,no } = argmax v j {yes,no} P v j P outlook=sunny v j P Temperature=cool v j P Humidity=high v j P Wind=strong v j Prior Probabilities Conditional Probabilities

Estimating Probabilities Prior Probabilities: Probabilities of different target values estimated from frequencies over 14 training examples P(PlayTennis = yes) = 9/14 =.64 P(PlayTennis = no) = 5/14 =.36 Conditional Probabilities: Similarly, can estimate conditional probabilities. For example, those for Wind = strong are: P(Wind = strong PlayTennis = yes) = 3/9 =.33 P(Wind = strong PlayTennis = no) = 3/5 =.60

The Result Using these and similar probability estimates for remaining attribute values, v NB can be calculated as follows P(yes)*P(sunny yes)*p(cool yes)*p(high yes)*p(strong yes) =.0053 P(no)*P(sunny no)*p(cool no)*p(high no)*p(strong no) =.0206 Thus, the naïve Bayes classifier assigns the target value PlayTennis = no to this new instance

Normalizing class probabilities By normalizing the above quantities to sum to one, we can calculate the conditional probability that the target value is no, given the observed attribute values. For the current example, this probability is.0206.0206.0053 =.795

An Approach to Represent Arbitrary Text Documents Given a text document, we define an attribute for each word position in the document the value of that attribute to be the English word found in that position thus, this paragraph beginning with sentence Given a text document, would be described by 74 attribute values, corresponding to the 74 word positions value of the first attribute is the word Given the value of the second attribute is a etc.

Reducing Number of Probability Terms Assume positional independence Probability of encountering a specific word w k is independent of the specific word position being encountered (a 23 versus a 95 ) This amounts to assuming that attributes are independent and identically distributed P(a i =w k v j )=P(a m =w k v j ) for all i,j,k,m

Electronic Newsgroups considered in Text Classification Experiment Problem of classifying news articles 20 electronic newsgroups (usenet) considered comp.graphics, alt.atheism, etc Target classification Name of newsgroup in which article appeared Task is one of Newsgroup Posting Service that learns to assign documents to appropriate newsgroup

Experiment Dataset Data Set 1,000 articles collected from each newsgroup, forming data set of 20,000 documents Vocabulary 100 most frequent words were removed (the, of, ) any word occurring fewer than 3 times was removed resulting Vocabulary consisted of 38,500 words Naïve Bayes was applied using 2/3 of these 20,000 documents as training examples performance measured over remaining 1/3

Text Classification Experimental Results (Joachims 1996) Accuracy achieved by the program was 89% Random guessing would yield 5% accuracy Variant of Naïve Bayes NewsWeeder system Training: user rates some news articles as interesting Based on user profile Newsweeder then suggests subsequent articles of interest to user NewsWeeder suggests top 10% of its automatically rated articles each day Result: 59% of articles presented were interesting as opposed to 16% in overall pool

Summary The naïve Bayes classifier is useful in many practical applications. Called naïve because it incorporates the simplifying assumption that attribute values are conditionally independent, given the classification of the instance. Suitable for incremental (online) learning In some domains performance is comparable to that of neural network and decision tree learning Was successfully applied to text categorization problems

Decision Trees Decision Trees classify instances by sorting them down the tree itself Leaf nodes provide the classification of the instance Each node in the tree specify a test of some attribute (feature value) of the instance Each branch descending from the node correspond to one possible value of the attribute

A Sample Tree: Playing tennis Outlook Sunny Overcast Rainy Humidity Yes Wind High Normal Strong Weak No Yes No Yes

Representation Power Typically Examples presented as an array of features Each node tests one single attribute One child node for each possible value of the attribute The target function has discrete output values Robust methodology in presence of noise in the training data.

Induction of Decision Trees The core algorithm uses a greedy search through the space of possible trees The technique is a top-down search starting at the root node and ending at the leaf nodes. Start with full dataset Apply a statistical test to evaluate how good attributes are in partitioning the examples (dividing examples in to groups according to their class) For each outcome of the test create a child node Move examples to children according to the outcome of test Repeat procedure for each node

Evaluating the Attributes The central choice in top-down induction of decision trees is selecting which attribute to test at each node. We need to select the attribute that is most useful in classification Information Gain is a statistical property that measures how well a given attribute separates the examples according to the target classification Information Gain is based on the concept of entropy

Entropy function Entropy characterizes the purity/impurity of an arbitrary collection against a target concept. The entropy function is defined as: Entropy (S) = - Σ p i log 2 p i S: is the set of examples p i is the ratio of examples belonging to class i

Entropy of a 2 class problem Entropy (S) = - p a log 2 p a p b log 2 p b Possible Values: E (S) = 0 if all the members belong to the same class E (S) = 1 if each class is represented in equal number E (S) = (0,1) if the classes are unequally represented

Information Gain Information gain is defined as the expected reduction in entropy caused by partitioning the examples according to the given attribute. Gain(S,A) = Entropy(S) - Σ S v / S Entropy(S v ) S: is the given set of examples (training set) A: is the attribute we are testing S v : is the subset of elements of S having value v for attribute A

Example 1 Assume S has 9 + and 5 - examples; partition according to Wind or Humidity attribute S: [9+,5-] S: [9+,5-] E = 0.940 E = 0.940 Humidity Wind High Normal Strong Weak S: [3+,4-] S: [6+,1-] S: [6+,2-] S: [3+,3-] E = 0.985 E = 0.592 E = 0.811 E = 1.0 Gain(S, Humidity) Gain(S, Wind) =.940 - (7/14).985 - (7/14).592 =.940 - (8/14).811 - (6/14)1.0 = 0.151 = 0.048

Example 2 Assuming Outlook is chosen, the partitioning continues until the leaf nodes are reached [9+,5-] Outlook Sunny Overcast Rainy? Yes? [2+,3-] [4+,0-] [3+,2-]

Search in TDIDT (top-dow induction of decision trees) Hypothesis space H = set of all trees H is searched in a hill-climbing fashion, from simple to complex...

Inductive Bias in TDIDT What is the policy by which TDIDT generalizes from the observations? (aka what is its inductive bias?) Preference for short trees, and for those with high information gain attributes near the root Occam s razor: prefer the shortest (simplest) hypothesis that fits the data (1320) A 5-node tree fitting the data is less likely to be a statistical coincidence than a 500-node tree

Overfitting the Data Overfitting: keep improving a model, making it better and better on training set by making it more complicated increases risk of modelling noise and coincidences in the data set may actually harm predictive power of theory on unseen cases Example: fitting a curve with too many parameters:............

Effect of Overfitting Typical phenomenon when overfitting training accuracy keeps increasing accuracy on unseen test set starts decreasing accuracy accuracy on training data accuracy on unseen data overfitting starts about here size of tree

Avoiding Overfitting in DTs Option 1 stop adding nodes to tree when overfitting starts occurring need stopping criterion Option 2 don t bother about overfitting when growing the tree after the tree has been built, start pruning it

Stopping Criteria How do we know when overfitting starts? use a validation set: separated set extracted from the training set when accuracy goes down on validation set: stop adding nodes to this branch Evaluate the generalization accuracy on test set use some statistical test significance test: e.g., is the change in class distribution still significant? (χ 2 -test)

Pluses of Decision Trees Easy to generate; simple algorithm Easy to read small trees; can be converted to simple rules Fast and easy to construct Good performance on many tasks A wide variety of problems can be recast as classification problems

Minuses of Decision Trees Not always sufficient to learn complex concepts Can be hard to understand if the trees are large Some problems with continuously-valued attributes or classes may not be easily discretized

References Chapter 3 of Machine Learning, T. Mitchell