} It is non-zero, and maximized given a uniform distribution } Thus, for any distribution possible, we have:

Similar documents
Machine Learning (CS 419/519): M. Allen, 14 Sept. 18 made, in hopes that it will allow us to predict future decisions

Learning Decision Trees

Decision Trees. None Some Full > No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. Patrons? WaitEstimate? Hungry? Alternate?

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

CS 380: ARTIFICIAL INTELLIGENCE

Learning from Observations. Chapter 18, Sections 1 3 1

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

Introduction to Artificial Intelligence. Learning from Oberservations

1. Courses are either tough or boring. 2. Not all courses are boring. 3. Therefore there are tough courses. (Cx, Tx, Bx, )

From inductive inference to machine learning

Learning and Neural Networks

Bayesian learning Probably Approximately Correct Learning

Statistical Learning. Philipp Koehn. 10 November 2015

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

EECS 349:Machine Learning Bryan Pardo

Decision Trees. Ruy Luiz Milidiú

Introduction to Machine Learning

Learning Decision Trees

CSC 411 Lecture 3: Decision Trees

Incremental Stochastic Gradient Descent

Learning Decision Trees

the tree till a class assignment is reached

Assignment 1: Probabilistic Reasoning, Maximum Likelihood, Classification

Introduction To Artificial Neural Networks

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Classification Algorithms

Classification Algorithms

Learning from Examples

Administrative notes. Computational Thinking ct.cs.ubc.ca

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

16.4 Multiattribute Utility Functions

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Classification and Regression Trees

Lecture 3: Decision Trees

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Supervised Learning (contd) Decision Trees. Mausam (based on slides by UW-AI faculty)

Decision Trees Part 1. Rao Vemuri University of California, Davis

Tutorial 6. By:Aashmeet Kalra

Notes on Machine Learning for and

Decision T ree Tree Algorithm Week 4 1

Decision Trees. Gavin Brown

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Machine Learning

Allocation of multiple processors to lazy boolean function trees justification of the magic number 2/3

CS 6375 Machine Learning

Machine Learning 2nd Edi7on

Decision Trees. Tirgul 5

Name (NetID): (1 Point)

Decision Trees / NLP Introduction

Decision Trees.

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

CMPT 310 Artificial Intelligence Survey. Simon Fraser University Summer Instructor: Oliver Schulte

Generative v. Discriminative classifiers Intuition

Decision Tree Learning

Decision Tree Learning

Machine Learning

Introduction to Statistical Learning Theory. Material para Máster en Matemáticas y Computación

A Problem Involving Games. Paccioli s Solution. Problems for Paccioli: Small Samples. n / (n + m) m / (n + m)

Induction on Decision Trees

Decision Tree Learning Lecture 2

Induction of Decision Trees

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Decision Tree Learning and Inductive Inference

Decision Trees.

Classification: Decision Trees

Machine Learning

CHAPTER-17. Decision Tree Induction

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Statistics and learning: Big Data

CSCI 5622 Machine Learning

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Lecture 3: Decision Trees

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Decision Tree Learning

Introduction to Machine Learning CMU-10701

CIS519: Applied Machine Learning Fall Homework 5. Due: December 10 th, 2018, 11:59 PM

Administrative notes February 27, 2018

Classification Using Decision Trees

Machine Learning 3. week

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Chapter 3: Decision Tree Learning

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Classification: Decision Trees

CS 395T Computational Learning Theory. Scribe: Mike Halcrow. x 4. x 2. x 6

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Notes on induction proofs and recursive definitions

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Decision Trees. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Machine Learning & Data Mining

Lecture 7: DecisionTrees

CS 151. Red Black Trees & Structural Induction. Thursday, November 1, 12

Decision Trees Entropy, Information Gain, Gain Ratio

Dan Roth 461C, 3401 Walnut

Decision Trees. Danushka Bollegala

Tutorial 2. Fall /21. CPSC 340: Machine Learning and Data Mining

Transcription:

Review: Entropy and Information H(P) = X i p i log p i Class #04: Mutual Information & Decision rees Machine Learning (CS 419/519): M. Allen, 1 Sept. 18 } Entropy is the information gained on average when observing events that occur according to a probability distribution } It is non-zero, and maximized given a uniform distribution } hus, for any distribution possible, we have: P = {p 1,p,...,p k } 0 apple H(P) apple log k Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) Review: Joint Probability & Independence } If we have two events e 1 and e, the probability that both events occur, called the oint probability, is written: P (e 1 ^ e )=P (e 1,e ) } We say that two events are independent if and only if: P (e 1,e )=P (e 1 ) P (e ) } Independent events tell us nothing about each other Review: Conditional Probability } Given two events e 1 and e, the probability that e 1 occurs, given that e also occurs, called the conditional probability of e 1 given e, is written: P (e 1 e ) } In general, the conditional probability of an event can be quite different from the basic probability that it occurs } hus, for our weather example, we might have: W = {R, R} P W = {0.5, 0.5} U = {U, U} P U = {0., 0.8} P (U R) =0.8 P (U R) =0.1 P ( U R) =0. P ( U R) =0.9 Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 3 Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 4 1

Properties of Conditional Probability } Conditional probability can be defined using oint probability: P (e 1 e )= P (e 1,e ) P (e ) P (e 1,e )=P (e 1 e )P (e ) } hus, if the events are actually independent, we get: P (e 1 e )= P (e 1,e ) P (e ) P (e 1 e )= P (e 1)P (e ) P (e ) P (e 1 e )=P (e 1 ) By definition of independence Calculating Joint Probabilities } We have the simple and conditional probabilities of rain and my umbrellacarrying behavior: W = {R, R} P W = {0.5, 0.5} U = {U, U} P U = {0., 0.8} P (U R) =0.8 P (U R) =0.1 P ( U R) =0. P ( U R) =0.9 } his allows us to calculate various oint probabilities: P (U, R) =P (U R)P (R) =0.8 0.5 =0.4 P (U, R) =P (U R)P ( R) =0.1 0.5 =0.05 P ( U, R) =P ( U R)P (R) =0. 0.5 =0.1 P ( U, R) =P ( U R)P ( R) =0.9 0.5 =0.45 otal set of probabilities sums to 1.0 Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 5 Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 6 Mutual Information } Suppose we have two sets of possible events, each with its own probability distributions: E = {e 1,e,...,e m } P E = {p 1,p,...,p m } E 0 = {e 0 1,e 0,...,e 0 n} P E 0 = {p 0 1,p 0,...,p 0 n} } We can define mutual information, the amount that one event tells us about the other: I(E; E 0 )= X e i,e 0 } Effectively, this measures how much knowing that E 0 has happened reduces the entropy of P (e i,e 0 ) log P (e i,e 0 ) P (e i )P (e 0 ) Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 7 E Mutual Information } his allows us to quantify exactly how much knowing whether or not it is raining tells us about whether or not I will be carrying an umbrella: P (U, R) I(U; W) =P (U, R) log P (U)P (R) + P (U, R) log P (U, R) P (U)P ( R) + P ( U, R) P ( U, R) log P ( U)P (R) + P ( U, R) log P ( U, R) P ( U)P ( R) 0.4 =0.4log 0. 0.5 +0.05 log 0.05 0. 0.5 + 0.1 0.1 log 0.8 0.5 +0.45 log 0.45 0.8 0.5 =0.4 log 4+0.05 log 0.5+0.1 log 0.5 + 0.45 log 1.15 0.8 0.05 0. + 0.0765 = 0.665 te: the final value doesn t matter so much (e.g., it would change if we used a different base for our logarithms). It does allow us to compare different combinations of variables, however, to see which tells us the most about another. Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 8

Properties of Mutual Information I(E; E 0 )= X P (e i,e 0 P (e i,e 0 ) log ) P (e i )P (e 0 e i,e 0 ) } As defined, mutual information is: 1. Symmetric: I(E; E 0 )=I(E 0 ; E) Because: P (e i,e 0 )=P (e 0,e i ). n-negative: I(E; E 0 ) 0 Because: it s complicated, but trust me 3. Zero when events are independent (i.e., when independent, one event tells us nothing about the other that we didn t already know): I(E; E 0 )= X e i,e 0 = X e i,e 0 P (e i,e 0 P (e i,e 0 ) log ) P (e i )P (e 0 ) = X P (e i )P (e 0 P (e i )P (e 0 ) log ) P (e i )P (e 0 e i,e 0 ) P (e i )P (e 0 ) log 1= X e i,e 0 P (e i )P (e 0 ) 0=0 Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 9 Review: Inductive Learning } In its simplest form, induction is the task of learning a function on some inputs from examples of its outputs } or a target function, f, each training example is a pair (x, f (x )) } We assume that we do not yet know the actual form of the function f (if we did, we don t need to learn) } Learning problem: find a hypothesis function, h, such that h (x ) = f (x ) most of the time, based on a training set of example input-output pairs Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 10 Decision rees } A decision tree leads us from a set of attributes (features of the input) to some output } or example, we have a database of customer records for restaraunts } hese customers have made a number of decisions about whether to wait for a table, based on a number of attributes: 1. Alternate: is there an alternative restaurant nearby?. Bar: is there a comfortable bar area to wait in? 3. ri/sat: is today riday or Saturday? 4. Hungry: are we hungry? 5. Patrons: number of people in the restaurant (ne, Some, ull) 6. Price: price range ($, $$, $$$) 7. Raining: is it raining outside? 8. Reservation: have we made a reservation? 9. ype: kind of restaurant (rench, Italian, hai, Burger) 10. WaitEstimate: estimated wait time in minutes (0-10, 10-30, 30-60, >60) } he function we want to learn is whether or not a (future) customer will decide to wait, given some particular set of attributes Decisions Based on Attributes } raining set: cases where patrons have decided to wait or not, along with the associated attributes for each case } We now want to learn a tree that agrees with the decisions already made, in hopes that it will allow us to predict future decisions Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 11 Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 1 3

Decision ree unctions } or the examples given, here is a true tree (one that will lead from the inputs to the same outputs) ne Some ull Patrons? >60 30-60 10-30 0-10 Bar? Reservation? WaitEstimate? Alternate? ri/sat? Hungry? Alternate? Raining? Decision rees are Expressive A B A &&!B } Such trees can express any deterministic function we: } or example, in boolean functions, each row of a truth-table will correspond to a path in a tree } or any such function, there is always a tree: ust make each example a different path to a correct leaf output } A Problem: such trees most often do not generalize to new examples } Another Problem: we want compact trees to simplify inference B A B Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 13 Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 14 Why t Search for rees? } One thing we might consider would be to search through possible trees to find ones that are most compact and consistent with our inputs } Exhaustive search is too expensive, however, due to the large number of possible functions (trees) that exist } or n binary-valued attributes, and boolean decision outputs, there are n possibilities } or 5 such attributes, we have 4,94,967,96 trees! } Even restricting our search to conunctions over attributes, it is easy to get 3 n possible trees Building rees op-down } Rather than search for all trees, we build our trees by: 1. Choosing an attribute A from our set. Dividing our examples according to the values of A 3. Placing each subset of examples into a sub-tree below the node for attribute A } his can be implemented in a number of ways, but is perhaps most easily understood recursively } he main question becomes: how do we choose the attribute A that we use to split our examples? Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 15 Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 16 4

Decision ree Learning Algorithm function DECISION-REE-LEARNING(examples, attributes, parent examples) returns tree if examples is empty then return PLURALIY-VALUE(parent examples) else if all examples have the same classification then return the classification else if attributes is empty then return PLURALIY-VALUE(examples) else A argmax a attributes IMPORANCE(a, examples) tree anewdecisiontreewithroottesta for each value v k of A do exs {e : e examples and e.a = v k} subtree DECISION-REE-LEARNING(exs, attributes A, examples) add a branch to tree with label (A = v k) and subtree subtree return tree his Week } Information heory & Decision rees } Readings: } Blog post on Information heory (linked from class schedule) } Section 18.3 from Russell & rvig } Office Hours: Wing 10 } Monday/Wednesday/riday, 1:00 PM 1:00 PM } uesday/hursday, 1:30 PM 3:00 PM PLURALIY-VALUE(): returns output decision-value for maority of examples IMPORANCE(): rates attributes for their importance in making decisions for the given set of examples (the only actually complex part) Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 17 Wednesday, 1 Sep. 018 Machine Learning (CS 419/519) 18 5