Machine Learning

Similar documents
Machine Learning

Machine Learning

Probabilistic and Bayesian Analytics

A [somewhat] Quick Overview of Probability. Shannon Quinn CSCI 6900

Probability Basics. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Machine Learning

CS 4649/7649 Robot Intelligence: Planning

Sources of Uncertainty

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

CS 331: Artificial Intelligence Probability I. Dealing with Uncertainty

Introduction to Machine Learning CMU-10701

Machine Learning

Machine Learning

CISC 4631 Data Mining

Classification and Regression Trees

Estimating Parameters

Machine Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Machine Learning

the tree till a class assignment is reached

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Bayesian Approaches Data Mining Selected Technique

Machine Learning

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Introduction to Machine Learning

Course Overview and Review of Probability. William W. Cohen Machine Learning

Introduction to Machine Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Decision Tree Learning Lecture 2

CHAPTER 2 Estimating Probabilities

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Chapter 3: Decision Tree Learning (part 2)

Generative v. Discriminative classifiers Intuition

Classification: Decision Trees

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

Decision Trees. Tirgul 5

Decision Trees.

Vibhav Gogate University of Texas, Dallas

Decision Tree Learning

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

STATISTICAL METHODS IN AI/ML Vibhav Gogate The University of Texas at Dallas. Propositional Logic and Probability Theory: Review

Bayesian Models in Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Notes on Machine Learning for and

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Probabilistic and Bayesian Analytics Based on a Tutorial by Andrew W. Moore, Carnegie Mellon University

Decision Trees.

Machine Learning Gaussian Naïve Bayes Big Picture

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Decision Tree Learning

Bayesian Analysis for Natural Language Processing Lecture 2

Statistics for Business and Economics

Decision Tree Learning

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Machine Learning

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Bayesian Learning (II)

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Basic Probabilistic Reasoning SEG

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Machine Learning: Homework Assignment 2 Solutions

Naïve Bayes classification

Probability Theory for Machine Learning. Chris Cremer September 2015

Introduction to Machine Learning CMU-10701

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Probability Theory Review

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

MLE/MAP + Naïve Bayes

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Computational Cognitive Science

Gaussian Mixture Models, Expectation Maximization

Bias-Variance Tradeoff

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 11: Probability Review. Dr. Yanjun Qi. University of Virginia. Department of Computer Science

Bayes and Naïve Bayes Classifiers CS434

Empirical Risk Minimization, Model Selection, and Model Assessment

Chapter 3: Decision Tree Learning

Basics of Probability

PMR Learning as Inference

Decision Trees Lecture 12

Introduction to Bayesian Learning. Machine Learning Fall 2018

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Machine Learning. Lecture 2

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

CSC 411 Lecture 3: Decision Trees

Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Transcription:

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 13, 2011 Today: The Big Picture Overfitting Review: probability Readings: Decision trees, overfiting Mitchell, Chapter 3 Probability review Bishop Ch. 1 thru 1.2.3 Bishop, Ch. 2 thru 2.2 Andrew Moore s online tutorial Function Approximation: Decision Tree Learning Problem Setting: Set of possible instances X each instance x in X is a feature vector x = < x 1, x 2 x n > Unknown target function f : X Y Y is discrete valued Set of function hypotheses H={ h h : X Y } each hypothesis h is a decision tree Input: Training examples {<x (i),y (i) >} of unknown target function f Output: Hypothesis h H that best approximates target function f 1

Function approximation as Search for the best hypothesis ID3 performs heuristic search through space of decision trees Function Approximation: The Big Picture 2

Which Tree Should We Output? ID3 performs heuristic search through space of decision trees It stops at smallest acceptable tree. Why? Occam s razor: prefer the simplest hypothesis that fits the data Why Prefer Short Hypotheses? (Occam s Razor) Arguments in favor: Arguments opposed: 3

Why Prefer Short Hypotheses? (Occam s Razor) Argument in favor: Fewer short hypotheses than long ones a short hypothesis that fits the data is less likely to be a statistical coincidence highly probable that a sufficiently complex hypothesis will fit the data Argument opposed: Also fewer hypotheses containing a prime number of nodes and attributes beginning with Z What s so special about short hypotheses? 4

5

6

Split data into training and validation set Create tree that classifies training set correctly 7

8

What you should know: Well posed function approximation problems: Instance space, X Sample of labeled training data { <x (i), y (i) >} Hypothesis space, H = { f: X Y } Learning is a search/optimization problem over H Various objective functions minimize training error (0-1 loss) among hypotheses that minimize training error, select smallest (?) But inductive learning without some bias is futile! Decision tree learning Greedy top-down learning of decision trees (ID3, C4.5,...) Overfitting and tree/rule post-pruning Extensions Extra slides extensions to decision tree learning 9

10

Questions to think about (1) ID3 and C4.5 are heuristic algorithms that search through the space of decision trees. Why not just do an exhaustive search? 11

Questions to think about (2) Consider target function f: <x1,x2> y, where x1 and x2 are real-valued, y is boolean. What is the set of decision surfaces describable with decision trees that use each attribute at most once? Questions to think about (3) Why use Information Gain to select attributes in decision trees? What other criteria seem reasonable, and what are the tradeoffs in making this choice? 12

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 13, 2011 Today: Review: probability many of these slides are derived from William Cohen, Andrew Moore, Aarti Singh, Eric Xing. Thanks! Readings: Probability review Bishop Ch. 1 thru 1.2.3 Bishop, Ch. 2 thru 2.2 Andrew Moore s online tutorial The Problem of Induction David Hume (1711-1776): pointed out 1. Empirically, induction seems to work 2. Statement (1) is an application of induction. This stumped people for about 200 years 13

Probability Overview Events discrete random variables, continuous random variables, compound events Axioms of probability What defines a reasonable theory of uncertainty Independent events Conditional probabilities Bayes rule and beliefs Joint probability distribution Expectations Independence, Conditional independence Random Variables Informally, A is a random variable if A denotes something about which we are uncertain perhaps the outcome of a randomized experiment Examples A = True if a randomly drawn person from our class is female A = The hometown of a randomly drawn person from our class A = True if two randomly drawn persons from our class have same birthday Define P(A) as the fraction of possible worlds in which A is true or the fraction of times A holds, in repeated runs of the random experiment the set of possible worlds is called the sample space, S A random variable A is a function defined over S A: S {0,1} 14

A little formalism More formally, we have a sample space S (e.g., set of students in our class) aka the set of possible worlds a random variable is a function defined over the sample space Gender: S { m, f } Height: S Reals an event is a subset of S e.g., the subset of S for which Gender=f e.g., the subset of S for which (Gender=m) AND (eyecolor=blue) we re often interested in probabilities of specific events and of specific events conditioned on other specific events Visualizing A Sample space of all possible worlds Worlds in which A is true P(A) = Area of reddish oval Its area is 1 Worlds in which A is False 15

The Axioms of Probability 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) The Axioms of Probability 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) [di Finetti 1931]: when gambling based on uncertainty formalism A you can be exploited by an opponent iff your uncertainty formalism A violates these axioms 16

Interpreting the axioms 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) The area of A can t get any smaller than 0 And a zero area would mean no world could ever have A true Interpreting the axioms 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) The area of A can t get any bigger than 1 And an area of 1 would mean all worlds will have A true 17

Interpreting the axioms 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) Theorems from the Axioms 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) P(not A) = P(~A) = 1-P(A) 18

Theorems from the Axioms 0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B) P(not A) = P(~A) = 1-P(A) P(A or ~A) = 1 P(A and ~A) = 0 P(A or ~A) = P(A) + P(~A) - P(A and ~A) 1 = P(A) + P(~A) - 0 Elementary Probability in Pictures P(~A) + P(A) = 1 A ~A 19

Another useful theorem 0 <= P(A) <= 1, P(True) = 1, P(False) = 0, P(A or B) = P(A) + P(B) - P(A and B) P(A) = P(A ^ B) + P(A ^ ~B) A = A and (B or ~B) = (A and B) or (A and ~B) P(A) = P(A and B) + P(A and ~B) P((A and B) and (A and ~B)) P(A) = P(A and B) + P(A and ~B) P(A and A and B and ~B) Elementary Probability in Pictures P(A) = P(A ^ B) + P(A ^ ~B) A ^ B B A ^ ~B 20

Multivalued Discrete Random Variables Suppose A can take on more than 2 values A is a random variable with arity k if it can take on exactly one value out of {v 1,v 2,... v k } Thus Elementary Probability in Pictures A=2 A=3 A=5 A=4 A=1 21

Definition of Conditional Probability P(A ^ B) P(A B) = ----------- P(B) Corollary: The Chain Rule P(A ^ B) = P(A B) P(B) Conditional Probability in Pictures picture: P(B A=2) A=2 A=3 A=5 A=1 A=4 22

Independent Events Definition: two events A and B are independent if Pr(A and B)=Pr(A)*Pr(B) Intuition: knowing A tells us nothing about the value of B (and vice versa) Picture A independent of B 23

Elementary Probability in Pictures let s write 2 expressions for P(A ^ B) A ^ B B A P(B A) * P(A) P(A B) = P(B) Bayes rule we call P(A) the prior and P(A B) the posterior Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 by no means merely a curious speculation in the doctrine of chances, but necessary to be solved in order to a sure foundation for all our reasonings concerning past facts, and what is likely to be hereafter. necessary to be considered by any that would give a clear account of the strength of analogical or inductive reasoning 24

Other Forms of Bayes Rule You should know Events discrete random variables, continuous random variables, compound events Axioms of probability What defines a reasonable theory of uncertainty Independent events Conditional probabilities Bayes rule and beliefs 25

what does all this have to do with function approximation? 26

27

Maximum Likelihood Estimate for Θ 28

29

30

Dirichlet distribution number of heads in N flips of a two-sided coin follows a binomial distribution Beta is a good prior (conjugate prior for binomial) what it s not two-sided, but k-sided? follows a multinomial distribution Dirichlet distribution is the conjugate prior 31

Estimating Parameters Maximum Likelihood Estimate (MLE): choose θ that maximizes probability of observed data Maximum a Posteriori (MAP) estimate: choose θ that is most probable given prior probability and the data You should know Probability basics random variables, events, sample space, conditional probs, independence of random variables Bayes rule Joint probability distributions calculating probabilities from the joint distribution Point estimation maximum likelihood estimates maximum a posteriori estimates distributions binomial, Beta, Dirichlet, 32

Extra slides The Joint Distribution Recipe for making a joint distribution of M variables: Example: Boolean variables A, B, C 33

The Joint Distribution Recipe for making a joint distribution of M variables: 1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). Example: Boolean variables A, B, C A B C 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 The Joint Distribution Recipe for making a joint distribution of M variables: 1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). 2. For each combination of values, say how probable it is. Example: Boolean variables A, B, C A B C Prob 0 0 0 0.30 0 0 1 0.05 0 1 0 0.10 0 1 1 0.05 1 0 0 0.05 1 0 1 0.10 1 1 0 0.25 1 1 1 0.10 34

The Joint Distribution Recipe for making a joint distribution of M variables: 1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). 2. For each combination of values, say how probable it is. 3. If you subscribe to the axioms of probability, those numbers must sum to 1. Example: Boolean variables A, B, C A B C Prob 0 0 0 0.30 0 0 1 0.05 0 1 0 0.10 0 1 1 0.05 1 0 0 0.05 1 0 1 0.10 1 1 0 0.25 1 1 1 0.10 A 0.30 0.05 0.25 B 0.10 0.05 0.10 0.10 0.05 C Using the Joint One you have the JD you can ask for the probability of any logical expression involving your attribute 35

Using the Joint P(Poor Male) = 0.4654 Using the Joint P(Poor) = 0.7604 36

Inference with the Joint Inference with the Joint P(Male Poor) = 0.4654 / 0.7604 = 0.612 37

Expected values Given discrete random variable X, the expected value of X, written E[X] is We also can talk about the expected value of functions of X Covariance Given two discrete r.v. s X and Y, we define the covariance of X and Y as e.g., X=gender, Y=playsFootball or X=gender, Y=leftHanded Remember: 38

You should know Probability basics random variables, events, sample space, conditional probs, independence of random variables Bayes rule Joint probability distributions calculating probabilities from the joint distribution Point estimation maximum likelihood estimates maximum a posteriori estimates distributions binomial, Beta, Dirichlet, 39