Machine Learning

Similar documents
Estimating Parameters

Machine Learning

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning

Machine Learning

Machine Learning Gaussian Naïve Bayes Big Picture

Bayesian Methods: Naïve Bayes

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Machine Learning

Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Bias-Variance Tradeoff

MLE/MAP + Naïve Bayes

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

MLE/MAP + Naïve Bayes

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Introduction to Machine Learning CMU-10701

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Models in Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes classification

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

The Naïve Bayes Classifier. Machine Learning Fall 2017

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction to Machine Learning

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

CHAPTER 2 Estimating Probabilities

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Introduction to Bayesian Learning. Machine Learning Fall 2018

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Bayesian Learning (II)

Machine Learning, Fall 2012 Homework 2

Introduction to Probabilistic Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Generative v. Discriminative classifiers Intuition

Gaussians Linear Regression Bias-Variance Tradeoff

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Introduction to Machine Learning

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Computational Cognitive Science

Announcements. Proposals graded

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Hidden Markov Models

Hidden Markov Models

ECE 5984: Introduction to Machine Learning

Generative Models for Discrete Data

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Machine Learning for Signal Processing Bayes Classification

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Probability Theory for Machine Learning. Chris Cremer September 2015

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Bayesian RL Seminar. Chris Mansley September 9, 2008

Introduction to Machine Learning

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

PAC-learning, VC Dimension and Margin-based Bounds

CPSC 340: Machine Learning and Data Mining

Gaussian discriminant analysis Naive Bayes

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Learning Bayesian network : Given structure and completely observed data

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

PAC-learning, VC Dimension and Margin-based Bounds

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Announcements. Proposals graded

The Bayes classifier

Learning P-maps Param. Learning

Notes on Machine Learning for and

Lecture : Probabilistic Machine Learning

Computational Cognitive Science

Machine Learning

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

UVA CS / Introduc8on to Machine Learning and Data Mining

COMP 328: Machine Learning

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Bayesian Analysis for Natural Language Processing Lecture 2

Machine Learning

Stochastic Gradient Descent

Probability and Estimation. Alan Moses

Day 5: Generative models, structured classification

COMP90051 Statistical Machine Learning

Machine Learning Algorithm. Heejun Kim

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Transcription:

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 22, 2011 Today: MLE and MAP Bayes Classifiers Naïve Bayes Readings: Mitchell: Naïve Bayes and Logistic Regression (available on class website) Summary: Maximum Likelihood Estimate Bernoulli distribution 1

Summary: Maximum Likelihood Estimate Bernoulli distribution what does all this have to do with function approximation? 2

Let s learn classifiers by learning P(Y X) Consider Y=Wealth, X=<Gender, HoursWorked> Gender HrsWorked P(rich G,HW) P(poor G,HW) F <40.5.09.91 F >40.5.21.79 M <40.5.23.77 M >40.5.38.62 How many parameters must we estimate? Suppose X =<X 1, X n > where X i and Y are boolean RV s To estimate P(Y X 1, X 2, X n ) If we have 30 X i s instead of 2? 3

Bayes Rule Which is shorthand for: Equivalently: Can we reduce params using Bayes Rule? Suppose X =<X 1, X n > where X i and Y are boolean RV s How many parameters to define P(X 1, X n Y)? How many parameters to define P(Y)? 4

Naïve Bayes Naïve Bayes assumes i.e., that X i and X j are conditionally independent given Y, for all i j Conditional Independence Definition: X is conditionally independent of Y given Z, if the probability of X is independent of the value of Y, given the value of Z Which we often write E.g., 5

Naïve Bayes uses assumption that the X i are conditionally independent, given Y Given this assumption, then: in general: How many parameters to describe P(X 1 X n Y)? P(Y)? Without conditional indep assumption? With conditional indep assumption? Naïve Bayes in a Nutshell Bayes rule: Assuming conditional independence among X i s: So, classification rule for X new = < X 1,, X n > is: 6

Naïve Bayes Algorithm discrete X i Train Naïve Bayes (examples) for each * value y k estimate for each * value x ij of each attribute X i estimate Classify (X new ) * probabilities must sum to 1, so need estimate only n-1 of these... Estimating Parameters: Y, X i discrete-valued Maximum likelihood estimates (MLE s): Number of items in dataset D for which Y=y k 7

Example: Live in Sq Hill? P(S G,D,E) S=1 iff live in Squirrel Hill G=1 iff shop at SH Giant Eagle D=1 iff Drive to CMU E=1 iff even # of letters in last name What probability parameters must we estimate? Example: Live in Sq Hill? P(S G,D,E) S=1 iff live in Squirrel Hill G=1 iff shop at SH Giant Eagle D=1 iff Drive to CMU E=1 iff Even # letters last name P(S=1) : P(D=1 S=1) : P(D=1 S=0) : P(G=1 S=1) : P(G=1 S=0) : P(E=1 S=1) : P(E=1 S=0) : P(S=0) : P(D=0 S=1) : P(D=0 S=0) : P(G=0 S=1) : P(G=0 S=0) : P(E=0 S=1) : P(E=0 S=0) : 8

Naïve Bayes: Subtlety #1 Often the X i are not really conditionally independent We use Naïve Bayes in many cases anyway, and it often works pretty well often the right classification, even when not the right probability (see [Domingos&Pazzani, 1996]) What is effect on estimated P(Y X)? Special case: what if we add two copies: X i = X k Naïve Bayes: Subtlety #2 If unlucky, our MLE estimate for P(X i Y) might be zero. (e.g., X i = Birthday_Is_February_29 ) Why worry about just one parameter out of many? What can be done to avoid this? 9

Estimating Parameters Maximum Likelihood Estimate (MLE): choose θ that maximizes probability of observed data Maximum a Posteriori (MAP) estimate: choose θ that is most probable given prior probability and the data Summary: Maximum Likelihood Estimate Bernoulli distribution 10

Beta prior distribution P(θ) [C. Guestrin] Beta prior distribution P(θ) [C. Guestrin] 11

[C. Guestrin] [C. Guestrin] 12

versus [C. Guestrin] versus [C. Guestrin] 13

Conjugate priors [A. Singh] Conjugate priors [A. Singh] 14

Dirichlet distribution number of heads in N flips of a two-sided coin follows a binomial distribution Beta is a good prior (conjugate prior for binomial) what if it s not two-sided, but k-sided? follows a multinomial distribution Dirichlet distribution is the conjugate prior Estimating Parameters Maximum Likelihood Estimate (MLE): choose θ that maximizes probability of observed data Maximum a Posteriori (MAP) estimate: choose θ that is most probable given prior probability and the data 15

Probability & Stats: You should know Probability basics random variables, events, sample space, conditional probs, independence of random variables Bayes rule Joint probability distributions calculating probabilities from the joint distribution Estimating parameters from data maximum likelihood estimates (MLE) maximum a posteriori estimates (MAP) distributions binomial, Beta, Dirichlet, conjugate priors Naïve Bayes Classifier: What you should know: Training and using classifiers based on Bayes rule Conditional independence What it is Why it s important Naïve Bayes What it is Why we use it so much Training using MLE, MAP estimates Next: Discrete variables and continuous (Gaussian) 16

Questions to consider: What error will the classifier achieve if Naïve Bayes assumption is satisfied and we have infinite training data? Can you use Naïve Bayes for a combination of discrete and real-valued X i? How can we extend Naïve Bayes if just 2 of the n X i are dependent? What does the decision surface of a Naïve Bayes classifier look like? 17