Estimating Parameters

Similar documents
Machine Learning

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Machine Learning

Bayesian Methods: Naïve Bayes

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning Gaussian Naïve Bayes Big Picture

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Machine Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

MLE/MAP + Naïve Bayes

Machine Learning

Dimensionality reduction

Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Generative v. Discriminative classifiers Intuition

MLE/MAP + Naïve Bayes

Generative v. Discriminative classifiers Intuition

Machine Learning

Machine Learning

Bias-Variance Tradeoff

The Naïve Bayes Classifier. Machine Learning Fall 2017

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning, Fall 2012 Homework 2

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes classification

COMP 328: Machine Learning

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

CHAPTER 2 Estimating Probabilities

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Hidden Markov Models

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Click Prediction and Preference Ranking of RSS Feeds

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

CSE446: Naïve Bayes Winter 2016

Hidden Markov Models

Introduction to Machine Learning CMU-10701

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

CSC 411: Lecture 09: Naive Bayes

Introduction to Bayesian Learning. Machine Learning Fall 2018

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Generative v. Discriminative classifiers Intuition

Gaussian discriminant analysis Naive Bayes

Generative Model (Naïve Bayes, LDA)

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Machine Learning for Signal Processing Bayes Classification

CSE546: Naïve Bayes Winter 2012

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Generative Learning algorithms

CPSC 340: Machine Learning and Data Mining

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Learning Bayesian network : Given structure and completely observed data

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

ECE 5984: Introduction to Machine Learning

Stochastic Gradient Descent

Gaussians Linear Regression Bias-Variance Tradeoff

Logistic Regression. Machine Learning Fall 2018

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Bayesian Models in Machine Learning

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Bayesian Learning (II)

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

The Bayes classifier

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Probabilistic Classification

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Naive Bayes and Gaussian Bayes Classifier

Bayesian Learning. Instructor: Jesse Davis

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Stephen Scott.

Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

Naive Bayes and Gaussian Bayes Classifier

Generative Models for Discrete Data

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Introduction to Machine Learning

Generative Clustering, Topic Modeling, & Bayesian Inference

Announcements. Proposals graded

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

CS145: INTRODUCTION TO DATA MINING

Introduction to Logistic Regression and Support Vector Machine

Machine Learning: Assignment 1

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Notes on Machine Learning for and

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Probabilistic Machine Learning. Industrial AI Lab.

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Support Vector Machines

Transcription:

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 13, 2012 Today: Bayes Classifiers Naïve Bayes Gaussian Naïve Bayes Readings: Mitchell: Naïve Bayes and Logistic Regression (available on class website) Estimating Parameters Maximum Likelihood Estimate (MLE): choose θ that maximizes probability of observed data Maximum a Posteriori (MAP) estimate: choose θ that is most probable given prior probability and the data 1

Conjugate priors [A. Singh] Conjugate priors [A. Singh] 2

Conjugate priors [A. Singh] Let s learn classifiers by learning P(Y X) Consider Y=Wealth, X=<Gender, HoursWorked> Gender HrsWorked P(rich G,HW) P(poor G,HW) F <40.5.09.91 F >40.5.21.79 M <40.5.23.77 M >40.5.38.62 3

How many parameters must we estimate? Suppose X =<X 1, X n > where X i and Y are boolean RV s To estimate P(Y X 1, X 2, X n ) If we have 30 boolean X i s: P(Y X 1, X 2, X 30 ) Bayes Rule Which is shorthand for: Equivalently: 4

Can we reduce params using Bayes Rule? Suppose X =<X 1, X n > where X i and Y are boolean RV s Naïve Bayes Naïve Bayes assumes i.e., that X i and X j are conditionally independent given Y, for all i j 5

Conditional Independence Definition: X is conditionally independent of Y given Z, if the probability distribution governing X is independent of the value of Y, given the value of Z Which we often write E.g., Naïve Bayes uses assumption that the X i are conditionally independent, given Y Given this assumption, then: in general: How many parameters to describe P(X 1 X n Y)? P(Y)? Without conditional indep assumption? With conditional indep assumption? 6

Naïve Bayes in a Nutshell Bayes rule: Assuming conditional independence among X i s: So, to pick most probable Y for X new = < X 1,, X n > Naïve Bayes Algorithm discrete X i Train Naïve Bayes (examples) for each * value y k estimate for each * value x ij of each attribute X i estimate Classify (X new ) * probabilities must sum to 1, so need estimate only n-1 of these... 7

Estimating Parameters: Y, X i discrete-valued Maximum likelihood estimates (MLE s): Number of items in dataset D for which Y=y k Example: Live in Sq Hill? P(S G,D,E) S=1 iff live in Squirrel Hill G=1 iff shop at SH Giant Eagle D=1 iff Drive to CMU E=1 iff even # of letters in last name What probability parameters must we estimate? 8

Example: Live in Sq Hill? P(S G,D,E) S=1 iff live in Squirrel Hill G=1 iff shop at SH Giant Eagle D=1 iff Drive or Carpool to CMU E=1 iff Even # letters last name P(S=1) : P(D=1 S=1) : P(D=1 S=0) : P(G=1 S=1) : P(G=1 S=0) : P(E=1 S=1) : P(E=1 S=0) : P(S=0) : P(D=0 S=1) : P(D=0 S=0) : P(G=0 S=1) : P(G=0 S=0) : P(E=0 S=1) : P(E=0 S=0) : Naïve Bayes: Subtlety #1 If unlucky, our MLE estimate for P(X i Y) might be zero. (e.g., X i = Birthday_Is_January_30_1990) Why worry about just one parameter out of many? What can be done to avoid this? 9

Estimating Parameters Maximum Likelihood Estimate (MLE): choose θ that maximizes probability of observed data Maximum a Posteriori (MAP) estimate: choose θ that is most probable given prior probability and the data Estimating Parameters: Y, X i discrete-valued Maximum likelihood estimates: MAP estimates (Beta, Dirichlet priors): Only difference: imaginary examples 10

Naïve Bayes: Subtlety #2 Often the X i are not really conditionally independent We use Naïve Bayes in many cases anyway, and it often works pretty well often the right classification, even when not the right probability (see [Domingos&Pazzani, 1996]) What is effect on estimated P(Y X)? Special case: what if we add two copies: X i = X k Special case: what if we add two copies: X i = X k 11

Learning to classify text documents Classify which emails are spam? Classify which emails promise an attachment? Classify which web pages are student home pages? How shall we represent text documents for Naïve Bayes? Baseline: Bag of Words Approach aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0... gas 1... oil 1 Zaire 0 12

Learning to classify document: P(Y X) the Bag of Words model Y discrete valued. e.g., Spam or not X = <X 1, X 2, X n > = document X i is a random variable describing the word at position i in the document possible values for X i : any word w k in English Document = bag of words: the vector of counts for all w k s (like #heads, #tails, but we have more than 2 values) Naïve Bayes Algorithm discrete X i Train Naïve Bayes (examples) for each value y k estimate for each value x j of each attribute X i estimate Classify (X new ) prob that word x j appears in position i, given Y=y k * Additional assumption: word probabilities are position independent 13

MAP estimates for bag of words Map estimate for multinomial What β s should we choose? 14

For code and data, see www.cs.cmu.edu/~tom/mlbook.html click on Software and Data What you should know: Training and using classifiers based on Bayes rule Conditional independence What it is Why it s important Naïve Bayes What it is Why we use it so much Training using MLE, MAP estimates Discrete variables and continuous (Gaussian) 15

Questions: What error will the classifier achieve if Naïve Bayes assumption is satisfied and we have infinite training data? Can you use Naïve Bayes for a combination of discrete and real-valued X i? How can we extend Naïve Bayes if just 2 of the n X i are dependent? What does the decision surface of a Naïve Bayes classifier look like? 16