MLE/MAP + Naïve Bayes

Similar documents
MLE/MAP + Naïve Bayes

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Bayesian Methods: Naïve Bayes

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Naïve Bayes classification

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning

Bayesian Models in Machine Learning

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

CPSC 340: Machine Learning and Data Mining

The Naïve Bayes Classifier. Machine Learning Fall 2017

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Introduc)on to Bayesian methods (con)nued) - Lecture 16

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Hidden Markov Models

Machine Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

Estimating Parameters

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Perceptron (Theory) + Linear Regression

Bayesian Learning (II)

Logistic Regression. Machine Learning Fall 2018

CS 361: Probability & Statistics

COMP 328: Machine Learning

Hidden Markov Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

COMP90051 Statistical Machine Learning

Introduction to Machine Learning CMU-10701

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Naive Bayes and Gaussian Bayes Classifier

The Bayes classifier

Midterm: CS 6375 Spring 2018

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Machine Learning, Fall 2012 Homework 2

CS 361: Probability & Statistics

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Introduction to Machine Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Qualifier: CS 6375 Machine Learning Spring 2015

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Naive Bayes and Gaussian Bayes Classifier

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Machine Learning

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Introduction to Machine Learning

Machine Learning for Signal Processing Bayes Classification

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Probability Theory for Machine Learning. Chris Cremer September 2015

LEC 4: Discriminant Analysis for Classification

Bias-Variance Tradeoff

Machine Learning

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Machine Learning

Naive Bayes and Gaussian Bayes Classifier

CS 188: Artificial Intelligence Spring Today

Introduction to Probability and Statistics (Continued)

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Probability Review and Naïve Bayes

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Machine Learning for Signal Processing Bayes Classification and Regression

ECE521 week 3: 23/26 January 2017

Introduction to Machine Learning

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Overfitting, Bias / Variance Analysis

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

Lecture 2: Simple Classifiers

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Machine Learning Gaussian Naïve Bayes Big Picture

PAC-learning, VC Dimension and Margin-based Bounds

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Be able to define the following terms and answer basic questions about them:

Announcements. Proposals graded

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Probabilistic Learning

PAC-learning, VC Dimension and Margin-based Bounds

Introduction to Bayesian Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Probability and Estimation. Alan Moses

Transcription:

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes Matt Gormley Lecture 19 March 20, 2018 1

Midterm Exam Reminders Thursday Evening 6:30 9:00 (2.5 hours) Room and seat assignments will be announced on Piazza You may bring one 8.5 x 11 cheatsheet 2

Outline Generating Data Natural (stochastic) data Synthetic data Why synthetic data? Examples: Multinomial, Bernoulli, Gaussian Data Likelihood Independent and Identically Distributed (i.i.d.) Example: Dice Rolls Learning from Data (Frequentist) Principle of Maximum Likelihood Estimation (MLE) Optimization for MLE Examples: 1D and 2D optimization Example: MLE of Multinomial Aside: Method of Lagrange Multipliers Learning from Data (Bayesian) maximum a posteriori (MAP) estimation Optimization for MAP Example: MAP of Bernoulli Beta 3

Whiteboard Generating Data Natural (stochastic) data Synthetic data Why synthetic data? Examples: Multinomial, Bernoulli, Gaussian 4

In-Class Exercise 1. With your neighbor, write a function which returns samples from a Categorical Assume access to the rand() function Function signature should be: categorical_sample(theta) where theta is the array of parameters Make your implementation as efficient as possible! 2. What is the expected runtime of your function? 5

Whiteboard Data Likelihood Independent and Identically Distributed (i.i.d.) Example: Dice Rolls 6

Learning from Data (Frequentist) Whiteboard Principle of Maximum Likelihood Estimation (MLE) Optimization for MLE Examples: 1D and 2D optimization Example: MLE of Multinomial Aside: Method of Langrange Multipliers 7

Learning from Data (Bayesian) Whiteboard maximum a posteriori (MAP) estimation Optimization for MAP Example: MAP of Bernoulli Beta 8

Takeaways One view of what ML is trying to accomplish is function approximation The principle of maximum likelihood estimation provides an alternate view of learning Synthetic data can help debug ML algorithms Probability distributions can be used to model real data that occurs in the world (don t worry we ll make our distributions more interesting soon!) 9

Naïve Bayes Outline Probabilistic (Generative) View of Classification Decision rule for probability model Real-world Dataset Economist vs. Onion articles Document à bag-of-words à binary feature vector Naive Bayes: Model Generating synthetic "labeled documents" Definition of model Naive Bayes assumption Counting # of parameters with / without NB assumption Naïve Bayes: Learning from Data Data likelihood MLE for Naive Bayes MAP for Naive Bayes Visualizing Gaussian Naive Bayes 10

Today s Goal To define a generative model of emails of two different classes (e.g. spam vs. not spam) 11

Spam News The Economist The Onion 12

Whiteboard Real-world Dataset Economist vs. Onion articles Document à bag-of-words à binary feature vector 13

Whiteboard Naive Bayes: Model Generating synthetic "labeled documents" Definition of model Naive Bayes assumption Counting # of parameters with / without NB assumption 14

Model 1: Bernoulli Naïve Bayes Flip weighted coin If HEADS, flip each red coin If TAILS, flip each blue coin y x 1 x 2 x 3 x M 0 1 0 1 1 Each red coin corresponds to an x m 1 0 1 0 1 1 1 1 1 1 0 0 0 1 1 0 1 0 1 0 1 1 0 1 0 We can generate data in this fashion. Though in practice we never would since our data is given. Instead, this provides an explanation of how the data was generated (albeit a terrible one). 15

Whiteboard Naive Bayes: Model Generating synthetic "labeled documents" Definition of model Naive Bayes assumption Counting # of parameters with / without NB assumption 16

What s wrong with the Naïve Bayes Assumption? The features might not be independent!! Example 1: If a document contains the word Donald, it s extremely likely to contain the word Trump These are not independent! Example 2: If the petal width is very high, the petal length is also likely to be very high 17

Naïve Bayes: Learning from Data Whiteboard Data likelihood MLE for Naive Bayes MAP for Naive Bayes 18

VISUALIZING NAÏVE BAYES Slides in this section from William Cohen (10-601B, Spring 2016) 19

Fisher Iris Dataset Fisher (1936) used 150 measurements of flowers from 3 different species: Iris setosa (0), Iris virginica (1), Iris versicolor (2) collected by Anderson (1936) Species Sepal Length Sepal Width Petal Length 0 4.3 3.0 1.1 0.1 0 4.9 3.6 1.4 0.1 0 5.3 3.7 1.5 0.2 1 4.9 2.4 3.3 1.0 1 5.7 2.8 4.1 1.3 1 6.3 3.3 4.7 1.6 1 6.7 3.0 5.0 1.7 Petal Width Full dataset: https://en.wikipedia.org/wiki/iris_flower_data_set 21

Slide from William Cohen

Slide from William Cohen

Plot the difference of the probabilities z-axis is the difference of the posterior probabilities: p(y=1 x) p(y=0 x) Slide from William Cohen

Question: what does the boundary between positive and negative look like for Naïve Bayes? Slide from William Cohen (10-601B, Spring 2016)

Iris Data (2 classes) 26

Iris Data (sigma not shared) 27

Iris Data (sigma=1) 28

Iris Data (3 classes) 29

Iris Data (sigma not shared) 30

Iris Data (sigma=1) 31

Naïve Bayes has a linear decision boundary (if sigma is shared across classes) Slide from William Cohen (10-601B, Spring 2016)

Figure from William Cohen (10-601B, Spring 2016)

Figure from William Cohen (10-601B, Spring 2016) Why don t we drop the generative model and try to learn this hyperplane directly?

Beyond the Scope of this Lecture Multinomial Naïve Bayes can be used for integer features Multi-class Naïve Bayes can be used if your classification problem has > 2 classes 35

Summary 1. Naïve Bayes provides a framework for generative modeling 2. Choose p(x m y) appropriate to the data (e.g. Bernoulli for binary features, Gaussian for continuous features) 3. Train by MLE or MAP 4. Classify by maximizing the posterior 36