Introduction to Machine Learning CMU-10701

Similar documents
Point Estimation. Vibhav Gogate The University of Texas at Dallas

Point Estimation. Maximum likelihood estimation for a binomial distribution. CSE 446: Machine Learning

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Bayesian Methods: Naïve Bayes

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning: Assignment 1

Machine Learning

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Lecture 8: Maximum Likelihood Estimation (MLE) (cont d.) Maximum a posteriori (MAP) estimation Naïve Bayes Classifier

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

MLE/MAP + Naïve Bayes

Parametric Models: from data to models

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Gaussians Linear Regression Bias-Variance Tradeoff

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Review of Probabilities and Basic Statistics

Bayesian Models in Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

CSC321 Lecture 18: Learning Probabilistic Models

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Probability and Estimation. Alan Moses

MLE/MAP + Naïve Bayes

CS 361: Probability & Statistics

Introduction to Machine Learning CMU-10701

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

Naïve Bayes classification

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

CPSC 540: Machine Learning

Bias-Variance Tradeoff

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

PMR Learning as Inference

Announcements. Proposals graded

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Introduction to Machine Learning CMU-10701

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Introduction to Machine Learning. Lecture 2

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Probabilistic Graphical Models

The Exciting Guide To Probability Distributions Part 2. Jamie Frost v1.1

Machine Learning

Machine Learning

Bayesian RL Seminar. Chris Mansley September 9, 2008

Lecture 2: Conjugate priors

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Machine Learning

Bayesian Learning (II)

Probability Theory for Machine Learning. Chris Cremer September 2015

Estimating Parameters

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Machine Learning

Machine Learning CSE546

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Introduction to Probabilistic Machine Learning

MAT 271E Probability and Statistics

Announcements. Proposals graded

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Introduction to Probability and Statistics (Continued)

Introduction to Bayesian Learning. Machine Learning Fall 2018

The Bayes classifier

6.867 Machine Learning

Learning Bayesian belief networks

Lecture 11: Probability Distributions and Parameter Estimation

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Lecture 2: Priors and Conjugacy

Learning P-maps Param. Learning

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Computational Cognitive Science

Probabilistic Graphical Models

COMP 328: Machine Learning

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Nonparametric Bayesian Methods (Gaussian Processes)

Machine Learning

Naïve Bayes Lecture 17

Probability Review. September 25, 2015

Bayesian Inference and MCMC

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Advanced Introduction to Machine Learning CMU-10715

Data Mining Techniques. Lecture 3: Probability

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Non-Parametric Bayes

CS 188: Artificial Intelligence Spring Today

PHASES OF STATISTICAL ANALYSIS 1. Initial Data Manipulation Assembling data Checks of data quality - graphical and numeric

Generative Models for Discrete Data

Introduction to Machine Learning

Modeling Environment

Bayesian Methods for Machine Learning

Conditional distributions

Transcription:

Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes classification Barnabás Póczos & Aarti Singh 2014 Spring

Administration http://www.cs.cmu.edu/~aarti/class/10701_spring14/index.html Blackboard manager & Peer grading: Dani Webpage manager and autolab: Pulkit Camera man: Pengtao Homework manager: Jit Piazza manager: Prashant Recitation: Wean 7500, 6pm-7pm, on Wednesdays 2

Outline Theory: Probabilities: Dependence, Independence, Conditional Independence Parameter estimation: Maximum Likelihood Estimation (MLE) Maximum aposteriori (MAP) Bayes rule Naïve Bayes Classifier Application: Naive Bayes Classifier for Spam filtering Mind reading = fmri data processing 3

Independence

Independence Independent random variables: Y and X don t contain information about each other. Observing Y doesn t help predicting X. Observing X doesn t help predicting Y. Examples: Independent: Winning on roulette this week and next week. Dependent: Russian roulette 5

Dependent / Independent Y Y X Independent X,Y X Dependent X,Y 6

Conditionally Independent Conditionally independent: Knowing Z makes X and Y independent Examples: Dependent: show size and reading skills Conditionally independent: show size and reading skills given? age Storks deliver babies: Highly statistically significant correlation exists between stork populations and human birth rates across Europe. 7

Conditionally Independent London taxi drivers: A survey has pointed out a positive and significant correlation between the number of accidents and wearing coats. They concluded that coats could hinder movements of drivers and be the cause of accidents. A new law was prepared to prohibit drivers from wearing coats when driving. Finally another study pointed out that people wear coats when it rains 8

Correlation Causation xkcd.com 9

Conditional Independence Formally: X is conditionally independent of Y given Z: Equivalent to: Note: does NOT mean Thunder is independent of Rain But given Lightning knowing Rain doesn t give more info about Thunder 10

Conditional vs. Marginal Independence C calls A and B separately and tells them a number n ϵ{1,,10} Due to noise in the phone, A and B each imperfectly (and independently) draw a conclusion about what the number was. A thinks the number was n a and B thinks it was n b. Are n a and n b marginally independent? No, we expect e.g. P(n a = 1 n b = 1) > P(n a = 1) n a n b Are n a and n b conditionally independent given n? n Yes, because if we know the true number, the outcomes n a and n b are purely determined by the noise in each phone. P(n a = 1 n b = 1, n = 2) = P(n a = 1 n = 2) 11

Our first machine learning problem: Parameter estimation: MLE, MAP Estimating Probabilities 12

Flipping a Coin I have a coin, if I flip it, what s the probability that it will fall with the head up? Let us flip it a few times to estimate the probability: The estimated probability is: 3/5 Frequency of heads 13

Flipping a Coin The estimated probability is: 3/5 Frequency of heads Questions: (1) Why frequency of heads??? (2) How good is this estimation??? (3) Why is this a machine learning problem??? We are going to answer these questions 14

Question (1) Why frequency of heads??? Frequency of heads is exactly the maximum likelihood estimator for this problem MLE has nice properties (interpretation, statistical guarantees, simple) 15

Maximum Likelihood Estimation 16

MLE for Bernoulli distribution Data, D = P(Heads) = θ, P(Tails) = 1-θ Flips are i.i.d.: Independent events Identically distributed according to Bernoulli distribution MLE: Choose θ that maximizes the probability of observed data 17

Maximum Likelihood Estimation MLE: Choose θ that maximizes the probability of observed data Independent draws Identically distributed 18

Maximum Likelihood Estimation MLE: Choose θ that maximizes the probability of observed data That s exactly the Frequency of heads 19

Question (2) How good is this MLE estimation??? 20

How many flips do I need? I flipped the coins 5 times: 3 heads, 2 tails What if I flipped 30 heads and 20 tails? Which estimator should we trust more? The more the merrier??? 21

Simple bound Let θ * be the true parameter. For n = α H +α T, and For any ε>0: Hoeffding s inequality: 22

Probably Approximate Correct (PAC )Learning I want to know the coin parameter θ, within ε = 0.1 error with probability at least 1-δ = 0.95. How many flips do I need? Sample complexity: 23

Question (3) Why is this a machine learning problem??? improve their performance (accuracy of the predicted prob. ) at some task (predicting the probability of heads) with experience (the more coins we flip the better we are) 24

What about continuous features? 3 4 5 6 7 8 9 Let us try Gaussians σ 2 σ 2 µ=0 µ=0 25

MLE for Gaussian mean and variance Choose θ= (µ,σ 2 )that maximizes the probability of observed data Independent draws Identically distributed 26

MLE for Gaussian mean and variance Note: MLE for the variance of a Gaussian is biased [Expected result of estimation is not the true parameter!] Unbiased variance estimator: 27

What about prior knowledge? (MAP Estimation)

What about prior knowledge? We know the coin is close to 50-50. What can we do now? The Bayesian way Rather than estimating a single θ, we obtain a distribution over possible values of θ Before data After data 50-50 29

Prior distribution What prior? What distribution do we want for a prior? Represents expert knowledge (philosophical approach) Simple posterior form (engineer s approach) Uninformative priors: Uniform distribution Conjugate priors: Closed-form representation of posterior P(θ) and P(θ D) have the same form 30

In order to proceed we will need: Bayes Rule 31

Chain Rule & Bayes Rule Chain rule: Bayes rule: Bayes rule is important for reverse conditioning. 32

Bayesian Learning Use Bayes rule: Or equivalently: posterior likelihood prior 33

MLE vs. MAP Maximum Likelihood estimation (MLE) Choose value that maximizes the probability of observed data Maximum a posteriori(map) estimation Choose value that is most probable given observed data and prior belief When is MAP same as MLE? 34

MAP estimation for Binomial distribution Coin flip problem: Likelihood is Binomial If the prior is Beta distribution, posterior is Beta distribution Beta function: 35

MAP estimation for Binomial distribution Likelihood is Binomial: Prior is Beta distribution: posterior is Beta distribution P(θ) and P(θ D) have the same form! [Conjugate prior] 36

Beta distribution More concentrated as values of α, β increase 37

Beta conjugate prior As n = α H + α T increases As we get more samples, effect of prior is washed out 38

From Binomial to Multinomial Example: Dice roll problem (6 outcomes instead of 2) Likelihood is ~ Multinomial(θ = {θ 1, θ 2,, θ k }) If prior is Dirichlet distribution, Then posterior is Dirichlet distribution For Multinomial, conjugate prior is Dirichlet distribution. http://en.wikipedia.org/wiki/dirichlet_distribution 39

Conjugate prior for Gaussian? Conjugate prior on mean: Gaussian Conjugate prior on covariance matrix: Inverse Wishart 40

Bayesians vs.frequentists You are no good when sample is small You give a different answer for different priors 41