Learning with Probabilities

Similar documents
Statistical learning. Chapter 20, Sections 1 3 1

Statistical learning. Chapter 20, Sections 1 3 1

Statistical learning. Chapter 20, Sections 1 4 1

Statistical Learning. Philipp Koehn. 10 November 2015

From inductive inference to machine learning

Naïve Bayes classification

Introduction to Bayesian Learning. Machine Learning Fall 2018

The Naïve Bayes Classifier. Machine Learning Fall 2017

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Bayesian Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Model Averaging (Bayesian Learning)

Notes on Machine Learning for and

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Logistic Regression. Machine Learning Fall 2018

Bayesian Learning (II)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

CSCE 478/878 Lecture 6: Bayesian Learning

Bayesian Methods: Naïve Bayes

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

COMP90051 Statistical Machine Learning

MLE/MAP + Naïve Bayes

Bayesian Machine Learning

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Introduction to Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

an introduction to bayesian inference

MODULE -4 BAYEIAN LEARNING

Algorithmisches Lernen/Machine Learning

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Computing Posterior Probabilities. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Machine Learning

Generative v. Discriminative classifiers Intuition

ECE521 week 3: 23/26 January 2017

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

CS-E3210 Machine Learning: Basic Principles

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Mining Classification Knowledge

Introduction to Machine Learning

Bayesian Learning Extension

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Introduction to Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Bayesian Learning. Bayesian Learning Criteria

Learning Bayesian network : Given structure and completely observed data

Probability Based Learning

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Lecture 9: Bayesian Learning

Generative v. Discriminative classifiers Intuition

MLE/MAP + Naïve Bayes

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

PMR Learning as Inference

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CS Lecture 18. Topic Models and LDA

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Machine Learning

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Support Vector Machines

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

STA 414/2104: Machine Learning

CMU-Q Lecture 24:

Bayesian Networks aka belief networks, probabilistic networks. Bayesian Networks aka belief networks, probabilistic networks. An Example Bayes Net

Bayesian Decision and Bayesian Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning 4771

Directed Probabilistic Graphical Models CMSC 678 UMBC

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

STA 4273H: Statistical Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Generative v. Discriminative classifiers Intuition

Probabilistic Graphical Models

Introduction to Probabilistic Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Bayesian Networks. Motivation

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

CHAPTER-17. Decision Tree Induction

Algorithm-Independent Learning Issues

CS540 Machine learning L9 Bayesian statistics

STA 4273H: Statistical Machine Learning

Mining Classification Knowledge

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Bayesian Learning Features of Bayesian learning methods:

Machine Learning (CS 567) Lecture 2

Density Estimation: ML, MAP, Bayesian estimation

Transcription:

Learning with Probabilities CS194-10 Fall 2011 Lecture 15 CS194-10 Fall 2011 Lecture 15 1

Outline Bayesian learning eliminates arbitrary loss functions and regularizers facilitates incorporation of prior knowledge quantifies hypothesis and prediction uncertainty gives optimal predictions Maximum a posteriori and maximum likelihood learning Maximum likelihood parameter learning CS194-10 Fall 2011 Lecture 15 2

Full Bayesian learning View learning as Bayesian updating of a probability distribution over the hypothesis space H is the hypothesis variable, values h 1, h 2,..., prior P(H) ith observation x i gives the outcome of random variable X i training data X = x 1,..., x N Given the data so far, each hypothesis has a posterior probability: P (h k X) = αp (X h k )P (h k ) where P (X h k ) is called the likelihood Predictions use a likelihood-weighted average over the hypotheses: P(X N+1 X) = Σ k P(X N+1 X, h k )P (h k X) = Σ k P(X N+1 h k )P (h k X) No need to pick one best-guess hypothesis! CS194-10 Fall 2011 Lecture 15 3

Example Suppose there are five kinds of bags of candies: 10% are h 1 : 100% cherry candies 20% are h 2 : 75% cherry candies + 25% lime candies 40% are h 3 : 50% cherry candies + 50% lime candies 20% are h 4 : 25% cherry candies + 75% lime candies 10% are h 5 : 100% lime candies Then we observe candies drawn from some bag: What kind of bag is it? What flavour will the next candy be? CS194-10 Fall 2011 Lecture 15 4

Posterior probability of hypotheses P (h k X) = αp (X h k )P (h k ) P (h 1 5 limes) = αp (5 limes h 1 )P (h 1 ) = α 0.0 5 0.1 = 0 P (h 2 5 limes) = αp (5 limes h 2 )P (h 2 ) = α 0.25 5 0.2 = 0.000195α P (h 3 5 limes) = αp (5 limes h 3 )P (h 3 ) = α 0.5 5 0.4 = 0.0125α P (h 4 5 limes) = αp (5 limes h 4 )P (h 4 ) = α 0.75 5 0.2 = 0.0475α P (h 5 5 limes) = αp (5 limes h 5 )P (h 5 ) = α 1.0 5 0.1 = 0.1α α = 1/(0 + 0.000195 + 0.0125 + 0.0475 + 0.1) = 6.2424 P (h 1 5 limes) = 0 P (h 2 5 limes) = 0.00122 P (h 3 5 limes) = 0.07803 P (h 4 5 limes) = 0.29650 P (h 5 5 limes) = 0.62424 CS194-10 Fall 2011 Lecture 15 5

Posterior probability of hypotheses Posterior probability of hypothesis 1 0.8 0.6 0.4 0.2 0 P(h 1 d) P(h 2 d) P(h 3 d) P(h 4 d) P(h 5 d) 0 2 4 6 8 10 Number of samples in d CS194-10 Fall 2011 Lecture 15 6

Prediction probability P(X N+1 X) = Σ k P(X N+1 X, h k )P (h k X) = Σ k P(X N+1 h k )P (h k X) P (lime on 6 5 limes) = P (lime on 6 h 1 )P (h 1 5 limes) + P (lime on 6 h 2 )P (h 2 5 limes) + P (lime on 6 h 3 )P (h 3 5 limes) + P (lime on 6 h 4 )P (h 4 5 limes) + P (lime on 6 h 5 )P (h 5 5 limes) = 0 0 + 0.25 0.00122 + 0.5 0.07830 + 0.75 0.29650 + 1.0 0.62424 = 0.88607 CS194-10 Fall 2011 Lecture 15 7

Prediction probability 1 P(next candy is lime d) 0.9 0.8 0.7 0.6 0.5 0.4 0 2 4 6 8 10 Number of samples in d CS194-10 Fall 2011 Lecture 15 8

Learning from positive examples only Example from Tenenbaum via Murphy, Ch.3: Given examples of some unknown class, a predefined subset of {1,..., 100}, output a hypothesis as to what the class is E.g., {16, 8, 2, 64} Boolean classification problem; simplest consistent solution is everything. [This is the basis for Chomsky s Poverty of the Stimulus argument purporting to prove that humans must have innate grammatical knowledge] CS194-10 Fall 2011 Lecture 15 9

Bayesian counterargument Assuming numbers are sampled uniformly from the class: P ({16, 8, 2, 64} powers of 2) = 7 4 4.2 10 4 P ({16, 8, 2, 64} everything) = 100 4 = 10 8 This difference far outweighs any reasonable simplicity-based prior CS194-10 Fall 2011 Lecture 15 10

Bayes vs. Humans CS194-10 Fall 2011 Lecture 15 11

MAP approximation Summing over the hypothesis space is often intractable (e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes) Maximum a posteriori (MAP) learning: choose h MAP maximizing P (h k X) I.e., maximize P (X h k )P (h k ) or minimize log P (X h k ) log P (h k )... or, in information theory terms, minimize bits to encode data given hypothesis + bits to encode hypothesis This is the basic idea of minimum description length (MDL) learning In science experiments, inputs are fixed, deterministic h predicts outputs P (X h k ) is 1 if consistent, 0 otherwise MAP = simplest consistent hypothesis CS194-10 Fall 2011 Lecture 15 12

ML approximation For large data sets, prior becomes irrelevant Maximum likelihood (ML) learning: choose h ML maximizing P (X h k ) I.e., simply get the best fit to the data; identical to MAP for uniform prior (which is reasonable if all hypotheses are of the same complexity) ML is the standard (non-bayesian) statistical learning method CS194-10 Fall 2011 Lecture 15 13

A simple generative model: Bernoulli A generative model is a probability model from which the probability of any observable data set can be derived [Usually contrasted with a discriminative or conditional model, which gives only the probability for the output given the observable inputs ] E.g., Bernoulli[θ] model: P (X i = 1) = θ; P (X i = 0) = 1 θ or P (X i = x i ) = θ x i(1 θ) 1 x i Suppose we get a bag of candy from a new manufacturer; fraction θ of cherry candies Any θ is possible: continuum of hypotheses h θ CS194-10 Fall 2011 Lecture 15 14

ML estimation of Bernoulli model Suppose we unwrap N candies, c cherries and l = N c limes These are i.i.d. (independent, identically distributed) observations, so P (X h θ ) = Π N i = 1P (x i h θ ) = θ i x i (1 θ) N i x i = θ c (1 θ) l Maximize this w.r.t. θ which is easier for the log-likelihood: L(X h θ ) = log P (X h θ ) = Σ N i = 1 log P (x i h θ ) = c log θ + l log(1 θ) dl(x h θ ) = c dθ θ l 1 θ = 0 θ = c c + l = c N Seems sensible, but causes problems with 0 counts! CS194-10 Fall 2011 Lecture 15 15

Naive Bayes models Generative model for discrete (often Boolean) classification problems: Each example has discrete class variable Y i Each example has discrete or continuous attributes X ij, j = 1,..., D Attributes are conditionally independent given the class value: P(Y i = 1) θ Y i Y i 0 1 P(X ij = 1 Y i ) θ 0,j θ 1,j X i,1 X ij X i,d P (y i, x i,1,..., x i,d ) = P (y i )Π D j = 1P (x ij y i ) = θ y i (1 θ) 1 y i Π D j = 1θ x ij y i,j(1 θ yi,j) 1 x ij CS194-10 Fall 2011 Lecture 15 16

ML estimation of Naive Bayes models Likelihood is P (X h θ ) = Π N i = 1θ y i(1 θ) 1 y i Π D j = 1θ x ij y i,j(1 θ yi,j) 1 x ij Log likelihood is L = log P (X h θ ) = Σ N i = 1y i log θ + (1 y i ) log(1 θ) +Σ D j = 1x ij log θ yi,j + (1 x ij ) log(1 θ yi,j) This has parameters in separate terms, so derivatives are decoupled: L θ = y i ΣN i = 1 θ 1 y i 1 θ = N 1 θ N N 1 1 θ L x = Σ ij i:yi = y 1 x ij = N yj N y N yj θ yj 1 θ yj θ yj 1 θ yj θ yj where N y = number of examples with class label y and N yj = number of examples with class label y and value 1 for X ij CS194-10 Fall 2011 Lecture 15 17

Setting derivatives to zero: ML estimation contd. θ = N 1 /N as before θ yj = N yj /N y I.e., count the fraction of each class with jth attribute set to 1 O(ND) time to train the model Example: 1000 cherry and lime candies, wrapped in red or green wrappers by the Surprise Candy Company 400 cherry, of which 300 have red wrappers and 100 green wrappers 600 lime, of which 120 have red wrappers and 480 green wrappers θ = P (F lavor = cherry) = 400/100 = 0.40 θ 11 = P (W rapper = red F lavor = cherry) = 300/400 = 0.75 θ 01 = P (W rapper = red F lavor = lime) = 120/600 = 0.20 CS194-10 Fall 2011 Lecture 15 18

Classifying a new example P (Y = 1 x 1,..., x D ) = α P (x 1,..., x D Y = 1)P (Y = 1) = α θ Π D j = 1θ x j 1j(1 θ 1j ) 1 x j log P (Y = 1 x 1,..., x D ) = log α + log θ + Σ D j = 1x j log θ 1j + (1 x j ) log(1 θ 1j ) = ( log α + log θ + Σ D j = 1(1 θ 1j ) ) + Σ D j = 1x j (log(θ 1j /(1 θ 1j ))) The set of points where P (Y = 1 x 1,..., x D ) = P (Y = 0 x 1,..., x D ) = 0.5 is a linear separator! (But location is sensitive to class prior.) CS194-10 Fall 2011 Lecture 15 19

Summary Full Bayesian learning gives best possible predictions but is intractable MAP learning balances complexity with accuracy on training data Maximum likelihood assumes uniform prior, OK for large data sets 1. Choose a parameterized family of models to describe the data requires substantial insight and sometimes new models 2. Write down the likelihood of the data as a function of the parameters may require summing over hidden variables, i.e., inference 3. Write down the derivative of the log likelihood w.r.t. each parameter 4. Find the parameter values such that the derivatives are zero may be hard/impossible; modern optimization techniques help Naive Bayes is a simple generative model with a very fast training method that finds a linear separator in input feature space CS194-10 Fall 2011 Lecture 15 20

and provides probabilistic predictions CS194-10 Fall 2011 Lecture 15 21