CS540 Machine learning L9 Bayesian statistics

Similar documents
CS340: Bayesian concept learning. Kevin Murphy Based on Josh Tenenbaum s PhD thesis (MIT BCS 1999)

Lecture 2: Priors and Conjugacy

Outline. Limits of Bayesian classification Bayesian concept learning Probabilistic models for unsupervised and semi-supervised category learning

Introduction to Machine Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Computational Cognitive Science

CS540 Machine learning L8

Computational Cognitive Science

Introduction to Probabilistic Machine Learning

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Data Analysis and Uncertainty Part 2: Estimation

CPSC 540: Machine Learning

Lecture 2: Conjugate priors

Bayesian Concept Learning

Probability and Estimation. Alan Moses

PMR Learning as Inference

Bayesian RL Seminar. Chris Mansley September 9, 2008

Computational Perception. Bayesian Inference

Learning Bayesian network : Given structure and completely observed data

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Computational Cognitive Science

Part 1: Expectation Propagation

Introduction to Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Predictive Hypothesis Identification

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

an introduction to bayesian inference

Machine Learning

Hierarchical Models & Bayesian Model Selection

Introduction to Machine Learning

Density Estimation. Seungjin Choi

Bayesian Models in Machine Learning

Conjugate Priors, Uninformative Priors

Introduction into Bayesian statistics

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Parametric Techniques Lecture 3

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Learning with Probabilities

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Discrete Binary Distributions

David Giles Bayesian Econometrics

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Modeling Environment

Motif representation using position weight matrix

Machine Learning

Intro to Probability. Andrei Barbu

Lecture 6: Graphical Models: Learning

Chapter 5. Bayesian Statistics

CS 340 Fall 2007: Homework 3

Parametric Techniques

Frequentist Statistics and Hypothesis Testing Spring

Machine Learning Summer School

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

A primer on Bayesian statistics, with an application to mortality rate estimation

A Very Brief Summary of Bayesian Inference, and Examples

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Probability and Information Theory. Sargur N. Srihari

Exponential Families

COMP90051 Statistical Machine Learning

Time Series and Dynamic Models

7. Estimation and hypothesis testing. Objective. Recommended reading

Bayesian Machine Learning

(1) Introduction to Bayesian statistics

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Patterns of Scalable Bayesian Inference Background (Session 1)

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Bayesian Methods: Naïve Bayes

Introduction to Machine Learning CMU-10701

Bayesian Inference and MCMC

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

T Machine Learning: Basic Principles

Lecture 18: Bayesian Inference

Sequential Monitoring of Clinical Trials Session 4 - Bayesian Evaluation of Group Sequential Designs

Bayesian Inference: Posterior Intervals

STAB57: Quiz-1 Tutorial 1 (Show your work clearly) 1. random variable X has a continuous distribution for which the p.d.f.

Compute f(x θ)f(θ) dθ

Lecture : Probabilistic Machine Learning

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Bayesian Statistics. Debdeep Pati Florida State University. February 11, 2016

1 A simple example. A short introduction to Bayesian statistics, part I Math 217 Probability and Statistics Prof. D.

Bayesian philosophy Bayesian computation Bayesian software. Bayesian Statistics. Petter Mostad. Chalmers. April 6, 2017

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

10. Exchangeability and hierarchical models Objective. Recommended reading

Introduction to Bayesian Statistics

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

COS513 LECTURE 8 STATISTICAL CONCEPTS

Linear Models A linear model is defined by the expression

GWAS IV: Bayesian linear (variance component) models

Bayesian Phylogenetics:

Introduction to Machine Learning. Lecture 2

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Transcription:

CS540 Machine learning L9 Bayesian statistics 1

Last time Naïve Bayes Beta-Bernoulli 2

Outline Bayesian concept learning Beta-Bernoulli model (review) Dirichlet-multinomial model Credible intervals 3

Bayesian concept learning Based on Josh Tenenbaum s PhD thesis (MIT BCS 1999) 4

Concept learning (binary classification) from positive and negative examples 5

Concept learning from positive only examples How far out should the rectangle go? No negative examples to act as an upper bound. 6

Human learning vs machine learning/ statistics Most ML methods for learning "concepts" such as "dog" require a large number of positive and negative examples But people can learn from small numbers of positive only examples (look at the doggy!) This is called "one shot learning" 7

Everyday inductive leaps How can we learn so much about... Meanings of words Properties of natural kinds Future outcomes of a dynamic process Hidden causal properties of an object Causes of a person s action (beliefs, goals) Causal laws governing a domain... from such limited data? 8

The Challenge How do we generalize successfully from very limited data? Just one or a few examples Often only positive examples Philosophy: Induction called a problem, a riddle, a paradox, a scandal, or a myth. Machine learning and statistics: Focus on generalization from many examples, both positive and negative. 9

The solution: Bayesian inference Bayes rule: P( H ) P( D H ) P ( H D) = P( D) Various compelling (theoretical and experimental) arguments that one should represent one s beliefs using probability and update them using Bayes rule 10

11 Bayesian inference: key ingredients Hypothesis space H Prior p(h) Likelihood p(d h) Algorithm for computing posterior p(h D) = H h h p h d p h p h d p d h p ) ( ) ( ) ( ) ( ) (

The number game Learning task: Observe one or more examples (numbers) Judge whether other numbers are yes or no. 12

The number game Examples of yes numbers 60 60 80 10 30 Hypotheses multiples of 10 even numbers??? multiples of 10 even numbers 60 63 56 59 numbers near 60 13

Human performance 60 60 80 10 30 60 52 57 55 Diffuse similarity Rule: multiples of 10 Focused similarity: numbers near 50-60 14

Human performance 60 60 80 10 30 60 52 57 55 Diffuse similarity Rule: multiples of 10 Focused similarity: numbers near 50-60 Some phenomena to explain: People can generalize from just positive examples. Generalization can appear either graded (uncertain) or all-or-none (confident). 15

H: Hypothesis space of possible concepts: X = {x 1,..., x n }: n examples of a concept C. Evaluate hypotheses given data using Bayes rule: p( h X ) = Bayesian model h H p( X h) p( h) p( X h ) p( h ) p(h) [ prior ]: domain knowledge, pre-existing biases p(x h) [ likelihood ]: statistical information in examples. p(h X) [ posterior ]: degree of belief that h is the true extension of C. 16

Hypothesis space Mathematical properties (~50): odd, even, square, cube, prime, multiples of small integers powers of small integers same first (or last) digit Magnitude intervals (~5000): all intervals of integers with endpoints between 1 and 100 Hypothesis can be defined by its extension h={x:h(x)=1, x=1,2,...,100} 17

Likelihood p(x h) Size principle: Smaller hypotheses receive greater likelihood, and exponentially more so as n increases. p n 1 ( ) = if x,, 1 X h h size( ) = 0 if any x i h Follows from assumption of randomly sampled examples (strong sampling). Captures the intuition of a representative sample. x n h 18

Example of likelihood X={20,40,60} H1 = multiples of 10 = {10,20,,100} H2 = even numbers = {2,4,,100} H3 = odd numbers = {1,3,,99} P(X H1) = 1/10 * 1/10 * 1/10 p(x H2) = 1/50 * 1/50 * 1/50 P(X H3) = 0 19

20

Size principle Illustrating the size principle 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 h 1 h 2 21

Size principle Illustrating the size principle 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 h 1 h 2 Data slightly more of a coincidence under h 1 22

Size principle Illustrating the size principle 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 h 1 h 2 Data much more of a coincidence under h 1 23

Prior p(h) X={60,80,10,30} Why prefer multiples of 10 over even numbers? Size principle (likelihood) Why prefer multiples of 10 over multiples of 10 except 50 and 20? Prior Cannot learn efficiently if we have a uniform prior over all 2 100 logically possible hypotheses 24

Need for prior (inductive bias) Consider all 2 22 = 16 possible binary functions on 2 binary inputs If we observe (x 1 =0, x 2 =1, y=0), this removes h 5, h 6, h 7, h 8, h 13, h 14, h 15, h 16 Still leaves exponentially many hypotheses! Cannot learn efficiently without assumptions (no free lunch theorem) 25

Hierarchical prior 26

27 Computing the posterior In this talk, we will not worry about computational issues (we will perform brute force enumeration or derive analytical expressions). = H h h p h X p h p h X p X h p ) ( ) ( ) ( ) ( ) (

Prior Likelihoods Posteriors 28

Generalizing to new objects Given p(h X), how do we compute the probability that C applies to some new stimulus y? p ( y C X ) Posterior predictive distribution 29

30 Posterior predictive distribution Compute the probability that C applies to some new object y by averaging the predictions of all hypotheses h, weighted by p(h X) (Bayesian model averaging): = H h X h p h C y p X C y p ) ( ) ( ) ( = h y h y if 0 if 1

Examples: 16 31

Examples: 16 8 2 64 32

Examples: 16 23 19 20 33

+ Examples Human generalization Bayesian Model 60 60 80 10 30 60 52 57 55 16 16 8 2 64 16 23 19 20 34

Rules and exemplars in the number game Hyp. space is a mixture of sparse (mathematical concepts) and dense (intervals) hypotheses. If data supports mathematical rule (eg X={16,8,2,64}), we rapidly learn a rule ( aha! moment), otherwise (eg X={6,23,19,20}) we learn by similarity, and need many examples to get sharp boundary. 35

Summary of the Bayesian approach 1. Constrained hypothesis space H 2. Prior p(h) 3. Likelihood p(x h) 4. Hypothesis (model) averaging: 36

MAP (maximum a posterior) learning Instead of Bayes model averaging, we can find the mode of the posterior, and use it as a plug-in. As N, the posterior peaks around the mode, so MAP and BMA converge Cannot explain transition from similarity-based (broad posterior) to rule-based (narrow posterior) 37

Maximum likelihood learning ML = no prior, no averaging. Plug-in the MLE for prediction: X={16} -> h= "powers of 4" X={16,8,2,64} -> h= "powers of 2". So predictive distribution gets broader as we get more data, in contrast to Bayes. ML is initially very conservative. 38

Large sample size behavior As the amount of data goes to, ML, MAP and BMA all converge to the same solution, since the likelihood overwhelms the prior, since p(x h) grows with N, but p(h) is constant. If truth is in the hypothesis class, all methods will find it; thus they are consistent estimators. 39

40

Beta-Bernoulli model p(θ D) p(d θ)p(θ) = p(d θ)beta(θ α 0,α 1 ) = [θ N 1 (1 θ) N 0 ][θ α1 1 (1 θ) α0 1 ] = θ N 1+α 1 1 (1 θ) N 0+α 0 1 Beta(θ N 1 +α 1,N 0 +α 0 ) 41

Sequential updating 42

p(x) = Prior predictive density p(x θ)p(θ)dθ = 1 0 Bin(x θ,m)beta(θ α 0,α 1 )dθ def = Bb(x α 0,α 1,m)= B(x+α 1,m x+α 0 ) B(α 1,α 0 ) ( m x) 43

Posterior predictive density p(x D) = p(x θ)p(θ D)dθ = 1 0 Bin(x θ,m)beta(θ α 0,α 1)dθ def = Bb(x α 0,α 1,m)= B(x+α 1,n x+α 0 ) B(α 1,α 0 ) ( m x) Plugin approximation p(x D) = p(x θ)δˆθ(θ)dθ=p(x ˆθ) = Bin(x ˆθ,m) 44

Posterior predictive E[x] = m α 1 α 0 +α 1 mα 0 Var[x] = α 1 (α 0 +α 1 +m) (α 0 +α 1 )2 α 0 +α 1 +1 If m=1, X in {0,1}, E[x D] = p(x=1 D) = a1(a1+a0) p(x=1 D) = = 1 0 1 0 p(x = 1 θ)p(θ D)dθ θ Beta(θ α 1,α 0 )dθ=e[θ D]= α 1 α 0 +α 1 Laplace s rule of succession p(x=1 D)= N 1 +1 N 1 +N 0 +2 45

Prior Likelihood Posterior Summary of beta-bernoulli model p(θ)=beta(θ α 1,α 0 )= Posterior predictive p(d θ)=θ N 1 (1 θ) N 0 1 B(α 1,α 0 ) θα 1 1 (1 θ) α 0 1 p(θ D)=Beta(θ α 1 +N 1,α 0 +N 0 ) p(x=1 D)= α 1+N 1 α 1 +α 0 +N 46

Dirichlet-multinomial model X i Mult(θ,1), p(x i =k) = θ k Prior Likelihood Posterior p(θ)=dir(θ α 1,...,α K ) K p(d θ) = Posterior predictive k=1 θ N k k K k=1 p(θ D)=Dir(θ α 1 +N 1,...,α K +N K ) p(x=k D)= θ α k 1 k α k +N k k α k +N k 47

Dirichlet (20,20,20) (2,2,2) (20,2,2) 48

Summarizing the posterior If p(θ D) is too complex to plot, we can compute various summary statistics, such as posterior mean, mode and median ˆθ mean = E[θ D] ˆθ MAP = argmaxp(θ D) θ ˆθ median = t:p(θ>t D)=0.5 49

Bayesian credible intervals We can represent our uncertainty using a posterior credible interval We set p(l θ u D) 1 α l=f 1 (α/2),u=f 1 (1 α/2) 50

Example We see 47 heads out of 100 trials. Using a Beta(1,1) prior, what is the 95% credible interval for probability of heads? S = 47; N = 100; a = S+1; b = (N-S)+1; alpha = 0.05; l = betainv(alpha/2, a, b); u = betainv(1-alpha/2, a, b); CI = [l,u] 0.3749 0.5673 51