Model Averaging (Bayesian Learning)

Similar documents
Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1

Learning with Probabilities

Statistical learning. Chapter 20, Sections 1 3 1

Stochastic Simulation

Introduction to Bayesian Learning. Machine Learning Fall 2018

Statistical learning. Chapter 20, Sections 1 3 1

Statistical learning. Chapter 20, Sections 1 4 1

Bayesian Methods: Naïve Bayes

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Bayesian Models in Machine Learning

Logistic Regression. Machine Learning Fall 2018

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 14.3, Page 1

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

CSC321 Lecture 18: Learning Probabilistic Models

Machine Learning Linear Classification. Prof. Matteo Matteucci

Bayesian Networks. Motivation

Hierarchical Models & Bayesian Model Selection

Consider an experiment that may have different outcomes. We are interested to know what is the probability of a particular set of outcomes.

Final Examination CS 540-2: Introduction to Artificial Intelligence

CHAPTER 2 Estimating Probabilities

CPSC 540: Machine Learning

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Naïve Bayes classification

Stochastic Simulation

Quantifying uncertainty & Bayesian networks

The Naïve Bayes Classifier. Machine Learning Fall 2017

Lecture 9: Bayesian Learning

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Introduction to Bayesian Learning

Bayesian Learning Extension

MODULE -4 BAYEIAN LEARNING

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Introduction to Artificial Intelligence. Prof. Inkyu Moon Dept. of Robotics Engineering, DGIST

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Machine Learning. Lecture 2

Propositions. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 5.1, Page 1

Uncertainty. Variables. assigns to each sentence numerical degree of belief between 0 and 1. uncertainty

From inductive inference to machine learning

Classification & Information Theory Lecture #8

Relational Probabilistic Models

Lecture : Probabilistic Machine Learning

CSE 473: Artificial Intelligence Autumn Topics

Bayesian Updating: Discrete Priors: Spring

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Introduction to Probabilistic Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

STA414/2104 Statistical Methods for Machine Learning II

MLE/MAP + Naïve Bayes

Machine Learning

Bayesian Networks BY: MOHAMAD ALSABBAGH

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Bayesian Deep Learning

Probability and (Bayesian) Data Analysis

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Exercise 1: Basics of probability calculus

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Density Estimation: ML, MAP, Bayesian estimation

Least Squares Regression

Final Examination CS540-2: Introduction to Artificial Intelligence

Machine Learning

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Artificial Intelligence Programming Probability

Introduc)on to Bayesian methods (con)nued) - Lecture 16

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

MLE/MAP + Naïve Bayes

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Lecture 13 Fundamentals of Bayesian Inference

Artificial Intelligence Uncertainty

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Estimating Parameters

Computational Cognitive Science

Bayesian Learning. Bayesian Learning Criteria

Behavioral Data Mining. Lecture 2

Bayesian Updating: Discrete Priors: Spring

Introduction to Statistical Learning Theory

Lecture 11: Probability Distributions and Parameter Estimation

Probability and Information Theory. Sargur N. Srihari

Slide 1 Math 1520, Lecture 21

CS 188: Artificial Intelligence Fall 2011

CSC411 Fall 2018 Homework 5

CSCE 478/878 Lecture 6: Bayesian Learning

Introduction to Artificial Intelligence. Unit # 11

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Directed Graphical Models

an introduction to bayesian inference

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

COS513 LECTURE 8 STATISTICAL CONCEPTS

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Bayesian learning Probably Approximately Correct Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

David Giles Bayesian Econometrics

Machine Learning

Data Mining Techniques. Lecture 3: Probability

Transcription:

Model Averaging (Bayesian Learning) We want to predict the output Y of a new case that has input X = x given the training examples e: p(y x e) = m M P(Y m x e) = m M P(Y m x e)p(m x e) = m M P(Y m x)p(m e) M is a set of mutually exclusive and covering hypotheses. What assumptions are made here? c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.6, Page 1

Learning Under Uncertainty The posterior probability of a model given examples e: P(m e) = P(e m) P(m) P(e) The likelihood, P(e m), is the probability that model m would have produced examples e. The prior, P(m), encodes the learning bias P(e) is a normalizing constant so the probabilities of the models sum to 1. Examples e = {e 1,..., e k } are independent and identically distributed (i.i.d.) given m if P(e m) = k P(e i m) i=1 c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.6, Page 2

Plate Notation m m e 1 e 2... e e k i i c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.6, Page 3

Bayesian Leaning of Probabilities Y has two outcomes y and y. We want the probability of y given training examples e. We can treat the probability of y as a real-valued random variable on the interval [0, 1], called φ. Bayes rule gives: P(φ=p e) = c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.6, Page 4

Bayesian Leaning of Probabilities Y has two outcomes y and y. We want the probability of y given training examples e. We can treat the probability of y as a real-valued random variable on the interval [0, 1], called φ. Bayes rule gives: P(φ=p e) = P(e φ=p) P(φ=p) P(e) Suppose e is a sequence of n 1 instances of y and n 0 instances of y: P(e φ=p) = c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.6, Page 5

Bayesian Leaning of Probabilities Y has two outcomes y and y. We want the probability of y given training examples e. We can treat the probability of y as a real-valued random variable on the interval [0, 1], called φ. Bayes rule gives: P(φ=p e) = P(e φ=p) P(φ=p) P(e) Suppose e is a sequence of n 1 instances of y and n 0 instances of y: P(e φ=p) = p n 1 (1 p) n 0 Uniform prior: P(φ=p) = 1 for all p [0, 1]. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.6, Page 6

Posterior Probabilities for Different Training Examples (beta distribution) 3.5 3 2.5 n 0 =0, n 1 =0 n 0 =1, n 1 =2 n 0 =2, n 1 =4 n 0 =4, n 1 =8 2 P(φ e) 1.5 1 0.5 0 0 0.2 0.4 0.6 0.8 1 φ c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.6, Page 7

MAP model The maximum a posteriori probability (MAP) model is the model m that maximizes P(m e). That is, it maximizes: P(e m) P(m) Thus it minimizes: ( log P(e m)) + ( log P(m)) which is the number of bits to send the examples, e, given the model m plus the number of bits to send the model m. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.6, Page 8

Averaging Over Models Idea: Rather than choosing the most likely model, average over all models, weighted by their posterior probabilities given the examples. If you have observed a sequence of n 1 instances of y and n 0 instances of y, with uniform prior: the most likely value (MAP) is n 1 n 0 + n 1 the expected value is n 1 + 1 n 0 + n 1 + 2 c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.6, Page 9

Beta Distribution Beta α 0,α 1 (p) = 1 K pα 1 1 (1 p) α 0 1 where K is a normalizing constant. α i > 0. The uniform distribution on [0, 1] is Beta 1,1. The expected value is α 1 /(α 0 + α 1 ). If the prior probability of a Boolean variable is Beta α 0,α 1, the posterior distribution after observing n 1 true cases and n 0 false cases is: c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.6, Page 10

Beta Distribution Beta α 0,α 1 (p) = 1 K pα 1 1 (1 p) α 0 1 where K is a normalizing constant. α i > 0. The uniform distribution on [0, 1] is Beta 1,1. The expected value is α 1 /(α 0 + α 1 ). If the prior probability of a Boolean variable is Beta α 0,α 1, the posterior distribution after observing n 1 true cases and n 0 false cases is: Beta α 0+n 0,α 1 +n 1 c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.6, Page 11

Dirichlet distribution Suppose Y has k values. The Dirichlet distribution has two sorts of parameters, positive counts α1,..., α k α i is one more than the count of the ith outcome. probability parameters p1,..., p k p i is the probability of the ith outcome Dirichlet α 1,...,α k (p 1,..., p k ) = 1 k p α j 1 j K where K is a normalizing constant The expected value of ith outcome is α i j α j j=1 c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.6, Page 12

Hierarchical Bayesian Model Where do the priors come from? Example: S XH is true when patient X is sick in hospital H. We want to learn the probability of Sick for each hospital. Where do the prior probabilities for the hospitals come from? α 1 α 2 α 1 α 2 φ H φ 1 φ 2... φ k......... S XH X H S 11 S 12 S 21 S 22 S 1k (a) (b) c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.6, Page 13