Introduction to Bayesian Learning

Similar documents
Directed and Undirected Graphical Models

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

Naïve Bayes classification

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Directed Graphical Models

Learning with Probabilities

Statistical learning. Chapter 20, Sections 1 3 1

Bayesian Learning. Bayesian Learning Criteria

Statistical learning. Chapter 20, Sections 1 4 1

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

CSCE 478/878 Lecture 6: Bayesian Learning

Bayesian Networks BY: MOHAMAD ALSABBAGH

An Introduction to Bayesian Machine Learning

Probabilistic Graphical Networks: Definitions and Basic Results

LEARNING WITH BAYESIAN NETWORKS

Statistical learning. Chapter 20, Sections 1 3 1

COMP 328: Machine Learning

Bayesian Networks. Characteristics of Learning BN Models. Bayesian Learning. An Example

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Lecture 5: Bayesian Network

Learning Bayesian Networks (part 1) Goals for the lecture

Probabilistic Machine Learning

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Bayesian Methods: Naïve Bayes

Probabilistic Graphical Models (I)

Bayesian Networks. Motivation

{ p if x = 1 1 p if x = 0

Notes on Machine Learning for and

CS6220: DATA MINING TECHNIQUES

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Directed and Undirected Graphical Models

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

Rapid Introduction to Machine Learning/ Deep Learning

Probabilistic Reasoning Systems

CS6220: DATA MINING TECHNIQUES

Lecture 8: Bayesian Networks

CMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th

Introduction to Machine Learning

Review: Bayesian learning and inference

Bayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers:

CMPSCI 240: Reasoning about Uncertainty

Quantifying uncertainty & Bayesian networks

Directed Probabilistic Graphical Models CMSC 678 UMBC

Graphical Models and Kernel Methods

Introduction to Machine Learning

Uncertainty. Variables. assigns to each sentence numerical degree of belief between 0 and 1. uncertainty

Statistical Learning. Philipp Koehn. 10 November 2015

Conditional probabilities and graphical models

Probabilistic Graphical Models for Image Analysis - Lecture 1

Lecture : Probabilistic Machine Learning

Probability Theory for Machine Learning. Chris Cremer September 2015

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

Probabilistic Classification

Bayesian Networks aka belief networks, probabilistic networks. Bayesian Networks aka belief networks, probabilistic networks. An Example Bayes Net

Machine Learning for Data Science (CS4786) Lecture 24

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Some Probability and Statistics

Based on slides by Richard Zemel

CS 188: Artificial Intelligence. Bayes Nets

STA 4273H: Statistical Machine Learning

Be able to define the following terms and answer basic questions about them:

Probabilistic Graphical Models

Dynamic Approaches: The Hidden Markov Model

Directed Graphical Models

A Tutorial on Learning with Bayesian Networks

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

STA 4273H: Statistical Machine Learning

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Qualifier: CS 6375 Machine Learning Spring 2015

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Bayesian Approaches Data Mining Selected Technique

Artificial Intelligence Bayesian Networks

Lecture 8: December 17, 2003

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

Probabilistic Graphical Models: Representation and Inference

Reasoning Under Uncertainty: Belief Network Inference

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

Machine learning: lecture 20. Tommi S. Jaakkola MIT CSAIL

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

K. Nishijima. Definition and use of Bayesian probabilistic networks 1/32

Stephen Scott.

Bayes Nets III: Inference

Introduction to Artificial Intelligence. Unit # 11

MODULE -4 BAYEIAN LEARNING

MLE/MAP + Naïve Bayes

Bayesian Inference and MCMC

4 : Exact Inference: Variable Elimination

Some Probability and Statistics

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Chapter 16. Structured Probabilistic Models for Deep Learning

STA414/2104 Statistical Methods for Machine Learning II

Learning Bayesian network : Given structure and completely observed data

Transcription:

Course Information Introduction Introduction to Bayesian Learning Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Apprendimento Automatico: Fondamenti - A.A. 2016/2017

Outline Course Information Introduction Lecture 1 - Introduction Probabilistic reasoning and learning (Bayesian Networks) Lecture 2 - Parameters Learning (I) Learning fully observed Bayesian models Lecture 3 - Parameters Learning (II) Learning with hidden variables If we have time, we will cover also some application examples of Bayesian learning and Bayesian Networks

Course Information Reference Webpage: Course Information Introduction http://pages.di.unipi.it/bacciu/teaching/ apprendimento-automatico-fondamenti/ Here you can find Course information Lecture slides Articles and course materials Office Hours Where? Dipartimento di Informatica - 2nd Floor - Room 367 When? Monday 17-19 When? Basically anytime, if you send me an email beforehand.

Lecture Outline Course Information Introduction 1 Introduction 2 Probability Theory Probabilities and Random Variables Bayes Theorem and Independence 3 Hypothesis Selection Bayesian Classifiers 4 Bayesian Networks Representation Learning and Inference

Bayesian Learning Course Information Introduction Why Bayesian? Easy, because of frequent use of Bayes theorem... Bayesian Inference A powerful approach to probabilistic reasoning Bayesian Networks An expressive model for describing probabilistic relationships Why bothering? Real-world is uncertain Data (noisy measurements and partial knowledge) Beliefs (concepts and their relationships) Probability as a measure of our beliefs Conceptual framework for describing uncertainty in world representation Learning and reasoning become matters of probabilistic inference Probabilistic weighting of the hypothesis

Probability Theory Part I Probability and Learning

Random Variables Probability Theory Probabilities and Random Variables Bayes Theorem and Independence A Random Variable (RV) is a function describing the outcome of a random process by assigning unique values to all possible outcomes of the experiment Random Process = Coin { Toss 0 if heads Discrete RV = X = 1 if tails The sample space S of a random process is the set of all possible outcomes, e.g. S = {heads, tails} An event e is a subset e S, i.e. a set of outcomes, that may occur or not as a result of the experiment Random variables are the building blocks for representing our world

Probability Theory Probability Functions Probabilities and Random Variables Bayes Theorem and Independence A probability function P(X = x) [0, 1] (P(x) in short) measures the probability of a RV X attaining the value x, i.e. the probability of event x occurring If the random process is described by a set of RVs X 1,..., X N, then the joint conditional probability writes P(X 1 = x 1,..., X N = x n ) = P(x 1 x n ) Definition (Sum Rule) Probabilities of all the events must sum to 1 P(X = x) = 1 x

Probability Theory Probabilities and Random Variables Bayes Theorem and Independence Product Rule and Conditional Probabilities Definition (Product Rule a.k.a. Chain Rule) P(x 1,..., x i,..., x n y) = N P(x i x 1,..., x i 1, y) i=1 P(x y) is the conditional probability of x given y Reflects the fact that the realization of an event y may affect the occurrence of x Marginalization: sum and product rules together yield the complete probability equation P(X 1 = x 1 ) = x 2 P(X 1 = x 1, X 2 = x 2 ) = x 2 P(X 1 = x 1 X 2 = x 2 )P(X 2 = x 2 )

Bayes Rule Probability Theory Probabilities and Random Variables Bayes Theorem and Independence Given hypothesis h i H and observations d P(h i d) = P(d h i)p(h i ) P(d) = P(d h i)p(h i ) j P(d h j)p(h j ) P(h i ) is the prior probability of h i P(d h i ) is the conditional probability of observing d given that hypothesis h i is true (likelihood). P(d) is the marginal probability of d P(h i d) is the posterior probability that hypothesis is true given the data and the previous belief about the hypothesis.

Probability Theory So, Is This All About Bayes??? Probabilities and Random Variables Bayes Theorem and Independence Frequentists Vs Bayesians Figure: Credit goes to Mike West @ Duke University

Probability Theory Probabilities and Random Variables Bayes Theorem and Independence Independence and Conditional Independence Two RV X and Y are independent if knowledge about X does not change the uncertainty about Y and vice versa I(X, Y ) P(X, Y ) = P(X Y )P(Y ) = P(Y X)P(X) = P(X)P(Y ) Two RV X and Y are conditionally independent given Z if the realization of X and Y is an independent event of their conditional probability distribution given Z I(X, Y Z ) P(X, Y Z ) = P(X Y, Z )P(Y Z ) = P(Y X, Z )P(X Z ) = P(X Z )P(Y Z )

Probability Theory Probabilities and Random Variables Bayes Theorem and Independence Representing Probabilities with Discrete Data Joint Probability Distribution Table X 1... X i... X n P(X 1,..., X n ) x 1... x i... x n P(x 1,..., x n) x l 1... x l i... x l n P(x l 1,..., x l n) Describes P(X 1,..., X n ) for all the RV instantiations x 1,..., x n In general, any probability of interest can be obtained starting from the Joint Probability Distribution P(X 1,..., X n )

Wrapping Up... Probability Theory Hypothesis Selection Bayesian Classifiers We know how to represent the world and the observations Random Variables = X 1,..., X N Joint Probability Distribution = P(X 1 = x 1,..., X N = x n ) We have rules for manipulating the probabilistic knowledge Sum-Product Marginalization Bayes Conditional Independence It is about time that we do some learning...in other words, lets discover the values for P(X 1 = x 1,..., X N = x n )

Bayesian Learning Probability Theory Hypothesis Selection Bayesian Classifiers Statistical learning approaches calculate the probability of each hypothesis h i given the data D, and selects hypotheses/makes predictions on that basis Bayesian learning makes predictions using all hypotheses weighted by their probabilities P(X D = d) = i P(X d, h i )P(h i d) = i P(X h i ) P(h i d) New prediction Hypothesis prediction Posterior weighting

Probability Theory Hypothesis Selection Bayesian Classifiers Single Hypothesis Approximation Computational and Analytical Tractability Issue Bayesian Learning requires a (possibly infinite) summation over the whole hypothesis space Maximum a-posteriori (MAP) predicts P(X h MAP ) using the most likely hypothesis h MAP given the training data P(d h)p(h) h MAP = arg max P(h d) = arg max h H h H P(d) = arg max h H P(d h)p(h) Assuming uniform priors P(h i ) = P(h j ), yields the Maximum Likelihood (ML) estimate P(X h ML ) h ML = arg max h H P(d h)

All Too Abstract? Probability Theory Hypothesis Selection Bayesian Classifiers Let s go to the Cinema!!! How do I choose the next movie (prediction)? I might ask my friends for their favorite choice given their personal taste (hypothesis) Select the movie Bayesian advice? Make a voting from all the friends suggestions weighted by their attendance to cinema and taste MAP advice? From the friend who goes often to the cinema and whose taste I trust ML advice? From the friend who goes more often to the cinema

Probability Theory The Candy Box Problem Hypothesis Selection Bayesian Classifiers A candy manufacturer produces 5 types of candy boxes (hypothesis) that are indistinguishable in the darkness of the cinema h 1 100% cherry flavor h 2 75% cherry and 25% lime flavor h 3 50% cherry and 50% lime flavor h 4 25% cherry and 75% lime flavor h 5 100% lime flavor Given a sequence of candies d = d 1,..., d N extracted and reinserted in a box (observations), what is the most likely flavor for the next candy (prediction)?

Probability Theory Candy Box Problem Hypothesis Posterior Hypothesis Selection Bayesian Classifiers First, we need to compute the posterior for each hypothesis (Bayes) P(h i d) = αp(d h i )P(h i ) The manufacturer is kind enough to provide us with the production shares (prior) for the 5 boxes P(h 1 ), P(h 2 ), P(h 3 ), P(h 4 ), P(h 5 ) = (0.1, 0.2, 0.4, 0.2, 0.1) Data likelihood can be computed under the assumption that observations are independently and identically distributed (i.i.d.) P(d h i ) = N P(d j h i ) j=1

Probability Theory Candy Box Problem Hypothesis Posterior Computation Hypothesis Selection Bayesian Classifiers Suppose that the bag is a h 5 and consider a sequence of 10 observed lime candies Posterior probability of h 1 0.8 0.6 0.4 0.2 P(h 1 d) P(h 2 d) P(h 3 d) P(h 4 d) P(h 5 d) Hyp d 0 d 1 d 2 h 1 0.1 0 0 h 2 0.2 0.1 0.03 h 3 0.4 0.4 0.30 h 4 0.2 0.3 0.35 h 5 0.1 0.2 0.31 P(h i d) = αp(h i )P(d = l h i ) N 0 2 4 6 8 10 Number of samples in d Posteriors of false hypothesis eventually vanish

Probability Theory Candy Box Problem Comparing Predictions Hypothesis Selection Bayesian Classifiers Bayesian learning seeks 5 P(d 11 = l d 1 = l,..., d 10 = l) = P(d 11 = l h i )P(h i d) i=1 1 Probability next candy is lime 0.8 Posterior probability of h 0.6 0.4 0.2 0 2 4 6 Number of samples in d 8 10

Observations Probability Theory Hypothesis Selection Bayesian Classifiers Both ML and MAP are point estimates since they only make predictions based on the most likely hypothesis MAP predictions are approximately Bayesian if P(X d) P(X h MAP ) MAP and Bayesian predictions become closer as more data gets available ML is a good approximation to MAP if dataset is large and there are no a-priori preferences on the hypotheses ML is fully data-driven For large data sets, the influence of prior becomes irrelevant ML has problems with small datasets

Regularization Probability Theory Hypothesis Selection Bayesian Classifiers P(h) introduces preference across hypotheses Penalize complexity Complex hypotheses have a lower prior probability Hypothesis prior embodies trade-off between complexity and degree of fit MAP hypothesis h MAP max h P(d h)p(h) min log 2 (P(d h)) log 2 P(h) h Number of bits required to specify h MAP = choosing the hypothesis that provides maximum compression MAP is a regularization of the ML estimation

Probability Theory Bayes Optimal Classifier Hypothesis Selection Bayesian Classifiers Suppose we want to classify a new observation as one of the classes c j C The most probable Bayesian classification is copt = arg max P(c j d) = arg max P(c j h i )P(h i d) c j C c j C i Bayesian Optimal Classifier No other classification method (using same hypothesis space and prior knowledge) can outperform a Bayesian Optimal Classifier in average.

Gibbs Algorithm Probability Theory Hypothesis Selection Bayesian Classifiers Bayes Optimal Classifier provides the best results, but can be extremely expensive Use approximated methods = the Gibbs Algorithm 1 Choose a hypothesis h G from H at random, by drawing it from the current posterior probability distribution P(h i d) 2 Use h G to predict the classification of the next instance Under certain conditions, the expected number of misclassifications is at most twice the expected value of a Bayesian Optimal classifier

Probability Theory Naive Bayes Classifier Hypothesis Selection Bayesian Classifiers One of the simplest, yet popular, tools based on a strong probabilistic assumption Consider the setting Target classification function f : X C Each instance x X is described by a set of attributes Seek the MAP classification x =< a 1,..., a l,..., a L > c NB = arg max c j C P(c j a 1,..., a L )

Probability Theory Naive Bayes Assumption Hypothesis Selection Bayesian Classifiers The MAP classification rewrites c NB = arg max c j C P(c j a 1,..., a L ) = arg max c j C P(a 1,..., a L c j )P(c j ) Naive Bayes: Assume conditional independence between the attributes a l given classification c j P(a 1,..., a L c j ) = Naive Bayes Classification c NB = arg max c j C P(c j) L P(a l c j ) l=1 L P(a l c j ) l=1

Probability Theory Observations on Naive Bayes Hypothesis Selection Bayesian Classifiers Use it when Dealing with moderate or large training sets Attributes that describe instances are seemingly conditionally independent In reality, conditional independence is often violated......but Naive Bayes does not seem to know it. It often works surprisingly well!! Text classification Medical Diagnosis Recommendation systems Next class we ll see how to train a Naive Bayes classifier

Bayesian Networks Summary Part II Bayesian Networks

Bayesian Networks Summary Representation Learning and Inference Representing Conditional Independence Bayes Optimal Classifier (BOC) No independence information between RV Computationally expensive Naive Bayes (NB) Full independence given the class Extremely restrictive assumption Earthquake Burglary Bayesian Network (BNs) describe conditional independence between subsets of RV by a graphical model Radio Announcement Neighbour Call Alarm I(X, Y Z ) P(X, Y Z ) = P(X Z )P(Y Z ) Combine a-priori information concerning RV dependencies with observed data

Bayesian Networks Summary Representation Learning and Inference Bayesian Network - Directed Representation Directed Acyclic Graph (DAG) Nodes represent random variables Shaded observed Empty un-observed Edges describe the conditional independence relationships Every variable is conditionally independent w.r.t. its non-descendant, given its parents Conditional Probability Tables (CPT) local to each node describe the probability distribution given its parents N P(Y 1,..., Y N ) = P(Y i pa(y i )) i=1

Bayesian Networks Summary Representation Learning and Inference Naive Bayes as a Bayesian Network c c a l L a 1 a 2... a l... a L D Naive Bayes classifier can be represented as a Bayesian Network A more compact representation = Plate Notation Allows specifying more (Bayesian) details of the model

Plate Notation Bayesian Networks Summary Representation Learning and Inference π c a l β j L C D Bayesian Naive Bayes Boxes denote replication for a number of times denoted by the letter in the corner Shaded nodes are observed variables Empty nodes denote un-observed latent variables Black seeds identify constant terms (e.g. the prior distribution π over the classes)

Bayesian Networks Summary Inference in Bayesian Networks Representation Learning and Inference Inference - How can one determine the distribution of the values of one/several network variables, given the observed values of others? Bayesian Networks contain all the needed information (joint probability) NP-hard problem in general Solves easily for a single unobserved variable......otherwise Exact inference on small DAGs Approximated inference (Variational methods) Simulate the network (Monte Carlo)

Bayesian Networks Summary Representation Learning and Inference Learning with Bayesian Networks Fixed Structure P(Y X) X Y Structure Fixed Variables P(X, Y ) X Y Data Complete Incomplete Naive Bayes Calculate Frequencies (ML) Latent variables EM Algorithm (ML) MCMC, VBEM (Bayesian) Parameter Learning Discover dependencies from the data Structure Search Independence tests Difficult Problem Structural EM Structure Learning

Bayesian Networks Summary Take Home Messages Bayesian techniques formulate learning as probabilistic inference ML learning selects the hypothesis that maximizes data likelihood MAP learning selects the most likely hypothesis given the data (ML Regularization) Conditional independence relations can be expressed by a Bayesian network Parameters Learning (Next Lecture) Structure Learning