Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Similar documents
Probabilistic Modelling and Bayesian Inference

Probabilistic Modelling and Bayesian Inference

Probabilistic and Bayesian Machine Learning

Unsupervised Learning

Introduction to Probabilistic Modelling and Machine Learning

Bayesian Methods for Machine Learning

Lecture : Probabilistic Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Unsupervised Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

6.867 Machine Learning

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Machine Learning Summer School

6.867 Machine Learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Bayesian RL Seminar. Chris Mansley September 9, 2008

Lecture 6: Graphical Models: Learning

Bayesian Learning (II)

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Bayesian Machine Learning

CS 540: Machine Learning Lecture 2: Review of Probability & Statistics

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Density Estimation. Seungjin Choi

STA 4273H: Statistical Machine Learning

Bayesian Analysis for Natural Language Processing Lecture 2

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

CSC321 Lecture 18: Learning Probabilistic Models

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification

Introduction into Bayesian statistics

Bayesian Models in Machine Learning

Kernel methods, kernel SVM and ridge regression

Lecture 4: Probabilistic Learning

Introduction to Probabilistic Machine Learning

COMP90051 Statistical Machine Learning

Time Series and Dynamic Models

Probabilistic Graphical Models

Bayesian Hidden Markov Models and Extensions

Non-Parametric Bayes

Parametric Techniques Lecture 3

Introduction to Gaussian Processes

an introduction to bayesian inference

Bayesian Concept Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Bayesian analysis in nuclear physics

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

COS513 LECTURE 8 STATISTICAL CONCEPTS

How to build an automatic statistician

Introduction to Bayesian Data Analysis

STA 4273H: Sta-s-cal Machine Learning

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

New Bayesian methods for model comparison

Part 1: Expectation Propagation

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Part IV: Monte Carlo and nonparametric Bayes

Bayesian Learning Extension

On the errors introduced by the naive Bayes independence assumption

Parametric Techniques

Lecture 2: Priors and Conjugacy

Bayesian Machine Learning

Probability and Inference

Probability is related to uncertainty and not (only) to the results of repeated experiments

David Giles Bayesian Econometrics

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Intro to Probability. Andrei Barbu

Hierarchical Models & Bayesian Model Selection

Bayesian methods in economics and finance

Exact Inference by Complete Enumeration

Probability and Information Theory. Sargur N. Srihari

Lecture 6: Model Checking and Selection

A Discussion of the Bayesian Approach

Learning Bayesian network : Given structure and completely observed data

Part III. A Decision-Theoretic Approach and Bayesian testing

Machine Learning for Signal Processing Bayes Classification and Regression

CPSC 340: Machine Learning and Data Mining

Probability Theory Review

Be able to define the following terms and answer basic questions about them:

Choosing among models

PATTERN RECOGNITION AND MACHINE LEARNING

Discrete Binary Distributions

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

Introduction to Bayesian Learning. Machine Learning Fall 2018

Probability COMP 245 STATISTICS. Dr N A Heard. 1 Sample Spaces and Events Sample Spaces Events Combinations of Events...

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Building an Automatic Statistician

Probabilistic Graphical Models

Introduction to Machine Learning Midterm, Tues April 8

Probability - Lecture 4

Review: Statistical Model

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Model Selection for Gaussian Processes

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Announcements. Proposals graded

Introduction to Bayesian Inference

Ways to make neural networks generalize better

Transcription:

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ BARK 2008

Some Canonical Machine Learning Problems Linear Classification Nonlinear Regression Clustering with Gaussian Mixtures (Density Estimation)

Example: Linear Classification Data: D = {(x (n), y (n) )} for n = 1,..., N data points x (n) R D y (n) {+1, 1} x x x x xx x x x x x o o o o o o o Parameters: θ R D+1 P (y (n) = +1 θ, x (n) ) = 1 if D d=1 0 otherwise θ d x (n) d + θ 0 0 Goal: To infer θ from the data and to predict future labels P (y D, x)

Basic Rules of Probability P (x) P (x θ) P (x, θ) probability of x conditional probability of x given θ joint probability of x and θ P (x, θ) = P (x)p (θ x) = P (θ)p (x θ) Bayes Rule: Marginalization P (θ x) = P (x) = P (x θ)p (θ) P (x) P (x, θ) dθ

Bayes Rule Applied to Machine Learning P (θ D) = P (D θ)p (θ) P (D) P (D θ) P (θ) P (θ D) likelihood of θ prior probability of θ posterior of θ given D Model Comparison: P (m D) = P (D m) = P (D m)p (m) P (D) P (D θ, m)p (θ m) dθ Prediction: P (x D, m) = P (x θ, D, m)p (θ D, m)dθ

That s it!

Should all Machine Learning be Bayesian? Why be Bayesian? Where does the prior come from? How do we do these integrals?

Representing Beliefs (Artificial Intelligence) Consider a robot. In order to behave intelligently the robot should be able to represent beliefs about propositions in the world: my charging station is at location (x,y,z) my rangefinder is malfunctioning that stormtrooper is hostile We want to represent the strength of these beliefs numerically in the brain of the robot, and we want to know what rules (calculus) we should use to manipulate those beliefs.

Representing Beliefs II Let s use b(x) to represent the strength of belief in (plausibility of) proposition x. 0 b(x) 1 b(x) = 0 x is definitely not true b(x) = 1 x is definitely true b(x y) strength of belief that x is true given that we know y is true Cox Axioms (Desiderata): Strengths of belief (degrees of plausibility) are represented by real numbers Qualitative correspondence with common sense Consistency If a conclusion can be reasoned in more than one way, then every way should lead to the same answer. The robot always takes into account all relevant evidence. Equivalent states of knowledge are represented by equivalent plausibility assignments. Consequence: Belief functions (e.g. b(x), b(x y), b(x, y)) must satisfy the rules of probability theory, including Bayes rule. (Cox 1946; Jaynes, 1996; van Horn, 2003)

The Dutch Book Theorem Assume you are willing to accept bets with odds proportional to the strength of your beliefs. That is, b(x) = 0.9 implies that you will accept a bet: { x is true win $1 x is false lose $9 Then, unless your beliefs satisfy the rules of probability theory, including Bayes rule, there exists a set of simultaneous bets (called a Dutch Book ) which you are willing to accept, and for which you are guaranteed to lose money, no matter what the outcome. The only way to guard against Dutch Books to to ensure that your beliefs are coherent: i.e. satisfy the rules of probability.

Asymptotic Certainty Assume that data set D n, consisting of n data points, was generated from some true θ, then under some regularity conditions, as long as p(θ ) > 0 lim p(θ D n) = δ(θ θ ) n In the unrealizable case, where data was generated from some p (x) which cannot be modelled by any θ, then the posterior will converge to where ˆθ minimizes KL(p (x), p(x θ)): lim p(θ D n) = δ(θ ˆθ) n ˆθ = argmin θ p (x) log p (x) p(x θ) dx = argmax θ p (x) log p(x θ) dx Warning: careful with the regularity conditions, these are just sketches of the theoretical results

Asymptotic Consensus Consider two Bayesians with different priors, p 1 (θ) and p 2 (θ), who observe the same data D. Assume both Bayesians agree on the set of possible and impossible values of θ: {θ : p 1 (θ) > 0} = {θ : p 2 (θ) > 0} Then, in the limit of n, the posteriors, p 1 (θ D n ) and p 2 (θ D n ) will converge (in uniform distance between distibutions ρ(p 1, P 2 ) = sup P 1 (E) P 2 (E) ) E coin toss demo: bayescoin

Bayesian Occam s Razor and Model Comparison Compare model classes, e.g. m and m, using posterior probabilities given D: p(d m) p(m) p(m D) =, p(d m) = p(d θ, m) p(θ m) dθ p(d) Interpretation of the Marginal Likelihood ( evidence ): The probability that randomly selected parameters from the prior would generate D. Model classes that are too simple are unlikely to generate the data set. Model classes that are too complex can generate many possible data sets, so again, they are unlikely to generate that particular data set at random. P(D m) too simple "just right" D All possible data sets of size n too complex

Potential advantages of Bayesian Machine Learning over alternatives tries to be coherent and honest about uncertainty easy to do model comparison, selection rational process for model building and adding domain knowledge easy to handle missing and hidden data Disadvantages: to be discussed later :-)

Where does the prior come from? Objective Priors: noninformative priors that attempt to capture ignorance and have good frequentist properties. Subjective Priors: priors should capture our beliefs as well as possible. They are subjective but not arbitrary. Hierarchical Priors: multiple levels of priors: p(θ) = = dα p(θ α)p(α) dα p(θ α) dβ p(α β)p(β) (etc...) Empirical Priors: learn some of the parameters of the prior from the data ( Empirical Bayes )

Subjective Priors Priors should capture out beliefs as well as possible. Otherwise we are not coherent. How do we know our beliefs? Think about the problems domain (no black box view of machine learning) Generate data from the prior. Does it match expectations? Even very vague beliefs can be useful.

Two views of machine learning The Black Box View The goal of machine learning is to produce general purpose black-box algorithms for learning. I should be able to put my algorithm online, so lots of people can download it. If people want to apply it to problems A, B, C, D... then it should work regardless of the problem, and the user should not have to think too much. The Case Study View If I want to solve problem A it seems silly to use some general purpose method that was never designed for A. I should really try to understand what problem A is, learn about the properties of the data, and use as much expert knowledge as I can. Only then should I think of designing a method to solve A.

Bayesian Black Boxes? Can we meaningfully create Bayesian black-boxes? If so, what should the prior be? This seems strange... clearly we can create black boxes, but how can we advocate people blindly using them? Do we require every practitioner to be a well-trained Bayesian statistician?

Parametric vs Nonparametric Models Terminology (roughly): Parametric Models have a finite fixed number of parameters θ, regardless of the size of the data set. Given θ, the predictions are independent of the data D: p(x, θ D) = p(x θ) p(θ D) The parameters are a finite summary of the data. model-based learning (e.g. mixture of k Gaussians) We can also call this Non-parametric Models allow the number of parameters to grow with the data set size, or alternatively we can think of the predictions as depending on the data, and possible a usually small number of parameters α p(x D, α) We can also call this memory-based learning (e.g. kernel density estimation)

Example: Clustering Basic idea: each data point belongs to a cluster Many clustering methods exist: mixture models hierarchical clustering spectral clustering Goal: to partition data into groups in an unsupervised manner

Infinite mixture models (e.g. Dirichlet Process Mixtures) Why? You might not believe a priori that your data comes from a finite number of mixture components (e.g. strangely shaped clusters; heavy tails; structure at many resolutions) Inflexible models (e.g. a mixture of 6 Gaussians) can yield unreasonable inferences and predictions. For many kinds of data, the number of clusters might grow over time: clusters of news stories or emails, classes of objects, etc. You might want your method to automatically infer the number of clusters in the data.

Is non-parametrics the only way to go? When do we really believe our parametric model? But, when do we really believe or non-parametric model? Is a non-parametric model (e.g. a DPM) really better than a large parametric model (e.g. a mixture of 100 components)?

The Approximate Inference Conundrum All interesting models are intractable. So we use approximate inference (MCMC, VB, EP, etc). Since we often can t control the effect of using approximate inference, are coherence arguments meaningless? Is Subjective Bayesianism pointless?

Reconciling Bayesian and Frequentist Views Frequentist theory tends to focus on sampling properties of estimators, i.e. what would have happened had we observed other data sets from our model. Also look at minimax performance of methods i.e. what is the worst case performance if the environment is adversarial. Frequentist methods often optimize some penalized cost function. Bayesian methods focus on expected loss under the posterior. Bayesian methods generally do not make use of optimization, except at the point at which decisions are to be made. There are some reasons why frequentist procedures are useful to Bayesians: Communication: If Bayesian A wants to convince Bayesians B, C, and D of the validity of some inference (or even non-bayesians) then he or she must determine that not only does this inference follows from prior p A but also would have followed from p B, p C and p D, etc. For this reason it s useful sometimes to find a prior which has good frequentist (sampling / worst-case) properties, even though acting on the prior would not be coherent with our beliefs. Robustness: Priors with good frequentist properties can be more robust to mis-specifications of the prior. Two ways of dealing with robustness issues are to make sure that the prior is vague enough, and to make use of a loss function to penalize costly errors. also, recently, PAC-Bayesian frequentist bounds on Bayesian procedures.

Cons and pros of Bayesian methods Limitations and Criticisms: They are subjective. It is hard to come up with a prior, the assumptions are usually wrong. The closed world assumption: need to consider all possible hypotheses for the data before observing the data. They can be computationally demanding. The use of approximations weakens the coherence argument. Advantages: Coherent. Conceptually straightforward. Modular. Often good performance.

How can we convert the pagan majority of ML researchers to Bayesianism? Some suggestions: Killer apps Win more competitions Dispel myths, be more proactive, aggressive Release good easy to use black-box code??