Lecture 1: Bayesian Framework Basics

Similar documents
Statistical Machine Learning Lecture 1: Motivation

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

COS513 LECTURE 8 STATISTICAL CONCEPTS

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Lecture 2: Repetition of probability theory and statistics

Review of Probabilities and Basic Statistics

Naïve Bayes classification

Computational Genomics

Lecture 11. Probability Theory: an Overveiw

Math Review Sheet, Fall 2008

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Bayesian Machine Learning

Bayesian Decision and Bayesian Learning

Lecture 2: Basic Concepts of Statistical Decision Theory

Density Estimation. Seungjin Choi

STAT J535: Introduction

Lecture 2: Statistical Decision Theory (Part I)

Intro to Probability. Andrei Barbu

Time Series and Dynamic Models

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Lecture 2: Priors and Conjugacy

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

ST5215: Advanced Statistical Theory

Machine Learning 4771

David Giles Bayesian Econometrics

Parametric Techniques

Lecture 6: Model Checking and Selection

Algorithms for Uncertainty Quantification

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Lecture 13 and 14: Bayesian estimation theory

01 Probability Theory and Statistics Review

Introduction to Machine Learning

Lecture 8 October Bayes Estimators and Average Risk Optimality

Nonparametric Bayesian Methods (Gaussian Processes)

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Probability. Paul Schrimpf. January 23, UBC Economics 326. Probability. Paul Schrimpf. Definitions. Properties. Random variables.

Bayesian Approaches Data Mining Selected Technique

Some Concepts of Probability (Review) Volker Tresp Summer 2018

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Parametric Techniques Lecture 3

Introduction to Machine Learning

Introduction to Bayesian Statistics

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Data Mining Techniques. Lecture 3: Probability

BAYESIAN DECISION THEORY

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

COMP90051 Statistical Machine Learning

Ch. 5 Joint Probability Distributions and Random Samples

Bayesian Machine Learning

Introduction into Bayesian statistics

Quick Tour of Basic Probability Theory and Linear Algebra

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Grundlagen der Künstlichen Intelligenz

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

A Very Brief Summary of Bayesian Inference, and Examples

CSC321 Lecture 18: Learning Probabilistic Models

Non-Parametric Bayes

Undirected Graphical Models

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Introduction to Probability and Statistics (Continued)

Deep Learning for Computer Vision

Cheng Soon Ong & Christian Walder. Canberra February June 2018

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Multivariate probability distributions and linear regression

Be able to define the following terms and answer basic questions about them:

ECE521 week 3: 23/26 January 2017

9 Bayesian inference. 9.1 Subjective probability

Econ 2140, spring 2018, Part IIa Statistical Decision Theory

Bayesian Learning (II)

Probabilistic Machine Learning

Bayesian Methods: Naïve Bayes

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Lecture 9: PGM Learning

The Naïve Bayes Classifier. Machine Learning Fall 2017

(3) Review of Probability. ST440/540: Applied Bayesian Statistics

Statistical Learning Reading Assignments

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

PATTERN RECOGNITION AND MACHINE LEARNING

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Recitation 2: Probability

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Mathematical statistics: Estimation theory

Posterior Regularization

Probability Theory for Machine Learning. Chris Cremer September 2015

Machine Learning. Instructor: Pranjal Awasthi

3.0.1 Multivariate version and tensor product of experiments

Probability Review. Chao Lan

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

2 Belief, probability and exchangeability

CMU-Q Lecture 24:

Transcription:

Lecture 1: Bayesian Framework Basics Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de April 21, 2014

What is this course about? Building Bayesian machine learning models Performing the inference of these models Evaluating Bayesian solutions Directed graphical models

What is it NOT about? Basic machine learning. No SVMs, No Neural Networks. Basic probability and statistics in detail. Advanced probability and statistics. Advanced Bayesian theory. Undirected graphical models. No MRFs, No CRFs.

Useful text

The term project Choose a data set from the provided set (or offer your own) Devise your model (draw its Plate diagram) that solves the related problem Build the inference algorithm (i.e. choose the inference method, derive the necessary equations) Implement your model Evaluate your model s success Interpret your results in a report of approximately 4 pages.

Definitions Sample space (Ω): A collection of all possible outcomes of a random experiment. Event (E): A question about the experiment with a yes/no answer. A subset of the sample space. Probability measure: A function that assigns a number P(A) to each event A.

Axioms of probability Axiom 1: Probability of an event is a non-negative real number: P(E) R, P(E) 0, E Ω Axiom 2: Probability of the entire sample space is 1: P(Ω) = 1. Axiom 3: P(E 1 E 2 ) = P(E 1 ) + P(E 2 ), where E 1 E 2 =.

Consequences Sum rule: P(E 1 E 2 ) = P(E 1 ) + P(E 2 ) P(E 1 E 2 ) P( ) = 0 All set theory is applicable. Most of the Boolean algebra is applicable.

Conditional probability Kolmogorov s definition: P(A B) = P(A B) P(B) a.k.a product rule. De Finetti introduces this formulation as an axiom. Consider the following example 1 : 1 http://blacklen.wordpress.com/2011/02/02/introduction-to-bayesianconditional-probability/

Definitions (2) Probability density function (PDF): Pr[a x b] = b Cumulative distribution function (CDF): F x (x) = PDF - CDF relationship: x a p x (x)dx p x (x)dx Pr[a x b] = F x (b) F x (a)

Definitions(3) Expected value: E p(x) [x] = x p x (x)dx Variance: Var p(x) [x] = E p(x) [(x E p(x) [x]) 2 ] = E p(x) [x 2 ] (E p(x) [x]) 2 Standard Deviation: σ(x) = Var p(x) [x]

Definitions(4) Joint PDF Pr[a x b c y d] = Covariance: b d a c cov(x, y) = E[(x E[x])(y E[y])] p xy (x, y)dydx Marginal probability (sum rule): p(x) = p(x, y)dy

Normal distribution PDF: N (x µ, σ 2 ) = 1 (x µ) 2 σ 2π e 2σ 2 CDF: [ ( )] 1 x µ 1 + erf 2 2σ 2 where erf(x) = 1 x π x e t2 dt. Mean: µ Variance: σ 2

Normal distribution (2) 2 PDF CDF Std. Dev 2 http://en.wikipedia.org/wiki/normal_distribution

Multivariate normal distribution PDF: N (x µ, Σ) = (2π) D 2 Σ 1 2 e 1 2 (x µ)t Σ 1 (x µ) CDF: N/A. Mean: µ Variance: Σ

Multivariate normal distribution (2) 3 3 http://en.wikipedia.org/wiki/multivariate_normal_distribution

Central limit theorem Let x 1, x 2,, x N be N random variables with means µ 1, µ2,, µ N and standard deviations σ 1, σ 2,, σ N. Then the following variate X NORM = N i=1 x i N i=1 µ i N i=1 σ2 i has a limiting CDF which approaches a normal distribution.

Bayes Theorem p(θ x) = p(x θ)p(θ) p(x) x X is an observable in the sample space X. θ is the vector of model parameters. It is an index to a frequentist, and a random variable for a Bayesian. p(x θ): likelihood p(θ): prior p(θ x): posterior p(x): evidence

Independence and Conditional Independence Independence: P(E 1 E 2 ) = P(E 1 )P(E 2 ) Conditional independence: P(E 1 E 2 E 3 ) = P(E 1 E 3 )P(E 2 E 3 )

Independent and identically distributedness (i.i.d) Let x 1, x 2,, x N be N random variables corresponding to N observations of an experiment. They are defined to be independent and identically distributed (i.i.d) random variables if: All random variables x i have the same probability distribution. All pairs of observation events are independent.

Exchangeability The random variables (x 1, x 2,, x N ) are exchangeable if for any permutation π, the following equality holds p(x 1, x 2,, x N ) = p(x π1, x π2,, x πn ).

Frequentist and Bayesian views Is probability subjective or objective? For frequentists, it is an objective measure: p(e) = nr times event E occurs nr trials For Bayesians, it is a measure of likeliness that event E occurs. The classical view, based on physical considerations of symmetry, in which one should be obliged to give the same probability to such symmetric cases. But which symmetry? And, in any case, why? The original sentence becomes meaningful if reversed: the symmetry is probabilistically significant, in someone s opinion, if it leads him to assign the probabilities to such events. de Finetti, 1970/74, Preface,xi-xii

Motivation 1: De Finetti s Theorem A sequence of random variables (x 1, x 2,, x N ) is infinitely exchangeable iff, for any N, p(x 1, x 2,, x N ) = N i=1 p(x i θ)p(dθ) Here, P(dθ) = p(θ)dθ if θ has a density. Implications: Exchangeability can be checked from right hand side. There must exist a parameter θ! There must exist a likelihood p(x θ)! There must exist a distribution P on θ These three components are prerequisites for the data to be conditionally independent!

Motivation 2: Statistical Decision Theory Loss function: l(θ, δ(x)) where δ(x) is a decision based on data x. Determines the penalty for predicting δ(x) if θ is the true parameter. e.g. Squared loss: l(θ, δ(x)) = (θ δ(x)) 2. However, δ(x) does not have to be an estimate of θ.

Frequentist Risk R(θ, δ) = E[l(θ, δ(x))] for a fixed θ and differend x X. How to decide which decision is better: Admissibility: Never dominated everywhere by another decision. Not practical, a decision rarely dominates another in real cases. Restricted classes of procedures: For instance, we can restrict ourselves to the unbiased case (i.e. E θ [ˆθ] = θ). Many good procedures are biased. Moreover, some unbiased procedures are inadmissible. Minimax: Choose the one with lower maximum worst-case risk.

Motivation 3: Birnbaum s Principles Conditionality principle: If an experiment concerning inference about θ is chosen from a collection of possible experiments independently, then any experiment not chosen is irrelevant to the inference. Likelihood Principle: The relevant information in any inference about θ after x is observed is contained entirely in the likelihood function. Sufficiency Principle: If two different observations x, y are such that T (x) = T (y) for sufficient statistic T, then inference based on x and y should be the same.

Bayesian decision theory Posterior risk: ρ(π, δ(x)) = l(θ, δ(x))p(θ x)d θ where p(θ x) p(x θ)π(θ). The Bayes action δ (x) for any fixed x is the decision δ(x) that minimizes the posterior risk.

Bayesian decision theory (2) For example, let us calculate the posterior risk for l(θ, δ(x)) = (θ δ(x)) 2 : ρ = (θ δ(x)) 2 p(θ x)dθ = δ(x) 2 2δ(x) θp(θ x)dθ + θ 2 p(θ x)dθ and the Bayes action ρ δ(x) = 2δ(x) 2 θp(θ x)dθ = 0, δ (x) = θp(θ x)d θ turns out to be the posterior mean! For l(θ, δ(x)) = θ δ(x), the optimal decision is to choose the posterior median.

Comparison Both approaches use loss functions. Frequentists integrate out X. Bayesians integrate out θ.

Posterior predictive distribution Given a posterior p(θ x) and a new observation x, the posterior predictive distribution is p(x x) = p(x θ)p(θ x)dθ = E p(θ x) [p(x θ)]

Supervised learning Given a set of observations: x 1, x 2,, x N and the corresponding outcomes (labels) y 1, y 2,, y N, learn a function y = f (x) A naive solution is linear regression 4 : y = w T x. 4 http://commons.wikimedia.org/wiki/file:linear-regression.svg

Types of supervised learning Classification: y a, b, c,, k Regression: y R Semi-supervised learning: A (large) subset of the training set does not have labels. Active learning: The model asks labels of the most important observations. Structured output learning: y is a structure (e.g. a graph)

Unsupervised learning Given a set of observations: x 1, x 2,, x N, learn a model that does X. A commonplace X is to infer data chunks, called clusters. This problem is called clustering 5. 5 http://en.wikipedia.org/wiki/cluster_analysis

Discriminative versus Generative models Joint model: p(x, y) Generative model: p(y x) = p(y)p(x y) p(x) Discriminative model deals directly with p(y x).

Parametric and nonparametric models Parametric model: The structure of the training data is stored in a predetermined set of parameters. These parameters are sufficient for prediction, no need to store the training data. Non-parametric model: Number of parameters in the model grows with the training data size. Training data also has to be stored for prediction.