Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Similar documents
Stat 5101 Lecture Notes

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Bayesian Models in Machine Learning

ECE521 week 3: 23/26 January 2017

Probability and Estimation. Alan Moses

STA414/2104 Statistical Methods for Machine Learning II

A Very Brief Summary of Statistical Inference, and Examples

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Maximum Likelihood Estimation

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Density Estimation. Seungjin Choi

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

STA 4273H: Sta-s-cal Machine Learning

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Statistical Signal Processing Detection, Estimation, and Time Series Analysis

Bayesian Regression Linear and Logistic Regression

Machine Learning 4771

Estimation, Detection, and Identification CMU 18752

TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1

Testing Statistical Hypotheses

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

Curve Fitting Re-visited, Bishop1.2.5

Basic concepts in estimation

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics

Pattern Recognition and Machine Learning

Bayesian Methods for Machine Learning

Linear Models A linear model is defined by the expression

Finite Singular Multivariate Gaussian Mixture

CS Lecture 19. Exponential Families & Expectation Propagation

Gentle Introduction to Infinite Gaussian Mixture Modeling

Linear Dependent Dimensionality Reduction

Clustering using Mixture Models

STAT 730 Chapter 4: Estimation

Density Estimation: ML, MAP, Bayesian estimation

Probability and Statistics

Introduction to Probabilistic Machine Learning

Estimation Tasks. Short Course on Image Quality. Matthew A. Kupinski. Introduction

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Irr. Statistical Methods in Experimental Physics. 2nd Edition. Frederick James. World Scientific. CERN, Switzerland

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

ECE531 Lecture 10b: Maximum Likelihood Estimation

Modelling geoadditive survival data

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition.

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

The Bayes classifier

Signal detection theory

Statistical Estimation

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

CS 361: Probability & Statistics

An introduction to Bayesian inference and model comparison J. Daunizeau

Foundations of Statistical Inference

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Testing Statistical Hypotheses

Bayesian Machine Learning

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Part 2: One-parameter models

STATISTICS SYLLABUS UNIT I

Classical and Bayesian inference

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Econometrics I, Estimation

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

STA 4273H: Statistical Machine Learning

Review Quiz. 1. Prove that in a one-dimensional canonical exponential family, the complete and sufficient statistic achieves the

HANDBOOK OF APPLICABLE MATHEMATICS

Ch 4. Linear Models for Classification

2 Statistical Estimation: Basic Concepts

CS 361: Probability & Statistics

Brief Review on Estimation Theory

Bayesian estimation of the discrepancy with misspecified parametric models

State-Space Methods for Inferring Spike Trains from Calcium Imaging

Lessons in Estimation Theory for Signal Processing, Communications, and Control

Latent Variable Models and EM algorithm

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

ECE285/SIO209, Machine learning for physical applications, Spring 2017

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

1. Fisher Information

Lecture : Probabilistic Machine Learning

Bayes spaces: use of improper priors and distances between densities

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Biostat 2065 Analysis of Incomplete Data

Bayesian inference J. Daunizeau

Bayesian linear regression

Chapter 3 : Likelihood function and inference

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

EM Algorithm. Expectation-maximization (EM) algorithm.

Expectation Propagation Algorithm

General Bayesian Inference I

Bayesian Paradigm. Maximum A Posteriori Estimation

PMR Learning as Inference

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Probabilistic Graphical Models

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Transcription:

Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007 1

Statistical Inference Statistics aims at retriving the causes (e.g., parameters of a pdf) from the observations (effects) Probability Statistics Statistical inference problems can thus be seen as Inverse Problems As a result of this perpective, at the eighteenth century (at the time of Bayes and Laplace) Statistics was often called Inverse Probability IP, José Bioucas Dias, IST, 2007 2

Parametric Inference Consider the parametric model is the parameter space and The problem of inference reduces to the estimation of where is the parameter from ; i.e, Parameters of interest and nuisance parameters Let Sometimes we are only interested in some function that depends only on - parameter of interest; Example: - nuisance parameter IP, José Bioucas Dias, IST, 2007 3

Parametric Inference (theoretical limits) The Cramer Rao Lower Bound (CRLB) Under under appropriate regularity conditions, the covariance matrix of any Unbiased estimator, satisfies where is the Fisher information matrix given by An unbiased estimator that attains the CRLB may be found iif For some function h. The estimator is IP, José Bioucas Dias, IST, 2007 4

CRLB for the general Gaussian case Example: Parameter of a signal in white noise If Example: Known signal in unknown white noise IP, José Bioucas Dias, IST, 2007 5

Maximum Likelihood Method is the likelihood function If for all f we can use the log-likelihood Example (Bernoulli) IP, José Bioucas Dias, IST, 2007 6

Maximum Likelihood Example (Uniform) 1 1 IP, José Bioucas Dias, IST, 2007 7

Maximum Likelihood Example (Gaussian) IID Sample mean Sample variace IP, José Bioucas Dias, IST, 2007 8

Maximum Likelihood Example (Multivariate Gaussian) IID Sample mean Sample covariance IP, José Bioucas Dias, IST, 2007 9

Maximum Likelihood (linear observation model) Example: Linear observation in Gaussian noise A is full rank IP, José Bioucas Dias, IST, 2007 10

Example: Linear observation in Gaussian noise (cont.) MLE is equivalent to the LSE using the norm If,, is given by the Moore-Penrose Pseudo-Inverse is a projection matrix (SVD) If the noise is zero-mean but not Gaussian, the Best Linear Unbiased estimator (BLUE) is still given by IP, José Bioucas Dias, IST, 2007 11

Maximum likelihood Linear observation in Gaussian noise MLE Properties (MLE is optimal for the linear model) Is the Minimum Variance Unbiased (MVU) estimator [ and is the minimum among all unbiased estimators] Is efficient (it attains the Camer Rao Lower Bound (CRLB)) Its PDF is IP, José Bioucas Dias, IST, 2007 12

Maximum likelihood (characterization) Appealing properties of MLE Let A sequence of IID vectors in and 1. The MLE is consistent: ( denotes the true parameter) 2. The MLE is equivariant: if is the MLE estimate of, then is the MLE estimate of 3. The MLE (under appropriate regularity conditions) is asymptotically Normal and optimal or efficient: Fisher information matrix IP, José Bioucas Dias, IST, 2007 13

The exponential Family Definition: the set dimension k if there there are functions such that an exponential family of is a sufficient statistic for f, i.e, Theorem: (Neyman-Fisher Factorization) f iif can be factored as is a sufficient statistic for IP, José Bioucas Dias, IST, 2007 14

The exponential family Natural (or canonical) form Given an exponential family, it is always possible to introduce the change of variables and the reparemeterization such that Since is a PDF, it must integrate to one IP, José Bioucas Dias, IST, 2007 15

The exponential family (The partition function) Computing moments from the derivatives of the partition function After some calculus IP, José Bioucas Dias, IST, 2007 16

The exponential family (IID sequences) Let a member of an exponential family defined by The density of the IID sequence is belongs exponential family defined by IP, José Bioucas Dias, IST, 2007 17

Examples of exponential families Many of the most common probabilistic models belong to exponential families; e.g., Gaussian, Poisson, Bernoulli, binomial, exponential, gamma, and beta. Example: Canonical form IP, José Bioucas Dias, IST, 2007 18

Examples of exponential families (Gaussian) Example: Canonical form IP, José Bioucas Dias, IST, 2007 19

Computing maximum likelihood estimates Very often the MLE can not be found analytically. Commonly used numerical methods: 1. Newton-Raphson 2. Scoring 3. Expectation Maximization (EM) Newton-Raphson method Scoring method Can be computed off-line IP, José Bioucas Dias, IST, 2007 20

Computing maximum likelihood estimates (EM) Expectation Maximization (EM) [Dempster, Laird, and Rubin, 1977] Suppose that is hard to maximize But we can find a vector z such that is easy to maximze and Idea: iterate between two steps: E-step: Fill in z in M-step: Maximize Terminology Observed data Missing data Complete data IP, José Bioucas Dias, IST, 2007 21

Expectation maximization The EM algorithm 1. Pick up a starting vector : repeat steps 2. and 3. 2. E-step: Calculate 3. M-step Alternatively (GEM) IP, José Bioucas Dias, IST, 2007 22

Expectation maximization The EM (GEM) algorithm always increases the likelihood. Define 1. 2. 3. Kulback Leibler distance 4. KL distance maximization IP, José Bioucas Dias, IST, 2007 23

Expectation maximization (why does it work?) IP, José Bioucas Dias, IST, 2007 24

EM: Mixtures of densities Let be the random variable that selects the active mode: where and IP, José Bioucas Dias, IST, 2007 25

EM: Mixtures of densities Consider now that is a sequence of IID random variables Let be IID random variables, where selects the active mode in the sample : IP, José Bioucas Dias, IST, 2007 26

EM: Mixtures of densities Equivalent Q Where is the sample mean of x, i.e., IP, José Bioucas Dias, IST, 2007 27

EM: Mixtures of densities E-step: M-step: IP, José Bioucas Dias, IST, 2007 28

EM: Mixtures of densities E-step: M-step: IP, José Bioucas Dias, IST, 2007 29

EM: Mixtures of Gaussian densities (MOGs) E-step: M-step: Weighted sample mean Weighted sample covariance IP, José Bioucas Dias, IST, 2007 30

EM: Mixtures of Gaussian densities. 1D Example 0 1 0.6316 3 3 0.3158 6 10 0.0526-0.0288 1.0287 0.6258 2.8952 2.5649 0.3107 6.1687 7.3980 0.0635-3800 loglikelihood L(f k ) p = 3 N = 1900-4000 -4200-4400 -4600-4800 -5000-5200 0 5 10 15 20 25 30 IP, José Bioucas Dias, IST, 2007 31

EM: Mixtures of Gaussian Densities (MOGs) Example 1D 0 1 0.6316 3 3 0.3158 6 10 0.0526 p = 3 N = 1900 0.35 0.35 hist hist 0.3 est MOG true MOG 0.3 est modes true modes 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0-5 0 5 10 15 0-6 -4-2 0 2 4 6 8 10 12 IP, José Bioucas Dias, IST, 2007 32

EM: Mixtures of Gaussian Densities: 2D Example MOG with determination of the number of modes [M. Figueiredo, 2002] k=3 2 0-2 -2 0 2 4 IP, José Bioucas Dias, IST, 2007 33

Bayesian Estimation IP, José Bioucas Dias, IST, 2007 34

The Bayesian Philosophy ([Wasserman, 2004]) Bayesian Inference B1 Probabilities describe degrees of belief, not limiting relative frequency B2 We can make probability statements about parameters, even though they are fixed parameters B3 We make inferences about a parameter by producing a probalility distribution for Frequencist or Classical Inference F1 Probabilities refer to limiting relative frequencies and are objective properties of the real world F2 Parameters are fixed unknown parameters F3 The criteria for obtaining statistical procedures are based on long run frequency properties. IP, José Bioucas Dias, IST, 2007 35

The Bayesian Philosophy unknown Observation model observation Prior knowledge Bayesian Inference Classical Inference describes degrees of belief (subjective), not limiting frequency IP, José Bioucas Dias, IST, 2007 36

The Bayesian method 1. Choose a prior density, called the prior (or a priori) distribution that expresses our beliefs about f, before we see any data 2. Choose the observation model that reflects our belief about g given f 3. Calculate the posterior (or a posteriori) distribution using the Bayes law: where is the marginal on g (other names: evidence, unconditional, predictive) 4. Any inference should be based on the posterior IP, José Bioucas Dias, IST, 2007 37

The Bayesian method Example: Let IID and 4 3.5 3 = =0.5 = =1 = =2 = =10 for = >1, towards 1/2 pulls 2.5 2 1.5 1 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 IP, José Bioucas Dias, IST, 2007 38

Example (cont.): (Bernoulli observations, Beta prior) Observation model Prior Posterior Thus, IP, José Bioucas Dias, IST, 2007 39

Example (cont.): (Bernoulli observations, Beta prior) Maximum a posteriori estimate (MAP) Total ignorance: flat prior = =1 Note that for large values of n The von Mises Theorem If the prior is continuous and not zero at the location of the MLestimate, then, IP, José Bioucas Dias, IST, 2007 40

Conjugate priors In the previous example, the prior and the posterior are both Beta distributed. We say that the prior is conjugate with respect to the model Formally, let and be two parametrized families of priors and observation models, respectively is a conjugate family for if for some Very often, prior information about f is very small, allowing to select conjugate priors Conjugate priors why? Computing the posterior density simply consists in updating the parameters of the prior IP, José Bioucas Dias, IST, 2007 41

Conjugate priors (Gaussian observation, Gaussian prior) Gaussian observations Gaussian prior The posterior distribution is Gaussian 1. The mean of is in the simplex defined by {g, } 2. The variance of is the parallel of variances and IP, José Bioucas Dias, IST, 2007 42

Conjugate priors (Gaussian IID observations, Gaussian prior) Gaussian IID observations Gaussian prior The posterior distribution is Gaussian 1. The mean of is in the simplex defined by 2. The variance of is the parallel of variances and IP, José Bioucas Dias, IST, 2007 43

Conjugate Priors (Gaussian IID observations, Gaussian prior) 1 0.8 0.6 0.4 0.2-15 -10-5 5 10 15 IP, José Bioucas Dias, IST, 2007 44

Conjugate Priors (multivariate Gaussian: observation and prior) (g,f) jointly Gaussian distributed: Then a) b) c) IP, José Bioucas Dias, IST, 2007 45

Conjugate Priors (multivariate Gaussian: observation and prior) Linear observation model (f and w independent) Posterior IP, José Bioucas Dias, IST, 2007 46

Conjugate Priors (multivariate Gaussian: observation and prior) Linear observation model (f and w independent) Using the matrix inversion lemma is the solution of the following regularized LS problem e.g., penalize oscillatory solutions IP, José Bioucas Dias, IST, 2007 47

Improper Priors Assume that p(f)=k on given domain Even if the domain of f is unbounded, and, thus, the posterior is well defined. In a sense, improper priors account for a state of total ignorance. This raises no issues to the Bayesian framework, as far as the posterior is proper. IP, José Bioucas Dias, IST, 2007 48

Bayes Estimators IP, José Bioucas Dias, IST, 2007 49

Bayes estimators Ingredients of Statistical Decision Theory: posterior distribution conveys all knowledge about f, given the observation g loss function measures the discrepancy between and. a posteriori expected loss optimal Bayes estimator IP, José Bioucas Dias, IST, 2007 50

Bayesian framework Nuisance Parameter Let and Nuisance parameter The posterior risk depends only on the marginal on In a pure Bayesian framework, nuisance parameters are integrated out IP, José Bioucas Dias, IST, 2007 51

Bayes estimators: Maximum a posteriori probability (MAP) Zero-one, 0/1, loss Volume of an -ball Maximum a posteriori probability A discrete domain leads to the MAP estimator as well IP, José Bioucas Dias, IST, 2007 52

Bayes Estimators: Posterior Mean (PM) Quadratic loss: Q is symmetric and positive definite Only this term Depends on Posterior mean may be hard to compute Valid for any is additive. If Q diagonal the loss function IP, José Bioucas Dias, IST, 2007 53

Bayes estimators: Additive loss Let Then, the minimization is decoupled Each component of minimizes the corresponding marginal a posteriori expected loss IP, José Bioucas Dias, IST, 2007 54

Bayes Estimators: Additive Loss Additive 0/1 loss: is the maximizer of the posterior marginal Additive quadratic loss: The additive quadratic loss is a quadratic loss with Q=I. Therefore, The corresponding Bayes estimator is the posterior mean IP, José Bioucas Dias, IST, 2007 55

Example (Gaussian IID observations, Gaussian prior) Gaussian IID observations Gaussian prior The posterior distribution is Gaussian as IP, José Bioucas Dias, IST, 2007 56

Example (Gaussian observation, Laplacian prior) MAP estimate Strictly concave IP, José Bioucas Dias, IST, 2007 57

Example (Gaussian observation, Laplacian prior) MAP estimate IP, José Bioucas Dias, IST, 2007 58

Example (Gaussian observation, Laplacian prior) PM estimate No closed form expressions Resort to numerical procedures IP, José Bioucas Dias, IST, 2007 59

Example (Gaussian observation, Laplacian prior) 0.8 0.7 0.8 0.7 0.7 0.6 0.6 0.5 0.4 0.3 0.2 0.1 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0-10 -5 0 5 10 0-10 -5 0 5 10 0-10 -5 0 5 10 0.5 0.4 0.3 0.2 0.1 0-10 -5 0 5 10 0.5 0.4 0.3 0.2 0.1 0-10 -5 0 5 10 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0-10 -5 0 5 10 IP, José Bioucas Dias, IST, 2007 60

Example (Gaussian observation, Laplacian prior) 5 4 3 2 1 0-1 -2-3 -4-5 -5-4 -3-2 -1 0 1 2 3 4 5 IP, José Bioucas Dias, IST, 2007 61

Example (Multivariate Gaussian: observation and prior) Linear observation model (f and w independent) Posterior is called the Wiener filter If all the eigenvectors of C approaches infinite, then which is the Moore-Penrose pseudo (or generalized) inverse of A IP, José Bioucas Dias, IST, 2007 62