Lecture 13 and 14: Bayesian estimation theory

Similar documents
Mathematical statistics: Estimation theory

Linear Models A linear model is defined by the expression

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 1: Bayesian Framework Basics

STAT 830 Bayesian Estimation

Lecture 2: From Linear Regression to Kalman Filter and Beyond

13. Parameter Estimation. ECE 830, Spring 2014

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Estimation Theory Fredrik Rusek. Chapter 11

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

INTRODUCTION TO BAYESIAN METHODS II

Review of Probabilities and Basic Statistics

Lecture 16: State Space Model and Kalman Filter Bus 41910, Time Series Analysis, Mr. R. Tsay

CS37300 Class Notes. Jennifer Neville, Sebastian Moreno, Bruno Ribeiro

Name of the Student: Problems on Discrete & Continuous R.Vs

Probability and Estimation. Alan Moses

STAT 430/510 Probability

Lecture 8 October Bayes Estimators and Average Risk Optimality

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

18.175: Lecture 13 Infinite divisibility and Lévy processes

Parametric Techniques Lecture 3

Naïve Bayes classification

ACM 116: Lectures 3 4

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

Time Series and Dynamic Models

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

David Giles Bayesian Econometrics

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Parametric Techniques

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Probabilistic Machine Learning

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

Machine Learning

A Very Brief Summary of Statistical Inference, and Examples

Introduction to Machine Learning CMU-10701

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

STAT 430/510: Lecture 15

The Bayes classifier

Introduction to Bayesian Learning. Machine Learning Fall 2018

Lecture 8: Information Theory and Statistics

Chapter 2. Discrete Distributions

COS513 LECTURE 8 STATISTICAL CONCEPTS

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Convergence in Distribution

Basics on Probability. Jingrui He 09/11/2007

P (A G) dp G P (A G)

Department of Statistical Science FIRST YEAR EXAM - SPRING 2017

Lecture 11 and 12: Basic estimation theory

Lecture 7 Introduction to Statistical Decision Theory

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

A Very Brief Summary of Bayesian Inference, and Examples

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Measure-theoretic probability

Machine Learning

Problems on Discrete & Continuous R.Vs

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline

Statistical Theory MT 2006 Problems 4: Solution sketches

Lecture 2: Priors and Conjugacy

GAUSSIAN PROCESS REGRESSION

Lecture 12. Poisson random variables

Introduction to Machine Learning

Nonparameteric Regression:

Lecture Notes 7 Random Processes. Markov Processes Markov Chains. Random Processes

Naive Bayes and Gaussian Bayes Classifier

Intro to Probability. Andrei Barbu

Support Vector Machines and Bayes Regression

1 A simple example. A short introduction to Bayesian statistics, part I Math 217 Probability and Statistics Prof. D.

Lecture 4. f X T, (x t, ) = f X,T (x, t ) f T (t )

Density Estimation. Seungjin Choi

for valid PSD. PART B (Answer all five units, 5 X 10 = 50 Marks) UNIT I

Naive Bayes and Gaussian Bayes Classifier

EE514A Information Theory I Fall 2013

Name of the Student: Problems on Discrete & Continuous R.Vs

Naive Bayes and Gaussian Bayes Classifier

Random Variables. P(x) = P[X(e)] = P(e). (1)

Review of probability

18.175: Lecture 17 Poisson random variables

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Exercises and Answers to Chapter 1

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

6.1 Variational representation of f-divergences

Multivariate distributions

2 Statistical Estimation: Basic Concepts

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Chapter 5. Chapter 5 sections

Classical and Bayesian inference

Lecture 2: Statistical Decision Theory (Part I)

Least Squares Regression

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Frank Porter April 3, 2017

Convergence of Square Root Ensemble Kalman Filters in the Large Ensemble Limit

Contents 1. Contents

Practice Examination # 3

MIT Spring 2016

Probabilistic Graphical Models

Final Exam # 3. Sta 230: Probability. December 16, 2012

Transcription:

1 Lecture 13 and 14: Bayesian estimation theory Spring 2012 - EE 194 Networked estimation and control (Prof. Khan) March 26 2012 I. BAYESIAN ESTIMATORS Mother Nature conducts a random experiment that generates a parameter θ from a probability density function p(θ). This parameter θ then codes (or parameterizes) the conditional (or measurement) density f(x θ). A random experiment generates a measurement x from f(x θ). The problem is to estimate θ from x. We denote the estimate by (x). The Bayesian setup consists of the following notion. Loss function: The quality of the estimate (x) is measured by a real-valued loss function. Some examples are: Quadratic loss function: L(θ (x)) [θ (x)] T [θ (x]. Binary (01) loss function: L(θ (x)) 0 if (x θ and 1 otherwise. Risk: The risk can be defined as the average loss function over the density p(x θ). The risk basically addresses what is the average loss or risk associate with the estimate (x). Mathematically R(θ ) E x L(θ (x)) L(θ (x))f(x θ)dx. The notation E x indicates that the expectation is over the distribution of the random measurement x (with θ fixed). Bayes risk: Bayes risk is the risk averaged over the prior distribution on θ: R(p ) E θ R(θ ) R(θ )p(θ)dθ L(θ (x)) f(x θ)dxp(θ)dθ. }{{} f(xθ)dxdθ Bayes Risk estimator: The Bayes risk estimator minimizes the Bayes risk: B arg min R(p ) i.e. the value of that minimizes the Bayes risk.

2 The Bayes risk estimator is a rule for mapping the observations x into estimates B (x). It depends on the conditional distribution of the measurements and the prior distribution of the parameter. When this prior is not known then the mini-max principle may be used. Mini-max estimator Suppose an experimentalist chooses an estimate and Mother Nature (M) is allowed to choose her prior after the experimentalist (E) has made his/her choice. If Mother Nature does not like the experimentalist she will try to maximize the average risk for any choice : max R(p ). p We can turn this into a game between M and E by allowing E to observe the resulting average risk and permitting him/her to choose a decision rule to minimize this max average risk: min max R(p ). p The estimator that does this is called the mini-max estimator θ: θ arg min max R(p ). p There are other variants of this setup and leads to very fundamental questions in game theory. Recall that the Bayes risk is given by R(p ) II. COMPUTING BAYES RISK ESTIMATORS L(θ (x))f(x θ)dxdθ where f(x θ) f(x θ)p(θ). From Bayes rule we have f(x θ) f(θ x)f(x) where f(θ x) is the posterior density of θ given x and f(x) is the marginal density for x: f(θ x) f(x) f(x θ) f(x θ) f(x) f(x) p(θ) f(x θ)dθ f(x θ)p(θ)dθ. There is an important physical interpretation of the first formula. The prior density is mapped to the posterior density by the ratio of the conditional measurement density to the marginal density p(θ) x f(θ x)

3 i.e. the data x is used to map the prior into the posterior. The Bayes risk estimator is thus B (x) arg min L(θ (x))f(x θ)dxdθ arg min L(θ (x))f(θ x)f(x)dxdθ ( ) arg min L(θ (x))f(θ x)dθ f(x)dx arg min L(θ (x))f(θ x)dθ }{{} Conditional Bayes risk since the marginal density f(x) is non-negative. The result says that the Bayes risk estimator is the estimator that minimizes the conditional risk; conditional risk is the loss averaged over the conditional distribution of θ given x. Now to compute a particular estimator we need to consider some typical loss functions. Quadratic loss function: When the loss function is quadratic: L(θ (x)) [θ (x)] T [θ (x)] we may write the conditional Bayes risk as L(θ (x))f(θ x)dθ [θ (x)] T [θ (x)]f(θ x)dθ the gradient of the risk w.r.t is ( ) [θ (x)] T [θ (x)]f(θ x)dθ ( [θ (x)] T [θ (x)] ) f(θ x)dθ 2 [θ (x)]f(θ x)dθ and the second-derivative (curl?) is [ ] T ( ) [ T 2 [θ I > 0. The Bayes risk estimator now becomes 2 [θ (x)]f(θ x)dθ 0 θf(θ x)dθ (x)f(θ x)dθ B (x) θf(θ x)dθ E(θ x).

4 We say that the Bayes risk estimator under the quadratic loss function is the conditional mean of θ given x. In a nutshell Bayes estimation under quadratic loss comes down to the computation of the mean of the conditional density f(θ x). Nonlinear filtering is a generic term for this calculation because generally the result is a nonlinear function of the measurement x. Uniform loss function: Assume that the loss function is L(θ (x)) { 0 θ (x) ε 1 θ (x) > ε where ε > 0. Based on this loss function the expected posterior loss function becomes EL(θ (x)) 1 P ( θ (x) > ε) + 0 P ( θ (x) ε) P ( θ (x) > ε) 1 P ( θ (x) ε) 1 (x)+ε (x) ε f(θ x)dθ. The above is minimized when the negative term is maximized: (x) arg max θ In the limit that ε 0 the above becomes (x)+ε (x) ε f(θ x)dθ. which is the MAP estimator. lim ε 0 (x) arg max f(θ x) θ

5 Lecture 14: Wednesday Example 1: A radioactive source emits n radioactive particles and an imperfect Geiger counter records k n of them. Our problem is to estimate n from the measurements k. We assume that n is drawn from a Poisson distribution with known parameter λ: λ λn P [n] e n! n 0. The Poisson distribution characterizes the rate of emission of a process in a given interval of time (or space). Its likely to see a large n when the expected number of occurrences λ is high and a small n when the expected number of occurrences λ is small. We can show that E[n] λ and E((n E[n]) 2 ) λ. The number of recorded counts follow a binomial distribution: P [k n] ( n k ) p k (1 p) n k 0 k n E[k n] np E[k n] np(1 p). The Binomial distribution is the sum of i.i.d Bernoulli trials. Suppose a rv is 1 with probability p and 0 with 1 p. Then the binomial distribution characterizes what is the total number of 1 s we may observe over n trials. In order to proceed with the Bayesian analysis we need to compute the posterior distribution of n k: P [n k] P [n k] P [k] which requires the joint and the marginals. We have ( n P [n k] P [k n]p [n] )p k (1 p) n k λ λn e 0 k n n 0. k n! The marginal of k is P [k] ( n )p k (1 p) n k λ λn e k n! k 0 λ n k (λp) k (1 p) n k e λ k!(n k)! (λp) k e λ k! (λp)k e λ+λ λp k! λp (λp)k e k! (λ(1 p)) n k (n k)!

6 which is Poisson with rate λp. Now the posterior is P [n k] n! k!(n k)! pk (1 p) n k λ λn e n! e λp (λp)k k! 1 (n k)! (λ(1 p))n k e λ(1 p) n k which is similar to Poisson but n instead of starting from 0 starts from k. This has been called the Poisson distribution with displacement k. The conditional mean and variance are 1 E[n k] n (n k)! (λ(1 p))n k e λ(1 p) 1 (n k + k) (n k)! (λ(1 p))n k e λ(1 p) 1 (n k) (n k)! (λ(1 p))n k e λ(1 p) + λ(1 p)e λ(1 p) λ(1 p) + k; n k (n k)! (λ(1 p))n k 1 + ke λ(1 p) e λ(1 p) 1 k (n k)! (λ(1 p))n k e λ(1 p) E[(n E[n k]) 2 k] λ(1 p) Exercise. When the loss function is quadratic the optimal Bayes estimator is the conditional mean and thus n B E[n k] λ(1 p) + k. The Bayes estimate is k when p 1 independent of the expected number of occurrences λ. Since our measurement model is Bernoulli we can show that P (k n n) 1 when p 1. Similarly when p 0 i.e. we see no observations almost surely then the Bayes estimate is λ which is the expected number of occurrences. For 0 < p < 1 Bayes estimate optimally combines the two extremes. We can also think of λ(1 p) as the expected number of missed counts in this sense Bayes estimate applies a correction to include the missed counts. One can easily show that E[ n B ] λ E[n] i.e. the estimate is unbiased. However this is not conditionally unbiased i.e. The mean squared error in the estimator is E[ n B n] E[k n] + λ(1 p) np + λ(1 p) n. E[(n n B ) 2 ] E k ( E[(n nb ) 2 ] k ) E k ( E[(n E(n k)) 2 ] k ) E k (λ(1 p) k) λ(1 p).

7 III. MULTIVARIATE NORMAL Let x and y be jointly distributed according to the normal distribution: [ ] ([ ] [ ]) x 0 R xx R xy N y 0 Recall that the marginals are also normal i.e. R yx R yy x N(0 R xx ) y N(0 R yy ) where R xx E[xx T ] and so on. It can be shown that y x N(R yx R 1 xx x R yy R yx R 1 xx R xy ) x y N(R xy R 1 yy y R xx R xy R 1 yy R yx ). Hence the optimal Bayes estimate under quadratic loss is the mean of the posterior i.e. x B R xy Ryy 1 y. We can think of this as Mother Nature generating x from p(x) N(0 R xx ) distribution and Father Nature generating a measurement from from f(y x) which is also normal. What function relating y to x will result into the above f(y x)? Recalling that the sum of two normal random variables is also normal note that y Hx + r with H R yx Rxx 1 and r N(0 Q) statistically independent from x will result into the above f(y x). In other words we can generate the jointly normal x and y process as described above by two statistically independent normal random vectors x N(0 R xx ) and r N(0 Q) and by relating y and x as above. While generating this signal plus measurement model i.e. x being a signal and y Hx + r being the measurement we define one new matrix R yx and R yy is directly given by R xx and Q. Clearly R xy R T yx. Show that R yy R xy R 1 xx R T xy + Q. In short one can generate a jointly normal random process by two independent normal processes and a linear map.

8 IV. LINEAR STATISTICAL MODEL Consider the following signal plus noise model: y Hx + n where x N(0 R xx ) and n N(0 R nn ) are statistically independent. The correlation between x and y is R yx E[yx T ] E[(Hx + n)x T ] HR xx and the covariance of y is E[yy T ] HR xx H T + R nn. Thus x and y are jointly normal: [ ] ([ x N y 0 0 ] [ R xx HR xx R xx H T HR xx H T + R nn ]). Clearly the Bayes estimate under quadratic loss is the conditional mean of x y: with conditional covariance: x B R xx H T (HR xx H T + R nn ) 1 y }{{} G then P R xx R xx H T (HR xx H T + R nn ) 1 HR }{{} xx. G From the matrix inversion lemma note that P (R 1 nnh) 1 P 1 (R 1 nnh) GHR xx R xx P P (P 1 R xx I)(R 1 nnh) P ((R 1 nnh)r xx I) P (I + H T R 1 nnhr xx I) P H T R 1 nnhr xx G P H T R 1 nn. Hence the estimator can be re-written with x B P H T R 1 nny P (R 1 nnh) 1.

9 V. SEQUENTIAL BAYES The results of the previous section may be used to derive recursive estimates of the random vector x when the measurement vector y t [y 0 y 1... y t ] T increases in dimension with time. The basic idea is to write [ y t H t x + n t ] [ ] [ ] y t 1 y t H t 1 c T t x + n t 1 n t i.e. the kth measurement can be written as y k c T k x + n k where x N(0 R xx ) and n N(0 R t ) are statistically independent with R t being diagonal with elements r tt on the diagonal; R 00 r 00. This means that [ ] R 1 t Rt 1 1 0 0 T rtt 1 The joint distribution of x and y t is [ ] ([ ] [ ]) x 0 R xx R xx Ht T N. y t 0 H t R xx H t R xx Ht T + R t The posterior is x y N( x t P t ) x t P t H T t R 1 t y t P 1 t (R 1 xx + H T t R 1 t H t ). The dimensions of H t R t and y t increase in time whereas they are fixed for x t and P t. How can we make the estimate equations recursive?.