Estimation Theory Fredrik Rusek. Chapter 11

Similar documents
Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Lecture 13 and 14: Bayesian estimation theory

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Statistical learning. Chapter 20, Sections 1 3 1

Detection Theory. Composite tests

Probability and Estimation. Alan Moses

Estimation Theory Fredrik Rusek. Chapters

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Estimation Tasks. Short Course on Image Quality. Matthew A. Kupinski. Introduction

2 Statistical Estimation: Basic Concepts

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

Statistical learning. Chapter 20, Sections 1 3 1

ECE531 Lecture 10b: Maximum Likelihood Estimation

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Detection theory 101 ELEC-E5410 Signal Processing for Communications

Introduction to Probabilistic Machine Learning

Lecture 8: Information Theory and Statistics

Bayesian Regression Linear and Logistic Regression

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Estimation Theory Fredrik Rusek. Chapters 6-7

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

Module 2. Random Processes. Version 2, ECE IIT, Kharagpur

Statistical learning. Chapter 20, Sections 1 4 1

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Lecture 2: Statistical Decision Theory (Part I)

CMU-Q Lecture 24:

Machine Learning 4771

General Bayesian Inference I

CSC321 Lecture 18: Learning Probabilistic Models

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

EIE6207: Maximum-Likelihood and Bayesian Estimation

Statistics: Learning models from data

ECE531 Lecture 8: Non-Random Parameter Estimation

Bayesian Machine Learning

Lecture : Probabilistic Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Statistics & Data Sciences: First Year Prelim Exam May 2018

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Parameter Estimation

Parametric Techniques Lecture 3

STA414/2104 Statistical Methods for Machine Learning II

Introduction to Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

A Very Brief Summary of Bayesian Inference, and Examples

9 Multi-Model State Estimation

Parametric Techniques

CS540 Machine learning L9 Bayesian statistics

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Lecture 7 Introduction to Statistical Decision Theory

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

David Giles Bayesian Econometrics

Chapter 8: Least squares (beginning of chapter)

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Density Estimation. Seungjin Choi

Statistical Learning. Philipp Koehn. 10 November 2015

EIE6207: Estimation Theory

y Xw 2 2 y Xw λ w 2 2

Bayesian Linear Regression. Sargur Srihari

Introduction to Bayesian Learning. Machine Learning Fall 2018

Ch. 12 Linear Bayesian Estimators

Learning the hyper-parameters. Luca Martino

Lecture 2: Conjugate priors

Parameter estimation Conditional risk

Advanced Signal Processing Introduction to Estimation Theory

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Remarks on Improper Ignorance Priors

Basic concepts in estimation

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Parameter Estimation. Industrial AI Lab.

Bayesian Learning (II)

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

COS513 LECTURE 8 STATISTICAL CONCEPTS

Introduction into Bayesian statistics

A Few Notes on Fisher Information (WIP)

E. Santovetti lesson 4 Maximum likelihood Interval estimation

STA 4273H: Sta-s-cal Machine Learning

Bayesian Methods: Naïve Bayes

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Generative Clustering, Topic Modeling, & Bayesian Inference

Predictive Hypothesis Identification

Support Vector Machines and Bayes Regression

PMR Learning as Inference

Learning Bayesian network : Given structure and completely observed data

g-priors for Linear Regression

Rowan University Department of Electrical and Computer Engineering

L11: Pattern recognition principles

Detection and Estimation Chapter 1. Hypothesis Testing

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Rejection sampling - Acceptance probability. Review: How to sample from a multivariate normal in R. Review: Rejection sampling. Weighted resampling

Transcription:

Estimation Theory Fredrik Rusek Chapter 11

Chapter 10 Bayesian Estimation Section 10.8 Bayesian estimators for deterministic parameters If no MVU estimator exists, or is very hard to find, we can apply an MMSE estimator to deterministic parameters Recall the form of the Bayesian estimator for DC-levels in WGN Compute the MSE for a given value of A

Chapter 10 Bayesian Estimation Section 10.8 Bayesian estimators for deterministic parameters If no MVU estimator exists, or is very hard to find, we can apply an MMSE estimator to deterministic parameters Recall the form of the Bayesian estimator for DC-levels in WGN α<1 Compute the MSE for a given value of A

Chapter 10 Bayesian Estimation Section 10.8 Bayesian estimators for deterministic parameters If no MVU estimator exists, or is very hard to find, we can apply an MMSE estimator to deterministic parameters Recall the form of the Bayesian estimator for DC-levels in WGN α<1 Compute the MSE for a given value of A Variance smaller than classical estimator Large bias for large A

Chapter 10 Bayesian Estimation Section 10.8 Bayesian estimators for deterministic parameters If no MVU estimator exists, or is very hard to find, we can apply an MMSE estimator To deterministic parameters Recall the form of the Bayesian estimator for DC-levels in WGN α<1 Compute the MSE for a given value of A MSE for Bayesian is smaller for A close to the prior mean, but larger far away

Chapter 10 Bayesian Estimation Section 10.8 Bayesian estimators for deterministic parameters However, the BMSE is smaller To deterministic parameters

Chapter 10 Bayesian Estimation Section 10.8 Bayesian estimators for deterministic parameters However, the BMSE is smaller To deterministic parameters

Chapter 10 Bayesian Estimation Section 10.8 Bayesian estimators for deterministic parameters However, the BMSE is smaller To deterministic parameters

Chapter 10 Bayesian Estimation Section 10.8 Bayesian estimators for deterministic parameters However, the BMSE is smaller To deterministic parameters

Risk Functions To deterministic parameters p(θ) θ p(x θ) x Estimator θ Error: ε = θ - θ

Risk Functions To arameters p(θ) θ p(x θ) x Estimator θ Error: ε = θ - θ The MMSE estimator minimizes Bayes Risk where the cost function is

Risk Functions To arameters p(θ) θ p(x θ) x Estimator θ Error: ε = θ - θ The MMSE estimator minimizes Bayes Risk where the cost function is

Risk Functions To arameters An estimator that minimizez Bayes risk, for some cost, is termed a Bayes estimator p(θ) θ p(x θ) x Estimator θ Error: ε = θ - θ The MMSE estimator minimizes Bayes Risk where the cost function is

Let us now optimize for different To arameters For a quadratic cost, we already know that

Let us now optimize for different Bayes risk equals

Let us now optimize for different Bayes risk equals Minimize this to minimize Bayes risk

Let us now optimize for different Bayes risk equals

Let us now optimize for different Bayes risk equals

Let us now optimize for different Bayes risk equals We need depends on θ, but the limits of the integral Not standard differential

Interlude: Leibnitz s rule (very useful)

Leibnitz s rule (very useful) We have:

Leibnitz s rule (very useful)

Leibnitz s rule (very useful) u = θ φ 2 (u)= θ

Leibnitz s rule (very useful)

Leibnitz s rule (very useful)

Leibnitz s rule (very useful)

Leibnitz s rule (very useful) Lower limit does not depend on u: u = θ

Leibnitz s rule (very useful)

Leibnitz s rule (very useful)

Leibnitz s rule (very useful)

Let us now optimize for different Bayes risk equals We need depends on θ, but the limits of the integral Not standard differential

Let us now optimize for different Bayes risk equals

Let us now optimize for different Bayes risk equals θ is the median of the posterior

Let us now optimize for different Bayes risk equals θ = median

Let us now optimize for different Bayes risk equals θ = median

Let us now optimize for different Bayes risk equals θ = median θ = arg max

Let us now optimize for different Bayes risk equals θ = median θ = arg max Let δ->0: θ = arg max (maximum a posterori (MAP)) θ = arg max

Gausian posterior What is relation between mean, median and max? θ = median θ = arg max

Gausian posterior What is relation between mean, median and max? θ = median Gaussian posterior makes the three risk functions identical θ = arg max

Extension to vector parameter Suppose we have a vector parameter of unknowns θ Consider estimation of θ 1. It still holds that the MAP estimator uses

Extension to vector parameter Suppose we have a vector parameter of unknowns θ Consider estimation of θ 1. It still holds that the MAP estimator uses The parameters θ 2. θ N are nuisance parameters, but we can integrate them away

Extension to vector parameter Suppose we have a vector parameter of unknowns θ Consider estimation of θ 1. It still holds that the MAP estimator uses The parameters θ 2. θ N are nuisance parameters, but we can integrate them away The estimator is

Extension to vector parameter Suppose we have a vector parameter of unknowns θ Consider estimation of θ 1. It still holds that the MAP estimator uses The parameters θ 2. θ N are nuisance parameters, but we can integrate them away The estimator is

Extension to vector parameter Suppose we have a vector parameter of unknowns θ Consider estimation of θ 1. It still holds that the MAP estimator uses The parameters θ 2. θ N are nuisance parameters, but we can integrate them away The estimator is

Extension to vector parameter In vector form

Extension to vector parameter Observations Classical approach (non-bayesian): We must estimate all unknown paramters jointly, except if..what holds????

Extension to vector parameter Observations Classical approach (non-bayesian): We must estimate all unknown paramters jointly, except if Fisher information is diagonal Vector MMSE estimator minimizes the MSE for each component of the unknown vector parameter θ, i.e.,

Performance of MMSE estimator

Performance of MMSE estimator Function of x

Performance of MMSE estimator Bayes rule MMSE estimator

Performance of MMSE estimator By definition

Performance of MMSE estimator

Performance of MMSE estimator Element [1,1] of

Additive property Independent observations x 1,x 2 Estimate θ Assume that x 1,x 2, θ are jointly Gaussian Theorem 10.2

Additive property Independent observations x 1,x 2 Estimate θ Assume that x 1,x 2, θ are jointly Gaussian Typo in book, should include means as well Independent observations

Additive property Independent observations x 1,x 2 Estimate θ Assume that x 1,x 2, θ are jointly Gaussian MMSE estimate can be updated sequentially!!!

MAP estimator n

MAP estimator Benefits compared with MMSE Not needed (typically hard to find) Optimization generally easier than finding the conditional expectation n

MAP vs ML estimator Alexander Aljechin (1882-1946) became world chess champion 1927 (by defeating Capablanca) Aljechin defended his title twice, and regained it once

MAP vs ML estimator Alexander Aljechin (1882-1946) became world chess champion 1927 (by defeating Capablanca) Aljechin defended his title twice, and regained it once Magnus Calrsen became world champion 2013, and defended the title Once in 2014

MAP vs ML estimator Alexander Aljechin (1882-1946) became world chess champion 1927 (by defeating Capablanca) Aljechin defended his title twice, and regained it once Magnus Calrsen became world champion 2013, and defended the title Once in 2014 Now consider a title game in 2015. Observe Y=y1, where y1=win Two hypotheses: H1: Aljechin defends title H2: Carlsen defends title

MAP vs ML estimator Alexander Aljechin (1882-1946) became world chess champion 1927 (by defeating Capablanca) Aljechin defended his title twice, and regained it once Magnus Calrsen became world champion 2013, and defended the title Once in 2014 Now consider a title game in 2015. Observe Y=y1, where y1=win Two hypotheses: H1: Aljechin defends title H2: Carlsen defends title Given the above statistics f(y1 H1)>f(y1 H2)

MAP vs ML estimator Alexander Aljechin (1882-1946) became world chess champion 1927 (by defeating Capablanca) Aljechin defended his title twice, and regained it once Magnus Calrsen became world champion 2013, and defended the title Once in 2014 Now consider a title game in 2015. Observe Y=y1, where y1=win Two hypotheses: H1: Aljechin defends title H2: Carlsen defends title Given the above statistics f(y1 H1)>f(y1 H2) ML rule: Aljechin takes title (although he died in 1946)

MAP vs ML estimator Alexander Aljechin (1882-1946) became world chess champion 1927 (by defeating Capablanca) Aljechin defended his title twice, and regained it once Magnus Calrsen became world champion 2013, and defended the title Once in 2014 Now consider a title game in 2015. Observe Y=y1, where y1=win Two hypotheses: H1: Aljechin defends title H2: Carlsen defends title Given the above statistics f(y1 H1)>f(y1 H2) MAP rule: f(h1)=0, -> Carlsen defends title

Example DC-level in white noise, uniform prior U[-A 0,A 0 ] The posterior is We got stuck here: Cannot put the denominator in closed form Cannot integrate the nominator Lets try with the MAP estimator

Example DC-level in white noise, uniform prior U[-A 0,A 0 ] The posterior is Denominator: Does not depend on A -> irrelevant

Example DC-level in white noise, uniform prior U[-A 0,A 0 ] The posterior is Denominator: Does not depend on A -> irrelevant We need to maximize the nominator

Example DC-level in white noise, uniform prior U[-A 0,A 0 ]

Example DC-level in white noise, uniform prior U[-A 0,A 0 ]

Example DC-level in white noise, uniform prior U[-A 0,A 0 ] MAP estimator can be found! Lesson learned (generally true) MAP is easier to find than MMSE

Element-wise MAP for vector-valued parameter No-integration-needed benefit gone

Element-wise MAP for vector-valued parameter No-integration-needed benefit gone The estimator Minimizes the hit-or-miss risk for each I, where δ->0

Element-wise MAP for vector-valued parameter No-integration-needed benefit gone Let us now define another risk function Easy to prove that as δ->0, Bayes risk is minimized by the vector-map-estimator

Element-wise MAP and vector valued MAP are not the same Vector-valued MAP solution Element-wise MAP solution

Two properties of vector-map For jointly Gaussian x and θ, the conditional mean E(θ x) coincides with the peak of p(θ x). Hence, the vector-map and the MMSE coincide.

Two properties of vector-map For jointly Gaussian x and θ, the conditional mean E(θ x) coincides with the peak of p(θ x). Hence, the vector-map and the MMSE coincide. Invariance does not hold for MAP (as opposed to MLE)

Invariance Why does invariance hold for MLE? With α=g(θ), it holds that p(x α) = p θ (x g -1 (α))

Invariance Why does invariance hold for MLE? With α=g(θ), it holds that p(x α) = p θ (x g -1 (α)) However, MAP involves the prior, and it doesn t hold that p α (α)=p θ (g -1 (α)), since the two distributions are related through the Jacobian

Example Exponential Inverse gamma

Example Exponential Inverse gamma MAP

Example Exponential Inverse gamma MAP

Example Exponential Inverse gamma MAP

Example Consider estimation of? (holds for MLE)

Example Consider estimation of? (holds for MLE)

Example Consider estimation of? (holds for MLE)

Example Consider estimation of? (holds for MLE)

Example Consider estimation of? (holds for MLE)

Example Consider estimation of.