Lecture Note 2: Estimation and Information Theory

Similar documents
Lecture Note 1: Probability Theory and Statistics

State Space and Hidden Markov Models

2 Statistical Estimation: Basic Concepts

Basic concepts in estimation

Cramér-Rao Bounds for Estimation of Linear System Noise Covariances

Nonparameteric Regression:

CS491/691: Introduction to Aerial Robotics

Kalman filtering and friends: Inference in time series models. Herke van Hoof slides mostly by Michael Rubinstein

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Notes on Discriminant Functions and Optimal Classification

6. Vector Random Variables

PATTERN RECOGNITION AND MACHINE LEARNING

(Classical) Information Theory III: Noisy channel coding

Prediction of ESTSP Competition Time Series by Unscented Kalman Filter and RTS Smoother

Estimation, Detection, and Identification CMU 18752

Nonparametric Bayesian Methods (Gaussian Processes)

01 Probability Theory and Statistics Review

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Mobile Robot Localization

SIMON FRASER UNIVERSITY School of Engineering Science

Autonomous Navigation for Flying Robots

Bayesian Inference of Noise Levels in Regression

Online tests of Kalman filter consistency

Principal Component Analysis

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Regression with Input-Dependent Noise: A Bayesian Treatment

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

NUMERICAL COMPUTATION OF THE CAPACITY OF CONTINUOUS MEMORYLESS CHANNELS

Lecture Machine Learning (past summer semester) If at least one English-speaking student is present.

Problem Set 3 Due Oct, 5

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Statistical and Adaptive Signal Processing

Dept. of Linguistics, Indiana University Fall 2015

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Lecture 3: Pattern Classification. Pattern classification

The binary entropy function

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Machine Learning 4771

10 Robotic Exploration and Information Gathering

STA414/2104 Statistical Methods for Machine Learning II

MODULE -4 BAYEIAN LEARNING

Why do we care? Examples. Bayes Rule. What room am I in? Handling uncertainty over time: predicting, estimating, recognizing, learning

Lecture 5: GPs and Streaming regression

Lecture 2: August 31

Convergence of Square Root Ensemble Kalman Filters in the Large Ensemble Limit

9 Multi-Model State Estimation

EE 302 Division 1. Homework 6 Solutions.

Time Series Prediction by Kalman Smoother with Cross-Validated Noise Density

Recursive Estimation

Linear Dynamical Systems

ebay/google short course: Problem set 2

Linear Models for Regression

MACHINE LEARNING ADVANCED MACHINE LEARNING

The Kalman Filter. Data Assimilation & Inverse Problems from Weather Forecasting to Neuroscience. Sarah Dance

1 Cricket chirps: an example

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

7 Gaussian Discriminant Analysis (including QDA and LDA)

MACHINE LEARNING ADVANCED MACHINE LEARNING

Introduction to Gaussian Processes

Bayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge

Why do we care? Measurements. Handling uncertainty over time: predicting, estimating, recognizing, learning. Dealing with time

(Extended) Kalman Filter

TSRT14: Sensor Fusion Lecture 8

Machine Learning Lecture 2

ECE521 week 3: 23/26 January 2017

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Learning Gaussian Process Models from Uncertain Data

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Gaussian Processes for Regression: A Quick Introduction

Gaussian Process Approximations of Stochastic Differential Equations

Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction

Gaussian Process Regression

Ch. 5 Joint Probability Distributions and Random Samples

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

CSC321 Lecture 18: Learning Probabilistic Models

y Xw 2 2 y Xw λ w 2 2

Lecture 22: Final Review

State Estimation of Linear and Nonlinear Dynamic Systems

Shannon meets Wiener II: On MMSE estimation in successive decoding schemes

Likelihood Bounds for Constrained Estimation with Uncertainty

Computer Intensive Methods in Mathematical Statistics

Parameter Estimation. Industrial AI Lab.

Bayesian Deep Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

Approximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery

L06. LINEAR KALMAN FILTERS. NA568 Mobile Robotics: Methods & Algorithms

ME EN 363 Elementary Instrumentation

STA 4273H: Statistical Machine Learning

INTRODUCTION TO PATTERN RECOGNITION

Introduction to Probability and Statistics (Continued)

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

6.867 Machine learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

From Bayes to Extended Kalman Filter

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Transcription:

Univ. of Michigan - NAME 568/EECS 568/ROB 530 Winter 2018 Lecture Note 2: Estimation and Information Theory Lecturer: Maani Ghaffari Jadidi Date: April 6, 2018 2.1 Estimation A static estimation problem involve estimation of the system parameters. State estimation is a dynamic estimation problem, that is the state variables of the system evolve over time (Bar-Shalom et al. 2001). Definition 1 (Parameter) A quantity that is assumed to be time invariant or its time variation is slow compared to the state variables of a system. The PDF of the measurements, z, conditioned on the parameter,, is called the likelihood function of the parameter l() p(z ) (2.1) The likelihood measures that how likely a parameter value is given the obtained observations (evidence from the data). 2.1.1 Maimum Likelihood (ML) Estimator The maimum likelihood method maimizes the likelihood function, leading to the ML estimator (z) = argma l() = argma p(z ) (2.2) where (z) is a random variable as it is a function of random observations, z. The ML estimate is the solution of the likelihood equation dl() d = dp(z ) = 0 (2.3) d Problem 1 (ML estimate of n Gaussian samples) Find the ML estimator of a sample of n i.i.d. normal random variables (univariate). 2.1.2 Maimum A Posteriori (MAP) Estimator In the Bayesian framework, one can place a prior distribution over the parameter. Then, the MAP estimator maimizes the posterior PDF (z) = argma p( z) = argma p(z )p() (2.4) were the last equality is true because the normalization constant in the Bayes formula is independent of. 2-1

Lecture Note 2: Estimation and Information Theory 2-2 Remark 2 Since log is a monotonic function, it is often the case that we use the logarithm of the likelihood or posterior for maimization. Problem 2 (MAP estimate of n Gaussian samples) Find the MAP estimator of a sample of n i.i.d. normal random variables (univariate). Assume the prior is also Gaussian with mean µ 0 and variance σ 0. 2.1.3 Least Squares (LS) Estimator A common estimation procedure for nonrandom parameters is the LS method. Let h k () be a nonlinear measurement model and z k = h k () + v the scalar measurements. The LS estimator of is the solution of the following problem known as the nonlinear LS. k = argmin t [z k h k ()] 2 (2.5) k=1 If h k () is linear in, (2.5) becomes the linear LS problem. Furthermore, there is no assumption on the distribution of the noise term, v, in the LS problem. If we assume v N (0,σ), then the LS estimator coincides with the ML estimator. Problem 3 (Polynomial regression) Formulate the polynomial regression problem of order n using the LS estimator. Write a function that takes data and the order of the polynomial, and outputs the coefficients. Compare your results with MATLAB s built-in polyfit. What can you say about the goodness of fit? What happens when the order of the polynomial is eactly equal to the number of data points? 2.1.4 Minimum Mean Square Error (MMSE) Estimator For random parameters, the MMSE estimator is (z) = argmine[( ˆ ) 2 z] (2.6) ˆ where ˆ is an estimator of. The MMSE estimator s solution corresponds to the conditional mean of, that is (z) = E[ z] = p( z)d (2.7) Remark 3 If p( z) is Gaussian, then the MMSE estimator (the conditional mean) coincides with the MAP estimator since the mode and mean of the Gaussian distribution are the same. Eample 1 The MMSE estimate - the conditional mean - of a Gaussian random vector in terms of another Gaussian random vector (the measurement).

Lecture Note 2: Estimation and Information Theory 2-3 2.1.5 Central Limit Theorem If the random variables X i are independent, under general conditions the distribution of Y n = 1 n n X i (2.8) is approimately Gaussian, of mean 1 n ni=1 µ i and variance 1 n 2 ni=1 σ 2 i, and µ i and σ 2 i are the mean and variance of X i. As n, the approimation becomes more accurate (Anderson and Moore 1979). The Central Limit Theorem (CLT) has a very important role in characterizing many real-world sources of uncertainty: It is used as the justification/ecuse to make the omnipresent Gaussian assumption. For eample, the thermal noise in electronic devices, as the sum of many small contributions, is indeed close to Gaussian (Bar-Shalom et al. 2001). i=1 2.1.6 Filtering, Smoothing, and Prediction This part is a summary from Anderson and Moore (1979, Chapter 2). In general and a broad sense, filtering may refer to etraction of information about the internal state of a system from noisy measurements. Technically, filtering refer to a specific type of information processing (inference). In particular, in the latter definition, filtering means recovery of the state at time t, using measurements up to time t. Eample 2 An eample of the application of filtering in everyday life is in radio reception. Here the signal of interest is the voice signal. This signal is used to modulate a high frequency carrier that is transmitted to a radio receiver. The received signal is inevitably corrupted by noise, and so, when demodulated, it is filtered to recover as well as possible the original signal. Smoothing is another form of information processing that differs from filtering. In smoothing, information about the state need not become available at time t, and measurements derived later than time t can be used in obtaining information about the state. The trade-off here is that there must be a delay in producing the information about the state, as compared with the filtering case, but we can use more measurement data than in the filtering case in producing the information about the state. Since in smoothing we can also use measurements after time t, one should epect the smoothing process to be more accurate in some sense than the filtering process. Eample 3 An eample of smoothing is provided by the way the human brain tackles the problem of reading hastily written handwriting. Each word is tackled sequentially, and when word is reached that is particularly difficult to interpret, several words after the difficult word, as well as those before it, may be used to attempt to deduce the word. Prediction is the forecasting side of information processing. Here we intend to obtain information about the state at some time after t, while we may use measurements up to time t. Eample 4 When attempting to catch a ball, we have to predict the future trajectory of the ball in order to position a catching hand correctly. This task becomes more difficult the more the ball is subject to random

Lecture Note 2: Estimation and Information Theory 2-4 disturbances such as wind gusts. Generally, any prediction task becomes more difficult as the environment becomes noisier. 2.2 Solving System of Linear Equations 2.2.1 Cholesky Decomposition I directly took this part from the appendi of the well-known book on Gaussian processes (Rasmussen and Williams 2006). A very brief and concise note on Cholesly decomposition and solving linear systems. The Cholesky decomposition of a symmetric, positive definite matri A decomposes A into a product of a lower triangular matri L and its transpose LL T = A (2.9) where L is called the Cholesky factor. The Cholesky decomposition is useful for solving linear systems with symmetric, positive definite coefficient matri A. To solve A = b for, first solve the triangular system Ly = b by forward substitution and then the triangular system L T = y by back substitution. Using the backslash operator, we write the solution as = L T \(L\b), where the notation A\b is the vector which solves A = b. Both the forward and backward substitution steps require n 2 /2 operations, when A is of size n n. The computation of the Cholesky factor L is considered numerically etremely stable and takes time n 3 /6, so it is the method of choice when it can be applied. 2.2.2 QR Decomposition This part is taken from Lecture Notes for EE263, Stephen Boyd, Stanford 2008 (Boyd 2008). Consider the case of overdetermined set of linear equations in which there are more equations than unknowns. We wish to approimately solve A = b where now A R m n is skinny, i.e., m > n. The QR decomposition or factorization finds A = QR where Q R m n is an orthogonal matri, i.e., Q T Q = I n, and R R n n is an upper triangular and invertible matri. To approimately solve A = b, find A = QR and substitute it into the pseudo-inverse of A, i.e., (A T A) 1 A T, resulting in ˆ = R\Q T b. 2.3 Information Theory Entropy is a measure of the uncertainty of a random variable (Cover and Thomas 1991). The entropy H(X) of a discrete random variable X is defined as H(X) = E p() [log 1 p() ] = p() log p() (2.10) X

Lecture Note 2: Estimation and Information Theory 2-5 The joint entropy H(X,Y ) of discrete random variables X and Y with a joint distribution p(,y) is defined as H(X,Y ) = p(, y) log p(, y) (2.11) X y Y The chain rule implies that H(X,Y ) = H(X) + H(Y X) = H(Y ) + H(X Y ) (2.12) where H(X Y ) is the conditional entropy and is defined as H(Y X) = p(, y) log p(y ) (2.13) X y Y Theorem 4 (Chain rule for entropy) Let X 1,X 2,...,X n be drawn according to the joint probability distribution p( 1, 2,..., n ). Then H(X 1,X 2,...,X n ) = Proof: Please refer to Cover and Thomas (1991) for the proof. n H(X i X i 1,...,X 1 ) (2.14) i=1 The relative entropy or KLD is a measure of distance between two distributions p() and q(). It is defined as D(p q) = E p() [log p() q() ] (2.15) Theorem 5 (Information inequality) Let X be a discrete random variable. Let p() and q() be two probability mass functions. Then D(p q) 0 (2.16) with equality if and only if p() = q(). Proof: Please refer to Cover and Thomas (1991) for the proof. The mutual information I(X;Y ) is the reduction in the uncertainty of one random variable due to the knowledge of the other. The mutual information is non-negative and can be written as I(X;Y ) = D(p(,y) p()p(y)) = E p(,y) [log p(,y) p()p(y) ] (2.17) I(X;Y ) = H(X) H(X Y ) (2.18)

Lecture Note 2: Estimation and Information Theory 2-6 Corollary 6 (Nonnegativity of mutual information) For any two random variables X and Y, with equality if and only if X and Y are independent. I(X;Y ) 0 (2.19) Proof: I(X; Y ) = D(p(, y) p()p(y)) 0, with equality if and only if p(, y) = p()p(y). Some immediate consequences of the provided definitions are as follows. Lemma 7 For any discrete random variable X, we have H(X) 0. Proof: 0 p() 1 implies that log 1 p() 0. Theorem 8 (Conditioning reduces entropy) For any two random variables X and Y, with equality if and only if X and Y are independent. Proof: 0 I(X;Y ) = H(X) H(X Y ). H(X Y ) H(X) (2.20) We now define the equivalent of the functions mentioned above for probability density functions. Definition 9 (Differential entropy) Let X be a continuous random variable whose support set is S. Let p() be the probability density function for X. The differential entropy h(x) of X is defined as h(x) = p() log p()d (2.21) S Remark 10 The differential entropy of a set of continuous random variables that have a joint distribution is also defined using (2.21). Definition 11 (Conditional differential entropy) Let X and Y be continuous random variables that have a joint probability density function p(,y). The conditional differential entropy h(x Y ) is defined as h(x Y ) = p(, y) log p( y)ddy (2.22) Definition 12 (Relative entropy (KLD)) The relative entropy (KLD) between two probability density functions p and q is defined as D(p q) = p log p (2.23) q Definition 13 (Mutual information) The mutual information I(X; Y ) between two continuous random variables X and Y with joint probability density function p(,y) is defined as I(X;Y ) = p(,y)log p(,y) ddy (2.24) p()p(y)

Lecture Note 2: Estimation and Information Theory 2-7 2.4 Resources Here I list some of the good references for background knowledge. Of course, there are many other notable books and resources available. For probability and statistics see Papoulis and Pillai (2002). For optimal filtering including Kalman filtering, linear systems, and estimation see Anderson and Moore (1979). For a classic tetbook on estimation with a focus on target tracking and navigation that also covers an introduction to probability theory see Bar-Shalom et al. (2001). References Yaakov Bar-Shalom, X Rong Li, and Thiagalingam Kirubarajan. Estimation with applications to tracking and navigation: theory algorithms and software. John Wiley & Sons, 2001. Brian DO Anderson and John B Moore. Optimal filtering. Englewood Cliffs, 1979. C.E. Rasmussen and C.K.I. Williams. Gaussian processes for machine learning, volume 1. MIT press, 2006. Stephen Boyd. Lecture notes for EE263. Introduction to Linear Dynamical Systems, 2008. Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 1991. Athanasios Papoulis and S Unnikrishna Pillai. McGraw-Hill Education, 2002. Probability, random variables, and stochastic processes.