Consider the joint probability, P(x,y), shown as the contours in the figure above. P(x) is given by the integral of P(x,y) over all values of y.

Similar documents
DRAGON ADVANCED TRAINING COURSE IN ATMOSPHERE REMOTE SENSING. Inversion basics. Erkki Kyrölä Finnish Meteorological Institute

Gaussian Process Approximations of Stochastic Differential Equations

In the derivation of Optimal Interpolation, we found the optimal weight matrix W that minimizes the total analysis error variance.

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

2 (Statistics) Random variables

Random Signals and Systems. Chapter 3. Jitendra K Tugnait. Department of Electrical & Computer Engineering. Auburn University.

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Fundamentals of Data Assimila1on

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

(Extended) Kalman Filter

A review of probability theory

4. DATA ASSIMILATION FUNDAMENTALS

Solution to Assignment 3

Chapter 5 Joint Probability Distributions

Let X and Y denote two random variables. The joint distribution of these random

Review: mostly probability and some statistics

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Lecture 3: Pattern Classification

ECE 636: Systems identification

1 Introduction. Systems 2: Simulating Errors. Mobile Robot Systems. System Under. Environment

Bivariate distributions

Physics 403. Segev BenZvi. Propagation of Uncertainties. Department of Physics and Astronomy University of Rochester

Kalman Filter Computer Vision (Kris Kitani) Carnegie Mellon University

P (x). all other X j =x j. If X is a continuous random vector (see p.172), then the marginal distributions of X i are: f(x)dx 1 dx n

ECE302 Exam 2 Version A April 21, You must show ALL of your work for full credit. Please leave fractions as fractions, but simplify them, etc.

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Review of probability

Multivariate distributions

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Inverse problems and uncertainty quantification in remote sensing

Variational data assimilation

Part 4: Multi-parameter and normal models

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

Adaptive Data Assimilation and Multi-Model Fusion

A Probability Review

7 Gaussian Discriminant Analysis (including QDA and LDA)

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

ENGG2430A-Homework 2

2 Functions of random variables

Fundamentals of Data Assimila1on

Ensemble Data Assimilation and Uncertainty Quantification

Hierarchical Bayesian Modeling and Analysis: Why and How

Ch. 8 Math Preliminaries for Lossy Coding. 8.5 Rate-Distortion Theory

Optimal Interpolation ( 5.4) We now generalize the least squares method to obtain the OI equations for vectors of observations and background fields.

STA 256: Statistics and Probability I

Gaussian Filtering Strategies for Nonlinear Systems

19 : Slice Sampling and HMC

Parameter estimation Conditional risk

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

The Kalman Filter. Data Assimilation & Inverse Problems from Weather Forecasting to Neuroscience. Sarah Dance

UNDERSTANDING DATA ASSIMILATION APPLICATIONS TO HIGH-LATITUDE IONOSPHERIC ELECTRODYNAMICS

STAT 430/510 Probability

STAT 430/510: Lecture 16

ECE295, Data Assimila0on and Inverse Problems, Spring 2015

Introduction to Maximum Likelihood Estimation

Probability and Estimation. Alan Moses

M.Sc. in Meteorology. Numerical Weather Prediction

Introduction to Data Assimilation

APPM/MATH 4/5520 Solutions to Exam I Review Problems. f X 1,X 2. 2e x 1 x 2. = x 2

Data Assimilation: Finding the Initial Conditions in Large Dynamical Systems. Eric Kostelich Data Mining Seminar, Feb. 6, 2006

X t = a t + r t, (7.1)

Homework 10 (due December 2, 2009)

Bayes Decision Theory

Kalman Filter. Predict: Update: x k k 1 = F k x k 1 k 1 + B k u k P k k 1 = F k P k 1 k 1 F T k + Q

Multivariate Distributions

Multiple Integrals. Chapter 4. Section 7. Department of Mathematics, Kookmin Univerisity. Numerical Methods.

Inverse Theory. COST WaVaCS Winterschool Venice, February Stefan Buehler Luleå University of Technology Kiruna

Data assimilation with and without a model

Aircraft Validation of Infrared Emissivity derived from Advanced InfraRed Sounder Satellite Observations

Cheng Soon Ong & Christian Walder. Canberra February June 2018

ECONOMETRIC METHODS II: TIME SERIES LECTURE NOTES ON THE KALMAN FILTER. The Kalman Filter. We will be concerned with state space systems of the form

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

An Introduction to Rational Inattention

Inverse Problems in the Bayesian Framework

How to build an automatic statistician

2 Statistical Estimation: Basic Concepts

A new Hierarchical Bayes approach to ensemble-variational data assimilation

COM336: Neural Computing

Multivariate Distribution Models

Multivariate Gaussians. Sargur Srihari

1 Joint and marginal distributions

Chapter 2. Probability

Scalar and Vector Quantization. National Chiao Tung University Chun-Jen Tsai 11/06/2014

ECE531: Principles of Detection and Estimation Course Introduction

Review of Probability Theory

Bayesian analysis in nuclear physics

Introduction to Mobile Robotics Probabilistic Robotics

EE4601 Communication Systems

Naïve Bayes classification

DIFFERENTIATION AND INTEGRATION PART 1. Mr C s IB Standard Notes

Problem Y is an exponential random variable with parameter λ = 0.2. Given the event A = {Y < 2},

Physics 509: Error Propagation, and the Meaning of Error Bars. Scott Oser Lecture #10

Fundamentals of Data Assimilation

Exam. Matrikelnummer: Points. Question Bonus. Total. Grade. Information Theory and Signal Reconstruction Summer term 2013

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.

Probability and Information Theory. Sargur N. Srihari

Gaussian Processes in Machine Learning

Lecture 11. Probability Theory: an Overveiw

Inverse Kinematics. Mike Bailey. Oregon State University. Inverse Kinematics

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices.

Transcription:

ATMO/OPTI 656b Spring 009 Bayesian Retrievals Note: This follows the discussion in Chapter of Rogers (000) As we have seen, the problem with the nadir viewing emission measurements is they do not contain sufficient information for stand-alone retrievals of vertical atmospheric structure without smoothing which increases accuracy but limits our ability to determine the vertical structure of the atmosphere. Optimum utilization of their information is achieved by combining them with other apriori information about the atmospheric state. Examples of such apriori information are climatology and weather forecasts. Use of apriori information suggests a Bayesian approach or framework, we know something about the atmospheric state, x, before we make a set of measurements, y, and then refine our knowledge of x based on the measurements, y. We need to define some probabilities. P(x) is the apriori pdf of the state, x. P(y) is the apriori pdf of the measurement, y. P(x,y) is the joint pdf of x and y meaning that P(x,y) dx dy is the probability that x lies in the interval (x, x+dx) and y lies in (y,y+dy) P(y x) is the conditional pdf of y given x meaning that P(y x) dy is the probability that y lies in (y, y+dy) when x has a given value P(x y) is the conditional pdf of x given y meaning that P(x y) dx is the probability that x lies in (x, x+dx) when measurement y has a given value Consider the joint probability, P(x,y), shown as the contours in the figure above. P(x) is given by the integral of P(x,y) over all values of y. P( x) = $ P ( x, y )dy () "! 04/0/09

ATMO/OPTI 656b Spring 009 Similarly = P x, y P y $ dx () The conditional probability, P(y x=x ), is proportional to P(x,y) as a function of y for a given value of x = x. This is defined as the P(x,y) along a particular vertical line of x = x. Now P(y x ) must be normalized such that " $ P( y x )dy = (3) To normalize, we divide P(x=x,y) by the integral of P(x=x,y) over all y. Now we substitute () into (4) We can do the same for P(x y=y ) We can combine (5) and (6) to get " P( y x ) = P y x $ " P( x, y) P( x, y)dy = P x, y P x y P x = P x, y P y P( x y)p( y) = P( y x)p( x) (7) For the present context where we are interested in the best estimate of the state, x, given measurements, y, we write (7) as P( x y) = P ( y x )P( x) P y In this form, we see that on the left side we have the posterior probability density of x given a particular set of measurements, y. On the right side we have the pdf of the apriori knowledge of state, x, and the dependence of the measurements, y, on the state, x. The term, P(y), is usually just viewed as a normalization factor for P(x y) such that (4) (5) (6) (8) $ P( x y)dx = (9) " With these we can update the apriori probability of the state, x a, based on the actual observations and form the refined posterior probability density function, P(x y). Note that (8) tells us the probability of x but not x itself. Now consider a linear problem with Gaussian pdfs. 04/0/09

ATMO/OPTI 656b Spring 009 In the linear case, y = Kx where K is a matrix that represents dy/dx. If this were all that there were to the situation, then knowing x and K, we would know y and the pdf, P(y x), would simply be a delta function, d(y=kx). However, in a more realistic case, y = Kx + ε where ε is the set of measurement errors. Therefore we don t know y exactly if we know x because y is a bit blurry due to its random errors. We assume the y errors are Gaussian and generalize the Gaussian pdf for a scalar, y, P( y) = ( y$µ ) e$ (0) to a vector of measurements, y, where µ is the mean of random variable, y, and σ is the standard deviation of y defined as <(y- µ) > /. The natural log of the pdf is quite useful when working with Gaussian pdfs creating a linear relation between the conditional pdfs. From (8) we can write [ ] = ln$ P y x ln P( x y) " $ P( x) % ' P( y) &' = ln P y x [ ] + ln[ P( x) ] ( ln[ P( y) ] (ln 8) So with y and x now being measurement and state vectors respectively, we can write P(y x) as "ln P( y x) = ( y " Kx) T S " ( y " Kx) + c () where c is some constant and we have assumed the errors in y have zero mean (generally not strictly true), such that the mean y is Kx. Note that Kx is the expected measurement if x is the state vector. The covariance of the measurement errors is S e. The error covariance is defined as follows. Given a set of measurements, y, y, y n, each with an error, ε, ε, ε n, then the covariance of the errors is shown in () for a set of 4 measurements " S " = () where represent the expected value of ε ε. For a Gaussian pdf, a mean and covariance are all that are required to define the pdf. pdf of the apriori state Now we look at the pdf of the apriori estimate of the state, x a. To keep things simple, we also assume a Gaussian distribution, an assumption that is generally less realistic than the measurement error. The resulting pdf is given by where S a is the associated covariance matrix "ln P( x) = ( x " x a ) T S " a ( x " x a ) + c (3) S a = {( x " x a )( x " x a ) T } (4) 3 04/0/09

ATMO/OPTI 656b Spring 009 Now we plug () and (3) into (ln 8) to get the posterior pdf, P(x y) "ln P( x y) = ( y " Kx) T S " ( y " Kx) + ( x " x a ) T S " a ( x " x a ) + c 3 (5) Now we recognize that (5) is quadratic in x and must therefore be writeable as a Gaussian distribution. So we match it with a Gaussian solution "ln P x y = x " x ˆ T ˆ S " ( x " x ˆ ) + c 4 (6) where x ˆ is the optimum solution and S ˆ is the posterior covariance representing the Gaussian distribution of uncertainty in the optimum solution. Equating the terms that are quadratic in x in (5) and (6): so that the inverse of the posterior covariance, ˆ S ", is x T K T S " Kx + x T S a x = x T ˆ S x (7) ˆ S " = K T S " K + S a " Now we also equate the terms in (5) and (6) that are linear in x T Substituting (8) into (9) yields Now this must be true for any x and x T so So ("Kx) T S " ( y) + x T S " a ("x a ) = x T ˆ S " " x ˆ ("x) T K T S " ( y) + x T S " a ("x a ) = x T K T S " " K + S a (8) (9) " x ˆ ˆ K T S " y + S a x a = K T S " K + S a (0) x () x ˆ = K T S K + S a ) ( K T S " y + S a x a ) () () shows that the optimum state, ˆ x, is a weighted average of the apriori guess and the measurements. We get the form of () that I have shown previously by isolating and manipulating the x a term K T S K + S a ) S a x a = K T S K + S a ) S a x a + K T S K + S a ) ( K T S " K K T S " K)x a Plug (3) back into () to get = K T S K + S a ) K T S K + S a )x a + K T S K + S a ) (K T S " K)x a = x a + K T S K + S a ) (K T S " K)x a (3) x ˆ = x a + K T S K + S a ) K T S " ( y Kx a ) (4) which is indeed the form I showed previously. As I said in class and in the Atmo seminar, (4) shows that the optimum posterior solution, ˆ x, for the atmospheric state given an apriori estimate 4 04/0/09

ATMO/OPTI 656b Spring 009 of the atmospheric state, x a, and its covariance, S a, and a set of measurements, y, and its error covariance, S e, is the apriori guess plus a weighted version of the difference between the expected measurement, Kx a, and the actual measurement, y. Note that while (8) is true in general independent of the pdfs, () and (4) are true as long as the uncertainty in the apriori state estimate and measurements can be accurately described in terms of Gaussian pdfs. Note also that the unique solution,, depends on the apriori estimate, the measurements and their respective covariances. Change the covariances and the optimum state changes. Also note that it is assumed that there is not bias in the measurement errors or the apriori estimate. This is not true in general. Consider the analogous situation with two estimates, x and x, of the same variable, x. We want the best estimate of x, x ˆ, given the two estimates. To create this estimate, we need to know the uncertainty in each of the two estimates. If we know the uncertainty in each, s and s, then we can weight the two estimates to create the best estimate. We will take the best estimate to be the estimate with the smallest uncertainty, s. The uncertainty or error in x ˆ is the expected value of ( x ˆ - x T ) where x T is the true value of x. ˆ x = A x + (-A) x. (5) where A is a weight between 0 and. So the variance of the error in ˆ x is = Ax + " A ˆ x " x T = Ax " Ax T + A)x " A)x T [ x " x T ] (6) [ ] = [ A( x " x T ) + A) ( x " x T )] [{ } + { A) ( x " x ) T } + { A ( x " x ) T }{ A) ( x " x ) T }] = A x " x T = A ( x " x T ) + A) ( x " x T ) + A A) ( x " x T )( x " x T ) x ˆ = A + ( A) + A( A) ( x x T )( x x T ) (7) where we have used the definitions of s and s. The next question is the cross term. If the errors in x and x are uncorrelated then the cross term is 0. So let s keep things simple and assume this to be the case where the two estimates come from two different measurement systems. However, if this is not the case then the cross term must be known. This term is equivalent to the off diagonal terms in the covariance in (). x ˆ = A + ( A) We want the solution that minimizes x ˆ. So we take the derivative of x ˆ with respect to A and set it to 0. and the solution for A is (8) d" x ˆ da = A ( A) = 0 (9) 5 04/0/09

ATMO/OPTI 656b Spring 009 so the optimum solution for x is A = ( + ) (30) Manipulate this a bit ˆ x = ( + ) x + + ) " x ˆ = x % $ " + & + ( + " x & = % ( ' $ ' " + " x & + % ( $ ' " + " x & = " ) ) + % ( $ ' Note the similarity of (3) with The variance of the error in x is x ˆ = 4 ( + ) ˆ x = " ( + ) x + x (3) ) x + " ) ) ( + ) x " ( + ) x (3) x ˆ = K T S K + S a ) ( K T S " y + S a x a ) () + 4 ( + ) = + + x ˆ = + = = ( + ) ( + ) " + & % ( $ ' Take the inverse of (33) and note the similarity between it and (8). (" x ˆ ) = (33) + (34) ˆ S " = K T S " K + S a " So the vector/matrix form is indeed a generalization of the scalar form (as it must be). (8) 6 04/0/09

ATMO/OPTI 656b Spring 009 Data Assimilation So what is x a and where does it come from? In numerical weather prediction (NWP) systems, x a is the short term weather forecast over the next forecast update cycle, typically to hours. As such, x a is a state estimate produced by a combination of atmospheric model and past observations. This is a very powerful way to assimilate and utilize observations n the weather forecasting business. The resulting updated state estimate, x ˆ, is called the analysis or the analyzed state. The analysis is used as the initial atmospheric state to start the next model forecast run. The weather model needs an initial state to propagate forward in time. Using analyses to study climate These analyses are used as the best estimate of the atmospheric state to study climate. In theory these analyses are the best possible state estimate, optimally using the available atmospheric model and observational information. The problem, from a climate standpoint, is that the atmospheric models contain unknown errors including biases which are built right into the analyses via the x a term. As such, it is problematic to use the analyses to evaluate climate model performance because they are based in part on the model. The degree of any particular atmospheric state variable depends on the degree to which observations constrain that variable. The less the variable is constrained by the observations, the more the behavior of that variable as represented in the analysis is the result of the forecast model. It is this incestuous problem for determining climate, climate evolution and evaluating climate models that has driven us to make ATOMMS completely independent of atmospheric weather and climate models. 7 04/0/09