B4 Estimation and Inference

Similar documents
Lecture Note 1: Probability Theory and Statistics

Statistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart

01 Probability Theory and Statistics Review

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

Continuous Random Variables

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Lecture Notes on the Gaussian Distribution

CS 195-5: Machine Learning Problem Set 1

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Probability Theory for Machine Learning. Chris Cremer September 2015

Introduction to Machine Learning

Probability Review. Chao Lan

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Lecture 2: Repetition of probability theory and statistics

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Naïve Bayes classification

Robots Autónomos. Depto. CCIA. 2. Bayesian Estimation and sensor models. Domingo Gallardo

Probability Theory Review Reading Assignments

Deep Learning for Computer Vision

Introduction to Probability and Statistics (Continued)

Bivariate distributions

Algorithms for Uncertainty Quantification

Grundlagen der Künstlichen Intelligenz

MAS223 Statistical Inference and Modelling Exercises

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

EE4601 Communication Systems

Problem Set 1. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 20

18 Bivariate normal distribution I

Random Variables. Random variables. A numerically valued map X of an outcome ω from a sample space Ω to the real line R

ECE531: Principles of Detection and Estimation Course Introduction

Machine learning - HT Maximum Likelihood

Introduction to Normal Distribution

Probability. Paul Schrimpf. January 23, Definitions 2. 2 Properties 3

Lecture 2: Review of Basic Probability Theory

ME 597: AUTONOMOUS MOBILE ROBOTICS SECTION 2 PROBABILITY. Prof. Steven Waslander

Joint Distribution of Two or More Random Variables

Chapter 2. Some Basic Probability Concepts. 2.1 Experiments, Outcomes and Random Variables

Statistical Techniques in Robotics (16-831, F12) Lecture#17 (Wednesday October 31) Kalman Filters. Lecturer: Drew Bagnell Scribe:Greydon Foil 1

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Gaussian random variables inr n

Multivariate statistical methods and data mining in particle physics

Preliminary Statistics. Lecture 3: Probability Models and Distributions

Probability. Paul Schrimpf. January 23, UBC Economics 326. Probability. Paul Schrimpf. Definitions. Properties. Random variables.

Data Analysis and Monte Carlo Methods

UCSD ECE153 Handout #34 Prof. Young-Han Kim Tuesday, May 27, Solutions to Homework Set #6 (Prepared by TA Fatemeh Arbabjolfaei)

Chapter 2. Continuous random variables

Intro to Probability. Andrei Barbu

A Probability Review

The Multivariate Gaussian Distribution [DRAFT]

More than one variable

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Introduction to Machine Learning

PCMI Introduction to Random Matrix Theory Handout # REVIEW OF PROBABILITY THEORY. Chapter 1 - Events and Their Probabilities

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Introduction to Stochastic Processes

E X A M. Probability Theory and Stochastic Processes Date: December 13, 2016 Duration: 4 hours. Number of pages incl.

3. Review of Probability and Statistics

Lecture 3: Pattern Classification

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

COM336: Neural Computing

CSC 411: Lecture 09: Naive Bayes

Introduction to Computational Finance and Financial Econometrics Probability Review - Part 2

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Machine Learning for Large-Scale Data Analysis and Decision Making A. Week #1

conditional cdf, conditional pdf, total probability theorem?

Chapter 5 continued. Chapter 5 sections

Parametric Techniques Lecture 3

Math 180B Problem Set 3

Joint Gaussian Graphical Model Review Series I

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Probability Models for Bayesian Recognition

SDS 321: Introduction to Probability and Statistics

Chapter 4. Chapter 4 sections

ECE 450 Homework #3. 1. Given the joint density function f XY (x,y) = 0.5 1<x<2, 2<y< <x<4, 2<y<3 0 else

Single Maths B: Introduction to Probability

Introduction to Probability and Stocastic Processes - Part I

Parametric Techniques

Multiple Random Variables

Introduction to Bayesian Statistics

Chapter 2. Probability

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

3. Probability and Statistics

Bayesian statistics, simulation and software

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Joint Distributions. (a) Scalar multiplication: k = c d. (b) Product of two matrices: c d. (c) The transpose of a matrix:

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

5. Random Vectors. probabilities. characteristic function. cross correlation, cross covariance. Gaussian random vectors. functions of random vectors

SDS 321: Introduction to Probability and Statistics

Let X and Y denote two random variables. The joint distribution of these random

EXPECTED VALUE of a RV. corresponds to the average value one would get for the RV when repeating the experiment, =0.

Probability and Distributions

Outline. Random Variables. Examples. Random Variable

Lecture 2: From Linear Regression to Kalman Filter and Beyond

1 Probability theory. 2 Random variables and probability theory.

Bayesian Learning (II)

Transcription:

B4 Estimation and Inference 6 Lectures Hilary Term 27 2 Tutorial Sheets A. Zisserman Overview Lectures 1 & 2: Introduction sensors, and basics of probability density functions for representing sensor error and uncertainty. Lecture 3 & 4: Estimators Maximum Likelihood (ML) and Maximum a Posteriori (MAP). Lecture 5 & 6: Decisions and classification loss functions, discriminant functions, linear classifiers. Textbooks 1 Estimation with Applications to Tracking and Navigation: Theory Algorithms and Software Yaakov Bar-Shalom, Xiao-Rong Li, Thiagalingam Kirubarajan, Wiley, 21 Covers probability and estimation, but contains much more material than required for this course. Use also for C4B Mobile Robots.

Textbooks 2 Pattern Classification Richard O. Duda, Peter E. Hart, David G. Stork, Wiley, 2 Covers classification, but also contains much more material than required for this course. Background reading and web resources Information Theory, Inference, and Learning Algorithms. David J. C. MacKay, CUP, 23 Covers all the course material though at an advanced level Available on line Introduction to Random Signals and Applied Kalman Filtering. R. Brown and P. Hwang, Wiley, 1997 Good review of probability and random variables in first chapter Further reading (www addresses) and the lecture notes are on http://www.robots.ox.ac.uk/~az/lectures

One more book for background reading Pattern Recognition and Machine Learning Christopher Bishop, Springer, 26. Excellent on classification and regression Quite advanced

Introduction: Sensors and estimation Sensors are used to give information about the state of some system. Our aim in this course is to develop methods which can be used to combine multiple sensor measurements possibly from multiple sensors possibly over time with prior information to obtain accurate estimates of a system s state Noise and uncertainty in sensors Real sensors give inexact measurements for a variety of reasons: discretization error (e.g. measurements on a grid) calibration error quantization noise (e.g. CCD) Successive observations of a system or phenomena do not produce exactly the same result Statistical methods are used to describe and understand this variability, and to incorporate variability into the decision-making processes Examples: ultrasound, laser range scanner, CCD images, GPS

Ultrasound objective: diagnose heart disease Axial view Brightness codes depth Contrast agent enhanced Doppler velocity image use prior shape model for heart for inference Laser range scanner objective: build map of room y Good quality data up to discretization on a grid x

CCD sensor objective: read number plate 1 frame low light sequence Temporal average to suppress mean zero additive noise I(t) = I + n(t) Time averaged frames histogram equalized 2 frames 8 frames 5 1 4 1 frames Average N noise samples with zero mean and variance result has zero mean and variance /N

GPS objective: estimate car trajectory Global Positioning System -26-262 -264-266 -268-27 1 15 11 115 GPS data collected from car -2632-2632.5-2633 -2633.5-2634 -2634.5-2635 -2635.5 172 172.5 173 173.5 174 174.5 175 175.5 176 176.5 close-up

GPS: error sources fit (estimate) curve and lines to reduce random errors Two canonical estimation problems 1. Regression estimate parameters, e.g. of line fit to trajectory of car 2. Classification estimate class, e.g. handwritten digit classification?? 1 7

The need for probability We have seen that there is uncertainty in the measurements due to sensor noise. There may also be uncertainty in the model we fit. Finally, we often want more than just a single value for an estimate, we would also like to model the confidence or uncertainty of the estimate. Probability theory the calculus of uncertainty provides the mathematical machinery. Revision of Probability Theory probability distribution functions (pdf) joint and conditional probability independence Normal (Gaussian) distributions of one and two variables

1D Probability a brief review Discrete sets Suppose an experiment has a set of possible outcomes S, and an event E is a sub-set of S, then probability of E = relative frequency of event = number of outcomes in E total number of outcomes in S If S = {a 1,a 2,...,a n } has probabilities {p 1,p 2,...,p n } then nx i=1 p i p i = 1 e.g. throw a die, then probability of any particular number = 1/6; and probability of an even number = 1/2 Probability density function (pdf) P (x <X <x+ dx) = p(x) dx Z p(x)dx = 1 p(x) probability over a range of x is given by area under the curve x x+dx

PDF Example 1 Univariate Normal Distribution The most important example of a continuous density/distribution is the normal or Gaussian distribution. p(x) = 1 (x µ)2 exp( 2πσ 2σ 2 ) x N(µ, σ 2 ).4.35.3.25 µ = σ =1.2.15.1.5-6 -4-2 2 4 6 e.g. model sensor response for measured x, when true x = PDF Example 2 A Uniform Distribution p(x) U (a, b) 1 b a a b example: laser range finder

PDF Example 3 A histogram image intensity histogram frequency normalize to obtain a pdf intensity Joint and conditional probability Consider two (discrete) variables A and B Joint probability distribution of A and B P (A, B) = Probability A and B both occurring X P (A i,b j )=1 i,j Conditional probability distribution of A given B P (A B) =P (A, B)/P(B) Marginal (unconditional) distribution of A P (A) = X j P (A, B j )

Example Consider two sensors measuring the x and y coordinates of an object in 2D The joint distribution is given by the spreadsheet, where the array entry (i, j) isp (X = i, Y = j): X = X =1 X =2 row sum Y =.32.3.1.36 Y =1.6.24.2.32 Y =2.2.3.27.32 To compute the joint distribution 1. Count the number of times the measured location is at (X,Y) for X=,2; Y=,2 Y 2. Normalize the count matrix such that X i,j P (i, j) =1 X Exercise: Compute the marginals and conditional P( Y X=1 ) X = X =1 X =2 row sum Y =.32.3.1.36 Y =1.6.24.2.32 Y =2.2.3.27.32 col sum.4.3.3 1. marginal P(Y=) marginal P (X =1)= X Y P (X =1,Y) Conditional P( Y X=1 ) normalize P(X=1,Y) so that it is a probability P (X =1,Y) P (Y X =1)= P (X =1) X =1 Y =.3 /.3 Y =1.24 /.3 Y =2.3 /.3 col sum 1. in words the probability of Y given that X = 1

Bayes Rule (or Bayes Theorem) From the definition of the conditional P (A, B) = P (A B)P (B) P (A, B) = P (B A)P (A) Hence P (A B) = P (B A)P (A) P (B) Bayes Rule lets the dependence of the conditionals be reversed will be important later for Maximum a Posteriori (MAP) estimation Writing P (B) = X i P (A i,b)= X i P (B A i )P (A i ) P (A B) = P (B A)P (A) P (B) = P (B A)P (A) P i P (B A i)p (A i ) with similar expressions for P (B A) Independent variables Two variables are independent if (and only if) p(a, B) = p(a)p(b) i.e. all joint values = product of the marginals Compare with conditional probability p(a, B) =p(a B)p(B) So p(a B) =p(a) and similarly p(b A) =p(b) e.g. two throws of a dice or coin are independent two cards picked without replacement from the same pack are not independent H 1 T 1 row sum H 2.25.25.5 T 2.25.25.5 col sum.5.5 1. =.5.5.5.5

Examples are these joint distributions independent? X = X =1 X =2 row sum Y =.32.3.1.36 Y =1.6.24.2.32 Y =2.2.3.27.32 col sum.4.3.3 1. X = X =1 X =2 row sum Y =.9.12.9.3 Y =1.12.16.12.4 Y =2.9.12.9.3 col sum.3.4.3 1. y y y x x x Joint, conditional and independence for pdfs Similar results apply for pdfs of continuous variables x and y Z Z p(x, y) dx dy = 1 y x probability over a range of x and y is given by volume

Bivariate normal distribution N (x µ, Σ) = 1 2π Σ 1/2 exp ½ 1 2 (x µ)> Σ 1 ¾ (x µ) mean covariance x = Ã x y! µ = Ã µx µ y! Σ = " Σ11 Σ 12 Σ 21 Σ 22 # Example µ = Ã! Σ = " σ 2 x σ 2 y # = " 4 1 # p(x, y) = 1 ( exp 1 Ã x 2 2πσxσy 2 σ 2 x + y2!) σy 2 5 4 3 2 1-1 -2-3 -4 Note iso-probability contour curves are ellipses p(x,y) = p(x)p(y) so x is independent of y -5-5 -4-3 -2-1 1 2 3 4 5 x 2 σ 2 x + y2 σ 2 y = d 2

Conditional distribution p(x y) = = p(x, y) p(y) = 1 2πσxσy exp ½ 1 2 1 2πσy exp ( 1 exp x2 ) 2πσ x 2σ 2 x µ ¾ x 2 σ 2 + y2 x σ 2 y ½ ¾ y2 2σy 2 i.e. independent Normal distribution of n variables N (x µ, Σ) = 1 (2π) n/2 Σ 1/2 exp ½ 1 2 (x µ)> Σ 1 ¾ (x µ) where x is a n-component column vector µ is the n-component mean vector Σ is a n x n covariance matrix

Lecture 2: Describing and manipulating pdfs Expectations of moments in 1D Mean, variance, skew, kurtosis Expectations of moments in 2D Covariance and correlation Transforming variables Combining pdfs Introduction to Maximum Likelihood estimation Describing distributions Repeated measurements with the same sensor 16 14 12 1 8 6 4 2-4 -2 2 4 6 8 1 12 1D 2D Sensor model: could store original measurements (x i, y i ), or store histogram of measurements p i, or compute (fit) a compact representation of the distribution

Expectations and moments 1D The expected value of a scalar random variable, also called its mean, average, or first moment is: Discrete case E[x] = nx i=1 p i x i Continuous case E[x]= Z xp(x) dx = µ Note that E is a linear operator, i.e. E[ax + by] = ae[x] + be[y] Moments The nth moment is E[x n ]= Z xn p(x) dx -4-2 2 4 6 8 1 12 The second central moment, or variance, is var(x) =E[(x µ) 2 ]= Z (x µ)2 p(x) dx = E[x 2 ] µ 2 = σ 2 x The square root σ of the variance is the standard deviation

Example Gaussian pdf p(x) =N (µ, σ 2 )= 1 ( ) (x µ)2 exp 2πσ 2σ 2 mean E[x] = Z var(x) = E[(x µ) 2 ] xp(x) dx = µ.4.35.3.25.2 = Z (x µ)2 p(x) dx = σ 2.15.1.5-6 -4-2 2 4 6 a Normal distribution is defined by its first and second moments Fitting models by moment matching Example: fit a Normal distribution to measured samples Sketch algorithm: 1. Compute mean µ of samples 2. Compute variance σ 2 of samples 3. Represent by a Normal distribution N(µ,σ 2 ) fitted model -4-2 2 4 6 8 1 12

Mean and Variance of discrete random variable two probability distributions can differ even though they have identical means and variance mean and variance are summary values; more is needed to know the distribution (e.g. that it is a normal distribution) What model should be fitted to this measured distribution? what does the fitted Normal distribution look like?

Higher moments - skewness skew(x) =E[ Ã x µ µ! 3 ]= Z Ã! 3 x µ p(x) dx µ.4.35.3.25.2.15.1.5-6 -4-2 2 4 6 symmetric (e.g. Gaussian): skew = skew positive Higher moments - kurtosis µ x µ 4] 3 Z kurt(x) =E[ = σ µ x µ 4 p(x) dx 3 σ positive: narrow peak with long tails negative: flat peak and little tail Gaussian): kurt =

Expectations and moments 2D Suppose we have 2 random variables with bivariate joint density: p(x, y) Define moments (expectation of their product) Z E[xy] = xyp(x, y) dx dy y x = E[x] E[y] if x and y are independent y NB not if and only if x Covariance and Correlation measure behaviour about mean the covariance is defined as cov(xy) =σxy = Z (x µx)(y µy)p(x, y) dx dy = E[xy] E[x] E[y] [proof: exercise] summarize as a 2 x 2 symmetric covariance matrix Σ = " var(x) cov(xy) cov(xy) var(y) # = E h (x µ)(x µ) >i in n dimensions covariance is a n x n symmetric matrix

the correlation is defined as ρ(xy) = cov(xy) q q var(x) var(y) measures normalized correlation of two random variables (c.f. correlation of two signals) in the discrete sample case P i ρ(xy) = (x i µ x )(y i µ y ) q P i (x i µ x ) 2q P i (y i µ y ) 2 ρ(xy) 1 e.g. if x = y then ρ(x,y) = 1, if x = -y then ρ(x,y) = -1 if x and y are independent then ρ(x,y) = cov(x,y) =

Fitting a Bivariate Normal distribution N (x µ, Σ) = 1 2π Σ 1/2 exp ½ 1 2 (x µ)> Σ 1 ¾ (x µ) a Normal distribution in 2D is defined by its first and second moments (mean and covariance matrix) y in a similar manner to the 1D case, a 2D Gaussian is fitted by computing the mean and covariance matrix of the samples x y x Example µ = Ã! Σ = " 4 1.2 1.2 1 # 5 4 3 2 1-1 -2-3 -4-5 -6-4 -2 2 4 6 if x and y are not independent and have correlation ρ, Σ = " σ 2 x ρσxσy ρσxσy σ 2 y # Let S =Σ 1 then iso-probability curves are x > Sx = d 2, i.e. ellipses

Transformation of random variables Problem: Suppose the pdf for a dart thrower is p(x, y) = 1 2πσ 2e (x2 +y 2 )/(2σ 2 ) express this pdf in polar coordinates. The coordinates are related as x = x(r, θ) = r cos θ y = y(r, θ) = r sinθ and taking account of the area change p(x, y) dxdy = p(x(r, θ),y(r, θ))j drdθ where J is the Jacobian, and J = r in this case p(x(r, θ),y(r, θ))j drdθ = 1 2πσ 2e r2 /(2σ 2 ) rdrdθ marginalize to get p(r) p rθ (r, θ) Z 2π p(r) = p(r, θ) dθ = = r /(2σ 2 ) σ 2e r2 r /(2σ 2 Z 2π ) 2πσ 2e r2 p(r).25.2.15 dθ.1.5 σ 1 2 3 4 5 6 r

Example 1: Linear transformation of Normal distributions If the pdf of x is a Normal distribution p(x) =N (µ, Σ) = 1 (2π) n/2 Σ then under the linear transformation y = Ax + t 1/2 exp the pdf of y is also a Normal distribution with µ y = Aµx + t Σy = AΣxA > ½ 1 2 (x µ)> Σ 1 ¾ (x µ) Consider the quadratic term Under the transformation y = Ax + t x = A 1 (y t) developing the quadratic (x µ x ) > Σ 1 x (x µ x ) = (A 1 (y t) µ x ) > Σ 1 x (A 1 (y t) µ x ) = ³ A 1 ³ (y t Aµ x ) > Σ 1 x A 1 (y t Aµx ) = (y t Aµx) > A > Σ 1 x A 1 (y t Aµx) = (y µy) > Σ 1 y (y µy) with µ y = Aµ x + t Σ 1 y = A > Σ 1 x A 1 Σ y = AΣ x A >

Example 2: Sum of random variables Problem: suppose z = x + y and p xy (x,y) is the joint distribution of x and y, find p(z) Let t = x y then y p tz (t, z) dtdz = p xy (x, y)j dtdz = 1 2 p xy( z + t 2, z t 2 ) dtdz t p(z) = = = = Z p tz(t, z) dt Z 1 2 p xy( z + t 2, z t 2 ) dt Z p xy(z u, u) du where u = z t 2 z Z p x(z u)p y (u) du if x and y independent z+dz x which is the convolution of p x (x) and p y (y) Example: 1D Gaussians [proof: exercise] then p(x) =N(µx, σ x) 2 p(y) =N(µy, σ y 2 ) p(x + y) = N(µ x, σ 2 x ) N(µ y, σy 2 ) = N(µ x + µ y, σ 2 x + σ 2 y).8 3.7 2.5.6.5.4.3.2.1 2 1.5 1.5-6 -4-2 2 4 6 -.5-6 -4-2 2 4 6 Add the means and covariances

Maximum Likelihood Estimation informal Estimation is the process by which we infer the value of a quantity of interest, θ, by processing data z that is some way dependent on θ. Simple example: fitting a line to measured data y y = ax + b Estimate line parameters θ=(a,b), given measurements z i = (x i, y i ) and model of the sensor noise x Least squares solution min a,b nx i (y i (ax i + b)) 2 Consider a generative model for the data no noise on x i on y: y i =ỹ i + n i n i N (, σ 2 ) measured value true value p(y i ỹ i ) e (y i ỹ i )2 2σ 2 probability of measuring that true value is ỹ i y i given Model to be estimated ỹ i = ax i + b p(y i ax i + b) e (y i (ax i +b))2 2σ 2

For n points, assuming independence p(y 1,y 2,...y n x 1,x 2,...x n ; a, b) this is the likelihood p(y a, b) ny i ny e (y i (ax i +b))2 2σ 2 i e (y i (ax i +b))2 2σ 2 measured data parameters The Maximum likelihood (ML) estimate is obtained as {â,ˆb} =argmax a,b p(y a, b) p(y a, b) ny i e (y i (ax i +b))2 2σ 2 = e P n (y i (ax i +b)) 2 i 2σ 2 Take negative log log(p(y a, b)) nx i (y i (ax i + b)) 2 2σ 2 The Maximum likelihood (ML) estimate is equivalent to {â,ˆb} =argmin a,b nx i (y i (ax i + b)) 2 2σ 2 i.e. to least squares