B4 Estimation and Inference 6 Lectures Hilary Term 27 2 Tutorial Sheets A. Zisserman Overview Lectures 1 & 2: Introduction sensors, and basics of probability density functions for representing sensor error and uncertainty. Lecture 3 & 4: Estimators Maximum Likelihood (ML) and Maximum a Posteriori (MAP). Lecture 5 & 6: Decisions and classification loss functions, discriminant functions, linear classifiers. Textbooks 1 Estimation with Applications to Tracking and Navigation: Theory Algorithms and Software Yaakov Bar-Shalom, Xiao-Rong Li, Thiagalingam Kirubarajan, Wiley, 21 Covers probability and estimation, but contains much more material than required for this course. Use also for C4B Mobile Robots.
Textbooks 2 Pattern Classification Richard O. Duda, Peter E. Hart, David G. Stork, Wiley, 2 Covers classification, but also contains much more material than required for this course. Background reading and web resources Information Theory, Inference, and Learning Algorithms. David J. C. MacKay, CUP, 23 Covers all the course material though at an advanced level Available on line Introduction to Random Signals and Applied Kalman Filtering. R. Brown and P. Hwang, Wiley, 1997 Good review of probability and random variables in first chapter Further reading (www addresses) and the lecture notes are on http://www.robots.ox.ac.uk/~az/lectures
One more book for background reading Pattern Recognition and Machine Learning Christopher Bishop, Springer, 26. Excellent on classification and regression Quite advanced
Introduction: Sensors and estimation Sensors are used to give information about the state of some system. Our aim in this course is to develop methods which can be used to combine multiple sensor measurements possibly from multiple sensors possibly over time with prior information to obtain accurate estimates of a system s state Noise and uncertainty in sensors Real sensors give inexact measurements for a variety of reasons: discretization error (e.g. measurements on a grid) calibration error quantization noise (e.g. CCD) Successive observations of a system or phenomena do not produce exactly the same result Statistical methods are used to describe and understand this variability, and to incorporate variability into the decision-making processes Examples: ultrasound, laser range scanner, CCD images, GPS
Ultrasound objective: diagnose heart disease Axial view Brightness codes depth Contrast agent enhanced Doppler velocity image use prior shape model for heart for inference Laser range scanner objective: build map of room y Good quality data up to discretization on a grid x
CCD sensor objective: read number plate 1 frame low light sequence Temporal average to suppress mean zero additive noise I(t) = I + n(t) Time averaged frames histogram equalized 2 frames 8 frames 5 1 4 1 frames Average N noise samples with zero mean and variance result has zero mean and variance /N
GPS objective: estimate car trajectory Global Positioning System -26-262 -264-266 -268-27 1 15 11 115 GPS data collected from car -2632-2632.5-2633 -2633.5-2634 -2634.5-2635 -2635.5 172 172.5 173 173.5 174 174.5 175 175.5 176 176.5 close-up
GPS: error sources fit (estimate) curve and lines to reduce random errors Two canonical estimation problems 1. Regression estimate parameters, e.g. of line fit to trajectory of car 2. Classification estimate class, e.g. handwritten digit classification?? 1 7
The need for probability We have seen that there is uncertainty in the measurements due to sensor noise. There may also be uncertainty in the model we fit. Finally, we often want more than just a single value for an estimate, we would also like to model the confidence or uncertainty of the estimate. Probability theory the calculus of uncertainty provides the mathematical machinery. Revision of Probability Theory probability distribution functions (pdf) joint and conditional probability independence Normal (Gaussian) distributions of one and two variables
1D Probability a brief review Discrete sets Suppose an experiment has a set of possible outcomes S, and an event E is a sub-set of S, then probability of E = relative frequency of event = number of outcomes in E total number of outcomes in S If S = {a 1,a 2,...,a n } has probabilities {p 1,p 2,...,p n } then nx i=1 p i p i = 1 e.g. throw a die, then probability of any particular number = 1/6; and probability of an even number = 1/2 Probability density function (pdf) P (x <X <x+ dx) = p(x) dx Z p(x)dx = 1 p(x) probability over a range of x is given by area under the curve x x+dx
PDF Example 1 Univariate Normal Distribution The most important example of a continuous density/distribution is the normal or Gaussian distribution. p(x) = 1 (x µ)2 exp( 2πσ 2σ 2 ) x N(µ, σ 2 ).4.35.3.25 µ = σ =1.2.15.1.5-6 -4-2 2 4 6 e.g. model sensor response for measured x, when true x = PDF Example 2 A Uniform Distribution p(x) U (a, b) 1 b a a b example: laser range finder
PDF Example 3 A histogram image intensity histogram frequency normalize to obtain a pdf intensity Joint and conditional probability Consider two (discrete) variables A and B Joint probability distribution of A and B P (A, B) = Probability A and B both occurring X P (A i,b j )=1 i,j Conditional probability distribution of A given B P (A B) =P (A, B)/P(B) Marginal (unconditional) distribution of A P (A) = X j P (A, B j )
Example Consider two sensors measuring the x and y coordinates of an object in 2D The joint distribution is given by the spreadsheet, where the array entry (i, j) isp (X = i, Y = j): X = X =1 X =2 row sum Y =.32.3.1.36 Y =1.6.24.2.32 Y =2.2.3.27.32 To compute the joint distribution 1. Count the number of times the measured location is at (X,Y) for X=,2; Y=,2 Y 2. Normalize the count matrix such that X i,j P (i, j) =1 X Exercise: Compute the marginals and conditional P( Y X=1 ) X = X =1 X =2 row sum Y =.32.3.1.36 Y =1.6.24.2.32 Y =2.2.3.27.32 col sum.4.3.3 1. marginal P(Y=) marginal P (X =1)= X Y P (X =1,Y) Conditional P( Y X=1 ) normalize P(X=1,Y) so that it is a probability P (X =1,Y) P (Y X =1)= P (X =1) X =1 Y =.3 /.3 Y =1.24 /.3 Y =2.3 /.3 col sum 1. in words the probability of Y given that X = 1
Bayes Rule (or Bayes Theorem) From the definition of the conditional P (A, B) = P (A B)P (B) P (A, B) = P (B A)P (A) Hence P (A B) = P (B A)P (A) P (B) Bayes Rule lets the dependence of the conditionals be reversed will be important later for Maximum a Posteriori (MAP) estimation Writing P (B) = X i P (A i,b)= X i P (B A i )P (A i ) P (A B) = P (B A)P (A) P (B) = P (B A)P (A) P i P (B A i)p (A i ) with similar expressions for P (B A) Independent variables Two variables are independent if (and only if) p(a, B) = p(a)p(b) i.e. all joint values = product of the marginals Compare with conditional probability p(a, B) =p(a B)p(B) So p(a B) =p(a) and similarly p(b A) =p(b) e.g. two throws of a dice or coin are independent two cards picked without replacement from the same pack are not independent H 1 T 1 row sum H 2.25.25.5 T 2.25.25.5 col sum.5.5 1. =.5.5.5.5
Examples are these joint distributions independent? X = X =1 X =2 row sum Y =.32.3.1.36 Y =1.6.24.2.32 Y =2.2.3.27.32 col sum.4.3.3 1. X = X =1 X =2 row sum Y =.9.12.9.3 Y =1.12.16.12.4 Y =2.9.12.9.3 col sum.3.4.3 1. y y y x x x Joint, conditional and independence for pdfs Similar results apply for pdfs of continuous variables x and y Z Z p(x, y) dx dy = 1 y x probability over a range of x and y is given by volume
Bivariate normal distribution N (x µ, Σ) = 1 2π Σ 1/2 exp ½ 1 2 (x µ)> Σ 1 ¾ (x µ) mean covariance x = Ã x y! µ = Ã µx µ y! Σ = " Σ11 Σ 12 Σ 21 Σ 22 # Example µ = Ã! Σ = " σ 2 x σ 2 y # = " 4 1 # p(x, y) = 1 ( exp 1 Ã x 2 2πσxσy 2 σ 2 x + y2!) σy 2 5 4 3 2 1-1 -2-3 -4 Note iso-probability contour curves are ellipses p(x,y) = p(x)p(y) so x is independent of y -5-5 -4-3 -2-1 1 2 3 4 5 x 2 σ 2 x + y2 σ 2 y = d 2
Conditional distribution p(x y) = = p(x, y) p(y) = 1 2πσxσy exp ½ 1 2 1 2πσy exp ( 1 exp x2 ) 2πσ x 2σ 2 x µ ¾ x 2 σ 2 + y2 x σ 2 y ½ ¾ y2 2σy 2 i.e. independent Normal distribution of n variables N (x µ, Σ) = 1 (2π) n/2 Σ 1/2 exp ½ 1 2 (x µ)> Σ 1 ¾ (x µ) where x is a n-component column vector µ is the n-component mean vector Σ is a n x n covariance matrix
Lecture 2: Describing and manipulating pdfs Expectations of moments in 1D Mean, variance, skew, kurtosis Expectations of moments in 2D Covariance and correlation Transforming variables Combining pdfs Introduction to Maximum Likelihood estimation Describing distributions Repeated measurements with the same sensor 16 14 12 1 8 6 4 2-4 -2 2 4 6 8 1 12 1D 2D Sensor model: could store original measurements (x i, y i ), or store histogram of measurements p i, or compute (fit) a compact representation of the distribution
Expectations and moments 1D The expected value of a scalar random variable, also called its mean, average, or first moment is: Discrete case E[x] = nx i=1 p i x i Continuous case E[x]= Z xp(x) dx = µ Note that E is a linear operator, i.e. E[ax + by] = ae[x] + be[y] Moments The nth moment is E[x n ]= Z xn p(x) dx -4-2 2 4 6 8 1 12 The second central moment, or variance, is var(x) =E[(x µ) 2 ]= Z (x µ)2 p(x) dx = E[x 2 ] µ 2 = σ 2 x The square root σ of the variance is the standard deviation
Example Gaussian pdf p(x) =N (µ, σ 2 )= 1 ( ) (x µ)2 exp 2πσ 2σ 2 mean E[x] = Z var(x) = E[(x µ) 2 ] xp(x) dx = µ.4.35.3.25.2 = Z (x µ)2 p(x) dx = σ 2.15.1.5-6 -4-2 2 4 6 a Normal distribution is defined by its first and second moments Fitting models by moment matching Example: fit a Normal distribution to measured samples Sketch algorithm: 1. Compute mean µ of samples 2. Compute variance σ 2 of samples 3. Represent by a Normal distribution N(µ,σ 2 ) fitted model -4-2 2 4 6 8 1 12
Mean and Variance of discrete random variable two probability distributions can differ even though they have identical means and variance mean and variance are summary values; more is needed to know the distribution (e.g. that it is a normal distribution) What model should be fitted to this measured distribution? what does the fitted Normal distribution look like?
Higher moments - skewness skew(x) =E[ Ã x µ µ! 3 ]= Z Ã! 3 x µ p(x) dx µ.4.35.3.25.2.15.1.5-6 -4-2 2 4 6 symmetric (e.g. Gaussian): skew = skew positive Higher moments - kurtosis µ x µ 4] 3 Z kurt(x) =E[ = σ µ x µ 4 p(x) dx 3 σ positive: narrow peak with long tails negative: flat peak and little tail Gaussian): kurt =
Expectations and moments 2D Suppose we have 2 random variables with bivariate joint density: p(x, y) Define moments (expectation of their product) Z E[xy] = xyp(x, y) dx dy y x = E[x] E[y] if x and y are independent y NB not if and only if x Covariance and Correlation measure behaviour about mean the covariance is defined as cov(xy) =σxy = Z (x µx)(y µy)p(x, y) dx dy = E[xy] E[x] E[y] [proof: exercise] summarize as a 2 x 2 symmetric covariance matrix Σ = " var(x) cov(xy) cov(xy) var(y) # = E h (x µ)(x µ) >i in n dimensions covariance is a n x n symmetric matrix
the correlation is defined as ρ(xy) = cov(xy) q q var(x) var(y) measures normalized correlation of two random variables (c.f. correlation of two signals) in the discrete sample case P i ρ(xy) = (x i µ x )(y i µ y ) q P i (x i µ x ) 2q P i (y i µ y ) 2 ρ(xy) 1 e.g. if x = y then ρ(x,y) = 1, if x = -y then ρ(x,y) = -1 if x and y are independent then ρ(x,y) = cov(x,y) =
Fitting a Bivariate Normal distribution N (x µ, Σ) = 1 2π Σ 1/2 exp ½ 1 2 (x µ)> Σ 1 ¾ (x µ) a Normal distribution in 2D is defined by its first and second moments (mean and covariance matrix) y in a similar manner to the 1D case, a 2D Gaussian is fitted by computing the mean and covariance matrix of the samples x y x Example µ = Ã! Σ = " 4 1.2 1.2 1 # 5 4 3 2 1-1 -2-3 -4-5 -6-4 -2 2 4 6 if x and y are not independent and have correlation ρ, Σ = " σ 2 x ρσxσy ρσxσy σ 2 y # Let S =Σ 1 then iso-probability curves are x > Sx = d 2, i.e. ellipses
Transformation of random variables Problem: Suppose the pdf for a dart thrower is p(x, y) = 1 2πσ 2e (x2 +y 2 )/(2σ 2 ) express this pdf in polar coordinates. The coordinates are related as x = x(r, θ) = r cos θ y = y(r, θ) = r sinθ and taking account of the area change p(x, y) dxdy = p(x(r, θ),y(r, θ))j drdθ where J is the Jacobian, and J = r in this case p(x(r, θ),y(r, θ))j drdθ = 1 2πσ 2e r2 /(2σ 2 ) rdrdθ marginalize to get p(r) p rθ (r, θ) Z 2π p(r) = p(r, θ) dθ = = r /(2σ 2 ) σ 2e r2 r /(2σ 2 Z 2π ) 2πσ 2e r2 p(r).25.2.15 dθ.1.5 σ 1 2 3 4 5 6 r
Example 1: Linear transformation of Normal distributions If the pdf of x is a Normal distribution p(x) =N (µ, Σ) = 1 (2π) n/2 Σ then under the linear transformation y = Ax + t 1/2 exp the pdf of y is also a Normal distribution with µ y = Aµx + t Σy = AΣxA > ½ 1 2 (x µ)> Σ 1 ¾ (x µ) Consider the quadratic term Under the transformation y = Ax + t x = A 1 (y t) developing the quadratic (x µ x ) > Σ 1 x (x µ x ) = (A 1 (y t) µ x ) > Σ 1 x (A 1 (y t) µ x ) = ³ A 1 ³ (y t Aµ x ) > Σ 1 x A 1 (y t Aµx ) = (y t Aµx) > A > Σ 1 x A 1 (y t Aµx) = (y µy) > Σ 1 y (y µy) with µ y = Aµ x + t Σ 1 y = A > Σ 1 x A 1 Σ y = AΣ x A >
Example 2: Sum of random variables Problem: suppose z = x + y and p xy (x,y) is the joint distribution of x and y, find p(z) Let t = x y then y p tz (t, z) dtdz = p xy (x, y)j dtdz = 1 2 p xy( z + t 2, z t 2 ) dtdz t p(z) = = = = Z p tz(t, z) dt Z 1 2 p xy( z + t 2, z t 2 ) dt Z p xy(z u, u) du where u = z t 2 z Z p x(z u)p y (u) du if x and y independent z+dz x which is the convolution of p x (x) and p y (y) Example: 1D Gaussians [proof: exercise] then p(x) =N(µx, σ x) 2 p(y) =N(µy, σ y 2 ) p(x + y) = N(µ x, σ 2 x ) N(µ y, σy 2 ) = N(µ x + µ y, σ 2 x + σ 2 y).8 3.7 2.5.6.5.4.3.2.1 2 1.5 1.5-6 -4-2 2 4 6 -.5-6 -4-2 2 4 6 Add the means and covariances
Maximum Likelihood Estimation informal Estimation is the process by which we infer the value of a quantity of interest, θ, by processing data z that is some way dependent on θ. Simple example: fitting a line to measured data y y = ax + b Estimate line parameters θ=(a,b), given measurements z i = (x i, y i ) and model of the sensor noise x Least squares solution min a,b nx i (y i (ax i + b)) 2 Consider a generative model for the data no noise on x i on y: y i =ỹ i + n i n i N (, σ 2 ) measured value true value p(y i ỹ i ) e (y i ỹ i )2 2σ 2 probability of measuring that true value is ỹ i y i given Model to be estimated ỹ i = ax i + b p(y i ax i + b) e (y i (ax i +b))2 2σ 2
For n points, assuming independence p(y 1,y 2,...y n x 1,x 2,...x n ; a, b) this is the likelihood p(y a, b) ny i ny e (y i (ax i +b))2 2σ 2 i e (y i (ax i +b))2 2σ 2 measured data parameters The Maximum likelihood (ML) estimate is obtained as {â,ˆb} =argmax a,b p(y a, b) p(y a, b) ny i e (y i (ax i +b))2 2σ 2 = e P n (y i (ax i +b)) 2 i 2σ 2 Take negative log log(p(y a, b)) nx i (y i (ax i + b)) 2 2σ 2 The Maximum likelihood (ML) estimate is equivalent to {â,ˆb} =argmin a,b nx i (y i (ax i + b)) 2 2σ 2 i.e. to least squares