ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Siwei Guo: s9guo@eng.ucsd.edu Anwesan Pal: a2pal@eng.ucsd.edu 1

Vectors Notation x R d d-dimensional vector in Euclidean space. It may be deterministic or random. If it is random we describe it with a parametric distribution with parameters θ or w or equivalently with its probability density function px; w. Example: x N µ, Σ with pdf φx; µ, Σ, where N is the distribution, φ is the pdf, and {µ, Σ} are the deterministic vector and matrix parameters. A R n m An n m dimensional matrix. Its transpose is A T R m n and its inverse if it exists is A 1 R n m for n = m. x 2 := x T x vector Euclidean norm xt a vector that changes in continuous time t R >0 x t a vector that changes in discrete time t N ẋt := d dt xt time derivative of xt 2

Random Vectors Notation x R d d-dimensional random vector with CDF F and pdf p E[hx] expectation of a function h of the random vector x: E[hx] := hxpxdx µ := E[x] mean of x, also called first moment of x x := arg max px mode of x x E [ xx T ] second moment of x Σ := E [x ] µx µ T = E [xx ] covariance of x T µµ T 3

Parameter Estimation Notation D := {x i, y i } n i=1 px i, y i ; ω py i x i ; ω max ω n i=1 px i, y i ; ω max y px, y; ω σz := 1 1+exp z R softmaxz := training data of examples x i R d and labels y i { 1, 1} classification or y i R regression. The elements x i, y i are iid over i sampled from an unknown joint over x i and y i pdf p x i, y i generative model using pdf parameters ω discriminative model using parameters ω training: parameter estimation via MLE testing: label prediction for given new sample x and already trained parameters ω sigmoid function: useful for modeling binary classification ez R d softmax function: useful for modeling K- 1 T e z ary classification 4

Orientations Notation a R 3 â so3 R = expâ SO3 q = exp[0, 1 2a] S3 quaternion an axis-angle representation of orientation/rotation, also called rotation vector. The axis of rotation is ξ := a a 2 and the angle is θ := a 2 a skew-symmetric matrix associated with cross products âb = a b for b R 3. Skew-symmetric matrices define the Lie algebra i.e., local linear approximation of the Lie group of rotations. rotation matrix representation of the rotation vector a R 3. The matrix exponential function is expa := k=0 Ak k! and for rotations can be computed in closed form via the Rodrigues formula. representation of the rotation vector a R 3. The quaternion exponential function is not the same as the matrix exponential function! 5

Transformations Notation ω R 3 Angular velocity defines the rate of change of orientation for rotation matrices Ṙ = ˆωR or equivalently for quaternions q = [ 0, ω 2 ] q ζ := ω, v R 6 Linear velocity v R 3 and angular velocity ω R 3 ˆζ se3 a twist matrix v R 4 4 defining the Lie 0 1 algebra of the Lie group of rigid body transformations. A twist ˆζ defines the rate of change of pose of position ṗ = ˆωp + v and orientation Ṙ = ˆωR g = expˆζ SE3 a matrix g = R p representing rigid body 0 1 transformations, where R SO3 is the rotation/orientation and p R 3 is the translation/position 6

Transformations Notation W g B SE3 x t SE3 Body frame to world frame transformation defined by the body pose W g B or robot state x t g 1 g 2 := g 1 g 2 = R 1 p 1 R 2 p 2 0 1 0 1 g 2 g 1 := g 1 1 g 2 = RT 1 R1 T p 1 R 2 p 2 0 1 0 1 Transformation composition in SE3 equivalent to addition in Euclidean spaces Transformation inverse composition in SE3 equivalent to subtraction in Euclidean spaces 7

Filtering Notation x t R d system/robot state to be estimated e.g., position, orientation, etc.. Usually x t N µ t t, Σ t t with pdf φx t ; µ t t, Σ t t u t R du z t R dz w t N 0, W v t N 0, V known system/robot control input e.g., rotational velocity known system/robot measurement/observation e.g., pixel coordinates Gaussian motion noise Gaussian observation noise p t t x t pdf of the robot state x t given past measurements z 0:t and control inputs u 0:t 1 : p t t x t := px t z 0:t, u 0:t 1 p x t+1 predicted pdf of the robot state x t+1 given past measurements z 0:t and control inputs u 0:t : p x t+1 := px t+1 z 0:t, u 0:t 8

Bayes Filter Motion model: x t+1 = ax t, u t, w t p a x t, u t Observation model: z t = hx t, v t p h x t Filtering: keeps track of Bayes filter: p t t x t := px t z 0:t, u 0:t 1 p x t+1 := px t+1 z 0:t, u 0:t 1 η t+1 {}}{ 1 p +1 x t+1 = pz t+1 z 0:t, u 0:t p hz t+1 x t+1 Predict: p x t+1 { }}{ p a x t+1 x t, u t p t t x t dx t } {{ } Update Joint distribution: T px 0:T, z 0:T, u 0:T 1 = p 0 0 x 0 }{{} t=0 prior p h z t x t }{{} t=0 observation model T p a x t x t 1, u t 1 }{{} motion model 9

Gaussian Mixture Filter Prior: x t z 0:t, u 0:t 1 p t t x x := k αk t t φ x t ; µ k t t, Σk t t Motion model: x t+1 = Ax t + Bu t + w t, w t N 0, W Observation model: z t = Hx t + v t, v t N 0, V Prediction: p x = Update: = k p +1 x = p hz t+1 xp x pz t+1 z 0:t, u 0:t = k = k t t p a x s, u t p t t sds = k t t φ x; Aµ k t t + Bu t, AΣ k t t AT + W αk φz t+1; Hx, V φ x; µ k, Σk j αj φ z t+1 ; Hµ j, HΣj αk φ z t+1 ; Hµ k, HΣk HT + V j αj φ z t+1 ; Hµ j, HΣj HT + V p a x s, u t φ s; µ k t t, Σk t t ds φz t+1 ; Hx, V k αk φ x; µ k, Σk = φzt+1 ; Hs, V j αj φ s; µ j, Σj ds φ z t+1 ; Hµ k, HΣk HT + V HT + V φ z t+1 ; Hµ k, HΣk HT + V φ x; µ k Kalman Gain: K k := Σk HT HΣ k HT + V k +K z t+1 Hµ k k, I K HΣk 1 10

Gaussian Mixture Filter pdf: x t z 0:t, u 0:t 1 p t t x := k t t φ x t ; µ k t t, Σk t t mean: µ t t := E[x t z 0:t, u 0:t 1 ] = xp t t xdx = k t t µk t t ] cov: Σ t t := E [x t xt T z 0:t, u 0:t 1 = xx T p t t xdx µ t t µ T t t = k µ t t µ T t t t t Σ k t t + µk t t µk t t T µ t t µ T t t The GMF is just a bank of Kalman filters; sometimes called Gaussian Sum filter 11

Gaussian Mixture Filter If the motion or observation models are nonlinear, we can apply the EKF or UKF tricks to get a nonlinear GMF Additional operations are needed when strong nonlinearities are present in the motion or observation models: Refinement: introduces additional components to reduce the linearization error Pruning: approximates the overall distribution with a smaller number of components e.g., using KL divergence as a measure of accuracy More details: Nonlinear Gaussian Filtering: Theory, Algorithms, and Applications: Huber Bayesian Filtering and Smoothing: Särkkä 12

Histogram Filter Represent the pdf via a histogram over a discrete set of possible locations The accuracy is limited by the grid size A small grid becomes very computationally expensive in high dimensional state spaces because the number of cells is exponential in the number of dimensions Idea: represent the pdf via adaptive discretization, e.g., octrees 13

Histogram Filter Prediction step Assume bounded Gaussian noise in the motion model The prediction step can be realized by shifting the data in the grid according to the control input and convolving the grid with a separable Gaussian kernel: This reduces the prediction step cost from On 2 to On where n is the number of cells Update step To update and normalize the pdf upon sensory input, one has to iterate over all cells Is it possible to monitor which part of the state space is affected by the observations and only update that? 14

Particle Filter A Gaussian mixture filter with Σ k t t 0 so that φ x t ; µ k t t, Σk t t δ x t ; µ k t t := 1{x t = µ k t t } Prior: x t z 0:t, u 0:t 1 p t t x x := N t t αk t t δ x t ; µ k t t Motion model: x t+1 p a x t, u t Observation model: z t p h x t Prediction: p x = Update: N t t p a x s, u t t t δ s; µ k ds = t t N t t p h z t+1 x N δ x; µ k N p +1 x = ph z t+1 s N = j=1 α j δ s; µ j ds N t t p ax µ k t t, u t?? p h N j=1 α j p h z t+1 µ k δ x; µ k z t+1 µ j δ x; µ k 15

Particle Filter Resampling How do we approximate the prediction step? N t t p x = N t t p ax µ k t t, u t?? δ x; µ k How do we avoid particle depletion - a situation in which most of the particle weights are close to zero? Just like the GMF uses refinement and pruning, the particle filter uses a procedure called resampling to: 1. approximate the prediction step 2. avoid particle depletion during the update step Resampling is applied at time t if the effective number of particles: 1 N eff := is less than a threshold Nt t t t 2 16

Particle Filter Prediction How do we approximate the prediction step? N t t p x = N t t p ax µ k t t, u t?? δ x; µ k Since p x is a mixture pdf, we can approximate it with particles by drawing samples directly from it Let N be the number of particles in the approximation usually, N = N t t Bootstrap approximation: repeat N times and normalize the weights at the end: Draw j {1,..., Nt t } with probability α j t t j Draw µ p a µ j t t, u t Add the weighted sample µ j, p µ j to the new particle set 17

Particle Filter 18

Inverse Transform Sampling Target distribution: How do we sample from a distribution with pdf px and CDF F x = x psds? Inverse Transform Sampling: 1. Draw u U0, 1 2. Return inverse CDF value: µ = F 1 u 3. The CDF of F 1 u is: PF 1 u x = Pu F x = F x 19

Rejection Sampling Target distribution: How do we sample from a complicated pdf px? Proposal distribution: use another pdf qx that is easy to sample from e.g., Uniform, Gaussian and: λpx qx with λ 0, 1 Rejection Sampling: 1. Draw u U0, 1 and µ q 2. Return µ only if u λpµ qµ. If λ is small, many rejections are necessary Good qx and λ are hard to choose in practice 20

Sample Importance Resampling SIR How about rejection sampling without λ? Sample Importance Resampling for a target distribution p with proposal distribution q 1. Draw µ 1,..., µ N q 2. Compute importance weights = pµk qµ k and normalize: αk = αk j αj 3. Draw µ k independently with replacement from { µ 1,..., µ N} with probability and add to the final sample set with weight 1 N If q is a poor approximation of p, then the best samples from q are not necessarily good samples for resampling Markov Chain Monte Carlo methods e.g., Metropolis-Hastings and Gibbs sampling: The main drawback of rejection sampling and SIR is that choosing a good proposal distribution q is hard Idea: let the proposed samples µ depend on the last accepted sample µ, i.e., obtain correlated samples from a conditional proposal distribution µ k q µ k 1 Under certain conditions, the samples generated from q µ form an ergodic Markov chain with p as its stationary distribution 21

Stratified Resampling In the last step of SIR, the weighted sample set {µ k, } is resampled independently with replacement This might result in high variance resampling, i.e., sometimes some samples with large weights might not be selected or samples with very small weights may be selected multiple times Stratified resampling: guarantees that samples with large weights appear at least once and those with small weights at most once. Stratified resampling is optimal in terms of variance Thrun et al. 2005 Instead of selecting samples independently, use a sequential process: Add the weights along the circumference of a circle Divide the circle into N equal pieces and sample a uniform on each piece Samples with large weights are chosen at least once and those with small weights at most once 22

Stratified Resampling Stratified low variance resampling 1: Input: particle set { µ k, } N 2: Output: resampled particle set 3: j 1, c α 1 4: for k = 1,..., N do 5: u U 0, 1 N 6: β = u + k 1 N 7: while β > c do 8: j = j + 1, c = c + α j 9: add µ j, 1 N to the new set Systematic resampling: the same as stratified resampling except that the same uniform is used for each piece, i.e., u U 0, 1 N is sampled only once before the for loop above. Sample importance resampling SIR: draw µ k independently with replacement from { µ 1,..., µ N} with probability and add to the final sample set with weight 1 N 23

Particle Filter Summary Prior: x t z 0:t, u 0:t 1 p t t x x := N t t αk t t δ x t ; µ k t t Motion model: x t+1 p a x t, u t Observation model: z t p h x t Prediction: approximate the mixture by sampling: p x = N t t p a x s, u t t t δ s; µ k ds = t t N t t N t t p ax µ k t t, u t δ x; µ k Update: rescale the particles based on the observation likelihood: p h z t+1 ; x N δ x; µ k N p +1 x = ph z t+1 ; s N = j=1 α j δ s; µ j ds p h N j=1 α j p h z t+1 µ k z t+1 µ j δ x; µ k If N eff := 1 { µ k N t t t t +1, αk +1 N threshold, resample the particle set 2 } via stratified or sample importance resampling 24

Rao-Blackwellized Particle Filter The Rao-Blackwellized marginalized particle filter is applicable to conditionally linear-gaussian models: xt+1 n = ft n xt n + A n t xt n xt l + Gt n xt n wt n xt+1 l = ft l xt n + A l txt n xt l + Gt l xt n wt l z t = h t xt n + C t xt n xt l Nonlinear states: x + v t n t Linear states: xt l Idea: exploit linear-gaussian sub-structure to handle high dim. problems xt, l x0:t n z 0:t, u 0:t 1 p = p xt l z 0:t, u 0:t 1, x0:t n }{{} Kalman Filter N t t px0:t n z 0:t, u 0:t 1 }{{} Particle Filter = t t δ x0:t; n m k t t φ xt; l µ k t t, Σk t t The RBPF is a combination of the particle filter and the Kalman filter, in which each particle has a Kalman filter associated to it 25