Visual SLAM Tutorial: Bundle Adjustment

Visual SLAM Tutorial: Bundle Adjustment Frank Dellaert June 27, 2014 1 Minimizing Re-projection Error in Two Views In a two-view setting, we are interested in finding the most likely camera poses T1 w and T 2 w, both in the space SE3) of 3D poses, and 3D points Pj w R 3, with j 1..n, which we can do by minimizing the re-projection errors f Tc w,p w ; p) = htc w,p w ) p 2 Σ associated with each measurement p R 2. Above, the measurement function h : SE3) R 3 R 2 predicts the measurement p from the unknowns Tc w = R w c,tc w ) and P w, and the notation e 2 Σ e T Σ 1 e is shorthand for the squared Mahalanobis distance with covariance Σ. The objective function is then the log-likelihood of the unknowns given all such measurements p in the two views. Under the Gaussian noise assumptions, this is exactly 2 In a Linear World... E T1 w,t 2 w,{ }) = P j hti w,pj w ) p i j 2 Σ 1) i=1,2 j In a linear world, the measurement function h would be a linear function of the unknowns. For reasons that will become clear, let us introduce a 6DOF unknown ξ i associated with camera i, and a 3DOF unknwon δ j associated with point j. Then linearity implies = and the objective function ˆp i j = h i j ξi,δ j ) = Fi j ξ i + G i j δ j E ξ 1,ξ 2, { }) = δ j F i j ξ i + G i j δ j b i j 2 Σ 2) i=1,2 j where we also introduced b i j, the linear measurement. Then we can easily re-write the objective function 2) as E ξ 1,ξ 2, { δ j }) = Ax b 2 Σ 3) with x and b vectorized forms of our unknowns and measurements, respectively, and Σ a suitable defined block-diagonal) covariance matrix. 1

3 Sparse Matters The matrix A will be a block-sparse matrix [Hartley and Zisserman, 2004]. For example, with 3 points, we have F 11 G 11 b F 12 G 12 ξ 11 1 A = F 13 G 13 ξ 2 b 12 F 21 G 21, x = δ 1 F 22 G 22 δ 2, b = b 13 b 14 b δ 15 F 23 G 3 23 b 16 A linear least-squares criterion of the form 3) is easy to solve: we just take the derivative wrt x and set to zero, obtaining the condition A A ) x = A b, 4) also known as the normal equations. In the above, the symmetric matrix A A is the known as the Hessian of the system. Because A is sparse, it is imperative that, for the sake of computational efficiency, a sparse solver is used to solve for x, e.g., using GTSAM or cholmod. The latter is built into MATLAB and is automatically invoked simply by typing x = A\b. 4 Back to Reality In reality, the measurement function ht w c,p w ) is quite non-linear, and generates the predicted measurement ˆp by first transforming the point P w into camera coordinates P c, as specified by the camera T w c, then projecting the point so obtained into the image: ˆp = ht w c,p w ) = π gt w c,p w )) In detail, the point P c in camera coordinates is obtained by the coordinate transformation g, as in P c = gt w c,p w ) = R w c ) T P w t w c ) 5) Then the predicted image coordinates ˆp are obtained by the camera projection function π ˆp = ˆx,ŷ) = π X c,y c,z c ) = X c Z c, Y c Z c ) 6) Above we assumed that the measurements p are pre-calibrated to normalized image coordinates p = x, y), as we are interested in the setting where the calibration matrix K and radial distortion function f r) are known. So how can we get back to the easy, linear setting? Maybe we can do a Taylor expansion, as in ht w c + ξ,p w + δ) ht w c,p w ) + Fξ + Gδ. However, here we run into trouble, as + is not defined for elements of SE3). The next section introduces a more general view of Taylor expansion that will work, however. 2

5 A Twist of Lim)e The notion of an incremental update to an element on a manifold a M is provided by the notion of a tangent space at a a, denoted T a M, which is a vector space R n with n equal to the dimensionality of the manifold. Any vector x T a M in the tangent space can be mapped back to the manifold, with x = 0 mapping to a. Hence, vectors x correspond to updates around a, and this provides a way for our linearized problem to update the solution on the manifold [Absil et al., 2007]. In the special case that the manifold is also a group, i.e., a Lie group, the mapping is provided by the exponential map. In what follows, I use the notation from [Murray et al., 1994], directly specialized to SE3): a vector ξ R 6 is mapped to an element of the Lie Algebra se3), in this case a 4 4 matrix, which is then mapped to a Lie group element by matrix exponentiation: T w c T w c exp ˆξ where the hat operator ˆis the mapping from R 6 to se3): [ ] [ ξ = ω θ v ˆξ = [ω] v 0 0 For SE3) the 6-dimensional quantity ω,v) is also called a twist, whereas θ R + is the amount of twist applied, resulting in a finite screw motion. In general, the elements of the vector space isomorphic to the Lie algebra are called canonical coordinates. More details about the Lie algebra se3) and its corresponding exponential map can be found in Appendix A. ] θ 6 A Generalized Taylor Expansion Equipped with this notion, we can generalize the Taylor expansion to locally approximate a function f : G R m, with G an n-dimensional Lie group, as follows: f ae ˆξ ) f a) + f a)ξ where ˆξ g, the Lie algebra of G, and f a) is an m n Jacobian matrix, linearly mapping the n-dimensional exponential coordinates ξ to m-dimensional corrections on f a). Note that f a) is itself a function of a, i.e., the linear mapping is in general different for different values of a. In the image projection case, we need to similarly generalize the notion of partial derivatives, to approximate the two-argument measurement function h: ht w c e ˆξ,P w + δ) ht w c,p w ) + h [1] T w c,p w )ξ + h [2] T w c,p w )δ Above h [1].) and h [2].) are the 2 6 and 2 3 Jacobian matrices corresponding to the partial derivatives of h in its first and second arguments, respectively. Note that the incremental pose exp ˆξ ) is applied on the camera side, and hence the exponential coordinates ξ = ω,v) are expressed in the camera frame. 3

7 Nonlinear Optimization By defining F i j = h [1] Ti w,pj w), G i j = h [2] Ti w,pj w), and the linearization prediction error b i j = hti w,pj w) p i j, we recover exactly the linear objective function in 2), which we can minimize for the optimal exponential coordinates ξ i and δ j. Gauss-Newton nonlinear optimization simply iterates between linearization, solve, and update, until a predefined convergence criterion is stasfied: 1. Start with a good initial estimate θ 0 = T1 w,t 2 w,{ } P j 2. Linearize 1) around θ it to obtain A and b 3. Solve for x = ξ 1,ξ 2, { δ j } using the normal equations 4) A A ) x = A b where A A the Gauss-Newton approximation to the Hessian of the actual non-linear system. 4. Update the nonlinear estimate θ it+1 using the exponential map a) T1 w T 1 w exp ˆξ 1 b) T2 w T 2 w exp ˆξ 2 c) P j P j + δ j 5. If not converged, go to 2. 8 Too much Freedom! A problem that will pop up when using the Gauss-Newton for the two-view case is that the Hessian A A will be singular. The reason is that there is a so-called gauge freedom: the solution is not uniquely determined, as we can displace the cameras using an arbitrary, 7DOF similarity transform without affecting the objective function. There are several ways to get around this: Switch to a minimal representation of a scale-free relative transform between the two cameras. This works well for two cameras but is not so easy to apply in the multi-camera case. Use the so called inner constraints from photogrammetry, which constrain the centroid and scale of the solution. This generalizes to multiple cameras, but a downside is that the inner constraints destroy some of the sparsity of the system, which affects computational efficiency. Add prior terms on the first pose and the distance of the second pose from the first pose e.g., unity). This works well and scales to multiple cameras, but in a monocular situation there will still be scale drift. Fuse in information from other sensors, e.g., IMU and/or GPS. In the case of a calibrated stereo-rig the problem of scale disappears, but one still has to put a prior on or clamp down the first pose in order to eliminate the gauge freedom. A common but unadvisable way is to use the technique from the next section to make the problem disappear. 4

9 A Matter of Trust When far from the solution, the Hessian approximation A A might be valid for a very small region only, and taking quadratic steps as given by the normal equations will lead to divergence. A well known way to deal with this issue is to limit the steps based on the amount of trust we have in the current linearization, switching to a more cautious gradient descent approach when far from the local minimum. The most well-known of these so-called region trust methods is Levenberg-Marquardt LM), which solves for A A + λd ) x = A b, where D is either the diagonal of A or the identity matrix. For the latter case, with λ high, the solution is given by x = λ 1 A b, which is gradient descent with step size λ 1. When λ is zero or small, LM reverts to quadratic Gauss-Newton steps. Simple implementations of LM start with high values of λ and gradually decrease it, backing off if this leads to steps with non-decreasing error. More sophisticated variants put a lot more work into finding a good λ at every step. What LM is not designed for is masking out gauge freedoms, although in many implementations it is used that way: taking smaller steps when not needed will lead to slower convergence. 10 Finally, the Jacobians... To apply all of the above, one last task has to be undertaken, and that is deriving the Jacobians of the measurement function h. Because h itself is the composition of the coordinate transformation g followed by the projection π, the following chain rules apply: h [1] T w c,p w ) = π p c ) g [1] T w c,p w ) h [2] T w c,p w ) = π p c ) g [2] T w c,p w ) We now need to compute three Jacobian matrices: 1. The 2 3 Jacobian matrix π X c,y c,z c ) of the projection function π is easily computed as π p c ) = 1 [ 1 0 X c /Z c ] Z c 0 1 Y c /Z c = 1 [ ] 1 0 ˆx z c 7) 0 1 ŷ 2. The 3 3 Jacobian matrix g [2] T w c,p w ) of the coordinate transformation g with respect to a change δ in the point P w is also easy to compute, as is clearly linear in δ, and hence gt w c,p w + δ) = R w c ) T P w +t w c ) + R w c ) T δ g [2] T w c,p w ) = R w c ) T 8) 5

3. The change in pose is a bit more challenging. It can be shown, however, that g Tc w e ˆξ,P w) = P c + P c ω v This is linear in both ω and v, and hence the 3 6 Jacobian matrix g [1] Tc w,p w ) of g with respect to a change in pose easily follows as g [1] Tc w,p w ) = [ [P c ] ] I 3 9) In other words, rotating the camera frame is magnified by the arm of the point in camera coordinates, whereas moving the camera simply moves the point s coordinates in the opposite direction. In summary, using Eqns. 7, 8, and 9, the derivatives of image projection are h [1] Tc w,p w ) = 1 [ ] 1 0 ˆx [ [P Z c 0 1 ŷ c ] ] I 3 h [2] T w c,p w ) = 1 Z c [ 1 0 ˆx 0 1 ŷ 10) ] R w c ) T 11) Note the arm of P c for rays near the optical axis is now nearly canceled by the inverse depth factor 1/z c in π P c ), i.e., we expect flow nearly proportional to the incremental rotation ω. 11 More Than One Camera Many readers will have already figured out that the entire story above generalizes to many cameras almost at once. The general case is known as Structure from Motion SfM) or bundle adjustment. The total re-projection error now sums over all measurements p k E {Ti w }, { }) = P j hti w k,pj w k ) p k 2 Σ 12) k where i k and j k encode the data association, assumed known here. Techniques such as Levenberg- Marquardt and obtaining a good initial estimate will be of greater importance for the many-cameras case, as the objective function above is very non-linear, especially when points move closer to image planes and coross-over to the wrong side, in which case local minimizers such as Gauss Newton and LM will almost always converge to a local minimum. Another often used technique in batch SfM is the Schur complement trick, where in the linear step one first solves for the point updates δ j as a function of pose updates { ξ j }. However, the utility of that in an online, visual SLAM scenario is debatable - a discussion for a different part of the tutorial. 6

A 3D Rigid Transformations The Lie group SE3) is a subgroup of the general linear group GL4) of 4 4 invertible matrices of the form [ ] T = R t 0 1 where R SO3) is a rotation matrix and t R 3 is a translation vector. An alternative way of writing down elements of SE3) is as the ordered pair R, t), with composition defined as R 1, t 1 )R 2, t 2 ) = R 1 R 2, R 1 t 2 +t 1 ) Its Lie algebra se3) is the vector space of 4 4 twists ˆξ parameterized by the twist coordinates ξ R 6, with the mapping [Murray et al., 1994] [ ] [ ] ξ = ω θ v ˆξ = 6 [ω] v θ = ξ 0 0 i G i where the 4 4 matrices G are called the generators for se3). Note I follow a different convention from Murray and reserve the first three components for rotation, and the last three for translation. Hence, with this parameterization, the generators for SE3) are G 1 = G 4 = 0 0 1 0 0 1 0 0 0 0 0 1 G2 = G5 = 0 0 1 0 1 0 0 0 0 0 0 1 i=1 G3 = G6 = 0 1 0 0 1 0 0 0 0 0 0 1 Applying the exponential map to a twist ξ, scaled with θ R +, yields a screw motion in SE3): ) T t) = exp ˆξ θ A closed form solution for the exponential map is given in [Murray et al., 1994, page 42]. References exp [ ω v ] ) [ e [ω] θ I e t = [ω] θ )ω v) + ωω T vθ 0 1 [Absil et al., 2007] Absil, P.-A., Mahony, R., and Sepulchre, R. 2007). Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton, NJ, USA. [Hartley and Zisserman, 2004] Hartley, R. I. and Zisserman, A. 2004). Multiple View Geometry in Computer Vision. Cambridge University Press, second edition. [Murray et al., 1994] Murray, R., Li, Z., and Sastry, S. 1994). A Mathematical Introduction to Robotic Manipulation. CRC Press. ] 7