Structure from Motion CS4670/CS5670 - Kevin Matzen - April 15, 2016 Video credit: Agarwal, et. al. Building Rome in a Day, ICCV 2009
Roadmap What we ve seen so far Single view modeling (1 camera) Stereo modeling (2 cameras) Multi-view stereo (3+ cameras) How do we recover camera parameters necessary for MVS?
Wednesday s Lecture Assume we are always given the camera calibration. f1 T1 T2 f2 y x
Today s Lecture Assume we are always never given the camera calibration.???? y x
Calibration makes 3D reasoning possible! f1 x1 z x2 f2 1 2 b
Today s outline How can we calibrate our cameras? How can we calibrate a camera without photos of a calibration target? How can we automate this calibration at scale?
Projection Model
Projection Model Some 3D world-space point
Projection Model A 2D image-space projection Some 3D world-space point
Projection Model Calibration gives us these A 2D image-space projection Some 3D world-space point
Camera Calibration
Camera Calibration
Camera Calibration y (10, 12, 0) (0, 0, 0) x
DLT Method
DLT Method
DLT Method
DLT Method
Question: Is a single plane enough?
Question: Is a single plane enough? Assume plane is at Z = 0 (rotate and translate coordinates to make it so) 0 0 0 0
Question: Is a single plane enough? 0 0 0 0 Columns are all 0 > Rank is at most 9 No, calibration target cannot be planar with DLT method. But we can combine many planes.
Non-Linear Method DLT method does not automatically give decomposition into extrinsics and intrinsics May wish to impose additional constraints on camera model (e.g. isotropic focal length, square pixels) Non-linearities such as radial distortion are not easily modeled with DLT
2 4 u i w i v i w i w i 3 5 = 2 4 f x 0 c x 0 f y c y 0 0 1 3 5 2 4 r 11 r 12 r 13 t x r 21 r 22 r 23 t y r 31 r 32 r 33 t z 3 5 2 6 4 x i y i z i 1 3 7 5
2 4 u i w i v i w i w i 3 5 = 2 4 f x 0 c x 0 f y c y 0 0 1 3 5 2 4 r 11 r 12 r 13 t x r 21 r 22 r 23 t y r 31 r 32 r 33 t z 3 5 2 6 4 x i y i z i 1 3 7 5 3D world-space point
2 4 u i w i v i w i w i 3 5 = 2 4 f x 0 c x 0 f y c y 0 0 1 3 5 2 4 r 11 r 12 r 13 t x r 21 r 22 r 23 t y r 31 r 32 r 33 t z 3 5 2 6 4 x i y i z i 1 3 7 5 Rotate and translate point into camera space 3D world-space point
2 4 u i w i v i w i w i 3 5 = 2 4 f x 0 c x 0 f y c y 0 0 1 3 5 2 4 r 11 r 12 r 13 t x r 21 r 22 r 23 t y r 31 r 32 r 33 t z 3 5 2 6 4 x i y i z i 1 3 7 5 Project point into image plane Rotate and translate point into camera space 3D world-space point
2 4 u i w i v i w i w i 3 5 = 2 4 f x 0 c x 0 f y c y 0 0 1 3 5 2 4 r 11 r 12 r 13 t x r 21 r 22 r 23 t y r 31 r 32 r 33 t z 3 5 2 6 4 x i y i z i 1 3 7 5 Project point into image plane Rotate and translate point into camera space 3D world-space point apple ui w i w i = Let s work through a simpler 2D version apple f c 0 1 apple cos( ) sin( ) tx sin( ) cos( ) t y 2 4 x i y i 1 3 5
apple ui w i w i = apple f c 0 1 apple cos( ) sin( ) tx sin( ) cos( ) t y 2 4 x i y i 1 3 5 2D point 1D projection
apple ui w i w i = apple f c 0 1 apple cos( ) sin( ) tx sin( ) cos( ) t y 2 4 x i y i 1 3 5 apple ui w i w i = apple f c 0 1 apple cos( )xi sin( )y i + t x sin( )x i + cos( )y i + t y
apple ui w i w i = apple f c 0 1 apple cos( ) sin( ) tx sin( ) cos( ) t y 2 4 x i y i 1 3 5 apple ui w i w i = apple f c 0 1 apple cos( )xi sin( )y i + t x sin( )x i + cos( )y i + t y u i w i w i = apple f(cos( )xi sin( )y i + t x )+c(sin( )x i + cos( )y i + t y ) sin( )x i + cos( )y i + t y
apple ui w i w i = apple f c 0 1 apple cos( ) sin( ) tx sin( ) cos( ) t y 2 4 x i y i 1 3 5 apple ui w i w i = apple f c 0 1 apple cos( )xi sin( )y i + t x sin( )x i + cos( )y i + t y u i w i w i = apple f(cos( )xi sin( )y i + t x )+c(sin( )x i + cos( )y i + t y ) sin( )x i + cos( )y i + t y u i = f(cos( )x i sin( )y i + t x )+c(sin( )x i + cos( )y i + t y ) sin( )x i + cos( )y i + t y
h(f,c,,t x,t y,x i,y i )= f(cos( )x i sin( )y i + t x )+c(sin( )x i + cos( )y i + t y ) sin( )x i + cos( )y i + t y
h(f,c,,t x,t y,x i,y i )= f(cos( )x i sin( )y i + t x )+c(sin( )x i + cos( )y i + t y ) sin( )x i + cos( )y i + t y L(f,c,,t x,t y )= X i (u i h(f,c,,t x,t y,x i,y i )) 2
h(f,c,,t x,t y,x i,y i )= f(cos( )x i sin( )y i + t x )+c(sin( )x i + cos( )y i + t y ) sin( )x i + cos( )y i + t y L(f,c,,t x,t y )= X i (u i h(f,c,,t x,t y,x i,y i )) 2 argmin L(f,c,,t x,t y ) f,c,,t x,t y
h(f,c,,t x,t y,x i,y i )= f(cos( )x i sin( )y i + t x )+c(sin( )x i + cos( )y i + t y ) sin( )x i + cos( )y i + t y L(f,c,,t x,t y )= X i (u i h(f,c,,t x,t y,x i,y i )) 2 argmin L(f,c,,t x,t y ) f,c,,t x,t y Apply non-linear optimization method. Exercise: Derive @L @f, @L @c, @L @, @L, @L @t x @t y
What if we don t have a target?
What if we don t have a target? The world is our calibration target!
What if we don t have a target? The world is our calibration target! But we don t know the position of all points in the world.
Structure from Motion Key goals of SfM: Use approximate camera calibrations to match features and triangulate approximate 3D points Use approximate 3D points to improve approximate camera calibrations Chicken-and-egg problem Can extend and use our non-linear optimization framework Requires a good initialization
SfM building blocks What do we need from our CV toolbox? Keypoint detection Descriptor matching F-matrix estimation Ray triangulation Camera projection Non-linear optimization Useful metadata Focal length guess (EXIF tags)
Given: 1 2 Images 1 and 2 Focal length guesses
1. Compute feature 1 2 matches and F- matrix
2. Use approx K s 1 to get E-matrix 2 E = K2 T FK1
3. Decompose E 1 into relative pose 2 E = R[t]x
1 4. Triangulate features 2
1 5. Apply non-linear optimization 2
h(f,c,,t x,t y,x i,y i )= f(cos( )x i sin( )y i + t x )+c(sin( )x i + cos( )y i + t y ) sin( )x i + cos( )y i + t y L(f,c,,t x,t y )= X i (u i h(f,c,,t x,t y,x i,y i )) 2 argmin L(f,c,,t x,t y ) f,c,,t x,t y
h(f,c,,t x,t y,x i,y i )= f(cos( )x i sin( )y i + t x )+c(sin( )x i + cos( )y i + t y ) sin( )x i + cos( )y i + t y (f,c,,t x,t y, (x 1,y 1 ),...,(x n,y n )) = X i (u i h(f,c,,t x,t y,x i,y i )) 2 argmin f,c,,t x,t y,(x 1,y 1 ),...,(x n,y n ) L(f,c,,t x,t y, (x 1,y 1 ),...,(x n,y n )) Doesn t make sense for 1 camera
h(f,c,,t x,t y,x i,y i )= f(cos( )x i sin( )y i + t x )+c(sin( )x i + cos( )y i + t y ) sin( )x i + cos( )y i + t y L(K 1,...,K m, (x 1,y 1 ),...,(x n,y n )) = X i X j w i,j (u i h(k j, (x i,y i ))) 2 argmin K 1,...,K m,(x 1,y 1 ),...,(x n,y n ) L(K 1,...,K m, (x 1,y 1 ),...,(x n,y n )) Called Bundle Adjustment
Camera sets can be incrementally built up Essential matrix
Camera sets can be incrementally built up Perspective n Point method
Camera sets can be incrementally built up Perspective n Point method
Camera sets can be incrementally built up Perspective n Point method
Dubrovnik - Incremental Bundle Adjustment
Dubrovnik
Sacré-Cœur
SfM Ambiguities x = PX
SfM Ambiguities x = PX x =(PQ)(Q 1 X)
SfM Ambiguities x = PX x =(PQ)(Q 1 X) x = K(TQ)(Q 1 X)
SfM Ambiguities x = PX x =(PQ)(Q 1 X) x = K(TQ)(Q 1 X) T is a rigid body transformation If we want TQ to be a RBT, then Q could be an RBT > If we rotate and translate all the points, everything works out if we rotate and translate all the cameras.
SfM Ambiguities x = PX x =(PS 1 )(SX)
SfM Ambiguities x = PX x =(PS 1 )(SX) x = K(TS 1 )(SX)
SfM Ambiguities x = PX x =(PS 1 )(SX) x = K(TS 1 )(SX) x = K(S 1 TT 0 )(SX) x =(KS 1 )(TT 0 )(SX)
SfM Ambiguities x = PX x =(PS 1 )(SX) x = K(TS 1 )(SX) x = K(S 1 TT 0 )(SX) x =(KS 1 )(TT 0 )(SX) x =(S 1 K)(TT 0 )(SX) Sx = K(TT 0 )(SX)
SfM Ambiguities Sx = Sx = K(TT 0 )(SX) 2 3 2 3 suw uw 4 svw 5 = 4 vw 5 = x sw w
SfM Ambiguities Sx = Sx = K(TT 0 )(SX) 2 3 2 3 suw uw 4 svw 5 = 4 vw 5 = x sw w > If we scale all the points, everything works out if we move the camera positions.
SfM Ambiguities x = PX x =(PQ)(Q 1 X) In this case Q is a general similarity transform. We resolve the ambiguity often by placing one camera at the origin facing some direction and a second camera at fixed offset from the first.
Applications
Internet-scale 3D
Snavely, et. al. Finding Paths through the World's Photos. SIGGRAPH 2008.