Level-Set Person Segmentation and Tracking with Multi-Region Appearance Models and Top-Down Shape Information Appendix Esther Horbert, Konstantinos Rematas, Bastian Leibe UMIC Research Centre, RWTH Aachen University horbert,leibe@umic.rwth-aachen.de This appendix contains a detailed derivation of our segmentation and tracking framework [27] and lists typical values for important parameters. A. Derivation Fig. shows the results of our full model for the sequence WALKSTRAIGHT [26], illustrating the evolution of the three contours. Figure 2. The generative model used in this approach, M Mf, Mf 2, Mb, Mb2. x Pixel s coordinates inside reference frame y Pixel s color p Reference frame position h Shape model W x, p) Warp with parameters p Mf, M f 2 Foreground regions Mb, Mb2 Background regions P y Mk ) Appearance models Φ Level set embedding function Φc, Φf, Φb Embeddings for person and fore/background Ck Contour represented by the zero level set H z) Smoothed Heaviside step function δ z) Smoothed Dirac delta function Table. Notation used in this paper Figure. Results of our full model for the sequence WALK STRAIGHT 25 frames, from [26]). This example nicely illustrates the evolution of the separating contours dark red). The foreground contour Φf is initialized with a horizontal line at 50% of the object frame height, the background contour Φb with a line at 60%. However this is only the initialization and the lines can evolve to very different forms as the person s contour does).
The joint distribution for one pixel given by the model in Fig. 2 is: P x i, y i, h i, Φ, p, M) P x i Φ, p, M)P y i M)P h i M)P M)P Φ)P p) ) assumption: y, h independent) 2) P x i, Φ, p, M y i, h i )P y i )P h i ) P x i Φ, p, M)P y i M)P h i M)P M)P Φ)P p) 3) P x i, Φ, p, M y i, h i ) P x i Φ, p, M)P M y i )P M h i )P Φ)P p) 4) Marginalization over the models M yields the pixel-wise posterior probability of shape Φ and location p given a pixel x i, y i, h i : k f,f2,b,b2 k f,f2,b,b2 P x i, Φ, p, M k y i, h i ) P x i, Φ, p y i, h i ) P Φ, p x i, y i, h i )P x i ) 5) l P y i M l )P M l ) P M k h i ) P Φ)P p) 6) It follows P Φ, p x i, y i, h i ) P x) k f,f2,b,b2 l P y M l)p M l ) P M k h i ) P Φ)P p) 7) We use a smoothed Heaviside step function H ɛ to select the respective regions and a smoothed Dirac delta function δ ɛ to select the contours: H c H ɛ Φ c x i )), Hc H ɛ Φ c x i )) 8) H f H ɛ Φ f x i )), Hf H ɛ Φ f x i )) 9) H b H ɛ Φ b x i )), Hb H ɛ Φ b x i )) 0) P M k ) η k η, k f, f2, b, b2, M f M f M f2, M b M b M b2 ) The number of pixels in the four respective regions can be obtained as follows: N N N η k, η f H c H f, η f2 H chf, 2) η b N H c H b, η b2 N H c Hb 3)
and thus the probability of pixel x i for each region: P x i Φ, p, M f ) H ch f, P x i Φ, p, M f2 ) H ch f, η f η f2 4) P x i Φ, p, M b ) H c H b, P x i Φ, p, M b2 ) H c Hb. η b η b2 5) Fusing the pixel-wise posteriors with a logarithmic opinion pool yields P Φ, p x, y, h) 6) N [ ] l P y i M l )P M l ) P M k h i ) P Φ)P p) 7) k f,f2,b,b2 N [ Hc H f P y i M f )P M f ) η f l P y i M l )P M l ) P M f h i ) + H ch f P y i M f2 )P M f2 ) η f2 l P y i M l )P M l ) P M f2 h i ) 8) + H c H b P y i M b )P M b ) η b l P y i M l )P M l ) P M b h i ) + H c Hb P y i M b2 )P M b2 ) ] η b2 l P y i M l )P M l ) P M b2 h i ) P Φ)P p) 9) N [ H c H f P f + H chf P f2 + H c H b P b + H c Hb P b2 ]P Φ)P p) 20) N P x i Φ, p, y i, h i )P Φ)P p) 2) where P k P y i M k )P M k )P M k h i ) η k l P y i M l )P M l ) P y i M k )P M k h i ) l η, k f, f2, b, b2 22) lp y i M l ) P x i Φ, p, y i, h i )H c H f P f + H c Hf P f2 + H c H b P b + H c Hb P b2. 23) P y i M k ) is computed from the appearance models, i.e. color histograms. Eq. 2) contains P M k h) for four regions, but the detector only provides probabilities for two regions: foreground and background. However, 2) selects a model for each region by use of H, H f and H b respectively. We set Thus it holds: P M f h) P M f h) + P M f2 h) H f P M f h) + H f P M f h) 24) P M b h) P M b h) + P M b2 h) H b P M b h) + H b P M b h). 25) Either P M f h) P M f h) or P M f h) P M f2 h) and 26) either P M b h) P M b h) or P M b h) P M b2 h). 27) This means in practice P M f h) and P M f h) are both set to P M f h) background accordingly), which is possible because for each pixel one of the subregions is selected. We now specify P Φ) as the internal energy of the level set embedding functions). It contains a geometric prior that rewards a signed distance function: the gradient of the level set function is a normal distribution with mean. It thus makes the level set embedding function numerically stable without the need for periodic re-initializations [20]. The second term as in [0]) describes the length of the contour and rewards a smoother contour. This is very useful for cluttered scenes, where pixels with foreground or background appearance can form very small regions, which can easily result in a very uneven contour. N [ P Φ) σ 2π exp Φx i) ) 2 ) )] 2σ 2 exp λ H ɛ Φ) 28)
where σ and λ are the weights of the priors. Maximizing the posterior is equivalent to minimizing its negative logarithm: EΦ) logp Φ, p x, y, h)) N logp x i Φ, p, y i, h i )) Φ )2 2σ 2 A.. Derivation of the Segmentation Framework ) ) λ H ɛ Φ) + N log σ + logp p)) 2π For segmentation we optimize 29) w.r.t Φ, so the last two terms can be dropped, the rest is then differentiated by calculus of variation: Φ c EΦ c) δh f P f + δh f P f2 δh b P b δ H b P b2 )] [ 2 Φc t Φ c P x Φ, p, y, h) σ 2 Φ c ) div 30) Φ c ) Φc +λδ ɛ Φ c )div Φ f t Φ b t EΦ f ) Φ f EΦ b) Φ b Φ c H cδ f P f H c δ f P f2 + H c H b P b + H c Hb P b2 P x Φ, p, y, h) ) Φf +λδ ɛ Φ f )div Φ f :H >0)! δ f P f P f2 ) P x Φ, p, y, h) σ 2 accordingly. )] [ 2 Φf σ 2 Φ f ) div [ )] ) 2 Φf Φf Φ f ) div + λδ ɛ Φ f )div We evolve the two additional level set functions interleaved with the original level set function Φ c. In this way, the four appearance models are optimized at the same time, which leads to more robust and accurate segmentation results. In our implementation we use σ 2 50, λ 2, τ 2, ɛ 6 and 0 if x < ɛ x H ɛ x) 2ɛ + 2π sin ) πx ɛ + 2 if x < ɛ 33) if x > ɛ A.2. Derivation of the Tracking Framework δ ɛ x) 2ɛ + cos πx )) ɛ if x < ɛ 0 else In preparation for differentiation w.r.t p some terms in 29) can be dropped: 29) 3) 32) 34) N ) EΦ) logp x i Φ, p, y i, h i )) + logp p)) + const. 35) Now the warp Wx i, p) is introduced into 35), i.e. pixels x i are warped with parameters p. P p) is dropped for the moment, this is handled with drift correction, as in [3]: EΦ) N log P Wx i, p) Φ, p, y i, h i ) 36) We maximize w.r.t p: p arg max p N log P Wx i, p) Φ, p, y i, h i ) 37)
We use a second order Newton optimization scheme as in [4]: With the short-hand notation P...) P Wx i, p) Φ, p, y i, h i ): where with [ N p P...) P...) ) 2 ] N P...) P...) J ch f + H c J f )P f + J chf H c J f )P f2 J c H b P b J c Hb P b2 J c H f P f + H f P f2 H b P b H b P b2 ) + J f H c P f H c P f2 ) 39) J c H c Φ c W Φ c x J f H f Φ f Φ f x 38) p δ ɛφ c x i )) Φ c x i ) W p, 40) W p δ ɛφ f x i )) Φ f x i ) W p 4) We assume that the background does not move with the foreground, thus H ɛ Φ b x i )) is constant w.r.t the derivation. Eq. 39) illustrates that both the person s contour and the division line of the foreground contribute to the warp, whereas the division line of the background does not contribute: J c and J f contain the factor δ ɛ Φ k ), the Dirac delta function of the respective level set function, which is only greater than zero in a narrow band around the contour with width ɛ). For parameters p s, t x, t y ) T with scale and translation the warp is: ) x ) + s 0 tx Wx; p) y + s) x + tx 42) 0 + s t y + s) y + t y The term W is the Jacobian of the warp and with Wx; p) W xx; p), W y x; p)) T : W x ) 0 y 0 43) In our implementation the rigid registration typically converges after 0 to 40 iterations. References [3] C. Bibby and I. Reid. Robust Real-Time Visual Tracking using Pixel-Wise Posteriors. ECCV, 2008. [4] C. Bibby and I. Reid. Real-time Tracking of Multiple Occluding Objects using Level Sets. CVPR, 200. [0] D. Cremers, M. Rousson, and R. Deriche. A Review of Statistical Approaches to Level Set Segmentation Integrating Color, Texture, Motion and Shape. IJCV, 72:95-25, 2007. [20] C. Li, C. Xu, C. Gui, and M. Fox. Level Set Evolution without Re-initialization: A New Variational Formulation. CVPR, 2005. [26] H. Sidenbladh and M. J. Black. Learning the statistics of people in images and video. IJCV, 54-3):83-209, 2003. [27] E. Horbert, K. Rematas and B. Leibe. Level-Set Person Segmentation and Tracking with Multi-Region Appearance Models and Top-Down Shape Information. ICCV, 20.