Tree-based Classifiers for Bilayer Video Segmentation

Tree-se Clssifiers for Bilyer Vieo Segmenttion Pei Yin 1 Antonio Criminisi 2 John Winn 2 Irfn Ess 1 1 Shool of Intertive Computing, Georgi Institute of Tehnology, Atlnt, GA 30332, USA 2 Mirosoft Reserh Lt., Cmrige, CB3 0FB, Unite Kingom Astrt This pper presents n lgorithm for the utomti segmenttion of monoulr vieos into foregroun n kgroun lyers. Corret segmenttions re proue even in the presene of lrge kgroun motion with nerly sttionry foregroun. There re three key ontriutions. The first is the introution of novel motion representtion, motons, inspire y reserh in ojet reognition. Seon, we propose lerning the segmenttion likelihoo from the sptil ontext of motion. The lerning is effiiently performe y Rnom Forests. The thir ontriution is generl txonomy of tree-se lssifiers, whih filittes theoretil n experimentl omprisons of severl known lssifition lgorithms, s well s spwning new ones. Diverse visul ues suh s motion, motion ontext, olour, ontrst n sptil priors re fuse together y mens of Conitionl Rnom Fiel (CRF) moel. Segmenttion is then hieve y inry min-ut. Our lgorithm requires no initiliztion. Experiments on mny vieo-ht type sequenes emonstrte the effetiveness of our lgorithm in vriety of senes. The segmenttion results re omprle to those otine y stereo systems. 1. Introution n relte work This pper resses the prolem of extrting foregroun lyer from monoulr vieo. Applitions for the propose tehnology inlue kgroun sustitution, ompression, ptive it-rte vieo trnsmission n trking. These pplitions require: (i) roust segmenttion to survive from strong istrting events suh s people moving in the kgroun, mer shke or illumintion hnge; (ii) effiient seprtion to ttin live streming spee. This pper fouses on the ommon senrio of vieo ht. Reent reserh in this re hs proue ompelling, rel-time lgorithms, either se on stereo [8] or motion [5]. Other lgorithms require initiliztion in the form of len imge of the kgroun [20]. Stereo-se segmenttion [8] seems to hieve the most roust results. In ft, kgroun ojets re orretly seprte from foregroun, inepenently from their motion/stsis hrteristis. This pper ims t hieving similr ehviour monoulrly (f. fig. 1). Figure 1. Ahieving roust lyer extrtion monoulrly. (,) Originl frmes from the IU n JM sequenes from [8], respetively. (, ) Temporl erivtives (rk inites lrge vlues). The foregroun person is nerly sttionry while the kgroun one is moving. In this se, kgroun sutrtion tehniques or onventionl motion-se lgorithms woul ten to lssify the kgroun person erroneously s foregroun. Furthermore, inurte lssifition my e proue in the textureless n motionless res of the foregroun. (, ) Segmenttion otine y the propose lgorithm. Corret foregroun/kgroun seprtion hs een hieve (the extrte foregroun is shown in originl olours). The stti kgroun ssumption of some monoulr systems [20] mkes segmenttion prone to mer shke (e.g. for wem mounte on lptop sreen), hnges in illumintion n lrge ojets moving in the kgroun. The lgorithm in [5] vois the nee for len kgroun imge. However, the segmenttion still suffers in the presene of lrge kgroun motion. Also, initiliztion is sometimes neessry in the form of glol olour moels. The work in [24] hs strte n importnt line of reserh in using geometri moels (e.g. plnr motion) for the segmenttion of optil flow fiels. However, in vieoht t the foregroun motion nnot e esrie well y suh rigi moels. Furthermore, we wish to voi the omplexities ssoite with optil flow omputtion. The lgorithm propose in this pper exploits motion n its sptil ontext s powerful ue for lyer seprtion. The orret level of geometri rigiity is utomtilly lernt from trining t. Our lgorithm enefits from novel 1

Figure 2. Groun-truth lyer lelling in vieo frmes. () A frme from monoulr trining sequene. In this vieo oth the loser n the frther persons re moving. () Depth-se lyer lelling (white for foregroun, lk for kgroun n gry for unertin), s use in [8]. Here only the losest person is lelle s foregroun. () Motion-se lyer lelling, s use in [5]. Both moving ojets re lelle s foregroun. In this pper we only use epth-se lelling. This enourges our monoulr system to lern to imitte stereo. motion fetures, lle motons. Motons (relte to textons) were inspire y reent reserh in motion moeling [5], n ojet n mteril reognition [13, 15, 19, 22, 25]. Motons re then omine with shpe-filters [19] to moel long-rnge sptil orreltions (shpe). These new fetures prove useful t pturing visul ontext n filling-in missing, textureless or motionless regions. Fuse motion-shpe ues re isrimintely selete y supervise lerning. Key to our tehnique is lssifier trine on epth-efine lyer lels like those use in stereo [8], s oppose to motion-efine lyer lels like in [5] (ompre fig. 2 n fig. 2). Thus, the lssifier is fore to omine other ville ues oringly to inue epth in the sene of stereo, while mintining generliztion. A generl txonomy of lssifiers is esrie whih enles us to interpret ommon lgorithms suh s ABoost, Deision Trees, Rnom Forests n Cse Boosting s vrints of single tree-se lssifier. In turn, this llows us to ompre firly the ifferent lgorithms n selet the most effiient or urte for the pplition t hn. Pixel-wise segmenttion is finlly hieve y fusing motion-shpe, olour n ontrst with lol smoothness prior in Conitionl Rnom Fiel moel [10, 11]. The finl inry segmenttion is hieve through min-ut [3]. The result is segmenttion lgorithm whih is effiient, roust to istrting events n requires no initiliztion. 2. Motons n shpe filters This setion esries the motion-shpe fetures use in our segmenttion lgorithm. We uil upon the motion-vsstsis likelihoo moel of [5], n omine it with onepts orrowe from reent literture in ojet reognition [19]. This les to powerful set of fetures whih simultneously pture motion n its long-rnge sptil ontext. Nottion. Given n input sequene of imges, frme is represente s n rry z = (z 1, z 2,, z n,, z N ) of pixels in YUV olor spe, inexe y the pixel position n. A frme t time t is enote z t. Temporl erivtives Figure 3. Motons. Trining sptio-temporl erivtives lustere into 10 lusters (motons). Different olours for ifferent lusters. re enote ż = ( z 1, z 2,, z n,, z N ) n t eh time t, re ompute s z n t = G(zn) t G(zn t 1 ) with G( ) Gussin kernel t the sle of σ t pixels. Sptil grients g = (g 1, g 2,, g n,, g N ) where g n = z n, re ompute y onvolving the imges with DoG kernels of with σ s. Here we use σ s = σ t = 0.8, pproximting Nyquist smpling filter. Sptio-temporl erivtives re ompute on the Y hnnel only. Motion oservles re O m = (g, ż). The segmenttion tsk is to infer the inry lel x n {F g, Bg} from oserve t. Motons. Here we follow proeure similr to tht for onstruting textons [13]. First, motion fetures O m re ompute for ll trining pixels. Those 2D vetors re then lustere into M lusters vi Expettion Mximiztion. The M resulting luster entres re lle motons. An exmple with M = 10 motons is shown in fig. 3. This opertion my e interprete s uiling voulry of motionse visul wors. Our visul wors pture informtion out motion n egeness of imge pixels, rther thn their texture ontent s in textons. Clustering (i) enles effiient inexing of the joint (g, ż) spe while mintining useful orreltion etween g n ż, n (ii) reues sensitivity to noise. A itionry size of just 10 motons hs proven suffiient. Also, our moton representtion is shown to yiel less segmenttion errors thn using O m iretly. In [5], it is oserve tht strong eges with low temporl erivtives usully orrespon to kgroun regions, while strong eges with high temporl erivtives re likely to e foregroun. Textureless regions ten to hve their log likelihoo rtio lose to zero ue to unertinty. Those stsis/motion isrimintion properties re retine y our quntize representtion; whih is not yet suffiient for seprting moving kgroun from moving foregroun. Given itionry of motons, eh pixel in new imge n e ssigne to its losest moton y Mximum Likelihoo (ML). Therefore, eh pixel n now e reple y n inex into our smll visul itionry [25]. An exmple of the resulting moton mp is shown olour oe in fig. 4.

Shpe filters. In vieo-ht type sequenes the foregroun ojet (usully person) moves non-rigily n yet in struture fshion. This setion shows how to pture the sptil ontext of motion ptively. Similr to [19] shpe filter is efine s motonretngle pir (k, r), with k inexing in the itionry of motons n r inexing retngulr msk whose four orners re hosen within etetion winow (ouning ox) out the size of the vieo frme (fig. 5). A whole set of shpe filters S = {(k i, r i )}, i = 1,..., is then efine y rnomly seleting moton/retngle pirs (see etils lter). For eh pixel position n, the ssoite feture ψ n is ompute s follows. Given the moton k we enter the etetion winow t n n ount the numer of pixels in I k whih fll in the offset retngle msk r. This ount is enote v n (k, r) (see fig. 5). The feture vlue ψ n (i, j) is simply otine y sutrting moton ounts ollete for the two shpe filters (k i, r i ) n (k j, r j ), i.e. ψ n (i, j) = v n (k i, r i ) v n (k j, r j ). The i n j inies re selete rnomly ( ). The feture ψ n n e ompute effiiently y integrl imge proessing [23] for every mo Figure 5. Shpe filters pplie to moton ns. (,) Two moton ns with retngulr msks r 1 n r 2 (entre in the sme pixel, n) superimpose. Given n n the shpe filter (k, r), v n(k, r) ounts the numer of pixels ssoite to k within r; see text. e f Figure 4. Moton mps n moton ns. () Originl frme from the IU sequene. () Corresponing moton mp with M = 10 motons. Sme olour orrespons to sme moton. () A moton n showing ll the pixels ssoite to moving-ege moton. () Pixels ssoite to moving, wek-texture moton. (e) Pixels ssoite to sttionry-ege moton. (f) Pixels ssoite to sttionry, wek-texture moton. Then, moton mp n e eompose in its M omponent ns, nmely moton ns. Thus we hve M moton ns I k, k = 1,..., M for eh vieo frme z. Eh I k is inry imge, with I k (n) initing whether the n th pixel hs een ssigne the k th moton or not. Fig. 4-f shows some exmple moton ns. Figure 6. The tree ue txonomy of lssifiers ptures mny lssifition lgorithms in single struture. See text for etils. ton n I k. Our shpe filters my e interprete s generliztion of the fetures use in [23]. Next, our fetures re isrimintively selete n omine y lssifier for the estimtion of our Fg/Bg unry potentils (see Setion 5). The following setion presents txonomy of tree-se lssifiers n shows how ommon lssifiers my e interprete s instnes of the sme generl lgorithm. Suh txonomy then helps us to selet the lssifier tht performs est in our pplition. 3. The Tree Cue txonomy Common lssifition lgorithms suh s Deision Trees [16], Boosting [7] n Rnom Forests [1, 4] shre the ft tht they uil strong lssifiers from omintion of wek lssifiers (lerners), often just eision stumps. The min ifferene etween these lgorithms is the wy the wek lerners re omine. This setion presents useful frmework for onstruting strong lssifiers y omining wek lssifiers in ifferent wys. The three most ommon wys of omining wek lssifiers re: i) hierrhilly (H), ii) y verging (A) or iii) vi oosting (B). In Fig. 6 the origin represents the wek lerner (e.g. eision stump), n the xes H, A, B represent those three si omining moves. The H move hierrhilly omines wek lssifiers into eision trees. During trining new wek lssifier is itertively rete n tthe to lef noe where neee se on informtion gin. It n e

Pth Clssifition lgorithm A voting y ommittee [2] B ooster of stumps H eision tree HA forest of trees (eision forest) HB ooster of trees AH tree of forests (of stumps) AB ooster of forests (of stumps) BA ommittee of oosters BH tree of oosters HAB ooster of forests of trees HBA ommittee of oosters of trees BAH tree of ommittee of oosters (of stumps) BHA ommittee of trees of oosters ABH tree of oosters of forests (of stumps) AHB ooster of trees of forests (of stumps) Tle 1. Tree Cue lssifiers. Fifteen of the mny possile lssifition lgorithms enoe in the txonomy of fig. 6. shown tht the H move reues lssifition is [2]. The B move, inste, linerly omines wek lssifiers. After the insertion of eh wek lssifier, the trining t is reweighte/resmple [16]. Clssifition of one instne involves evluting ll the tests in the strong lssifier. The B move inlues ABoost n Gentle Boost. Boosting reues the empiril error oun y perturing the trining t [7]. The A move retes strong lssifiers y verging the results of mny wek lssifiers. Note tht the wek lssifiers e y the A move fe the sme prolem, while those sequentilly e y the H n B moves fe ifferent prolems/istriution. Thus, the min omputtionl vntge is tht eh wek lssifier n e trine inepenently from eh other n in prllel. The A move gives rise to Rnom Forests when the wek lssifier is rnom tree. The verging move reues lssifition vrine [2]. Pths long the eges of the ue in fig. 6 orrespon to ifferent omintions of wek lssifiers n thus ifferent strong lssifiers. Restriting eh of the three si moves to e use only one proues three orer-1 lgorithms (exluing the se lerner itself), six orer-2 n six orer-3 lgorithms, s liste in tle 1. Mny known lgorithms re onveniently mppe into pths through the tree-ue. For exmple: Boosting (B), Deision Trees (H), Booster of Trees (HB) n Rnom Forests (HA). Also, note tht the wiely use Attention Cse [23] n e interprete s one-sie tree of oosters. The tree-ue txonomy lso enles us to explore new lgorithms (e.g.hab) n ompre them to other lgorithms of the sme orer (e.g.bha). Next, we explore whih lssifier performs est for the segmenttion of vieo-ht sequenes. Following the treeue strtegy we ompre two populr seon-orer moels: Rnom Forests of trees (RF) n Booster of Trees (BT). As snity hek we lso ompute the results of onventionl ooster of stumps (GB). Initilize weights of N trining points w n = 1/N, n = 1, 2,..., N n initilize strong lssifier F (n) = 0. Repet for l = 1, 2,...L 1. fit the regression funtion h l (.) y minimizing N n=1 wn(h l(n) y n) 2, with y n { 1, +1} the groun-truth lel of pixel n. 2. upte strong lssifier F (n) F (n) + h l (n) 3. upte trining weights w n w n e ynh l(n), n re-normlize N n=1 wn = 1 Strong lssifier is F (n) = L l=1 h l(n) Tle 2. Trining Gentle Boost. 4. Rnom Forests vs Booster of Trees The se wek lssifier use in this pper is the wiely use eision stump. A eision stump pplie to the n th pixel tkes the form h(n) = δ(ψ n (i, j) > θ) +, where δ( ) is 0-1 initor funtion, ψ n (i, j) is the shpe filter response for the i th n j th shpe filters (s esrie in setion 2). Positive vlues of the rel vlue h(n) output inite tht pixel n elongs to foregroun n vie-vers. Now we look t ifferent wys of omining stumps into strong lssifiers. We egin with the H move. Deision tree. When trining tree, t eh itertion new eision stump is fitte y fining the θ, n vlues whih yiel either the lest squre error [19] or the mximum entropy gin, s esrie lter. During testing, the output F (n) of tree lssifier is the output of the lef noe. Next we nlyze the etils of the B omintion move. Gentle Boost. Out of the mny versions of oosting, here, we fous on the Gentle Boost lgorithm [7] euse of its roustness properties [14, 21]. For the n th pixel, strong lssifier F (n) is onstrute s liner omintion of stumps F (n) = L l=1 h l(n). For ompleteness the full lgorithm is summrize in tle 2. Here Gentle Boost is pplie oth to stumps (B in fig. 6) n eision trees (HB in fig. 6). We revite the first lgorithm s GB n the seon s BT. We lso omine the stumps into Rnom Forests (the HA pth in fig. 6) Rnom Forests. A forest is me of mny trees, n its output F (n) is the verge of the output of ll trees (the A move). A Rnom Forests (enote RF) is n ensemle of eision trees trine with rnom fetures. In this se, eh tree is trine y ing new stumps in the lef noes where mximum informtion gin n e hieve. However, unlike oosting, the trining t is not reweighte for ifferent trees. RF hs een pplie to reognition prolem in vision, e.g. OCR [1] n keypoint reognition [12]. Rnomiztion. GB, BT n RF re trine effetively y optimizing eh stump only on few (1000 in our implementtion) rnomly selete shpe filter fetures. This re-

ues the sttistil epenene etween wek lerners [1], n it provies inrese effiieny without signifintly ffeting the ury [19, 6]. In ll three lgorithms the lssifition onfiene is ompute y softmx trnsformtion [7, 19] P (xn = exp(f (n)) F g O m ) = 1+exp(F (n)). Next we esrie how those motion-shpe se lssifiers re omine with olour, ontrst n sptil smoothness to otin inry segmenttion. 5. Lyer segmenttion Segmenttion is st s n energy minimiztion prolem where the energy to e minimize is similr to the one in [8], with the only ifferene tht the unry potentil of stereo mth U M is reple y our motion-shpe unry N U MS (O m, x; Θ) = log( P (x n O m )). (1) n=1 The CRF energy is s follows: E(O m, z, x; Θ) = (2) γ MS U MS (O m, x; Θ) + γ C U C (z, x; Θ) + V (z, x; Θ), Similr to [8], U C is the olour potentil ( omintion of glol n pixel-wise ontriutions) n V is the wielyuse ontrst-sensitive sptil smoothness term. Moel prmeters re inorporte in Θ n reltive weights γ MS n γ C re optimize isrimintively from trining t. The finl segmenttion is inferre y inry min-ut. No omplex temporl moel [5] is use here. Finlly, kgroun ege ting [20] oul lso e exploite here if pixel-wise kgroun moel were lerne on the fly. 6. Experimentl Results Our new motion-shpe likelihoo, eq.(1) is vlite in setion 6.1; while the segmenttion ury hieve y the omplete CRF moel, eq.(2) is ssesse in setion 6.2. We hve ollete tse of 28 monoulr sequenes 1 whih we hve then pixel-wise lelle every fifth or tenth frme into foregroun, kgroun n unertin (in the iffiult, mixe-pixel regions), oring to their istne from the mer (fig. 2). In our experiments, 46 lelle frmes from 7 lips re hosen rnomly for trining n 2 lips for vlition. All the 426 lelle frmes of the remining 19 lips re use for testing. 6.1. Compring unry lssifiers GB n BT were trine y minimizing the empiril loss s require y oosting, while RF were trine y mximizing the informtion gin s require y C4.5 [16]. All three lgorithms shre the sme set of motons. Their testing errors re then mesure n ompre with one nother. 1 http://reserh.mirosoft.om/vision/ mrige/i2i/dswe.htm For the six sequenes tht re pture in stereo setting, only the left-mer vieos re use here, n only for testing. Figure 7. Compring ury of lssifiers. Testing unry ury with respet to the omplexity of the ensemles in one tril. Five trils hve een run n RF hs onsistently outperforme oth the GB n BT lgorithms. Stereo Stereo [8] Motion [5] Monoulr Lerne motion-shpe RF GB BT 5.55% 23.66% 9.93% 11.76% 11.42% Tle 3. Comprison with stte of the rt. Comprison etween the propose lssifiers n existing stereo n monoulr unries. The ensemle size for GB, BT n RF re set to 195 stumps, 19 trees n 47 trees respetively to mximize the ury on the vlition set. The trees in BT n RF hve 50 noes, whih is optiml for BT on vlition set. Next, we evlute the unry lssifition ury with the 426 testing frmes. Pixel lssifition into foregroun n kgroun is rrie out y thresholing t F (n) = 0 2. The error rte is verge over 426 160 120 = 8.18 10 6 pixels, n the proessing rte (frme per seon) is mesure with our non-optimize Mtl oe. Aury of unry potentils. Fig. 7 ompres the lssifition ury of GB, BT n RF when vrying the numer of se lerners. Assuming lne trees, evluting one inry eision tree with 50 noes roughly equls evluting log 2 50 6 stumps (epens on the lne of the tree). Thus we hve sle own the urve of GB long the x-xis y ftor 6, so tht the expete numer of stumps evlute re the sme for GB, BT n RF. Rnom Forests onsistently yiel the lowest testing errors. From the GB urve in fig. 7 we n lso see tht there re not mny pixels whih n e orretly lssifie with few stumps, therefore, we wouln t expet se to give signifint speeup in our pplition. Tle 3 ompres the ury of our motion-shpe potentils U MS with the stereo potentils of [8] 3 n the mo- 2 The finl segmenttion results re signifintly improve y integrting olor, ontrst, n sptil priors using the CRF moel, shown next. 3 Here, the ury of the stereo likelihoo hs een improve with respet to [8] y setting the log likelihoo rtio to zero in low texture res (unertin for stereo). The pixel-wise stereo error rte woul inrese to 17.51% without suh postproessing. This further illustrtes the importne of shpe informtion in our ilyer segmenttion pplition.

Figure 10. More segmenttion results. (, ) Originl frmes from test sequene 56, where the piture on the TV set hnges. (, ) Corresponing segmenttions. (- ) More segmenttion results on test sequene 43 n 50. Figure 8. Segmenttion results on the IU test sequene. () Originl, () stereo-se segmenttion from [8], () monoulr segmenttion from [5], where kgroun motion pops into foregroun, () monoulr segmenttion from the propose lgorithm. tion potentils of the monoulr system in [5]. Our motionshpe unry potentils le to ury omprle to those of stereo n superior to the motion-se ones. Aury vs effiieny. Tle 4 ompres the three lssifiers oth in terms of ury n spee. The first three olumns report the lowest hieve lssifition error n the orresponing frme rte for eh of the three lgorithms t their optiml prmeter setting oring to vlition. Hving onfirme tht RF proues the lowest errors we then evlute the spee of RF when it is fore to proue the sme error level s GB n BT (4th olumn). In the lst two olumns, the ensemle of RF is fore to run t the sme spee s GT or BT, n oserve its lssifition error. In ll ses RF outperforms the other two lssifiers. 6.2. Assessing segmenttion ury This setion nlyzes segmenttion results otine y the full CRF moel with U M S estimte y RF. Qulittive evlution. Figure 8 ompres the segmenttion of the IU test sequene otine y our lgorithm with those in [5, 8]. In this sequene, two people wlk ehin the foregroun person. Vrying sene illumintion onstitutes further soure of iffiulty. The motion se metho in [5] lssifies the kgroun people s foregroun (s it is esigne to o). The propose lgorithm, inste, proues segmenttion similr to tht of the stereo system in [8], where kgroun motion is effetively ignore. Fig. 1, 9, 10, 11 provie more segmenttion results. Quntittive evlution. Fig. 12 shows segmenttion errors with respet to groun truth for four of our test sequenes. The mein error is roun or elow 1%. Mein errors for 10 test sequenes re lso reporte in tle 5. Automti initiliztion. Fig. 13 illustrtes how the system initilizes itself. At the eginning the sujet is sttionry n the segmenttion inurte. However, smll mo- Figure 11. Bkgroun sustitution on the IU test sequene from [8]. Originl n orresponing segmente frmes with kgroun sustitution. People moving ehin the foregroun person re orretly lssifie s kgroun. Test Sequene 41 43 50 51 54 Seg. Err. (%) 0.80 0.02 1.31 1.06 0.33 Test Sequene 56 58 60 IU [8] JM [8] Seg. Err. (%) 0.93 0.79 6.33 2.56 0.27 Tle 5. Segmenttion errors for ten of the test sequenes. tion of the he is suffiient to hieve the orret segmenttion (Fig. 13). This urn-in effet my lso e oserve for ifferent test sequene in the error plot in fig. 12. The plot in fig 12 emonstrtes how our lgorithm n reover utomtilly from possile mistkes. In ft, in frmes 60-85 of the JM sequene the sujet lens very lose to the mer n so the imge looks very ifferent from the trining frmes, n segmenttion errors our. The segmenttion reovers promptly following this error. Inurte segmenttions. Fig. 14, show exmples of inurte segmenttion. Uner hrsh lighting onitions unry potentils my not e very strong n thus the Ising smoothness term my fore the segmenttion to ut through shoulers n hir regions. Similr effets my e notie in sttionry frmes. Noise in temporl erivtives lso ffets the results. This sitution n e etete y monitoring the mgnitue of motion, n enling temporl moel suh s tht in [5] my help reue the prolem.

Algorithm GB BT RF RF RF RF (est) (est) (est) (sme err s GB n BT) (sme spee s GB) (sme spee s BT) Clssif. error (%) 11.76 11.42 9.93 11.23 9.93 10.11 Spee (fps) 1.2 2.9 1.2 7.7 1.2 2.9 Tle 4. Compring testing ury n effiieny for GB, BT n RF in one of five testing trils. See text. Figure 9. Segmenttion results. () An originl frme n four segmente frmes for test sequene 41. () An originl frme n four segmente frmes for test sequene 54. Borer mtting [17] oul e pplie here to improve the hir regions. This pper is onerne with inry segmenttion only. Figure 12. Aury of segmenttion. () Perentge of mislssifie pixels on the JM test sequene from [8]. Note how the system promptly reovers from possile mistkes. The mein error (horizontl line) is well elow 0.5%. (,,) Perentge of mislssifie pixels for the test sequenes 41, 54, 51, respetively. () After n initil urn-in perio the segmenttion onverges to goo ury level (roun or elow 1% mislssifie pixels). 6.3. Disussion on is n vrine For lssifition lgorithm, is esries its moelling power while vrine esries its stility [9]. Note tht is n vrine re ifferent thn the men error n the vrine of error. The is n vrine eomposition of the RF, GB n BT lgorithms (in tle 6) helps us to unerstn their ehviour, n the nture of our tsk. This Figure 13. Self-initiliztion. () originl frme from test sequenes 58. Hrsh lighting onitions mke the segmenttion prolem hllenging. (-) Segmenttion for ifferent frmes in hronologil orer. After n initil urn-in perio the lgorithm onverges to the orret Fg/Bg seprtion. A len kgroun imge or other types of initiliztion were not neessry. Figure 14. Some inurte segmenttion. () A frme from test sequene 41. () A frme from test sequene 60. (, ) orresponing segmenttions. B sene illumintion proues inurte segmenttion in the hir region of ( ) n in the shouler region of ( ). See lso the errors in tle 5. eomposition is ompute from five trils on the testing set, n the results isusse elow. (1) BT n RF yiel lower is thn GB. The liner omintion property of the B n A moves requires tht the eision ounry of the lssifition tsk is itive in terms of the eision ounry of the wek lerners i.e. the pity of the se lerning lgorithm mthes the omplexity of the prolem [7]. By moving long the H iretion, igger trees re onstrute whih re ple of moeling higher orer intertion etween vriles. e.g. eision stump only ontins one feture, while the typil eision pth of 50-noe inry tree

Metho Bis Vrine RF (%) 10.20 0.39 GB (%) 10.31 1.48 BT (%) 9.95 2.09 Tle 6. Bis/vrine nlysis. Bis n vrine of RF, GB n BT equls onjuntion term of 5-6 fetures. Therefore, (1) inites tht the segmenttion tsk is omplex, n its eision ounry is etter pproximte y eeper trees. (2) BT hs lower is thn RF. This result onfirms tht oosting hieves itionl reution in is y ggressively perturing the trining proess to fous on the iffiult smples [4]. However, this sometimes result in overfitting. (3) RF hs the lowest vrine. Boosting persistently inreses the minimum mrgin (of few inorretly lssifie smples) t the potentil ost of eresing the verge mrgin (over ll trining t) [26]. Therefore oosting oes not generlize well in the presene of lel noise 4. In ontrst, RF is quite roust to suh kin of noise. In ft, the effet of few mistkenly lelle smples is restrite to prtiulr lef noes lolly, without srifiing the ury of other noes or trees. Therefore, it is not surprising to see tht overfitting mkes the error of B higher thn A. Similr phenomen hve lso een reporte in [4]. 7. Conlusion n Future Work This pper hs presente n lgorithm for the effiient segmenttion of foregroun n kgroun in monoulr vieo sequenes. Our lgorithm is ple of inferring ilyer segmenttion monoulrly even in the presene of istrting kgroun motion n without the nee for mnul initiliztion. We hve: (i) introue novel visul fetures whih pture motion n motion ontext effiiently; (ii) provie generl unerstning of tree-se lssifiers, whih in turn (iii) hs helpe us selet n effiient n urte lssifier in the form of Rnom Forests. Experiments n the relte nlysis on existing test t n our own tse onfirm urte n roust segmenttion. Similr to stereo-se system, our pproh mnges to seprte foregroun n kgroun even when istrting kgroun motion ours. Next, we woul like to pply our lssifier txonomy to other omins n pplitions to ssess the merits of ifferent types of lssifition lgorithms in vrious situtions. Further omprisons etween tree-se lssifiers n Kernel Mhines [18] (e.g. SVM) re lso neessry. 4 unvoile in segmenttion prolems Referenes [1] Y. Amit n D. Gemn. Shpe quntiztion n reognition with rnomize trees. Neurl Comput., 9(7):1545 1588, 1997. [2] C. Bishop. Neurl Netowrks for Pttern Reognition. Oxfor University Press, 1995. [3] Y. Boykov, O. Veksler, n R. Zih. Fst pproximte engergy minimiztion vi grph uts. IEEE PAMI, pges 1222 1239, 2001. [4] L. Breimn. Rnom forests. UC Berkeley TR567, 1999. [5] A. Criminisi, G. Cross, A. Blke, n V. Kolmogorov. Bilyer segmenttion of live vieo. In IEEE CVPR, pges 53 60, 2006. [6] T. Deselers, A. Criminisi, J. Winn, n A. Agrwl. Inorporting on-emn stereo for rel time reognition. In CVPR, 2007. [7] J. Friemn, T. Hstie, n R. Tishirni. Aitive logisti regression: sttistil view of oosting. Annls of sttistis, 38:337 374, 2000. [8] V. Kolmogorov, A. Criminisi, A. Blke, G. Cross, n C. Rother. Bilyer segmenttion of inoulr stereo vieo. In IEEE CVPR, pges 407 414, 2005. [9] E. Kong n T. Dietterih. Error-orreting output oing orrets is n vrine. In Pro. of ICML, pges 313 321, 1995. [10] S. Kumr n M. Heert. Disrimintive rnom fiels: A isrimintive frmework for ontextul intertion in lssifition. In Pro. of IEEE ICCV, pges 1150 1157, 2003. [11] J. Lfferty, A. MCllum, n F. Pereir. Conitionl rnom fiels: Proilitsti moels for segmenting n leling sequene t. In 18th Int. Conf. on Mhine Lerning, pges 282 289, 2001. [12] V. Lepetit, P. Lgger, n P. Fu. Rnomize trees for rel-time keypoint reognition. In Conferene on Computer Vision n Pttern Reognition, Sn Diego, CA, June 2005. [13] T. Leung n J. Mlik. Representing n reognizing the visul pperne of mterils using three-imensionl textons. IJCV, 43(1):29 44, 2001. [14] R. Lienhrt, A. Kurnov, n V. Pisrevsky. Empiril nlysis of etetion ses of ooste lssifiers for rpi ojet etetion. In DAGM, pges 297 304, 2003. [15] F. Perronnin, G. Dne, C. Csurk, n M. Bressn. Apte voulries for generi visul tegoriztion. In IEEE ECCV, 2006. [16] J. Quinln. Bgging, Boosting, n C4.5. In Pro.of Ntionl Conf. on Artifiil Intelligene, pges 725 730. AAAI Press, 1996. [17] C. Rother, V. Kolmogorov, n A. Blke. grut : intertive foregroun extrtion using iterte grph uts. ACM Trns. Grph., 23(3):309 314, 2004. [18] B. Sholkopf. Sttistil lerning n kernel methos. MSR-TR 2000-23, 2000. [19] J. Shotton, J. Winn, C. Rother, n A. Criminisi. Textonoost: Joint pperne, shpe n ontext moeling for multi-lss ojet reognition n segmenttion. In ECCV, 2006. [20] J. Sun, W. Zhng, X. Tng, n H. Shum. Bkgroun ut. In Pro. of ECCV, pges 628 641, 2006. [21] A. Torrl, K. Murphy, n W. Freemn. Shring fetures: effiient oosting proeures for multilss ojet etetion. In CVPR04, pges 762 769, 2004. [22] M. Vrm n A. Zissermn. A sttistil pproh to texture lssifition from single imges. IJCV, 62(1-2):61 81, 2005. [23] P. Viol n M. Jones. Roust rel-time ojet etetion. IJCV, 57(2):137 154, 2004. [24] J. Wng n E. Aelson. Representing moving imges with lyers. IEEE Trns. Imge Proess, 3(5):625 638, 1994. [25] J. Winn, A. Criminisi, n T. Mink. Ojet tegoriztion y lerne universl visul itionry. In Pro. of ICCV, pges 1800 1807, 2005. [26] P. Yin, I. Ess, n J. M. Rehg. Segmentl oosting lgorithm for time-seris feture seletion. Teh. Report GIT-GVU-06-24.