Vdeo-Based Face Recognon Usng Adapve Hdden Markov Models Xaomng Lu and suhan Chen Elecrcal and Compuer Engneerng, Carnege Mellon Unversy, Psburgh, PA, 523, U.S.A. xaomng@andrew.cmu.edu suhan@cmu.edu Absrac Whle radonal face recognon s ypcally based on sll mages, face recognon from vdeo sequences has become popular recenly. In hs paper, we propose o use adapve Hdden Markov Models (HMM) o perform vdeobased face recognon. Durng he ranng process, he sascs of ranng vdeo sequences of each subjec, and he emporal dynamcs, are learned by an HMM. Durng he recognon process, he emporal characerscs of he es vdeo sequence are analyzed over me by he HMM correspondng o each subjec. he lelhood scores provded by he HMMs are compared, and he hghes score provdes he deny of he es vdeo sequence. Furhermore, wh unsupervsed learnng, each HMM s adaped wh he es vdeo sequence, whch resuls n beer modelng over me. Based on exensve expermens wh varous daabases, we show ha he proposed algorhm provdes beer performance han usng majory vong of mage-based recognon resuls.. Inroducon For decades human face recognon has been an acve opc n he feld of objec recognon. A general saemen of hs problem can be formulaed as follows: Gven sll or vdeo mages of a scene, denfy one or more persons n he scene usng a sored daabase of faces []. A lo of algorhms have been proposed o deal wh he mage-omage, or mage-based, recognon where boh he ranng and es se conss of sll face mages. Some examples are Prncpal Componen Analyss (PCA) [2], Lnear Dscrmnae Analyss (LDA) [3], and Elasc Graphc Machng [4]. However, wh exsng approaches, he performance of face recognon s affeced by dfferen knds of varaons, for example, expresson, llumnaon and pose. hus, he researchers sar o look a he vdeo-ovdeo, or vdeo-based recognon [5][6][7][8], where boh he ranng and es se are vdeo sequences conanng he face. he vdeo-based recognon has superor advanages over he mage-based recognon. Frs, he emporal nformaon of faces can be ulzed o faclae he recognon ask. For example, he person-specfc dynamc characerscs can help he recognon [5]. Secondly, more effecve represenaons, such as a 3D face model [9] or super-resoluon mages [0], can be obaned from he vdeo sequence and used o mprove recognon resuls. Fnally, vdeo-based recognon allows learnng or updang he subjec model over me. Lu e al. proposed an updang-durng-recognon scheme, where he curren and pas frames n a vdeo sequence are used o updae he subjec models o mprove recognon resuls for fuure frames [8]. he emporal and moon nformaon s a very mporan cue for he vdeo-based recognon. In [5], L suggesed o model he face vdeo as a surface n a subspace and changed he recognon problem o be a surface machng problem. Edwards e al. [6] proposed an adapve framework on learnng he human deny by usng he moon nformaon along he vdeo sequence, whch mproves boh face rackng and recognon. Recenly, Zhou e al. proposed a probablsc approach o vdeobased recognon [7]. hey modeled he deny and he face moon as a jon dsrbuon, whose margnal dsrbuon s esmaed o provde he recognon resul. he Hdden Markov Model (HMM) [] has been successfully appled o model emporal nformaon on applcaons such as speech recognon, gesure recognon [2], and expresson recognon [3], ec. Samara and Young used pxel values n each block as he observaon vecors and appled HMM spaally o mage-based face recognon [4]. In [5], Nefan proposed o ulze DC coeffcens as observaon vecors and a spaally embedded HMM was used for recognon. Alhough oher varaons of HMM have been appled o face recognon spaally, few of hem are dealng wh vdeo-based recognon. In hs paper, we apply adapve HMM emporally o perform he vdeo-based face recognon. As shown n Fgure, durng he ranng process, he sascs of ranng sequences, and her emporal dynamcs are learned by an HMM. Durng he recognon process, he emporal characerscs of he es vdeo sequence are analyzed over me by he HMM correspondng o each subjec. Our proposed algorhm can learn he dynamc nformaon and Proceedngs of he 2003 IEEE Compuer Socey Conference on Compuer Vson and Paern Recognon (CVPR 03) 063-699/03 $7.00 2003 IEEE
mprove he recognon performance compared o he convenonal mehod ha smply ulzes he majory vong of mage-based recognon resuls. Also movaed by he research n speaker adapaon [6], durng he recognon process, we adap he HMM usng he es sequences. hus an updaed HMM can provde beer modelng and resul n beer performance over me. he paper s organzed as follows. In he nex secon, we brefly nroduce HMM. In Secon 3 our algorhms wll be presened n deal. We dscuss how o adap he HMM n order o enhance he modelng and recognon performance. In Secon 4, we compare he recognon performance of our algorhm wh a baselne algorhm appled o varous daabases. Fnally hs paper s concluded n Secon 5. b ( O ) c N( O ; µ, U ), N () = M k= where c s he mxure coeffcen for k h mxure componen n Sae. M s he number of componens n a Gaussan mxure model. N O ; µ, U ) s a Gaussan pdf wh he mean ( vecor µ and he covarance marx U. =, he nal sae dsrbuon,.e., { π } π = P[ q = S ] N.,, where Usng a shorhand noaon, an HMM s defned as he rple = A, B,. ( ) 3. Our proposed algorhms 2 3 4 In hs secon, we descrbe our proposed algorhms n deal. Frs, feaure exracon, HMM ranng and HMM esng are presened. hen, we nroduce an algorhm o adap he HMM n order o enhance he modelng over me. 3. emporal HMM Fgure emporal HMM for modelng face sequences 2. Hdden Markov model A Hdden Markov Model s a sascal model used o characerze he sascal properes of a sgnal []. An HMM consss of wo sochasc processes: one s an unobservable Markov chan wh a fne number of saes, an nal sae probably dsrbuon and a sae ranson probably marx; he oher s a se of probably densy funcons assocaed wh each sae. here are wo ypes of HMM: dscree HMM and connuous HMM. he connuous HMM s characerzed by he followng: N, he number of saes n he model. We denoe he ndvdual sae as S = { S, S 2,, S N }, and he sae a me as q,, where s he lengh of he observaon sequence. A, he sae ranson probably marx,.e., A = { a j }, where a = P[ q = S q = S ],, j N j j wh he consran, N j = a j =, N. B, he observaon probably densy funcons B = b (O), where (pdf),.e., { } When applyng HMM o face recognon, researchers have proposed o use dfferen feaures, for example, pxels values [4], egen-coeffcens and DC coeffcens [5], as he observaon vecors. In our algorhm, each frame n he vdeo sequence s consdered as one observaon. Snce PCA gves he opmal represenaon of he mages n erms of he mean square error, all face mages are reduced o low-dmensonal feaure vecors by PCA. Gven a face daabase wh L subjecs and each subjec has a ranng vdeo sequence conanng mages. Fl = { f l,,f l, 2,, f l, } l L Each mage only conans he face poron. By performng egen-analyss for hese L samples, we oban an egenspace wh a mean vecor m and a few egenvecors { V, V 2,, V d }. All he ranng mages are projeced no hs egenspace and generae correspondng feaure vecors, e l, l L,, whch wll be used as observaon vecors n he HMM ranng. A hs sage, we also compue he covarance marx of all he feaure vecors e l, C, whch s a dagonal marx wh e egenvalues as he dagonal elemens. he marx C e descrbes n general how all he face mages dsrbue on each dmenson of a low-dmensonal egenspace, whch provdes useful nformaon for he HMM ranng. Essenally n our algorhm, PCA s used for he dmenson reducon purpose. Each subjec n he daabase s modeled by a N-sae fully conneced HMM. he feaure vecors e, l Proceedngs of he 2003 IEEE Compuer Socey Conference on Compuer Vson and Paern Recognon (CVPR 03) 063-699/03 $7.00 2003 IEEE
form he observaon vecors O for ranng he HMM of Subjec l. he ranng for each HMM s as follows. Frs, he HMM = ( A, B, ) s nalzed. Vecor quanzaon s used o separae he observaon vecors no N classes and he observaon vecors assocaed wh each class are used o generae he nal esmaes for B,.e., esmae c, µ and U as n (). Second, n order o maxmze he lelhood P ( O ), he model parameers are re-esmaed by usng he Expecaon Maxmzaon (EM) algorhm [6]. I produces a sequence of esmaes for, gven a se of observaon vecors O, so ha each esmae ( n) = ( A, B, ) has a larger value of P ( O ) han he ( n ) precedng esmae. he re-esmaon s defned as follows: O, q = ) π = (2) O ) c a = = j = = M = k= O, q = O, q m =, q q m = j ) = ) = k O, ) q = k O, ) (3) (4) O ) q = µ = (5) ) q = U = ( α) C + e ( O µ )( O µ ) ) q = α (6) = m q = k O, ) where m q ndcaes he mxure componen for Sae q and me. Equ. (6) s used o adap he varance esmae from C, whch s a general model for he varance of all e subjecs. he parameer α s a weghng facor and s chosen as 0.5 n our expermens. Normally durng he ranng process, when he number of face mages assgned o each sae s less han he dmenson of feaure vecors, U wll be a sngular marx. hs adapaon sep can preven hs from happenng. he model parameers are esmaed eravely usng (2)-(6) unl he lelhood P ( O ) converges. In he recognon process, gven a vdeo sequence conanng face mages, all frames are projeced no he egenspace and he resulng feaure vecors form he observaon vecors. hen he lelhood score of he observaon vecors gven each HMM s compued, where he ranson probably and observaon pdf are used. he face sequence s recognzed as Subjec k f: P O = max P O 3.2 Adapve HMM ( ) ( ) k In ypcal speech recognon sysems, here s a dchoomy beween speaker-ndependen and speakerdependen sysems. Whle speaker-ndependen sysems are ready o be used whou furher ranng, her performance s usually wo or hree mes worse han ha of speakerdependen sysems. However, he laer requres large amouns of ranng daa from he desgnaed speaker. o address hs ssue, he concep of speaker adapaon [6][7] has been nroduced, where a small amoun of daa from he specfc speaker are used o modfy he speakerndependen sysem and mprove s performance. Smlarly, n he vson communy, Lu e al. [8] also proposed unsupervsed model updang o enhance he objec modelng and mprove he recognon performance over me. Movaed by hese deas, we propose o use adapve HMM for vdeo-based face recognon. ha s, durng he recognon process, afer we recognzed one es sequence as one subjec, we can use hs sequence o updae he HMM of ha subjec, whch wll learn he new appearance n hs sequence and provde an enhanced model of ha subjec. Obvous wo quesons need o be answered. Frs, how do we decde wheher we should use he curren es sequence for updang? hs s mporan for avodng wrong updang,.e., one sequence s used o updae oher subjec s model, nsead of hs/her own model. Second, how do we adap HMM? Essenally for he frs queson, we would le o measure how confden ha he recognon resul for he curren sequence s correc based on a ceran feaure. he more he confdence, he more ceran we should use he curren sequence o updae he HMM. In our algorhm, we use he lelhood dfference,.e., he dfference beween he hghes lelhood score and he second hghes lelhood score, as he feaure o make he decson. he reason s ha for correc recognon, he lelhood dfference ends o be large; whle for ncorrec recognon, he lelhood dfference s usually small. So gven a es sequence, we compare s lelhood dfference wh a predefned hreshold, and updae he HMM only f he lelhood dfference s larger han he hreshold. In pracce, hs pre-defned hreshold can be deermned by performng expermens on a cross-valdaon daa se. We use he sandard MAP adapaon echnque [6] o old adap he HMM. ha s, gven an exsng HMM and observaon vecors O from a es sequence, we esmae a = A, B,. We use old as he nal new HMM, ( ) l l Proceedngs of he 2003 IEEE Compuer Socey Conference on Compuer Vson and Paern Recognon (CVPR 03) 063-699/03 $7.00 2003 IEEE
parameers of, and he EM algorhm s used o reesmae excep ha he mean esmaon s as follows: O ) q old = µ = ( β ) µ + β (7) ) where = old µ s he mean vecor from he exsng HMM q old and β s a weghng facor ha gves he bas beween he prevous esmae and he curren daa. In our expermens, we choose β o be 0.3. Also durng he eraon of he EM algorhm, we do no updae he covarance marx of he observaon pdf, U, because from speech research leraure, he major dscrmnave nformaon of an HMM s reaned by he mean vecors nsead of he covarance marxes. Also n our expermens, updang he covarance marx does no show sgnfcan mprovemens compared o no updang. subjecs, we randomly choose a subjec l, a sarng frame k and a lengh z. hen frames { f l, k,f l, k+, f l, k+ z } form a sequence for esng. Essenally hs s smlar o he praccal suaon where any subjec can come o he recognon sysem a any me and wh any duraon. For boh daabases, we use hs scheme o creae a large amoun of sequences for esng. Fgure 2 Sample face mages from our ask daabase 4. Expermens 4. Seup In order o es he proposed algorhm, we have colleced a ask daabase wh 2 subjecs. Durng he daa collecon, each subjec s requred o read a paper ha s hung besde he monor and ype usng he keyboard. hus essenally he subjec swches beween readng he paper, lookng a he monor and lookng a he keyboard. For each subjec we colleced 2 sequences, where one has 322 frames and s used for ranng; he oher has around 400 frames and s used for esng. From he whole vdeo frame, we manually crop he face regon as a face mage wh 6 by 6 pxels. Sample face mages for some subjecs are shown n Fgure 2. In addon, 5 monhs afer we capured he orgnal ask daabase, we capured a new es se wh dfferen lghng condons and camera sengs, bu wh only subjecs avalable. he second daabase s he Mobo daabase [8], orgnally colleced for human denfcaon from dsance. here are 24 subjecs n hs daabase. Each subjec has four sequences capured n dfferen walkng suaons: holdng a ball, fas walkng, slow walkng, and walkng on he nclne. Each sequence has 300 frames. hree frames from one sequence are shown n Fgure 3. Large head pose varaon can be observed from hs daabase. We crop he face poron from each frame and use for expermens. he mage sze s 48 by 48 pxels. Some of he manually cropped faces are shown n Fgure 4. For each subjec, he frs 50 frames of all four sequences are used for ranng, and he remanng 50 frames of all sequences are used for esng. In order o mmc he praccal suaon, we use he es scheme shown n Fgure 5. ha s, gven a es se wh L Fgure 3 Sample mages from he Mobo daabase Fgure 4 Cropped faces from he Mobo daabase Recognon module S S 2 S 3 S L 4.2 Expermenal resuls es se S 2 S 3 S L S Fgure 5 es scheme Proceedngs of he 2003 IEEE Compuer Socey Conference on Compuer Vson and Paern Recognon (CVPR 03) 063-699/03 $7.00 2003 IEEE
In order o compare vdeo-based recognon wh magebased recognon, we choose he ndvdual PCA mehod (IPCA) as a baselne mage-based algorhm. I has been a popular face recognon mehod o buld an ndvdual egenspace for each subjec and perform recognon based on he resdue [3]. Gven a face sequence, afer applyng IPCA recognon o each frame, majory vong deermnes he denfcaon of he whole sequence. We use hs mehod as he baselne algorhm. Wh he es scheme n Fgure 5, we es he baselne algorhm on he ask daabase usng 000 sequences. By choosng dfferen numbers of egenvecors for he baselne algorhm, we can oban dfferen recognon performance. ypcally he more egenvecors he baselne algorhm uses, he beer performance has. However, afer a ceran number of egenvecors are used, he performance does no mprove any furher. For he ask daabase, when 2 egenvecors are used, he baselne algorhm has he bes performance: 9.9% recognon error rae. We also apply our algorhm on hs daabase wh he same 000 sequences. We use 45 egenvecors durng he dmenson reducon and a 2-sae HMM s used o model each subjec. Generally speakng, he more saes used o ran one HMM, he beer modelng we have, whle we also have more parameers o esmae. For he ask daabase, we found 2 saes s a good compromse beween modelng and esmaon. Each sae uses one Gaussan dsrbuon o model he observaon probably. Evenually we oban 7.0% recognon error rae, whch s beer han he bes resul we can ge from he baselne algorhm. We also apply he adapve HMM on he same es se and oban 4.0% recognon error rae. Also, we do he same comparson wh he newly capured es se. As shown n able, alhough he overall recognon rae s a lo hgher han he frs es se because of he me elapse and he lghng and camera varaons, we sll see ha our proposed mehods work beer han he baselne algorhm. However, he adapve HMM does no mprove a lo comparng o he emporal HMM because when he overall recognon error rae goes hgh, s more lely o make wrong updang. Smlarly we apply hese hree algorhms o he Mobo daabase as well. he same es scheme s ulzed and 500 randomly chosen sequences are used for esng. Dfferen numbers of egenvecors have been used for he baselne algorhm. I has he bes performance wh 2.4% recognon error rae when 7 egenvecors are used. For our emporal HMM, we use 30 egenvecors durng he dmenson reducon and ran a 4-sae HMM for each subjec, where each sae has one Gaussan dsrbuon for modelng he observaon probably. he recognon error rae s.6% for he emporal HMM. Smlarly we also apply he adapve HMM for he Mobo daabase and oban.2% recognon error rae. We summarze he performance comparson among hree algorhms n able. As we can see, n boh daabases, our proposed algorhms perform much beer han he baselne algorhm. Especally, he adapve HMM algorhm almos halves he error rae of he baselne algorhm. able Comparson among hree algorhms Baselne emporal HMM Adapve HMM Mobo 2.4%.6%.2% ask 9.9% 7.0% 4.0% ask-new se 49.% 3.0% 29.8% here are a few reasons why he proposed algorhms work beer han he baselne algorhm. he frs s ha HMM s able o learn boh he dynamcs and he emporal nformaon. he second s ha here s msmachng beween he ranng and es ses,.e., some of he es sequences show he new appearance ha s barely seen n he ranng se. So he adapve HMM enables he HMM o learn hs new appearance n he es se and hus enhance he modelng. he hrd s he modelng ably of usng observaon pdf correspondng dfferen saes. For example, Fgure 6 shows he ranng mages of one subjec n a subspace composed by he frs hree egenvecors, { V, V, V }. he plus sgns show he feaure vecors of all 2 3 mages from one ranng sequence. I llusraes ha s hard for IPCA o model hs arbrary dsrbuon effecvely snce IPCA essenally assumes a sngle Gaussan dsrbuon, whle he four dos, whch are he means of observaon pdf correspondng o four saes, can model he whole dsrbuon beer. We also plo he four means as mages n Fgure 7, where hey seem o represen dfferen head poses n he ranng sequence. Fgure 6 he dsrbuon of ranng faces n he egenspace Proceedngs of he 2003 IEEE Compuer Socey Conference on Compuer Vson and Paern Recognon (CVPR 03) 063-699/03 $7.00 2003 IEEE
Fgure 7 he mean faces correspondng o four saes 5. Conclusons In hs paper, we propose o use adapve HMM o perform vdeo-based face recognon. Durng he ranng process, he sascs of ranng vdeo sequences of each subjec, and her emporal dynamcs are learned by an HMM. Durng he recognon process, he emporal characerscs of he es vdeo sequence are analyzed over me by he HMM correspondng o each subjec. he lelhood scores provded by he HMMs are compared, and he hghes score provdes he deny of he es vdeo sequence. Furhermore, wh unsupervsed learnng, each HMM s adaped wh he es vdeo sequence, whch resuls n beer modelng over me. Based on exensve expermens wh varous daabases, we show ha he proposed algorhm provdes beer performance han usng majory vong of mage-based recognon resuls. he paper shows ha vdeo-based face recognon s one promsng way o enhance he performance of curren mage-based recognon. Along hs drecon, our fuure work s o combne he dea of spaal HMM wh our emporal HMM o model boh spaal and emporal nformaon of he face sequences. Also, snce he observaon probables of HMM s used o model facal appearance, we can ulze for face rackng, whch enables boh face rackng and recognon n he same framework. Acknowledgmen he auhors would le o hank Dr. J. Sh, R. Gross, Carnege Mellon Unversy, and Prof. V. Krueger, Aalborg Unversy, for provdng he Mobo daabase for our expermens. he auhors also acknowledge each ndvdual appearng n our face daabase. References [] R. Chellappa, C.L. Wlson, and S. Srohey, Human and machne recognon of faces: a survey, Proceedngs of he IEEE, Vol.83, No.5, 995, pp.705 74. [2] M. urk and A. Penland, Egenfaces for Recognon, Journal of Cognve Neuroscence, Vol.3, No., 99, pp.7-86. [3] P.N. Belhumeur, J.P. Hespanha, and D.J. Kregman, Eegnfaces vs. Fsherfaces: Recognon Usng Class Specfc Lnear Projecon, IEEE ransacon on Paern Analyss and Machne Inellgence, Vol.9, No.7, 997, pp.7-720. [4] M. Lades, J.C. Vorbruggen, J. Buhmann, J. Lange, C. von der Malsburg, R.P. Wurz, and W. Konen, Dsoron Invaran Objec Recognon n he Dynamc Lnk Archecure, IEEE ransacons on Compuers, Vol.42, No.3, 992, pp.300-3. [5] Y. L, Dynamc face models: consrucon and applcaons, PhD hess, Queen Mary, Unversy of London, 200. [6] G. J. Edwards, C.J. aylor,.f. Cooes, Improvng Idenfcaon Performance by Inegrang Evdence from Sequences, In Proc. Of 999 IEEE Conference on Compuer Vson and Paern Recognon, June 23-25, 999 For Collns, Colorado, pp.486-49. [7] S. Zhou, V. Krueger, and R. Chellappa, Face Recognon from Vdeo: A CONDENSAION Approach, In Proc. of Ffh IEEE Inernaonal Conference on Auomac Face and Gesure Recognon, Washngon D.C., May 20-2, 2002, pp.22-228. [8] X. Lu,. Chen and S. M. hornon, Egenspace Updang for Non-Saonary Process and Is Applcaon o Face Recognon, o appear n Paern Recognon, Specal ssue on Kernel and Subspace Mehods for Compuer Vson, Sepember 2002. [9] A. Roy Chowdhury, R. Chellappa, R. Krshnamurhy and.vo, 3D Face Recosrucon from Vdeo Usng A Generc Model, In Proc. of In. Conf. on Mulmeda and Expo, Lausanne, Swzerland, Augus 26-29, 2002. [0] S. Baker and. Kanade, Lms on Super-Resoluon and How o Break hem, IEEE ransacons on Paern Analyss and Machne Inellgence, Vol. 24, No. 9, Sepember 2002, pp.67-83. [] L. Rabner, A uoral on Hdden Markov Models and seleced applcaons n speech recognon, Proceedngs of he IEEE, Vol.77, No.2, 989, pp.257-286. [2] A. Kale, A.N. Rajagopalan, N. Cunoor and V. Krueger, Ga-based Recognon of humans Usng Connuous HMMs, In proceedngs of he 5h IEEE Inernaonal Conference on Auomac Face and Gesure Recognon, Washnon D.C. May 20-2, 2002, pp.336-34. [3] J.J. Len, Auomac Recognon of Facal Expressons Usng Hdden Markov Models and Esmaon of Expresson Inensy, docoral dsseraon, ech. repor CMU-RI-R-98-3, Robocs Insue, Carnege Mellon Unversy, Aprl 998. [4] F. Samara and S. Young, HMM-based archecure for face denfcaon, Image and vson compung, Vol.2, No.8, Oc 994. [5] A. Nefan, A hdden Markov model-based approach for face deecon and recognon, PhD hess, Georga Insue of echnology, Alana, GA. 999. [6] J-L. Gauvan and C-H. Lee, Maxmum a Poseror Esmaon for Mulvarae Gaussan Mxure Observaons of Markov Chans, IEEE ransacons on Speech and Audo Processng, Vol.2, No.2, 994, pp.29-298. [7] C. J. Leggeer and P. C. Woodland, Maxmum lelhood lnear regresson for speaker adapaon of he parameers of connuous densy hdden markov models, Compuer Speech and Language, Vol.9, 995, pp. 7-85. [8] R. Gross and J. Sh, he CMU Moon of Body (MoBo) Daabase, ech. repor CMU-RI-R-0-8, Robocs Insue, Carnege Mellon Unversy, June, 200. Proceedngs of he 2003 IEEE Compuer Socey Conference on Compuer Vson and Paern Recognon (CVPR 03) 063-699/03 $7.00 2003 IEEE