Extending boosting for large scale spoken language understanding

Size: px

Start display at page:

Download "Extending boosting for large scale spoken language understanding"

Garey Roberts
5 years ago
Views:

1 Mach Learn (2007) 69: DOI /s Extendng boostng for arge scae spoken anguage understandng Gokhan Tur Receved: 19 August 2005 / Revsed: 10 June 2007 / Accepted: 28 August 2007 / Pubshed onne: 25 September 2007 Sprnger Scence+Busness Meda, LLC 2007 Abstract We propose three methods for extendng the Boostng famy of cassfers motvated by the rea-fe probems we have encountered. Frst, we propose a semsupervsed earnng method for expotng the unabeed data n Boostng. We then present a nove cassfcaton mode adaptaton method. The goa of adaptaton s optmzng an exstng mode for a new target appcaton, whch s smar to the prevous one but may have dfferent casses or cass dstrbutons. Fnay, we present an effcent and effectve cost-senstve cassfcaton method that extends Boostng to aow for weghted casses. We evauated these methods for ca cassfcaton n the AT&T VoceTone spoken anguage understandng system. Our resuts ndcate that t s possbe to obtan the same cassfcaton performance by usng 30% ess abeed data when the unabeed data s utzed through semsupervsed earnng. Usng mode adaptaton we can acheve the same cassfcaton accuracy usng ess than haf of the abeed data from the new appcaton. Fnay, we present sgnfcant mprovements n the mportant (.e., hgher weghted) casses wthout a sgnfcant oss n overa performance usng the proposed cost-senstve cassfcaton method. Keywords Boostng Semsupervsed earnng Mode adaptaton Cost-senstve cassfcaton Ca cassfcaton Spoken daog systems 1 Introducton Statstca cassfcaton agorthms have ong been studed n the machne earnng communty. Typcay, cassfcaton modes are traned usng arge amounts of task data that are usuay abeed by humans. By abeng, we mean assgnng one or more of the predefned casses to each exampe. Budng better cassfcaton systems n a shorter tme frame s a Edtor: Dan Roth. Ths work was done when the author was wth AT&T Labs-Research, Forham Park, NJ G. Tur ( ) SRI Internatona Speech Technoogy and Research Lab, Meno Park, CA 94025, USA e-ma: gokhan@speech.sr.com

2 56 Mach Learn (2007) 69: need for most rea-word appcatons. For nstance, consder a natura anguage ca routng system where the am s to route the ncomng cas n a customer care ca center. In such a system the am s to dentfy the customer s ntent (ca-type), whch s bascay framed as a ca cassfcaton probem. Consder the utterance I woud ke to know my account baance. Assumng that the utterance s recognzed correcty by an automatc speech recognzer, the correspondng ca-type woud be Teme(Baance) and the acton woud be promptng the baance to the user or routng ths ca to the bng department. In cases where numerous such systems for dfferent ca centers from dfferent domans or companes need to be but, preparng abeed data for each system s a very expensve, tme consumng, and aborous process. Assumng that the botteneck s not the coecton of data but nstead abeng t, we propose usng semsupervsed earnng. Another probem wth most cassfcaton agorthms s that they make the assumpton that the nput channe s statonary, that s, the dstrbuton of the ncomng (tranng and test) exampes s constant. Athough keepng the tranng and test sets fxed may be a good dea n comparng dfferent cassfcaton approaches, for appcatons ke those descrbed above, the statonary dstrbuton assumpton may not aways be true. For exampe, f the correspondng company ntroduces a new servce, one may expect caers to start nqurng about that servce, whch may be unseen or very nfrequent n the tranng data. Contnuous human abeng of new data and mode retranng aevates ths probem. However, adaptaton to tme-varyng statstcs by ncrementng the effect of a sma set of newy abeed exampes mght be a more cost-effectve and faster souton. Lkewse, n some cases there may be a new appcaton very smar to an exstng one, such as product soutons for two companes from the same sector. Therefore, the od modes can be adapted to the new appcaton. In ths paper, we propose usng mode adaptaton technques for ths probem. Furthermore, typcay n the proposed cassfcaton agorthms a casses have the same mportance or cost for mutcass cassfcaton. For most rea-word appcatons, ths s not the case; not a casses have the same weght. The accuracy of a partcuar cass can be more crtca than others or hgh precson may be needed for some casses. Especay whe deang wth more than a few casses, ths s unavodabe. For the ca routng exampe, mscassfyng an utterance askng for an account baance as a request for canceaton s more costy than the other way around, as t may resut n a ost customer. In ths paper, we propose a cost-senstve cassfcaton approach. Our proposed methods depend on a partcuar cassfcaton agorthm: namey Boostng. Boostng s an teratve agorthm; on each teraton, a weak cassfer s traned on a weghted tranng set, and at the end, the weak cassfers are combned nto a snge, combned cassfer (Freund and Schapre 1997). The Boostng agorthm has been used successfuy for many cassfcaton tasks, such as text categorzaton (Schapre and Snger 2000). We propose the foowng methods for extendng the Boostng famy of cassfers: Semsupervsed Learnng: The goa s to expot the unabeed data n Boostng. We propose a method for augmentng the cassfcaton mode traned usng the human-abeed exampes wth the machne-abeed exampes n a weghted manner. The machne-abeed exampes are automatcay constructed usng the abes output by the mode traned wth human-abeed exampes. The resutng mode s actuay a superset of weak-earners of the nta mode. The new weak earners are earned once the nta ones are apped to a the human- and machne-abeed data. We aso compare ths approach wth a taskndependent basene approach where human-abeed data s smpy concatenated to the machne-abeed data. Mode Adaptaton: The goa s to adapt an exstng mode to a new target appcaton, whch s smar but may have dfferent casses or cass dstrbutons. Ths may aso n-

3 Mach Learn (2007) 69: cude contnuous adaptaton of an exstng mode to tme-varyng statstcs or expotng out-of-doman data for tranng the target mode. The dea s smar to the semsupervsed earnng technque. Boostng appes the exstng mode to the new data, and then new weak earners are added to ths exstng mode usng the new data n a weghted manner. Cost-Senstve Cassfcaton: The goa s to extend Boostng to aow weghted casses. Our man dea s to change the error functon consderng the weghts of the casses, thus ensurng optma accuracy at each teraton for the mportant casses. In other words, we have changed the crteron to choose the weak earner accordng to the assocated costs of the casses. In the foowng secton we summarze the prevous reated work on a three of these areas. We then brefy descrbe Boostng n Sect. 3 snce a extensons requre modfcatons to the orgna agorthm. Sectons 4, 5, and6 present our proposed methods for semsupervsed earnng, mode adaptaton, and cost-senstve cassfcaton, respectvey. In Sect. 7 we present our resuts usng a ca cassfcaton task from the AT&T VoceTone spoken daog system. 1 2 Reated work The foowng descrbes the reated work for each topc of our research. 2.1 Boostng for ca cassfcaton and text categorzaton Boostng, the machne earnng agorthm we focus n ths paper, has been used for a number of anguage processng tasks, such as ca cassfcaton and text categorzaton. Schapre and Snger have presented emprca resuts comparng Boostng wth other state-of-theart cassfcaton methods (Schapre and Snger 2000). For exampe, for text categorzaton Boostng outperformed other methods such as Roccho, Nave Bayes, k-nearest-neghbor (KNN), RIPPER, and Seepng Experts on the newswre data, especay when numerous exampes are avaabe for tranng. They have aso provded the frst experments on usng Boostng for a smpe ca routng task wth ony sx casses. A ater Boostng agorthm was used for the BBN ca routng system (Ztoun et a. 2001). In ths 23-way cassfcaton task, an addtona 10% mprovement has been observed compared to the prevousy best cassfcaton method, named Beta cassfcaton. AT&T has successfuy used Boostng n arge-scae ca cassfcaton systems, frst wth a Hepdesk appcaton (D Fabbrzo et a. 2002) and ater for ts enterprse customers as part of the AT&T VoceTone spoken daog system (Gupta et a. 2006). The agorthm was then extended to hande manuay wrtten ca cassfcaton rues to augment the abeed data n speca cases (Schapre et a. 2005). Ths paper has emerged from the apparent needs of extensons to the basc Boostng agorthm durng the rea-fe appcaton deveopments for VoceTone. 1 The VoceTone system s provded by AT&T for customer care centers.

4 58 Mach Learn (2007) 69: Semsupervsed earnng Semsupervsed earnng agorthms that use both abeed and unabeed data have been used for cassfcaton n order to reduce the need for abeed tranng data. Bum and Mtche (1998) proposed a semsupervsed earnng approach caed co-tranng. For cotranng, the features n the probem doman shoud naturay dvde nto two sets. Then the exampes cassfed wth hgh confdence scores wth one vew can be used as the tranng data of other vews. For exampe, for web page cassfcaton, one vew can be the text n the web pages and another vew can be the text n the hypernks pontng to those web pages. For the same task, Ngam et a. (2000) used an agorthm for earnng from abeed and unabeed documents based on the combnaton of the Expectaton Maxmzaton (EM) agorthm and a Nave Bayes cassfer. Ngam and Ghan (2000) then combned co-tranng and EM agorthms, comng up wth the Co-EM agorthm, whch s the probabstc verson of co-tranng. Ghan (2002) ater combned the Co- EM agorthm wth error-correctng output codng (ECOC) to expot the unabeed data, n addton to the abeed data. For spoken anguage understandng, we have presented semsupervsed earnng approaches for expotng unabeed data (Tur and Hakkan-Tür 2003). Note that our focus s dfferent from tranng ca routng systems wth automatc speech recognzer (ASR) output nstead of manua transcrptons such as (Iyer et a. 2002; Ashaw 2003). In ths paper we present how Boostng can be extended to augment an exstng mode usng unabeed data. 2.3 Mode adaptaton Athough statstca mode adaptaton has been we studed n some specfc areas such as speech recognton for acoustc and anguage modeng (Rccard and Gorn 2000; Bacchan et a. 2004; Dgaaks et a. 1995, among others), there s comparaby ess work done on machne earnng and natura anguage processng. We beeve ths s the frst study presentng mode adaptaton for anguage understandng. One recent study s on the adaptaton of natura anguage understandng usng a common adaptaton method of maxmum a posteror (MAP) adaptaton (He and Young 2004), whch adapts the hdden vector state mode but for an arne trave nformaton appcaton (ATIS) to another (DARPA Communcator). Another study s about supervsed and unsupervsed adaptaton of probabstc context-free grammars to a new doman agan usng MAP adaptaton (Roark and Bacchan 2003). For spoken anguage understandng, we have proposed a mode adaptaton approach usng Boostng (Tur 2005). In the machne earnng terature, mode adaptaton has been studed under the ttes of meta-earnng (Prodromds and Stofo 1998) and muttask earnng (Caruana 1997). Metaearnng ams at earnng usefu nformaton from arge and nherenty dstrbuted sources (n our case, appcatons). Gven mutpe cassfcaton modes traned usng oca data, the goa s to tran a meta-eve cassfer combnng a these basene modes usng a meta-eve tranng set. Muttask earnng ams at tranng tasks (n our case, appcatons) n parae whe usng a shared representaton. What s earned for each task can hep other tasks be earned better. 2.4 Cost-senstve cassfcaton In the machne earnng terature, vast majorty of the cassfers, ncudng Boostng, does not hande weghted casses automatcay, a ack that we address n ths work. There are a number of studes expcty attackng ths and they fa nto three categores:

5 Mach Learn (2007) 69: Makng exstng cassfers cost-senstve (Tur 2004; Drummond and Hote 2000; Fan et a. 1999). For exampe, Fan et a. (1999) have proposed the AdaCost agorthm whch extends the AdaBoost agorthm gvng weghts to ndvdua tranng exampes nstead of casses. Usng Bayes rsk theory to assgn each exampe to ts owest rsk cass (Domngos 1999; Margneantu 2002; Zadrozny and Ekan 2001). Changng the cass dstrbutons n the tranng set such that the cost-nsenstve cassfer earned w perform equay to a cost senstve cassfer earned from the orgna set (Zadrozny et a. 2003). Whe the frst approach s cassfer dependent, t may be more effectve for some cassfcaton agorthms. The atter two approaches are cassfer ndependent. The thrd approach, whe practca, requres ether oss of exstng tranng data (for down-sampng) or repcaton of t (up-sampng) and hence may be suboptma for some of the cassfcaton agorthms. The second approach requres the cassfer to output a probabty for each of the casses. In the cases usng Bayes rsk theory, a cost matrx C s usuay used, where the entry (, j) s the cost of predctng cass when the true cass s j. Then the Bayes optma predcton for a sampe x s the cass that mnmzes the expectaton of the cost, or the condtona rsk (Duda and Hart 1973): R( x) = j P(j x)c(,j) (1) where P(j x) s the probabty of sampe x to be cass j. Usng ths formua of the condtona rsk demands good estmaton of the cass probabtes. For cassfers that output scores for casses, cabraton can be used to estmate the probabtes, so cost-nsenstve cassfers that have been we deveoped, such as SVM, Boostng, and so on can be used. 3 Boostng We propose methods for usng Boostng for semsupervsed earnng, mode adaptaton, and cost-senstve cassfcaton. Before presentng these methods, we need to present the Boostng agorthm n deta. Boostng s an teratve agorthm; on each teraton, t, a weak cassfer, h t, s traned on a weghted tranng set, and at the end, the weak cassfers are combned nto a snge, combned cassfer (Freund and Schapre 1997). For exampe, for text categorzaton, one can use word n-grams as features, and each weak cassfer (e.g., decson stump, whch s a snge node decson tree) can check the absence or presence of an n-gram. The agorthm generazed for mutcass and mutabe cassfcaton s as foows: Let X = x 1,...,x m denote the doman of possbe tranng exampes and Y be a fnte set of casses of sze Y =k.fory Y,etY [] for cass Y be { +1 f Y, Y []= 1 otherwse for each exampe. The agorthm begns by ntazng a unform weght dstrbuton D 1 (, ) over tranng exampes and abes. D t (, ) ndcates the weght of a tranng sampe, for the cass,. After each round ths weght dstrbuton s updated so that the exampe-cass combnatons that are easer to cassfy get ower weghts and vce versa. The ntended effect

6 60 Mach Learn (2007) 69: s to force the weak earnng agorthm to concentrate on the exampes and abes that w be the most benefca to the overa goa of fndng a hghy accurate cassfcaton rue. More formay, the agorthm s as foows: Gven tranng data from the nstance space S ={(x 1,Y 1 ),...,(x m,y m )} where x X and Y Y. Intaze the dstrbuton D 1 (, ) = 1 mk. For each teraton t = 1,...,T do Tran a base earner h t : X R usng dstrbuton D t. Update D t+1 (, ) = D t(, )e αt Y []h t (x,) Z t where Z t s a normazaton factor Z t = Y m D t (, )e αt Y []h t (x,) =1 and α t s the weght of the base earner. Output of the fna cassfer s then defned as f(x,)= T α t h t (x, ). t=1 The Boostng agorthm s ndependent of the weak cassfers empoyed. It s assumed that at each teraton a weak cassfer h t s traned usng sampes assocated wth casses. Each sampe may have a dfferent weght, hence the dstrbuton D t. The fna score for each cass, f(x,)s then smpy defned as the weghted summaton of the ndvdua weak earners scores, h t. α t s the weght of each weak earner. Athough there may be many ways to compute ths weght, one of them can be usng error rate, ɛ t of each weak earner: α t = 1 2 n 1 ɛ t. ɛ t Schapre and Snger (1999) have proved a bound on the emprca Hammng oss (HL) of H n the Boostng agorthm. Hammng oss s defned as the fracton of exampes,, and abes,, for whch the sgn of f(x,), H(x,) { 1, 1}, s dfferent from Y []. Theorem 1 where where HL(H) T Z t (2) t=1 HL(H)= 1 mk, : H(x,) Y [] = 1 mk δ(x,)= { 1 f H(x,) Y [], 0 otherwse. δ(x,) (3)

7 Mach Learn (2007) 69: Proof By unraveng the update rue, we have that D T +1 ( + 1) = e Y []f(x,) mk t Z. (4) t Moreover, f H(x,) Y [] then Y []f(x,) 0 mpyng that e Y []f(x,) 1. Thus, Combnng (3), (4), and (5), we get 1 mk δ(x,) 1 mk δ(x,) e Y []f(x,). (5) = t=1 e Y []f(x,) ( T T Z t )D T +1 (, ) = Z t. (6) Theorem 1 s mportant snce t contans the oss functon Boostng tres to mnmze and the crteron weak earners need to mnmze. For semsupervsed earnng and mode adaptaton the goa s to change the oss functon to mnmze and for the cost-senstve cassfcaton the goa s to change the weak earner seecton crteron. As seen from (6), ths agorthm can be seen as a procedure for fndng a near combnaton of base cassfers that attempts to mnmze an exponenta oss functon (Schapre and Snger 1999), whch n ths case s: e Y []f(x,). (7) An aternatve woud be to mnmze a ogstc oss functon as suggested by (Fredman et a. 2000), namey n(1 + e Y []f(x,) ). (8) The confdence of a cass,, for exampe, x s then computed wth a ogstc functon, ρ(),as 1 P(Y []=+1 x ) = ρ(f(x,))= (9) 1 + e Kf (x,) where K s 1 for ogstc oss, and 2 for exponenta oss (Cons et a. 2002). A more detaed expanaton and anayss of ths agorthm can be found n (Schapre 2001). t=1 4 Semsupervsed earnng The am of semsupervsed earnng s to expot the unabeed exampes for a statstca cassfcaton task. We assume that there s some amount of tranng data avaabe for tranng an nta cassfer. The basc dea s to use ths cassfer to abe the unabeed data automatcay, and mprove the cassfer performance usng the machne-abeed exampes, thus reducng the amount of human-abeng effort necessary to come up wth better statstca systems.

8 62 Mach Learn (2007) 69: Fg. 1 Semsupervsed earnng framework The smpest method for semsupervsed earnng s augmentng the human-abeed data wth machne-abeed data. In ths method, frst an nta mode s traned usng the humanabeed data, whch s then used to cassfy the unabeed data. Then we add the unabeed exampes drecty to the tranng data, by usng the machne-abeed casses. In order to reduce the nose added because of cassfer errors, we add ony those exampes that are cassfed wth a confdence hgher than some threshod. Ths threshod can be set usng a separate hed-out set. Then whoe data ncudng both human- and machne-abeed exampes are used for tranng the cassfer agan. We w evauate and compare ths method wth our proposed method n Sect. 7. Fgure 1 depcts the process proposed for semsupervsed earnng. Ths method s smar to ncorporatng pror knowedge nto Boostng (Schapre et a. 2005). In that work, a mode fttng both the human-abeed tranng data and the task knowedge s traned. In our case, the am s to tran a mode that fts both the human-abeed and machne-abeed data. In that sense, ths s actuay an appcaton of that work for semsupervsed earnng. Fttng a mode and a data set s mpemented as foows: We frst tran an nta mode, usng the human-abeed data. Then, the Boostng agorthm measures the ft to the machne-abeed data and the ft to the nta mode. To measure the ft to the machne-abeed data, the agorthm uses the ogstc oss as gven by (8). The ft to the nta mode s measured usng the Kuback Leber (KL) dvergence (or bnary reatve entropy). More formay, the agorthm now ams to mnmze the foowng oss functon, whch s actuay an extenson of the ogstc oss: (n(1 + e Y []f(x,) ) + ηkl(p (Y [.]=1 x ) ρ(f(x,.)))) (10) where ( ) ( ) p 1 p KL(p q) = p n + (1 p)n q 1 q s the KL dvergence between two dstrbutons p and q. In our case, these two dstrbutons correspond to the cass confdences from the nta mode, P(Y [.]=1 x ) andtothedstrbuton from the constructed mode, ρ(f(x,.)),asdefnedby(9). Ths term s bascay the dstance from the nta mode but by human-abeed data and the new mode but

9 Mach Learn (2007) 69: wth machne-abeed data. 2 η s used to contro the reatve mportance of these two terms. Ths weght may be determned emprcay on a hed-out set. In addton to that, n order to reduce the nose added because of cassfer errors, we can expot ony those exampes that are cassfed wth a confdence hgher than some threshod. Ths threshod s aso optmzed usng the hed-out set. The modfcaton of the AdaBoost agorthm to ncorporate ths new oss functon s expaned n deta n (Schapre et a. 2005). Equaton (10) can be rewrtten as C + n(1 + e Y []f(x,) ) + ηp (Y []=1 x ) n(1 + e f(x,) ) + η(1 P(Y []=1 x )) n(1 + e f(x,) ). Ths functon can be mnmzed by addng two addtona exampes for each gven exampe wth weghts ηp (Y []=1 x ) and η(1 P(Y []=1 x )) for Y []=1andY []= 1, respectvey. The AdaBoost agorthm then consders these weghts of ndvdua exampes nstead of startng wth unform dstrbuton. 5 Mode adaptaton The am of mode adaptaton s to expot the exstng abeed data and modes for mprovng the performance of the new smar appcatons usng a supervsed adaptaton method. The basc assumpton s that there s an exstng mode traned wth data smar to the target appcaton. Then the dea s to adapt ths cassfcaton mode usng the sma amount of aready-abeed data from the target appcaton, thus reducng the amount of human abeng necessary to come up wth more accurate statstca cassfcaton systems. The very same adaptaton technque can be empoyed for mprovng the exstng mode for nonstatonary data, where the data characterstcs of an appcaton change over tme. There are at east two other ways of expotng the exstng abeed data from a smar appcaton. We w evauate and compare these methods to adaptaton n Sect. 7. Smpe Data Concatenaton (smpe): where the new cassfcaton mode s traned usng the data from the prevous appcaton concatenated to the data abeed for the target appcaton. Tagged Data Concatenaton (tagged): where the new cassfcaton mode s traned usng both data sets, but each set s tagged wth the source appcaton. That s, n addton to the exampes, we use the source of each exampe as an addtona feature durng cassfcaton. In our approach for adaptaton, we begn wth an exstng cassfcaton mode. Then usng the abeed data from the target appcaton we bud an augmented mode based on ths exstng mode. Note that ths mode s not traned from scratch (hence the adaptaton). The earnng starts from an exstng mode and then augments that mode. In other words, we add more teratons (hence more weak earners) to the exstng mode (whch s nothng but a set of weak earners). Ths method s smar to expotng unabeed exampes as 2 Note that, athough P and ρ do not gve probabtes, but rather some confdence scores between 0 and 1, ths functon provdes an estmate of dstance.

10 64 Mach Learn (2007) 69: presented n the prevous secton, where a mode that fts both the manuay-abeed tranng data and machne-abeed data s traned. In ths case, the am s to tran a mode that fts both a sma amount of appcaton-specfc abeed data and the exstng mode from the smar appcaton. More formay, the Boostng agorthm tres to mnmze the foowng oss functon: (n(1 + e Y []f(x,) ) + ηkl(p (Y [.]=1 x ) ρ(f(x,.)))) (11) where P(Y []=1 x ) s the dstrbuton from the nta exstng mode and ρ(f(x,.)) s the dstrbuton from the newy constructed mode as gven by (9). Ths term s bascay the dstance from the exstng mode to the new mode but wth newy abeed n-doman data. Here, agan, η s used to contro the reatve mportance of these two terms and may be determned emprcay on a hed-out set. 6 Cost-senstve cassfcaton The am of cost-senstve cassfcaton s to extend Boostng to aow weghted casses. Our man dea s to change the crteron to choose the best weak earner at each round of Boostng. The bound gven by (1) mpes that n order to mnmze the tranng error, a reasonabe approach woud be mnmzng Z t on each round of Boostng. Ths eads to a crteron for fndng weak hypotheses, h t (x, ) for a gven teraton, t (Schapre and Snger 1999). Assume that the weak earners, h, make ther predctons based on a parttonng of the doman X nto dsjont bocks X j. For exampe, f the weak earner s a decson stump, checkng the absence or presence of an n-gram, there are two bocks for each cass. Let c j = h(x, ) for x X j.defne = W j b :x X j D(,)(1 δ(x,)) where b ={, +} dependng on the vaue of Y [] 1, 1. In other words W j + (W j )sthe tota weght of sampes n partton j (not) abeed as. Usng ths termnoogy: Z t = D t (, )e Y []cj = W j + e c j + W j e c j. j :x X j j It s then straghtforward to see that the optma c j, mnmzng Z t s c j = 1 ( j ) W 2 n +. W j Puttng ths n pace resuts n: Z t = j 2 W j + W j. Then t s enough to choose the weak earner that mnmzes ths vaue.

11 Mach Learn (2007) 69: Now assume that not a abes are equay mportant, that s, there s an assocated cost or weght, w,toagvenabe,. In such a case we need to defne a weghted Hammng oss (WHL): WHL(H ) = 1 w δ(x,). (12) mk It s easy to see that when w > 1 mnmzng Z t may not be the best crteron for a weak earner to optmze the weghted Hammng oss, because the nequaty may not hod. WHL(H ) Theorem 2 Wth the weghts of the casses beng w, the foowng bound hods on the weghted Hammng oss of H : T t=1 Z t WHL(H ) T Y t t=1 f w 1 where where ŵ s a functon of w. Y t = Y ŵ m =1 Proof By unraveng the update rue, we have that D t (, )e αt Y []h t (x,) D T +1 ( + 1) = e Y []f(x,) mk t Z. (13) t Moreover, f H(x,) Y [] then Y []f(x,) 0 mpyng that e Y []f(x,) 1. Thus, Combnng (12), (13), and (14), we get WHL(H ) = 1 mk snce D T +1 (, ) 1. δ(x,) e Y []f(x,). (14) w δ(x,) 1 w e Y []f(x,) = ( T w Z t )D T +1 (, ) mk t=1 ( T ( T = w D T +1 (, ) t=1 Z t) t=1 Z t) w

12 66 Mach Learn (2007) 69: Defne the constant W = w.then ( T ) ( T ) WHL(H ) W 1 t Zt = Y t t=1 t=1 when ŵ = w W 1 t. Foowng smar steps, ths eads us to a new crteron to seect the weak hypotheses: Choose the weak earner that mnmzes the foowng: Y t = ŵ j 2 W j + W j. Note that ths s n contrast to the unweghted case, where the weak earner optmzes Z t = j 2 W j + W j. In other words, ths corresponds to changng the seecton crteron for the weak earner, h t, n the AdaBoost agorthm. No change s requred n the agorthm presented n Sect Experments and resuts We evauated the proposed methods usng the utterances from the database of the AT&T VoceTone spoken daog system (Gupta et a. 2006). Ths s a ca routng system where the am s to route the nput cas n a customer care ca center. In ths natura anguage spoken daog system, caers are greeted by the open-ended prompt How May I Hep You? Users then ask questons about ther phone bs, cang pans, and so on. The system tres to dentfy the customer s ntent (ca-type), whch s bascay framed as a ca cassfcaton probem smar to the terature (Chu-Carro and Carpenter 1999; Gorn et a. 1997; Natarajan et a. 2002). If the system s unabe to understand the caer wth hgh enough confdence, then the conversaton w proceed wth ether a carfcaton or a confrmaton prompt. In our experments a the utterances are transcrbed. We performed our tests usng the Boostexter too (Schapre and Snger 2000). For a experments, we used word n-grams as features and decson stumps as weak cassfers. 7.1 Evauaton metrcs Whe evauatng cassfcaton performance, we used many two metrcs, both usng mcroaveragng aowng mutpe ca-types. The frst one s the top cass error rate (TCER), whch s the fracton of utterances n whch the ca-type wth maxmum probabty was not one of the true ca-types. Inspred by the nformaton retreva communty, the second metrc we used s the F-Measure, whch s the harmonc mean of reca and precson. Reca s defned as the proporton of a the true ca-types that are correcty deduced by the cassfer. It s obtaned by dvdng the number of true postves by the sum of true postves and fase negatves. Precson s defned as the proporton of a the accepted ca-types that are aso

13 Mach Learn (2007) 69: true. It s obtaned by dvdng true postves by the sum of true postves and fase postves. True (Fase) postves are the number of ca-types for an utterance for whch the deduced ca-type has got a confdence above a gven threshod, hence accepted, and s (not) among the correct ca-types. Fase negatves are the number of ca-types for an utterance for whch the deduced ca-type has got a confdence ess than a threshod, hence rejected, but s among the true ca-types. More formay, et a = #{trueabe =+, predcted =+}, b = #{trueabe =+, predcted = }, c = #{trueabe =, predcted =+}, d = #{trueabe =, predcted = }. Then reca = F Measure = a (a + b), precson = a (a + c), 2 reca precson reca + precson. One dfference between these two evauaton metrcs s that the top cass error rate evauates ony the top-scorng ca-type for an utterance, whereas the F-Measure evauates a the ca-types exceedng the gven threshod. For ower threshods, the precson s ower but reca s hgher, and vce versa for hgher threshods. To optmze the F-Measure, we check ts vaue for a threshods between 0 and 1, and use the best one as the F-Measure of that system, snce t s aways possbe to change the operatona threshod of the system. 7.2 Semsupervsed earnng experments To evauate the proposed semsupervsed earnng method, we seected a ca cassfcaton appcaton. The data characterstcs are gven n Tabe 1. In ths frst set of experments, we kept the number of teratons fxed to 500. Frst, we seected the optma threshod of topscorng ca-type confdences usng TCER on the hed-out set. Obvousy, there s a trade-off n seectng the threshod. If t s set to a ower vaue, that means a arger amount of nosy data, and f t s set to a hgher vaue, that means a esser amount of usefu or nformatve data. Fgure 2 proves ths behavor for the hed-out set. We traned nta modes usng 2,000, 4,000, and 8,000 human-abeed utterances and then augmented these as descrbed n the smpe method, wth the remanng data n the tranng set (usng ony machne-abeed ca-types). On the x axs, we have dfferent threshods to seect from the unabeed data that the cassfer uses, and on the y axs we have the cassfcaton error rate (TCER) f that data s aso expoted. A threshod of 0 means usng a the machne-abeed data and 1 means usng none. As seen, there s consstenty a 1 1.5% dfference n cassfcaton error Tabe 1 Data characterstcs used n the semsupervsed earnng experments Tranng Data Sze 57,829 utterances Test Data Sze 3,513 utterances Hed-out Data Sze 3,500 utterances Number of Ca-types 49

14 68 Mach Learn (2007) 69: Fg. 2 Trade-off for choosng the threshod to seect among the machne-abeed data on the hed-out set rates usng varous threshods for each data sze, and the owest error rates are acheved wth threshods around Fgure 3 depcts the performance usng the proposed semsupervsed earnng method, by pottng the earnng curves for varous nta abeed data set szes. In the fgure, x axs s the amount of human-abeed tranng utterances, and y axs s the cassfcaton error rate of the correspondng mode on the test set. The basene s the top curve, wth the hghest error rate, where no machne-abeed data s used. To compare our proposed approach, we aso empoyed the smpe data augmentaton method, where we concatenate the humanabeed and machne-abeed data. In these experments, we seected 0.5 as the threshod for seectng machne-abeed data, and for each data sze, we optmzed the weght η n the proposed method usng the hed-out set. As n the case of the hed-out set, we consstenty obtaned 1 1.5% cassfer error rate reductons on the test set usng both approaches when the abeed tranng data sze s ess than 15,000 utterances. 3 The reducton n the need for human-abeed data to acheve the same cassfcaton performance s around 30%. For exampe we get the same performance when we expot machne-abeed data wth 8,000 human-abeed utterances nstead of 12,000 utterances. The proposed method performed consstenty better (though not sgnfcanty) when there are fewer human-abeed exampes. The smpe semsupervsed earnng method outperformed the proposed method after 5,000 human-abeed utterances. 3 For ths test set, 1.3% s statstcay sgnfcant accordng to a Z-test for 95% confdence nterva.

15 Mach Learn (2007) 69: Fg. 3 Resuts usng semsupervsed earnng. The top-most earnng curve s obtaned usng just human-abeed data as a basene. Beow that e the earnng curves usng the frst (concatenaton of humanand machne-abeed data) and second (Boostng adaptaton) methods 7.3 Mode adaptaton experments To evauate the proposed mode adaptaton method, we seected two appcatons, T 1 and T 2, both from the teecommuncatons doman, where users have requests about ther phone bs, cang pans, and so on. The frst appcaton s concerge-ke and has a the ca-types the second appcaton covers. The second appcaton s used ony for a specfc subset of ca-types. The data propertes are shown n Tabe 2. We computed the perpexty of the ca-type probabty dstrbuton as Perpexty = 2 c C (p(c) og p(c)) where p(c) s the pror probabty of a ca-type c C. As seen, the perpexty of the second appcaton s sgnfcanty ower whe the utterances are onger. As shown n Fg. 4 the cass dstrbutons for these two appcatons are sgnfcanty dfferent. We have about 9 tmes more data for the frst appcaton. To avod deang wth fndng the optma teraton numbers n Boostng, we terated many tmes, got the error rate after each teraton and used the best error rate n a the resuts beow. In ths experment, the goa s adaptng the cassfcaton mode for T 1 usng T 2 so that the resutng mode for T 2 woud perform better. Tabe 3 presents the basene resuts usng tranng and test data combnatons. The rows ndcate the tranng sets, and coumns ndcate the test sets. The vaues are the cassfcaton error rates, whch are the ratos of the utterances

16 70 Mach Learn (2007) 69: Fg. 4 Cass dstrbutons for appcatons T 1 (above) andt 2 (beow). The casses for two appcatons are agned Tabe 2 Data characterstcs used n the mode adaptaton experments T 1 T 2 Tranng Data Sze 53,022 utterances 5,866 utterances Test Data Sze 5,529 utterances 614 utterances Number of Ca-Types Ca-Type Perpexty Average Utterance Length 8.06 words words for whch the cassfer s top scorng cass s not one of the correct ca-types. The thrd row s smpy the concatenaton of both tranng sets (ndcated by smpe). The fourth row (ndcated by tagged) s obtaned by tranng the cassfer wth an extra feature ndcatng the source of that utterance, ether T 1 or T 2. The performance of the adaptaton s shown n the ast three rows (ndcated by adapt). As seen, athough the two appcatons are very smar, when the tranng set does not match the test set, the performance drops drastcay. Addng T 1 tranng data to T 2 does not hep, and actuay hurts sgnfcanty. 4 Ths negatve effect dsappears when we denote the source of the tranng data, but no mprovement has been observed on the performance of the cassfcaton mode for T 2. Adaptaton experments 4 For ths test set, 3% s statstcay sgnfcant accordng to a Z-test for 95% confdence nterva.

17 Mach Learn (2007) 69: Tabe 3 Adaptaton resuts for the experments. smpe ndcates smpe concatenaton, tagged ndcates usng an extra feature denotng the source of tranng data, adapt ndcates adaptaton wth dfferent η vaues Tranng Set Test Set T 1 T 2 T % 26.87% T % 13.36% T 1 + T 2 smpe 14.15% 16.78% T 1 + T 2 tagged 14.05% 13.36% T 1 + T 2 adapt(η = 0.1) 19.01% 12.54% T 1 + T 2 adapt(η = 0.5) 16.13% 14.01% T 1 + T 2 adapt(η = 0.9) 15.27% 15.96% Fg. 5 Resuts usng ca-type cassfcaton mode adaptaton. x-axs s the amount of abeed data from appcaton T 2. The top earnng curve s obtaned usng just T 2 data as a basene. Beow that es the earnng curve usng the adaptaton usng dfferent η vaues ndcate nterestng resuts. We see that usng a vaue of 0.1, t s actuay possbe to outperform the mode performance traned usng ony T 2 tranng data. Snce we expect the proposed adaptaton method to work better wth ess appcaton specfc tranng data, we draw the earnng curves as presented n Fg. 5 usng 0.1 as the η vaue. The top curve s the basene, obtaned usng random seecton of ony T 2 tranng data. When we adapt the T 1 mode wth ony 1,106 utterances from T 2 we see 2.5% absoute

18 72 Mach Learn (2007) 69: Tabe 4 Data characterstcs used n the cost-senstve cassfcaton experments Tranng Data Sze 9,094 utterances Test Data Sze 5,171 utterances Number of Ca-Types 84 Ca-Type Perpexty Average Utterance Length words Tabe 5 Performance change when one ca-type s weghted more than others Weght Overa Teme(Baance) Reca Precson F-Measure Reca Precson F-Measure mprovement, whch means 56% reducton (from about 2,500 utterances to 1,106 utterances for an error rate of 16.77%) n the amount of data needed to acheve that performance. As the number of human-abeed utterances from appcaton T 2 ncreases, the dfference between random and adaptaton curves gets smaer, but the adaptaton curve reaches the performance obtaned wth usng a random data wth 40% ess data. Throughout the curves we see that adaptaton performs better consstenty than the basene, though not sgnfcanty. 7.4 Cost-senstve cassfcaton experments To evauate the proposed cost-senstve cassfcaton method, we seected a ca cassfcaton appcaton, agan. Tabe 4 summarzes the characterstcs of our appcaton ncudng amount of tranng and test data, tota number of ca-types, average utterance ength, and ca-type perpexty. Smar to semsupervsed earnng experments, we kept the number of Boostng teratons fxed to 500. As a frst experment, we chose one moderatey frequent ca-type, namey Teme(Baance), occurrng 189 tmes and ncreased ts weght, whe keepng a other weghts as 1. Tabe 5 shows the change n the performance of that ca-type and that of the overa performance. We tred varous weghts for Teme(Baance) as shown n the frst coumn of the tabe. As seen, the F-Measure of that ca-type has ncreased sgnfcanty by 6.3% absoute wthout a sgnfcant change n the overa (or ndvdua cass) performance(s) when the weght s set to Note that when we contnue ncreasng the weght, performance begns to deterorate because of the decrease n precson. The vast majorty of the weak earners seected for the mode traned wth the weght of 10,000 are reated to that hgherweghted ca-type, makng the other ca-types harder to wn, and even resutng n worse performance for that ca-type because of overtranng. To see whether the same behavor s true for a set of mportant ca-types, not just one, we randomy seected 12 ca-types, occurrng 766 tmes, and gave them hgher weghts. Tabe 6 presents our resuts. The F-Measure for these ca-types ncreased by 3.5% absoute wthout a sgnfcant oss n overa performance wth a weght of 100, but note the steady decrease n overa performance wth ncreasng weghts athough the performance of the mportant casses contnues to mprove. 5 For ths test set, 1.1% s statstcay sgnfcant accordng to a Z-test for 95% confdence nterva.

19 Mach Learn (2007) 69: Tabe 6 Performance change when one set of ca-types s weghted more than others Weght Overa Important Ca-types Reca Precson F-Measure Reca Precson F-Measure Concusons We have proposed three methods for extendng the Boostng famy of cassfers, namey, semsupervsed earnng, mode adaptaton, and cost-senstve cassfcaton. We have shown the effectveness of these methods n a rea-fe appcaton, the AT&T VoceTone ca routng system. Note that athough these methods are cassfer (namey, Boostng) dependent, the deas are more genera and can be apped to other cassfers. For exampe, n a Nave Bayes cassfer, adaptaton can be mpemented as near mode nterpoaton or a Bayesan adaptaton (ke MAP) can be empoyed. It s aso possbe to appy the dea for other cassfcaton tasks that may need adaptaton such as topc cassfcaton or named entty extracton. Our future work ncudes unsupervsed adaptaton of cassfcaton modes. Ths w enabe us to bootstrap new modes wthout abeng any appcaton-specfc data. For costsenstve cassfcaton we woud ke to extend ths formuaton so as to enabe costs for pars of casses nstead of snge casses. Acknowedgements We thank Robert E. Schapre for provdng us wth the Boostexter cassfer and hs suggestons. We thank Dek Hakkan-Tür, Murat Saracar, Patrck Haffner, and Mazn Gbert for many hepfu dscussons. References Ashaw, H. (2003). Effectve utterance cassfcaton wth unsupervsed phonotactc modes. In Proceedngs of the human anguage technoogy conference (HLT)-conference of the North Amercan chapter of the assocaton for computatona ngustcs (NAACL), Edmonton, Canada. Bacchan, M., Roark, B., & Saracar, M. (2004). Language mode adaptaton wth MAP estmaton and the perceptron agorthm. In Proceedngs of the human anguage technoogy conference (HLT)-conference of the North Amercan chapter of the assocaton for computatona ngustcs (NAACL), Boston, MA. Bum, A., & Mtche, T. (1998). Combnng abeed and unabeed data wth co-tranng. In Proceedngs of the workshop on computatona earnng theory (COLT), Madson, WI. Caruana, R. (1997). Muttask earnng. Machne Learnng, 28(1), Chu-Carro, J., & Carpenter, B. (1999). Vector-based natura anguage ca routng. Computatona Lngustcs, 25(3), Cons, M., Schapre, R. E., & Snger, Y. (2002). Logstc regresson, AdaBoost and Bregman dstances. Machne Learnng, 48(1/2/3). D Fabbrzo, G., Dutton, D., Gupta, N., Hoster, B., Rahm, M., Rccard, G., Schapre, R., & Schroeter, J. (2002). AT&T hep desk. In Proceedngs of the nternatona conference on spoken anguage processng (ICSLP), Denver,CO. Dgaaks, V., Rtschev, D., & Neumeyer, L. G. (1995). Speaker adaptaton usng constraned estmaton of Gaussan mxtures. IEEE Transactons on Speech and Audo Processng, 3(5). Domngos, P. (1999). MetaCost: a genera method for makng cassfers cost senstve. In: Proceedngs of the nternatona conference on knowedge dscovery and data mnng (KDD), San Dego, CA. Drummond, C., & Hote, R. C. (2000). Expotng the cost n senstvty of decson tree spttng crtera. In Proceedngs of the nternatona conference on machne earnng (ICML), Pao Ato, CA.

20 74 Mach Learn (2007) 69: Duda, R. O., & Hart, P. E. (1973). Pattern cassfcaton and scene anayss. New York: Wey. Fan, W., Stofo, S. J., Zhang, J., & Chan, P. K. (1999). AdaCost: mscassfcaton cost-senstve boostng. In Proceedngs of the nternatona conference on machne earnng (ICML), Bed, Sovena. Freund, Y., & Schapre, R. E. (1997). A decson-theoretc generazaton of on-ne earnng and an appcaton to boostng. Journa of Computer and System Scences, 55(1), Fredman, J., Haste, T., & Tbshran, R. (2000). Addtve ogstc regresson: a statstca vew of boostng. The Annas of Statstcs, 38(2), Ghan, R. (2002). Combnng abeed and unabeed data for mutcass text categorzaton. In Proceedngs of the nternatona conference on machne earnng (ICML), Sydney, Austraa. Gorn, A. L., Rccard, G., & Wrght, J. H. (1997). How may I hep You? Speech Communcaton, 23, Gupta, N., Tur, G., Hakkan-Tür, D., Bangaore, S., Rccard, G., & Rahm, M. (2006). The AT&T spoken anguage understandng system. IEEE Transactons on Speech and Audo Processng, 14(1), He, Y., & Young, S. (2004). Robustness ssues n a data-drven spoken anguage understandng system. In Proceedngs of the HLT/NAACL workshop on spoken anguage understandng, Boston, MA. Iyer, R., Gsh, H., & McCarthy, D. (2002). Unsupervsed tranng technques for natura anguage ca routng. In: Proceedngs of the nternatona conference on acoustcs, speech and sgna processng (ICASSP), Orando, FL. Margneantu, D. (2002). Cass probabty estmaton and cost-senstve cassfcaton decsons. In Proceedngs of the European conference on machne earnng (ICML), Hesnk, Fnand. Natarajan, P., Prasad, R., Suhm, B., & McCarthy, D. (2002). Speech enabed natura anguage ca routng: BBN ca drector. In Proceedngs of the nternatona conference on spoken anguage processng (ICSLP),Denver,CO. Ngam, K., & Ghan, R. (2000). Anayzng the effectveness and appcabty of co-tranng. In Proceedngs of the nternatona conference on nformaton and knowedge management (CIKM), McLean, VA. Ngam, K., McCaum, A., Thrun, S., & Mtche, T. (2000). Text cassfcaton from abeed and unabeed documents usng EM. Machne Learnng, 39(2/3), Prodromds, A. L., & Stofo, S. (1998). Mnng databases wth dfferent schemas: ntegratng ncompatbe cassfers. In Proceedngs of the nternatona conference on knowedge dscovery and data mnng (KDD), NewYork,NY. Rccard, G., & Gorn, A. L. (2000). Stochastc anguage adaptaton over tme and state n a natura spoken daog system. IEEE Transactons on Speech and Audo Processng, 8(1), 3 9. Roark, B., & Bacchan, M. (2003). Supervsed and unsupervsed PCFG adaptaton to nove domans. In Proceedngs of the human anguage technoogy conference (HLT)-conference of the North Amercan chapter of the assocaton for computatona ngustcs (NAACL), Edmonton, Canada. Schapre, R. E. (2001). The boostng approach to machne earnng: an overvew. In Proceedngs of the MSRI workshop on nonnear estmaton and cassfcaton, Berkeey,CA. Schapre, R. E., & Snger, Y. (1999). Improved boostng agorthms usng confdence-rated predctons. Machne Learnng, 37(3), Schapre, R. E., & Snger, Y. (2000). Boostexter: a boostng-based system for text categorzaton. Machne Learnng, 39(2/3), Schapre, R. E., Rochery, M., Rahm, M., & Gupta, N. (2005). Boostng wth pror knowedge for ca cassfcaton. IEEE Transactons on Speech and Audo Processng, 13(2). Tur, G. (2004). Cost-senstve ca cassfcaton. In Proceedngs of the nternatona conference on spoken anguage processng (ICSLP), Jeju-Isand, Korea. Tur, G. (2005). Mode adaptaton for spoken anguage understandng. In Proceedngs of the nternatona conference on acoustcs, speech and sgna processng (ICASSP), Phadepha, PA. Tur, G., & Hakkan-Tür, D. (2003). Expotng unabeed utterances for spoken anguage understandng. In Proceedngs of the European conference on speech communcaton and technoogy (EUROSPEECH), Geneva, Swtzerand. Zadrozny, B., & Ekan, C. (2001). Learnng and makng decsons when costs and probabtes are both unknown. In Proceedngs of the nternatona conference on knowedge dscovery and data mnng (KDD), San Francsco, CA. Zadrozny, B., Langford, J., & Abe, N. (2003). Cost-senstve earnng by cost-proportonate exampe weghtng. In Proceedngs of the IEEE nternatona conference on data mnng, Mebourne, FL. Ztoun, I., Kuo, H.-K. J., & Lee, C.-H. (2001). Natura anguage ca routng: towards combnaton and boostng of cassfers. In Proceedngs of the IEEE automatc speech recognton and understandng (ASRU) workshop, Trento, Itay.

Image Classification Using EM And JE algorithms

Image Classification Using EM And JE algorithms Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe Image Cassfcaton Usng EM And JE agorthms Xaojn Sh Department of Computer Engneerng, Unversty of Caforna, Santa Cruz, CA, 9564 jennfer@soe.ucsc.edu