Approximation Lasso Methods for Language Modeling

Approxmaon Lasso Mehods for Language Modelng Janfeng Gao Mcrosof Research One Mcrosof Way Redmond WA 98052 USA jfgao@mcrosof.com Hsam Suzuk Mcrosof Research One Mcrosof Way Redmond WA 98052 USA hsams@mcrosof.com Bn Yu Deparmen of Sascs Unversy of Calforna Berkeley., CA 94720 U.S.A. bnyu@sa.berkeley.edu Absrac Lasso s a regularzaon mehod for parameer esmaon n lnear models. I opmzes he model parameers wh respec o a loss funcon subjec o model complexes. Ths paper explores he use of lasso for sascal language modelng for ex npu. Owng o he very large number of parameers, drecly opmzng he penalzed lasso loss funcon s mpossble. Therefore, we nvesgae wo approxmaon mehods, he boosed lasso (BLasso) and he forward sagewse lnear regresson (FSLR). Boh mehods, when used wh he exponenal loss funcon, bear srong resemblance o he boosng algorhm whch has been used as a dscrmnave ranng mehod for language modelng. Evaluaons on he ask of Japanese ex npu show ha BLasso s able o produce he bes approxmaon o he lasso soluon, and leads o a sgnfcan mprovemen, n erms of characer error rae, over boosng and he radonal maxmum lkelhood esmaon. 1 Inroducon Language modelng (LM) s fundamenal o a wde range of applcaons. Recenly, has been shown ha a lnear model esmaed usng dscrmnave ranng mehods, such as he boosng and percepron algorhms, ouperforms sgnfcanly a radonal word rgram model raned usng maxmum lkelhood esmaon (MLE) on several asks such as speech recognon and Asan language ex npu (Bacchan e al. 2004; Roark e al. 2004; Gao e al. 2005; Suzuk and Gao 2005). The success of dscrmnave ranng mehods s largely due o fac ha unlke he radonal approach (e.g., MLE) ha maxmzes he funcon (e.g., lkelhood of ranng daa) ha s loosely assocaed wh error rae, dscrmnave ranng mehods am o drecly mnmze he error rae on ranng daa even f hey reduce he lkelhood. However, gven a fne se of ranng samples, dscrmnave ranng mehods could lead o an arbrary complex model for he purpose of achevng zero ranng error. I s well-known ha complex models exhb hgh varance and perform poorly on unseen daa. Therefore some regularzaon mehods have o be used o conrol he complexy of he model. Lasso s a regularzaon mehod for parameer esmaon n lnear models. I opmzes he model parameers wh respec o a loss funcon subjec o model complexes. The basc dea of lasso s orgnally proposed by Tbshran (1996). Recenly, here have been several mplemenaons and expermens of lasso on mul-class classfcaon asks where only a small number of feaures need o be handled and he lasso soluon can be drecly compued va numercal mehods. To our knowledge, hs paper presens he frs emprcal sudy of lasso for a realsc, large scale ask: LM for Asan language ex npu. Because he ask ulzes mllons of feaures and ranng samples, drecly opmzng he penalzed lasso loss funcon s mpossble. Therefore, wo approxmaon mehods, he boosed lasso (BLasso, Zhao and Yu 2004) and he forward sagewse lnear regresson (FSLR, Hase e al. 2001), are nvesgaed. Boh mehods, when used wh he exponenal loss funcon, bear srong resemblance o he boosng algorhm whch has been used as a dscrmnave ranng mehod for LM. Evaluaons on he ask of Japanese ex npu show ha BLasso s able o produce he bes approxmaon o he lasso soluon, and leads o a sgnfcan mprovemen, n erms of characer error rae, over he boosng algorhm and he radonal MLE. 2 LM Task and Problem Defnon Ths paper sudes LM on he applcaon of Asan language (e.g. Chnese or Japanese) ex npu, a sandard mehod of npung Chnese or Japanese ex by converng he npu phonec symbols no he approprae word srng. In hs paper we call he ask IME, whch sands for

npu mehod edor, based on he name of he commonly used Wndows-based applcaon. Performance on IME s measured n erms of he characer error rae (CER), whch s he number of characers wrongly convered from he phonec srng dvded by he number of characers n he correc ranscrp. Smlar o speech recognon, IME s vewed as a Bayes decson problem. Le A be he npu phonec srng. An IME sysem s ask s o choose he mos lkely word srng W * among hose canddaes ha could be convered from A: W * = arg maxp( W A) = arg maxp( W ) P( A W) (1) W GEN( A) W GEN(A) where GEN(A) denoes he canddae se gven A. Unlke speech recognon, however, here s no acousc ambguy as he phonec srng s npued by users. Moreover, we can assume a unque mappng from W and A n IME as words have unque readngs,.e. P(A W) = 1. So he decson of Equaon (1) depends solely upon P(W), makng IME an deal evaluaon es bed for LM. In hs sudy, he LM ask for IME s formulaed under he framework of lnear models (e.g., Duda e al. 2001). We use he followng noaon, adaped from Collns and Koo (2005): Tranng daa s a se of example npu/oupu pars. In LM for IME, ranng samples are represened as {A, W R }, for = 1 M, where each A s an npu phonec srng and W R s he reference ranscrp of A. We assume some way of generang a se of canddae word srngs gven A, denoed by GEN(A). In our expermens, GEN(A) consss of op n word srngs convered from A usng a baselne IME sysem ha uses only a word rgram model. We assume a se of D+1 feaures f d(w), for d = 0 D. The feaures could be arbrary funcons ha map W o real values. Usng vecor noaon, we have f(w) R D+1, where f(w) = [f 0(W), f 1(W),, f D(W)] T. f 0(W) s called he base feaure, and s defned n our case as he log probably ha he word rgram model assgns o W. Oher feaures (f d(w), for d = 1 D) are defned as he couns of word n-grams (n = 1 and 2 n our expermens) n W. Fnally, he parameers of he model form a vecor of D+1 dmensons, each for one feaure funcon, λ = [λ 0, λ 1,, λ D]. The score of a word srng W can be wren as Score( W, λ ) = λf( W ) = D d= 0 λ f ( W ). (2) The decson rule of Equaon (1) s rewren as d d 1 Se λ 0 = argmn λ0exploss(λ); and λ d = 0 for d=1 D 2 Selec a feaure f k* whch has larges esmaed mpac on reducng ExpLoss of Eq. (6) 3 Updae λ k* λ k* + δ*, and reurn o Sep 2 Fgure 1: The boosng algorhm * W ( A, λ) = arg max Score( W, λ). (3) W GEN(A) Equaon (3) vews IME as a rankng problem, where he model gves he rankng score, no probables. We herefore do no evaluae he model va perplexy. Now, assume ha we can measure he number of converson errors n W by comparng wh a reference ranscrp W R usng an error funcon Er(W R,W), whch s he srng ed dsance funcon n our case. We call he sum of error couns over he ranng samples sample rsk. Our goal hen s o search for he bes parameer se λ whch mnmzes he sample rs as n Equaon (4): λ def MSR = arg mn λ = 1... M Er( W R, W ( A, λ)). (4) However, (4) canno be opmzed easly snce Er(.) s a pecewse consan (or sep) funcon of λ and s graden s undefned. Therefore, dscrmnave mehods apply dfferen approaches ha opmze approxmaely. The boosng algorhm descrbed below s one of such approaches. 3 Boosng Ths secon gves a bref revew of he boosng algorhm, followng he descrpon of some recen work (e.g., Schapre and Snger 1999; Collns and Koo 2005). The boosng algorhm uses an exponenal loss funcon (ExpLoss) o approxmae he sample rsk n Equaon (4). We defne he margn of he par (W R, W) wh respec o he model λ as R R M( W, W ) = Score( W, λ) Score( W, λ) (5) Then, ExpLoss s defned as R ExpLoss( λ ) = exp( M( W, W )) (6) = 1... M W GEN( A ) Noce ha ExpLoss s convex so here s no problem wh local mnma when opmzng. I s shown n Freund e al. (1998) and Collns and Koo (2005) ha here exs graden search procedures ha converge o he rgh soluon. Fgure 1 summarzes he boosng algorhm we used. Afer nalzaon, Seps 2 and 3 are *

repeaed N mes; a each eraon, a feaure s chosen and s wegh s updaed as follows. Frs, we defne Upd(λ, δ) as an updaed model, wh he same parameer values as λ wh he excepon of λ whch s ncremened by δ Upd( λ, δ ) = { λ0, λ1,..., λk + δ,..., λd} Then, Seps 2 and 3 n Fgure 1 can be rewren as Equaons (7) and (8), respecvely. ( k*, δ *) = arg mnexploss(upd( λ, δ )) (7) δ δ *) (8) The boosng algorhm can be oo greedy: Each eraon usually reduces he ExpLoss(.) on ranng daa, so for he number of eraons large enough hs loss can be made arbrarly small. However, fng ranng daa oo well evenually leads o overfng, whch degrades he performance on unseen es daa (even hough n boosng overfng can happen very slowly). Shrnkage s a smple approach o dealng wh he overfng problem. I scales he ncremenal sep δ by a small consan ν, ν (0, 1). Thus, he updae of Equaon (8) wh shrnkage s νδ *) (9) Emprcally, has been found ha smaller values of ν lead o smaller numbers of es errors. 4 Lasso Lasso s a regularzaon mehod for esmaon n lnear models (Tbshran 1996). I regularzes or shrnks a fed model hrough an L 1 penaly or consran. Le T(λ) denoe he L 1 penaly of he model,.e., T(λ) = d = 0 D λ d. We hen opmze he model λ so as o mnmze a regularzed loss funcon on ranng daa, called lasso loss defned as LassoLoss( λ, α ) = ExpLoss( λ) +αt( λ) (10) where T(λ) generally penalzes larger models (or complex models), and he parameer α conrols he amoun of regularzaon appled o he esmae. Seng α = 0 reverses he LassoLoss o he unregularzed ExpLoss; as α ncreases, he model coeffcens all shrn each ulmaely becomng zero. In pracce, α should be adapvely chosen o mnmze an esmae of expeced loss, e.g., α decreases wh he ncrease of he number of eraons. Compuaon of he soluon o he lasso problem has been suded for specal loss funcons. For leas square regresson, here s a fas algorhm LARS o fnd he whole lasso pah for dfferen α s (Obsborn e al. 2000a; 2000b; Efron e al. 2004); for 1-norm SVM, can be ransformed no a lnear programmng problem wh a fas algorhm smlar o LARS (Zhu e al. 2003). However, he soluon o he lasso problem for a general convex loss funcon and an adapve α remans open. More mporanly for our purposes, drecly mnmzng lasso funcon of Equaon (10) wh respec o λ s no possble when a very large number of model parameers are employed, as n our ask of LM for IME. Therefore we nvesgae below wo mehods ha closely approxmae he effec of he lasso, and are very smlar o he boosng algorhm. I s also worh nong he dfference beween L 1 and L 2 penaly. The classcal Rdge Regresson seng uses an L 2 penaly n Equaon (10).e., T(λ) = d = 0 D(λ d) 2, whch s much easer o mnmze (for leas square loss bu no for ExpLoss). However, recen research (Donoho e al. 1995) shows ha he L 1 penaly s beer sued for sparse suaons, where here are only a small number of feaures wh nonzero weghs among all canddae feaures. We fnd ha our ask s ndeed a sparse suaon: among 860,000 feaures, n he resulng lnear model only around 5,000 feaures have nonzero weghs. We hen focus on he L 1 penaly. We leave he emprcal comparson of he L 1 and L 2 penaly on he LM ask o fuure work. 4.1 Forward Sagewse Lnear Regresson (FSLR) The frs approxmaon mehod we used s FSLR, descrbed n (Algorhm 10.4, Hase e al. 2001), where Seps 2 and 3 n Fgure 1 are performed accordng o Equaons (7) and (11), respecvely. ( k*, δ *) = arg mnexploss(upd( λ, δ )) (7) δ ε sgn( δ *)) (11) Noce ha FSLR s very smlar o he boosng algorhm wh shrnkage n ha a each sep, he feaure f k* ha has larges esmaed mpac on reducng ExpLoss s seleced. The only dfference s ha FSLR updaes he wegh of f k* by a small fxed sep sze ε. By akng such small seps, FSLR mposes some mplc regularzaon, and can closely approxmae he effec of he lasso n a local sense (Hase e al. 2001). Emprcally, we fnd ha he performance of he boosng algorhm wh shrnkage closely resembles ha of FSLR, wh he learnng rae parameer ν correspondng o ε.

4.2 Boosed Lasso (BLasso) The second mehod we used s a modfed verson of he BLasso algorhm descrbed n Zhao and Yu (2004). There are wo major dfferences beween BLasso and FSLR. A each eraon, BLasso can ake eher a forward sep or a backward sep. Smlar o he boosng algorhm and FSLR, a each forward sep, a feaure s seleced and s wegh s updaed accordng o Equaons (12) and (13). ( k*, δ *) = arg mnexploss(upd( λ, δ )) (12) δ =± ε ε sgn( δ *)) (13) However, here s an mporan dfference beween Equaons (12) and (7). In he boosng algorhm wh shrnkage and FSLR, as shown n Equaon (7), a feaure s seleced by s mpac on reducng he loss wh s opmal updae δ *. In conrac, n BLasso, as shown n Equaon (12), he opmzaon over δ s removed, and for each feaure, s loss s calculaed wh an updae of eher +ε or -ε,.e., he grd search s used for feaure selecon. We wll show laer ha hs seemngly rval dfference brngs a sgnfcan mprovemen. The backward sep s unque o BLasso. In each eraon, a feaure s seleced and s wegh s updaed backward f and only f leads o a decrease of he lasso loss, as shown n Equaons (14) and (15): k* = arg mnexploss(upd( λ, sgn( λ ) ε ) (14) k, λk 0 sgn( λ k* ) ε ) (15) f LassoLoss( λ, α ) LassoLoss( λ, α ) > θ where θ s a olerance parameer. Fgure 2 summarzes he BLasso algorhm we used. Afer nalzaon, Seps 4 and 5 are repeaed N mes; a each eraon, a feaure s chosen and s wegh s updaed eher backward or forward by a fxed amoun ε. Noce ha he value of α s adapvely chosen accordng o he reducon of ExpLoss durng ranng. The algorhm sars wh a large nal α, and hen a each forward sep he value of α decreases unl he ExpLoss sops decreasng. Ths s nuvely desrable: I s expeced ha mos hghly effecve feaures are seleced n early sages of ranng, so he reducon of ExpLoss a each sep n early sages are more subsanal han n laer sages. These early seps concde wh he boosng seps mos of he me. In oher words, he effec of backward seps s more vsble a laer sages. k Our mplemenaon of BLasso dffers slghly from he orgnal algorhm descrbed n Zhao and Yu (2004). Frsly, because he value of he base feaure f 0 s he log probably (assgned by a word rgram model) and has a dfferen range from ha of oher feaures as n Equaon (2), λ 0 s se o opmze ExpLoss n he nalzaon sep (Sep 1 n Fgure 2) and remans fxed durng ranng. As suggesed by Collns and Koo (2005), hs ensures ha he conrbuon of he log-lkelhood feaure f 0 s well-calbraed wh respec o ExpLoss. Secondly, when updang a feaure wegh, f he sze of he opmal updae sep (compued va Equaon (7)) s smaller han ε, we use he opmal sep o updae he feaure. Therefore, n our mplemenaon BLasso does no always ake a fxed sep; may ake seps whose sze s smaller han ε. In our nal expermens we found ha boh changes (also used n our mplemenaons of boosng and FSLR) were crucal o he performance of he mehods. 1 Inalze λ 0 : se λ 0 = argmn λ0exploss(λ), and λ d = 0 for d=1 D. 2 Take a forward sep accordng o Eq. (12) and (13), and he updaed model s denoed by λ 1 3 Inalze α = (ExpLoss(λ 0 )-ExpLoss(λ 1 ))/ε 4 Take a backward sep f and only f leads o a decrease of LassoLoss accordng o Eq. (14) and (15), where θ = 0; oherwse 5 Take a forward sep accordng o Eq. (12) and (13); updae α = mn(α, (ExpLoss(λ -1 )-ExpLoss(λ ))/ε ); and reurn o Sep 4. Fgure 2: The BLasso algorhm (Zhao and Yu 2004) provdes heorecal jusfcaons for BLasso. I has been proved ha (1) guaranees ha s safe for BLasso o sar wh an nal α whch s he larges α ha would allow an ε sep away from 0 (.e., larger α s correspond o T(λ)=0); (2) for each value of α, BLasso performs coordnae descen (.e., reduces ExpLoss by updang he wegh of a feaure) unl here s no descen sep; and (3) for each sep where he value of α decreases, guaranees ha he lasso loss s reduced. As a resul, can be proved ha for a fne number of feaures and θ = 0, he BLasso algorhm shown n Fgure 2 converges o he lasso soluon when ε 0. 5 Evaluaon 5.1 Sengs We evaluaed he ranng mehods descrbed above n he so-called cross-doman language model adapaon paradgm, where we adap a model raned on one doman (whch we call he

background doman) o a dfferen doman (adapaon doman), for whch only a small amoun of ranng daa s avalable. The daa ses we used n our expermens came from fve dsnc sources of ex. A 36-mllon-word Nkke Newspaper corpus was used as he background doman, on whch he word rgram model was raned. We used four adapaon domans: Yomur (newspaper corpus), TuneUp (balanced corpus conanng newspapers and oher sources of ex), Encara (encyclopeda) and Shncho (collecon of novels). All corpora have been pre-word-segmened usng a lexcon conanng 167,107 enres. For each of he four domans, we creaed ranng daa conssng of 72K senences (0.9M~1.7M words) and es daa of 5K senences (65K~120K words) from each adapaon doman. The frs 800 and 8,000 senences of each adapaon ranng daa were also used o show how dfferen szes of ranng daa affeced he performances of varous adapaon mehods. Anoher 5K-senence subse was used as held-ou daa for each doman. We creaed he ranng samples for dscrmnave learnng as follows. For each phonec srng A n adapaon ranng daa, we produced a lace of canddae word srngs W usng he baselne sysem descrbed n (Gao e al. 2002), whch uses a word rgram model raned va MLE on he Nkke Newspaper corpus. For effcency, we kep only he bes 20 hypoheses n s canddae converson se GEN(A) for each ranng sample for dscrmnave ranng. The oracle bes hypohess, whch gves he mnmum number of errors, was used as he reference ranscrp of A. We used ungrams and bgrams ha occurred more han once n he ranng se as feaures n he lnear model of Equaon (2). The oal number of canddae feaures we used was around 860,000. 5.2 Man Resuls Table 1 summarzes he resuls of varous model ranng (adapaon) mehods n erms of CER (%) and CER reducon (n parenheses) over comparng models. In he frs column, he numbers n parenheses nex o he doman name ndcaes he number of ranng senences used for adapaon. Baselne, wh resuls shown n Column 3, s he word rgram model. As expeced, he CER correlaes very well he smlary beween he background doman and he adapaon doman, where doman smlary s measured n erms of cross enropy (Yuan e al. 2005) as shown n Column 2. MAP (maxmum a poseror), wh resuls shown n Column 4, s a radonal LM adapaon mehod where he parameers of he background model are adjused n such a way ha maxmzes he lkelhood of he adapaon daa. Our mplemenaon akes he form of lnear nerpolaon as descrbed n Bacchan e al. (2004): P(w h) = λp b(w h) + (1-λ)P a(w h), where P b s he probably of he background model, P a s he probably raned on adapaon daa usng MLE and he hsory h corresponds o wo precedng words (.e. P b and P a are rgram probables). λ s he nerpolaon wegh opmzed on held-ou daa. Boosng, wh resuls shown n Column 5, s he algorhm descrbed n Fgure 1. In our mplemenaon, we use he shrnkage mehod suggesed by Schapre and Snger (1999) and Collns and Koo (2005). A each eraon, we used he followng updae for he kh feaure + 1 C k + εz (16) δ k = log _ 2 C + εz where C k + s a value ncreasng exponenally wh he sum of margns of (W R, W) pars over he se where f k s seen n W R bu no n W; C k - s he value relaed o he sum of margns over he se where f k s seen n W bu no n W R. ε s a smoohng facor (whose value s opmzed on held-ou daa) and Z s a normalzaon consan (whose value s he ExpLoss(.) of ranng daa accordng o he curren model). We see ha εz n Equaon (16) plays he same role as ν n Equaon (9). BLasso, wh resuls shown n Column 6, s he algorhm descrbed n Fgure 2. We fnd ha he performance of BLasso s no very sensve o he selecon of he sep sze ε across ranng ses of dfferen domans and szes. Alhough small ε s preferred n heory as dscussed earler, would lead o a very slow convergence. Therefore, n our expermens, we always use a large sep (ε = 0.5) and use he so-called early soppng sraegy,.e., he number of eraons before soppng s opmzed on held-ou daa. In he ask of LM for IME, here are mllons of feaures and ranng samples, formng an exremely large and sparse marx. We herefore appled he echnques descrbed n Collns and Koo (2005) o speed up he ranng procedure. The resulng algorhms run n around 15 and 30 mnues respecvely for Boosng and BLasso o converge on an XEON MP 1.90GHz machne when ranng on an 8K-sennece ranng se. k

The resuls n Table 1 gve rse o several observaons. Frs of all, boh dscrmnave ranng mehods (.e., Boosng and BLasso) ouperform MAP subsanally. The mprovemen margns are larger when he background and adapaon domans are more smlar. The phenomenon s arbued o he underlyng dfference beween he wo adapaon mehods: MAP ams o mprove he lkelhood of a dsrbuon, so f he adapaon doman s very smlar o he background doman, he dfference beween he wo underlyng dsrbuons s so small ha MAP canno adjus he model effecvely. Dscrmnave mehods, on he oher hand, do no have hs lmaon for hey am o reduce errors drecly. Secondly, BLasso ouperforms Boosng sgnfcanly (p-value < 0.01) on all es ses. The mprovemen margns vary wh he ranng ses of dfferen domans and szes. In general, n cases where he adapaon doman s less smlar o he background doman and larger ranng se s used, he mprovemen of BLasso s more vsble. Noe ha he CER resuls of FSLR are no ncluded n Table 1 because acheves very smlar resuls o he boosng algorhm wh shrnkage f he conrollng parameers of boh algorhms are opmzed va cross-valdaon. We shall dscuss her dfference n he nex secon. 5.3 Dcusson Ths secon nvesgaes wha componens of BLasso brng he mprovemen over Boosng. Comparng he algorhms n Fgures 1 and 2, we noce hree dfferences beween BLasso and Boosng: () he use of backward seps n BLasso; () BLasso uses he grd search (fxed sep sze) for feaure selecon n Equaon (12) whle Boosng uses he connuous search (opmal sep sze) n Equaon (7); and () BLasso uses a fxed sep sze for feaure updae n Equaon (13) whle Boosng uses an opmal sep sze n Equaon (8). We hen nvesgae hese dfferences n urn. To sudy he mpac of backward seps, we compared BLasso wh he boosng algorhm wh a fxed sep search and a fxed sep updae, henceforh referred o as F-Boosng. F-Boosng was mplemened as Fgure 2, by seng a large value o θ n Equaon (15),.e., θ = 10 3, o prohb backward seps. We fnd ha alhough he ranng error curves of BLasso and F-Boosng are almos dencal, he T(λ) curves grow apar wh eraons, as shown n Fgure 3. The resuls show ha wh backward seps, BLasso acheves a beer approxmaon o he rue lasso soluon: I leads o a model wh smlar ranng errors bu less complex (n erms of L 1 penaly). In our expermens we fnd ha he benef of usng backward seps s only vsble n laer eraons when BLasso s backward seps kck n. A ypcal example s shown n Fgure 4. The early seps f o hghly effecve feaures and n hese seps BLasso and F-Boosng agree. For laer seps, fne-unng of feaures s requred. BLasso wh backward seps provdes a beer mechansm han F-Boosng o revse he prevously chosen feaures o accommodae hs fne level of unng. Consequenly we observe he superor performance of BLasso a laer sages as shown n our expermens. As well-known n lnear regresson models, when here are many srongly correlaed feaures, model parameers can be poorly esmaed and exhb hgh varance. By mposng a model sze consran, as n lasso, hs phenomenon s allevaed. Therefore, we speculae ha a beer approxmaon o lasso, as BLasso wh backward seps, would be superor n elmnang he negave effec of srongly correlaed feaures n model esmaon. To verfy our speculaon, we performed he followng expermens. For each ranng se, n addon o word ungram and bgram feaures, we nroduced a new ype of feaures called headword bgram. As descrbed n Gao e al. (2002), headwords are defned as he conen words of he senence. Therefore, headword bgrams consue a specal ype of skppng bgrams whch can capure dependency beween wo words ha may no be adjacen. In realy, a large poron of headword bgrams are dencal o word bgrams, as wo headwords can occur nex o each oher n ex. In he adapaon es daa we used, we fnd ha headword bgram feaures are for he mos par eher compleely overlappng wh he word bgram feaures (.e., all nsances of headword bgrams also coun as word bgrams) or no overlappng a all (.e., a headword bgram feaure s no observed as a word bgram feaure) less han 20% of headword bgram feaures dsplayed a varable degree of overlap wh word bgram feaures. In our daa, he rae of compleely overlappng feaures s 25% o 47% dependng on he adapaon doman. From hs, we can say ha he headword bgram feaures show moderae o hgh degree of correlaon wh he word bgram feaures. We hen used BLasso and F-Boosng o ran he lnear language models ncludng boh word bgram and headword bgram feaures. We fnd ha alhough he CER reducon by addng

headword feaures s overall very small, he dfference beween he wo versons of BLasso s more vsble n all four es ses. Comparng Fgures 5 8 wh Fgure 4, can be seen ha BLasso wh backward seps ouperforms he one whou backward seps n much earler sages of ranng wh a larger margn. For example, on Encara daa ses, BLasso ouperforms F-Boosng afer around 18,000 eraons wh headword feaures (Fgure 7), as opposed o 25,000 eraons whou headword feaures (Fgure 4). The resuls seem o corroborae our speculaon ha BLasso s more robus n he presence of hghly correlaed feaures. To nvesgae he mpac of usng he grd search (fxed sep sze) versus he connuous search (opmal sep sze) for feaure selecon, we compared F-Boosng wh FSLR snce hey dffers only n her search mehods for feaure selecon. As shown n Fgures 5 o 8, alhough FSLR s robus n ha s es errors do no ncrease afer many eraons, F-Boosng can reach a much lower error rae on hree ou of four es ses. Therefore, n he ask of LM for IME where CER s he mos mporan merc, he grd search for feaure selecon s more desrable. To nvesgae he mpac of usng a fxed versus an opmal sep sze for feaure updae, we compared FSLR wh Boosng. Alhough boh algorhms acheve very smlar CER resuls, he performance of FSLR s much less sensve o he seleced fxed sep sze. For example, we can selec any value from 0.2 o 0.8, and n mos sengs FSLR acheves he very smlar lowes CER afer 20,000 eraons, and wll say here for many eraons. In conras, n Boosng, he opmal value of ε n Equaon (16) vares wh he szes and domans of ranng daa, and has o be uned carefully. We hus conclude ha n our ask FSLR s more robus agans dfferen ranng sengs and a fxed sep sze for feaure updae s more preferred. 6 Concluson Ths paper nvesgaes wo approxmaon lasso mehods for LMappled o a realsc ask wh a very large number of feaures wh sparse feaure space. Our resuls on Japanese ex npu are promsng. BLasso ouperforms he boosng algorhm sgnfcanly n erms of CER reducon on all expermenal sengs. We have shown ha hs superor performance s a consequence of BLasso s backward sep and s fxed sep sze n boh feaure selecon and feaure wegh updae. Our expermenal resuls n Secon 5 show ha he use of backward sep s val for model fne-unng afer major feaures are seleced and for copng wh srongly correlaed feaures; he fxed sep sze of BLasso s responsble for he mprovemen of CER and he robusness of he resuls. Expermens on oher daa ses and heorecal analyss are needed o furher suppor our fndngs n hs paper. References Bacchan, M., Roar B., and Saraclar, M. 2004. Language model adapaon wh MAP esmaon and he percepron algorhm. In HLT-NAACL 2004. 21-24. Collns, Mchael and Terry Koo 2005. Dscrmnave rerankng for naural language parsng. Compuaonal Lnguscs 31(1): 25-69. Duda, Rchard O, Har, Peer E. and Sor Davd G. 2001. Paern classfcaon. John Wley & Sons, Inc. Donoho, D., I. Johnsone, G. Kerkyacharan, and D. Pcard. 1995. Wavele shrnkage; asympopa? (wh dscusson), J. Royal. Sas. Soc. 57: 201-337. Efron, B., T. Hase, I. Johnsone, and R. Tbshran. 2004. Leas angle regresson. Ann. Sas. 32, 407-499. Freund, Y, R. Iyer, R. E. Schapre, and Y. Snger. 1998. An effcen boosng algorhm for combnng preferences. In ICML 98. Hase, T., R. Tbshran and J. Fredman. 2001. The elemens of sascal learnng. Sprnger-Verlag, New York. Gao, Janfeng, Hsam Suzuk and Yang Wen. 2002. Explong headword dependency and predcve cluserng for language modelng. In EMNLP 2002. Gao. J., Yu, H., Yuan, W., and Xu, P. 2005. Mnmum sample rsk mehods for language modelng. In HLT/EMNLP 2005. Osborne, M.R. and Presnell, B. and Turlach B.A. 2000a. A new approach o varable selecon n leas squares problems. Journal of Numercal Analyss, 20(3). Osborne, M.R. and Presnell, B. and Turlach B.A. 2000b. On he lasso and s dual. Journal of Compuaonal and Graphcal Sascs, 9(2): 319-337. Roar Bran, Mura Saraclar and Mchael Collns. 2004. Correcve language modelng for large vocabulary ASR wh he percepron algorhm. In ICASSP 2004. Schapre, Rober E. and Yoram Snger. 1999. Improved boosng algorhms usng confdence-raed predcons. Machne Learnng, 37(3): 297-336. Suzuk, Hsam and Janfeng Gao. 2005. A comparave sudy on language model adapaon usng new evaluaon mercs. In HLT/EMNLP 2005. Tbshran, R. 1996. Regresson shrnkage and selecon va he lasso. J. R. Sas. Soc. B, 58(1): 267-288. Yuan, W., J. Gao and H. Suzuk. 2005. An Emprcal Sudy on Language Model Adapaon Usng a Merc of Doman Smlary. In IJCNLP 05. Zhao, P. and B. Yu. 2004. Boosed lasso. Tech Repor, Sascs Deparmen, U. C. Berkeley. Zhu, J. S. Rosse, T. Hase, and R. Tbshran. 2003. 1-norm suppor vecor machnes. NIPS 16. MIT Press.

Table 1. CER (%) and CER reducon (%) (Y=Yomur; T=TuneUp; E=Encara; S=-Shncho) Doman Enropy vs.nkke Baselne MAP (over Baselne) Boosng (over MAP) BLasso (over MAP/Boosng) Y (800) 7.69 3.70 3.70 (+0.00) 3.13 (+15.41) 3.01 (+18.65/+3.83) Y (8K) 7.69 3.70 3.69 (+0.27) 2.88 (+21.95) 2.85 (+22.76/+1.04) Y (72K) 7.69 3.70 3.69 (+0.27) 2.78 (+24.66) 2.73 (+26.02/+1.80) T (800) 7.95 5.81 5.81 (+0.00) 5.69 (+2.07) 5.63 (+3.10/+1.05) T (8K) 7.95 5.81 5.70 (+1.89) 5.48 (+5.48) 5.33 (+6.49/+2.74) T (72K) 7.95 5.81 5.47 (+5.85) 5.33 (+2.56) 5.05 (+7.68/+5.25) E (800) 9.30 10.24 9.60 (+6.25) 9.82 (-2.29) 9.18 (+4.38/+6.52) E (8K) 9.30 10.24 8.64 (+15.63) 8.54 (+1.16) 8.04 (+6.94/+5.85) E (72K) 9.30 10.24 7.98 (+22.07) 7.53 (+5.64) 7.20 (+9.77/+4.38) S (800) 9.40 12.18 11.86 (+2.63) 11.91 (-0.42) 11.79 (+0.59/+1.01) S (8K) 9.40 12.18 11.15 (+8.46) 11.09 (+0.54) 10.73 (+3.77/+3.25) S (72K) 9.40 12.18 10.76 (+11.66) 10.25 (+4.74) 9.64 (+10.41/+5.95) Fgure 3. L 1 curves: models are raned on he E(8K) daase. Fgure 4. Tes error curves: models are raned on he E(8K) daase. Fgure 5. Tes error curves: models are raned on he Y(8K) daase, ncludng headword bgram feaures. Fgure 6. Tes error curves: models are raned on he T(8K) daase, ncludng headword bgram feaures. Fgure 7. Tes error curves: models are raned on he E(8K) daase, ncludng headword bgram feaures. Fgure 8. Tes error curves: models are raned on he S(8K) daase, ncludng headword bgram feaures.