Approximation Lasso Methods for Language Modeling

Similar documents
Variants of Pegasos. December 11, 2009

An introduction to Support Vector Machine

Introduction to Boosting

Solution in semi infinite diffusion couples (error function analysis)

Advanced Machine Learning & Perception

Robust and Accurate Cancer Classification with Gene Expression Profiling

Outline. Probabilistic Model Learning. Probabilistic Model Learning. Probabilistic Model for Time-series Data: Hidden Markov Model

Machine Learning Linear Regression

Clustering (Bishop ch 9)

CHAPTER 10: LINEAR DISCRIMINATION

Lecture 11 SVM cont

In the complete model, these slopes are ANALYSIS OF VARIANCE FOR THE COMPLETE TWO-WAY MODEL. (! i+1 -! i ) + [(!") i+1,q - [(!

Lecture 6: Learning for Control (Generalised Linear Regression)

V.Abramov - FURTHER ANALYSIS OF CONFIDENCE INTERVALS FOR LARGE CLIENT/SERVER COMPUTER NETWORKS

( ) () we define the interaction representation by the unitary transformation () = ()

Volatility Interpolation

TSS = SST + SSE An orthogonal partition of the total SS

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 4

On One Analytic Method of. Constructing Program Controls

Lecture VI Regression

Introduction ( Week 1-2) Course introduction A brief introduction to molecular biology A brief introduction to sequence comparison Part I: Algorithms

Machine Learning 2nd Edition

Fall 2010 Graduate Course on Dynamic Learning

January Examinations 2012

Department of Economics University of Toronto

FTCS Solution to the Heat Equation

Robustness Experiments with Two Variance Components

( ) [ ] MAP Decision Rule

Econ107 Applied Econometrics Topic 5: Specification: Choosing Independent Variables (Studenmund, Chapter 6)

Dynamic Team Decision Theory. EECS 558 Project Shrutivandana Sharma and David Shuman December 10, 2005

Cubic Bezier Homotopy Function for Solving Exponential Equations

( t) Outline of program: BGC1: Survival and event history analysis Oslo, March-May Recapitulation. The additive regression model

John Geweke a and Gianni Amisano b a Departments of Economics and Statistics, University of Iowa, USA b European Central Bank, Frankfurt, Germany

EEL 6266 Power System Operation and Control. Chapter 5 Unit Commitment

Online Supplement for Dynamic Multi-Technology. Production-Inventory Problem with Emissions Trading

Linear Response Theory: The connection between QFT and experiments

Approximate Analytic Solution of (2+1) - Dimensional Zakharov-Kuznetsov(Zk) Equations Using Homotopy

Reactive Methods to Solve the Berth AllocationProblem with Stochastic Arrival and Handling Times

New M-Estimator Objective Function. in Simultaneous Equations Model. (A Comparative Study)

5th International Conference on Advanced Design and Manufacturing Engineering (ICADME 2015)

CHAPTER 5: MULTIVARIATE METHODS

Lecture 2 L n i e n a e r a M od o e d l e s

RELATIONSHIP BETWEEN VOLATILITY AND TRADING VOLUME: THE CASE OF HSI STOCK RETURNS DATA

GENERATING CERTAIN QUINTIC IRREDUCIBLE POLYNOMIALS OVER FINITE FIELDS. Youngwoo Ahn and Kitae Kim

Boosted LMS-based Piecewise Linear Adaptive Filters

Graduate Macroeconomics 2 Problem set 5. - Solutions

Appendix H: Rarefaction and extrapolation of Hill numbers for incidence data

Hidden Markov Models Following a lecture by Andrew W. Moore Carnegie Mellon University

Learning Objectives. Self Organization Map. Hamming Distance(1/5) Introduction. Hamming Distance(3/5) Hamming Distance(2/5) 15/04/2015

Single-loop System Reliability-Based Design & Topology Optimization (SRBDO/SRBTO): A Matrix-based System Reliability (MSR) Method

Math 128b Project. Jude Yuen

Tight results for Next Fit and Worst Fit with resource augmentation

CS286.2 Lecture 14: Quantum de Finetti Theorems II

CHAPTER 2: Supervised Learning

Mechanics Physics 151

THEORETICAL AUTOCORRELATIONS. ) if often denoted by γ. Note that

DEEP UNFOLDING FOR MULTICHANNEL SOURCE SEPARATION SUPPLEMENTARY MATERIAL

CSCE 478/878 Lecture 5: Artificial Neural Networks and Support Vector Machines. Stephen Scott. Introduction. Outline. Linear Threshold Units

Bayes rule for a classification problem INF Discriminant functions for the normal density. Euclidean distance. Mahalanobis distance

THE PREDICTION OF COMPETITIVE ENVIRONMENT IN BUSINESS

Dual Approximate Dynamic Programming for Large Scale Hydro Valleys

F-Tests and Analysis of Variance (ANOVA) in the Simple Linear Regression Model. 1. Introduction

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

12d Model. Civil and Surveying Software. Drainage Analysis Module Detention/Retention Basins. Owen Thornton BE (Mech), 12d Model Programmer

WiH Wei He

Existence and Uniqueness Results for Random Impulsive Integro-Differential Equation

Let s treat the problem of the response of a system to an applied external force. Again,

[ ] 2. [ ]3 + (Δx i + Δx i 1 ) / 2. Δx i-1 Δx i Δx i+1. TPG4160 Reservoir Simulation 2018 Lecture note 3. page 1 of 5

Survival Analysis and Reliability. A Note on the Mean Residual Life Function of a Parallel System

The Finite Element Method for the Analysis of Non-Linear and Dynamic Systems

Computing Relevance, Similarity: The Vector Space Model

. The geometric multiplicity is dim[ker( λi. number of linearly independent eigenvectors associated with this eigenvalue.

Discrete Markov Process. Introduction. Example: Balls and Urns. Stochastic Automaton. INTRODUCTION TO Machine Learning 3rd Edition

2. SPATIALLY LAGGED DEPENDENT VARIABLES

The Analysis of the Thickness-predictive Model Based on the SVM Xiu-ming Zhao1,a,Yan Wang2,band Zhimin Bi3,c

Genetic Algorithm in Parameter Estimation of Nonlinear Dynamic Systems

Chapter Lagrangian Interpolation

An Effective TCM-KNN Scheme for High-Speed Network Anomaly Detection

UNIVERSITAT AUTÒNOMA DE BARCELONA MARCH 2017 EXAMINATION

Anomaly Detection. Lecture Notes for Chapter 9. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

A Novel Efficient Stopping Criterion for BICM-ID System

ABSTRACT KEYWORDS. Bonus-malus systems, frequency component, severity component. 1. INTRODUCTION

J i-1 i. J i i+1. Numerical integration of the diffusion equation (I) Finite difference method. Spatial Discretization. Internal nodes.

Testing a new idea to solve the P = NP problem with mathematical induction

Lecture 18: The Laplace Transform (See Sections and 14.7 in Boas)

A Novel Iron Loss Reduction Technique for Distribution Transformers. Based on a Combined Genetic Algorithm - Neural Network Approach

Chapter 4. Neural Networks Based on Competition

Ordinary Differential Equations in Neuroscience with Matlab examples. Aim 1- Gain understanding of how to set up and solve ODE s

Fitting a Conditional Linear Gaussian Distribution

. The geometric multiplicity is dim[ker( λi. A )], i.e. the number of linearly independent eigenvectors associated with this eigenvalue.

Li An-Ping. Beijing , P.R.China

Chapter 6: AC Circuits

(,,, ) (,,, ). In addition, there are three other consumers, -2, -1, and 0. Consumer -2 has the utility function

Comparison of Supervised & Unsupervised Learning in βs Estimation between Stocks and the S&P500

Robustness of DEWMA versus EWMA Control Charts to Non-Normal Processes

Time-interval analysis of β decay. V. Horvat and J. C. Hardy

MANY real-world applications (e.g. production

Dynamically Weighted Majority Voting for Incremental Learning and Comparison of Three Boosting Based Approaches

ECE 366 Honors Section Fall 2009 Project Description

HEAT CONDUCTION PROBLEM IN A TWO-LAYERED HOLLOW CYLINDER BY USING THE GREEN S FUNCTION METHOD

Transcription:

Approxmaon Lasso Mehods for Language Modelng Janfeng Gao Mcrosof Research One Mcrosof Way Redmond WA 98052 USA jfgao@mcrosof.com Hsam Suzuk Mcrosof Research One Mcrosof Way Redmond WA 98052 USA hsams@mcrosof.com Bn Yu Deparmen of Sascs Unversy of Calforna Berkeley., CA 94720 U.S.A. bnyu@sa.berkeley.edu Absrac Lasso s a regularzaon mehod for parameer esmaon n lnear models. I opmzes he model parameers wh respec o a loss funcon subjec o model complexes. Ths paper explores he use of lasso for sascal language modelng for ex npu. Owng o he very large number of parameers, drecly opmzng he penalzed lasso loss funcon s mpossble. Therefore, we nvesgae wo approxmaon mehods, he boosed lasso (BLasso) and he forward sagewse lnear regresson (FSLR). Boh mehods, when used wh he exponenal loss funcon, bear srong resemblance o he boosng algorhm whch has been used as a dscrmnave ranng mehod for language modelng. Evaluaons on he ask of Japanese ex npu show ha BLasso s able o produce he bes approxmaon o he lasso soluon, and leads o a sgnfcan mprovemen, n erms of characer error rae, over boosng and he radonal maxmum lkelhood esmaon. 1 Inroducon Language modelng (LM) s fundamenal o a wde range of applcaons. Recenly, has been shown ha a lnear model esmaed usng dscrmnave ranng mehods, such as he boosng and percepron algorhms, ouperforms sgnfcanly a radonal word rgram model raned usng maxmum lkelhood esmaon (MLE) on several asks such as speech recognon and Asan language ex npu (Bacchan e al. 2004; Roark e al. 2004; Gao e al. 2005; Suzuk and Gao 2005). The success of dscrmnave ranng mehods s largely due o fac ha unlke he radonal approach (e.g., MLE) ha maxmzes he funcon (e.g., lkelhood of ranng daa) ha s loosely assocaed wh error rae, dscrmnave ranng mehods am o drecly mnmze he error rae on ranng daa even f hey reduce he lkelhood. However, gven a fne se of ranng samples, dscrmnave ranng mehods could lead o an arbrary complex model for he purpose of achevng zero ranng error. I s well-known ha complex models exhb hgh varance and perform poorly on unseen daa. Therefore some regularzaon mehods have o be used o conrol he complexy of he model. Lasso s a regularzaon mehod for parameer esmaon n lnear models. I opmzes he model parameers wh respec o a loss funcon subjec o model complexes. The basc dea of lasso s orgnally proposed by Tbshran (1996). Recenly, here have been several mplemenaons and expermens of lasso on mul-class classfcaon asks where only a small number of feaures need o be handled and he lasso soluon can be drecly compued va numercal mehods. To our knowledge, hs paper presens he frs emprcal sudy of lasso for a realsc, large scale ask: LM for Asan language ex npu. Because he ask ulzes mllons of feaures and ranng samples, drecly opmzng he penalzed lasso loss funcon s mpossble. Therefore, wo approxmaon mehods, he boosed lasso (BLasso, Zhao and Yu 2004) and he forward sagewse lnear regresson (FSLR, Hase e al. 2001), are nvesgaed. Boh mehods, when used wh he exponenal loss funcon, bear srong resemblance o he boosng algorhm whch has been used as a dscrmnave ranng mehod for LM. Evaluaons on he ask of Japanese ex npu show ha BLasso s able o produce he bes approxmaon o he lasso soluon, and leads o a sgnfcan mprovemen, n erms of characer error rae, over he boosng algorhm and he radonal MLE. 2 LM Task and Problem Defnon Ths paper sudes LM on he applcaon of Asan language (e.g. Chnese or Japanese) ex npu, a sandard mehod of npung Chnese or Japanese ex by converng he npu phonec symbols no he approprae word srng. In hs paper we call he ask IME, whch sands for

npu mehod edor, based on he name of he commonly used Wndows-based applcaon. Performance on IME s measured n erms of he characer error rae (CER), whch s he number of characers wrongly convered from he phonec srng dvded by he number of characers n he correc ranscrp. Smlar o speech recognon, IME s vewed as a Bayes decson problem. Le A be he npu phonec srng. An IME sysem s ask s o choose he mos lkely word srng W * among hose canddaes ha could be convered from A: W * = arg maxp( W A) = arg maxp( W ) P( A W) (1) W GEN( A) W GEN(A) where GEN(A) denoes he canddae se gven A. Unlke speech recognon, however, here s no acousc ambguy as he phonec srng s npued by users. Moreover, we can assume a unque mappng from W and A n IME as words have unque readngs,.e. P(A W) = 1. So he decson of Equaon (1) depends solely upon P(W), makng IME an deal evaluaon es bed for LM. In hs sudy, he LM ask for IME s formulaed under he framework of lnear models (e.g., Duda e al. 2001). We use he followng noaon, adaped from Collns and Koo (2005): Tranng daa s a se of example npu/oupu pars. In LM for IME, ranng samples are represened as {A, W R }, for = 1 M, where each A s an npu phonec srng and W R s he reference ranscrp of A. We assume some way of generang a se of canddae word srngs gven A, denoed by GEN(A). In our expermens, GEN(A) consss of op n word srngs convered from A usng a baselne IME sysem ha uses only a word rgram model. We assume a se of D+1 feaures f d(w), for d = 0 D. The feaures could be arbrary funcons ha map W o real values. Usng vecor noaon, we have f(w) R D+1, where f(w) = [f 0(W), f 1(W),, f D(W)] T. f 0(W) s called he base feaure, and s defned n our case as he log probably ha he word rgram model assgns o W. Oher feaures (f d(w), for d = 1 D) are defned as he couns of word n-grams (n = 1 and 2 n our expermens) n W. Fnally, he parameers of he model form a vecor of D+1 dmensons, each for one feaure funcon, λ = [λ 0, λ 1,, λ D]. The score of a word srng W can be wren as Score( W, λ ) = λf( W ) = D d= 0 λ f ( W ). (2) The decson rule of Equaon (1) s rewren as d d 1 Se λ 0 = argmn λ0exploss(λ); and λ d = 0 for d=1 D 2 Selec a feaure f k* whch has larges esmaed mpac on reducng ExpLoss of Eq. (6) 3 Updae λ k* λ k* + δ*, and reurn o Sep 2 Fgure 1: The boosng algorhm * W ( A, λ) = arg max Score( W, λ). (3) W GEN(A) Equaon (3) vews IME as a rankng problem, where he model gves he rankng score, no probables. We herefore do no evaluae he model va perplexy. Now, assume ha we can measure he number of converson errors n W by comparng wh a reference ranscrp W R usng an error funcon Er(W R,W), whch s he srng ed dsance funcon n our case. We call he sum of error couns over he ranng samples sample rsk. Our goal hen s o search for he bes parameer se λ whch mnmzes he sample rs as n Equaon (4): λ def MSR = arg mn λ = 1... M Er( W R, W ( A, λ)). (4) However, (4) canno be opmzed easly snce Er(.) s a pecewse consan (or sep) funcon of λ and s graden s undefned. Therefore, dscrmnave mehods apply dfferen approaches ha opmze approxmaely. The boosng algorhm descrbed below s one of such approaches. 3 Boosng Ths secon gves a bref revew of he boosng algorhm, followng he descrpon of some recen work (e.g., Schapre and Snger 1999; Collns and Koo 2005). The boosng algorhm uses an exponenal loss funcon (ExpLoss) o approxmae he sample rsk n Equaon (4). We defne he margn of he par (W R, W) wh respec o he model λ as R R M( W, W ) = Score( W, λ) Score( W, λ) (5) Then, ExpLoss s defned as R ExpLoss( λ ) = exp( M( W, W )) (6) = 1... M W GEN( A ) Noce ha ExpLoss s convex so here s no problem wh local mnma when opmzng. I s shown n Freund e al. (1998) and Collns and Koo (2005) ha here exs graden search procedures ha converge o he rgh soluon. Fgure 1 summarzes he boosng algorhm we used. Afer nalzaon, Seps 2 and 3 are *

repeaed N mes; a each eraon, a feaure s chosen and s wegh s updaed as follows. Frs, we defne Upd(λ, δ) as an updaed model, wh he same parameer values as λ wh he excepon of λ whch s ncremened by δ Upd( λ, δ ) = { λ0, λ1,..., λk + δ,..., λd} Then, Seps 2 and 3 n Fgure 1 can be rewren as Equaons (7) and (8), respecvely. ( k*, δ *) = arg mnexploss(upd( λ, δ )) (7) δ δ *) (8) The boosng algorhm can be oo greedy: Each eraon usually reduces he ExpLoss(.) on ranng daa, so for he number of eraons large enough hs loss can be made arbrarly small. However, fng ranng daa oo well evenually leads o overfng, whch degrades he performance on unseen es daa (even hough n boosng overfng can happen very slowly). Shrnkage s a smple approach o dealng wh he overfng problem. I scales he ncremenal sep δ by a small consan ν, ν (0, 1). Thus, he updae of Equaon (8) wh shrnkage s νδ *) (9) Emprcally, has been found ha smaller values of ν lead o smaller numbers of es errors. 4 Lasso Lasso s a regularzaon mehod for esmaon n lnear models (Tbshran 1996). I regularzes or shrnks a fed model hrough an L 1 penaly or consran. Le T(λ) denoe he L 1 penaly of he model,.e., T(λ) = d = 0 D λ d. We hen opmze he model λ so as o mnmze a regularzed loss funcon on ranng daa, called lasso loss defned as LassoLoss( λ, α ) = ExpLoss( λ) +αt( λ) (10) where T(λ) generally penalzes larger models (or complex models), and he parameer α conrols he amoun of regularzaon appled o he esmae. Seng α = 0 reverses he LassoLoss o he unregularzed ExpLoss; as α ncreases, he model coeffcens all shrn each ulmaely becomng zero. In pracce, α should be adapvely chosen o mnmze an esmae of expeced loss, e.g., α decreases wh he ncrease of he number of eraons. Compuaon of he soluon o he lasso problem has been suded for specal loss funcons. For leas square regresson, here s a fas algorhm LARS o fnd he whole lasso pah for dfferen α s (Obsborn e al. 2000a; 2000b; Efron e al. 2004); for 1-norm SVM, can be ransformed no a lnear programmng problem wh a fas algorhm smlar o LARS (Zhu e al. 2003). However, he soluon o he lasso problem for a general convex loss funcon and an adapve α remans open. More mporanly for our purposes, drecly mnmzng lasso funcon of Equaon (10) wh respec o λ s no possble when a very large number of model parameers are employed, as n our ask of LM for IME. Therefore we nvesgae below wo mehods ha closely approxmae he effec of he lasso, and are very smlar o he boosng algorhm. I s also worh nong he dfference beween L 1 and L 2 penaly. The classcal Rdge Regresson seng uses an L 2 penaly n Equaon (10).e., T(λ) = d = 0 D(λ d) 2, whch s much easer o mnmze (for leas square loss bu no for ExpLoss). However, recen research (Donoho e al. 1995) shows ha he L 1 penaly s beer sued for sparse suaons, where here are only a small number of feaures wh nonzero weghs among all canddae feaures. We fnd ha our ask s ndeed a sparse suaon: among 860,000 feaures, n he resulng lnear model only around 5,000 feaures have nonzero weghs. We hen focus on he L 1 penaly. We leave he emprcal comparson of he L 1 and L 2 penaly on he LM ask o fuure work. 4.1 Forward Sagewse Lnear Regresson (FSLR) The frs approxmaon mehod we used s FSLR, descrbed n (Algorhm 10.4, Hase e al. 2001), where Seps 2 and 3 n Fgure 1 are performed accordng o Equaons (7) and (11), respecvely. ( k*, δ *) = arg mnexploss(upd( λ, δ )) (7) δ ε sgn( δ *)) (11) Noce ha FSLR s very smlar o he boosng algorhm wh shrnkage n ha a each sep, he feaure f k* ha has larges esmaed mpac on reducng ExpLoss s seleced. The only dfference s ha FSLR updaes he wegh of f k* by a small fxed sep sze ε. By akng such small seps, FSLR mposes some mplc regularzaon, and can closely approxmae he effec of he lasso n a local sense (Hase e al. 2001). Emprcally, we fnd ha he performance of he boosng algorhm wh shrnkage closely resembles ha of FSLR, wh he learnng rae parameer ν correspondng o ε.

4.2 Boosed Lasso (BLasso) The second mehod we used s a modfed verson of he BLasso algorhm descrbed n Zhao and Yu (2004). There are wo major dfferences beween BLasso and FSLR. A each eraon, BLasso can ake eher a forward sep or a backward sep. Smlar o he boosng algorhm and FSLR, a each forward sep, a feaure s seleced and s wegh s updaed accordng o Equaons (12) and (13). ( k*, δ *) = arg mnexploss(upd( λ, δ )) (12) δ =± ε ε sgn( δ *)) (13) However, here s an mporan dfference beween Equaons (12) and (7). In he boosng algorhm wh shrnkage and FSLR, as shown n Equaon (7), a feaure s seleced by s mpac on reducng he loss wh s opmal updae δ *. In conrac, n BLasso, as shown n Equaon (12), he opmzaon over δ s removed, and for each feaure, s loss s calculaed wh an updae of eher +ε or -ε,.e., he grd search s used for feaure selecon. We wll show laer ha hs seemngly rval dfference brngs a sgnfcan mprovemen. The backward sep s unque o BLasso. In each eraon, a feaure s seleced and s wegh s updaed backward f and only f leads o a decrease of he lasso loss, as shown n Equaons (14) and (15): k* = arg mnexploss(upd( λ, sgn( λ ) ε ) (14) k, λk 0 sgn( λ k* ) ε ) (15) f LassoLoss( λ, α ) LassoLoss( λ, α ) > θ where θ s a olerance parameer. Fgure 2 summarzes he BLasso algorhm we used. Afer nalzaon, Seps 4 and 5 are repeaed N mes; a each eraon, a feaure s chosen and s wegh s updaed eher backward or forward by a fxed amoun ε. Noce ha he value of α s adapvely chosen accordng o he reducon of ExpLoss durng ranng. The algorhm sars wh a large nal α, and hen a each forward sep he value of α decreases unl he ExpLoss sops decreasng. Ths s nuvely desrable: I s expeced ha mos hghly effecve feaures are seleced n early sages of ranng, so he reducon of ExpLoss a each sep n early sages are more subsanal han n laer sages. These early seps concde wh he boosng seps mos of he me. In oher words, he effec of backward seps s more vsble a laer sages. k Our mplemenaon of BLasso dffers slghly from he orgnal algorhm descrbed n Zhao and Yu (2004). Frsly, because he value of he base feaure f 0 s he log probably (assgned by a word rgram model) and has a dfferen range from ha of oher feaures as n Equaon (2), λ 0 s se o opmze ExpLoss n he nalzaon sep (Sep 1 n Fgure 2) and remans fxed durng ranng. As suggesed by Collns and Koo (2005), hs ensures ha he conrbuon of he log-lkelhood feaure f 0 s well-calbraed wh respec o ExpLoss. Secondly, when updang a feaure wegh, f he sze of he opmal updae sep (compued va Equaon (7)) s smaller han ε, we use he opmal sep o updae he feaure. Therefore, n our mplemenaon BLasso does no always ake a fxed sep; may ake seps whose sze s smaller han ε. In our nal expermens we found ha boh changes (also used n our mplemenaons of boosng and FSLR) were crucal o he performance of he mehods. 1 Inalze λ 0 : se λ 0 = argmn λ0exploss(λ), and λ d = 0 for d=1 D. 2 Take a forward sep accordng o Eq. (12) and (13), and he updaed model s denoed by λ 1 3 Inalze α = (ExpLoss(λ 0 )-ExpLoss(λ 1 ))/ε 4 Take a backward sep f and only f leads o a decrease of LassoLoss accordng o Eq. (14) and (15), where θ = 0; oherwse 5 Take a forward sep accordng o Eq. (12) and (13); updae α = mn(α, (ExpLoss(λ -1 )-ExpLoss(λ ))/ε ); and reurn o Sep 4. Fgure 2: The BLasso algorhm (Zhao and Yu 2004) provdes heorecal jusfcaons for BLasso. I has been proved ha (1) guaranees ha s safe for BLasso o sar wh an nal α whch s he larges α ha would allow an ε sep away from 0 (.e., larger α s correspond o T(λ)=0); (2) for each value of α, BLasso performs coordnae descen (.e., reduces ExpLoss by updang he wegh of a feaure) unl here s no descen sep; and (3) for each sep where he value of α decreases, guaranees ha he lasso loss s reduced. As a resul, can be proved ha for a fne number of feaures and θ = 0, he BLasso algorhm shown n Fgure 2 converges o he lasso soluon when ε 0. 5 Evaluaon 5.1 Sengs We evaluaed he ranng mehods descrbed above n he so-called cross-doman language model adapaon paradgm, where we adap a model raned on one doman (whch we call he

background doman) o a dfferen doman (adapaon doman), for whch only a small amoun of ranng daa s avalable. The daa ses we used n our expermens came from fve dsnc sources of ex. A 36-mllon-word Nkke Newspaper corpus was used as he background doman, on whch he word rgram model was raned. We used four adapaon domans: Yomur (newspaper corpus), TuneUp (balanced corpus conanng newspapers and oher sources of ex), Encara (encyclopeda) and Shncho (collecon of novels). All corpora have been pre-word-segmened usng a lexcon conanng 167,107 enres. For each of he four domans, we creaed ranng daa conssng of 72K senences (0.9M~1.7M words) and es daa of 5K senences (65K~120K words) from each adapaon doman. The frs 800 and 8,000 senences of each adapaon ranng daa were also used o show how dfferen szes of ranng daa affeced he performances of varous adapaon mehods. Anoher 5K-senence subse was used as held-ou daa for each doman. We creaed he ranng samples for dscrmnave learnng as follows. For each phonec srng A n adapaon ranng daa, we produced a lace of canddae word srngs W usng he baselne sysem descrbed n (Gao e al. 2002), whch uses a word rgram model raned va MLE on he Nkke Newspaper corpus. For effcency, we kep only he bes 20 hypoheses n s canddae converson se GEN(A) for each ranng sample for dscrmnave ranng. The oracle bes hypohess, whch gves he mnmum number of errors, was used as he reference ranscrp of A. We used ungrams and bgrams ha occurred more han once n he ranng se as feaures n he lnear model of Equaon (2). The oal number of canddae feaures we used was around 860,000. 5.2 Man Resuls Table 1 summarzes he resuls of varous model ranng (adapaon) mehods n erms of CER (%) and CER reducon (n parenheses) over comparng models. In he frs column, he numbers n parenheses nex o he doman name ndcaes he number of ranng senences used for adapaon. Baselne, wh resuls shown n Column 3, s he word rgram model. As expeced, he CER correlaes very well he smlary beween he background doman and he adapaon doman, where doman smlary s measured n erms of cross enropy (Yuan e al. 2005) as shown n Column 2. MAP (maxmum a poseror), wh resuls shown n Column 4, s a radonal LM adapaon mehod where he parameers of he background model are adjused n such a way ha maxmzes he lkelhood of he adapaon daa. Our mplemenaon akes he form of lnear nerpolaon as descrbed n Bacchan e al. (2004): P(w h) = λp b(w h) + (1-λ)P a(w h), where P b s he probably of he background model, P a s he probably raned on adapaon daa usng MLE and he hsory h corresponds o wo precedng words (.e. P b and P a are rgram probables). λ s he nerpolaon wegh opmzed on held-ou daa. Boosng, wh resuls shown n Column 5, s he algorhm descrbed n Fgure 1. In our mplemenaon, we use he shrnkage mehod suggesed by Schapre and Snger (1999) and Collns and Koo (2005). A each eraon, we used he followng updae for he kh feaure + 1 C k + εz (16) δ k = log _ 2 C + εz where C k + s a value ncreasng exponenally wh he sum of margns of (W R, W) pars over he se where f k s seen n W R bu no n W; C k - s he value relaed o he sum of margns over he se where f k s seen n W bu no n W R. ε s a smoohng facor (whose value s opmzed on held-ou daa) and Z s a normalzaon consan (whose value s he ExpLoss(.) of ranng daa accordng o he curren model). We see ha εz n Equaon (16) plays he same role as ν n Equaon (9). BLasso, wh resuls shown n Column 6, s he algorhm descrbed n Fgure 2. We fnd ha he performance of BLasso s no very sensve o he selecon of he sep sze ε across ranng ses of dfferen domans and szes. Alhough small ε s preferred n heory as dscussed earler, would lead o a very slow convergence. Therefore, n our expermens, we always use a large sep (ε = 0.5) and use he so-called early soppng sraegy,.e., he number of eraons before soppng s opmzed on held-ou daa. In he ask of LM for IME, here are mllons of feaures and ranng samples, formng an exremely large and sparse marx. We herefore appled he echnques descrbed n Collns and Koo (2005) o speed up he ranng procedure. The resulng algorhms run n around 15 and 30 mnues respecvely for Boosng and BLasso o converge on an XEON MP 1.90GHz machne when ranng on an 8K-sennece ranng se. k

The resuls n Table 1 gve rse o several observaons. Frs of all, boh dscrmnave ranng mehods (.e., Boosng and BLasso) ouperform MAP subsanally. The mprovemen margns are larger when he background and adapaon domans are more smlar. The phenomenon s arbued o he underlyng dfference beween he wo adapaon mehods: MAP ams o mprove he lkelhood of a dsrbuon, so f he adapaon doman s very smlar o he background doman, he dfference beween he wo underlyng dsrbuons s so small ha MAP canno adjus he model effecvely. Dscrmnave mehods, on he oher hand, do no have hs lmaon for hey am o reduce errors drecly. Secondly, BLasso ouperforms Boosng sgnfcanly (p-value < 0.01) on all es ses. The mprovemen margns vary wh he ranng ses of dfferen domans and szes. In general, n cases where he adapaon doman s less smlar o he background doman and larger ranng se s used, he mprovemen of BLasso s more vsble. Noe ha he CER resuls of FSLR are no ncluded n Table 1 because acheves very smlar resuls o he boosng algorhm wh shrnkage f he conrollng parameers of boh algorhms are opmzed va cross-valdaon. We shall dscuss her dfference n he nex secon. 5.3 Dcusson Ths secon nvesgaes wha componens of BLasso brng he mprovemen over Boosng. Comparng he algorhms n Fgures 1 and 2, we noce hree dfferences beween BLasso and Boosng: () he use of backward seps n BLasso; () BLasso uses he grd search (fxed sep sze) for feaure selecon n Equaon (12) whle Boosng uses he connuous search (opmal sep sze) n Equaon (7); and () BLasso uses a fxed sep sze for feaure updae n Equaon (13) whle Boosng uses an opmal sep sze n Equaon (8). We hen nvesgae hese dfferences n urn. To sudy he mpac of backward seps, we compared BLasso wh he boosng algorhm wh a fxed sep search and a fxed sep updae, henceforh referred o as F-Boosng. F-Boosng was mplemened as Fgure 2, by seng a large value o θ n Equaon (15),.e., θ = 10 3, o prohb backward seps. We fnd ha alhough he ranng error curves of BLasso and F-Boosng are almos dencal, he T(λ) curves grow apar wh eraons, as shown n Fgure 3. The resuls show ha wh backward seps, BLasso acheves a beer approxmaon o he rue lasso soluon: I leads o a model wh smlar ranng errors bu less complex (n erms of L 1 penaly). In our expermens we fnd ha he benef of usng backward seps s only vsble n laer eraons when BLasso s backward seps kck n. A ypcal example s shown n Fgure 4. The early seps f o hghly effecve feaures and n hese seps BLasso and F-Boosng agree. For laer seps, fne-unng of feaures s requred. BLasso wh backward seps provdes a beer mechansm han F-Boosng o revse he prevously chosen feaures o accommodae hs fne level of unng. Consequenly we observe he superor performance of BLasso a laer sages as shown n our expermens. As well-known n lnear regresson models, when here are many srongly correlaed feaures, model parameers can be poorly esmaed and exhb hgh varance. By mposng a model sze consran, as n lasso, hs phenomenon s allevaed. Therefore, we speculae ha a beer approxmaon o lasso, as BLasso wh backward seps, would be superor n elmnang he negave effec of srongly correlaed feaures n model esmaon. To verfy our speculaon, we performed he followng expermens. For each ranng se, n addon o word ungram and bgram feaures, we nroduced a new ype of feaures called headword bgram. As descrbed n Gao e al. (2002), headwords are defned as he conen words of he senence. Therefore, headword bgrams consue a specal ype of skppng bgrams whch can capure dependency beween wo words ha may no be adjacen. In realy, a large poron of headword bgrams are dencal o word bgrams, as wo headwords can occur nex o each oher n ex. In he adapaon es daa we used, we fnd ha headword bgram feaures are for he mos par eher compleely overlappng wh he word bgram feaures (.e., all nsances of headword bgrams also coun as word bgrams) or no overlappng a all (.e., a headword bgram feaure s no observed as a word bgram feaure) less han 20% of headword bgram feaures dsplayed a varable degree of overlap wh word bgram feaures. In our daa, he rae of compleely overlappng feaures s 25% o 47% dependng on he adapaon doman. From hs, we can say ha he headword bgram feaures show moderae o hgh degree of correlaon wh he word bgram feaures. We hen used BLasso and F-Boosng o ran he lnear language models ncludng boh word bgram and headword bgram feaures. We fnd ha alhough he CER reducon by addng

headword feaures s overall very small, he dfference beween he wo versons of BLasso s more vsble n all four es ses. Comparng Fgures 5 8 wh Fgure 4, can be seen ha BLasso wh backward seps ouperforms he one whou backward seps n much earler sages of ranng wh a larger margn. For example, on Encara daa ses, BLasso ouperforms F-Boosng afer around 18,000 eraons wh headword feaures (Fgure 7), as opposed o 25,000 eraons whou headword feaures (Fgure 4). The resuls seem o corroborae our speculaon ha BLasso s more robus n he presence of hghly correlaed feaures. To nvesgae he mpac of usng he grd search (fxed sep sze) versus he connuous search (opmal sep sze) for feaure selecon, we compared F-Boosng wh FSLR snce hey dffers only n her search mehods for feaure selecon. As shown n Fgures 5 o 8, alhough FSLR s robus n ha s es errors do no ncrease afer many eraons, F-Boosng can reach a much lower error rae on hree ou of four es ses. Therefore, n he ask of LM for IME where CER s he mos mporan merc, he grd search for feaure selecon s more desrable. To nvesgae he mpac of usng a fxed versus an opmal sep sze for feaure updae, we compared FSLR wh Boosng. Alhough boh algorhms acheve very smlar CER resuls, he performance of FSLR s much less sensve o he seleced fxed sep sze. For example, we can selec any value from 0.2 o 0.8, and n mos sengs FSLR acheves he very smlar lowes CER afer 20,000 eraons, and wll say here for many eraons. In conras, n Boosng, he opmal value of ε n Equaon (16) vares wh he szes and domans of ranng daa, and has o be uned carefully. We hus conclude ha n our ask FSLR s more robus agans dfferen ranng sengs and a fxed sep sze for feaure updae s more preferred. 6 Concluson Ths paper nvesgaes wo approxmaon lasso mehods for LMappled o a realsc ask wh a very large number of feaures wh sparse feaure space. Our resuls on Japanese ex npu are promsng. BLasso ouperforms he boosng algorhm sgnfcanly n erms of CER reducon on all expermenal sengs. We have shown ha hs superor performance s a consequence of BLasso s backward sep and s fxed sep sze n boh feaure selecon and feaure wegh updae. Our expermenal resuls n Secon 5 show ha he use of backward sep s val for model fne-unng afer major feaures are seleced and for copng wh srongly correlaed feaures; he fxed sep sze of BLasso s responsble for he mprovemen of CER and he robusness of he resuls. Expermens on oher daa ses and heorecal analyss are needed o furher suppor our fndngs n hs paper. References Bacchan, M., Roar B., and Saraclar, M. 2004. Language model adapaon wh MAP esmaon and he percepron algorhm. In HLT-NAACL 2004. 21-24. Collns, Mchael and Terry Koo 2005. Dscrmnave rerankng for naural language parsng. Compuaonal Lnguscs 31(1): 25-69. Duda, Rchard O, Har, Peer E. and Sor Davd G. 2001. Paern classfcaon. John Wley & Sons, Inc. Donoho, D., I. Johnsone, G. Kerkyacharan, and D. Pcard. 1995. Wavele shrnkage; asympopa? (wh dscusson), J. Royal. Sas. Soc. 57: 201-337. Efron, B., T. Hase, I. Johnsone, and R. Tbshran. 2004. Leas angle regresson. Ann. Sas. 32, 407-499. Freund, Y, R. Iyer, R. E. Schapre, and Y. Snger. 1998. An effcen boosng algorhm for combnng preferences. In ICML 98. Hase, T., R. Tbshran and J. Fredman. 2001. The elemens of sascal learnng. Sprnger-Verlag, New York. Gao, Janfeng, Hsam Suzuk and Yang Wen. 2002. Explong headword dependency and predcve cluserng for language modelng. In EMNLP 2002. Gao. J., Yu, H., Yuan, W., and Xu, P. 2005. Mnmum sample rsk mehods for language modelng. In HLT/EMNLP 2005. Osborne, M.R. and Presnell, B. and Turlach B.A. 2000a. A new approach o varable selecon n leas squares problems. Journal of Numercal Analyss, 20(3). Osborne, M.R. and Presnell, B. and Turlach B.A. 2000b. On he lasso and s dual. Journal of Compuaonal and Graphcal Sascs, 9(2): 319-337. Roar Bran, Mura Saraclar and Mchael Collns. 2004. Correcve language modelng for large vocabulary ASR wh he percepron algorhm. In ICASSP 2004. Schapre, Rober E. and Yoram Snger. 1999. Improved boosng algorhms usng confdence-raed predcons. Machne Learnng, 37(3): 297-336. Suzuk, Hsam and Janfeng Gao. 2005. A comparave sudy on language model adapaon usng new evaluaon mercs. In HLT/EMNLP 2005. Tbshran, R. 1996. Regresson shrnkage and selecon va he lasso. J. R. Sas. Soc. B, 58(1): 267-288. Yuan, W., J. Gao and H. Suzuk. 2005. An Emprcal Sudy on Language Model Adapaon Usng a Merc of Doman Smlary. In IJCNLP 05. Zhao, P. and B. Yu. 2004. Boosed lasso. Tech Repor, Sascs Deparmen, U. C. Berkeley. Zhu, J. S. Rosse, T. Hase, and R. Tbshran. 2003. 1-norm suppor vecor machnes. NIPS 16. MIT Press.

Table 1. CER (%) and CER reducon (%) (Y=Yomur; T=TuneUp; E=Encara; S=-Shncho) Doman Enropy vs.nkke Baselne MAP (over Baselne) Boosng (over MAP) BLasso (over MAP/Boosng) Y (800) 7.69 3.70 3.70 (+0.00) 3.13 (+15.41) 3.01 (+18.65/+3.83) Y (8K) 7.69 3.70 3.69 (+0.27) 2.88 (+21.95) 2.85 (+22.76/+1.04) Y (72K) 7.69 3.70 3.69 (+0.27) 2.78 (+24.66) 2.73 (+26.02/+1.80) T (800) 7.95 5.81 5.81 (+0.00) 5.69 (+2.07) 5.63 (+3.10/+1.05) T (8K) 7.95 5.81 5.70 (+1.89) 5.48 (+5.48) 5.33 (+6.49/+2.74) T (72K) 7.95 5.81 5.47 (+5.85) 5.33 (+2.56) 5.05 (+7.68/+5.25) E (800) 9.30 10.24 9.60 (+6.25) 9.82 (-2.29) 9.18 (+4.38/+6.52) E (8K) 9.30 10.24 8.64 (+15.63) 8.54 (+1.16) 8.04 (+6.94/+5.85) E (72K) 9.30 10.24 7.98 (+22.07) 7.53 (+5.64) 7.20 (+9.77/+4.38) S (800) 9.40 12.18 11.86 (+2.63) 11.91 (-0.42) 11.79 (+0.59/+1.01) S (8K) 9.40 12.18 11.15 (+8.46) 11.09 (+0.54) 10.73 (+3.77/+3.25) S (72K) 9.40 12.18 10.76 (+11.66) 10.25 (+4.74) 9.64 (+10.41/+5.95) Fgure 3. L 1 curves: models are raned on he E(8K) daase. Fgure 4. Tes error curves: models are raned on he E(8K) daase. Fgure 5. Tes error curves: models are raned on he Y(8K) daase, ncludng headword bgram feaures. Fgure 6. Tes error curves: models are raned on he T(8K) daase, ncludng headword bgram feaures. Fgure 7. Tes error curves: models are raned on he E(8K) daase, ncludng headword bgram feaures. Fgure 8. Tes error curves: models are raned on he S(8K) daase, ncludng headword bgram feaures.