Conditional Random Fields in Speech, Audio and Language Processing

Size: px

Start display at page:

Download "Conditional Random Fields in Speech, Audio and Language Processing"

Dwight Simon
6 years ago
Views:

1 PROCEEDINGS OF THE IEEE, PREPRINT 1 Condtonal Random Felds n Speech, Audo and Language Processng Erc Fosler-Lusser,* Senor Member, IEEE, Yanzhang He, Preeth Jyoth, and Roht Prabhavalkar, Graduate Student Member, IEEE (Invted Paper Abstract Condtonal random felds (CRFs are probablstc sequence models that have been appled n the last decade to a number of applcatons n audo, speech, and language processng. In ths artcle, we provde a tutoral overvew of CRF technologes, pontng to other resources for more n-depth dscusson; n partcular, we descrbe the common lnear-chan model as well as a number of common extensons wthn the CRF famly of models. An overvew of the mathematcal technques used n tranng and evaluatng these models s also provded, as well as a dscusson of the relatonshps wth other probablstc models. Fnally, we survey recent work n speech, audo, and language processng to show how the same CRF technology can be deployed n dfferent scenaros. Index Terms Random felds, statstcal learnng, automatc speech recognton, natural language processng EDICS Category: SPE-GASR, SPE-RECO, SLP-LANG I. INTRODUCTION In many speech, audo, and language processng problems, a system must extract a hgher-level descrpton of a sequental nput sgnal; the exact type of nput and extracted nformaton depends on the problem defnton. As one example, consder determnng whether word sequences n runnng text correspond to a person or place (or nether: n solaton, we can dentfy that Johnston, by vrtue of ts captalzaton n Englsh, s lkely a proper noun, but t could be a person, cty, or street. Smlarly, the abbrevaton Dr. s ambguous between Doctor and Drve. In sequence, however, we can combne ths evdence and predct that Dr. Johnston s lkely a person, whle Johnston Dr. s lkely a street name. Probablstc sequence modelng has long been used to tackle a varety of these speech, audo and language tasks: for example, the well-known Hdden Markov Model (HMM has been the backbone of automatc speech recognton (ASR research for decades. In ths tutoral, we explore Condtonal Random Felds (CRFs a model that allows for combnaton of local nformaton to predct a global probablty model over sequences. Orgnally proposed n 2001 [1], ths model has been appled to a sgnfcant number of tasks across these domans. Condtonal Random Felds are really a class of models: one purpose of ths artcle s to document the specfc propertes *Correspondng author. The authors are wth the Department of Computer Scence and Engneerng, The Oho State Unversty, Columbus, OH 43210, USA. (e-mal: {fosler,hey,jyoth,prabhava}@cse.oho-state.edu, phone: ( , fax: ( Manuscrpt receved Aprl 6, 2012, revsed September 3, 2012, accepted December 27, of CRFs across dfferent mplementatons. In the orgnal Lafferty et al. paper [1], a CRF s defned as a dscrmnatve model that predcts the global probablty of a sequence of random varables, assumng a frst-order Markov dependency between the varables (also known as a lnear-chan model; the probablty structure depends on a log-lnear combnaton of evdence features derved from the nput. Ths s typcally the frst style of model that comes to mnd when CRFs are mentoned. However, there are many modelng choces that are encapsulated n the dense descrpton above. Several researchers have sought to extend CRFs n one or more drectons because of the partcular problem they are workng wth: examples nclude developng rcher graph structures that model collectons of varables, ntroducng hdden structure nto the model, changng the optmzaton crtera, or adjustng the probablty structure to nclude bnary swtchng varables rather than pure log-lnear probabltes. In ths artcle, we take the vew that all of these varants fall under the general class of CRFs, and we provde a roadmap that can help readers understand the propertes of CRFs, both n the canoncal lnear-chan case, as well as some of the extensons that have been studed n the lterature. CRFs were orgnally developed n the machne learnng communty and evaluated usng Natural Language Processng (NLP tasks; as we dscuss n Secton VI, the model has snce found utlty n the speech and audo processng areas. Ths artcle also explores the relatonshps of CRFs to other types of models utlzed n these other felds (partcularly Hdden Markov Models and Mult-Layer Perceptrons (MLPs: knowng these relatonshps can allow researchers to cross-fertlze technques, such as Deep-Learnng CRFs [2], whch hybrdze deep learnng technques for MLPs wth the sequence modelng of CRFs. We also pont to varous resources for further readng, ncludng addtonal tutoral materal [3] and work descrbng the relatonshps between HMMs and CRFs [4], [5]. In the next secton, we descrbe the hstorcal development of CRFs; we also dscuss the partcular propertes that are attrbuted to lnear-chan CRFs and how those propertes have been extended n subsequent research. Secton III gves a more n-depth vew of the mathematcs behnd lnear-chan CRFs, ncludng tranng and decodng algorthms, defnng feature functons for descrbng observatons, and methods of avodng overfttng. We then extend the dscusson to graphcal structures that are not lnear-chan n Secton IV, and examne the relatonshps to other classfer technologes

2 2 PROCEEDINGS OF THE IEEE, PREPRINT (usually the quantty that one would lke to predct and O = {o 1, o 2,..., o T } to represent a sequence of observaton varables. In Secton VI, we ntroduce each problem by defnng the values that S and O can take on. For example, Rabner s ball-and-urn problem can be specfed as: BALL-AND-URN PROBLEM S: Sequence of urns O: Color of balls In order to understand the assumptons n the ball-and-urn problem better, let us lay out the statstcal assumptons made n the HMM. Brefly stated, HMMs compute a jont probablty between a set of hdden labels (S and observatons (O. Fg. 1. The ball-and-urn problem: three urns wth assocated transton probabltes and observaton dstrbutons over yellow (speckled, red (sold or blue (strped balls. A ball s drawn from an urn and the selector transtons to the next urn. Each urn s equally lkely to be selected as the frst urn to draw from. n Secton V. Wth ths background n mnd, we then turn to specfc applcatons n audo, speech, and language processng n Secton VI, followed by concludng remarks. II. HISTORICAL PERSPECTIVE AND ANALYSIS The ntal mpetus for the development of CRFs was the desre to construct drect, dscrmnatve models of hdden sequences that depend on the observatons. In ths secton, we frst descrbe the progresson from Hdden Markov Models (HMMs to Maxmum Entropy Markov Models (MEMMs and then to Condtonal Random Felds (CRFs as lad out by Lafferty et al. [1]. In order to llustrate the model, we draw on an example from the classc HMM paper of Rabner [6], and show how the parameterzatons of the three models dffer. After provdng the hstorcal overvew of CRF development, we break down the model and descrbe ts component propertes, whch allows us to understand more fully the assumptons n the model descrbed n [1], as well as see how varants of the model mght ft nto the CRF famly. A. Why were CRFs developed? We begn our dscusson of CRF development by recallng Rabner s tutoral on Hdden Markov Models [6]. Rabner motvates the power of the HMM statstcal model usng a ball-and-urn analogy: consder a set of n urns that holds balls of several colors; the dstrbutonal mx of ball colors dffers from urn to urn. At every tme step, a ball s selected from an urn and ts color announced (and then replaced, and then wth some probablty the selector moves to a dfferent urn; ths process contnues to generate a sequence of observatons (Fgure 1. Crucally, whch urn the ball s selected from s hdden from the observer only the sequence of ball colors s reported to the observer. Throughout ths paper we wll be usng a sequence S = {s 1, s 2,..., s T } to represent hdden states of varous problems P (S, O = T P (s t s t 1 P (o t s t, (1 t=1 where the jont probablty s composed of local probabltes over the state sequence transtons (P (s t s t 1 and observaton probabltes (P (o t s t. 1 The correspondng graphcal model for ths factorzaton can be seen n Fgure 2(a. The HMM can be consdered a generatve model n that the dstrbutons for observatons are determned by the states the states are seen as generatng the observatons. 2 We assume that the dstrbutons reman statonary over tme (whch s true for the ball-and-urn problem, snce we replace the ball after selectng t. To make ths more concrete, let us pck some dstrbutons over yellow, blue, and red balls for n = 3 urns, as depcted n Fgure 1. Equaton 1 can be used to compute the probablty dstrbuton over jont (S, O sequences; for example P ( , Y Y RBBRBY R = 2.7e 7. For recognton tasks where we want to fnd a sequence S gven a fxed O, fndng the S that maxmzes P (S, O also maxmzes the condtonal probablty P (S O = P (S, O/P (O. The HMM decomposton of the jont probablty requres fndng a good generaton model of the lkelhood of observatons gven each ndvdual label P (o t s t. McCallum et al. observed that ths seems antthetcal to the dea of predctng a label sequence n essence, we want to drectly compute the condtonal label posteror P (S O by predctng the current label gven the prevous label and the current observaton [7]: P (S O = T P (s t s t 1, o t. (2 t=1 The correspondng graphcal model, referred to as a Maxmum Entropy Markov Model (MEMM, can be seen n Fgure 2(b note the nverson of the arrow drectons, whch ndcate that the hdden state dstrbutons now depend on the observatons. Observatons are now taken as fxed (and hence are shaded n the fgure. Table I demonstrates some nterestng propertes of the observaton-dependent transton probablty 1 For tme t = 1, we assume an uncondtonal pror dstrbuton over the startng state P (s 1 s 0 = P (s 1 ; equvalently we can add a start symbol and assume P (s 0 = start = 1. We carry ths notatonal convenence throughout the rest of the artcle. 2 It s mportant to draw the dstncton here between models and tranng crtera: generatve models can be traned usng dscrmnatve crtera that mprove end task performance.

3 FOSLER-LUSSIER ET AL.: CONDITIONAL RANDOM FIELDS IN SPEECH, AUDIO, AND LANGUAGE PROCESSING, PREPRINT 3 S 1 S 2 S 3 O 1 O 2 O 3 (a HMM graphcal model descrbng P (S, O S 1 S 2 S 3 O 1 O 2 O 3 (b MEMM graphcal model descrbng P (S O S 1 S 2 S 3 O 1 O 2 O 3 (c Lnear-chan CRF graphcal model descrbng P (S O Fg. 2. Graphcal models for three formalsms descrbng factorzatons of relatonshps between S and O. Hdden Markov Models (HMMs treat observatons as random varables wth assocated probablty dstrbutons, and thus are unshaded. Maxmum Entropy Markov Models (MEMMs and Condtonal Random Felds (CRFs treat observatons (shaded as fxed and thus do not model the dstrbuton. The CRF factorzaton gven here mmcs the MEMM factorzaton n that there s an observatonal dependence on the transtons. s t 1 o t P (s t = {1, 2, 3} s t 1, o t 1 Y {0.92, 0.02, 0.07} 1 B {0.42, 0.37, 0.21} 1 R {0.80, 0.10, 0.10} 2 Y {0.37, 0.42, 0.21} 2 B {0.02, 0.92, 0.07} 2 R {0.10, 0.80, 0.10} 3 Y {0.18, 0.03, 0.80} 3 B {0.03, 0.18, 0.80} 3 R {0.10, 0.10, 0.80} Dstrbutons do not sum to one because of roundng. TABLE I MEMM BALL-AND-URN PROBABILITY DISTRIBUTIONS FOR NON-INITIAL STATES ASSUMING IID DISTRIBUTION OF OBSERVATIONS dstrbuton correspondng to the ball-and-urn problem. Observng yellow when the prevous selecton came from urn 1 s a good ndcator that the current selecton also comes from urn 1 (and smlarly for blue and urn 2. Red s a poor dscrmnator of whch urn a ball came from when red s observed the dstrbuton reverts to the pror P (s t s t 1. Why mght t be advantageous to model transtons dependent on observatons? The HMM model assumes that observatons are drawn from a state ndependently and dentcally dstrbuted, whch mght not hold for all stuatons. 3 Consder a modfcaton to the ball-and-urn rule: the selector contnues to draw from the same urn untl she draws a red ball, at whch pont one of the other two urns s selected at the next turn (wth equal probablty. Under these rules, P (11, RY = 0, but the HMM wll gve a non-zero probablty. 4 Snce t can be dffcult to predct a current label condtoned on both the prevous label and the current observaton, McCallum et al. use a maxmum entropy dstrbuton to model the component probabltes, n whch an exponental model s used to relate the observatons and prevous label to the current label va a set of weghted descrptve feature functons [7]: 5 P (s t s t 1, o t = ( #func exp λ f (o t, s t 1, s t, (3 Z(o t, s t 1 where #func s the number of feature functons n the model, and Z(o t, s t 1 = ( #func s exp t λ f (o t, s t 1, s t s the partton functon, or normalzaton constant, that ensures that the probablty dstrbuton sums to one. The feature functons of the MEMM can be as smple as bnary ndcators of partcular transtons between labels (gnorng the observatons, or complcated functons of the observaton; the feature functons must be defned by the system desgner and should reflect knowledge of the problem at hand. The learned weghts λ ndcate the mportance of each functon n relatng a label to the prevous label and current observaton. 6 For the ball-and-urn problem, a sngle set of bnary ndcator feature functons can be used to relate Equaton 3 to the probabltes n Table I. For example, to model P (s t = 1 s t 1 = 1, o t = Y we can defne a feature functon: { 1 f ot = Y, s f 1 (o t, s t, s t 1 = t = 1, s t 1 = 1 0 otherwse. (4 Smlarly, we can have ndcator functons f 2 select for (o t = Y, s t = 2, s t 1 = 1, f 3 select for (o t = Y, s t = 3, s t 1 = 1, etc. By settng λ 1 = log(0.92, λ 2 = log(0.02, λ 3 = log(0.07, and so forth, we can encode the probabltes n Table I usng ths weghted feature representaton. 7 However, one potental downfall of MEMMs s what s known as the label bas problem [9]. Factorng P (S O nto terms of P (s t s t 1, o t means that f there are very few states s t that can follow s t 1, then the role of o t n dstngushng them s very dmnshed. In the extreme case when only one value of s t (say a can follow s t 1 = b, then P (s t = a s t 1 = b, o t = 1 o t, and thus P (s t a s t 1 = b, o t = 0 3 Speech observatons are not IID, for example: two adjacent frames n a steady state vowel wll be relatvely close n observaton space. 4 The HMM can model ths stuaton by subdvdng the state space nto two states for each urn the stay wth ths urn state and the leave ths urn state and keepng separate dstrbutons for each. 5 The notaton n ths equaton dffers somewhat from that n [7] n that the prevous state s explctly represented, followng notaton n subsequent work [1]. 6 It should be noted that multnomal logstc regresson has exactly the same mathematcal form as maxmum entropy classfcaton; n both systems weghts have exactly the same role and are estmated smlarly [8]. 7 Note the the Z-term wll always sum to one n ths case, snce we already have a normalzed probablty dstrbuton.

4 4 PROCEEDINGS OF THE IEEE, PREPRINT the mpact of o t s completely gnored by the model. 8 In essence, the problem arses because every state s requred to be locally normalzed to sum to a probablty dstrbuton (cf. the denomnator n Equaton 3. Condtonal Random Felds (CRFs were frst ntroduced by Lafferty et al. [1] to address the label-bas problem found n MEMMs. CRFs are stll condtonal models (.e., they drectly compute P (S O, but handle the label bas problem by jontly modelng the S sequence and globally normalzng over the entre sequence structure. The CRF factorzaton for an HMMlke structure (often termed a lnear-chan CRF 9 can be gven by: P (S O = 1 Z(O T φ state (s t, o t φ trans (s t, s t 1, o t (5 t=1 where Z(O s a normalzaton term whch ensures that Equaton 5 forms a vald probablty dstrbuton, and φ( are a non-negatve potental functons over state/observaton confguratons. State potentals assocate observatons wth a partcular state; transton potentals assocate pars of states wth observatons. Smlar to MEMMs, one can model these as an exponental famly functon, thus ensurng that each potental s postve: φ state (s t, o t = exp φ trans (s t, s t 1, o t = exp ( #func λ f (s t, o t, (6 ( #func λ f (s t, s t 1, o t.(7 Fgure 2(c descrbes the graphcal model correspondng to ths lnear-chan CRF. Note that the model s undrected (ndcatng relatonshps between nodes defned by potental functons rather than condtonal probablty dstrbutons; the clques n the graph structure defne the doman of the potental functons assocated wth the global probablty dstrbuton. 10 As a generalzaton, one can just defne all functons over the maxmal clque φ(s t, s t 1, o t, mergng Equaton 6 nto Equaton 7, and allow the assocated functons to gnore any subset of the varables, provdng for defntons of state or transton bas functons that gnore the observatons. The condtonal probabltes for the MEMM can be used to defne one parameter settng for the ball-and-urn problem: for any par of states and observaton found n Table I, we can defne a potental functon φ(s t, s t 1, o t = λ f (s t, s t 1, o t where λ = log P (s t s t 1, o t and f (s t, s t 1, o t = 1 ff o t s observed and we are hypotheszng a transton from s t 1 to s t. 11 Ths represents only one possble parameter 8 Whether ths s a problem n any partcular doman s really an emprcal queston. 9 Non-lnear-chan CRFs are dscussed n Secton IV; we also restrct observatons here to a sngle frame o t for convenence of comparson to HMMs, but n general potental functons can be defned over arbtrary nput O; see Equaton??. 10 A clque s a set of nodes formng a fully connected subgraph n a graph; n Fgure 2(c S 1 and O 1 form one clque, whle S 1, S 2 and O 2 form another clque. 11 For example, the weght correspondng to φ(1, 2, Y would be λ = log(0.37. settng correspondng to the same probablty dstrbuton because potental functons n CRFs are only constraned to be non-negatve (unlke MEMMs, whch have a requrement for all transtons from a state to sum to 1 as a probablty dstrbuton, there are multple potental functon values that can lead to the same global probablty dstrbuton [5]. 12 In short, there are two basc changes gong from MEMMs to CRFs. Frst, the observatons O have an mpact both on the hdden states ndvdually (Equaton 6 and on the transtons between the states (Equaton 7. Second, note the change n the normalzaton constant Z(O: nstead of local normalzaton for each state (as n the MEMM, we compute the accumulated potentals over all possble S sequences. For lnear-chan CRFs, there s an effcent forward-backward computaton to compute Z(O outlned by Lafferty et al. They also demonstrate mprovement over MEMMs usng a part of speech taggng task [1]. B. What are the propertes of CRFs? We presented above the hstorcal progresson that ntroduced CRFs nto the natural language processng communty, and descrbed varous concepts that were an ntegral part of the development of that lne of research. We can tease apart four nterrelated parts wthn a CRF system; dfferent users of CRFs often stress one part or another as a desrable property over some baselne model (e.g., HMMs, and n fact some clams are really combnatons of multple propertes. Take the clam that Lafferty et al. make about condtonal models such as CRFs [1]: A condtonal model specfes the probabltes of possble label sequences gven an observaton sequence. Therefore, t does not expend modelng effort on the observatons, whch at test tme are fxed anyway. In the above quote, there s a conflaton between two dfferent (but nterrelated aspects of the statstcal system: model structure and tranng crtera (whch we defne below; Lafferty suggests that the generatve model structure of HMMs (where observatons are consdered to be generated by hdden states requres them to expend modelng effort on capturng all of the observed data. However, ths s only true when HMMs are traned wth a generatve crteron (e.g., Maxmum Lkelhood, whch maxmzes the lkelhood of the observatons gven the states. Dscrmnatve tranng crtera allow the assocaton between observatons and hdden states to be focused on separatng hdden states, rather than on modelng the data [10]. Hegold et al. argue qute elegantly that HMMs and CRFs wth the same model structure wll gve the same theoretcal results gven equvalent tranng crtera [5]. Ths s elaborated further n Secton V. We can decompose a CRF nto four components: model structure, parameterzaton, nference methods, and parameter estmaton. 12 Consder the case where all potental functons φ are multpled by a constant c; ths just scales the unnormalzed potental score of any sequence of length T by c T, thus leavng the posteror P (S O unchanged.

5 FOSLER-LUSSIER ET AL.: CONDITIONAL RANDOM FIELDS IN SPEECH, AUDIO, AND LANGUAGE PROCESSING, PREPRINT 5 Model structure: The structure of a model s gven by the random varables n the system and the connectons (factors that the modeler wants to draw between the varables. For an HMM, there s one factor for the relatonshps between observatons O and hdden states S (f(o t, s t = P (o t s t, and another factor for the relatonshps between hdden states (f(s t, s t 1 = P (s t s t 1. Hegold et al. pont out that when the CRF does not have observatonal dependence on transtons, the factorzaton of a lnear-chan CRF and an HMM s dentcal (state observatons are factors f(o t, s t, and transton functons are factors of the form f(s t 1, s t, and they provde proofs that HMMs and CRFs of ths form can be transformed nto one another [5]. In Rabner s ball-and-urn example, the dscrete HMM probablty dstrbuton can be represented n a log-lnear CRF by havng a set of observaton functons f(s t = n, o t = c = 1 ff the state value s n and the observaton s c at tme t (and 0 otherwse, wth the weght λ s=n,o=c = log P (o = c s = n. A smlar weghted bas feature can be set up for the transton probabltes as well. However, for lnear-chan CRFs that have transton dependence on observatons (smlar to MEMMs, factorzaton equvalence wth HMMs no longer holds. As ponted out above, such CRFs have feature functons that relate states to observatons (f(s t, o t, as well as separate functons that relate transtons to observatons (e.g., f(s t, s t 1, o t ; ths factorzaton allows for a dfferent decomposton of the jont probablty of the hdden varables. In short, the HMM assumes that s t 1 s ndependent of o t gven s t : there s no way for the standard HMM to represent a covarance of the three varables. 13 In general, whether usng drected graphcal models (e.g., HMMs and other Dynamc Bayesan Networks (DBNs or undrected graphcal models (CRFs, the key elements of the model structure are the factorzatons represented by the graph. For undrected models, factorzaton s gven n terms of the clques of the graph. For drected models, the factorzaton s gven by the dependence between a partcular varable and ts parents. It s possble to use rcher graph structures beyond those seen n Fgure 2(c; for example, Hdden Condtonal Random Felds (HCRFs ntroduce ntermedate hdden varables nto the graph that margnalze over dfferent sub-state confguratons. These rcher models are explored further n Secton IV. Another pont of departure for CRFs s that Lafferty et al. s defnton generally allows for any dependence on the observatons (f(s t, O, f(s t 1, s t, O, whch would requre explct dependences and generatve probablty dstrbutons n an HMM-lke DBN representaton. What makes ths feasble s that the O are generally fxed observatons, rather than treated as random varables; ths dstncton has mplcatons for the probablty dstrbutons that are modeled. Parameterzaton: Whle the factorzatons of the probablty dstrbutons of the HMM and smple CRFs can be equvalent, the parametrc forms of the models typcally 13 Agan, the state space of the HMM can be augmented to ncorporate ths dependence. are not the same. In an HMM, the state transton factor P (s t s t 1 s typcally represented by a stochastc transton matrx, and the observaton factor P (o t s t s often modeled wth a condtonal probablty table for dscrete observatons, or a dstrbuton such as a mxture of Gaussans for contnuous observatons. For a CRF, factors are typcally combned usng a log-lnear model (cf. Equatons 5-7 above. In general, the jont probablty of hdden varables n any Markov Random Feld s gven by defnng postve potental functons over the confguraton clques n the graph. By defnng the potental functon as the exponental of a sum of weghted features (loglnear model, we can ensure that every potental s postve. It s mportant to note that we are ndcatng the typcal forms of probablty dstrbutons for HMMs and CRFs. One could specfy a dfferent form for P (o t s t n an HMM for example, usng dscrete vector quantzaton probabltes. Smlarly, one could use models outsde of the exponental famly to derve clque potentals n a CRF. Yet most CRF mplementatons take the form of the log-lnear model because t s relatvely easy to estmate parameters accordng to the Condtonal Maxmum Lkelhood crteron. Parameter estmaton: The typcal method for estmatng parameters s to teratvely update parameters wth one of a number of methods (e.g., teratve scalng or gradent ascent to optmze the Condtonal Maxmum Lkelhood (CML of the correct S sequence condtoned on O, drectly optmzng P (S O. Agan, the Lafferty et al. crtcsms of HMMs are based on the tradtonal method of Maxmum Lkelhood tranng, whch optmzes P (S, O va the expectatonmaxmzaton (EM algorthm [63]. However, other dscrmnatve technques for tranng HMMs have become popular n felds such as ASR; as we dscuss later, the CML crteron for CRF tranng s equvalent to the Maxmum Mutual Informaton Estmaton (MMIE crteron for tranng dscrmnatve HMMs. 14 As a practcal matter, CML tranng of lnear-chan CRFs, whch utlze a forward-backward recurson smlar to that of ML tranng for HMMs, can tran dscrmnatve models relatvely effcently n terms of space; many mplementatons of MMIE (and other crtera for HMMs requre the generaton of compettor lattces, whch can be cumbersome to store. We dscuss the CML tranng crteron further n Secton III. Inference mechansm: The forward-backward recurrence mentoned above s a nce property of lnear-chan CRFs: lke an HMM, lnear-chan CRFs have exact nference mechansms both n terms of the alpha-beta recurrences for parameter updatng as well as a Vterb algorthm for determnng best paths (also dscussed n Secton III. However, lke non-polytree DBNs, CRFs wth rcher structures may requre nference that s exponental n the number of hdden varables. Fortunately, approxmate nference mechansms, such as Monte Carlo samplng or beamwdth prunng technques, can be successfully appled to rcher CRF structures, such as Factoral CRFs [11]. Such algorthms are descrbed further n Secton IV. By consderng these four components as separate, but nterrelated, contrbutons to CRF technology, one can consder 14 Dfferent felds wll use the terms CML or MMIE tranng; for consstency wth prevous CRF papers we wll refer to ths crteron as CML n ths artcle.

6 6 PROCEEDINGS OF THE IEEE, PREPRINT how varants of the orgnal lnear-chan CRF model may ft nto the CRF famly. In the next secton, we take a closer look at probablty dstrbutons, parameter estmaton, and nference n lnear-chan CRFs, followed by a dscusson of models that move beyond smple lnear-chan structures. III. THE MATHEMATICS BEHIND LINEAR-CHAIN CRFS The mathematcal formalsm for tranng and decodng condtonal random felds has been extensvely dscussed elsewhere [1], [4], [11], [12], [13]; n ths secton, we provde a summary of the mathematcs for parameter estmaton and nference for lnear-chan CRFs for those who are famlar wth tradtonal HMM technques. We begn by descrbng the process for estmatng the parameters of a lnear-chan CRF n Secton III-A1. We then descrbe the process for decodng the most lkely state assgnment gven a traned CRF n Secton III-A2. We defer a dscusson of nference n non-lnear-chan structured CRFs untl Secton IV. In secton III-B, we descrbe some of the ssues nvolved n defnng feature functons for CRF systems. We end ths secton wth a dscusson of varous technques that have been employed n order to avod overfttng to tranng data n Secton III-C. A. Tranng and decodng In Secton II-A, we noted that the probablty dstrbuton of P (S O (over an entre sequence s defned over the clques n the undrected graph representng the CRF [14]. For a lnear-chan CRF, Equatons 5-7 show that the factors of the probablty dstrbuton are defned over sngle states and pars of states. More generally, we can rewrte Equatons 5 7 as follows: T t=1 P (S O = φ(s t, s t 1, O, t Z(O T t=1 = exp ( λ f (s t, s t 1, O, t Z(O ( T exp t=1 λ f (s t, s t 1, O, t = Z(O where ranges over the set of possble descrptons of the nput O, and S corresponds to the unknown state sequence. Some functons f can be degenerate functons that gnore some of ther nputs: ths enables CRFs to have bas terms for states and transtons (gnorng the nput O, or be state-only functons (gnorng s t 1 ; generally functons that are condtoned on nput O wll only utlze a wndow of O around the current frame t (and sometmes only o t. 1 Estmatng parameters of lnear-chan CRFs: The objectve of the lnear-chan CRF s, gven a set of observatons O and the correspondng states S, to maxmze the condtonal lkelhood of the labels gven the observatons P (S O. 15 Gven Equaton 8, the queston s how to estmate the λ 15 One should take care not to confuse condtonal maxmum lkelhood (maxmzng P (S O wth the more tradtonal maxmum lkelhood, whch s often cast n terms of maxmzng the jont probablty P (S, O or the data lkelhood P (O S. (8 parameters, whch weght the contrbutons of each functon to the probablty of the sequence S. To estmate the parameters, we try to fnd the parameters that maxmze P (S O, where S s the correct label sequence; ncreasng P (S O wll nherently decrease the probablty P (S O of any ncorrect sequence S. 16 To do ths, optmzng the CML crteron takes partal dervatves of the log probablty L = log P (S O wth respect to the parameters ( log P (S O/ λ n formng the lkelhood gradent L: 17 L = λ log P (S O =. λ log P (S O., (9 and then fnd the zero of the gradent, whch maxmzes P (S O, snce t s concave [11]. Computng the gradent: Let the vector of functons f (s t, s t 1, O, t at a partcular tme t be denoted f(s, O, t and the vector of weghts λ = [λ ]. Snce λ does not depend on t, followng the conventons of [13] we can use F(S, O = T t=1 f(s, O, t to rewrte P (S O (Equaton 8 and L: exp (λ F(S, O P (S O =, L = λ F(S, O log Z(O. Z(O (10 For any one partcular λ, the partal dervatve of the log lkelhood s: L = F (S, O log Z(O λ λ = F (S, O exp (λ F(S, O F (S, O Z(O S = F (S, O S P (S OF (S, O (11 and thus the gradent s λ L = F(S, O S P (S OF(S, O. (12 What ths equaton says s that when the dfference between the observed functons and the expected value of the functons accordng to the probablty dstrbuton s zero, we have reached maxmum condtonal lkelhood. The gradent over an entre tranng corpus can be computed by summng the gradents for each utterance. For those used to more tradtonal ASR termnology, Hfny and Renals [15] cast the log lkelhood dervatve as beng a dfference between stochastc counts of observatons of the correct hypothess and of all ncorrect 16 See the relatonshp to MMI estmaton n Secton V-A. 17 Gven a tranng set of N ndependent and dentcally dstrbuted examples, T = {(S 1, O 1,, (S N, O N }, the condtonal log-lkelhood of the entre tranng set can be wrtten as L = log P (S O, and thus L = λ log P (S O. Snce the gradent of the log-lkelhood for the entre tranng set can be obtaned by summng together the gradents for all tranng examples, the dervatons n ths secton only descrbe the gradent computaton for a sngle tranng example.

7 FOSLER-LUSSIER ET AL.: CONDITIONAL RANDOM FIELDS IN SPEECH, AUDIO, AND LANGUAGE PROCESSING, PREPRINT 7 hypotheses: for state-level features, the state occupancy probabltes (typcally known as gamma probabltes n HMM parlance [6] determne the weghtng of each observaton; summng these weghted observatons for the correct path and all paths gves the two types of counts. The remanng queston s how to effcently compute the expected value of the features that s, the second term n Equaton 12. For those acquanted wth computng parameters of Gaussan acoustc models n HMM systems, the answer s exactly analogous: compute the local probablty of beng n a state s t at a partcular tme t (for state-only features or the local probablty of a transton between s t 1 and s t (for transton features, and multply n the value of the observaton feature to get the expected value. Sha and Perera have a partcularly nce formulaton that sums the potentals for partal sequences n matrces va alpha-beta recurrences [13]: buld a transton matrx M t such that M t [s t 1, s t ] = exp(λ f(s t, s t 1, O, t. (13 The value of each element M t [s t 1, s t ] of the transton matrx s the weghted contrbuton of all functons related to that transton. 18 We can compute an alpha-beta recurrence, where the row vector alpha represents the accumulated potentals (not probabltes from the begnnng of the sequence, and the row vector beta represents the backward-accumulated potentals from the end of the sentence (agan usng notaton from [13]: { αt 1 M α t = t 0 < t T, 1 t = 0 { βt Mt+1 βt+1 = 1 t < T 1 t = T (14 Gven the alphas and betas we can now effcently compute both the expectaton of the features and the normalzaton term Z(O: P (S OF(S, O = S t where Z(O = 1 α T α t 1 (f t M t β t Z(O (15 where the operator s component-wse multplcaton. Usng the gradent for parameter estmaton: Fndng the zero of the gradent functon drectly s ntractable, so many optmzaton technques have been used for parameter estmaton. Varous papers have made comparsons of the dfferent types of technques n the NLP doman [3], [13], and the reader s referred to those works for a longer dscusson. The orgnal approach used by Lafferty et al. [1] s an teratve scalng technque (n partcular, the Improved Iteratve Scalng algorthm, or IIS and s remnscent of Maxmum Entropy classfer tranng technques [16]. Ths approach makes weght updates at every pass through the tranng data that are guaranteed to ncrease the posteror P (S O; the sze 18 Here we assume that state functons are folded nto ths matrx by addng n a state vector to each column of the matrx, although t s more effcent to factorze the state functons separately n the subsequent alpha-beta computatons. of the update s related to the rato of the emprcal feature count to the expected count under the current model. Statstcs can be collected n parallel across the corpus (much lke standard HMM technques, allowng for parallelzed tranng. However, teratve scalng technques are generally slow. Hfny and Renals [15] propose a varant of the IIS algorthm called the Approxmate Iteratve Scalng (AIS algorthm, n whch each parameter λ s only allowed to jon the model (.e., devate from zero when the absolute value of the partal dervatve for that λ s greater than some threshold α. When a sparse feature representaton s used (Secton III-B, ths can lead to a sgnfcant acceleraton n tranng. Several authors have ponted out that teratve scalng technques are generally very slow to converge because they do not pay attenton to addtonal nformaton, such as the second dervatve of the lkelhood [3], [12], [13], [17]. The Lmted Memory Broyden Fletcher Goldfarb Shanno (LBFGS algorthm [18] estmates the second dervatve of the lkelhood (.e., what wll ncrease the lkelhood the fastest by keepng track of prevous gradents. Conjugate Gradent methods have also been used to modfy the update equaton so that the update drecton s a lnear combnaton of the gradent and the prevous search drecton [13]. In contrast to the second-order technques descrbed above, other technques used to speed convergence nclude stochastc gradent [4] technques and Rprop [19]. In stochastc gradentbased technques, the traner randomly samples the corpus and updates the weghts after collectng the gradent for some small n number of utterances (often n = 1. The net effect of havng more frequent weght updates s that the weghts converge much faster (on some tasks, more than an order of magntude faster. The dsadvantage to ths technque s that for small n, one cannot effectvely parallelze over tranng utterances t may be more effcent to collect gradents from a tranng set usng a large compute cluster and then use the batch update technques lsted above. A dfferent strategy for fast-convergence batch tranng s emboded n the Rprop algorthm [19]: rather than drectly usng the gradent, Rprop uses the sgn of each component of the gradent to determne the drecton of weght adjustment. The amount of adjustment does not depend on the gradent tself, rather on a step sze that ncreases as long as the sgn of the partal dervatve does not change; the magntude of the weght update decreases (and sgn changes wth a change n gradent so that the system wll come ncreasngly close to the zero gradent. Rprop s desgned to be robust aganst large changes n the gradent (whch can be caused by feature scalng ssues; see Secton III-B. Whle ths algorthm was developed for tranng Mult-layer Perceptrons, t has been found to be effectve n tranng CRFs as well [20]. Whle ths artcle has focused mostly on tranng wth the CML crteron, t s possble to tran CRFs wth other optmzaton crtera. 19 As an example, the perceptron update [23], whch replaces the expected feature count n Equaton 11 wth the features correspondng to the current best hypothess effectvely requrng a Vterb decodng rather than the full 19 Dscrmnatve tranng technques for HMMs also have a close connecton to CRF tranng; see [21], [22] for a revew.

8 8 PROCEEDINGS OF THE IEEE, PREPRINT forward decodng descrbed above. Ths has the effect of mnmzng predcton error on the tranng set: once the best hypothess s the same as the reference hypothess, the system no longer updates the weghts. Other optmzaton crtera such as the mnmum classfcaton error (MCE crteron [24] or a mnmum tag error crteron (approxmatng the sentence labelng error and nspred by the popular mnmum phone error (MPE crteron [25] n speech recognton [26] can also be used to tran the model. An alternatve approach to tranng CRFs s based on maxmzng the margn between the reference labelng and all competng labelngs [27]. We elaborate further on these large-margn technques n Secton III-C. 2 Decodng lnear-chan CRFs: In decodng, we are nterested n computng the most lkely assgnment to the varables S gven O, S = arg max S = arg max S P (S O exp ( T t=1 λ f (s t, s t 1, O, t. (16 Z(O Snce the normalzaton constant Z(O = ( S exp T t=1 λ f (s t, s t 1, O, t remans constant relatve to an nput O for any hypothess S, computng the most lkely assgnment S does not requre the computaton of Z(O, S = arg max S ( T exp λ f (s t, s t 1, O, t. (17 t=1 Although the maxmzaton n Equaton 17 nvolves an exponental number of state sequences S, the problem can be effcently decomposed over the edges of the lnear-chan CRF, thus allowng S to be computed effcently wth dynamc programmng usng the Vterb algorthm. In partcular, computng S can be accomplshed by a recurson smlar to the alphabeta recurrence n Equaton 14, δ t [s] = { maxs M t [s, s]δ t 1 [s ] 0 < t T 1 t = 0. (18 Notce that the recursons presented above are dentcal to the alpha recurson presented n Equaton 14 wth summaton beng replaced by maxmzaton. Once the δ t [s] have been computed for all t, s, the most lkely assgnment of state labels, S, can be computed startng from tme T, s t = { arg maxs δ T [s] t = T arg max s M t+1 [s, s t+1]δ t [s] 1 t < T. (19 B. Defnng feature functons for log-lnear representatons As noted above, log-lnear models are a popular way to realze potental functons n CRFs; we make heavy reference to the functons f (S, O, t that relate the observatons wth states at varous tme steps t. The defnton of functons vares from problem to problem (we descrbe how feature functons are defned for dfferent problems n Secton VI, but some general observatons can be made about dfferng classes of features. The features descrbed n the ball-and-urn problem n Secton II-A were based on dscrete observatons; drect models of observatons are represented va bnary functons representng the presence (f = 1 or absence (f = 0 of the observaton, such as s the prevous word X? [28], [29]. More general bnary functons can be developed based on the doman: for example, n developng a part of speech tagger for German, observng that a word s captalzed and not sentence ntal may be a strong ndcator that the word has a nomnal form; words n Englsh endng n -ng are more lkely to be partcples. Bnary functons also play a role when observatons are not nvolved: bnary functons that are defned on the state space varables are known as bas terms, and can be thought of playng an equvalent role to the Markovan pror n an HMM. When the observatons are contnuous (for example, n speech recognton applcatons where the observatons may be a multdmensonal representaton of the spectrum such as MFCCs, the modeler needs to take nto consderaton how to represent these n the log-lnear model. One approach s to use the features drectly: for example, when the orgnal nput and ther squares are used, the feature representaton provdes the suffcent statstcs for a Gaussan model [4], [5]. An example for ths converson s dscussed n Secton V. The quadratc transformaton of MFCC features n ths system can be thought of as an expanson of the orgnal MFCC features to a hgher-dmensonal space: ths s necessary because the phonetc classes are not log-lnearly separable. It s possble to combne dscrete and contnuous observatons n one system (e.g., Chen and Chueh [30] augmented the observaton statstcs above wth n-gram statstcs over the state symbols. It s also possble to use external classfers to convert the orgnal features nto a feature representaton that may be more amenable to log-lnear classfcaton. One approach s to use the log lkelhoods and ther partal dervatves from Maxmum Lkelhood traned Gaussans to provde a scorng space relatng contnuous observatons to the underlyng states [31], [32]. Posteror classfers have also been used to assgn the probablty of putatvely mportant events descrbng the nput, gvng a softer verson of the bnary ndcator functon [33]. Log lkelhoods of sequence models are usually represented n weghted fnte state transducers (WFSTs for effcent decodng. Learnng of WFSTs could be acheved by encodng the arc weghts as the features n a CRF, whch s traned wth the CML crteron or the perceptron update. Such a dscrmnatve learnng method effectvely allows for jont tranng of dfferent models composed wth WFSTs, e.g., acoustc models, pronuncaton models and language models for speech recognton [34], [35], [36]. Bnary ndcator functons on dscrete nputs, especally when the observaton space s large, lead to a relatvely sparse representaton (many feature functons are zero for a partcular nput. The contnuous functons descrbed above, however, tend to be lower-dmensonal, but dense most features are non-zero. Whle theoretcally ths has no m-

9 FOSLER-LUSSIER ET AL.: CONDITIONAL RANDOM FIELDS IN SPEECH, AUDIO, AND LANGUAGE PROCESSING, PREPRINT 9 pact on the algorthm, from an engneerng perspectve t s mportant: usng sparse features, t s more effcent to use sparse matrx multplcaton n computng the recurrences n Equaton 14, whereas dense matrx multplcaton packages can speed computaton on a multcore machne for dense features. 20 However, contnuous observatons can also have sparse representatons: for example, fndng the k hghest scorng Gaussans [37], [15] or vector quantzng the observaton space [38] can lead to sparse spaces, whch can motvate partcular learnng algorthms such as the AIS algorthm descrbed above [15]. An nterestng thread runnng through ths lne of research s that features can be derved from a frst-pass statstcal model, whch n effect forms a set of bass functons for a hgher-dmensonal space to facltate classfcaton: dvsons between phonetc classes that are not lnearly separable n a low-dmensonal space may be separable n the hgherdmensonal space. In that way, t s remnscent of kernelbased technques for dmensonalty expanson, such as that used n Support Vector Machne classfcaton [39]. In fact, Lafferty et al. [40] explctly use feature functons that are generated va kernel methods a Kernel CRF (KCRF. In a KCRF, a bass functon s used to construct feature functons from the tranng data. Snce ths can result n a large number of feature functons, a greedy selecton algorthm selects a representatve sample from the possble feature functons that ft the tranng set. Ths technque showed mprovements over kernel-based logstc regresson models for proten sequencng [40]. Feature functons n CRFs need not be defned a pror, but may be learned automatcally n a process that s analogous to feature nducton n the hdden layer of a multlayer perceptron; features at the nput layer are projected non-lnearly nto a hgher-dmensonal space [41], [42], [43]. These approaches are closely related to work by Kngsbury [44], whch dscusses tranng multlayer perceptrons wth sequental classfcaton crtera such as MMI or MPE for neural network acoustc modelng. Peng et al. [41] observe large gans over standard CRFs on a proten structure predcton task usng ths approach. Extensons of the lnear-chan CRF model based on recent work on deep archtectures n machne learnng have also been explored n the lterature. Deep belef networks [45] ncorporate multple hdden layers each of whch successvely transforms the nput representaton nto an ntermedate hdden layer representaton used as the nput to the next layer. A powerful feature of these models s ther ablty to learn these hdden representatons n a completely unsupervsed fashon; once the hdden layer weghts have been learned the entre network can be traned dscrmnatvely n a supervsed manner. In the Deep-structured condtonal random feld [46], frst- or zeroth-order lnear-chan CRFs are stacked together so that the margnal probabltes obtaned from the outputs of a lower layer serve as the nputs to the hgher layer. The model can be further extended where label sequences are known but the correspondng segmentatons are unknown [2]. 20 For those usng off-the-shelf toolkts, t may be mportant to pay attenton to the nternal algorthm to ensure that t matches the desred feature space. In concludng the dscusson on features, t s helpful to consder some practcal aspects of feature desgn n these systems. It s clear from the dscusson above that CRFs can utlze a wde range of feature functon defntons. However, the nteracton between features and learnng rates s crucal. One mportant consderaton s feature scalng: t s often benefcal to have most feature actvatons le wthn a lmted range (say, [0, 1] as n posterors, or [ k, +k] for small k, for mean and varance normalzed MFCCs, as bas terms for states and transtons are typcally set to 1; havng features n the same range enables the system desgner to use the same learnng rate for all weght updates. The relatonshp between learnng rates and feature scale s approxmately nverse: consder a sngle feature system wth posteror exp(λf/z(o and a scaled verson exp(λ2f/z(o. To express the same dstrbuton, the 2f system wll requre λ to be half the sze; smaller steps n changng λ (.e., the learnng rate are requred. Wesler and Ney have shown, wth theoretcal analyss and emprcal justfcaton, that mean and varance normalzaton and decorrelaton for features are mportant for the good convergence behavor of gradent-based optmzaton algorthms for log-lnear models [47]. The weghted sum of the features across a sequence also becomes an mportant consderaton. If the weghted sum exceeds 700 or so wthn an sequence, computng the forwardbackward probabltes wll overflow the sze of doubleprecson floatng pont usng the straghtforward mplementaton descrbed n Secton III-A1. 21 Ths s typcally not an ssue for the natural language communty, where sequences are short, but for the ASR communty utterances can easly exceed thousands of frames. Convertng all of the operatons to log-space has the advantage of allowng many more features, at the cost of tranng speed: the matrx-multply operatons n Equaton 14 become much more cumbersome. C. Avodng overfttng As s noted throughout the machne learnng lterature, one of the potental hazards of learnng parameters from a partcular tranng set s overfttng: the parameters wll match the partcular tranng set well but not generalze to a separate test set. There are four basc technques that have been used (alone or n combnaton n CRF tranng to mnmze the effects of overfttng. The most common approach s regularzaton: a penalty term s added to the log-lkelhood n order to ensure that the weghts of the CRF reman small. A common penalty s the l 2 norm, whch changes the log-lkelhood of the system (from Equaton 10 to: L l2 penalzed = λ F(S, O log Z(O n =1 λ 2 2σ 2 (20 where σ 2 ndcates a free parameter that determnes the wdth of a sphercal pror over the weghts [1], [3]. When the partal 21 In our experence, standard bookkeepng technques that renormalze the alpha-beta computaton help, but only up to a certan pont. The man ssue becomes the computaton of Z(O, where the renormalzatons must be taken nto account. Ths ssue s dscussed n the HMM context n [6].

10 10 PROCEEDINGS OF THE IEEE, PREPRINT dervatve of the penalzed lkelhood s taken wth respect to a partcular λ (whch determnes the scale of the update, the dervatve s decreased by λ /σ 2. Thus, for a fxed σ 2, larger values of λ wll receve proportonately smaller ncreases n the gradent ascent. When σ 2 s large, ths term becomes ncreasngly vanshng; for NLP tasks a value between 10 and 100 works relatvely well [12], although the senstvty of the answer to ths parameter seems to be qute low [3]. Hfny and Renals [15] fnd, for ther sparse augmented representaton, that the l 1 norm s more approprate; ther penalzed log lkelhood equaton takes the form: L l1 penalzed = λ F(S, O log Z(O α n λ (21 =1 where α s a parameter used to control the nfluence of the l 1 regularzer. Ths regularzer s called the Lasso penalty, and they clam that ths penalty wll often drve many of the λ weghts to zero [48] a very desrable property wth hghdmensonal sparse representatons. Another approach to avodng overfttng s one proposed for perceptron learnng, most recently by Collns [23], whch has been adapted for use n CRFs [4]. The voted perceptron works by contnually mantanng two sets of weghts: the current set of unnormalzed weghts, and a runnng average of all weghts mantaned by the system n essence, the average of all updates made to the system durng tranng. In our experence, ths technque s very effectve when usng stochastc gradent tranng of CRFs, because t mtgates the effects of any one ndvdual example s updates to the weghts; n the teratve scalng varants of tranng, updates are wthheld untl the end of the tranng epoch and thus all examples have an effect on the weght update contemporaneously. Cross-valdaton technques on held-out data can also be used to judge the progress of the tranng; these technques have long been used for tranng, e.g., neural networks [49]. The held-out set s decoded at set ntervals (often an epoch of the tranng data. Whle the log posteror of the labels contnues to ncrease wth further teratons of tranng, the predcton accuracy on a development test set often levels off, and suggests that any further tranng wll over-ft the data. One can combne these technques as well (e.g., usng weght averagng wth cross-valdaton. In the context of structured predcton and avodng overfttng, t would be amss not to menton large-margn approaches: changng the tranng crteron so that the margn between the correct label sequence and competng sequences s maxmzed, rather than maxmzng the lkelhood [27]. Related to ths approach are structured support vector machnes (structured SVMs [50], [51]: these are smlar to CRFs n that both approaches mnmze a convex regularzed loss functon (the negatve regularzed condtonal log lkelhood n Equaton 20 s the loss functon for CRFs. The optmzaton problem n structured SVMs ntroduces a max-margn regularzaton term and slack varables to prevent overfttng. We refer the reader to [52] for a detaled comparson of varous structured dscrmnatve models, partcularly n speech recognton. IV. GOING BEYOND LINEAR-CHAIN CRFS The dscusson of CRFs has so far been restrcted to lnear chans, so named because they correspond to the graphcal model structure n Fgure 2(c. Before dscussng possble generalzatons of the lnear-chan CRFs, we frst explore the connecton between graphcal model structure and the correspondng probablty dstrbutons n greater detal. The dscusson n the followng secton must necessarly be bref; for a more detaled descrpton we refer the nterested reader to Koller and Fredman [53]. A. Undrected Graphcal Models and CRFs An undrected graphcal model such as the one n Fgure 3 s ntended to represent the condtonal ndependence assumptons nherent n the probablty dstrbuton defned on a set of random varables. The vertces of the graph represent partcular random varables and there s no loss of generalty n assumng that the vertces are labeled by the correspondng random varable, as we have done n Fgure 2(c. We may thus refer to the neghbors of a partcular random varable s n the graph, by a set of vertces (denoted by N (s that are adjacent to t n the graph. The edges n the undrected graphcal model encode the condtonal ndependence assumptons amongst the varables, n terms of the Markov property: a random varable s s ndependent of any other random varable n the graph gven the values of ts neghbors. A graph defned on the set of varables (S, O s a condtonal random feld f the varables n S obey the Markov property wth respect to the graph when condtoned on the observatons O [1], P (s s, N (s, O = P (s N (s, O. (22 Now, the structure of a graph s completely specfed by statng the set of clques, C n the graph, where a clque c C s a set of vertces that are mutually connected by an edge. A surprsng fact about CRFs (and random felds n general s that the set of all clques n the graph also completely characterzes the set of probablty dstrbutons on the graph varables that satsfy the Markov property wth respect to the graph. If we denote by s c the set of random varables n S that correspond to a partcular clque c C, the Hammersley-Clfford theorem [14] states that a postve probablty dstrbuton P (S, O satsfes the Markov property n Equaton 22 f and only f t can be wrtten n the form, P (S O = 1 φ(s c, O, (23 Z(O c C where φ(s c, O are a set of postve functons defned on the random varables and Z(O s a normalzaton term that ensures that Equaton 23 forms a vald probablty dstrbuton, Z(O = φ(s c, O. (24 S c C In the partcular case where the graphcal structure correspondng to the CRF s a lnear-chan, the clques n the graph correspond to edges s t 1, s t and ndvdual nodes s t. Consequently, the correspondng functons are represented as φ(s t 1, s t, O and φ(s t, O as defned n Equatons 6 and 7.

11 FOSLER-LUSSIER ET AL.: CONDITIONAL RANDOM FIELDS IN SPEECH, AUDIO, AND LANGUAGE PROCESSING, PREPRINT 11 S 1,1 S 2,1 S 3,1 S 1,1 S 2,1 S 3,1 S 4,1 S 1,2 S 2,2 S 3,2 S 1,2 S 2,2 S 3,2 S 4,2 O 1 O 2 O 3 S 1,3 S 2,3 S 3,3 S 4,3 Fg. 3. An example of a more general (non-lnear-chan CRF, wth an nterlnked chan of two state varables dependng on a sequence of observatons. As we dscuss n secton VI-A, such a graphcal model structure has been used to jontly model noun phrase segmentaton (the upper chan of varables s,1 and part of speech tags (as part of the lower chan of varables s,2 n [11]. B. General CRFs The dscusson n the prevous secton, and Equaton 23 n partcular, provde a clear roadmap for the extenson of CRFs beyond a lnear-chan structure. All that s requred s that a sutable graphcal structure be defned that captures the condtonal ndependence assumptons that one would want to hold for the set of random varables S gven the observatons O and to choose an approprate form for the feature functons φ(s c, O. In many applcatons sutably modeled by a general CRF, t s useful to thnk of the complete graphcal structure as correspondng to a set of clque templates that are repeated over tme to obtan a dynamc condtonal random feld (DCRF [11], [3]. Ths s smlar to the process by whch a Bayesan network s unrolled over tme to obtan a dynamc Bayesan network [54]. We use the noton of repeatng over tme loosely here, snce there are alternatve graphcal structures such as 2-D grds (Fgure 4 that are sutable for mage processng applcatons; the essental feature that we wsh to stress s that t s useful to thnk of the overall structure as beng decomposable nto a number of templates that may be repeated to obtan the overall structure. Wth ths understandng, let us denote the set of varables assocated wth a clque template c C, at a partcular tme t, as s t,c. If we choose to represent the set of functons φ( of Equaton 23 to be n the exponental famly, we may wrte, P (S O = 1 Z(O c C t=1 T exp(λ T c f(s t,c, O (25 where λ c s the set of parameters assocated wth the clque template c. In the case of a lnear-chan CRF, clque templates could be defned correspondng to ndvdual edges c 1 and nodes c 2, thus s t,c1 = {s t 1, s t } and s t,c2 = {s t }. Note that the characterzaton of the general graphcal structure n terms of templates makes explct an addtonal feature of general graphcal models that must be exploted n practcal systems: the weghts λ c assocated wth clque templates are explctly ted across repettons. Fg D (grd structured Condtonal Random Feld. For clarty, lnks to observatons have been omtted. We have prevously appled such a graphcal model structure for speech segregaton [55], where each node s,j represents a tme-frequency unt. We descrbe ths approach n greater detal n Secton VI-C. C. Inference n General CRFs The parameters {λ c } n Equaton 25 can be learned analogously to a lnear-chan CRF as descrbed n Secton III-A1, whch nvolves the computaton of the margnal probabltes P (s t,c O for varables n the graph. In the case where the underlyng CRF graph structure corresponds to a tree 22, these margnal probabltes can be computed exactly by a generalzaton of the forward-backward algorthm for lnear chans, whch s known as the sum-product algorthm or belef propagaton [56], [57]. For non-tree-structured CRFs, the margnal dstrbutons can be computed exactly usng the juncton tree algorthm [53]. The basc dea behnd ths algorthm s to successvely cluster the varables n the graph to obtan a new tree-structured graph, called the juncton tree, whose vertces are subsets of the orgnal graph nodes and then to run nference algorthms on the juncton tree. Unfortunately, the juncton algorthm s exponental n the tree wdth of the graph (one less than the sze of the largest clque n the juncton tree and thus the cost s prohbtve for graphs of large tree-wdth (2-D grds, for example. In cases where exact nference s ntractable, t s possble to utlze a number of approxmate nference technques. Conceptually, one of the smplest approaches s loopy belef propagaton, whch s an applcaton of the basc belef propagaton algorthm to graphs wth cycles. Although the algorthm s not provably correct for such graphs, applyng the algorthm n these condtons has been found to be emprcally successful for some applcatons [58]. It should be noted however, that n general graphs, the algorthm may not converge, or f t does converge, t may not converge to the correct margnals for graphs wth cycles. An alternatve approach s to use Markov Chan Monte Carlo (MCMC technques such as Gbbs samplng to obtan samples from the dstrbuton that can then be used to approxmate the margnals of nterest. Snce Gbbs samplng requres one to sample from the dstrbuton of a partcular random varable condtoned on all other varables n the network, the Markov propertes of the graph (Equaton 22 can be readly exploted to smplfy the samplng process. MCMC technques can, however, be very 22 More techncally, the sum-product algorthm can be used for computng margnals n the case when the correspondng factor graph [56] s a tree.

12 12 PROCEEDINGS OF THE IEEE, PREPRINT S 1 S 2 S 3 O 1 O 2 O 3 O 4 O 5 O 6 O 7 Fg. 5. Example Sem-Markov CRF or SCRF graphcal model. Unlke a regular CRF, each SCRF state can refer to a varable-length sequence of observatons, also known as a segment. For example, n speech recognton, each of the states could represent a word along wth the correspondng segment of acoustc observatons [20]. slow n practce, whch s an ssue f the margnal dstrbutons n the graph need to be computed repeatedly durng tranng for each nput example. Varatonal technques [59] can also be used for approxmate nference. A full dscusson of varatonal methods s outsde the scope of ths artcle; we refer the nterested reader to [60]. Although exact nference usng the juncton tree algorthm s ntractable for a number of graphcal structures, knowledge of determnstc task-dependent constrants can greatly smplfy the nference problem. For example, we have consdered the use of CRFs to model the varous streams of artculators (lps, tongue, glotts, velum, etc. whle explctly accountng for the fact that the artculators can desynchronze from each other n conversatonal speech [61]. Although the factor graph for ths model appears farly complcated, the determnstc constrants that we mpose on the problem allow nference to be carred out n an equvalent lnear-chan graphcal model, wth a manageable state space. D. Introducng hdden varables n tranng: HCRFs For many problems of nterest, there s unknown substructure that lnks the state space wth observatons: for example, n speech recognton one often uses a sub-phonetc state representaton to statstcally descrbe the begnnng, mddle, and end of a subword unt (phone; n many contnuous domans a mxture of Gaussans s used for observaton probabltes, where the mxture class s unknown. Durng tranng, the system wll often have access to some labels of nterest, but wll not have a labelng avalable for substructures. Moreover, the substructure s often not desred at decode tme, so t can be margnalzed out of the probablty estmaton. Hdden Condtonal Random Felds [4], [62] take the approach that an observed state s pared wth one that s hdden durng tranng; the CML tranng algorthm can be adapted to estmate hdden state values usng the expectatonmaxmzaton algorthm [63] or to compute the (n-best most lkely hdden sequences durng tranng. These approaches have the advantage of beng able to represent a more complex state space than the log-lnear probablty model n a standard CRF; however the optmzaton problem s no longer convex, so a global optmum on the tranng set s not guaranteed. E. Segmental CRFs Sem-Markov CRFs [64] or Segmental Condtonal Random Felds (SCRFs [20] (Fgure 5 are extensons of lnear-chan CRFs that allow each of the label states to span a segment of consecutve observed unts, each of whch s n turn assgned the same label. These models relax the Markov assumpton from the frame level to the segment level, whle the transtons wthn segments can be non-markovan, whch enables longspan dependences. A segmental CRF models a segment-level label sequence S = {s 1, s 2,..., s K } along wth the correspondng segment boundares (encoded n terms of end tmes, E. Let E = {e 1, e 2,..., e K } denote the end tmes of the sequence of observaton segments correspondng to S, and o ej e j 1 denote the observaton segment from o ej 1 (exclusve to o ej (nclusve. The clque template c at a partcular segment ndex j could be defned on s j 1, s j, e j 1 and e j, and the features f could be defned on a par of consecutve segment labels and the correspondng observed segment. Accordng to Equaton (25, we can model the jont probablty dstrbuton of S and E condtoned on O as: where P (S, E O = Z(O = S,E s.t. E = S E exp λ f (s j 1, s j, o ej j=1 Z(O E exp λ f (s j 1, s j, o ej j=1 e j 1 e j 1 (26. (27 One has a choce whether to optmze the jont label and segmentaton space P (S, E O (as n [64], [65], or only the label sequence P (S O (as n [20]. In the former case, both correct label sequences S and assocated correct segmentatons E are known durng tranng, thus makng t a convex optmzaton problem. In the latter case, only label sequences are requred for tranng but the correct segmentatons are treated as hdden structure, and one needs to sum over all possble segmentatons E for S (as shown n Equaton 28 leadng to a non-convex optmzaton problem, smlar to HCRFs; t s an open queston whether the margnalzaton over segmentatons s necessary for most tasks. E s.t. E = S E exp λ f (s j 1, s j, o ej j=1 e j 1+1 P (S O = Z(O (28 The standard forward-backward algorthm for lnear-chan CRFs can be extended n ether case for effcent parameter estmaton and nference, wth an ncrease n complexty by a factor of the maxmum duraton of segments. Furthermore, we have shown that f each observaton-dependent transton feature s defned on a fxed number of observaton unts nstead of a varable-length observaton segment, the tme

13 FOSLER-LUSSIER ET AL.: CONDITIONAL RANDOM FIELDS IN SPEECH, AUDIO, AND LANGUAGE PROCESSING, PREPRINT 13 complexty of the forward-backward algorthm for SCRFs can be reduced to that of standard lnear-chan CRFs [65] under mld assumptons. We refer nterested readers to [64], [20], [65] for detals of these algorthms. The SCRF s a desrable framework for tasks where the labels of nterest are nherently segmental n nature, for example, named entty recognton [64], [66], Chnese word segmentaton [66], [67], word recognton [20], [68], phone classfcaton and phone recognton [38], [65]. F. Condtonal Augmented Models Condtonal augmented (C-Aug models [31] are dscrmnatve segment-based models that ncorporate the suffcent statstcs of observaton log-lkelhoods from a base generatve model. The models predct a sngle mult-class label for a sequence of observatons that are assumed to share the same label. The posteror probablty of the sngle label s for the entre sequence O s gven by: P (s O = exp ( α T(s, O; λ (29 Z(O; λ, α where λ refer to the parameters of the base model, and T(s, O; λ are the suffcent statstcs of the observaton loglkelhoods from a base generatve model. C-Aug models have the ablty to represent a wde range of temporal and spatal dependences snce the feature functons can be dependent on the entre observaton sequence. C-Aug models can be extended to sentence models for recognton, wth a sequence of segmental labels. These models are then smlar to SCRFs except that the features are observaton log-lkelhoods and ther dervatves, whch are obtaned from a base generatve model, when λ are fxed or pre-traned. A frst-pass lattce from a base model (e.g., HMMs s used to sgnfcantly constran the possble segmentatons and label sequences for C-Aug models. Addtonal detals about ths can be found n [69]. V. RELATIONSHIPS OF CRFS TO OTHER CLASSIFIER TECHNOLOGIES In the prevous sectons, we detaled the partcular mathematcal technques that allow for Condtonal Maxmum Lkelhood tranng of CRFs over a range of feature defntons. In ths secton, we relate n partcular the structure and learnng methods for CRFs to several current classfer technologes used n speech and language technologes. A. Relatonshp to HMMs Many comparsons have been made between CRFs and HMMs [4], [3], but t s nstructve to show the relatonshp between these models n a slghtly more explct manner. We hghlght here work that s explaned more fully n Hegold et al. s paper [5] comparng Gaussan-based HMMs for contnuous observatons and log-lnear models, but we walk through the explct steps n understandng the relatonshps between Gaussan and log-lnear models, usng ASR as a doman example. In speech recognton HMMs typcally model the probablty of a word sequence W gven an observed acoustcs sequence O va a state sequence S over tme t = 1... T : arg max W arg max W P (W O = arg max P (O W P (W max S arg max max W S W P (O SP (S W P (W [ T ] P (o t s t P (s t s t 1 P (W. (30 t=1 Let us focus on the acoustc computaton n the nner brackets, and frst consder the case where P (o t s t, the acoustc lkelhood for a partcular state, s represented by a sngle Gaussan wth a dagonal covarance matrx. In ths case, the lkelhood for data o t (wth dmensonalty n at state s t = s wth multvarate mean µ and standard devaton vector σ (where = 1... n ranges over the components of these vectors and both are mplctly ndexed by s s gven by: 1 P (o t s t = (2π n/2 n σ exp 1 (o t, µ 2 2 σ 2. (31 We can rewrte ths equaton so that t s an exponental functon of a sum over the dmensons of terms of o and o 2. Notce that ( ( 1 (2π n/2 n σ = exp log (2π 1/2 σ (32 ( = exp 1 2 log ( 2πσ 2. (33 Snce 1 (o t, µ 2 2 σ 2 = ( ( o 2 t, 2σ 2 + o t,µ σ 2 + µ2 2σ 2, (34 we can rewrte Equaton 31 by gatherng terms n o t, : [ ( ] P (o t s t = exp c + o t,µ + o2 t,, where c = 1 2 σ 2 ( log 2πσ 2 + µ2 σ 2 2σ 2. (35 We can see that ths Gaussan model can be expressed as a sum of weghted feature functons f and weghts λ. Let f 0,s t f 1,s t = δ(s t = s = δ(s t = so t, f,s 2 t = δ(s t = so 2 t, λ 0,s = 1 ( log 2πσ 2 + µ2 2 σ 2 λ 1,s = µ σ 2 λ 2,s = 1 2σ 2 where the delta functon selects for a partcular state s that s, when a postulated HMM state s t = s the delta functon takes

14 14 PROCEEDINGS OF THE IEEE, PREPRINT on a non-zero value, and s zero for s t s. 23 Replacng terms n Equaton 35 wth the above terms, we obtan the followng expresson (for a partcular tme t: n 2 P (o t s t = exp λ j,s f j,s t. (36 =1 j=0 In an HMM, however, t s not only mportant to model the acoustc probabltes at each state, but also the transtons between states. We can easly ncorporate state-to-state transton probabltes by havng transton functons between pars of state values (s, s : t and no transton-observaton feature functons, one can derve a set of HMM parameters θ that can exactly represent the same probablty dstrbutons [72]. They demonstrate ths by showng equvalence of the two models usng a semantc concept taggng task. The equvalence has been shown to be b-drectonal, and has been extended to mappngs between HMMs wth a mxture of Gaussan observatons and a mxture of log-lnear models [5]. 25 Ths equvalence proof makes use of lnear-chan CRFs wth sub-phonetc HMM state sequences (S as hdden varables, also known as hdden CRFs (HCRFs. We refer the reader to Secton IV-D for more detals on HCRFs. 26 f trans s,s = δ(s t = s, s t+1 = s λ trans s,s = log P (s t+1 = s s t = s and thus, when consderng a state sequence S = {s 0, s 1,..., s t } one can compute the lkelhood n terms of weghted functons of the state s t and transton (s t, s t+1 P (O, S W = exp t λ j,s f j,s t + λ trans s t,s t+1 f trans j s t,s t+1. (37 One can analogously ntroduce feature functons that ncorporate the probablty of the words P (W. When expressng a Gaussan HMM as a CRF, Equaton 37 s compatble wth the numerator of the CRF equaton that s, a lkelhood-based HMM can be represented as an unnormalzed CRF. Remember that the partton functon (denomnator of the CRF s the sum of the exponental functons over all paths: P (W O max S P (O, S W P (W S P (O, S W P (W. (38 Ths s the same quantty that s maxmzed durng Maxmum Mutual Informaton Estmaton (MMIE tranng [70]. 24 Thus, the structure of the equatons when suffcent statstcs are used as feature functons wthn a CRF s the same as that of MMIE. Both methods also have parameter updates for the probablty dstrbuton that are based on gradent ascent methods. In practce, MMIE leads to a constraned optmzaton problem, n that the updated HMM parameters must stll form a probablty dstrbuton [70], whereas CML updates for log-lnear CRFs lead to an unconstraned optmzaton because potentals are globally normalzed. It should also be ponted out that t s possble to represent a subclass of lnear-chan CRFs wthn an HMM, effectvely gvng a dscrmnatvely-traned HMM. Hegold et al. show that, for any set of lnear-chan CRF parameters λ that have tme-algned state-observaton relatonshps only (.e., only the observaton at tme t s assocated wth the state at tme 23 Note that whle the functons vary as a combnaton of the nput and postulated state over tme, the weghts (λ are not dependent on the tme t, n keepng wth the statonarty assumpton of HMMs. 24 A good revew reference for MMIE can be found n [71]. B. Relatonshp to Perceptrons and MLPs The basc equatons for CRFs also bear some resemblance to Mult-Layer Perceptrons, whch have been used for local phone classfcaton tasks n ASR for many years ([73] nter ala, and as part of larger hybrd artfcal neural networkhdden Markov model (ANNHMM and tandem ANN-HMM systems [74]. MLPs attempt to dscrmnate between classes by projectng the nput to a hgh-dmensonal space (va one or more hdden layers and then combnng these nputs to gve an estmate of (phone class membershp. At each layer, a (possbly nonlnear functon s used to determne the outputs of the layer; Brdle [75] showed that when the softmax functon was used (and traned approprately, the outputs could be nterpreted as a posteror probablty. In short, f Wc T s a vector of weghts between the hdden nodes and the node for class c n the output layer, and X s the vector of outputs from the hdden layer of an MLP (whch s n turn a nonlnear functon of the nput or a prevous hdden layer, then the probablty of class c gven the nput can be computed as: P (c nput = exp(w c T X c exp(w T (39 c X. The form of ths equaton s remarkably smlar to the CRF equaton where only state feature functons are consdered, that s, the weghts Wc T correspond to the CRF weghts λ and the X are the ndvdual feature functons. For sngle layer perceptrons (SLPs wth a softmax transfer functon, the form of the equaton s n fact dentcal to a Maxmum Entropy classfer, and thus the CRF extends ths dea by allowng statstcal dependences between adjacent states. 27 Maxmum Entropy tranng s equvalent to fndng the Condtonal Maxmum Lkelhood of Equaton 39 over all the tranng data; followng Klen [8] one can wrte the log lkelhood of an (unregularzed MaxEnt model where the feature functons are 25 It should be noted that the results here apply to one form of CRF: lnearchan models that do not have transton functons that depend on observatons. 26 We note that ths dea of usng HMM states as hdden varables was frst ntroduced by Gunawardana et al.. They also pont out that ths approach extends naturally to hdden component weghts of mxtures of Gaussans that are typcally used as observaton denstes n speech recognton [4]. 27 One can also formulate MLPs wth dependences between states (see, for example, the REMAP paradgm [76], although the tranng regmen becomes much more complex and computatonally expensve than the usual MLP tranng method.

15 FOSLER-LUSSIER ET AL.: CONDITIONAL RANDOM FIELDS IN SPEECH, AUDIO, AND LANGUAGE PROCESSING, PREPRINT 15 based on the data as: L(W = log = log = P (c X, W exp(wc T X c exp(w T c X W T c X log c exp (Wc T X (40 As derved earler, ths quantty s maxmzed through gradent search over W. Smlarly, wth softmax perceptrons, one tres to mnmze an error crteron the cross entropy of the target dstrbuton and the probablty dstrbuton currently predcted gven the weghts by searchng wth the gradent of the error functon. Gven a true dstrbuton P T and predcted dstrbuton P P, c P T (c log P P (c the cross entropy H(P T, P P = s a measure of dstorton between these dstrbutons. In the specfc case where the true (target dstrbuton s a onehot encodng (the correct class has probablty 1, the ncorrect classes 0, ths reduces to the log probablty of the correct class: H(P T, P P = log P P (c. (41 Gven that the forms of P P are the same for the MaxEnt and softmax perceptron classfers, the partal dervatves of Equaton 41 wth respect to the weght are dentcal to those of MaxEnt classfers for the case of one-hot targets. In tryng to mnmze the cross entropy, therefore, the one-hot softmax perceptron s dentcal to the maxmum entropy formulaton. The dscusson above relates softmax perceptrons and Max- Ent classfers; the addtonal power of a MLP s n ts hdden layers these allow for non-lnear boundares between classes to be learned. On the other hand, Sngle-layer Perceptrons, MaxEnt models, and CRFs (as well as Support Vector Machnes are lnear classfers of ther nput. CRFs and SVMs can model nonlnear functons by sutably transformng the nput nto a nonlnear space (through explct desgn of feature functons n CRFs such as second-order statstcs, or by kernel functons n SVMs. However, f one consders the MLP s hdden layer (X, often the output of a sgmod functon appled to weghted lnear nputs as a set of automatcally learned feature functons n an exponental model, we can utlze these functons wth other log-lnear classfers. Recent approaches based on learnng deep archtectures [45] can be thought of as ntermedate between these postons. The hdden layer representatons n the deep belef network are ntally learned drectly from the data n an unsupervsed fashon as part of the pre-tranng step; the weghts of the network can then be dscrmnatvely optmzed for classfcaton n a supervsed fashon. The relatonshps between neural networks and CRFs has led to a number of exctng hybrd approaches, as dscussed prevously n ths artcle. Usng sequence-based optmzaton crtera such as CML/MMIE has mproved local posteror estmaton n neural networks [44]. Smlarly, usng MLP or NNP NNP NNP VBZ VBN TO The Washngton Post s expected to Fg. 6. Part of speech tags for a fragment of a sample sentence. The tags correspond to the Penn Treebank tag set. For example, NNP corresponds to sngular proper noun, VBZ corresponds to verb (present tense, 3rd person sngular, etc. Deep Neural Network technques can automatcally learn the feature functons for CRFs [42], [46]. VI. APPLICATIONS IN AUDIO, SPEECH, AND LANGUAGE PROCESSING To ths pont, we have talked about CRF technology from the perspectves of hstorcal development, uncoverng ther mathematcal structure, and pontng out ther relatonshp to other common classfer technques. In ths secton, we explore the nvolvement of CRFs n natural language, speech and audo processng tasks; we gve bref summares of dfferent applcatons of CRFs by defnng the problem n terms of the state sequence S that s desred and the observatons O. The am of ths secton s not to be an exhaustve revew of all possble applcatons, but rather to hghlght the dversty of the range of problems that can be addressed usng CRFs. A. Natural Language Processng Consder a seres of operatons that could be appled to a natural language sentence: one could begn wth low-level tasks such as predctng part-of-speech (POS tags or recognzng named enttes, followed by operatons at a syntactc level (parsng and endng wth hgh-level, semantcally-orented tasks such as queston answerng or nformaton extracton. CRFs have been emprcally successful n each of these sequence labelng problems; ths secton attempts to brefly revew some of the man contrbutons. In order to mantan consstent notaton throughout the artcle, we wll ntroduce each problem by defnng the values that the hdden varables (S and observaton varables (O can take. PART OF SPEECH TAGGING S: Part of speech tags O: Words n a sentence CRFs were frst appled to predct part-of-speech tags for an nput word sequence [1] (see Fg 6; n that work, CRFs were found to outperform both HMMs and MEMMs when evaluated on the Penn Treebank. The CRF error rate when compared wth the HMM drops consderably when the model s augmented wth a small set of orthographc features: the system used lngustcally-motvated features, for example whether the word was captalzed, the presence of partcular suffxes (-ed, -on, etc. In subsequent work, Sutton et al. [11]

16 16 PROCEEDINGS OF THE IEEE, PREPRINT B I I O O O The Washngton Post s expected to Fg. 7. NP chunk tags for a fragment of a sample sentence. Here, the NP s The Washngton Post specfed n terms of the B, I, O labels representng words that begn an entty, are nsde an entty, or are outsde the entty. observe that jontly modelng noun phrase (NP segmentaton and POS taggng usng DCRFs mproves POS predcton accuraces. POS taggng usng CRFs has also been appled to languages such as Chnese [77], Arabc, Czech [78] and Japanese [79] wth complex morphologcal structures and they have been found to outperform ther generatve counterparts. NAMED ENTITY RECOGNITION S: Named enttes O: Words n a sentence For a hgh level task lke queston answerng or nformaton extracton, t would be benefcal to dentfy all the people, locatons and organzatons n a document (or sentence of nterest; ths s commonly referred to as named entty recognton (NER. McCallum and L [80] descrbe one of the frst attempts at applyng a CRF to ths problem. They use feature nducton and select only those feature conjunctons that sgnfcantly ncrease log-lkelhood; the atomc features are derved usng Web-augmented lexcons. Fnkel et al. [81] report further mprovements by modelng non-local structure wth Gbbs samplng that allow mposng varous sorts of long range constrants. PARSING S: Grammatcal productons (NP chunks, grammar rules O: Words n a sentence Shallow parsng s typcally used as a precursor to full parsng or further semantc analyss. Noun phrase (NP chunkng s an archetypal shallow parsng problem that fnds NPs n a sentence. Ths could be modeled as a sequence taggng task that takes words n a sentence, annotated wth POS tags, as nput and generates NP chunks n terms of Begn, Insde and Outsde as shown n Fg 7. Sha and Perera [13] use a CRF model for ths task wth the labels beng bgrams of the B, I, O tags to mpose a second-order Markov dependency between them. Ther CRF model delvers superor F-score performance beatng all sngle-model NP chunkng results on the CoNLL shared task. A recent approach by Fnkel et al. [82] presents a general, feature-rch CRF-based parser attanng a state-of-the-art F score on the Penn Treebank parsng task (on sentences of length 15. Ther model uses grammar rules defned by probablstc context-free grammars, along wth addtonal nformaton such as the span of words encompassed by the rule and ts poston n the sentence, as labels for ther CRF model. Ths dscrmnatve parser was further used to buld a model of nested named enttes [83] and a jont model of parsng and NER [84]. INFORMATION EXTRACTION S: Felds to be extracted (e.g., Ttle, Speaker, etc O: Words n a document We end ths secton on CRFs n NLP applcatons wth a reference to the semantcally-orented task of nformaton extracton. Peng and McCallum [85] acheve state-of-the-art performance n extractng standard header felds such as ttle, author, nsttuton, etc from research papers on a standard benchmark data set. The authors use a number of local features (such as contans a dot, layout features (such as end of a lne and external lexcon features ( match word n author lexcon n ther lnear-chan CRF model. Sutton and McCallum [86] approach ths extracton problem dfferently by usng a skp-chan CRF whose structure s a lnear chan wth addtonal edges between smlar words. B. Speech Processng: phone recognton, word recognton and other tasks As we have dscussed throughout the artcle, varous ASR tasks lend themselves to be modeled as sequence labelng problems; t comes as no surprse that CRFs fnd wde applcaton across all of these tasks. We dscuss some of these applcatons n the followng sectons. The descrpton n ths secton s ntentonally bref snce the systems descrbed heren have been elaborated upon n prevous sectons. PHONE OR WORD RECOGNITION S: Phone or word labels O: Acoustc features representng spectral propertes CRFs were frst appled n ASR to the task of phone classfcaton [4] by usng feature functons explctly derved from a baselne HMM system; ths was extended by Sung and Jurafsky [62] to phone recognton. Subsequent work has explored the use of alternatve feature functons such as Gaussan responsblty scores [37] and phone and phonologcal feature posterors [33]. Recently, Segmental CRFs have been employed for phone classfcaton and recognton, makng use of segmental features and complex state space [38], [65]. Phone-based CRF approaches have shown large mprovements over baselne HMM systems on the standard TIMIT dataset, n partcular when observaton-dependent transton factors are employed. Extendng these approaches to word recognton for medum to large vocabulary tasks s complcated by the large ncrease n the state-space; Zweg and Nguyen [20] address ths challenge va a Segmental CRF by restrctng the set of hypotheses to a lattce obtaned from a baselne HMM system. An alternatve approach [87], [88] uses CRFs to produce phone-lattces that are used n subsequent processng steps for word recognton. LETTER-TO-SOUND CONVERSION S: Graphemes from orthographc transcrpton O: Phonemes correspondng to pronuncaton

17 FOSLER-LUSSIER ET AL.: CONDITIONAL RANDOM FIELDS IN SPEECH, AUDIO, AND LANGUAGE PROCESSING, PREPRINT 17 The letter-to-sound converson task (also known as pronuncaton predcton or grapheme-to-phoneme converson attempts to predct the pronuncaton of a word (n terms of phones gven the spellng of the word (for example phoenx /F IY N IH K S/. Snce the orthographc transcrpton and the phonemc pronuncaton are generally of dfferent lengths, systems for ths task need to derve algnments between these sequences. Wang and Kng [89] use a lnear-chan CRF wth a set of smple ndcator functons by examnng pars of graphemes and phonemes, fndng a large mprovement over a generatve baselne system. SPOKEN LANGUAGE UNDERSTANDING: CONCEPT TAGGING S: Semantc concept labels O: (Possbly errorful word transcrpts of speech Spoken language understandng attempts to extract the ntended meanng of a speaker from speech recognton output. One formalzaton of ths dea s the extracton of task relevant attrbute-value pars (or concept labels from text e.g., I want to fly to Austn on Tuesday n a flght reservaton doman should result n concept pars destnaton:austn and travel.date=tuesday. Hahn et al. present several systems for concept extracton n multple languages, and fnd that CRFs perform qute well n terms of concept error rate across three languages, usng a mxture of features based on word bgrams and features of the words themselves (such as prefxes and captalzaton [90]. C. Audo processng We end ths secton on CRF applcatons by drawng attenton to recent work n relatvely new areas of audo processng: speech segregaton usng bnary masks and two tasks ntended to support musc nformaton retreval. SPEECH SEGREGATION S: Bnary labelng of tme-frequency unts O: Mxture of target and nterference acoustc sgnals An area of speech processng that lends tself well to the use of CRFs s computatonal audtory scene analyss (CASA. Wthn ths paradgm, a body of research has been drected towards the estmaton of the deal bnary mask (IBM n order to perform speech segregaton. The IBM s a 0/1 mask defned on the tme-frequency (T-F representaton of a mxture of two sgnals (target and nterference, whch dentfes T-F unts where the target energy domnates that of the nterference and s computed from the pre-mxed sgnals [91]. In prevous work, we have expermented wth the use of a 2-D grd structured CRF for speech segregaton [55]. The vertces of the graph represent T-F unts; each T-F unt s connected by an edge to ts neghbor n tme and frequency. We observed mproved T-F unt classfcaton performance over prevous heurstcallyguded approaches. AUDIO-TO-SCORE ALIGNMENT S: Concurrency labels descrbng smultaneous notes O: Acoustc features Audo-to-score algnment nvolves algnng an audo recordng wth ts correspondng symbolc score potentally adng applcatons n musc nformaton retreval. Joder et al. [92] propose the use of CRF-based models for algnng polyphonc musc that ncludes multple voces or melodes. Ths algnment problem s formalzed as a sequence labelng task; the muscal score s represented as a sequence of concurrences - a set of notes that occur smultaneously. They experment wth ncreasngly detaled CRF structures that attempt to model concurrency duratons and tempo. Ther model ncorporates several acoustc features characterzng dfferent aspects of muscal content, such as harmony and tempo, extracted from local neghborhoods. In expermental evaluatons, they observe mproved performance n predctng algnments when evaluated on a large-scale database of polyphonc, classcal and popular musc. MUSIC TRANSCRIPTION S: Notes O: Ptch estmates from the audo sgnal CRFs have been employed n automatc musc transcrpton [93] n a post-processng step as a smoothng technque to reduce sngle-frame errors. The authors employ a lnearchan CRF for each ptch value that takes as nput frame-level ptch estmates, labelng these observatons wth the notes. The addton of the CRF preprocessng step results n mproved performance over several prevously reported results on a standard database used for the evaluaton of polyphonc musc transcrpton approaches. VII. CONCLUSION Ths artcle has explored some of the dfferent propertes of CRFs, partcularly n relatonshp to speech, audo, and language technologes. CRFs are n some sense a brdge between HMMs and Mult-Layer Perceptrons, utlzng the Markovan and observatonal structure of HMMs, whle at the same tme usng a tranng crteron not unlke that of softmax-based perceptrons. Understandng the relatonshps of the dfferent components of a CRF (model structure, parameterzaton, parameter estmaton, and nference mechansm to current technology can not only help brdge the gap between exstng technologes but facltate nnovaton n combnng dfferent components. A short artcle can only scratch the surface of the topc of CRFs, but the lterature cted here provdes good further readng on the topc. In partcular, the orgnal Lafferty et al. artcle [1], the revew chapter by Sutton and McCallum [3] and the NAACL tutoral by Klen [8] make a good tro of papers for understandng general perspectves on CRFs and ther tranng. More detal on structured dscrmnatve models such as those descrbed n Secton IV, can be found n [52], whch has a partcular emphass on speech recognton. We hope that ths artcle has provded connectons from that lterature to the partcular perspectves of researchers nterested n ths technology.

18 18 PROCEEDINGS OF THE IEEE, PREPRINT ACKNOWLEDGMENTS The authors would lke to thank Jeremy Morrs, Ilana Hentz and Chrs Brew for dscussons of ths work. Ths work was supported by NSF CAREER grant IIS and NSF grant IIS ; the opnons and conclusons expressed n ths work are those of the authors and not of any fundng agency. REFERENCES [1] J. Lafferty, A. McCallum, and F. Perera, Condtonal random felds: Probablstc models for segmentng and labelng sequence data, n Proceedngs of the Internatonal Conference on Machne Learnng (ICML 01, Wllamstown, MA, USA, Jun. 2001, pp [2] D. Yu and L. Deng, Deep-structured hdden condtonal random felds for phonetc recognton, n Proceedngs of the Annual Conference of the Internatonal Speech Communcaton Assocaton (Interspeech 10, Makuhar, Chba, Japan, Sep. 2010, pp [3] C. Sutton and A. McCallum, An ntroducton to condtonal random felds for relatonal learnng, n Introducton to Statstcal Relatonal Learnng, L. Getoor and B. Taskar, Eds. MIT Press, 2007, pp [4] A. Gunawardana, M. Mahajan, A. Acero, and J. C. Platt, Hdden condtonal random felds for phone classfcaton, n Proceedngs of the European Conference on Speech Communcaton and Technology (Interspeech 05 - Eurospeech, Lsbon, Portugal, Sep. 2005, pp [5] G. Hegold, H. Ney, P. Lehnen, T. Gass, and R. Schlüter, Equvalence of generatve and log-lnear models, IEEE Transactons on Audo, Speech, and Language Processng, vol. 19, no. 5, pp , Jul [6] L. R. Rabner, A tutoral on hdden Markov models and selected applcatons n speech recognton, Proceedngs of the IEEE, vol. 77, no. 2, pp , Feb [7] A. McCallum, D. Fretag, and F. Perera, Maxmum entropy Markov models for nformaton extracton and segmentaton, n Proceedngs of the Internatonal Conference on Machne Learnng (ICML 00, Stanford, CA, USA, Jun. 2000, pp [8] D. Klen, Introducton to classfcaton: Lkelhoods, margns, features, and kernels, n NAACL Tutoral, Rochester, NY, USA, Apr [9] L. Bottou, Une approche théorque de l apprentssage connexonnste: Applcatons à la reconnassance de la parole. Ph.D. dssertaton, Dept. Electr. Eng. Comput. Sc., Unversté de Pars XI, Pars, France, [10] E. McDermott, Dscrmnatve tranng for speech recognton, Ph.D. dssertaton, Dept. Comput. Sc. Eng., Waseda Unv., Tokyo, Japan, [11] C. Sutton, A. McCallum, and K. Rohanmanesh, Dynamc condtonal random felds: Factorzed probablstc models for labelng and segmentng sequence data, Journal of Machne Learnng Research, vol. 8, pp , Mar [12] H. Wallach, Effcent tranng of condtonal random felds, Master s thess, Schl. Cogn. Sc., Unv. Ednburgh, Ednburgh, [13] F. Sha and F. Perera, Shallow parsng wth condtonal random felds, n Proceedngs of the Human Language Technology Conference of the North Amercan Chapter of the Assocaton for Computatonal Lngustcs (HLT-NAACL 03, vol. 1, Edmonton, Canada, May 2003, pp [14] J. Hammersley and P. Clfford, Markov felds on fnte graphs and lattces, 1971, unpublshed manuscrpt. [Onlne]. Avalable: grg/books/hammfest/hamm-clff.pdf [15] Y. Hfny and S. Renals, Speech recognton usng augmented condtonal random felds, IEEE Transactons on Audo, Speech, and Language Processng, vol. 17, no. 2, pp , Feb [16] A. L. Berger, S. D. Della Petra, and V. J. D. Della Petra, A maxmum entropy approach to natural language processng, Computatonal Lngustcs, vol. 22, no. 1, pp , Mar [17] T. Mnka, Algorthms for maxumum-lkelhood logstc regresson, Carnege Mellon Unversty Statstcs Department, Tech. Rep. 758, [18] R. H. Byrd, P. Lu, J. Nocedal, and C. Y. Zhu, A lmted memory algorthm for bound constraned optmzaton, SIAM Journal on Scentfc Computng, vol. 16, no. 6, pp , Sep [19] M. Redmller and H. Braun, A drect adaptve method for faster backpropagaton learnng: the RPROP algorthm, n Proceedngs of the IEEE Internatonal Conference on Neural Networks (ICNN 93, vol. 1, San Francsco, CA, USA, Mar. 1993, pp [20] G. Zweg and P. Nguyen, A segmental CRF approach to large vocabulary contnuous speech recognton, n Proceedngs of the IEEE Workshop on Automatc Speech Recognton Understandng (ASRU 09, Merano, Italy, Dec. 2009, pp [21] X. He, L. Deng, and W. Chou, Dscrmnatve learnng n sequental pattern recognton, IEEE Sgnal Processng Magazne, vol. 25, no. 5, pp , Sep [22] C.-H. Lee, Dscrmnatve tranng - fundamentals and applcatons, n Interspeech Tutoral Sesson, Makuhar, Chba, Japan, Sep [23] M. Collns, Dscrmnatve tranng methods for hdden Markov models: Theory and experments wth perceptron algorthms, n Proceedngs of the Conference on Emprcal Methods n Natural Language Processng (EMNLP 02, Prague, Czech Republc, Jul. 2002, pp [24] J. Suzuk, E. McDermott, and H. Isozak, Tranng condtonal random felds wth multvarate evaluaton measures, n Proceedngs of the Annual Meetng of the Assocaton for Computatonal Lngustcs (ACL 06. Sydney, Australa: Assocaton for Computatonal Lngustcs, Jul. 2006, pp [25] D. Povey and P. Woodland, Mnmum phone error and I-smoothng for mproved dscrmnatve tranng, n Proceedngs of the IEEE Internatonal Conference on Acoustcs, Speech, and Sgnal Processng (ICASSP 02, vol. 1, May 2002, pp [26] Y. Xong, J. Zhu, H. Huang, and H. Xu, Mnmum tag error for dscrmnatve tranng of condtonal random felds, Informaton Scences, vol. 179, no. 1 2, pp , Jan [27] M. Km, Large margn cost-senstve learnng of condtonal random felds, Pattern Recognton, vol. 43, no. 10, pp , [Onlne]. Avalable: [28] A. Ratnaparkh, A maxmum entropy model for part-of-speech taggng, n Proceedngs of the Conference on Emprcal Methods n Natural Language Processng (EMNLP 96, E. Brll and K. Church, Eds., Somerset, New Jersey, May 1996, pp [29] R. Rosenfeld, Two decades of statstcal language modelng: where do we go from here? Proceedngs of the IEEE, vol. 88, no. 8, pp , Aug [30] J. Chen and C. Chueh, Jont acoustc and language modelng for speech recognton, Speech Communcaton, vol. 52, no. 3, pp , Mar [31] M. I. Layton and M. J. F. Gales, Augmented statstcal models for speech recognton, n Proceedngs of the IEEE Internatonal Conference on Acoustcs, Speech and Sgnal Processng (ICASSP 06, vol. 1, Toulouse, France, May 2006, pp [32] A. Ragn and M. J. F. Gales, Dervatve kernels for nose robust ASR, n Proceedngs of the IEEE Workshop on Automatc Speech Recognton and Understandng (ASRU 11, Hawa, Dec. 2011, pp [33] J. Morrs and E. Fosler-Lusser, Condtonal random felds for ntegratng local dscrmnatve classfers, IEEE Transactons on Audo, Speech, and Language Processng, vol. 16, no. 3, pp , Mar [34] M. Lehr and I. Shafran, Learnng a dscrmnatve weghted fnte-state transducer for speech recognton, IEEE Transactons on Audo, Speech, and Language Processng, vol. 19, no. 5, pp , Jul [35] B. Roark, M. Saraclar, M. Collns, and M. Johnson, Dscrmnatve language modelng wth condtonal random felds and the perceptron algorthm, n Proceedngs of the Annual Meetng of the Assocaton for Computatonal Lngustcs (ACL 04, Barcelona, Span, Jul. 2004, pp [36] Y. Kubo, S. Watanabe, and A. Nakamura, Decodng network optmzaton usng mnmum transton error tranng, n Proceedngs of the IEEE Internatonal Conference on Acoustcs, Speech and Sgnal Processng (ICASSP 12, Kyoto, Japan, Mar. 2012, pp [37] Y. H. Abdel-Haleem, Condtonal random felds for contnuous speech recognton, Ph.D. dssertaton, Dept. Comput. Sc., Unv. Sheffeld, Sheffeld, U.K., [38] G. Zweg, Classfcaton and recognton wth drect segment models, n Proceedngs of the IEEE Internatonal Conference on Acoustcs, Speech and Sgnal Processng (ICASSP 12, Kyoto, Japan, Mar. 2012, pp [39] V. N. Vapnk, The nature of statstcal learnng theory. New York, NY, USA: Sprnger-Verlag New York, Inc., [40] J. Lafferty, X. Zhu, and Y. Lu, Kernel condtonal random felds: representaton and clque selecton, n Proceedngs of the Internatonal Conference on Machne learnng (ICML 04, Banff, Alberta, Canada, Jul. 2004, pp

19 FOSLER-LUSSIER ET AL.: CONDITIONAL RANDOM FIELDS IN SPEECH, AUDIO, AND LANGUAGE PROCESSING, PREPRINT 19 [41] J. Peng, L. Bo, and J. Xu, Condtonal neural felds, n Advances n Neural Informaton Processng Systems (NIPS 09, Vancouver, Brtsh Columba, Canada, Dec. 2009, pp [42] R. Prabhavalkar and E. Fosler-Lusser, Backpropagaton tranng for multlayer condtonal random feld based phone recognton, n Proceedngs of the IEEE Internatonal Conference on Acoustcs Speech and Sgnal Processng (ICASSP 10, Dallas, Texas, USA, Mar. 2010, pp [43] Y. Fuj, K. Yamamoto, and S. Nakagawa, Automatc speech recognton usng hdden condtonal neural felds, n Proceedngs of the IEEE Internatonal Conference on Acoustcs, Speech and Sgnal Processng (ICASSP 11, Prague, Czech Republc, May 2011, pp [44] B. Kngsbury, Lattce-based optmzaton of sequence classfcaton crtera for neural-network acoustc modelng, n Proceedngs of the IEEE Internatonal Conference on Acoustcs, Speech and Sgnal Processng (ICASSP 09, Tape, Tawan, Apr. 2009, pp [45] G. E. Hnton, S. Osndero, and Y.-W. Teh, A fast learnng algorthm for deep belef nets, Neural Computaton, vol. 18, no. 7, pp , May [46] D. Yu, S. Wang, and L. Deng, Sequental labelng usng deep-structured condtonal random felds, IEEE Journal of Selected Topcs n Sgnal Processng, vol. 4, no. 6, pp , Dec [47] S. Wesler and H. Ney, A convergence analyss of log-lnear tranng, n Advances n Neural Informaton Processng Systems (NIPS 11, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Perera, and K. Wenberger, Eds., Granada, Span, Mar. 2011, pp [48] T. Haste, R. Tbshran, and J. H. Fredman, The Elements of Statstcal Learnng. New York: Sprnger-Verlag, [49] D. Johnson, D. Ells, C. Oe, C. Wooters, P. Faerber, N. Morgan, and K. Asanovc, ICSI qucknet software package, [50] I. Tsochantards, T. Joachms, T. Hofmann, and Y. Altun, Large margn methods for structured and nterdependent output varables, Journal of Machne Learnng Research, vol. 6, pp , Sep [51] B. Taskar, C. Guestrn, and D. Koller, Max-margn markov networks, n Advances n Neural Informaton Processng Systems (NIPS 03, Vancouver, Brtsh Columba, Canada, Dec. 2003, pp [52] M. Gales, S. Watanabe, and E. Fosler-Lusser, Structured dscrmnatve models for speech recognton, Sgnal Processng Magazne, vol. 29, no. 6, pp , Nov [53] D. Koller and N. Fredman, Probablstc Graphcal Models: Prncples and Technques (Adaptve Computaton and Machne Learnng seres, 1st ed. The MIT Press, [54] K. P. Murphy, Dynamc Bayesan networks: Representaton, nference and learnng, Ph.D. dssertaton, Comput. Sc. Dv., Dept. Electr. Eng. Comput. Sc., Unv. Calforna, Berkeley, CA, USA, [55] R. Prabhavalkar, Z. Jn, and E. Fosler-Lusser, Monaural segregaton of voced speech usng dscrmnatve random felds, n Proceedngs of the Annual Conference of the Internatonal Speech Communcaton Assocaton (Interspeech 09, Brghton, Unted Kngdom, Sep. 2009, pp [56] F. R. Kschschang, B. J. Frey, and H. A. Loelger, Factor graphs and the sum-product algorthm, IEEE Transactons on Informaton Theory, vol. 47, no. 2, pp , Feb [57] J. Pearl, Fuson, propagaton, and structurng n belef networks, Artfcal Intellgence, vol. 29, no. 3, pp , Sep [58] K. P. Murphy, Y. Wess, and M. I. Jordan, Loopy belef propagaton for approxmate nference: An emprcal study, n Proceedngs of the Internatonal Conference on Uncertanty n Artfcal Intellgence (UAI 99, Stockholm, Sweden, Jul. 1999, pp [59] M. I. Jordan, Z. Ghahraman, T. S. Jaakkola, and L. K. Saul, An ntroducton to varatonal methods for graphcal models, Machne Learnng, vol. 37, no. 2, pp , Nov [60] M. J. Wanwrght and M. I. Jordan, Graphcal models, exponental famles, and varatonal nference, Foundatons and Trends n Machne Learnng, vol. 1, no. 1-2, pp , Jan [61] R. Prabhavalkar, E. Fosler-Lusser, and K. Lvescu, A factored condtonal random feld model for artculatory feature forced transcrpton, n Proceedngs of the IEEE Workshop on Automatc Speech Recognton and Understandng (ASRU 11, Hawa, Dec. 2011, pp [62] Y.-H. Sung and D. Jurafsky, Hdden condtonal random felds for phone recognton, n Proceedngs of the IEEE Workshop on Automatc Speech Recognton and Understandng (ASRU 09, Merano, Italy, Dec. 2009, pp [63] A. P. Dempster, N. M. Lard, and D. B. Rubn, Maxmum lkelhood from ncomplete data va the EM algorthm, Journal of the Royal Statstcal Socety, vol. 39, no. 1, pp. 1 38, [64] S. Sarawag and W. W. Cohen, Sem-Markov condtonal random felds for nformaton extracton, n Advances n Neural Informaton Processng Systems (NIPS 04, Vancouver, Brtsh Columba, Canada, Dec. 2004, pp [65] Y. He and E. Fosler-Lusser, Effcent segmental condtonal random felds for phone recognton, n Proceedngs of the Annual Conference of the Internatonal Speech Communcaton Assocaton (Interspeech 12, Portland, OR, USA, Sep. 2012, pp [66] P. Lang, Sem-supervsed learnng for natural language, Master s thess, Dept. Electr. Eng. Comput. Sc., Massachusetts Inst. Technol. (MIT, Cambrdge, MA, USA, [67] G. Andrew, A hybrd Markov/sem-Markov condtonal random feld for sequence segmentaton, n Proceedngs of the Conference on Emprcal Methods n Natural Language Processng (EMNLP 06, Sydney, Australa, Jul. 2006, pp [68] G. Zweg, P. Nguyen, D. Van Compernolle, K. Demuynck, L. Atlas, P. Clark, G. Sell, M. Wang, F. Sha, H. Hermansky, D. Karakos, A. Jansen, S. Thomas, G. S. V. S. Svaram, S. Bowman, and J. Kao, Speech recognton wth segmental condtonal random felds: A summary of the JHU CLSP 2010 summer workshop, n Proceedngs of the IEEE Internatonal Conference on Acoustcs, Speech and Sgnal Processng (ICASSP 11, Prague, Czech Republc, May 2011, pp [69] M. Layton, Augmented statstcal models for classfyng sequence data, Ph.D. dssertaton, Dept. Eng., Cambrdge Unv., Cambrdge, U.K., [70] B. H. Juang and S. Katagr, Dscrmnatve learnng for mnmum error classfcaton, IEEE Transactons on Sgnal Processng, vol. 40, no. 12, pp , Dec [71] X. Huang, A. Acero, and H. W. Hon, Spoken Language Processng: A Gude to Theory, Algorthm and System Development. Prentce Hall, [72] G. Hegold, P. Lehnen, R. Schlüter, and H. Ney, On the equvalence of Gaussan and log-lnear HMMs, n Proceedngs of the Annual Conference of the Internatonal Speech Communcaton Assocaton (Interspeech 08, Brsbane, Australa, Sep. 2008, pp [73] H. Bourlard and N. Morgan, Connectonst Speech Recognton: A Hybrd Approach. Kluwer Academc Publshers, [74] H. Hermansky, D. P. W. Ells, and S. Sharma, Tandem connectonst feature extracton for conventonal HMM systems, IEEE Internatonal Conference on Acoustcs, Speech, and Sgnal Processng (ICASSP 00, vol. 3, pp , Jun [75] J. S. Brdle, Probablstc nterpretaton of feedforward classfcaton network outputs, wth relatonshps to statstcal pattern recognton, n Neurocomputng: Algorthms, Archtectures and Applcatons, ser. NATO ASI Seres, Fogelman-Soule and Herault, Eds. Sprnger, 1990, pp [76] Y. Kong, H. Bourlard, and N. Morgan, REMAP: Recursve estmaton and maxmzaton of a posteror probabltes - applcaton to transtonbased connectonst speech recognton, n Advances n Neural Informaton Processng Systems (NIPS 95, Denver, CO, USA, Nov. 1995, pp [77] Y. Sh and M. Wang, A dual-layer CRFs based jont decodng method for cascaded segmentaton and labelng tasks, n Proceedngs of the Internatonal Jont Conference on Artfcal Intellgence (IJCAI 07, Hyderabad, Inda, Jan. 2007, pp [78] N. A. Smth, D. A. Smth, and R. W. Tromble, Context-based morphologcal dsambguaton wth random felds, n Proceedngs of Human Language Technology Conference and Conference on Emprcal Methods n Natural Language Processng (HLT-EMNLP 05, Vancouver, Brtsh Columba, Canada, Oct. 2005, pp [79] T. Kudo, K. Yamamoto, and Y. Matsumoto, Applyng condtonal random felds to Japanese morphologcal analyss, n Proceedngs of the Conference on Emprcal Methods n Natural Language Processng (EMNLP 04, Barcelona, Catalunya, Span, Jul. 2004, pp [80] A. McCallum and W. L, Early results for named entty recognton wth condtonal random felds, feature nducton and web-enhanced lexcons, n Proceedngs of the Conference on Natural Language Learnng (CoNLL 03 at HLT-NAACL 2003, Edmonton, Canada, May 2003, pp [81] J. Fnkel, T. Grenager, and C. Mannng, Incorporatng non-local nformaton nto nformaton extracton systems by Gbbs samplng, n Proceedngs of the Annual Meetng of the Assocaton for Computatonal Lngustcs (ACL 05, Ann Arbor, MI, USA, Jun. 2005, pp [82] J. Fnkel, A. Kleeman, and C. Mannng, Effcent, feature-based, condtonal random feld parsng, n Proceedngs of the Annual Meetng of the Assocaton for Computatonal Lngustcs: Human Language

20 [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] PROCEEDINGS OF THE IEEE, PREPRINT Technologes (ACL-HLT 08, Columbus, OH, USA, Jun. 2008, pp. 959 967. J. Fnkel and C.

, Jont parsng and named entty recognton, n Proceedngs of the Human Language Technology Conference of the North Amercan Chapter of the Assocaton for Computatonal Lngustcs (HLT-NAACL 09, Boulder,

McCallum, Accurate nformaton extracton from research papers usng condtonal random felds, n Proceedngs of the Human Language Technology Conference of the North Amercan Chapter of the Assocaton for

McCallum, Collectve segmentaton and labelng of dstant enttes n nformaton extracton, Unversty of Massachusetts, Amherst, Tech. Rep., 2004. J. Morrs and E.

20 20 [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] PROCEEDINGS OF THE IEEE, PREPRINT Technologes (ACL-HLT 08, Columbus, OH, USA, Jun. 2008, pp J. Fnkel and C. Mannng, Nested named entty recognton, n Proceedngs of the Conference on Emprcal Methods n Natural Language Processng (EMNLP 09, Suntec, Sngapore, Aug. 2009, pp , Jont parsng and named entty recognton, n Proceedngs of the Human Language Technology Conference of the North Amercan Chapter of the Assocaton for Computatonal Lngustcs (HLT-NAACL 09, Boulder, Colorado, May 2009, pp F. Peng and A. McCallum, Accurate nformaton extracton from research papers usng condtonal random felds, n Proceedngs of the Human Language Technology Conference of the North Amercan Chapter of the Assocaton for Computatonal Lngustcs (HLT-NAACL 04, Boston, Massachusetts, USA, May 2004, pp C. Sutton and A. McCallum, Collectve segmentaton and labelng of dstant enttes n nformaton extracton, Unversty of Massachusetts, Amherst, Tech. Rep., J. Morrs and E. Fosler-Lusser, Crandem: Condtonal random felds for word recognton, n Proceedngs of the Annual Conference of the Internatonal Speech Communcaton Assocaton (Interspeech 09, Brghton, Unted Kngdom, Sep. 2009, pp J. J. Morrs, A study on the use of condtonal random felds for automatc speech recognton, Ph.D. dssertaton, Dept. Comput. Sc. Eng., The Oho State Unv., Columbus, OH, USA, D. Wang and S. Kng, Letter-to-sound pronuncaton predcton usng condtonal random felds, IEEE Sgnal Processng Letters, vol. 18, no. 2, pp , Feb S. Hahn, M. Dnarell, C. Raymond, F. Lefevre, P. Lehnen, R. de Mor, A. Moschtt, H. Ney, and G. Rccard, Comparng stochastc approaches to spoken language understandng n multple languages, IEEE Transactons on Audo, Speech, and Language Processng, vol. 19, no. 6, pp , Aug D. Wang, On deal bnary mask as the computatonal goal of audtory scene analyss, n Speech Separaton by Humans and Machnes, P. Dveny, Ed. Sprnger US, 2005, pp C. Joder, S. Essd, and G. Rchard, A condtonal random feld framework for robust and scalable audo-to-score matchng, IEEE Transactons on Audo, Speech, and Language Processng, vol. 19, no. 8, pp , Nov E. Benetos and S. Dxon, Jont mult-ptch detecton usng harmonc envelope estmaton for polyphonc musc transcrpton, IEEE Journal of Selected Topcs n Sgnal Processng, vol. 5, no. 6, pp , Oct Erc Fosler-Lusser receved the B.A.S n Computer and Cogntve Studes and the B.A. n Lngustcs from the Unversty of Pennsylvana n He receved the Ph.D. degree from the Unversty of Calforna, Berkeley n 1999; hs Ph.D. research was conducted at the Internatonal Computer Scence Insttute. As an Assocate Professor wth the Department of Computer Scence and Engneerng (and Lngustcs by courtesy at The Oho State Unversty, he drects the Speech and Language Technologes (SLaTe Laboratory. He currently serves on the IEEE Speech and Language Techncal Commttee, and receved the 2010 Sgnal Processng Socety Best Paper Award. Yanzhang He receved the B.E. degree n Software Engneerng from Behang Unversty, Bejng, Chna, n He s currently pursung the Ph.D. degree n the Department of Computer Scence and Engneerng, The Oho State Unversty, Columbus. Hs current research nterests nclude condtonal random felds and dscrmnatve learnng for automatc speech recognton and keyword spottng. Preeth Jyoth s a Ph.D. student at the Speech and Language Technologes (SLaTe Laboratory of the Department of Computer Scence and Engneerng at The Oho State Unversty. She completed her B. Tech degree at the Natonal Insttute of Technology, Calcut, Inda n She receved a Best Student Paper Award at Interspeech Her research nterests nclude automatc speech recognton (ASR, dscrmnatve models and speech producton-orented models for ASR. Roht Prabhavalkar receved the B.E. degree n Computer Engneerng from the Unversty of Pune, Inda n 2007 and the M.S. degree n Computer Scence and Engneerng from the Oho State Unversty n He s currently a Ph.D. canddate n the Department of Computer Scence and Engneerng at the Oho State Unversty, workng n the Speech and Language Technologes (SLaTe Laboratory. Hs research nterests nclude acoustc and pronuncaton modelng for automatc speech recognton, natural language processng and machne learnng.

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general