IDIAP IDIAP RESEARCH REPORT A NEURAL NETWORK FOR CLASSIFICATION WITH INCOMPLETE DATA. Martigny - Valais - Suisse. August Andrew C.

Size: px

Start display at page:

Download "IDIAP IDIAP RESEARCH REPORT A NEURAL NETWORK FOR CLASSIFICATION WITH INCOMPLETE DATA. Martigny - Valais - Suisse. August Andrew C."

Morgan McCarthy
5 years ago
Views:

1 IDIAP RESEARCH REPORT IDIAP Martgny - Valas - Susse A NEURAL NETWORK FOR CLASSIFICATION WITH INCOMPLETE DATA Andrew C. Morrs IDIAP-RR August 2000 REDUCED VERSION TO APPEAR IN Int. Conf. of Spoen Language Processng, ICSLP 2000, Beng Dalle Molle Insttute for Perceptual Artfcal Intellgence P.O.Box 592 Martgny Valas Swtzerland phone fax emal secretarat@dap.ch nternet

3 IDIAP-RR A NEURAL NETWORK FOR CLASSIFICATION WITH INCOMPLETE DATA Andrew C. Morrs August 2000 REDUCED VERSION TO APPEAR IN Int. Conf. of Spoen Language Processng, ICSLP 2000, Beng Abstract If the data vector for nput to an automatc classfer s ncomplete, the optmal estmate for each class probablty must be calculated as the expected value of the classfer output. We dentfy a form of Radal Bass Functon (RBF) classfer whose expected outputs can easly be evaluated n terms of the orgnal functon parameters. Two ways are descrbed n whch ths classfer can be appled to robust automatc speech recognton, dependng on whether or not the poston of mssng data s nown. Acnowledgements: Ths wor was supported by the EC/OFES (European Communty / Swss Federal Offce for Educaton and Scence) RESPITE proect (REcognton of Speech by Partal Informaton TEchnques). Recognton tests for the methods presented n ths report were carred out n collaboraton wth the speech group at Sheffeld Unversty, U.K.

4 4 IDIAP-RR 00-23

5 IDIAP-RR Contents 1. Introducton IDCN archtecture Poston of mssng data gven Poston of mssng data unnown IDCN tranng Parameter ntalsaton Error gradent calculaton Gradent descent teraton Recognton wth mssng data Summary and concluson Acnowledgements Appendx A: Usng HTK for both Gaussans and output layer weghts ntalsaton Appendx B: Dervaton of IDCN error gradent equatons Appendx C: Dervaton of expected class posteror probabltes References

6 6 IDIAP-RR 00-23

7 IDIAP-RR Introducton In any realstc automatc recognton tas t s common that part of the nput feature vector x to be classfed s corrupted by some nd of nose process, and the recognton performance of a system whch s not traned to expect ths nd of nose wll degrade dramatcally as the nose level ncreases. In many cases ths problem can be reduced by applyng some nd of nose removal or data enhancement process. But there are also many stuatons n whch some feature components are rretrevable. The approach taen n ths case depends on to what extent t s possble to dentfy whch features have been corrupted. If the poston of mssng features s gven, then the estmate for the posteror probablty for each class, whch s best n the sense that t gves the maxmum probablty of correct classfcaton, can be obtaned as the expected value of the classfer output for that class, condtoned by any avalable constrants on the mssng data [10]. The man problem wth ths approach s that for most classfers, the expected value of the class probablty outputs cannot be obtaned as a smple closed form expresson from the classfer parameters. If the poston of mssng data s not nown, one successful approach [6, 11, 12] has been to tran a separate classfer for each possble poston of mssng data and then to combne the posterors for one class as a weghted sum over all classfers. Even wth equal weghts ths approach shows some robustness to mssng data, because uncertan classfers tend to contrbute equal and therefore small probabltes to each class. The problem wth ths approach s that the number of dfferent possble postons of mssng data s generally far too large to allow tranng of a separate classfer for each poston. In ths paper we present a partcular form of Radal Bass Functon (RBF) classfer n whch the output layer uses Bayes Rule to drectly transform pooled mxture lelhoods from the RBF layer nto a-posteror class probabltes [2, 3, 8, 17]. Even though the output unts are non-lnear, the expected outputs of ths classfer, for any gven mssng data components, are a smple functon of the orgnal classfer parameters. The use of closely related RBF networs for recognton wth mssng data s not new [1], but to the author s nowledge the partcular form of ncomplete data classfcaton networ (IDCN) descrbed here has not been used before n ether of the technques presented n ths report. In Secton 2 we present the IDCN archtecture, and descrbe how t can be appled n two dfferent nds of HMM/ ANN hybrd system for automatc speech recognton (ASR), dependng on whether the poston of mssng data s nown, or otherwse. In Secton 3 we descrbe varous ways n whch the IDCN can be traned for ASR. Secton 4 shows how networ outputs (class posteror probabltes) are calculated when some of the nput features are mssng. In Secton 5 the wor s summarsed, problems arsng are brefly dscussed and new ways forward are suggested.

8 8 IDIAP-RR IDCN archtecture nput x y ( x) pxr ( ) z ( x) Ps ( x) x 1 y 1 z x y z x nx y ny z nz Fgure 1: RBF networ used here for classfcaton wth ncomplete data. The output layer uses Bayes Rule to drectly transform pooled mxture lelhoods from the RBF layer nto a-posteror class probabltes. The networ has one nput, one hdden and one output layer, as shown n Fg.1. Each RBF unt y n the hdden layer uses a dagonal covarance Gaussan y ( x) to model the probablty densty pxr ( ) for nput vector x havng been generated by ths Gaussan, whle each output unt uses a functon z ( x) to model the posteror probablty that x s from output class. If r denotes that x was generated by Gaussan, and s that x s from class, then: where y ( x) pxr ( ) N( x, µ, v ) pxs (, ) net z ( x) Ps ( x) px ( ) px ( ) (1) (2) net Pxr (,, s ) Pr (, s )pxr (, s ) Pr (, s )pxr ( ) px ( ) pxr (,, s ), net w y (3) (4) Although the above structure of the IDCN does not change, the way n whch t s appled depends on whether the poston of mssng nput data s nown. 2.1 Poston of mssng data gven The IDCN can be used as a front end to a conventonal HMM based ASR system, whereby the log lelhoods whch are normally calculated from the Gaussan mxture models for each hdden state are replaced, durng decodng, by log scaled lelhoods from the IDCN (by dvdng by ther class prors Ps ( ), then tang the logarthm). Ths comprses a form of HMM/ANN based ASR system [3] whch s sutable for use wth mssng data when the poston of mssng data s gven. The man potental advantages of ths model over the purely HMM based mssng-data theory based system descrbed n [10] s that the ANN s dscrmnatvely traned and provdes a more powerful model for capturng spectral dynamcs.

9 IDIAP-RR Poston of mssng data unnown In prncple a sngle IDCN can be used to replace the 2 d dfferent ANN experts whch are normally requred [12] to cover all possble selectons of mssng features from a d dmensonal feature vector. Provded that the combned features nput to each ANN expert are merely concatenated (.e. no compresson, orthogonalsaton, or whatever s appled), the expected posterors for each poston of mssng features can be computed drectly from the IDCN parameters, and then smply combned n a lnearly weghted sum [11] or geometrcally weghted product [5]. 3. IDCN tranng Classfer parameters to be traned are the mean and varance vectors n Eq.(1) for each Gaussan RBF unt, and the output layer weghts, w, n Eq.(3). In order for the performance of ths classfer to compete wth that of the MLP, t s essental that all parameters are traned together, and wth a dscrmnatve obectve functon. Unsupervsed dscrmnatve tranng s also possble, usng mnmum classfcaton error technques [9]. However, n ths artcle we tae the smpler approach of tranng by supervsed gradent descent. Durng tranng the softmax functon s used to constran the weghts w Pr (, s ) to le n [ 01, ], and sum to one. w e α l, m e α lm Ths gves the full set of parameters to be traned as ( µ, v, α ), for 1 n x, 1 n y, 1 n z. 3.1 Parameter ntalsaton Any hll clmbng procedure can encounter problems wth local mnma, so that system performance may be very senstve to the ntal parameter values used. In the context of the TIDgts connected dgts ASR tas, the followng two methods were tested [13] for ntalsng the RBF layer parameters (means, varances, and prors Pr ( )): Randomly assgn each data pont to an RBF centre, followed by -means clusterng and lelhood maxmsaton by Expectaton Maxmsaton (EM). Use HTK (verson 1.5) [18] to tran a set of 400 pooled Gaussans, usng the Baum-Welch forward-bacward tranng algorthm, wth embedded realgnment. As well as tranng the RBF layer parameters, HTK also trans mx weghts Pr ( s ) for each of the hdden states as specfed by whatever HMM structure s to be used n recognton. Whchever of the above methods was used, the traned HMM model was also used to provde a tranng data segmentaton, from whch we can estmate Ps ( ). Once the Gaussan parameters were ntalsed, two methods were tested for ntalsng the weghts w, usng the gven segmentaton: Use HMM traned mx weghts Pr ( s ) only: (5) w Pr (, s ) Pr ( s )Ps ( ) (6) Use HMM traned Gaussans only (see Appendx A for dervaton of ths rule): P( r ) Pr ( s )Ps ( ), P( s r ) y ( x ) y ( x ), w Pr ( )Ps ( r ) (7) x s

10 10 IDIAP-RR Of these dfferent RBF layer and output layer ntalsaton methods, the best results by far were obtaned usng RBFs traned usng HTK, and output weghts traned usng Eq.(7). Before gradent descent tranng, the auxlary parameters α were then ntalsed as: α log( w ) (8) 3.2 Error gradent calculaton Whchever error functon E s used, the dervatves of E wth respect to each of the model parameters were obtaned by the usual error bac propagaton (EBP) approach, frst calculatng the delta values for each output unt. See Appendx B for detals of EBP algorthm and dervaton of Eqs(9-12): δ net l ( δ l z l ) z l px ( ) (9) µ v ( ) y x µ v w δ ( x µ ) 2 y w v 2v δ (10) (11) w α ( 1 w )y δ (12) If s the target posteror for class l, then for three common error functons (to be mnmsed): τ l E, ( z ( x ) t ( x )) 2 : mean square error (13) E, t ( x ) logz ( x ) : cross-entropy (14) E, z ( x )t ( x ) : correlaton (15) we have , wth (droppng the subscrpt) as: z l z l z l z l τ l : mean square error (16) τ l z l τ l : cross-entropy (17) : correlaton (18) Best results here used the cross-entropy obectve.

11 IDIAP-RR Gradent descent teraton A constant momentum factor θ 1 s used, and an adaptable learnng rate, ϕ [4]. Wth g E, g 0 0, dw 0 0, ϕ 0 1, we have (where ĝ s a unt vector): + ϕ t ( 1 0.5dwˆ t ĝ t ) ϕ t 1 dw t + 1 ϕ t 1 w t 1 + ( θdw t ĝ t ) + w t + dw t + 1 (19) (20) (21) Tranng contnues untl the correct state classfcaton rate on the cross-valdaton set stops ncreasng. The gradent wth respect to all IDCN parameters was evaluated, and all parameters updated, usng all frames from a fxed number of utterances whch were selected at random from the full tranng set. We found that very small samples led rapdly to one or more RBFs developng zero prors, from whch they could not escape. As a compromse between processng speed and performance level at convergence, we settled on samples of 100 utterances. It was found that further tranng of the RBF parameters by EM, after gradent descent tranng had converged, followed by applcaton of Eq.(7), nevtably resulted n a very rapd ncrease n data lelhood, accompaned by an equally dramatc fall n classfcaton accuracy. As a result ths technque was not used. 4. Recognton wth mssng data As outlned n Secton 2, the way n whch the IDCN s ncorporated nto a recognton system depends on whether or not the poston of mssng data s gven. If t s gven then expected posteror probabltes for each state need to be calculated ust once, for the gven poston of mssng data. Otherwse the expected posterors need to be calculated for all possble postons of mssng data, and averaged [3]. Whchever s the case, for any gven poston of mssng data we may denote the present and mssng components of the feature vector by ( x p, x m ). The estmate for Ps ( x) whch results n the hghest probablty of correct classfcaton s then gven by the expected value of the classfer output functon, condtoned on x p and any nowledge κ m whch may constran mssng data values [10]. For the RBF classfer presented here ths leads to the followng estmates. If nothng s nown about the mssng data then (see Appendx C for dervaton of Eqs(22-24)): ẑ ( x) EPs [ ( x) x p ] w y ( x p ) If each mssng feature has a lmted range of possble values (as s the case for flterban features, whch are bounded below by zero and above by ther observed value): (22) ẑ ( x) EPs [ ( x) x p, x m r m ] w y ( x p ) y ( x m ) dx m r m (23) (24) In Eqs. (22) and (24) y ( x p ) s the margnal dagonal Gaussan over the ndcated x components. Posterors ẑ ( x) are obtaned by scalng the above values to sum to one across all classes.

12 12 IDIAP-RR It should be noted that t s only due to the consstent probablstc nterpretaton of each stage of processng by ths networ that t s so smple to obtan the margnal posterors n ths way drectly from the full system parameters. 5. Summary and concluson We have shown that an RBF networ, n whch the output layer uses Bayes Rule to drectly transform pooled mxture lelhoods from the RBF layer nto a-posteror class probabltes, s a sutable canddate networ for classfcaton wth mssng data. Ths s because t can be dscrmnatvely traned, and the expected values of ts posteror class probablty outputs can readly be evaluated as a smple functon of the orgnal model parameters. We have further shown how ths networ can be ncorporated nto two dfferent approaches to robust ASR. For the case where the poston of mssng data s nown we can ntegrate the IDCN nto an HMM system by replacng the usual state lelhoods by scaled state lelhoods output from the IDCN. In ths case the posterors based system should show some advantage over the lelhood based system due to dscrmnatve tranng. However, ASR tests [13] have shown that severe problems arse wth local mnma durng IDCN tranng by gradent descent, to the pont that very lttle performance mprovement s possble after parameter ntalsaton through normal non dscrmnatve EM based HMM tranng. In fact, performance of the IDCN/HMM system was almost dentcal to that of a Gaussan mxture lelhood based HMM system, usng the same mssng feature theory and the same method for detectng mssng data [16]. It s possble that the performance of the IDCN n ths case could be mproved by use of a more effectve dscrmnatve HMM tranng procedure, such as MCE [9] and/or boostng [15]. When the poston of mssng data s not nown, the IDCN offers a new approach to mult-stream processng whch should permt large numbers of feature streams to be combned wth greatly reduced effort. Ths approach remans to be tested. Acnowledgements Ths wor was supported by the EC/OFES (European Communty / Swss Federal Offce for Educaton and Scence) RESPITE proect (REcognton of Speech by Partal Informaton TEchnques). Recognton tests for the methods presented n ths report were carred out n collaboraton wth the speech group at Sheffeld Unversty, U.K.

13 IDIAP-RR Appendx A: Usng HTK for both Gaussans and output layer weghts ntalsaton HTK can be used to estmate the Gaussan parameters µ, v, and mx weghts Pr ( s ) for each pooled Gaussan r and hdden state s. The traned HMMs can then be used to produce a state level segmentaton. From ths segmentaton we can drectly estmate state prors Ps ( ) from the relatve frequency of occurrence of each state n the tranng data 1. The number of free parameters to be traned should frst be reduced by combnng Pr ( s ) and Ps ( ) nto Pr (, s ). We have tested two ways of dong ths. Method 1 uses Eq.(6), Secton 3.1. w Pr (, s ) Pr ( s )Ps ( ) Method 2 starts as method 1, but then estmates w Pr (, s ) Pr ( )Ps ( r ) by frst estmatng Pr ( ) usng Eq.(7), and then estmatng Ps ( r ) usng Eq.(7) P( s r ) y ( x ) x s y ( x ) Ths s derved as follows: P( s r ) Pr (, s ) Pr ( ) p( x, r, s ) Pr ( )p( x r, s ) (25) px (, r, s ) Pr ( ) px ( r, s ) Ps ( x, r )px (, r ) Pr ( ) px ( r ) (26) (27) px ( r ) px ( r ) y ( x ) x s x s y ( x ) (28) In the ASR tests made, t was found that method 2 gave far better recognton results. However, t s not clear why ths should be so, and so ths result may not generalse to other databases. 1. If one or more states occur only a small number of tmes (so that the varance of the relatve error n the relatve frequency estmate s unacceptably hgh) then all state pror estmates should be weghted towards the unform pror 1 n. Ths s a commonly used probablty estmate correcton, whch s drectly related to the so called m estmate, where the weghtng factor s proportonal to the pror degree of belef that the probabltes are all equal.

14 14 IDIAP-RR Appendx B: Dervaton of IDCN error gradent equatons Although n the present case the neural networ under consderaton has only one hdden layer, t s stll helpful to mae use of the error bac propagaton (EBP) theoretcal framewor, whch s based around the dea that, for any networ wth connectons between adacent layers only, the contrbutons to the error gradent for parameters n the networ at layer can be obtaned n terms of quanttes δ whch are frst evaluated for each node n layer ( + 1). z l net net , where net (29) px ( ) w y ( x) net m m The delta rule maes use of the chan rule for partal dfferentals, as follows: δ z l net net l z l net z l net px ( ) l l l ( δ l z l ) z l px ( ) (30) We can obtan the error gradent wth respect to the output layer parameters as follows: w w net net δ w w w l y l ( x) l δ y ( x) (31) α w w α (32) w α e α e α e α lm α e α e α lm e α lm α 2 l, m e α lm l, m l, m l, m (33) e α lm 2 e α e α lm e α 2 w w l, m l, m δ α y w ( 1 w ) w ( 1 w ) (34) (35) The error gradent for parameters µ and v n the hdden layer can be obtaned n a smlar way as follows: µ y , , (36) y µ y net net net y y wl y y l w l y ( x µ ) ( x y µ ), y w δ (37) µ v µ v y v y ( x µ ) 2 y ( x , µ ) w (38) 2v v v 2v v δ

15 IDIAP-RR Appendx C: Dervaton of expected class posteror probabltes When a parametrc classfer z ( x) s traned to estmate posteror class probabltes Pq ( x), and the poston of mssng components n the data vector x s gven so that t can be parttoned nto present and mssng parts ( x p, x m ), then the estmate for the posteror probablty for each class whch s best n the sense that t gves the maxmum probablty of correct classfcaton, s gven by the expected value of the networ output, condtoned on the data whch s not mssng, and any avalable nowledge whch may be used to constran the mssng data values 1 [10]. κ m ẑ ( x) Ez [ ( x) x p, κ m ] EPs [ ( x) x p, κ m ] ps (, x) px ( px ( ) m x p, κ m ) dx m (39) Here we wll consder ust two mssng-data condtons. One n whch nothng at all s nown about the mssng data values, and another n whch the mssng data s nown to le wthn a gven range,. In the second case we have: px ( m x p, κ m ) px ( m x p ) when x, else 0 (40) px ( m x p ) dx m ps (, x) px ( m x p ) so ẑ ( x) (41) px ( p )px ( m x p ) d x m A p( s, x) dx m px ( m x p ) dx m where A s ndependent of. The ntegral can easly be evaluated as follows: ps (, x) dx m w y ( x) d x m w N ( x p ) N ( x m x p ) dx m (42) Here we consder only the case of dagonal covarance, so that N( x m x p ) N( x m ) [14], and ẑ ( x) A w N ( x p ) N ( x m ) dx m (43) The ntegral n Eq.(43) can easly be evaluated as the product of unvarate Gaussan ntegrals, each of whch can be evaluated usng the C standard erf functon. If the mssng data s unbounded then the ntegral s ust unty and can be gnored. As Ps ( x, κ ) m 1, the constant A can be elmnated, to obtan ẑ ( x) as follows: θ ( x) w N ( x p ) N ( x m ) dx m (44) ẑ ( x) θ ( x) θ l ( x) l (45) 1. Note that the analyss presented n [1] regardng posterors estmaton wth mssng data s ncorrect.

16 16 IDIAP-RR References [1] Ahmed, S. & Tresp, V. (1993) Some solutons to the mssng feature problem n vson, n Advances n Neural Informaton Processng Systems 5, Morgan Kauffman, San Mateo, pp [2] Bshop, C. (1995) Neural Networs for Pattern Recognton, Clarendon Press, Oxford. [3] Bourlard, H. & Morgan, N. (1993) Connectonst Speech Recognton, Kluwer Academc Publshers, Boston. [4] Chan, L.W. & Fallsde, F. (1987) An adaptve tranng algorthm for bac-propagaton networs, Computer Speech and Language 2, pp [5] Hagen, A. & Morrs, A.C. (n press) Comparson of HMM experts wth MLP experts n the full combnaton mult-band approach to robust ASR, Proc. ICSLP [6] Hagen, A., Morrs, A.C. & Bourlard, H. (1998) Sub-band based speech recognton n nosy condtons: The Full-Combnaton approach, Research Report IDIAP-RR [7] Hermansy, H., Ells, D. & Sharma, S. (2000) Tandem connectonst feature stream extracton for conventonal HMM systems, Proc. ICASSP-2000, pp [8] Lppmann, R. P. & Carlson, B. A. (1997) Usng mssng feature theory to actvely select features for robust speech recognton wth nterruptons, flterng and nose, Proc. Eurospeech 97, pp [9] McDermott, E. & Katagr, S. (1994) Prototype-based mnmum classfcaton error / generalsed probablstc descent tranng for varous speech unts, Computer Speech and Language, No.8, pp [10] Morrs, A.C., Cooe, M. & Green, P. (1998) Some solutons to the mssng feature problem n data classfcaton, wth applcaton to nose robust ASR, Proc. ICASSP'98, pp [11] Morrs, A.C., Hagen, A. & Bourlard, H. (1999) The full-combnaton subbands approach to nose robust HMM/ANN based ASR, Proc. Eurospeech 99, pp [12] Morrs, A.C., Hagen, A., Glotn, H. & Bourlard, H. (n press) Mult-stream adaptve evdence combnaton for nose robust ASR, Speech Communcaton. [13] Morrs, A.C., Josfovs, L., Bourlard, H., Cooe, M. & Green, P. (n press) A neural networ for classfcaton wth ncomplete data: applcaton to robust ASR, Proc. ICSLP [14] Morrson, D.F. (1990) Multvarate statstcal methods, 3rd Edton, McGraw-Hll. [15] Schwen, H. (1999) Usng boostng to mprove a hybrd HMM/neural networ speech recognser, Proc. ICASSP 99. pp [16] Vznho, A., Green, P., Cooe, M. & Josfovs, L. (1999) Mssng data theory, spectral subtracton and sgnal-to-nose estmaton for robust ASR: An ntegrated study, Proc. Eurospeech 99, pp [17] Whte, H. (1989) Learnng n artfcal neural networs: A statstcal perspectve, Neural Computng, 1, pp [18] Young, S.J. & Woodland, P.C. (1993) HTK Verson 1.5: User, Reference and Programmer Manual, Cambrdge Unversty Engneerng Dept., Speech Group.

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach A Bayes Algorthm for the Multtask Pattern Recognton Problem Drect Approach Edward Puchala Wroclaw Unversty of Technology, Char of Systems and Computer etworks, Wybrzeze Wyspanskego 7, 50-370 Wroclaw, Poland