STUDY OF ATTRACTOR VARIATION IN THE RECONSTRUCTED PHASE SPACE OF SPEECH SIGNALS

STUDY OF ATTRACTOR VARIATION IN THE RECONSTRUCTED PHASE SPACE OF SPEECH SIGNALS Jiji Ye Departmet of Electrical ad Computer Egieerig Milwaukee, WI USA jiji.ye@mu.edu Michael T. Johso Departmet of Electrical ad Computer Egieerig Milwaukee, WI USA mike.johso@mu.edu Richard J. Povielli Departmet of Electrical ad Computer Egieerig Milwaukee, WI USA richard.povielli@mu.edu ABSTRACT This paper presets a study of the attractor variatio i the recostructed phase spaces of isolated phoemes. The approach is based o recet work i time-domai sigal classificatio usig dyamical sigal models, whereby a statistical distributio model is obtaied from the phase space ad used for maximum likelihood classificatio. Two sets of experimets are preseted i this paper. The first uses a variable time lag phase space to examie the effect of fudametal frequecy o attractor patters. The secod focuses o speaker variability through a ivestigatio of speaker-depedet phoeme classificatio across speaker sets of icreasig size. The results are discussed at the ed of the paper. 1. INTRODUCTION State of the art speech recogitio systems typically use cepstral coefficiet features, obtaied via a frame-based spectral aalysis of the speech sigal (Deller et al., 2). However, recet work i phase space recostructio techiques (Abarbael, 1996; Katz ad Schrieber, 2) for oliear modelig of time-series sigals has motivated ivestigatio ito the efficacy of usig dyamical systems models i the time-domai for speech recogitio (Ye et al., 22). I theory, recostructed phase spaces capture the full dyamics of the uderlyig system, icludig oliear iformatio ot preserved by traditioal spectral techiques, leadig to possibilities for improved recogitio accuracy. This paper is based o work supported by NSF uder Grat No. IIS-11358 The classical techique for phoeme classificatio is Hidde Markov Models (HMM) (Lee ad Ho, 1989; Youg, 1992), ofte based o Gaussia Mixture Model (GMM) observatio probabilities. The most commo features are Mel Frequecy Cepstral Coefficiets (MFCCs). I cotrast, the recostructed phase space is a plot of the time-lagged vectors of a sigal. Such phase spaces have bee show to be topologically equivalet to the origial system, if the embeddig dimesio is large eough (Sauer et al., 1991). Structural patters occur i this processig space, commoly referred to as trajectories or attractors, which ca be quatified through ivariat metrics such as correlatio dimesio or Lyapuov expoets or through direct models of the phase space distributio. Previous results o phoeme classificatio (Ye et al., 22) have show that a Bayes classifier over statistical models of the recostructed phase spaces are effective i classifyig phoemes. Phase space recostructios are ot specific to ay particular productio model of the uderlyig system, assumig oly that the dimesio of the system is fiite. We would like to be able to take advatage of our a priori kowledge about speech productio mechaisms to improve usefuless of phase space models for speech recogitio i particular. I pursuit of this, we have implemeted two sets of experimets to study attractor variatio, the first to look at fudametal frequecy effects i vowels ad the secod to look at speaker variability issues. Fudametal frequecy, as a parameter that varies sigificatly but does ot cotai iformatio about the geeratig phoeme, should clearly affect the phase space i a adverse way for classificatio. This hypothesis is examied through a compesatio techique usig variable lag recostructios. Speaker variability is

a as of yet ukow factor with regard to the amout of variace caused i uderlyig attractor characteristics, ad is a importat issue i the questio of how well this techique will work for speaker-idepedet tasks. Iitial experimets have show some sigificat discrimiability i such tasks, but performed at a measurably lower accuracy tha that for speaker-depedet tests. Each of these experimets use a oparametric distributio model based o bi couts with a maximum likelihood phoeme classifier, as described i more detail i the ext sectio. The TIMIT database is used for both tasks. 2. METHOD 2.1. Phase Space Recostructio Phase space recostructio techiques are fouded o uderlyig priciples of dyamical system theory (Sauer et al., 1991; Takes, 198) ad have bee applied to a variety of time series aalysis ad oliear sigals processig applicatios [refs xx] (Abarbael, 1996; Katz ad Schrieber, 2). Give a time series x = x, = 1... N (1) where is a time idex ad N is the umber of observatios, the vectors i a recostructed phase space are formed as x = [ x x x ], (2) ( d 1) where is the time lag ad d is the embeddig dimesio. Take as a whole, the sigal forms a trajectory matrix compiled from the time-lagged sigal vectors: X = x x x 1 1+ 1 + ( d 1) x x x 2 2+ 2 + ( d 1) x x x N ( d 1) N ( d 2) N Figure 1 shows a illustrative two-dimesioal recostructed phase space trajectory, while Figure 2 shows the same space with idividual poits oly, as modeled by the statistical approach used here. (3) Figure 1 Recostructed phase space of the vowel phoeme /aa/ illustratig trajectory Figure 2 Recostructed phase space of the vowel phoeme /aa/ illustratig desity The time lag used i the recostructed phase space is empirical but guided by some key measures such as mutual iformatio ad autocorrelatio (Abarbael, 1996; Katz ad Schrieber, 2). Usig such measures, a time lag of six is appropriate for isolated phoeme recogitio usig TIMIT. For the variable lag experimets baselie time lags of 6 ad 12 are used, as discussed i details later.

I practice, the phase-space recostructio is zeromeaed ad the amplitude variatio is radially ormalized via: x µ x X = (4) σ where σ r x N 1 x µ x N ( d 1) =+ r 1 ( d 1) is a origial poit i the phase space, µ is the sample mea of the colums of X. X 2.2. Noparametric Distributio Model of Recostructed Phase Space A statistical characterizatio of the recostructed phase space, related to the atural measure or atural distributio of the uderlyig attractor (Abarbael, 1996; Katz ad Schrieber, 2), is estimated by dividig the recostructed phase space ito 1 histogram bis as is illustrated i Figure 2. This is doe by dividig each dimesio ito te partitios such that each partitio cotais approximately 1% of the data poits. The itercepts of the bis are determied usig all the traiig data. 2.3. The Bayes Classifier The estimates of the atural distributio are used as models for a Bayes classifier. This classifier simply computes the coditioal probabilities of the differet classes give the phase space ad the selects the class with the highest coditioal probability: N ˆ ω = arg max{ pˆ ( )} pˆ i X = arg max i ( x ) (6) i= 1 C i= 1 C = 1 p x is the bi-based likelihood of a where ( ) ˆi poit i the phase space, C is the umber of phoeme classes, ad ˆω is the resultig maximum likelihood hypothesis. 3. EXPERIMENT DESIGN 3.1. Variable Lag Model for Vowels The basic idea of this experimet is to use variable time lags istead of a fixed time lag for embeddig vowel phoemes, as a fuctio of the uderlyig fudametal frequecy of the vowel. A estimate 2 (5) of the fudametal frequecy is used to determie the appropriate embeddig lag. The fudametal frequecy estimate algorithm for vowels used here is based o the computatio of autocorrelatio i the time domai as implemeted by the Etropic ESPS package (Etropic, 1993). The typical vowel fudametal frequecy rage for male speakers is 1~15Hz, with a average of about 125Hz, while the typical rage for female speakers is 175~256Hz, with a average of about 2Hz. For this experimet oly male speakers were used. I the recostructed phase space, a lower fudametal frequecy has a loger period, correspodig to a larger time lag. With a baselie time lag ad mea fudametal frequecy give as ad f respectively, we perform fudametal frequecy compesatio via the equatios f = f (7) ad f ' = (8) f where is the ew time lag ad f is the fudametal frequecy estimate of the phoeme example. This time lag is rouded ad used for phase space recostructio, for both estimatio of the phoeme distributios across the traiig set ad maximum likelihood classificatio of the test set examples. Two differet baselie time lags are used i the experimets. A time lag of 6 correspods to that chose through examiatio of the mutual iformatio ad autocorrelatio heuristics; however, roudig effects lead to quite a low resolutio o the lags i the experimet, which vary primarily betwee 5, 6, ad 7. To achieve a slightly higher resolutio, a secod set of experimets at a time lag of 12 is implemeted for compariso. Sice the fial time lags used for recostructio are give by a fudametal frequecy ratio, the value of the baselie frequecy is ot of great importace, but should be chose to be ear the mea fudametal frequecy. Baselies of 12Hz ad 13Hz were ivestigated. The fial time lag is give i accordace with equatio (8) above. The data set used here icludes 6 male speakers for traiig ad 3 differet male speakers for testig, all withi the same dialect regio.

3.2. Speaker Variability Usig the phase space recostructio techique for speaker-idepedet tasks clearly requires that the attractor patter across differet speakers is cosistet. Icosistecy of attractor structures across differet speakers would be expected to lead to smoothed ad imprecise phoeme models with resultig poor classificatio accuracy. The experimets preseted here are desiged to ivestigate the iter-speaker variatio of attractor patters. Although a umber of differet attractor distace metrics could be used for this purpose, the best such choice is ot readily apparet ad we have istead focused o classificatio accuracy as a fuctio of the umber of speakers i a closedset speaker depedet recogitio task. The higher the cosistecy of attractors across speakers, the less accuracy degradatio should be expected as the umber of speakers i the set is icreased. All speakers are male speakers selected from the same dialect regio withi the TIMIT corpus. The bi-based models ad maximum likelihood classificatio methods discussed previously are used i all cases. The oly variable is the umber of speakers for isolated phoeme classificatio tasks. To examie speaker variability effects across differet classes of phoemes, vowels, fricatives ad asals are tested separately. The overall data set is a group of 22 male speakers, from which subsets of 22, 17, 11, 6, 3, 2 or 1 speaker(s) have bee radomly selected. Classificatio experimets are performed o sets of 7 fricatives, 7 vowels, ad 5 asals. Exp 1: d = 2, = 6, =, f Exp 2: d = 2, = 6, f = 12 Hz, =, f f Exp 3: d = 2, = 6, f = 13 Hz, =, f Exp 4: d = 2, = 12, =, f Exp 5: d = 2, = 12, f = 12 Hz, = f, f Exp 6: d = 2, = 12, f = 13 Hz, = f, where d is the embeddig dimesio, is the default time lag, f is the default fudametal frequecy, f is the estimated fudametal frequecy, ad is the actual embeddig time lag. Table 1 shows the resultig rages for give the parameters, while Table 2 ad 3 show the classificatio results. f 6 12Hz 5~7 6 13Hz 5~8 12 12Hz 1~14 12 13Hz 1~16 Table 1 Rage of give ad f Exp 1 Exp 2 Exp 3 27.7% 33.78% 35.14% 4. RESULTS Table 2 Vowel phoeme classificatio results for 4.1. Variable Lag for Vowels lag of 6 The 7 vowel set { /aw/ /ay/ /ey/ /ix/ /iy/ /ow/ /oy/ } is used for these experimets. As described previously, data are selected from 6 male speakers for traiig ad 3 differet male speakers for testig, all withi the same dialect regio. There are a total of six experimets, three with a baselie lag of 6 ad three with a baselie lag of 12. I each case the tests are ru with a fixed lag as well as with variable lags based o both 12Hz ad 13Hz. The six experimets are summarized as follows: Exp 4 Exp 5 Exp 6 39.19% 35.14% 39.86% Table 3 Vowel phoeme classificatio results for lag of 12 As ca be see from the above results, the best classificatio accuracy is obtaied by usig a variable lag model with default fudametal frequecy of 13Hz.

4.2. Speaker Variability The evaluatio of speaker variability was carried out usig the leave-oe-out cross validatio. The overall classificatio accuracies for the three types of phoemes are show i Table 4, Table 5 ad Table 6. Spkr# 1 2 6 22 Acc(%) 58. 51.6 49.26 48.58 Table 4 Phoeme classificatio results of fricatives with differet umber of speakers Spkr# 1 2 6 11 17 22 Acc(%) 61.9 49.9 49.58 46.92 45.46 46.3 Table 5 Phoeme classificatio results of vowels with differet umber of speakers Spkr# 1 2 3 6 22 Acc(%) 51.79 4. 31.91 29.95 26.76 Table 6 Phoeme classificatio results of asals with differet umber of speakers Figure 3 is a visual iterpretatio of the results preseted above. The classificatio results are plotted agaist the umber of speakers for vowels, fricatives ad asals respectively. reported i the previous paper (Ye et al., 22), for a speaker-idepedet task over vowels, asals, ad fricatives. I all three phoeme types, the accuracy was relatively uchaged after 2 or 3 speakers. 5. WORK IN PROGRESS Because of the large differece betwee the baselie umbers i the variable lag experimets, we will verify ad ivestigate them by ruig other experimets. The experimets preseted are performed o a speaker-idepedet dataset, which could affect the results due to speaker variability. The additioal ivestigatio o the effect of fudametal frequecy o attractor patters will be icluded i the fial paper. 6. DISCUSSION AND CONCLUSIONS We have examied attractor variatio i the recostructed phase space of isolated phoemes. The variable lag experimets show a improvemet whe the phase space icorporates fudametal frequecy compesatio. The speaker variability experimets idicate that there is a sigificat amout of cosistecy across attractor patters betwee speakers, which shows that the recostructed phase space represetatio of speech sigal has the discrimiative power for the speaker-idepedet tasks. Future work would iclude discoverig the method to compesate speaker variability for the phase space recostructio techique o speech recogitio applicatios. REFERENCES Figure 3 The classificatio accuracy vs. umber of speakers As ca be see from Figure 3, the degree of attractor variatio across speakers is differet for these three types of phoemes. Nasals appear to have the largest variability while the fricatives have the least, which is cosistet with the results Abarbael, H.D.I., 1996. Aalysis of observed chaotic data. Spriger, New York, xiv, 272 pp. Deller, J.R., Hase, J.H.L. ad Proakis, J.G., 2. Discrete-Time Processig of Speech Sigals. IEEE Press, New York, 98 pp. Etropic, 1993. ESPS Programs Maual. Etropic Research laboratory. Katz, H. ad Schrieber, T., 2. Noliear Time Series Aalysis. Cambridge Noliear Sciece Series 7. Cambridge Uiversity Press, New York, NY, 34 pp. Lee, K.-F. ad Ho, H.-W., 1989. Speaker-idepedet phoe recogitio usig hidde Markov models.

IEEE Trasactios o Acoustics, Speech ad Sigal Processig, 37(11): 1641-1648. Sauer, T., Yorke, J.A. ad Casdagli, M., 1991. Embedology. Joural of Statistical Physics, 65(3/4): 579-616. Takes, F., 198. Detectig strage attractors i turbulece. Dyamical Systems ad Turbulece, 898: 366-381. Ye, J., Povielli, R.J. ad Johso, M.T., 22. Phoeme Classificatio Usig Naive Bayes Classifier I Recostructed Phase Space. 1th Digital Sigal Processig Workshop. Youg, S., 1992. The geeral use of tyig i phoemebased HMM speech recogitio. ICASSP, 1: 569-572.