DISTINCTIVE FEATURE DETECTION USING SUPPORT VECTOR MACHINES. Partha Niyogi, Chris Burges, and Padma Ramesh. Bell Labs, Lucent Technologies, USA.

Size: px

Start display at page:

Download "DISTINCTIVE FEATURE DETECTION USING SUPPORT VECTOR MACHINES. Partha Niyogi, Chris Burges, and Padma Ramesh. Bell Labs, Lucent Technologies, USA."

Laurel Higgins
5 years ago
Views:

1 DISTINCTIVE FEATURE DETECTION USING SUPPORT VECTOR MACHINES Partha Nyog, Chrs Burges, and Padma Ramesh Bell Labs, Lucent Technologes, USA. ABSTRACT An mportant aspect of dstnctve feature based approaches to automatc speech recognton s the formulaton of a framework for robust detecton of these features. We dscuss the applcaton of the support vector machnes (SVM) that arse when the structural rsk mnmzaton prncple s appled to such feature detecton problems. In partcular, we descrbe the problem of detectng stop consonants n contnuous speech and dscuss an SVM framework for detectng these sounds. In ths paper we use both lnear and nonlnear SVMs for stop detecton and present expermental results to show that they perform better than a cepstral features based hdden Markov model (HMM) system, on the same task. 1. INTRODUCTION We are pursung an approach to speech recognton that develops detectors for varous dstnctve features from ther acoustc correlates n the speech sgnal. Crucal to the success of such an approach are the followng: (1) determnaton of the acoustc sgnatures for dfferent sound classes and the development of sgnal representatons n whch that acoustc sgnature best expresses tself; (2) the constructon of statstcal learnng paradgms that operate on the above representatons and separate the postve nstances of the dstnctve feature from the negatve nstances of the same feature. Tradtonally, (1) s regarded as the front-end and (2) as the back end of a speech recognton system. In most tradtonal speech recognton desgns, the same representaton s used for all sound classes (.e., cepstra; flterbanks; computed wth the same analyss wndow and steppng rate). The front-end s thus a vector tme seres. The back-end s typcally an HMM of one form or another. In contrast, we consder usng dfferent representatons for dfferent sound classes. An mportant queston that arses n ths context s: gven a partcular representaton, what sort of a statstcal learnng machne should be used to optmally separate the postve examples of each sound from the negatve? In ths paper, we nvestgate the applcaton of support vector machnes (SVM) for the knds of detecton problems that would naturally arse n our feature based approach to speech recognton. Here, we address the ssue of detectng stop consonants n contnuous speech. We frst provde a descrpton of the detecton problem and then dscuss support vector machnes. An mportant ssue that needs to be addressed s the ablty of the machne to generalze from ts tranng set to ts test set. We use the Vapnk-Chervonenks theory [7] that gves rse to SVMs to address ths ssue. We examne a varety of support vector machnes wth dfferent archtectures and capacty control mechansms and show ther effect on the successful utlzaton of SVMs for speech recognton. Fnally, we present expermental results on usng SVMs for the detecton of stop consonants. 8 6 dcl d y axr dcl d aa r kcl k s Fgure 1: Porton of the speech waveform s(n) (top panel), the assocated three-dmensonal feature vector, x(n) (mddle panel), and the desred output y(n) bottom panel markng the tmes of the closure-burst transton, x axs s tme n msec. 2. STOP DETECTION Stop consonants are produced by causng a complete closure of the vocal tract followed by a sudden release. Hence they are sgnalled n contnuous speech by a perod of extremely low energy (correspondng to the perod of closure) followed by a sharp, broad band sgnal (correspondng to the release). As a result, stops consonants are hghly transent (dynamc) sounds that have a varyng duraton lastng anywhere from 5 to 1 ms. In Amercan Englsh, the class of stops conssts of the sounds f p,t,k,b,d,gg. In order to buld a detector for stop consonants n runnng speech, the speech sgnal, s(t) s characterzed by a vector tme seres wth three dmensons: () log(total Energy) () log(energy above 3kHz) () spectral flatness measure based on Wener Entropy defned as R log(s(f t))df ; log( R S(f t)df) where S(f t) s spectral energy at frequency f and tme t: All quanttes are computed usng 5 ms wndows moved every 1 ms. Thus, we have x(n) =[x 1(n) x 2(n) x 3(n)] where n represents tme (dscretzed n unts of mllseconds) and x 1 through x 3 are the three acoustc quanttes that are measured every 1 ms. Energes at 1 ms ntervals potentally allow us to track rapd transtons that would otherwse be smoothed out by a coarser temporal resoluton. For more on stop consonants, ther acoustc-phonetc features and common confusons, see [4]. We need to fnd an operator on the feature vector tme seres that wll return a sngle dmensonal tme seres that takes on large values around the tmes that stops occur and small values otherwse. The most natural ponts n tme that mark the presence of stops are the transton from closure to burst release. Shown n fg. 1 s an example of a speech waveform s(n) the assocated

2 feature vector tme seres x(n) and a desred output y(n): The techncal goal s to fnd an operator g on the tme seres x(n) that produces an output y g(n) = g x(n) wth values n f,1g, such that k y ; y g k s small n some sense (norm). Specfcally, we choose the optmal operator (from some class G of operators) accordng to the crteron g opt =argmn R(g) = arg mn E[(y ; g2g g2g yg)2 ] (1) Snce both y and y g have values n f 1g, ths s a two class, pattern classfcaton problem at each pont n tme n: Here we consder operators gven by y h (n) =h(x(n ; W ) ::: x(n + W )) where h s a functon from R 3(2W +1) ;!f 1g: In our experments W =5: Fnally, t s mportant to emphasze that the speech recognton problem can be decomposed nto a collecton of feature detecton problems that have a structure very smlar to that of the stop detecton problem descrbed above. For example, a vowel detector mght be bult wth a representaton x consstng of dmensons lke degree of perodcty n the sgnal, rato of hgh frequency to low frequency energy, and so on. The same s true of a nasal detector, a frcatve detector and so on. In all of these detecton problems, the support vector machne framework mght provde the bass for constructng optmal detectors that are learned from the data. 3. SUPPORT VECTOR MACHINES For a general ntroducton to usng SVMs for the pattern recognton task, see [1]. Consder a two-class pattern recognton problem wth labelled examples,.e., (x y ) pars drawn accordng to some unknown dstrbuton P (x y) on the space X Y: The goal s to construct a functon h (drawn from some class H : X ;! Y ; w.l.o.g let Y = f;1 1g as ths s convenent for our dscusson) that s able to classfy unknown patterns x nto the approprate class wth mnmum msclassfcaton error. A sutable h s usually pcked by mnmzng the emprcal rsk,.e., 1 ^h l = arg mn Remp(f) =argmn (y ; f(x))2 f2h f2h l Ths s the straghtforward approach commonly pursued n statstcal pattern recognton and the hope s that the emprcally chosen functon ^h l would generalze successfully to novel unlabelled examples Structural Rsk Mnmzaton The straghtforward approach of mnmzng the emprcal rsk turns out not to guarantee a small actual rsk on the test set, partcularly, f the number l of tranng examples are lmted. As a result of the work of Vapnk and Chervonenks [7] (see also [5, 6]), a new nducton prncple has emerged, the prncple of structural rsk mnmzaton (SRM). Ths s based on the fact that the true goal of the learner should be to mnmze the expected rsk,.e., h opt = arg mn R(f) =argmn E[(y ; f2h f2h f(x))2 ] where the expectaton s taken accordng to the true dstrbuton P: Snce P s unknown, one approxmates the above functonal by the followng large devaton bound (that holds wth probablty greater than 1 ; ) d R(f) R emp(f) +( l log() ) (2) l q d(log 2l where s defned as ( d l log() l ) = The parameter d s called the VC dmenson of the set of functons, H[7] and s a measure of ts complexty. Some aspects of eq. 2 are worth hghlghtng. Frst, to guarantee mnmzaton of the true rsk, R(f) one has to ensure that both terms, R emp(f) d +1);log(=4) l and ( d l log() l ) are made suffcently small sgnfcantly, t s noted that mnmzng R emp alone s not enough. Second, for a fxed amount of tranng data l the two components represent a complexty regularzaton trade-off. As the class H becomes larger, the mnmum emprcal rsk becomes smaller but the VC term becomes larger. Generalzaton s controlled by controllng each of these two terms. Conceptually ths s done by mposng a structure H 1 H 2 H 3 :::H of nested subsets of H: One then searches wthn ths herarchy for the class H j that mnmzes the bound,.e., ^h =arg mn (Remp(f) +()) (3) H j 2H 3.2. Hyperplanes Here we dscuss the applcaton of the SRM prncple to the case where the class H s the class of lnear hyperplanes n X.e., H = fh : X! f;1 1gjh(x) = (w:x + b)g where (z) = 1 f z and (z) = ;1 otherwse. Thus each par (w b) corresponds to a unque hyperplane (after removng an arbtrary scale factor n w and b by requrng that mn jw:x + bj = 1 where (x y ) are the set of tranng examples.) Now we use the followng: Theorem 1 Let B x1 ::: x r = fx 2 X : kx ; ak <Rg (a 2 X) be the smallest ball contanng the ponts x 1 ::: x r, and H A = ffw b = ((w:x) +b) j jjwjj Ag (4) be a subclass of hyperplanes n canoncal form wth respect to fx 1 ::: x rg. Then, H A has a VC-dmenson h satsfyng h R 2 A 2 : (5) Ths theorem suggests a natural structure on the set of hyperplanes n canoncal form. Clearly, H = [ A>H A: It s possble to show that utlzng such a structure transforms the problem posed n eq. 3 to ^h = arg mn w b (Remp(w b)+w:w) trades-off the ft to data wth model complexty Separable Data For the case, where the data s separable by hyperplanes, the structural rsk mnmzaton prncple attempts to mnmze the VCdmenson of H A whle keepng the emprcal rsk R emp fxed at. Ths s equvalent to mn 1 w:w (6) 2

3 subject to y (w:x + b) 1 (7) The th constrant s satsfed only f the th data pont s correctly classfed by the hyperplane classfer. Introducng Lagarange multplers for each of the constrants, the Lagrangan s formed as X L(w b f g) = 1 2 w:w ; (y(w:x + b) ; 1) The optmal soluton les at the saddle pont of the Lagrangan and satsfes the followng condtons: P (1) dfferentatng wth respect to w and settng to yelds w = yx.e., the optmal hyperplane s a lnear combnaton of the tranng vectors, P (2) dfferentatng wth respect to b and settng to yelds y =, (3) frst order Kuhn Tucker condtons yeld (y(w:x + b) ; 1) = : From ths we see that all ponts x for whch s non-zero le on the margn hyperplanes w :x + b =1or w :x + b = ;1: Such ponts are called support vectors. All other ponts are exteror ponts and do not enter n the expanson of the optmal hyperplane w snce they have = : As a result of these propertes, the learnng machne s called the support vector machne. Substtutng for w n the Lagrangan yelds the quadratc pro- P j yjyj(x :x j) sub- grammng P P problem, max ; 1 2 ject to y =: Non-separable Case In many practcal applcatons, a perfectly separatng hyperplane does not exst. To allow for the possblty of examples volatng (7), [2] ntroduce slack varables =1 ::: l (8) to get y ((w:x) +b) 1 ; =1 ::: l: (9) The structural rsk mnmzaton approach to mnmzng the guaranteed rsk bound (2) conssts of the followng: mn (w )=w:w + U =1 ( ) (1) subject to the constrants (8) and (9). In eq. 1, the frst term corresponds to the VC-dmenson of the learnng machne as before. The term P l =1 () (for small ) s equvalent to the number of msclassfcatons on the tranng set and therefore a measure of emprcal rsk. The constant U therefore controls the trade-off between the emprcal rsk P l =1 () and the VC-dmenson of the learnng machne. In actual practce, one has to make choces both for U and : For computatonal reasons, s chosen to be 1 because that translates the optmzaton problem of eq. 1 nto a quadratc programmng problem lke that of the prevous secton (n fact, =2does also). Introducng Lagrange multplers for each of the constrants n eqs. 8 ( s) and 9 ( s), we form the Lagrangan L(w f g b f g f jg) as before L = 1 2 w:w+u =1 ( ); j j ; j=1 =1 (y((w:x)+b);1) Frst order P condtons show that the optmal hyperplane s gven by w = yx: Addtonally takng nto account the Kuhn- Tucker P condtons, Peq. P1 s transformed nto P mn ; 1 2 j jyyj(x:xj) subject to y = U: Once the s are obtaned by solvng the above problem, the optmal hyperplane s easly found by substtuton Non-lnear Extensons: Kernels In order to get non-lnear decson boundares, we transform the data by a fxed non-lnear transformaton and use lnear technques n the transformed space. Consder a transformaton : X ;! Z mappng ponts x 2 X to correspondng z 2 Z: Constructng hyperplanes n Z accordng to the SRM prncple ultmately reduces to solvng optmzaton problems of the sort descrbed earler (eq. 1) wth x s replaced by z s. Sgnfcantly, we note that the only form n whch z s appear n the optmzaton problem s nner products,.e., (z z j): Therefore, t s enough to know the nner product between pars of them. Consequently, we characterze the transformaton by the nner product t mposes on the space Z: Specfcally, we consder mappngs of the form K : X ;! Z such that for any z 1 z 2 2 Z we have (z 1 z 2) = K(x 1 x 2) where K s the Kernel of a postve Hlbert-Schmdt operator and accordng to Mercer s theorem of functonal analyss ([3]) corresponds to a dot product n some other space. Each dfferent choce of the kernel K defnes a dfferent choce of the transformaton K: The dmensonalty of Z depends upon the number of nonzero egenvalues of the kernel K and s potentally nfnte. If we set up the optmzaton problem approprately, we see that the optmal hyperplane has the form w = P y z. The decson rule for ths s: h(x) =( P y K(x x )+b): We see that the form of ths decson rule s lke a feed-forward neural network, or a kernel-based non-parametrc scheme wth two sgnfcant dfferences (a) the parameters are estmated by the prncple of structural rsk mnmzaton (b) the number of hdden nodes or bass functons s chosen automatcally by the procedure. Thus an mportant problem of model selecton s resolved. 4. EXPERIMENTAL RESULTS As formulated n secton 2, stop detecton s a two class pattern classfcaton problem that can be solved usng SVMs. We dscuss n ths secton the experments conducted wth these SVMs. Detecton experments were conducted on dalect regon 4 of the TIMIT test set that conssts of 32 speakers, 16 male and 16 female utterng 1 sentences each makng for a total of 32 sentences n all. Tranng was performed on 1 sentences each from 4 randomly chosen speakers from the TIMIT tranng set from dfferent dalect regons yeldng 133 postve examples and 176 negatve examples Lnear Hyperplanes In ths secton we consder results obtaned when the class of lnear hyperplanes s used as a decson boundary between stops and nonstops (non-separable formulaton). On 1893 tranng data ponts, wth U =1 163 support vectors were generated (1:5% of tranng vectors). The errors on the tranng set conssted of 25 false postves and 37 false negatves. Shown n fg. 2 s a dstrbuton (normalzed hstograms)

4 Non Stops (176).35.3 LINEAR Percentage.15.1 Stops (133) False Acceptance DERIV.5 SVM d(x) False Rejecton.8 Fgure 2: Hstogram of d(x) on the tranng set. Fgure 4: Performance of lnear and non-lnear support vector machnes aganst dervatve operators and HMMs. False Acceptance U =.5 U = False Rejecton Fgure 3: Performance of lnear support vector machnes. of d(x) =w:x + b for the postve and negatve tranng examples. As d(x) jjwjj s the dstance of each datapont x to the separatng hyperplane d(x) s proportonal to ths dstance. Postve examples that have d<and negatve examples that have d>are msclassfed. The regon ;1 d 1 corresponds to the margn,.e., ponts that le wthn the strp gven by the hyperplanes w:x + b =1and w:x + b = ;1: As we dscussed n secton 2, the problem s really one of accurate detecton of stops. Consequently, by changng the threshold (currently set at d =) one can obtan a trade-off between type I and type II errors for the stop detecton problem. Shown n fg. 3 are the ROC curves generated by varyng such a threshold of acceptance for values of U = :2 :5: Ucontrols the trade-off between emprcal ft to the data and capacty of the learnng machne. Performance on the test set (a measure of generalzaton) gets progressvely better as U decreases over ths range. The best performance was obtaned for U =:5: 4.2. RBF Kernels Another mportant choce n the adopton of the support vector framework for problems such as these nvolves the choce of kernel, K: In the experments below, we consdered Radal Bass Functon kernels of the sort K(x y) = exp ; jjx;yjj2 2. Experments were conducted wth values of = Performance mproved margnally from sgma = 1 to = 5. Shown n fg. 4 s the ROC curve for an RBF network wth = 5 (labeled SVM). Also shown are ROC curves for the best lnear hyperplane and one obtaned usng a dervatve operator on total and hgh-frequency energes. As a pont of comparson, the performance of an HMM based system runnng wth free grammar on the test sentences s shown by the * n fg. 4. The HMM based system conssted of 47 phones wth 3 state left to rght HMMs per phone. The output probabltes were mxtures of Gaussans (16 mxtures per state) and the frontend was a 39 dmensonal vector tme seres obtaned from the frst 12 cepstral coeffcents and total energy and ther frst and second dfferences (delta and delta-delta). A stop n a test sentence was consdered to be correctly dentfed f the closure-burst transton was anywhere wthn the segment postulated as a stop by the HMM recognzer. If a partcular segment postulated by the HMM recognzer as a stop dd not contan a closure-burst transton, t was deemed a false accept. 5. CONCLUSIONS Stop detecton belongs to a class of feature detecton problems that naturally arse n a feature based approach to speech recognton. It s partcularly nterestng as stops present a transent sgnal wth a perod of rapd change that s often poorly characterzed by standard cepstral representatons. We have dscussed how the problem can be cast as tryng to dscrmnate between postve and negatve examples usng functons drawn from the class H of the approprate complexty. We have ntroduced the prncple of structural rsk mnmzaton that provdes a framework for dong ths. We have dscussed the two major ssues, the choce of an approprate U and kernel K that affect the successful deployment of ths technology for detecton problems. We have also shown a steady mprovement from lnear to non-lnear SVMs and shown that they perform better than an HMM usng cepstral features. 6. REFERENCES [1] Burges, C.J., A Tutoral on Support Vector Machnes for Pattern Recognton, Data Mnng and Knowledge Dscovery, Vol. 2, No. 2, pp. 1-47, [2] Cortes, C. and Vapnk, V. N., Support Vector Networks, Machne Learnng Journal, Vol. 2, pp [3] Courant, R. and Hlbert, D., Methods of Mathematcal Physcs, J. Wley, New York, [4] Nyog, P., Mtra, P., and Sondh, M., A Detecton Framework for Locatng Phonetc Events, Proceedngs of ICSLP- 98, Sydney, Australa. [5] Shawe-Taylor, J., Bartlett, P. L., Wllamson, R. C., and Anthony, M., A Framework for Structural Rsk Mnmzaton, Proceedngs, 9th Annual Conference on Computatonal Learnng Theory, pp , 1996.

5 [6] Shawe-Taylor, J., Bartlett, P. L., Wllamson, R. C., and Anthony, M., Structural Rsk Mnmzaton over Data- Dependent Herarches, NeuroCOLT Techncal Report NC- TR-96-53, [7] Vapnk, V. N. The Nature of Statstcal Learnng Theory, Sprnger-Verlag, 1995.

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general