Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis

Speaker Adapaion Techniques For Coninuous Speech Using Medium and Small Adapaion Daa Ses Consaninos Boulis

Ouline of he Presenaion Inroducion o he speaker adapaion problem Maximum Likelihood Sochasic Transformaions (MLST) Experimenal resuls of MLST Basis Transformaions Experimenal resuls of Basis Transformaions

Wha is Speaker Adapaion? Curren speech recognizers are rained on a high number of speakers. Performance is no always opimal for each speaker. Speaker adapaion echniques aemp o adap an iniial sysem o a sysem ha beer maches a specific speaker.

The Benefis of Speaker Adapaion By using 40 senences from each speaker a 20% drop in WER is achieved (naive English speakers). A 50% drop in WER is observed for non-naive English speakers. Recognizers become 15%-20% faser due o beer mach beween he models and he speech inpu.

Speaker Adapaion Mehods Maximum Likelihood $ = arg max{ P( x θ) } θ θ Maximum a Poseriori $ = arg max { P( x) } = arg max{ P( x ) P( ) } θ θ θ θ θ θ

Maximum Likelihood esimae of he mean is given by: Maximum a Poseriori esimae of he mean wih a known sandard deviaion σ and normal prior disribuion for he mean wih mean µ and sandard deviaion κ: Esimaion of he Mean of a Normal Densiy = = T ML x T m 1 1 ˆ µ κ σ σ κ σ κ 2 2 2 2 2 2 ˆ ˆ T m T T m ML MAP + + + =

Speaker Adapaion Mehods Transformaion-Based [ P( θ+ )] Ab $, $ : max x A b Ab, The same linear ransform is ied over many HMM saes. Any unseen HMM saes during he adapaion daa will be ransformed in conras wih MAP.

Comparison of Speaker Adapaion Mehods Maximum Likelihood, requires housands of senences o robusly rain a large vocabulary sysem. Maximum a Poseriori, achieves improved performance bu inadequae for fewer han 40 senences. Transformaion-based, achieves he bes resuls for small daa ses bu performance does no scale well as he number of adapaion senences increases.

Maximum Likelihood Linear Regression Only he observaion probabiliies of he HMM saes are alered. Maximum Likelihood Linear Regression (MLLR) P SA Nω ( y s ) p( ω s ) N( y ; A m + b, S ) j= 1 Sae is mapped o class c wih ransform s = j c s j c c s j [ A b ] W = c c

Limiaions of MLLR All he Gaussians of a class are ransformed idenically. The ransforms for differen classes are esimaed independenly.

Maximum Likelihood Sochasic Transformaions Maximum Likelihood Sochasic Transformaions (MLST) esimaes muliple ransformaions per class and ransform weigh vecors: P SA Nω Nλ ( y s ) p( λ c ) p( ω s ) N( y ; A m + b, S ) = j= 1k= 1 k The key poin of MLST is he differen ying beween ransformaions and ransform weighs. j ck s j ck s j

Maximum Likelihood Sochasic Transformaions Robus esimaes of ransform weighs are obained using far fewer samples han ransformaions. Each HMM sae may now esimae is own ransform weigh vecor and use ransformaions shared by many saes. MLST achieves increased adapaion resoluion wihou sacrificing robusness.

Example Suppose here are enough daa for only wo ransforms W 1,W 2 λ + λ 11W 1 12W2 λ + λ n1w 1 n2w2 In effec we have n ransforms while MLLR could esimae wo.

Esimaing Transform Weighs Transform weighs are esimaed using: p ( λ c ) k = s c j s c j k γ sjk γ () sjk () where γ p () = p( s = s, ω = j, λ = k o ) = ( s = s, ω = j o, λ = k) p( λ = k o ) sjk Noice ha anoher hidden variable is added.

Esimaing Transformaions The ransformaions are esimaed using: s c j ) 1 ) ) γ sjk 1 () Ssj x µ sj = γ sjk() Ssj xwckµ sj µ sj ck s c j [ A b ] W = ck ck If each speech frame is of dimension d hen we need o solve d+1 linear sysems of d+1 equaions each (full roaion marix).

Grouping he Transformed Gaussians MLST increases he number of Gaussians by a facor of we need a way o reurn o he original number of Gaussians. Three grouping mehods are inroduced : highes ransform weigh (HTW), linear combinaion of ransforms (LCT) and merging ransformed gaussians (MTG). In HTW we use : P SA N ( y s ) = ω p( ω s ) N( y ; A m + b, S ) j= 1 j ck { p( λ c )} k = arg max k k s j ck s j N λ

In LCT we use: Advanage: The smoohing across all ransformed Gaussians reduces esimaion errors. Grouping he Transformed Gaussians ( ) ( ) ( ) j s ck j s ck N j j SA S b m A y N s s y P, ; p 1 = + = ω ω ( ) ( ) = = = = λ λ λ λ N k ck k ck ck N k k k c b c p b A c p A 1 1

In MTG we use: where: Disadvanage: Covariances are always broader now. Grouping he Transformed Gaussians ( ) = = λ µ λ N k i jk s k i j s c p m 1 ) ( ) ( ( ) ( ) ( )( ) ( ) 2 ) ( 1 2 ) ( 2 ) ( 2 ) ( i j s N k i jk s k i j s i j s m c p + = = λ µ λ σ σ ( ) ) ( ) ( i ck j s ck i jk s b m A + = µ

Memory and Time Requiremens ( ) MLST needs o sore Nω Nλ 2 d +1 real numbers in excess hose used from MLLR. Also he forward-backward algorihm, which is run during adapaion, is slower. In pracice he added memory requiremens are no more han 60MB. Caching mehods can be used o reduce his number. During adapaion, MLST is abou 15% slower han MLLR.

Experimenal Resuls Experimens were carried ou on he Wall Sree Journal daabase. Gender dependen sysems wih 15,000 Gaussians each, 12,000 conex-dependen phoneic models. When evaluaed on naive English speakers performance was benchmarked a 12.0% using bigram language models. For non-naive speakers performance was 27.4%.

Effec of Transform Weigh Tying 40 adapaion senences, 10 regression classes for ransformaions and 10 ransformaions per class. Transform weigh hreshold 0 5 10 50 16.5 14.1 14.3 15.0 17.2 Inroducing ying on ransform weighs clearly helps.

Linear Combinaion of Transformed Gaussians 40 adapaion senences # Transforms per class # Acousic classes 2 5 10 20 30 1 20.6 17.8 16.4 15.6 16.4 2 18.7 15.1 14.7 15.4 15.3 4 17.0 14.4 14.5 15.2 14.7 6 16.4 14.3 14.4 15.0 14.8 8 15.7 14.1 14.7 15.2 15.9 10 15.6 13.5 13.9 14.7 17.3 12 15.4 14.1 14.0 14.9 19.6

Highes Transformed Gaussian 40 adapaion senences # Transforms per class # Acousic classes 2 5 10 20 30 1 20.6 17.8 16.4 15.6 16.4 2 18.7 15.2 15.1 16.1 16.3 4 16.6 15.3 15.2 16.8 16.5 6 16.5 14.9 16.9 16.3 17.0 8 16.2 15.0 16.7 16.5 16.1 10 16.1 15.4 16.2 16.4 16.4 12 16.0 16.0 16.3 16.3 16.1

Merging Transformed Gaussians 40 adapaion senences # Transforms per class # Acousic classes 5 10 20 30 1 17.8 16.4 15.6 16.4 2 14.9 15.0 15.8 15.9 4 15.5 17.2 18.6 19.7 6 18.0 18.8 19.9 20.7 8 19.1 22.1 22.9 22.7 10 19.1 22.1 23.1 23.9 12 19.3 22.8 23.3 23.6

Comparaive Resuls MLLR and MLST performance for various number of senences. # sens MLLR MLST 10 19.6 17.5 20 17.3 16.2 40 15.6 13.5 Significan, consisen gains over MLLR for any number of senences.

Basis Transformaions Moivaion: Exploi similariies beween speakers. Use MLST wih ransformaions generaed from oher speakers and adap only he ransform weighs. Now here are very few adapaion parameers, so we can use i for very few adapaion senences.

Mehodology for Swedish ATIS Speaker independen sysem rained on non-dialec speakers, wih 5,700 Gaussians, 3600 conex-dependen phoneic models. Achieved a 8.9% WER for non-dialec speakers bu a 25.8% for dialec speakers (Scania region). We had available 39 Scania speakers, used 31 of hem o generae he basis ransforms and he res for evaluaion.

Mehodology for Swedish ATIS For various number of regression classes we generaed MLLR ransforms for each dialec speaker. The ransform weighs were esimaed on he same ying wih ransforms. Backoff ransform weighs are used when here are no enough daa.

Smoohing he Transform Weighs Under very small amouns of adapaion daa smoohing echniques for he ransform weighs are expeced o help. Esimae smoohed ransform weighs using : where: p ( λ c) k h = () s s j s j k 1 = 0.1 h () s γ () h if s c if s c sjk () s γ () sjk

Experimens on he ATIS MLLR performance: Number of classes Number of senences 1 3 5 10 15 1 21.9 18.4 16.0 15.6 15.5 3-18.9 16.1 14.0 13.6 6 - - 15.9 14.4 13.5 11 - - 17.1 13.5 11.9 31 - - - 11.7 10.9 41 - - - - 12.0

Experimens on he ATIS Resuls of MLST wih basis ransforms using MTG Number of classes Number of senences 1 3 5 10 15 1 18.1 18.1 18.5 18.3 18.2 3 16.3 16.3 16.1 16.1 16.1 6 13.8 14.1 14.3 14.1 14.0 11 13.3 13.8 13.7 13.4 13.4 31 13.3 13.0 13.2 13.6 12.9 41 13.2 12.7 12.6 12.5 12.4 51 13.6 12.6 12.9 12.8 13.0 57 13.7 12.6 12.7 12.8 12.8

Experimens on he ATIS How imporan are ransform weighs? Perhaps he ransforms hemselves are capuring an imporan par of mismach. Number of classes Adaped weighs Equal weighs 1 18.2 18.8 25.1 3 16.1 17.0 23.1 6 14.0 14.6 23.4 31 12.9 14.9 28.6 41 12.4 14.6 27.6 Random ransform

Mehodology for he WSJ Evaluaed basis ransformaions mehod for naive speakers on he WSJ ask. The very high number of raining speakers (245) necessiaed a differen mehodology han ATIS. Gender dependen sysems using 15,000 Gaussians wih 12,000 conex-dependen phoneic models.

Mehodology for he WSJ Cluser he speakers in ses. Esimae cluser-specific ransformaions or Deermine he cenroid speaker of each cluser, re-rain a sysem wihou he cenroid speakers and esimae speakerspecific ransformaions.

Clusering he Training Speakers Used MLLR o generae speaker-adaped models for each speaker. Disance beween speakers m and n is given by: D 1 2 ( m, n) = [ log p( x λ ) + log p( x λ ) log p( x λ ) log p( x λ )] m m Use he forward-backward algorihm o calculae he probabiliies. n n m n n m

Clusering he Training Speakers Hierarchical clusering echniques are used. Disance beween clusers c and k iniially defined as: D ( c, k) = avgd( m, n) m c, n k m, n To creae balanced clusers he crierion was modified o: D ( c, k) = avgd( m, n) + alog( l + l ) m c, n k m, n c k

Experimenal Resuls Baseline performance 11.9% MLLR performance for various number of senences Number of senences 1 2 5 10 20 11.2 10.7 10.6 10.2 10.2 For 2 adapaion senences he 70% of he oal gain is obained.

Experimenal Resuls Cluser-specific ransformaions for 9 clusers, 2 ransformaion classes, 1 ransformaion per class # weigh classes Number of senences 1 2 5 10 20 1 11.7 11.6 11.4 11.3 11.3 10 11.7 11.6 11.4 11.4 11.3 30 11.7 11.7 11.4 11.3 11.2 Speaker-specific ransformaions (re-rained sysem 12.0%) # weigh classes Number of senences 1 2 5 10 20 1 12.0 12.3 11.4 11.5 11.4 10 12.3 12.0 11.6 11.5 11.4 30 11.8 11.9 11.6 11.6 11.5

Experimenal Resuls Speaker-specific ransformaions for 9 clusers, 2 ransformaions classes, 7 ransformaions per class # weigh classes Number of senences 1 2 5 10 20 1 11.3 11.6 11.3 11.5 11.2 10 11.4 11.4 11.2 11.4 11.2 30 11.4 11.4 11.0 11.2 11.1 Speaker-specific ransformaions for 23 clusers, 2 ransformaion classes, 3 ransformaions per class # weigh classes Number of senences 1 2 5 10 20 1 11.2 11.1 11.1 11.0 11.0 10 11.2 11.1 10.9 11.1 10.8 30 11.2 11.2 11.0 11.2 10.9

Experimenal Resuls Speaker-specific ransformaions for 23 clusers, 2 ransformaion classes, 3 full ransformaions per class # weigh classes Number of senences 1 2 5 10 20 1 11.2 11.5 11.1 11.3 10.9 10 11.2 11.4 11.0 11.3 11.2 30 11.1 11.0 11.2 11.3 11.4 Speaker-specific ransformaions for 23 clusers, 10 ransformaion classes, 3 ransformaions per class # weigh classes Number of senences 1 2 5 10 20 1 11.5 11.5 11.4 11.0 10.9 10 11.2 11.1 11.0 10.9 11.0 30 11.4 11.2 11.1 11.4 11.3

Experimenal Resuls The only facor which seems o influence he resuls is he oal number of ransformaions per class. Insensiive o he number of ransformaion classes, he mehodology used, he ransform ype, he ransform weigh classes and he number of clusers used. Diagnosic experimens performed as expeced.

Comparing MLLR and BT Bes resuls for he wo mehods # senences MLLR BT 1 11.2 11.1 2 10.7 11.0 5 10.6 10.9 10 10.2 10.9 20 10.2 10.8 Equal performance for few daa, MLLR makes beer use of increased amouns of daa.

Combining MLLR and BT We can cascade BT and MLLR (opposie is expensive) # senences MLLR BT+MLLR 1 11.2 11.0 2 10.7 10.6 5 10.6 10.2 10 10.2 10.2 20 10.2 9.9 Marginal gains for any amoun of adapaion daa.

Conclusions Two speaker adapaion algorihms were presened. MLST is a more general mehod han MLLR and i was shown ha i ouperforms i by 13.5% for 40 senences. Basis Transformaions is a varian of MLST designed o work under very few daa. When cascaded wih MLLR here are marginal gains han MLLR alone.