Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis

Similar documents
Georey E. Hinton. University oftoronto. Technical Report CRG-TR February 22, Abstract

Sequential Importance Resampling (SIR) Particle Filter

Robust estimation based on the first- and third-moment restrictions of the power transformation model

Doctoral Course in Speech Recognition

Lecture 3: Exponential Smoothing

Hidden Markov Models

Dimitri Solomatine. D.P. Solomatine. Data-driven modelling (part 2). 2

Vehicle Arrival Models : Headway

0.1 MAXIMUM LIKELIHOOD ESTIMATION EXPLAINED

Presentation Overview

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H.

Content-Based Shape Retrieval Using Different Shape Descriptors: A Comparative Study Dengsheng Zhang and Guojun Lu

A Bayesian Approach to Spectral Analysis

Ensamble methods: Boosting

Ensamble methods: Bagging and Boosting

References are appeared in the last slide. Last update: (1393/08/19)

Development of a new metrological model for measuring of the water surface evaporation Tovmach L. Tovmach Yr. Abstract Introduction

T Automatic Speech Recognition: From Theory to Practice

20. Applications of the Genetic-Drift Model

Augmented Reality II - Kalman Filters - Gudrun Klinker May 25, 2004

State-Space Models. Initialization, Estimation and Smoothing of the Kalman Filter

Isolated-word speech recognition using hidden Markov models

10. State Space Methods

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Linear Gaussian State Space Models

5 The fitting methods used in the normalization of DSD

Announcements. Recap: Filtering. Recap: Reasoning Over Time. Example: State Representations for Robot Localization. Particle Filtering

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

Adaptive Noise Estimation Based on Non-negative Matrix Factorization

Time series Decomposition method

Excel-Based Solution Method For The Optimal Policy Of The Hadley And Whittin s Exact Model With Arma Demand

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Chapter 5. Heterocedastic Models. Introduction to time series (2008) 1

Non-parametric techniques. Instance Based Learning. NN Decision Boundaries. Nearest Neighbor Algorithm. Distance metric important

Nature Neuroscience: doi: /nn Supplementary Figure 1. Spike-count autocorrelations in time.

Types of Exponential Smoothing Methods. Simple Exponential Smoothing. Simple Exponential Smoothing

Temporal probability models

Problem Set 5. Graduate Macro II, Spring 2017 The University of Notre Dame Professor Sims

Understanding the asymptotic behaviour of empirical Bayes methods

Hidden Markov Models. Adapted from. Dr Catherine Sweeney-Reed s slides

3.1 More on model selection

Unit Root Time Series. Univariate random walk

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

A variational radial basis function approximation for diffusion processes.

L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS. NA568 Mobile Robotics: Methods & Algorithms

Linear Combinations of Volatility Forecasts for the WIG20 and Polish Exchange Rates

EE3723 : Digital Communications

Testing for a Single Factor Model in the Multivariate State Space Framework

ACE 562 Fall Lecture 4: Simple Linear Regression Model: Specification and Estimation. by Professor Scott H. Irwin

Navneet Saini, Mayank Goyal, Vishal Bansal (2013); Term Project AML310; Indian Institute of Technology Delhi

Two Coupled Oscillators / Normal Modes

GMM - Generalized Method of Moments

Modal identification of structures from roving input data by means of maximum likelihood estimation of the state space model

Applying Genetic Algorithms for Inventory Lot-Sizing Problem with Supplier Selection under Storage Capacity Constraints

Linear Response Theory: The connection between QFT and experiments

2017 3rd International Conference on E-commerce and Contemporary Economic Development (ECED 2017) ISBN:

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8)

Single-Pass-Based Heuristic Algorithms for Group Flexible Flow-shop Scheduling Problems

4.1 Other Interpretations of Ridge Regression

USP. Surplus-Production Models

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Notes on Kalman Filtering

An recursive analytical technique to estimate time dependent physical parameters in the presence of noise processes

DEPARTMENT OF ECONOMICS

1 Evaluating Chromatograms

Reliability of Technical Systems

Non-parametric techniques. Instance Based Learning. NN Decision Boundaries. Nearest Neighbor Algorithm. Distance metric important

Mechanical Fatigue and Load-Induced Aging of Loudspeaker Suspension. Wolfgang Klippel,

DEPARTMENT OF STATISTICS

Authors. Introduction. Introduction

Linear Hidden Transformations for Adaptation of Hybrid ANN/HMM Models

Introduction to Mobile Robotics

h[n] is the impulse response of the discrete-time system:

Biol. 356 Lab 8. Mortality, Recruitment, and Migration Rates

Pattern Classification (VI) 杜俊

Using the Kalman filter Extended Kalman filter

Temporal probability models. Chapter 15, Sections 1 5 1

Combining Statistical and Knowledge-based Spoken Language Understanding in Conditional Models

Summer Term Albert-Ludwigs-Universität Freiburg Empirische Forschung und Okonometrie. Time Series Analysis

The Simple Linear Regression Model: Reporting the Results and Choosing the Functional Form

TOP-QUARK MASS MEASUREMENTS AT THE LHC

Methodology. -ratios are biased and that the appropriate critical values have to be increased by an amount. that depends on the sample size.

Estimation of Poses with Particle Filters

2016 Possible Examination Questions. Robotics CSCE 574

Bias in Conditional and Unconditional Fixed Effects Logit Estimation: a Correction * Tom Coupé

m = 41 members n = 27 (nonfounders), f = 14 (founders) 8 markers from chromosome 19

Financial Econometrics Kalman Filter: some applications to Finance University of Evry - Master 2

Exponential Weighted Moving Average (EWMA) Chart Under The Assumption of Moderateness And Its 3 Control Limits

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

Let us start with a two dimensional case. We consider a vector ( x,

Smoothing. Backward smoother: At any give T, replace the observation yt by a combination of observations at & before T

Tracking. Many slides adapted from Kristen Grauman, Deva Ramanan

Maximum Likelihood Parameter Estimation in State-Space Models

Kriging Models Predicting Atrazine Concentrations in Surface Water Draining Agricultural Watersheds

ST2352. Stochastic Processes constructed via Conditional Simulation. 09/02/2014 ST2352 Week 4 1

Solutions to Odd Number Exercises in Chapter 6

STATE-SPACE MODELLING. A mass balance across the tank gives:

Institute for Mathematical Methods in Economics. University of Technology Vienna. Singapore, May Manfred Deistler

A Specification Test for Linear Dynamic Stochastic General Equilibrium Models

Transcription:

Speaker Adapaion Techniques For Coninuous Speech Using Medium and Small Adapaion Daa Ses Consaninos Boulis

Ouline of he Presenaion Inroducion o he speaker adapaion problem Maximum Likelihood Sochasic Transformaions (MLST) Experimenal resuls of MLST Basis Transformaions Experimenal resuls of Basis Transformaions

Wha is Speaker Adapaion? Curren speech recognizers are rained on a high number of speakers. Performance is no always opimal for each speaker. Speaker adapaion echniques aemp o adap an iniial sysem o a sysem ha beer maches a specific speaker.

The Benefis of Speaker Adapaion By using 40 senences from each speaker a 20% drop in WER is achieved (naive English speakers). A 50% drop in WER is observed for non-naive English speakers. Recognizers become 15%-20% faser due o beer mach beween he models and he speech inpu.

Speaker Adapaion Mehods Maximum Likelihood $ = arg max{ P( x θ) } θ θ Maximum a Poseriori $ = arg max { P( x) } = arg max{ P( x ) P( ) } θ θ θ θ θ θ

Maximum Likelihood esimae of he mean is given by: Maximum a Poseriori esimae of he mean wih a known sandard deviaion σ and normal prior disribuion for he mean wih mean µ and sandard deviaion κ: Esimaion of he Mean of a Normal Densiy = = T ML x T m 1 1 ˆ µ κ σ σ κ σ κ 2 2 2 2 2 2 ˆ ˆ T m T T m ML MAP + + + =

Speaker Adapaion Mehods Transformaion-Based [ P( θ+ )] Ab $, $ : max x A b Ab, The same linear ransform is ied over many HMM saes. Any unseen HMM saes during he adapaion daa will be ransformed in conras wih MAP.

Comparison of Speaker Adapaion Mehods Maximum Likelihood, requires housands of senences o robusly rain a large vocabulary sysem. Maximum a Poseriori, achieves improved performance bu inadequae for fewer han 40 senences. Transformaion-based, achieves he bes resuls for small daa ses bu performance does no scale well as he number of adapaion senences increases.

Maximum Likelihood Linear Regression Only he observaion probabiliies of he HMM saes are alered. Maximum Likelihood Linear Regression (MLLR) P SA Nω ( y s ) p( ω s ) N( y ; A m + b, S ) j= 1 Sae is mapped o class c wih ransform s = j c s j c c s j [ A b ] W = c c

Limiaions of MLLR All he Gaussians of a class are ransformed idenically. The ransforms for differen classes are esimaed independenly.

Maximum Likelihood Sochasic Transformaions Maximum Likelihood Sochasic Transformaions (MLST) esimaes muliple ransformaions per class and ransform weigh vecors: P SA Nω Nλ ( y s ) p( λ c ) p( ω s ) N( y ; A m + b, S ) = j= 1k= 1 k The key poin of MLST is he differen ying beween ransformaions and ransform weighs. j ck s j ck s j

Maximum Likelihood Sochasic Transformaions Robus esimaes of ransform weighs are obained using far fewer samples han ransformaions. Each HMM sae may now esimae is own ransform weigh vecor and use ransformaions shared by many saes. MLST achieves increased adapaion resoluion wihou sacrificing robusness.

Example Suppose here are enough daa for only wo ransforms W 1,W 2 λ + λ 11W 1 12W2 λ + λ n1w 1 n2w2 In effec we have n ransforms while MLLR could esimae wo.

Esimaing Transform Weighs Transform weighs are esimaed using: p ( λ c ) k = s c j s c j k γ sjk γ () sjk () where γ p () = p( s = s, ω = j, λ = k o ) = ( s = s, ω = j o, λ = k) p( λ = k o ) sjk Noice ha anoher hidden variable is added.

Esimaing Transformaions The ransformaions are esimaed using: s c j ) 1 ) ) γ sjk 1 () Ssj x µ sj = γ sjk() Ssj xwckµ sj µ sj ck s c j [ A b ] W = ck ck If each speech frame is of dimension d hen we need o solve d+1 linear sysems of d+1 equaions each (full roaion marix).

Grouping he Transformed Gaussians MLST increases he number of Gaussians by a facor of we need a way o reurn o he original number of Gaussians. Three grouping mehods are inroduced : highes ransform weigh (HTW), linear combinaion of ransforms (LCT) and merging ransformed gaussians (MTG). In HTW we use : P SA N ( y s ) = ω p( ω s ) N( y ; A m + b, S ) j= 1 j ck { p( λ c )} k = arg max k k s j ck s j N λ

In LCT we use: Advanage: The smoohing across all ransformed Gaussians reduces esimaion errors. Grouping he Transformed Gaussians ( ) ( ) ( ) j s ck j s ck N j j SA S b m A y N s s y P, ; p 1 = + = ω ω ( ) ( ) = = = = λ λ λ λ N k ck k ck ck N k k k c b c p b A c p A 1 1

In MTG we use: where: Disadvanage: Covariances are always broader now. Grouping he Transformed Gaussians ( ) = = λ µ λ N k i jk s k i j s c p m 1 ) ( ) ( ( ) ( ) ( )( ) ( ) 2 ) ( 1 2 ) ( 2 ) ( 2 ) ( i j s N k i jk s k i j s i j s m c p + = = λ µ λ σ σ ( ) ) ( ) ( i ck j s ck i jk s b m A + = µ

Memory and Time Requiremens ( ) MLST needs o sore Nω Nλ 2 d +1 real numbers in excess hose used from MLLR. Also he forward-backward algorihm, which is run during adapaion, is slower. In pracice he added memory requiremens are no more han 60MB. Caching mehods can be used o reduce his number. During adapaion, MLST is abou 15% slower han MLLR.

Experimenal Resuls Experimens were carried ou on he Wall Sree Journal daabase. Gender dependen sysems wih 15,000 Gaussians each, 12,000 conex-dependen phoneic models. When evaluaed on naive English speakers performance was benchmarked a 12.0% using bigram language models. For non-naive speakers performance was 27.4%.

Effec of Transform Weigh Tying 40 adapaion senences, 10 regression classes for ransformaions and 10 ransformaions per class. Transform weigh hreshold 0 5 10 50 16.5 14.1 14.3 15.0 17.2 Inroducing ying on ransform weighs clearly helps.

Linear Combinaion of Transformed Gaussians 40 adapaion senences # Transforms per class # Acousic classes 2 5 10 20 30 1 20.6 17.8 16.4 15.6 16.4 2 18.7 15.1 14.7 15.4 15.3 4 17.0 14.4 14.5 15.2 14.7 6 16.4 14.3 14.4 15.0 14.8 8 15.7 14.1 14.7 15.2 15.9 10 15.6 13.5 13.9 14.7 17.3 12 15.4 14.1 14.0 14.9 19.6

Highes Transformed Gaussian 40 adapaion senences # Transforms per class # Acousic classes 2 5 10 20 30 1 20.6 17.8 16.4 15.6 16.4 2 18.7 15.2 15.1 16.1 16.3 4 16.6 15.3 15.2 16.8 16.5 6 16.5 14.9 16.9 16.3 17.0 8 16.2 15.0 16.7 16.5 16.1 10 16.1 15.4 16.2 16.4 16.4 12 16.0 16.0 16.3 16.3 16.1

Merging Transformed Gaussians 40 adapaion senences # Transforms per class # Acousic classes 5 10 20 30 1 17.8 16.4 15.6 16.4 2 14.9 15.0 15.8 15.9 4 15.5 17.2 18.6 19.7 6 18.0 18.8 19.9 20.7 8 19.1 22.1 22.9 22.7 10 19.1 22.1 23.1 23.9 12 19.3 22.8 23.3 23.6

Comparaive Resuls MLLR and MLST performance for various number of senences. # sens MLLR MLST 10 19.6 17.5 20 17.3 16.2 40 15.6 13.5 Significan, consisen gains over MLLR for any number of senences.

Basis Transformaions Moivaion: Exploi similariies beween speakers. Use MLST wih ransformaions generaed from oher speakers and adap only he ransform weighs. Now here are very few adapaion parameers, so we can use i for very few adapaion senences.

Mehodology for Swedish ATIS Speaker independen sysem rained on non-dialec speakers, wih 5,700 Gaussians, 3600 conex-dependen phoneic models. Achieved a 8.9% WER for non-dialec speakers bu a 25.8% for dialec speakers (Scania region). We had available 39 Scania speakers, used 31 of hem o generae he basis ransforms and he res for evaluaion.

Mehodology for Swedish ATIS For various number of regression classes we generaed MLLR ransforms for each dialec speaker. The ransform weighs were esimaed on he same ying wih ransforms. Backoff ransform weighs are used when here are no enough daa.

Smoohing he Transform Weighs Under very small amouns of adapaion daa smoohing echniques for he ransform weighs are expeced o help. Esimae smoohed ransform weighs using : where: p ( λ c) k h = () s s j s j k 1 = 0.1 h () s γ () h if s c if s c sjk () s γ () sjk

Experimens on he ATIS MLLR performance: Number of classes Number of senences 1 3 5 10 15 1 21.9 18.4 16.0 15.6 15.5 3-18.9 16.1 14.0 13.6 6 - - 15.9 14.4 13.5 11 - - 17.1 13.5 11.9 31 - - - 11.7 10.9 41 - - - - 12.0

Experimens on he ATIS Resuls of MLST wih basis ransforms using MTG Number of classes Number of senences 1 3 5 10 15 1 18.1 18.1 18.5 18.3 18.2 3 16.3 16.3 16.1 16.1 16.1 6 13.8 14.1 14.3 14.1 14.0 11 13.3 13.8 13.7 13.4 13.4 31 13.3 13.0 13.2 13.6 12.9 41 13.2 12.7 12.6 12.5 12.4 51 13.6 12.6 12.9 12.8 13.0 57 13.7 12.6 12.7 12.8 12.8

Experimens on he ATIS How imporan are ransform weighs? Perhaps he ransforms hemselves are capuring an imporan par of mismach. Number of classes Adaped weighs Equal weighs 1 18.2 18.8 25.1 3 16.1 17.0 23.1 6 14.0 14.6 23.4 31 12.9 14.9 28.6 41 12.4 14.6 27.6 Random ransform

Mehodology for he WSJ Evaluaed basis ransformaions mehod for naive speakers on he WSJ ask. The very high number of raining speakers (245) necessiaed a differen mehodology han ATIS. Gender dependen sysems using 15,000 Gaussians wih 12,000 conex-dependen phoneic models.

Mehodology for he WSJ Cluser he speakers in ses. Esimae cluser-specific ransformaions or Deermine he cenroid speaker of each cluser, re-rain a sysem wihou he cenroid speakers and esimae speakerspecific ransformaions.

Clusering he Training Speakers Used MLLR o generae speaker-adaped models for each speaker. Disance beween speakers m and n is given by: D 1 2 ( m, n) = [ log p( x λ ) + log p( x λ ) log p( x λ ) log p( x λ )] m m Use he forward-backward algorihm o calculae he probabiliies. n n m n n m

Clusering he Training Speakers Hierarchical clusering echniques are used. Disance beween clusers c and k iniially defined as: D ( c, k) = avgd( m, n) m c, n k m, n To creae balanced clusers he crierion was modified o: D ( c, k) = avgd( m, n) + alog( l + l ) m c, n k m, n c k

Experimenal Resuls Baseline performance 11.9% MLLR performance for various number of senences Number of senences 1 2 5 10 20 11.2 10.7 10.6 10.2 10.2 For 2 adapaion senences he 70% of he oal gain is obained.

Experimenal Resuls Cluser-specific ransformaions for 9 clusers, 2 ransformaion classes, 1 ransformaion per class # weigh classes Number of senences 1 2 5 10 20 1 11.7 11.6 11.4 11.3 11.3 10 11.7 11.6 11.4 11.4 11.3 30 11.7 11.7 11.4 11.3 11.2 Speaker-specific ransformaions (re-rained sysem 12.0%) # weigh classes Number of senences 1 2 5 10 20 1 12.0 12.3 11.4 11.5 11.4 10 12.3 12.0 11.6 11.5 11.4 30 11.8 11.9 11.6 11.6 11.5

Experimenal Resuls Speaker-specific ransformaions for 9 clusers, 2 ransformaions classes, 7 ransformaions per class # weigh classes Number of senences 1 2 5 10 20 1 11.3 11.6 11.3 11.5 11.2 10 11.4 11.4 11.2 11.4 11.2 30 11.4 11.4 11.0 11.2 11.1 Speaker-specific ransformaions for 23 clusers, 2 ransformaion classes, 3 ransformaions per class # weigh classes Number of senences 1 2 5 10 20 1 11.2 11.1 11.1 11.0 11.0 10 11.2 11.1 10.9 11.1 10.8 30 11.2 11.2 11.0 11.2 10.9

Experimenal Resuls Speaker-specific ransformaions for 23 clusers, 2 ransformaion classes, 3 full ransformaions per class # weigh classes Number of senences 1 2 5 10 20 1 11.2 11.5 11.1 11.3 10.9 10 11.2 11.4 11.0 11.3 11.2 30 11.1 11.0 11.2 11.3 11.4 Speaker-specific ransformaions for 23 clusers, 10 ransformaion classes, 3 ransformaions per class # weigh classes Number of senences 1 2 5 10 20 1 11.5 11.5 11.4 11.0 10.9 10 11.2 11.1 11.0 10.9 11.0 30 11.4 11.2 11.1 11.4 11.3

Experimenal Resuls The only facor which seems o influence he resuls is he oal number of ransformaions per class. Insensiive o he number of ransformaion classes, he mehodology used, he ransform ype, he ransform weigh classes and he number of clusers used. Diagnosic experimens performed as expeced.

Comparing MLLR and BT Bes resuls for he wo mehods # senences MLLR BT 1 11.2 11.1 2 10.7 11.0 5 10.6 10.9 10 10.2 10.9 20 10.2 10.8 Equal performance for few daa, MLLR makes beer use of increased amouns of daa.

Combining MLLR and BT We can cascade BT and MLLR (opposie is expensive) # senences MLLR BT+MLLR 1 11.2 11.0 2 10.7 10.6 5 10.6 10.2 10 10.2 10.2 20 10.2 9.9 Marginal gains for any amoun of adapaion daa.

Conclusions Two speaker adapaion algorihms were presened. MLST is a more general mehod han MLLR and i was shown ha i ouperforms i by 13.5% for 40 senences. Basis Transformaions is a varian of MLST designed o work under very few daa. When cascaded wih MLLR here are marginal gains han MLLR alone.