Geoffrey Zweig May 7, 2009

Size: px

Start display at page:

Download "Geoffrey Zweig May 7, 2009"

Juniper Hampton
5 years ago
Views:

1 Geoffrey Zweig May 7, 2009

2 Taxonomy of LID Techniques LID Acoustic Scores Derived LM Vector space model GMM GMM Tokenization Parallel Phone Rec + LM Vectors of phone LM stats [Carrasquillo et. al. 02], [Singer et. al. 03], [Lamel and Gauvain 94] [Zissman 96], [Li et al 07]

3 The GMM Approach English Acoustic Model French Acoustic Model Output Likeliest Tamil Acoustic Model arg max l P( l x) arg max l P( l) P( x l) arg max l P( l) i k c kl N( x i ; l k, l k )

4 A digression: What Features? MFCCs PLPs Stacked Deltas Image from Approaches to Language Identification Torres-Carrasquillo et al.

5 Stacked Deltas: What Are They? Look at cepstral differences for frames separated by 2d frames Move into the future P frames and do it again Do this k times Concatenate all the deltas together

6 Stacked Deltas Seem to Make a Difference (!) From Burget et al, Discriminative Training Techniques for Acoustic Language ID

7 Polishing the Featutres: LDA Linear Discriminant Analysis (LDA) is a dimensionality reduction method [Kumar & Andereou 98] : y T x y R p x R n p n is estimated to maximize the log-likelihood of the data {x i } under assumption: class distributions are Gaussians that have different means and same variance. And the rejected n-p dimensions are modeled with a single shared gaussian. Ghinwa Choueiter, Internship Presentation

8 Visualizing LDA Images from

9 Polishing the Features: HLDA Heteroscedastic (LDA) is a generalization of (LDA) where [Kumar & Andreou 98]: y T x y R p x R n p n is estimated to maximize the log-likelihood of the data {x i } under assumption: class distributions are Gaussians that have different means and different variances. And the rejected n-p dimensions are modeled with a single shared gaussian. Ghinwa Choueiter, Internship Presentation

10 HLDA vs LDA From Saon et al., ICASSP 2000

11 HLDA: Results From: Choueiter & Zweig, ICASSP 2008

12 End of Features Digression Now, how can we improve the GMMs themselves?

13 MMI: Further Polishing the GMMs Discriminative training approach Avoids putting one classes gaussians on top of another s Instead tries to maximize the information between features and class labels, modulo the model parameters Full discussion in last lecture

14 Maximum Mutual Information Training X: Y: How much does knowing X tell about Y? MI ( X ; Y ) xvals yvals P( X xval, Y yval)log P( X P( X xval, Y xval) P( Y yval) yval)

15 Two Cases to Consider ) ( ) ( ), ( )log, ( ) ; ( y Y P x X P y Y x X P y Y x X P Y X MI y x X and Y are independent 0 ) ( ) ( ) ( ) ( )log ( ) ( ) ; ( y Y P x X P y Y P x X P y Y P x X P Y X MI y x X determines Y ) ( ) ; ( Y H Y X MI

16 Language and Acoustic Sequences a l a P l P a l P a l P A L MI ) ( ) ( ), ( )log, ( ) ; ( D is training data. Approximate by: ) ( ) ( ) ( ) ( log ) ( ) ( ), ( log ) ; ( a P l P l a P l P a P l P a l P A L MI D D ) ( ) ( ) ( log ) ( ) ( log r a P r P l a P a P l a P languages r D D Key idea: P(*) is a parametric probability distribution Estimate the parameters of P to maximize mutual information [Bahl et al. 1986]

17 MMI Training for LID Compute the sufficient statistics for the GMM of a language using only data from that language ( numerator statistics ) Do the same, but allow all data to match all GMMs ( denominator statistics ) Update the means and variances, using differences between these quantities Data Samples GMMs

18 MMI Updates For HMMs, the mean and variance update eq [Povey & Woodland 2000]:

19 MMI Results From Choueiter & Zweig, ICASSP 2008

20 Phone Recognition + Language Model (PRLM) p ih n s probably English k r p s t probably Czech Simple HMMs 5/14 Language Models 4/30 After Zissman 1996

21 But why use English phones? Parallel PRLM (PPRLM) Same methods multiple times After Zissman 1996

22 What to do the Classification With? Averaging initially used. What if we also want to use acoustic scores or other information sources? Why not: Maximum entropy model? Decision tree? SVM? Not clear what is best for combining both LM and AM features (Something a project could explore)

23 But why use Phones at all? - Gaussian Tokenizer A tokenizer is a GMM that generates sequences of indices (one index/time frame). Each index corresponds to mixture component with highest likelihood. Tokenizer used to generate index sequences for each language. Index sequences used to generate index LMs for each language.

24 Gaussian Tokenizer Language GMM lang1 utt lang2 utt Adapted from Choueiter Internship presentation

25 Training with agaussian Tokenizer Language-Independent Tokenizer: AR-utts BP- utts VI-utts. G AR LM training data BP LM training data VI LM training data Adapted from Choueiter Internship presentation

26 Testing with a Gaussian Tokenizer Language-Independent Tokenizer: AR LM utts G Index sequence BP LM. Pick Max VI LM Adapted from Choueiter Internship presentation

27 Parallel Gaussian Tokenization Analogous to Parallel-Phone Recognition + Language Modeling (PPRLM) Now using gaussian tokenization rather than phonetic tokenization Note the difference in scale between gaussian tokenization (100/second) and phonetic tokenization (20/second) Torres-Carrasquillo et al., 2002 reports no change from smoothing, compressing, etc.

28 A Digression: Information Retrieval A classical problem is to retrieve documents that are relevant to a query We know (kind of) how to do this What if we pretend that the training data we have for a language is a document? And we pretend that some sample data we want to classify is a query? Can we look up the document (language) that the query is most related to? Let s look at a common IR method: vector-space lookup with TF-IDF weightings.

29 Term-document count matrices Consider the number of occurrences of a term in a document: Each document is a count vector in N v : a column below Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony Brutus Caesar Calpurnia Cleopatra mercy worser Adapted from Manning & Raghavan, CS276

30 IDF: How important is a Word? A word that occurs everywhere (like the ) doesn t tell you much IDF t log { d D : t d} Weight a term t by the log of the fraction of documents that have it IDF Antony 1.1 Brutus 0.69 Caesar 0.18 Calpurnia 1.8 Cleopatra 1.8 mercy 0.18 worser 0.4

31 TF: What words are in a document? Want more frequent words in a document to count more But also want to normalize by total document size So use term frequency Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony Brutus Caesar Calpurnia Cleopatra mercy worser

32 The TF-IDF Document Vectors The value of a word in a document is now TF*IDF Each document is a vector Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony Brutus Caesar Calpurnia Cleopatra mercy worser

33 A Query Cleopatra Calpurnia Represent this as a document too, and find the document(s) in the collection that are closest But what does closest mean? TF IDF FTIDF Antony Brutus Caesar Calpurnia Cleopatra mercy worser

34 Cosine Distance Return the document with the smallest angle from the query Or maximum cosine-of-angle Adapted from Manning & Raghavan, CS276

35 cosine(query,document) Dot product q d cos( q, d ) q d Unit vectors q d q d V i 1 q V i 1 2 i q d i i V i 1 d 2 i q i is the tf-idf weight of term i in the query d i is the tf-idf weight of term i in the document cos(q,d) is the cosine similarity of q and d or, equivalently, the cosine of the angle between q and d. Adapted from Manning & Raghavan, CS276

36 A More Sophisticated Approach Li, Ma &Lee: A Vector Space Modeling Approach to Spoken Language Identification Do phone decoding (use a universal phone set) Extract Phone-Ngram features Weight them with something similar to TF-IDF Now each utterance is a vector Labeled with the language in training Build a classifier Tested SVMs and ANNs

37 Wouldn t All Phones Be Everywhere? Maybe But we can use phone n-grams as features Those might be more informative! Raw Phones k r ae l p eh ih t p k ao p t ao k ih eh p l k r ae Phone 2-grams kr rae ael lp peh ehih iht tp pk kao Pt tao aok kih iheh ehp pl lk kr rae

38 Vector Space Model Results Unigram, bigram and trigram phone features Various numbers of universal phone units From Li et al., A Vector Space Modeling Approach

39 Break Next Up: Speaker Identification

40 Flavors of Speaker Recognition From JHU 2002 SuperSID Final Presentation Reynolds et al.

41 Information Sources for Speaker ID From JHU 2002 SuperSID Final Presentation Reynolds et al.

42 Approaches GMMs GMMs with a Universal Background Model SVMs and Supervectors MLLR Matrices High Level Information

43 The GMM Approach: Identification arg max s P( s x) arg max s P( s) P( x s) arg max s P( s) i k c ks N( x i ; s k, s k )

44 Verification and Universal Background Models Emphasis in NIST evaluations is on verification Rather than evaluating all models, just use the supposed speaker s model and a background model The background model is also used to estimate speaker models via adaptation

45 Creating UBMs From Reynolds et al., Speaker Verification using Adapted GMMs

46 MAP Adapting to a Speaker From Reynolds et al., Speaker Verification using Adapted GMMs

47 MAP Equations Smoothly interpolates between generic and speaker-specific parameters From Reynolds et al., Speaker Verification using Adapted GMMs

48 Super Vectors and SVMs Interesting recent work on using support vector machines Builds on classical GMM approach See, e.g. Campbell et al., SVM Based Speaker Verification Using a GMM Supervector Kernel Adapted GMM representing a speaker Concatenated means: feature vector for SVM. Labeled with speaker.

49 Pictorial Idea of SVMs (1) Map input into a high dimensional space With enough dimensions, the classes are likely to be separable with a hyperplane From

50 Pictorial Idea of SVMs (2) In this high dimensional space, the best hyperplane is the one with the maximum margin From K.K. Chin, Support Vector Machines applied to Speech Pattern Classification

51 Pictorial Idea of SVMs (3) Misclassifications allowed but penalized. (Soft margin)

52 SVM Classification f ( x) L i 1 i tik( x, xi ) d L i 1 i t i 0 and i 0 x i is a support vector or labeled example t i is its class: +1 or -1 K is a kernel function * Key idea is that a dot product in a high dimensional space can (sometimes) be implicitly represented as a normal function of two points in the original space s learned to maximize the soft margin

53 The Kernel Gaussian weight From Campbell et al., SVM Based Speaker Verification

54 Using SVMs Training: Create supervectors for background (non-target) speakers. (Label them b ) Create supervectors for each example of a particular speaker, s. (Label them s ). Train an SVM that differentiates s from b. Do this for each speaker. Test (verification): Someone claims to be s. Make a supervector from the sample. Run the s vs b SVM and see what it says

55 Using Speaker Adaptation We have been using MFCCs These are good for speech recognition Where we want to remove speaker variability Why should they be good for speaker ID Where we want to enhance and recognize speaker variability? What captures speaker-specific information? Parameters of a speaker-adaptation scheme! E.g. MLLR First tested by Stolcke et al., MLLR Transforms as Features in Speaker Recognition (2003)

56 Recall MLLR ' T T A (1 ) New mean is linear transformation of old An offset is added to the old mean as well Transformation matrix chosen to maximize the likelihood of the adaptation data under the transformed model One transformation (e.g. 39x39) shared by many gaussians (e.g. 1000s) See, e.g., Leggetter & Woodland, Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models Similar transforms possible for covariance matrix

57 MLLR Picture New data New means are a linear transform of the old ones Old data modeled by some gaussians

58 MLLR for Speaker Recognition Compute the MLLR Matrix In practice several matrices may be used for different phonetic classes Concatenate all the numbers into a giant feature vector Label it with the speaker Train a SVM classifier Stolcke et al used the straight dot-product kernel

59 MLLR for SID Results From Stolcke et al., MLLR Transforms as Features in Speaker Recognition

60 Moving up the Food Chain From JHU 2002 SuperSID Final Presentation Reynolds et al.

61 Pitch Pitch and pitch tracks should vary from person to person Image from

62 Prosody Rate of Speech Number of words per second Number of phones per second Number of long pauses Number of short pauses Average voiced segment lengths Average unvoiced segment lengths

63 Phone N-grams Analogous to Langauge ID, Except the languages are Speakers Can also train a global background phone-ngram model After Zissman 1996

64 Personal Word Usage From JHU 2002 SuperSID Final Presentation Reynolds et al.

65 Concluding Remarks Numerous techniques for LID and SID Sometimes ad-hoc, e.g. PPRLM Sometimes principled, e.g. MLLR in SID Often requires fusing multiple methods E.g. combining prosody and idiolect with GMM scores in SID Is there room for a more unified theory?

66 Project Discussions Who is set already? Please see me Who is interested in: Formant project Language ID Language Modeling Text-to-Speech Speaker ID HMM / Speech Recognition Speech Coding => Please discuss

Information Retrieval

Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture;