EECS E6870: Lecture 6: Pronunciation Modeling and Decision Trees

Size: px

Start display at page:

Download "EECS E6870: Lecture 6: Pronunciation Modeling and Decision Trees"

Janis Barker
6 years ago
Views:

1 EECS E6870: Lecture 6: Pronunciation Modeling and Decision Trees Stanley F. Chen, Michael A. Picheny and Bhuvana Ramabhadran IBM T. J. Watson Research Center Yorktown Heights, NY October 3, 009 EECS E6870: Seech Recognition

2 In the beginning. was the whole word model For each word in the vocabulary, decide on a toology Often the number of states in the model is chosen to be roortional to the number of honemes in the word Train the observation and transition arameters for a given word using examles of that word in the training data (Recall roblem 3 associated with Hidden Markov Models) Good domain for this aroach: digits Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees

3 Examle toologies: Digits Vocabulary consists of { zero, oh, one, two, three, four, five, six, seven, eight, nine } Assume we assign two states er honeme Must allow for different durations use self loos and ski arcs Models look like: zero oh 9 0 Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 3

4 one two three four five Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 4

5 six seven eight nine Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 5

6 How to reresent EECS E6870 any sequence of digits? Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 6

7 9 EECS E Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 7

8 State: EECS E6870 Trellis Reresentation Time: Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 8

9 Whole-word model limitations The whole-word model suffers from two main roblems. Cannot model unseen words. In fact, we need several samles of each word to train the models roerly. Cannot share data among models data sarseness roblem.. The number of arameters in the system is roortional to the vocabulary size. Thus, whole-word models are best on small vocabulary tasks. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 9

10 Subword Units To reduce the number of arameters, we can comose word models from sub-word units. These units can be shared among words. Examles include: Units Aroximate number Phones 50 Dihones 000 Trihones 0,000 Syllables 5,000 Each unit is small The number of arameters is roortional to the number of units (not the number of words in the vocabulary as in whole-word models.) Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 0

11 Phonetic Models We reresent each word as a sequence of honemes. This reresentation is the baseform for the word. BANDS B AE N D Z Some words need more than one baseform THE DH UH DH IY Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees

12 Baseform Dictionary To determine the ronunciation of each word, we look it u in a dictionary Each word may have several ossible ronunciations Every word in our training scrit and test vocabulary must be in the dictionary The dictionary is generally written by hand Prone to errors and inconsistencies nd looks wrong AA K.. is missing Shouldn t the choices For the st honeme be the Same for these words? Don t these words All start the same? acaulco acaulco accelerator accelerator acceleration acceleration accent accet accetable access accessory accessory AE K AX P AH L K OW AE K AX P UH K OW AX K S EH L AX R EY DX ER IX K S EH L AX R EY DX ER AX K S EH L AX R EY SH IX N AE K S EH L AX R EY SH IX N AE K S EH N T AX K S EH P T AX K S EH P T AX B AX L AE K S EH S AX K S EH S AX R IY EH K S EH S AX R IY Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees

13 Phonetic Models, cont d We can allow for honological variation by reresenting baseforms as grahs acaulco acaulco AE K AX P AH L K OW AA K AX P UH K OW AE K AX P AH L K OW AA UH Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 3

14 Phonetic Models, cont d Now, construct a Markov model for each hone. Examles: (length) length Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 4

15 Embedding Relace each hone by its Markov model to get a word model n.b. The model for each hone will have different arameter values AE K AX P AH L K OW AA UH Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 5

16 Reducing Parameters by Tying Consider the three-state model t t t 3 t 4 t 5 t 6 Note that t and t corresond to the beginning of the hone t 3 and t 4 corresond to the middle of the hone t 5 and t 6 corresond to the end of the hone If we force the outut distributions for each member of those airs to be the same, then the training data requirements are reduced. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 6

17 Tying A set of arcs in a Markov model are tied to one another if they are constrained to have identical outut distributions. Similarly, states are tied if they have identical transition robabilities. Tying can be exlicit or imlicit. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 7

18 Imlicit Tying Occurs when we build u models for larger units from models of smaller units Examle: when word models are made from hone models First, consider an examle without any tying. Let the vocabulary consist of digits 0,,, 9 We can make a searate model for each word. To estimate arameters for each word model, we need several samles for each word. Samles of 0 affect only arameters for the 0 model. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 8

19 Imlicit Tying, cont d Now consider hone-based models for this vocabulary 0 Z IY R OW W AA N T UW 3 TH R IY 4 F AO R 5 F AY V 6 S IH K S 7 S EH V AX N 8 EY T 9 N AY N Training samles of 0 will also affect models for 3 and 4 Useful in large vocabulary systems where the number of words is much greater than the number of hones. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 9

20 Exlicit Tying Examle: b b m m e e 6 non-null arcs, but only 3 different outut distributions because of tying Number of model arameters is reduced Tying saves storage because only one coy of each distribution is saved Fewer arameters mean less training data needed Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 0

21 Variations in realizations of honemes The broad units, honemes, have variants known as allohones Examle: In English and h (un-asirated and asirated ) Exercise: Put your hand in front of your mouth and ronounce sin and then in. Note that the in in has a uff of air, while the in sin does not. Articulators have inertia, thus the ronunciation of a honeme is influenced by surrounding honemes. This is known as co-articulation Examle: Consider k and g in different contexts In key and geese the whole body of the tongue has to be ulled u to make the vowel Closure of the k moves forward comared to caw and gauze Phonemes have canonical articulator target ositions that may or may not be reached in a articular utterance. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees

22 kee Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees

23 coo Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 3

24 Context-deendent models We can model hones in context Two aroaches: trihones and leaves Both methods use clustering. Trihones use bottom-u clustering, leaves use to-down clustering Tyical imrovements of seech recognizers when introducing context deendence: 30% - 50% fewer errors. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 4

25 Tri-hone models Model each honeme in the context of its left and right neighbor e.g. K-IY+P is a model for IY when K is its left context honeme and P is its right context honeme If we have 50 honemes in a language, we could have as many as 50 3 trihones to model Not all of these occur Still have data sarsity issues Try to solve these issues by agglomerative clustering Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 5

26 Agglomerative / Bottom-u Clustering Start with each item in a cluster by itself Find closest air of items Merge them into a single cluster Iterate Different results based on distance measure used Single-link: dist(a,b) = min dist(a,b) for a A, b B Comlete-link: dist(a,b) = max dist(a,b) for a A, b B Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 6

27 Bottom-u clustering / Single Link Assume our data oints look like: Single-link: clusters are close if any of their oints are: dist(a,b) = min dist(a,b) for a A, b B Single-link clustering into grous roceeds as: Single-link clustering tends to yield long, stringy, meandering clusters Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 7

28 Bottom-u clustering / Comlete Link Again, assume our data oints look like: Comlete-link: clusters are close only if ALL of their oints are: dist(a,b) = max dist(a,b) for a A, b B Single-link clustering into grous roceeds as: Comlete-link clustering tends to yield roundish clusters Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 8

29 Dendogram A natural way to dislay clusters is through a dendogram Shows the clusters on the x-axis, distance between clusters on the y-axis Provides some guidance as to a good choice for the number of clusters x x x 3 x 4 x 5 x 6 x 7 x 8 distance Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 9

30 Trihone Clustering We can use e.g. comlete-link clustering to cluster trihones Hels with data sarsity issue Still have an issue with unseen data To model unseen events, we need to back-off to lower order models such as bi-hones and uni-hones Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 30

31 Allohones Pronunciation variations of honemes Less rigidly defined than trihones In our seech recognition system, we tyically have about 50 allohones er honeme. Older techniques ( tri-hones ) tried to enumerate by hand the set of allohones for each honeme. We currently use automatic methods to identify the allohones of a honeme. Boils down to conditional modeling. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 3

32 Conditional Modeling In seech recognition, we use conditional models and conditional statistics Acoustic model Conditioned on the hone context Selling-to-sound rules Describe the ronunciation of a letter in terms of a robability distribution that deends on neighboring letters and their ronunciations. The distribution is conditioned on the letter context and ronunciation context. Language model Probability distribution for the next word is conditioned on the receding words One way to define the set of conditions is a class of statistical models known as decision trees Classic text: L. Breiman et al. Classification and Regression Trees. Wadsworth & Brooks. Monterey, California Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 3

33 Decision Trees We would like to find equivalence classes among our training samles DTs categorize data into multile disjoint categories The urose of a decision tree is to ma conditions (such as honetic contexts) into equivalence classes The goal in constructing a decision tree is to create good equivalence classes DTs are examles of divisive / to-down clustering Each internal node secifies an attribute that an object must be tested on Each leaf node reresents a classification Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 33

34 What does a decision tree look like? Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 34

35 Tyes of Attributes/Features Numerical: Domain is ordered and can be reresented on the real line (e.g., age, income) Nominal or categorical: Domain is a finite set without any natural ordering (e.g., occuation, marital status, race) Ordinal: Domain is ordered, but absolute differences between values is unknown (e.g., reference scale, severity of an injury) Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 35

36 The Classification Problem If the deendent variable is categorical, the roblem is a classification roblem Let C be the class label of a given data oint X=X,, X k Let d( ) be the redicted class label Define the misclassification rate of d: Prob (d(x,, X k )!= C Problem definition: Given a dataset, find the classifier d such that the misclassification rate is minimized. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 36

37 The Regression Problem If the deendent variable is numerical, the roblem is a regression roblem. The tree d mas observation X to rediction Y of Y and is called a regression function. Define mean squared error rate of d as: E[ (Y - d(x,, X k )) ] Problem definition: Given dataset, find regression function d such that mean squared error is minimized. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 37

38 Goals & Requirements Goals: To roduce an accurate classifier/regression function To understand the structure of the roblem Requirements on the model: High accuracy Understandable by humans, interretable Fast construction for very large training databases Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 38

39 Decision Trees: Letter-to-Sound Examle Let s say we want to build a tree to decide how the letter will sound in various words Training examles: loohole eanuts ay ale f hysics telehone grah hoto φ ale sycho terodactyl neumonia The ronunciation of deends on its context. Task: Using the above training data, artition the contexts into equivalence classes so as to minimize the uncertainty of the ronunciation. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 39

40 Decision Trees: Letter-to-Sound Examle, cont d Denote the context as L L R R R = h? Y N loohole f hysics telehone grah hoto eanut ay ale φ ale sycho terodactyl neumonia At this oint we have two equivalence classes:. R = h and. R g h The ronunciation of class is either or f, with f much more likely than. The ronunciation of class is either or φ. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 40

41 R = h? Y N loohole f hysics telehone grah hoto eanut ay ale φ ale sycho terodactyl neumonia L = o? R = consonant? Y loohole N f hysics telehone grah hoto 3 Y φ ale ale sycho terodactyl neumonia N 4 eanut ay Four equivalence classes. Uncertainty remains only in class 3. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 4

42 R = h? Y N loohole f hysics telehone grah hoto eanut ay ale φ ale sycho terodactyl neumonia L = o? loohole Five equivalence classes, which is much less than the number of letter contexts. No uncertainty left in the classes. f hysics telehone grah hoto A node without children (an equivalence class) is called a leaf node. Otherwise it is called an internal node. Y N Y N ale L = a? φ ale sycho terodactyl neumonia Y N 3 φ ale 4 ale sycho terodactyl neumonia R = consonant? eanut ay Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 4 5

43 Consider test case: Paris R = h? Y N loohole f hysics telehone grah hoto eanut ay ale φ ale sycho terodactyl neumonia L = o? loohole Y N Y N f hysics telehone grah hoto ale φ ale sycho terodactyl neumonia R = consonant? 5 eanut ay Correct L = a? Y N 3 φ ale 4 ale sycho terodactyl neumonia Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 43

44 Consider test case : goher R = h? Y N loohole f hysics telehone grah hoto eanut ay ale φ ale sycho terodactyl neumonia L = o? loohole Wrong Y N Y N f hysics telehone grah hoto ale φ ale sycho terodactyl neumonia R = consonant? 5 eanut ay Although effective on the training data, this tree does not generalize well. It was constructed from too little data. L = a? Y N 3 φ ale 4 ale sycho terodactyl neumonia Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 44

45 Decision Tree Construction. Find the best question for artitioning the data at a given node into equivalence classes.. Reeat ste recursively on each child node. 3. Sto when there is insufficient data to continue or when the best question is not sufficiently helful. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 45

46 Basic Issues to Solve The selection of the slits The decisions when to declare a node terminal or to continue slitting The assignment of each node to a class Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 46

47 Decision Tree Construction Fundamental Oeration There is only fundamental oeration in tree construction: Find the best question for artitioning a subset of the data into two smaller subsets. i.e. Take an equivalence class and slit it into more-secific classes. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 47

48 Decision Tree Greediness Tree construction roceeds from the to down from root to leaf. Each slit is intended to be locally otimal. Constructing a tree in this greedy fashion usually leads to a good tree, but robably not globally otimal. Finding the globally otimal tree is an NP-comlete roblem: it is not ractical. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 48

49 Slitting Each internal node has an associated slitting question. Examle questions: Age <= 0 (numeric) Profession in {student, teacher} (categorical) 5000*Age + 3*Salary 0000 > 0 (function of raw features) Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 49

50 Dynamic Questions The best question to ask about some discrete variable x consists of the best subset of the values taken by x. Search over all subsets of values taken by x at a given node. (This is generating questions on the fly during tree construction.) xc{a,b,c} Q: xc{a}? Q: xc{b}? Q3: xc{c}? Q4: xc{a,b}? Q5: xc{a,c}? Q6: xc{b,c}? Q7: xc{a,b,c}? Use the best question found. Potential roblems:. Requires a lot of CPU. For alhabet size A there are questions. j. Allows a lot of freedom, making it easy to overtrain. j A Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 50

51 Pre-determined Questions The easiest way to construct a decision tree is to create in advance a list of ossible questions for each variable. Finding the best question at any given node consists of subjecting all relevant variables to each of the questions, and icking the best combination of variable and question. In acoustic modeling, we tyically ask about 0 variables: the 5 hones to the left of the current hone and the 5 hones to the right of the current hone. Since these variables all san the same alhabet (hone alhabet) only one list of questions. Each question on this list consists of a subset of the honetic hone alhabet. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 5

52 Samle Questions Phones Letters {P} {A} {T} {E} {K} {I} {B} {O} {D} {U} {G} {Y} {P,T,K} {A,E,I,O,U} {B,D,G} {A,E,I,O,U,Y} {P,T,K,B,D,G} Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 5

53 Discrete Questions A decision tree has a question associated with every non-terminal node. If x is a discrete variable which takes on values in some finite alhabet Α, then a question about x has the form: xcs? where S is a subset of Α. Let L denote the receding letter in building a selling-to-sound tree. Let S={A,E,I,O,U}. Then LcS? denotes the question: Is the receding letter a vowel? Let R denote the following hone in building an acoustic context tree. Let S={P,T,K}. Then RcS? denotes the question: Is the following hone an unvoiced sto? Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 53

54 Continuous Questions If x is a continuous variable which takes on real values, a question about x has the form x<θ? where θ is some real value. In order to find the threshold θ, we must try values which searate all training samles. θ θ θ 3 x We do not currently use continuous questions for seech recognition. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 54

55 Tyes of Questions In rincile, a question asked in a decision tree can have any number (greater than ) of ossible outcomes. Examles: Binary: Yes No 3 Outcomes: Yes No Don t_know 6 Outcomes: A B C Z In ractice, only binary questions are used to build decision trees. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 55

56 Simle Binary Question A simle binary question consists of a single Boolean condition, and no Boolean oerators. X c S? Is a simle question. (( X c S ) && ( X c S ))? Is not a simle question. Toologically, a simle question looks like: xcs? Y N Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 56

57 Comlex Binary Question A comlex binary question has recisely outcomes (yes, no) but has more than Boolean condition and at least Boolean oerator. (( X c S ) && ( X c S ))? Is a comlex question. Toologically this question can be shown as: x cs? x cs? Y N Outcomes: ( X c S ) 3 ( X c S ) Y N ( X c S ) 3 ( X c S ) =( X c S ) 4 ( X c S ) All comlex questions can be reresented as binary trees with terminal nodes tied to roduce outcomes. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 57

58 Configurations Currently Used All decision trees currently used in seech recognition use: a re-determined set of simle, binary questions on discrete variables. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 58

59 Tree Construction Overview Let x x n denote n discrete variables whose values may be asked about. Let Q ij denote the j th re-determined question for x i. Starting at the root, try slitting each node into sub-nodes:. For each variable x i evaluate questions Q i, Q i, and let Q i denote the best.. Find the best air x i,q i and denote it x,q. 3. If Q is not sufficiently helful, make the current node a leaf. 4. Otherwise, slit the current node into new sub-nodes according to the answer of question Q on variable x. Sto when all nodes are either too small to slit further or have been marked as leaves. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 59

60 Question Evaluation The best question at a node is the question which maximizes the likelihood of the training data at that node after alying the question. Parent node: Model as a single Gaussian N(µ,Σ ) Comute likelihood L(data µ,σ ) Q? Y N Left child node: Model as a single Gaussian N(µ l,σ l ) Comute likelihood L(data l µ l,σ l ) Right child node: Model as a single Gaussian N(µ r,σ r ) Comute likelihood L(data r µ r,σ r ) Goal: Find Q such that L(data l µ l,σ l ) x L(data r µ r,σ r ) is maximized. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 60

61 Question Evaluation, cont d Let x,x, x N be a samle of feature x, in which outcome a i occurs c i times. Let Q be a question which artitions this samle into left and right subsamles of size n l and n r, resectively. Let c il, c i r denote the frequency of a i in the left and right sub-samles. The best question Q for feature x is the one which maximizes the conditional likelihood of the samle given Q, or, equivalently, maximizes the conditional log likelihood of the samle. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 6

62 Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 6 log likelihood comutation The log likelihood of the data, given that we ask question Q, is: Using the maximum likelihood estimates of il, ir gives: = = + = N i N i r i r i l i l i n c c Q x x L log log ).. ( log r r l l r i r i N i l i l i N i r i r N i N i r i r i N i l i l l i l i r N i N i r i r i l l i l i n n n n n c c c c c n c c c n c c n c c n c c Q x x L log log } log log { log log log log ) / log( ) / log( ).. ( log + = + = + = = = = = = = = The best question is the one which maximizes this simle exression. c il,c ir,n l,and n r are all non-negative integers. The above exression can be comuted very efficiently using a re-comuted table of n log n for non-negative integers n.

63 Entroy EECS E6870 Entroy is a measure of uncertainty, measured in bits. Let x be a discrete random variable taking values a..a N in an alhabet A of size N with robabilities N resectively. The uncertainty about what value x will take can be measured by the entroy of the robability distribution =( N ). H H N = = 0 i= i log i = for some j i H>=0. Entroy is maximized when i =/N for all i. Then H=log N. Thus H tells us something about the sharness of the distribution. j and = 0 for i j H can be interreted as the theoretical minimum average number of bits that are required to encode/transmit the distribution. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 63

64 What does entroy look like for a binary variable? Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 64

65 Entroy and Likelihood Let x be a discrete random variable taking values a..a N in an alhabet A of size N with robabilities N resectively. Let x.. x n be a samle of x in which a i occurs c i times, i=..n. The samle likelihood is: The maximum likelihood estimate of i is N L = i i= c i ˆ i = c i n Thus, an estimate of the samle log likelihood is: and log Lˆ = log n N i= c log ˆ i Lˆ = N i= i ˆ log i ˆ i = Hˆ Since n is a constant, and log is a monotonic function, H is monotonically related to L. Therefore, maximizing likelihood minimizing entroy. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 65

66 tree, revisited : eanut, ay, loohole, ale c = 4 f : hysics, hoto, grah, telehone c f = 4 φ : ale, sycho, terodactyl, neumonia c φ = 4 n= Log likelihood of data at the root node is: log = L( x 4log.. x ) 4 + 4log 3 = c i= i log 4 + 4log c i nlog 4 log n = 9.0 Average entroy at the root node is: H ( x.. x ) = log L( x.. x n = 9.0 / =.58 bits ) Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 66

67 tree revisited: Question A R = h? Y N loohole f hysics telehone grah hoto eanut ay ale φ ale sycho terodactyl neumonia n l =5 c l = c fl =4 c φl =0 n r =7 c r =3 c fr =0 c φr =4 Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 67

68 tree revisited: Question A Log likelihood of data after alying question A is: n l =5 c l = c fl =4 c φl =0 n r =7 c r =3 c fr =0 c φr =4 log L( x.. x Q A ) = 4log 4 5log 5 + 3log 3+ 4log 4 7log 7 = 0.5 Average entroy of data after alying question A is: H ( x.. x Q A ) = log L( x.. x QA) = 0.5/ = 0.87 bits Increase in log likelihood due to question A is =8.5 Decrease in entroy due to question A is =0.7 bits Knowing the answer to question A rovides 0.7 bits of information about the ronunciation of. A further 0.87 bits of information is still required to remove all the uncertainty about the ronunciation of. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 68

69 tree revisited: Question B L = φ? Y N eanut ay f hysics hoto φ sycho terodactyl neumonia loohole ale f telehone grah φ ale n l =7 c l = c fl = c φl =3 n r =5 c r = c fr = c φr = Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 69

70 tree revisited: Question B n l =7 c l = c fl = c φl =3 n r =5 c r = c fr = c φr = Log likelihood of data after alying question B is: log log L( x.. x + log Q B log 5log + log 5 = 8.5 Average entroy of data after alying question B is: H ( x.. x Q B ) = + 3log 3 7log Increase in log likelihood due to question B is =0.5 ) = log L( x.. x QB ) = 8.5/ =.54 bits 7 + Decrease in entroy due to question B is =0.04 bits Knowing the answer to question A rovides 0.04 bits of information (very little) about the ronunciation of. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 70

71 tree revisited: Question C L = vowel? Y N loohole ale f telehone grah eanut ay f hysics hoto φ ale sycho terodactyl neumonia n l =4 c l = c fl = c φl =0 n r =8 c r = c fr = c φr =4 Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 7

72 tree revisited: Question C n l =4 c l = c fl = c φl =0 n r =8 c r = c fr = c φr =4 Log likelihood of data after alying question C is: log L( x.. x + log Q C + log log + 4log + log 4 8log 4log Average entroy of data after alying question C is: H ( x.. x Q C ) = 8 = 6.00 Increase in log likelihood due to question C is =3.0 Decrease in entroy due to question C is =0.5 bits Knowing the answer to question C rovides 0.5 bits of information about the ronunciation of. ) = log L( x.. x QC ) = 6 / =.33 bits 4 Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 7

73 Comarison of Questions A, B, C Log likelihood of data given question: A -0.5 B -8.5 C Average entroy (bits) of data given question: A 0.87 B.54 C.33 Gain in information (in bits) due to question: A 0.7 B 0.04 C 0.5 These measures all say the same thing: Question A is best. Question C is nd best. Question B is worst. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 73

74 Best Question In general, we seek questions which maximize the likelihood of the training data given some model. The exression to be maximized deends on the model. In the revious examle the model is a multinomial distribution. Another examle: context-deendent rototyes use a continuous distribution. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 74

75 Best Question: Context-Deendent Prototyes For context-deendent rototyes, we use a decision tree for each arc in the Markov model. At each leaf of the tree is a continuous distribution which serves as a context-deendent model of the acoustic vectors associated with the arc. The distributions and the tree are created from acoustic feature vectors aligned with the arc. We grow the tree so as to maximize the likelihood of the training data (as always), but now the training data are real-valued vectors. We estimate the likelihood of the acoustic vectors during tree construction using a diagonal Gaussian model. When tree construction is comlete, we relace the diagonal Gaussian models at the leaves by more accurate models to serve as the final rototyes. Tyically we use mixtures of diagonal Gaussians. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 75

76 Diagonal Gaussian Likelihood Let Y=y y y n be a samle of indeendent -dimensional acoustic vectors arising from a diagonal Gaussian distribution with mean µ and variances σ. Then log L( Y DG( µ, σ )) = The maximum likelihood estimates of µ and σ are: uˆ ˆ σ j j n n Hence, an estimate of log L(Y) is log L( Y = = n i= n i= y ij y ij DG( ˆ, µ ˆ σ )) = ˆ µ n { log(π ) + i= j= j j =,,... n j =,,... { log(π ) + i= j= logσ + log ˆ σ + j j j= j= ( y ij ( y µ ) σ ij j ˆ σ j ˆ µ ) j j } } Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 76

77 Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 77 Diagonal Gaussian Likelihood Now, Hence j n n y j n y y y j j n i ij j n i ij j n i ij j n i ij,,... ˆ ˆ,,.. ˆ ˆ ) ˆ ( = = = = + = = = = = σ µ µ µ µ } ˆ log ) log( { } ˆ log ) log( { )) ˆ ˆ, ( ( log n n n n DG Y L j j j n i j j n i + + = + + = = = = = = σ π σ π σ µ

78 Diagonal Gaussian Slits Let Q be a question which artitions Y into left and right sub-samles Y l and Y r, of size n l and n r. The best question is the one which maximizes logl(y l )+logl(y r ) Using a diagonal Gaussian model Common to all slits + n log L( Y = = { nl r log(π ) + n l DG( ˆ µ, ˆ σ )) + log L( Y log(π ) + n j= j= { n log(π ) + n} l r l l log ˆ σ + n log ˆ σ + n rj { nl lj r r j= DG( ˆ µ, ˆ σ )) } l log ˆ σ + n lj r r r j= log ˆ σ } rj Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 78

79 Diagonal Gaussian Slits, cont d Thus, the best question Q minimizes: D Q = n l j= log ˆ σ + lj nr log ˆ σ rj j= Where ˆ σ lj = y j ( y j ) j =,,.. n n l y Y l l y Y D Q involves little more than summing vector elements and their squares. l Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 79

80 How Big a Tree? CART suggests cross-validation Measure erformance on a held-out data set Choose the tree size that maximizes the likelihood of the held-out data In ractice, simle heuristics seem to work well A decision tree is fully grown when no terminal node can be slit Reasons for not slitting a node include:. Insufficient data for accurate question evaluation. Best question was not very helful / did not imrove the likelihood significantly 3. Cannot coe with any more nodes due to CPU/memory limitations Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 80

81 Instability Decision trees have the undesirable roerty that a small change in the data can result in a large difference in the final tree Consider a -class roblem, w and w Observations for each class are -dimensional w x w o x x x x * (.3^) Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 8

82 x < x x x o x x <0.3 x <0.6 x <0.35 x o x < x x o x o* o o o o x o o x o x Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 8

83 x < x x x o x x <0.09 x <0.6 x o x x < x x o x o o^ o o o x o o x Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 83

84 Bagging EECS E6870 Create multile models by training the same learner on different samles of the training data Given a training set of size n, create m different training sets of size n by samling with relacement Combine the m models using a simle majority vote. Can be alied to any learning method, including decision trees. Decreases the generalization error by reducing variance in the results for unstable learners (i.e. when the hyothesis can change dramatically if training data is slightly altered.) Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 84

85 Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 85

86 Boosting EECS E6870 Another method for roducing multile models by reeatedly altering the data given to a learner. Examles are given weights, and at each iteration a new hyothesis is learned and the examles are reweighted to focus on those that the latest hyothesis got wrong During testing, each of the hyotheses gets a weighted voted roortional to its training set accuracy. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 86

87 Strengths & Weaknesses of Decision Trees Strengths Easy to generate; simle algorithm Relatively fast to construct Classification is very fast Can achieve good erformance on many tasks Weaknesses Not always sufficient to learn comlex concets Can be hard to interret. Real roblems can roduce large trees Some roblems with continuously valued attributes may not be easily discretized Data fragmentation Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 87

88 Putting it all together Given a word sequence, we can construct the corresonding Markov model by:. Re-writing word string as a sequence of honemes. Concatenating honetic models 3. Using the aroriate tree for each arc to determine which alloarc (leaf) is to be used in that context Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 88

89 Examle The rain in Sain falls. Look these words u in the dictionary to get: DH AX R EY N IX N S P EY N F AA L Z Rewrite hones as states according to honetic model DH DH DH 3 AX AX AX 3 R R R 3 EY EY EY 3 Using honetic context, descend decision tree to find leaf sequences DH _5 DH _7 DH 3_4 AX _53 AX _37 AX 3_ R _4 R _46. Use the Gaussian mixture model for the aroriate leaf as the observation robabilities for each state in the Hidden Markov Model. Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 89

90 Course Feedback Was this lecture mostly clear or unclear? What was the muddiest toic? Other feedback (ace, content, atmoshere, etc.) Seech Recognition Lecture 6: Pronunciation Modeling and Decision Trees 90

Lecture 6. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

Lecture 6. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Lecture 6 pr@ n2nsi"eis@n "mad@lin Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com