Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms

Size: px

Start display at page:

Download "Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms"

Jonas Andrews
5 years ago
Views:

1 DRAFT VERSION JANUARY 3, 2006 Joint Factor Analyi of Speaker and Seion Variability: Theory and Algorithm Patrick Kenny Abtract We give a full account of the algorithm needed to carry out a joint factor analyi of peaker and eion variability in a training et in which each peaker i recorded over many different channel and we dicu the practical limitation that will be encountered if thee algorithm are implemented on very large data et I INTRODUCTION Thi article i intended a a companion to [] we preented a new type of likelihood ratio tatitic for peaker verification which i deigned principally to deal with the problem of inter-eion variability, that i the variability among recording of a given peaker Thi likelihood ratio tatitic i baed on a joint factor analyi of peaker and eion variability in a training et in which each peaker i recorded over many different channel uch a one of the Switchboard II databae Our purpoe in the current article i to give detailed algorithm for carrying out uch a factor analyi Although we have only experimented with the application of thi model in peaker recognition we will alo explain how it could erve a an integrated framework for progreive peaker-adaptation and on-line channel adaptation of HMM-baed peech recognizer operating in ituation peaker identitie are known II OVERVIEW OF THE JOINT FACTOR ANALYSIS MODEL The joint factor analyi model can be viewed Gauian ditribution on peaker- and channel-dependent or, more accurately, eion-dependent HMM upervector in which mot but not all of the variance in the upervector population i aumed to be accounted for by a mall number of hidden variable which we refer to a peaker and channel factor The peaker factor and the channel factor play different role in that, for a given peaker, the value of the peaker factor are aumed to be the ame for all recording of the peaker but the channel factor are aumed to vary from one recording to another For example, the Gauian ditribution on peaker-dependent upervector ued in eigenvoice MAP [2] i a pecial cae of the factor analyi model in which there are no channel factor and all of the variance in the peakerdependent HMM upervector i aumed to be accounted The author are with the Centre de recherche informatique de Montréal CRIM P Kenny can be reached at x 4624, pkenny@crimca Thi work wa upported in part by the Natural Science and Engineering Reearch Council of Canada and by the Minitère du Développement Économique et Rǵional et de la Recherche du gouvernement du Québec for by the peaker factor In thi cae, for a given peaker, the value of the peaker factor are the co-ordinate of the peaker upervector relative to a uitable bai of the eigenpace The general model combine the prior underlying claical MAP [3], eigenvoice MAP [2] and eigenchannel MAP [4] o we begin by reviewing thee and howing how a ingle prior can be contructed which embrace all of them A Speaker and channel factor We aume a fixed HMM tructure containing a total of C mixture component Let F be the dimenion of the acoutic feature vector o that, aociated with each mixture component, there i an F -dimenional mean vector and an F F dimenional covariance matrix which in thi article we will take to be diagonal For each mixture component c =,, C, let m c denote the correponding peaker-independent mean vector which i uually etimated by Baum-Welch training and let m denote the CF upervector obtained by concatenating m,, m C To begin with let u ignore channel effect and aume that each peaker can be modeled by a ingle peaker-dependent upervector M common to all recording of the peaker For each mixture component c, let M c be the correponding ubvector of M Claical MAP i often preented a a rule of thumb interpolate between the peaker-dependent and peakerindependent etimate of the HMM mean vector without explaining what thi rule ha to do with prior and poterior ditribution In order to achieve the type of generalization that we are aiming for we will have to pell out the role of thee ditribution explicitly The aumption concerning the prior on peaker upervector i that all of the HMM mean vector M c are tatitically independent c range over all mixture component and over all peaker and, for each mixture component c, the marginal ditribution of M c i normal and independent of If we further aume that the prior ditribution i normal then the baic aumption i that there i a diagonal matrix d uch that, for a randomly choen peaker, M = m + dz z i a hidden vector ditributed according to the tandard normal ditribution, Nz 0, I Although we will only conider the cae d i diagonal, a generalization to the cae d i a block diagonal matrix with each block being of dimenion F F i poible Given ome adaptation data for a peaker, MAP adaptation conit in calculating

2 DRAFT VERSION JANUARY 3, the poterior ditribution of M; the MAP etimate of M i jut the mode of thi poterior We will explain later how thi give rie to the rule of thumb Provided that d i non-ingular, claical MAP adaptation i guaranteed to be aymptotically equivalent to peakerdependent training a the amount of adaptation data increae However, in the abence of obervation for a given peaker and mixture component, the claical MAP etimator fall back to the peaker-independent etimate of the HMM mean vector Thu if the number of mixture component C i large, claical MAP tend to aturate lowly in the ene that large amount of enrollment data are needed to ue it to full advantage Eigenvoice MAP aume intead that there i a rectangular matrix v of dimenion CF R R CF uch that, for a randomly choen peaker, M = m + vy 2 y i a hidden R vector having a tandard normal ditribution Since the dimenion of y i much maller than that of z, eigenvoice MAP tend to aturate much more quickly than claical MAP But thi approach to peaker adaptation uffer from the drawback that, in etimating v from a given training et, it i neceary to aume that R i le than or equal to the number of training peaker [2] o that a very large number of training peaker may be needed to etimate v properly Thu in practice there i no guarantee that eigenvoice MAP adaptation will exhibit correct aymptotic behavior a the quantity of enrollment data for a peaker increae The trength and weaknee of claical MAP and eigenvoice MAP complement each other Eigenvoice MAP i preferable if mall amount of data are available for peaker adaptation and claical MAP if large amount are available An obviou trategy to combine the two i to aume a decompoition of the form M = m + vy + dz 3 the hidden vector y z ha a tandard normal ditribution Given ome adaptation data for a peaker, peaker adaptation can be implemented by calculating the poterior ditribution of M a in claical MAP or eigenvoice MAP In thi ituation it i no longer appropriate to peak of eigenvoice; rather v i a factor loading matrix and the component of y are peaker factor If d = 0 then all peaker upervector are contained in the affine pace defined by tranlating the range of vv by m o we will call thi the peaker pace We will ue thi terminology in the general cae d 0 a well becaue, although it may be unrealitic to aume that all peaker upervector are contained in a linear manifold of low dimenion, the intuition underlying eigenvoice modeling eem to be quite ound and our experience ha been that d i relatively mall in practice In [4] we preented a a quick and dirty olution to the problem of channel adaptation of peaker HMM which we called eigenchannel MAP Suppoe that for a peaker we have obtained a peaker adapted HMM or, equivalently, a point etimate m of M by ome peaker adaptation technique uch a eigenvoice MAP or claical MAP Given a collection of recording for the peaker, let M h denote the upervector correponding to the recording h h =, 2, Our tarting point in [4] wa to aume that there i a matrix u of low rank uch that for each recording h, the prior ditribution of M h i given by M h = m + ux h 4 x h i a hidden vector having a tandard normal ditribution Given ome adaptation data from the recording, the peaker upervector m can be adapted to a new et of recording condition by calculating the poterior ditribution of M h jut a in eigenvoice MAP In fact the form of thi model i o imilar to 2 that no new mathematic i needed to develop it Furthermore, a we explained in [4], it i le uceptible to the rank deficiency problem that afflict eigenvoice MAP in practice The trouble with thi model i that it neglect to take channel effect into account in deriving the point etimate m a problem which i not eaily remedied The aumption that M can be replaced by a point etimate i alo unatifactory ince eigenvoice MAP and epecially claical MAP produce poterior ditribution on M rather than point etimate It i only in the cae the amount of adaptation data i large that the poterior ditribution become concentrated on point In order to develop an integrated approach to peaker- and channel-adaptation we will work out the model obtained by ubtituting the right hand ide of 3 for m in 4 Thu we aume that, for a given peaker and recording h, M = m + vy + dz M h = M + ux h The range of uu can be thought of a the channel pace Alternatively, ince the channel factor may be capturing intrapeaker variability a much a channel variability, it could be termed the intra-peaker pace Thi would be in keeping with the terminology ued in the dual eigenpace approach to face recognition [5] So our factor analyi model will be pecified by a quintuple of hyperparameter Λ of the form m, u, v, d, Σ m i a upervector of dimenion CF 2 u i a matrix of dimenion CF R C R C for channel rank 3 v i a matrix of dimenion CF R S R S for peaker rank 4 d i a CF CF diagonal matrix 5 Σ i a CF CF diagonal covariance matrix whoe diagonal block we denote by Σ c for c =,, C To explain the role of the covariance matrice in 5, fix a mixture component c For each peaker and recording h, let M hc denote the ubvector of M h correponding to the given mixture component We aume that, for all peaker and recording h, obervation drawn from the mixture component c are ditributed with mean M hc and covariance matrix Σ c In the cae u = 0 and v = 0, o that the factor analyi model reduce to the prior for claical MAP, } 5

3 DRAFT VERSION JANUARY 3, the relevance factor of [6] are the diagonal entrie of d 2 Σ a we will explain The term dz i included in the factor analyi model in order to enure that it inherit the aymptotic behavior of claical MAP but it i cotly in term of both mathematical and computational complexity The reaon for thi i that, although the increae in the number of free parameter i relatively modet ince unlike u and v d i aumed to be diagonal, the increae in the number of hidden variable i enormou auming that there are at mot a few hundred peaker and channel factor On the other hand if d = 0, the model i quite imple ince the baic aumption i that each peaker- and channel-dependent upervector i a um of two upervector one of which i contained in the peaker pace and the other in the channel pace We will ue the term Principal Component Analyi PCA to refer to the cae d = 0 and reerve the term Factor Analyi to decribe the general cae B The likelihood function To decribe the likelihood function for the factor analyi model, uppoe that we are given a et of recording for a peaker indexed by h =,, H For each recording h, aume that each frame ha been aligned with a mixture component and let X h denote the collection of labeled frame for the recording Set X = X X H and let X be the vector of hidden variable defined by x X = x H y z If X were given we could write down M h and calculate the Gauian likelihood of X h for each recording h o the calculation of the likelihood of X would be traightforward Let u denote thi conditional likelihood by P Λ X X Since the value of the hidden variable are not given, calculating the likelihood of X require evaluating the integral P Λ X XNX 0, IdX 6 NX 0, I i the tandard Gauian kernel Nx 0, I Nx H 0, INy 0, INz 0, I We denote the value of thi integral by P Λ X ; a cloed form expreion for P Λ X i given in Theorem 3 below In fact thi decompoition i normally unique ince the range of uu and the range of vv, being low dimenional ubpace of a very high dimenional pace, will typically only interect at the origin C Etimating the hyperparameter The principal problem we have to deal with i how to etimate the hyperparameter which pecify the factor analyi model given that peaker- and channel-dependent HMM upervector are unobervable It i not poible in practice to etimate a peaker- and channel-dependent HMM by maximum likelihood method from a ingle recording We will how how, if we are given a training et in which each peaker i recorded in multiple eion, we can etimate the hyperparameter Λ by EM algorithm which guarantee that the total likelihood of the training data increae from one iteration to the next The total likelihood of the training data i P ΛX range over the training peaker We refer to thee a peaker-independent hyperparameter etimation algorithm or imply a training procedure ince they conit in fitting the factor analyi model 5 to the entire collection of peaker in the training data rather than to an individual peaker One etimation algorithm, which we will refer to imply a maximum likelihood etimation, can be derived by extending Propoition 3 in [2] to handle the hyperparameter u and d in addition to v and Σ Our experience with it ha been that it tend to converge very lowly Another algorithm can be derived by uing the divergence minimization approach to hyperparameter etimation introduced in [7] Thi eem to converge much more rapidly but it ha the property that it keep the orientation of the peaker and channel pace fixed o that it can only be ued if thee are well initialized A imilar ituation arie when the divergence minimization approach i ued to etimate inter-peaker correlation See the remark following Propoition 3 in [7] We have experimented with the maximum likelihood approach on it own and with the maximum likelihood approach followed by divergence minimization With one notable exception we obtained eentially the ame performance in both cae even though the hyperparameter etimate are quite different We alo need a peaker-dependent etimation algorithm for the hyperparameter both for contructing likelihood ratio tatitic for peaker verification [8], [] and for progreive peaker adaptation For thi we aume that, for a given peaker and recording h, M = m + vy + dz M h = M + ux h That i, we make the hyperparameter m, v and d peakerdependent but we continue to treat the hyperparameter Σ and and u a peaker-independent There i no reaon to uppoe that channel effect vary from one peaker to another Given ome recording of the peaker, we etimate the peaker-dependent hyperparameter m, v and d by firt uing the peaker-independent hyperparameter and the recording to calculate the poterior ditribution of M and then adjuting the peaker-dependent hyperparameter to fit thi poterior More pecifically, we find the ditribution of the form m + vy + dz which i cloet to the poterior in the ene that the divergence i minimized Thi i jut the minimum divergence etimation algorithm applied to a ingle peaker in the cae u and Σ are held fixed } 7

4 DRAFT VERSION JANUARY 3, Thu m i an etimate of the peaker upervector when channel effect are abtracted and d and v meaure the uncertainty in thi etimate D Speaker- and channel-adaptation of HMM Given peaker-independent or peaker-dependent etimate of the hyperparameter m, u, v, d and Σ and a collection of recording for a peaker, we will explain how to calculate the poterior ditribution of the hidden variable y, z and x h for each recording h A in [2], thi i by far the mot important calculation that need to be performed in order to implement the model It turn out to be much more difficult and computationally expenive in the general cae than in the PCA cae Knowing thi poterior, it i eay to calculate the poterior ditribution of the peaker- and channel-dependent upervector M h for each recording h and hence to implement MAP peaker- and channel-adaptation for each recording Note however that, ince the value of z and y are aumed to be common to all of the recording, the poterior ditribution of M h for h =, 2, cannot be calculated independently of each other Thu a naive approach to the problem of performing peaker- and channel-adaptation to a new et of recording condition would require proceing all of the recording of the peaker if the calculation i to be carried out exactly Clearly, in practical ituation multiple recording of a peaker are available a in dictation it would be much more atifactory to proce thee recording equentially producing a peaker- and channel-adapted HMM for each recording a it become available rather than in batch mode The mot natural approach to avoiding the need for batch proceing i to apply the peaker-dependent hyperparameter etimation algorithm equentially Whenever a new recording of the peaker become available we can ue the current etimate of the peaker-dependent hyperparameter m, v and d and the given recording to perform MAP HMM adaptation We can alo ue the given recording to update the peaker-dependent hyperparameter Thi enable u to perform progreive peaker-adaptation and on-line channel-adaptation one recording at a time Note that the peaker-independent hyperparameter etimate play a fundamental role if peaker-dependent prior are allowed to evolve in thi way The peaker-independent hyperparameter needed to initialize the equential update algorithm for the peaker-dependent hyperparameter, they provide the loading matrix for the channel factor and ince minimum divergence etimation preerve the orientation of the peaker pace they impoe contraint how the peaker-dependent prior can evolve III LIKELIHOOD CALCULATIONS Suppoe we are given a peaker and a collection of recording h =,, H with obervable variable X and hidden variable X For a given et of hyperparameter Λ we denote the joint ditribution of X and X by P Λ X, X In thi ection we will tudy thi joint ditribution and how how to derive the poterior and marginal ditribution needed to implement the factor analyi model More pecifically we will calculate the marginal ditribution of the obervable variable P Λ X and the poterior ditribution of the hidden variable, P Λ X X We will alo how how thi poterior ditribution can be ued for peakerand channel- adaptation of HMM A The joint ditribution of the obervable and hidden variable Note firt that if we fix a recording h, it i a traightforward matter to calculate the conditional likelihood of the obervation X h if the upervector M h i given So we begin by calculating the conditional ditribution of the obervable variable given the hidden variable, P Λ X X Multiplying thi by the tandard Gauian kernel NX 0, I give the joint ditribution of the obervable variable and the hidden variable Firt we define ome tatitic For each recording h and mixture component c, let N hc be the total number of obervation for the given mixture component and et F hc, m c = X t m c t S hc, m c = diag X t m c X t m c t the um extend over all obervation X t aligned with the given mixture component, m c i the cth block of m, and diag et off-diagonal entrie to 0 Thee tatitic can be extracted from the training data in variou way; the implet algorithm conceptually i to extract them by mean of a Viterbi alignment with a peaker- and channel-independent HMM but a forward-backward alignment could alo be ued Alo, the peaker- and channel-adapted HMM decribed in Section III E below could be ued in place of the peaker- and channelindependent HMM Let N h be the CF CF diagonal matrix whoe diagonal block are N hc I for c =,, C I i the F F identity matrix Let F h, m be the CF vector obtained by concatenating F hc, m c for c =,, C Similarly, let S h, m be the CF CF diagonal matrix whoe diagonal block are S hc, m c for c =,, C Let N be the HCF HCF matrix whoe diagonal block are N h for h =,, H H i the number of recording of the peaker Let F, m be the HCF vector obtained by concatenating F h, m Let Σ be the HCF HCF matrix whoe diagonal entrie are all equal to Σ Let V be the matrix of dimenion HCF HR C + R S + CF defined by V = u v d u v d 8 We are now in a poition to write down a formula for the conditional ditribution of the obervable variable given the hidden variable

5 DRAFT VERSION JANUARY 3, Theorem : For each peaker, and log P Λ X X = G Σ, m + H Λ, X G Σ, m = H C N hc log 2 tr Σ S h, m H Λ, X = X V Σ F, m Proof: Set 2π F/2 Σ c /2 2 X V NΣ V X O = V X o that for each recording h, For each h =,, H, M h = m + O h log P Λ X h X C = N hc log 2π F/2 Σ c /2 C X t M hc Σ c 2 X t M hc t, for each mixture component c, t extend over all frame that are aligned with the mixture component Ignoring the factor of 2 for the time being, the econd term here can be expreed in term of O h a follow: t C X t m c O hc Σ c X t m c O hc = = C X t m c Σ c X t m c 2 + C 2 + t C OhcΣ c X t m c C t OhcN hc Σ c O hc tr Σ c S hc, m c C C O hcσ c F hc, m c OhcN hc Σ c O hc = tr Σ S h, m 2O hσ F h, m + O h N hσ O h o that Hence log P Λ X h X C = N hc log 2π F/2 Σ c /2 log P Λ X X = = H H 2 tr Σ S h, m + O hσ F h, m 2 O hn h Σ O h log P Λ X h X C N hc log 2π F/2 Σ c /2 2 tr Σ S h, m + O Σ F, m 2 O NΣ O = G Σ, m + H Λ, X a required B The poterior ditribution of the hidden variable The poterior ditribution of the hidden variable given the obervable variable, P Λ X X, can be calculated jut a in Propoition of [2] Let L = I + V Σ NV Theorem 2: For each peaker, the poterior ditribution of X given X i Gauian with mean L V Σ F, m and covariance matrix L Proof: By Theorem, So log P Λ X X P Λ X X = G Σ, m + X V Σ F, m 2 X V NΣ V X P Λ X XNX 0, I = exp X V Σ F, m 2 X LX exp 2 A X LA X a required A = L V Σ F, m

6 DRAFT VERSION JANUARY 3, We ue the notation E [ ] and Cov, to indicate expection and covariance calculated with the poterior ditribution pecified in Theorem 2 Strictly peaking the notation hould include a reference to the peaker but thi will alway be clear from the context All of the computation needed to implement the factor analyi model can be cat in term of thee poterior expectation and covariance if one bear in mind that correlation can be expreed in term of covariance by relation uch a E [zy ] = E [z] E [y ] + Cov z, y etc The next two ection which give a detailed account of how to calculate thee poterior expectation and covariance can be kipped on a firt reading Although the hidden variable in the factor analyi model are all aumed to be independent in the prior 5, thi i not true in the poterior the matrix L i pare but L i not It turn out however that the only entrie in L whoe value are actually needed to implement the model are thoe which correpond to non-zero entrie in L In the PCA cae it i traightforward to calculate only the portion of L that i needed but in the general cae calculating more entrie of L eem to be unavoidable The calculation here involve inverting a large matrix and thi ha important practical conequence if there i a large number of recording for the peaker: the computation needed to evaluate the poterior i OH in the PCA cae but OH 3 in the general cae We will return to thi quetion in Section VII C Evaluating the poterior in the general cae Let X h = xh y for h =,, H Dropping the reference to for convenience, the poterior expectation and covariance we will need to evaluate are E [z], diag Cov z, z and, for h =,, H, E [X h ], Cov X h, X h and Cov X h, z Note that E [X] can be written down if E [X h ] i given for h =,, H We will alo need to calculate the determinant of L in order to evaluate the formula for the complete likelihood function given in Theorem 3 below We will ue the following identitie which hold for any ymmetric poitive definite matrix: α β By Theorem 2 L = I +V Σ N V V i given by 8 A traightforward calculation how that L can be written a and a b c a H b H c H b b H I + v Σ Nv v Σ Nd c c H dσ Nv I + Σ Nd 2 for h =,, H So taking we have α = β = N = N + + N H a h = I + u Σ N h u b h = u Σ N h v c h = u Σ N h d a b a H b H b b H I + v Σ Nv c c H v Σ Nd γ = I + Σ Nd 2 ζ = α βγ β L = L = ζ γ ζ γ β ζ ζ βγ γ + γ β ζ βγ Since Cov X, X = L, thi enable u to write down the poterior covariance that we need In order to calculate the poterior expectation, et F h = F h, m for h =,, H and let F = F + + F H Recall that by Theorem 2, 9 β = γ ζ γ β ζ ζ βγ γ + γ β ζ βγ and by 8 E [X] = L V Σ F and α β β γ = ζ γ ζ = α βγ β V Σ F = u Σ F u Σ F H v Σ F dσ F

7 DRAFT VERSION JANUARY 3, Hence E [X] = = U = ζ ζ ζ βγ γ β ζ γ + γ β ζ βγ u Σ F u Σ F H v Σ F dσ F U γ β U + γ dσ F u Σ F Σ N d 2 γ F u Σ F H Σ N H d 2 γ F v γ Σ F Thi enable u to write down the poterior expectation that we need D Evaluating the poterior in the PCA cae Continuing to uppre the reference to, the only quantitie that we need to evaluate in the PCA cae are the determinant of L and E [X h ] and Cov X h, X h for h =,, H Inverting the matrix L reduce to inverting the pare matrix K defined by a b K = a H b H b b H I + v Σ Nv and N = N + + N H a h = I + u Σ N h u b h = u Σ N h v for h =,, H Recall that if a poitive definite matrix X i given then a factorization of the form X = T T T i an upper triangular matrix i called a Choleky decompoition of X; we will write T = X /2 to indicate that thee condition are atified One way of inverting K i to ue the block Choleky algorithm given in [9] to find an upper triangular block matrix T having the ame parity tructure a K uch that K = T T The non-zero block of T are calculated a follow For h =,, H: and T H+,H+ = T hh = K /2 hh T h,h+ = T hh K H+,H+ K h,h+ /2 H T h,h+ h,h+ T A in the previou ection, et F h = F h, m for h =,, H and et F = F + + F H By Theorem 2, for h =,, H, the poterior expectation E [X] i given by olving the equation KU = B for U B = u Σ F u Σ F H v Σ F Thi can be olved by back ubtitution in the uual way Firt olve the equation T C = B for C taking advantage of the parity of T : C h = T hh B h h =,, H H C H+ = T H+,H+ B H+ T h,h+ C h Then olve the equation T U = C for U: U H+ = T H+,H+ C H+ U h = T hh C h T h,h+ U H+ h =,, H A for the conditional covariance they are given by Shh S Cov X h, X h = h,h+ S H+,h S H+,H+ for h =,, H S i the invere of K Note that the only block of S that we need are thoe which correpond to the non-zero block of K Although we had to calculate all of the block of ζ in the previou ubection we do not have to calculate all of the block of K The block that we need can be calculated a follow: and for h =,, H S H+,H+ = T H+,H+ T H+,H+ S h,h+ = T hh T h,h+s H+,H+ S hh = T hh T hh T hh T h,h+s h,h+ Finally the determinant of L i jut the determinant of K and thi i given by K = H+ T hh 2 which i eaily evaluated ince the determinant of a triangular matrix i jut the product of the diagonal element E HMM adaptation Suppoe that we have a recording h of peaker we wih to contruct a HMM adapted to both the peaker and the recording Set w = u v and X h = xh y

8 DRAFT VERSION JANUARY 3, o that 5 can be written in the form M h = m + wx h + dz Then the poterior ditribution of M h ha mean given by and covariance matrix m + we [X h ] + de [z] Cov wx h + dz, wx h + dz ˆM h Let D h be the diagonal matrix obtained by etting the offdiagonal entrie of thi matrix to be 0 That i, D h = diag w Cov X h, X h w + 2 diag w Cov X h, z d + d diag Cov z, z d For each mixture component c =,, C let ˆMhc be the cth block of ˆM h and let D hc be the cth block of D h Applying the Bayeian predictive claification principle a in [2], we can contruct a HMM adapted to the given peaker and recording by aigning the mean vector ˆM hc and the diagonal covariance matrix Σ c + D hc to the mixture component c for c =,, C Whether thi type of variance adaptation i ueful in practice i a matter for experimentation; our experience in [2] ugget that imply copying the variance from a peaker- and channelindependent HMM may give better reult A we mentioned in Section II D, the poterior expectation and covariance needed to implement MAP HMM adaptation can be calculated either in batch mode or equentially, that i, uing peaker-independent hyperparameter 5 and all of the recording of the peaker or uing peaker-dependent hyperparameter 7 and only the current recording h In the claical MAP cae u = 0 and v = 0, the calculation in Section III C above how that, if MAP adaptation i carried out in batch mode, then for each recording h, ˆM h = m + I + d 2 Σ N d 2 Σ F, m N = F, m = H H N h F h, m Since u = 0, the poterior ditribution of M h i the ame for all recording h channel effect are not modeled in claical MAP Denoting the common value of ˆM h for all recording by ˆM and writing r = d 2 Σ we obtain ˆM = m + r + N F, m The diagonal entrie of r are referred to a relevance factor in [6] They can be interpreted by oberving that if, for each i =,, CF, m i denote the ith entry of m and imilarly for r i and M i, then the MAP etimate of M i i the ame a the maximum likelihood etimate calculated by treating m i a an extra obervation of M i occurring with frequency r i Relevance factor are uually etimated empirically after being tied acro all mixture component and acoutic feature dimenion However the hyperparameter etimation algorithm that we will develop will enable u to bring the maximum likelihood principle to bear on the problem of etimating d and Σ o they provide a principled way of etimating relevance factor without any tying F The marginal ditribution of the obervable variable Here we how how to evaluate the likelihood function P Λ X defined by 6 Thi i the complete likelihood function for the factor analyi model in the ene that thi term i ued in the EM literature It primary role i to erve a a diagnotic for verifying the implementation of the EM algorithm for hyperparameter etimation that we will preent It alo erve a the bai for contructing the likelihood ratio tatitic ued for peaker verification in [8] The proof of Theorem 3 i formally identical to the proof of Propoition 2 in [2] Theorem 3: For each peaker, log P Λ X = G Σ, m log L E [X ] V Σ F, m Proof: By 6 P Λ X = P Λ X XNX 0, IdX and, by Theorem, we can write thi a log P Λ X = G Σ, m + log exp H Λ, X NX 0, IdX o that H Λ, X = X V Σ F, m 2 X V NΣ V X H Λ, X 2 X X = X V Σ F, m 2 X LX To evaluate the integral we appeal to the formula for the Fourier-Laplace tranform of the Gauian kernel [0] If NX 0, L denote the Gauian kernel with mean 0 and covariance matrix L then exp X V Σ F, m NX 0, L dx = exp 2 log L + 2 F, mσ V L V Σ F, m By Theorem 2, the latter expreion can be written a exp 2 log L + 2 E [X] V Σ F, m

9 DRAFT VERSION JANUARY 3, o a = diag NE [zz ] 4 log P Λ X = G Σ, m log L 2 a required + 2 E [X] V Σ F, m IV MAXIMUM LIKELIHOOD HYPERPARAMETER ESTIMATION We now turn to the problem of etimating the hyperparameter et Λ from a training et compriing everal peaker with multiple recording for each peaker uch a one of the Switchboard databae We continue to aume that, for each peaker and recording h, all frame have been aligned with mixture component Both of the etimation algorithm that we preent in thi and in the next ection are EM algorithm which when applied iteratively produce a equence of etimate of Λ having the property that the complete likelihood of the training data, namely log P Λ X range over the training peaker, increae from one iteration to the next The major difference between the two approache i that we will impoe contraint on the peaker and channel pace in the next ection but not in thi one Accordingly we will refer to the approach preented in thi ection imply a maximum likelihood etimation Since our concern i with peaker-independent etimation of Λ, it i generally reaonable to etimate m by peakerindependent Baum-Welch training in the uual way o will concentrate on the problem of how to etimate the remaining hyperparameter if m i given The baic idea in the maximum likelihood etimation algorithm i to extend the argument ued to prove Propoition 3 of [2] to handle u and d a well a v and Σ Thi entail accumulating the following tatitic over the training et on each iteration the poterior expectation are calculated uing the current etimate of the hyperparameter, range over the training peaker and, for each peaker, the recording are labeled h =,, H Set xh X h = y for each peaker and recording h The accumulator are N c = A c = B = C = H H H H N hc 0 N hc E [X h X h ] N h E [zx h ] 2 F h, me [X h ] 3 b = diag F, me [z ] 5 In 0 and, c range over all mixture component and the quantitie N and F, m in 4 and 5 are defined by N = F, m = H H N h F h, m for each training peaker Theorem 4: Let m be given Suppoe we have a hyperparameter et of the form m, u 0, v 0, d 0, Σ 0 and we ue it to calculate the accumulator 0 5 Define a new hyperparameter et m, u, v, d, Σ a follow i Set w = u v For each mixture component c =,, C and for each f =,, F, et i = c F +f and let w i denote the ith row of w and d i the ith entry of d Then w i and d i are defined by the equation A wi d c B i i = C B i a i b i 6 i B i i the ith row of B, a i i the ith entry of a, C i i the ith row of C and b i i the ith entry of b ii Let M be the diagonal CF CF matrix given by Set M = diag Cw + bd Σ = N S, m M 7 N i the CF CF diagonal matrix whoe diagonal block are N I,, N C I and S, m = H S h, m 8 for each training peaker Then if Λ 0 = m, u 0, v 0, d 0, Σ 0 and Λ = m, u, v, d, Σ, log P Λ X log P Λ0 X range over the training peaker Before turning to the proof of thi theorem note that 6 i a low dimenional ytem of equation R S +R C + equation in R S +R C + unknown R S i the number of peaker factor and R C the number of channel factor o there i no difficulty in olving it in practice In fact a very efficient algorithm can be developed by firt calculating the Choleky decompoition of the R S + R C R S + R C matrix A c and uing thi to calculate the Choleky decompoition of the matrice Ac B i B i a i

10 DRAFT VERSION JANUARY 3, for f =,, F There i a further implification in In the PCA cae ince only the equation w i A c = C i need to be olved In our experience calculating the poterior expectation needed to accumulate the tatitic 0 5 account for almot all of the computation needed to implement the algorithm A we have een, calculating poterior expection In the PCA cae i much le expenive than in the general cae; furthermore, ince z play no role in thi cae, the accumulator 2, 4 and 5 are not needed Auming that the accumulator are tored on dik rather than in memory, thi reult in a 50% reduction in memory requirement in the PCA cae Proof: We begin by contructing an EM auxiliary function with the X a hidden variable By Jenen inequality, log P ΛX, X P Λ0 X, X log P Λ0 X X dx PΛ X, X P Λ0 X, X P Λ 0 X X dx Since the right hand ide of thi inequality implifie to log P Λ X log P Λ0 X, the total log likelihood of the training data can be increaed by chooing the new etimate of the hyperparameter o a to make the left hand ide poitive Since for each peaker, P Λ X, X = P Λ X XNX 0, I and imilarly for P Λ0 X, X, the left hand ide can be written a A Λ X A Λ0 X A Λ X = log P Λ X XP Λ0 X X dx Thu we can enure that log P Λ X log P Λ0 X by chooing Λ o that A Λ X A Λ0 X 0 and thi can be achieved by maximizing A Λ X with repect to Λ We refer thi a the auxiliary function We derive 6 and 7 by calculating the gradient of the auxiliary function with repect to w, d and Σ and etting thee to zero Oberve that for each peaker A Λ X = E [log P Λ X X] So, by Theorem, the auxiliary function can be written a G Σ, m + E [H Λ, X] 9 The firt term here i independent of w and d and the the econd term can be expreed in term of w, d and Σ a follow: E [H Λ, X] = = = [ E X V Σ F, m ] 2 X V NΣ V X tr Σ F, me[x] V 2 NV E[XX ]V tr Σ F, me[o] 2 NE[OO ] H tr Σ F h, me[o h ] = 2 N he[o h O h] O h = wx h + dz 20 for each peaker and recording h Setting the gradient of the auxiliary function with repect to w equal to zero give H N h we [X h X h ] + de [zx h ] = H F h, me [X h] 2 and etting the gradient with repect to d equal to zero implie that the diagonal of H N h we [X h z ] + de [zz ] 22 i equal to the diagonal of F, me [z ] 23 For each mixture component c =,, C and for each f =,, F, et i = c F + f Equating the ith row of the left-hand ide of 2 with the ith row of the right-hand ide give w i A c + d i B i = C i 24

11 DRAFT VERSION JANUARY 3, 2006 Since N h i diagonal, the diagonal of H i the ame a the diagonal of H N h we [X h z ] we [X h z ] N h which i to ay wb The ith diagonal entry of thi i w i B i that i, the product of w i and the tranpoe of the ith row of B, o equating the ith diagonal entry of 22 with the ith diagonal entry of 23 give w i B i + d i a i = b i 25 Combining 24 and 25 give 6: A wi d c B i i = C B i a i b i i In order to derive 7 it i more convenient to calculate the gradient of the auxiliary function with repect to Σ than Σ Note firt that if we potmultiply the left-hand ide of 2 by w and 22 by d and um the reult we obtain H N h E [O h O h] by the definition of O h On the other hand if we potmultiply the right-hand ide of 2 by w and 23 by d and um the reult we obtain H F h, me [O h] Since the diagonal of 22 and 23 are equal, it follow that diag H = diag N h E [O h O h ] H F h, me [O h] 26 at a critical point of the auxiliary function Uing the expreion 9 to calculate the derivative of the econd term of 8 with repect to Σ in the direction of e e i any diagonal matrix having the ame dimenion a Σ we obtain H By 26 thi implifie to 2 tr F h, me [O h ] 2 N he [O h O h] e H tr diag F h, me [O h] e = tr Me 27 2 It remain to calculate the gradient of the firt term of 8 with repect to Σ Note that, for each c =,, C, the derivative of log Σ c with repect to Σ c in the direction of e c i tr Σ c e c Hence, for each peaker, the derivative of G Σ, m with repect to Σ in the direction of e i H 2 = 2 C tr N hc Σ c S hc, m c e c H tr N h Σ S h, m e = tr NΣ S, m e 28 2 Combining 28 and 29 we obtain the derivative of the auxiliary function with repect to Σ in the direction of e, namely 2 tr NΣ S, m + M e In order for thi to be equal to zero for all e we mut have Σ = N S, m M which i 7 The etimate for the covariance matrice can be hown to be poitive definite with a little bit of extra work Baum-Welch training i an appropriate way of etimating the upervector m provided that a peaker- and channelindependent HMM rather than the peaker- and channeladapted HMM of Section III E above i ued to align the training data and the training et i the ame a that ued to etimate the other peaker-independent hyperparameter, namely u, v, d and Σ However, even in thi ituation, it would be more natural to etimate m uing the ame criterion a the other hyperparameter Theorem 4 could be extended to etimate all 5 hyperparameter imultaneouly by mean of a tandard trick, namely etting V = v m y and Y = for each peaker o that m+vy = V Y thereby eliminating m Of coure thi require re-deriving the etimation formula in the tatement of Theorem 4 A impler if le efficient way of dealing with thi problem i to derive an EM algorithm for etimating m on the aumption that the other hyperparameter are given By alternating between the EM algorithm in Theorem 4 and 5 on ucceive iteration the entire hyperparameter et m, u, v, d, Σ can be etimated in a conitent way Theorem 5: Let u, v, d and Σ be given Suppoe we have a hyperparameter et of the form m 0, u, v, d, Σ and we ue it to calculate the tatitic O defined by O = H N h E [O h ]

12 DRAFT VERSION JANUARY 3, range over the training peaker and O h = wx h + dz for each recording h For each c =,, C, let m c = N c H F hc, 0 O c 29 O c i the cth block of O Let m be the upervector obtained by concatenating m,, m C Then if Λ 0 = m, u 0, v 0, d 0, Σ 0 and Λ = m, u, v, d, Σ, log P Λ X log P Λ0 X range over the training peaker Proof: We ue the ame auxiliary function a in the proof of the previou theorem but regard it a a function of m rather than a a function of u, v, d and Σ Setting the gradient of the auxiliary function with repect to m equal to zero give H F h, m = H N h E [O h ] Solving thi equation for m give 29 It can be hown that at a fixed point of the maximum likelihood etimation algorithm, the following condition mut hold: E [yy ] = I S S i the number of peaker in the training et and the um extend over all training peaker Thi i to be expected ince, for each peaker, the prior ditribution of y i the tandard normal ditribution Similar condition apply to the other hidden variable However, our experience with the maximum likelihood etimation algorithm ha been that, no matter how often it i iterated on a training et, thi condition i never atified The diagonal entrie of the left hand ide of thi equation are alway much le than unity in practice and, to compenate for thi, the eigenvalue of vv are larger than eem reaonable The eigenvalue of vv can be calculated by oberving that they are the ame a the eigenvalue of the low dimenional matrix v v On the other hand, when the HMM likelihood of the training et i calculated with peakerand channel-adapted HMM derived from maximum likelihood etimate of the peaker-independent hyperparameter, the reult do eem to be reaonable and thi ugget that the maximum likelihood etimation algorithm doe a fair job of etimating the orientation of the peaker and channel pace Thu the eigenvector of vv eem to be well etimated if not the eigenvalue We don t have any explanation for thi anomaly but we mention it becaue it ugget that other way of hyperparameter etimation might be worth tudying V MINIMUM DIVERGENCE HYPERPARAMETER ESTIMATION In thi ection we ue the method introduced in [7] to tudy the problem of etimating the hyperparameter in ituation the orientation of the peaker pace and channel pace are known An example of how thi type of ituation can arie wa given at the end of the lat ection; we will give other example later Throughout thi ection we fix a hyperparameter et m 0, u 0, v 0, d 0, Σ 0 which we denote by Λ 0 Let C 0 denote the range of u 0 u 0 the channel pace and let S 0 denote the range of v 0 v 0 tranlated by m 0 the peaker pace Conider a model of the form M = m 0 + v 0 y + d 0 z M h = M + u 0 x h } 30, intead of auming that y, z and x h have tandard normal ditribution, we aume that y i normally ditributed with mean m y and covariance matrix K yy, z i normally ditributed with mean µ z and diagonal covariance matrix K zz, and x h i normally ditributed with mean 0 and covariance matrix K xx Set λ = m y, µ z, K xx, K yy, K zz, Σ Thi model i eaily een to be equivalent to the model M = m + vy + dz } M h = M + ux h x, y and x h have tandard normal ditribution and m = m 0 + v 0 µ y + d 0 µ z u = u 0 K /2 xx v = v 0 K /2 yy d = d 0 K /2 zz Set Λ = m, u, v, d, Σ The channel pace for the factor analyi model defined by Λ i C 0 and the peaker pace i parallel to S 0 Converely, any hyperparameter et that atifie thee condition arie in thi way from ome extuple λ Note in particular that if λ 0 = 0, 0, I, I, I, Σ 0 then the hyperparameter et correponding to λ 0 i jut Λ 0 Thu the problem of etimating a hyperparameter ubject to the contraint on the peaker and channel pace can be formulated in term of etimating the extuple λ We need to introduce ome notation in order to do likelihood calculation with the model defined by 30 For each peaker, let Q λ X be the ditribution on X defined by λ That i, Q λ X i the normal ditribution with mean µ and covariance matrix K µ i the vector whoe block are 0,, 0, m y, µ z and K i the block diagonal matrix whoe diagonal block are K xx,, K xx, K yy, K zz The ditribution Q λ X give rie to a ditribution Q λ M for the random vector M defined by M M = M H M h i given by 30 for h =,, H Thi ditribution i the image of the ditribution Q λ X under

13 DRAFT VERSION JANUARY 3, the linear tranformation T Λ0 defined by x T Λ0 x H = y z M M H In other word, Q λ M i the normal ditribution with mean T Λ0 µ and covariance matrix T Λ0 K T Λ 0 The likelihood of X can be calculated uing Q λ M a follow: Q λ X = P Σ X MQ λ MdM we have ue the notation P Σ X M to emphaize the fact that Σ i the only hyperparameter that play a role in calculating the conditional ditribution of X given that M = M Of coure P Λ X = Q λ X Λ i the hyperparameter et correponding to λ o we alo have P Λ X = P Σ X MQ λ MdM 3 We will derive the minimum divergence etimation procedure from the following theorem Theorem 6: Suppoe λ atifie the following condition: D P Λ0 X X Q λ X D P Λ0 X X Q λ0 X D indicate the Kullback-Leibler divergence and E [log P Σ X M] E [log P Σ0 X M], range over the training peaker and E [ ] indicate a poterior expectation calculated with the initial hyperparameter et Λ 0 Then log P Λ X log P Λ0 X range over the training peaker In order to prove thi theorem we need two lemma Lemma : For any peaker, log P ΛX P Λ0 X E [log P Σ0 X M] + E [log P Σ X M] + D Q λ0 M X Q λ0 M D Q λ0 M X Q λ M Proof: Thi follow from Jenen inequality log UMQ λ0 M X dm log UMQ λ0 M X dm Note that o that UM = Q λ0 M X Q λ0 M UMQ λ0 M X P ΣX MQ λ M P Σ0 X MQ λ0 M = Q λ 0 X M Q λ0 X = P Σ 0 X M P Λ0 X = P ΣX MQ λ M P Σ0 X MQ λ0 M Q λ 0 M X = P ΣX MQ λ M P Λ0 X By 3 the left hand ide of the inequality implifie a follow UMQ λ0 M X dm = P Σ X MQ λ MdM P Λ0 X = P ΛX P Λ0 X and the reult follow immediately Note that Lemma refer to divergence between ditribution on M a the tatement of Theorem 6 refer to divergence between ditribution on X Thee two type of divergence are related a follow Lemma 2: For each peaker, D P Λ0 X X Q λ X = D Q λ0 M X Q λ M + D Q λ0 X M Q λ X M Q λ0 M X dm Proof: The chain rule for divergence tate that, for any ditribution πx, M and ρx, M, D πx, M ρx, M = D πm ρm + D πx M ρx M πmdm The reult follow by applying the chain rule to the ditribution defined by etting πx, M = P Λ0 X X δ T Λ0 XM ρx, M = Q λ Xδ T Λ0 XM δ denote the Dirac delta function Since divergence are non-negative it follow from Lemma 2 that D P Λ0 X X Q λ X D Q λ0 M X Q λ M 32 Thi i a data proceing inequality in the ene in which thi term i ued in information theory the proceing

14 DRAFT VERSION JANUARY 3, i the operation of tranforming X into M by T Λ0 In the cae d = 0 and the CF R C matrix v 0 i of rank R C, the tranformation T Λ0 i injective Thu, for each M, the ditribution Q λ0 X M and Q λ X M are point ditribution concentrated on the ame point o their divergence i 0 So the inequality become an equality in thi cae no information i lot Thee divergence alo vanih in the cae λ = λ 0 o that D P Λ0 X X Q λ0 X = D Q λ0 M X Q λ0 M 33 irrepective of whether or not d = 0 We can now prove the aertion of Theorem 6, namely that P Λ X P Λ0 X the um extend over all peaker in the training et Note that E [log PΣ X M] D Q λ0 M X Q λ M E [log PΣ X M] D P Λ0 X X Q λ X E [log PΣ0 X M] = D P Λ0 X X Q λ0 X E [log PΣ0 X M] D Q λ0 M X Q λ0 M The firt inequality here follow from 32, the econd by the hypothee of the Theorem and the concluding equality from 33 The reult now follow from Lemma Theorem 6 guarantee that we can find a hyperparameter et Λ which repect the ubpace contraint and fit the training data better than Λ 0 by chooing λ o a to minimize D P Λ0 X X Q λ X and maximize E [log P Σ X M] Thi lead to the following et of re-etimation formula Theorem 7: Define a new hyperparameter et Λ by etting Λ = m, u, v, d, Σ m = m 0 + v 0 µ y + d 0 µ z 34 u = u 0 K /2 xx 35 v = v 0 K /2 yy 36 d = d 0 K /2 zz 37 Σ = N H S h, m 0 2 diag F h, m 0 E [O h ] + diag E [O h O h ] N h 38 Here O h = M h m 0 for each peaker and recording h and µ y = E [y] S µ z = E [z] S H K xx = E [x h x H h] K yy = E [yy ] µ S y µ y K zz = diag E [zz ] µ S z µ z N = H = H H; N h S i the number of training peaker, the um extend over all peaker in the training et and the poterior expectation are calculated uing the hyperparameter et Λ 0 Then Λ atifie the condition of Theorem 6 o that P Λ X P Λ0 X range over the training peaker Proof: In order to minimize D P Λ0 X X Q λ X, = we ue the formula for the divergence of two Gauian ditribution [7] For each peaker, P Λ0 X X i the Gauian ditribution with mean E [X] and covariance matrix L by Theorem 2 So if λ = m y, µ z, K xx, K yy, K zz, Σ, D P Λ0 X X Q λ X = 2 log L K + 2 tr L + E [X] µ E [X] µ K 2 R S + HR C + CF 39 µ i the vector whoe block are 0,, 0, m y, µ z the argument i needed to indicate that there are H repetition of 0 and K i the block diagonal matrix whoe diagonal block are K xx,, K xx, K yy, K zz The formula are derived by differentiating 39 with repect to m y, µ z, K /2 xx, K /2 yy and K /2 zz derivative to zero In order to maximize E [log P Σ X M], and etting the

15 DRAFT VERSION JANUARY 3, oberve that by the argument ued in proving Theorem, log P Σ X h M h C = N hc log 2π F/2 Σ c /2 2 tr Σ S h, m 0 + tr Σ F h, m 0 O h 2 tr Σ O h O hn h for each peaker and recording h Thu E [log P Σ X M] = C N hc log 2π F/2 Σ c /2 2 tr Σ S h, m 0 + tr Σ F h, m 0 E [O h ] 2 tr Σ E [O h O h] N h and 38 follow by differentiating thi with repect to Σ and etting the derivative to zero VI ADAPTING THE FACTOR ANALYSIS MODEL FROM ONE SPEAKER POPULATION TO ANOTHER We have aumed o far that we have at our dipoal a training et in which there are multiple recording of each peaker If the training et doe not have thi property then the peaker-independent hyperparameter etimation algorithm given in the preceding ection cannot be expected to give reaonable reult To take an extreme example, conider the cae we have jut one recording per peaker a in the enrollment data for the target peaker in the current NIST peaker verification evaluation It i impoible to ditinguih between peaker and channel effect in thi ituation We have attempted to deal with thi problem by uing an ancillary training et in which peaker are recorded in multiple eion in order to model channel effect The idea i to firt etimate a full et of hyperparameter m, u, v, d and Σ on the ancillary training et and then, holding u and Σ fixed, re-etimate m, v and d on the original training et In other word, we keep the hyperparameter aociated with channel pace fixed and reetimate only the hyperparameter aociated with the peaker pace Thi can be done uing either the maximum likelihood or the divergence minimization approach but our experience with NIST data et [8] ha been that only the latter approach i effective In other word, it i neceary to keep the orientation of the peaker pace fixed a well a that of the channel pace rather than change it to fit the target peaker population in order to avoid overtraining on the very limited amount of enrollment data provided by NIST The maximum likelihood approach entail accumulating the following tatitic over the original training et on each iteration in addition to the tatitic a and b defined by 4 and 5 the poterior expectation are calculated uing the current value of the hyperparameter and range over the training peaker: S c = T c = U = V = W = H H H H H N hc E [yy ] 40 N hc E [x h y ] 4 N h E [zy ] 42 N h E [zx h] 43 F h, me [y ] 44 In 40 and 4, c range from to C Thee tatitic can eaily be extracted from the tatitic defined in 5 Theorem 8: Let m 0, u 0 and Σ 0 be given Suppoe we have a hyperparameter et Λ 0 of the form m 0, u 0, v 0, d 0, Σ and we ue it to calculate the accumulator defined in Define v and d a follow For each mixture component c =,, C and for each f =,, F, et i = c F +f and let u i denote the ith row of u 0, v i the ith row of v and d i the ith entry of d Then v i and d i are defined by the equation vi d i S c U i U i a i = W i u i T c b i u i V i U i, V i and W i denote the ith row of U, V and W and a i and b i denote the ith entrie of a and b If Λ = m 0, u 0, v, d, Σ 0 then log P Λ X log P Λ0 X range over the peaker in the original training et Proof: The idea i to ue the auxiliary function that we ued in proving Theorem 4, regarding it a a function of v and d alone The minimum divergence algorithm i much eaier to formulate ince it i jut a pecial cae of Theorem 7 Theorem 9: Suppoe we are given a hyperparameter et Λ 0 of the form m 0, u 0, v 0, d 0, Σ 0 Define a new hyperparameter et Λ of the form m, u 0, v, d, Σ 0 by etting m = m 0 + v 0 µ y + d 0 µ z v = v 0 K /2 yy d = d 0 K /2 zz µ y, µ z, K yy and K zz are defined in the tatement of Theorem 7 Then log P Λ X log P Λ0 X range over the peaker in the original training et

Social Studies 201 Notes for November 14, 2003

Social Studies 201 Notes for November 14, 2003 1 Social Studie 201 Note for November 14, 2003 Etimation of a mean, mall ample ize Section 8.4, p. 501. When a reearcher ha only a mall ample ize available, the central limit theorem doe not apply to the