The Basic Idea of EM

Size: px
Start display at page:

Download "The Basic Idea of EM"

Transcription

1 The Basc Idea of EM Janxn Wu LAMDA Group Natonal Key Lab for Novel Software Technology Nanjng Unversty, Chna June 7, 2017 Contents 1 Introducton 1 2 GMM: A workng example Gaussan mxture model The hdden varable nterpretaton What f we can observe the hdden varable? Can we mtate an oracle? An nformal descrpton of the EM algorthm 6 4 The Expectaton-Maxmzaton algorthm Jontly-non-concave ncomplete log-lkelhood Possbly) Concave complete data log-lkelhood The general EM dervaton The E- & M-steps The EM algorthm Wll EM converge? EM for GMM 13 1 Introducton Statstcal learnng models are very mportant n many areas nsde computer scence, ncludng but not confned to machne learnng, computer vson, pattern recognton and data mnng. It s also mportant n some deep learnng models, such as the Restrcted Boltzmann machne RBM). 1

2 Statstcal learnng models have parameters, and estmatng such parameters from data s one of the key problems n the study of such models. Expectaton- Maxmzaton EM) s arguably the most wdely used parameter estmaton technque. Hence, t s worthwhle to know some bascs of EM. However, although EM s a must-have knowledge n studyng statstcal learnng models, t s not easy for begnners. Ths note ntroduces the basc dea behnd EM. I want to emphasze that the man purpose of ths note s to ntroduce the basc dea or, emphaszng the ntuton) behnd EM, not for coverng all detals of EM or presentng rgorous mathematcal dervatons. 1 2 GMM: A workng example Let us start from a smple workng example: GMM). the Gaussan Mxture Model 2.1 Gaussan mxture model In Fgure 1, we show three curves correspondng to three dfferent probablty densty functons p.d.f.). The blue curve s the p.d.f. of a normal dstrbuton N10, 16),.e., a Gaussan dstrbuton wth the mean µ = 10 and the standard devaton σ = 4 and σ 2 = 16). We denote ths p.d.f. as p 1 x) = Nx; 10, 16). The red curve s another normal dstrbuton N30, 49) wth µ = 30 and σ = 7. Smlarly, we denote t as p 2 x) = Nx; 30, 49). We are nterested n the black curve, whose frst half s smlar to the blue one, whle the second half s smlar to the red one. Ths curve s also the p.d.f. of a dstrbuton, denoted by p 3. Snce the black curve s smlar to parts of the blue and red curves, t s reasonable to conjecture that p 3 are related to both p 1 and p 2. Indeed, p 3 s a weghted combnaton of p 1 and p 2. In ths example, p 3 x) = 0.2p 1 x) + 0.8p 2 x). 1) Because = 1, t s easy to verfy that p 3 z) 0 always holds and p 3x) dx = 1. Hence, p 3 s a vald p.d.f. p 3 s a mxture of two Gaussans p 1 and p 2 ), hence a Gaussan mxture model GMM). The defnton of a GMM s n fact more general: t can have more than two components, and the Gaussans can be multvarate. 1 The frst verson of ths note was wrtten n Chnese, and was started as a note-takng n a course n the Georga Insttute of Technology whle I was a graduate student there. That verson was typeset n Mcrosoft Word. Unfortunately, that verson contaned a lot of errors and I dd not have a chance to check t agan. Ths verson wrtten n 2016) s started whle I am preparng materals for the Pattern Recognton course I wll teach n the Sprng Semester n Nanjng Unversty. It s greatly expanded, and the errors that I found are corrected. 2

3 Fgure 1: A smple GMM llustraton. A GMM s a dstrbuton whose p.d.f. has the followng form: px) = = N α Nx; µ, Σ ) 2) N α exp 1 ) 2π) d/2 Σ 1/2 2 x µ ) T Σ 1 x µ ), 3) n whch x s a d-dmensonal random vector. In ths GMM, there are N Gaussan components, wth the -th Gaussan has the mean vector µ R d and the covarance matrx Σ R d d. 2 These Gaussan components are combned together usng a lnear combnaton, where the weght for the -th component s α called the mxng coeffcents). The mxng coeffcents must satsfy the followng condtons N α = 1, 4) α 0,. 5) It s easy to verfy that under these condtons, px) s a vald multvarate probablty densty functon. 2.2 The hdden varable nterpretaton We can have a dfferent nterpretaton of the Gaussan mxture model, usng the hdden varable concept, as llustrated n Fgure 2. 2 We wll use boldface letters to denote a vector. 3

4 Z X Fgure 2: GMM as a graphcal model. In Fgure 2, the random varable X follows a Gaussan mxture model cf. Equaton 3). Its parameter s θ = {α, µ, Σ } N. 6) If we want to sample an nstance from ths GMM, we could drectly sample from the p.d.f. n Equaton 3. However, there s another two-step way to perform the samplng. Let us defne a random varable Z. Z s a multnomal dscrete dstrbuton, takng values from the set {1, 2,..., N}. The probablty that Z takes the value Z = s α,.e., PrZ = ) = α, for 1 N. Then, the two-step samplng procedure s: Step 1 Sample from Z, and get a value 1 N); Step 2 Sample x from the -th Gaussan component Nµ, Σ ). It s easy to verfy that the sample x acheved from ths two-step samplng procedure follows the underlyng GMM dstrbuton n Equaton 3. In learnng GMM parameters, we are gven a sample set {x 1, x 2,..., x M }, where x are..d. dentcally and ndependently dstrbuted) nstances sampled from the p.d.f. n Equaton 3. From ths set of samples, we want to estmate or learn the GMM parameters θ = {α, µ, Σ } N. Because we are gven the samples x, the random varable X cf. Fgure 2) are called observed or observable) random varables. As shown n Fgure 2, observed random varables are usually shown as a flled crcle. 4

5 The random varable Z, however, s not observable, and s called a hdden varable or a latent varable). Hdden varables are shown as a crcled, as the Z node n Fgure What f we can observe the hdden varable? In real applcatons, we do not know the value or nstantaton) of Z, because t s hdden not observable). Ths fact makes estmatng GMM parameters rather dffcult, and technques such as EM the focus of ths note) have to be employed. However, for the sample set X = {x 1, x 2,..., x M }, let us consder the scenaro n whch we can further suppose that some oracle has gven us the value of Z: Z = {z 1, z 2,..., z M }. In other words, we know that x s sampled from the z -th Gaussan component. In ths case, t s easy to estmate the parameters θ. Frst, we can fnd all those samples that are generated from the -th component, and use X to denote ths subset of samples. In precse mathematcal languages, X = {x j z j =, 1 j M}. 7) The mxng coeffcent estmaton s a smple countng. We can count the number of examples whch are generated from the -th Gaussan component as m = X, where s the sze number of elements) of a set. Then, the maxmum lkelhood estmaton for α s ˆα = m N j=1 m j = m M. 8) Second, t s also easy to estmate the µ and Σ parameters for any 1 N. The maxmum lkelhood estmaton solutons are the same as those sngle Gaussan equatons: 3 ˆµ = 1 m x, 9) x X ˆΣ = 1 x ˆµ m ) x ˆµ ) T. 10) x X In short, f we know the hdden varable s nstantatons, the estmaton s straghtforward. Unfortunately, we are only gven the observed sample set X. The hdden varable nstantatons Z s unknown to us. Ths fact complcates the entre parameter estmaton process. 3 Please refer to my note on propertes of normal dstrbutons for dervaton of these equatons. 5

6 2.4 Can we mtate an oracle? A natural queston to ask ourselves s: f we do not have an oracle to teach us, can we mtate the oracle s teachng? In other words, can we guess the value of z j for x j? A natural choce s to use the posteror pz j x j, θ t) ) as a replacement for z j. Ths term s the posteror probablty gven the sample x j and the current parameter value θ t). 4 The posteror probablty s the best educated guess we can have gven the nformaton that s handy. In ths guessng game, we have at least two ssues n our way. Frst, an oracle s supposed to know everythng, and wll be able to tell us that x 7 comes from the thrd Gaussan component, wth 100% confdence. If an oracle exsts, we can smply say z 7 = 3 n ths example. However, our guess wll never be determnstc t can at best be a probablty dstrbuton about the random varable z j. Hence, we wll assume that for every observed sample x, there s a correspondng hdden vector z, whose values can be guessed but cannot be observed. We stll use Z to denote the underlyng random varable, and use Z to denote the set of hdden vectors. In the GMM example, a vector z j wll have N dmensons, but one and only one of these dmensons wll be 1, and all others wll be 0. Second, the guess we have about z j s a dstrbuton determned by the posteror pz j x j, θ t) ). However, what we really want are values nstead of a dstrbuton. How are we gong to use ths guess? A common trck n statstcal learnng s to use ts expectaton. We wll leave the detals about how the expectaton s used to later sectons. 3 An nformal descrpton of the EM algorthm Now we are ready to gve an nformal descrpton of the EM algorthm. We frst ntalze the values of θ n any reasonable way; Then, we can estmate the best possble Z expectaton of ts posteror dstrbuton) usng X and the current θ estmaton; Wth ths Z estmaton, we can fnd a better estmate of θ usng X ; A better θ combned wth X ) wll lead to a better guess of Z; Ths process estmatng θ and Z n alternatng order) can proceed untl the change n θ s small.e., the procedure converges). In stll more nformal languages, after proper ntalzaton of the parameters, we can: 4 As we wll see, EM s an teratve process, n whch the varable t s the teraton ndex. We wll update the parameter θ s every teraton, and use θ t) to denote ts value n the t-th teraton. 6

7 E-Step Fnd a better guess of the non-observable hdden varables, by usng the data and current parameter values; M-Step Fnd a better parameter estmaton, by usng the current guess for the hdden varables and the data; Repeat Repeat the above two steps untl convergence. In the EM algorthm, the frst step s usually called the Expectaton step, abbrevated as the E-step; whle the second step s usually called the Maxmzaton step, abbrevated as the M-step. The EM algorthm repeats E- and M-steps n alternatng order. When the algorthm converges, we get the desred parameter estmatons. 4 The Expectaton-Maxmzaton algorthm Now we wll show more detals of the EM algorthm. Suppose we are dealng wth two sets of random varables: the observed varables X and the hdden varables Z. The jont p.d.f. s px, Z; θ), where θ are the parameters. We are gven a set of nstances of X to learn the parameters, as X = {x 1, x 2,..., x M }. The task s to estmate θ from X. For every x j, there s a correspondng z j. And we want to clarfy that θ now nclude the parameters that are assocated wth Z. In the GMM example, z j are estmates for Z, {α, µ, Σ } N are parameters specfyng X, and θ nclude both sets of parameters. 4.1 Jontly-non-concave ncomplete log-lkelhood If we use the maxmum lkelhood ML) estmaton technque, the ML estmate for θ s ˆθ = arg max px θ). 11) θ Or equvalently, we can maxmze the log-lkelhood ˆθ = arg max ln px θ), 12) θ because ln ) s a monotoncally ncreasng functon. Then, parameter estmaton becomes an optmzaton problem. We wll use the notaton Lθ) to denote the log-lkelhood, that s, Lθ) = ln px θ). 13) Recent developments n optmzaton tells that we can generally consder a mnmzaton problem as easy f t s convex, but non-convex problems are usually dffcult to solve. Equvalently, a concave maxmzaton problem s generally consdered easy, whle non-concave maxmzaton s usually dffcult, because the negatve of a convex functon s a concave one, and vce versa. 7

8 Unfortunately, the log-lkelhood s non-concave n most cases. Take the Gaussan mxture model as an example, the lkelhood px θ) s M N α px θ) = exp 1 2π) d/2 Σ 1/2 2 x j µ ) T Σ 1 x j µ )) ). 14) j=1 The log-lkelhood has the followng form: M N α ln exp 1 2π) d/2 Σ 1/2 2 x j µ ) T Σ 1 x j µ )) ), 15) j=1 Ths equaton s non-concave wth respect to the jont optmzaton varables {α, µ, Σ } n. In other words, ths s a dffcult maxmzaton problem. We have two sets of random varables X and Z. The log-lkelhood n Equaton 15 s called the ncomplete data log-lkelhood because Z s not n that equaton. 4.2 Possbly) Concave complete data log-lkelhood The complete data log-lkelhood s ln px, Z θ). 16) Let us use GMM as an example once more. In GMM, the z j vectors whch form Z) s an N-dmensonal vector wth N 1 0 s and only one dmenson wth value 1. Hence, the complete data lkelhood s px, Z θ) = M N [ j=1 α exp 2π) d/2 Σ 1/2 1 2 x j µ ) T Σ 1 x j µ ))] zj. 17) Ths equaton can be explaned usng the two-step samplng process. Let us assume x j s generated by the -th Gaussan component. Then, f, we know that z j 1, otherwse z j = z j = 1. In other words, the term nsde [ ] wll equal 1 for N 1 tmes when z j = 0, and the remanng one entry wll be evaluated to α Nx; µ, Σ ), whch exactly matches the 2-step samplng procedure. 5 Then, the complete data log-lkelhood s M j=1 N 1 z j 2 ln Σ 1 x j µ ) T Σ 1 x j µ ) ) ) + ln α + const. 18) Let us consder the scenaro when the hdden varable z j s known, but α, µ and Σ are unknown. Here we suppose Σ s nvertble for 1 N. Instead 5 The frst step has probablty α, and the second step has densty Nx; µ, Σ ). These two steps are ndependent of each other, hence the product rule apples. 8

9 of consderng parameters µ, Σ ), we consder µ, Σ 1 ). 6 It s well known that the log-determnant functon ln s concave. It s also easy to prove that the quadratc term z j µ ) T Σ 1 z j µ ) s jontly convex wth respect to varables µ, Σ 1 ), whch drectly mples that ts negatve s concave. 7 Hence, ths sub-problem can be effcently solved. From ths optmzaton perspectve, we can understand the EM algorthm from a dfferent pont of vew. Although the orgnal maxmum lkelhood parameter estmaton problem s dffcult to solve jontly non-concave), the EM algorthm can usually but not always) make concave subproblems, hence becomng effcently solvable. 4.3 The general EM dervaton Now we talk about EM n the general sense. We have observable varables X and samples X. We also have hdden varables Z and unobservable samples Z. The overall system parameters are denoted by θ. The parameter learnng problem tres to fnd optmal parameters ˆθ by maxmzng the ncomplete data log-lkelhood We assume Z s dscrete, and hence ˆθ = arg max ln px θ). 19) θ px θ) = Z px, Z θ). 20) However, ths assumpton s manly for notatonal smplcty. If Z s contnuous, we can replace the summaton wth an ntegral. Although we have mentoned prevously that we can use the posteror of Z,.e., pz X, θ) as our guess, t s also nterestng to observe what wll happen to the complete data lkelhood f we use a arbtrary dstrbuton for Z and hence understand why the posteror s specal and why we should use t). Let q be any vald probablty dstrbuton for Z, we can measure how dfferent s q to the posteror usng the classc Kullback-Lebler KL) dvergence measure, as KLq p) = ) pz X, θ) qz) ln. 21) qz) Z The probablty theory tells us that px, Z θ) px θ) = pz X, θ) 22) px, Z θ) qz) = qz) pz X, θ). 23) 6 It s more natural to understand ths choce as usng the canoncal parameterzaton of a normal dstrbuton. Please refer to my note on propertes of normal dstrbutons. 7 For knowledge about convexty, please refer to the book Convex optmzaton by Stephen Boyd and Leven Vandenberghe, Cambrdge Unversty Press. The PDF verson of ths book s avalable at 9

10 Hence, ) ln px θ) = qz) ln px θ) 24) = Z = Z Z qz) ln px θ) 25) px, Z θ) qz) ln qz) qz) pz X, θ) px, Z θ) qz) ln qz) ln qz) ) ) pz X, θ) qz) 26) = Z 27) = px, Z θ) qz) ln + KLq p) qz) Z 28) = Lq, θ) + KLq p). 29) We have decomposed the ncomplete data log-lkelhood nto two terms. The frst term s Lq, θ), defned as Lq, θ) = Z qz) ln px, Z θ) qz). 30) The second term s a KL-dvergence between q and the posteror KLq p) = ) pz X, θ) qz) ln, 31) qz) Z whch was coped from Equaton 21. There are some nce propertes of the KL-dvergence. For example, Dq p) 0 32) always holds, and the qualty sgn s true f and only f q = p. 8 consequence of ths property s that One drect Lq, θ) ln px θ) 33) always holds, and Lq, θ) = ln px θ) f and only f qz) = pz X, θ). 34) In other words, we have found a lower bound of ln px θ). Hence, n order to maxmze ln px θ), we can perform two steps. 8 For more propertes of the KL-dvergence, please refer to the book Elements of nformaton theory by Thomas M. Cover and Joy A. Thomas, John Wley & Sons, Inc. 10

11 The frst step s to make the lower bound Lq, θ) equal to ln px θ). As aforementoned, we know the equalty hold f and only f ˆqZ) = pz X, θ). Now we have ln px θ) = Lˆq, θ), 35) and L only depends on θ now. Ths s the Expectaton step E-step) n the EM algorthm. In the second step, we can maxmze Lˆq, θ) wth respect to θ. Snce ln px θ) = Lˆq, θ), an ncrease of Lˆq, θ) also means an ncrease of the log-lkelhood ln px θ). And, because we are maxmzng Lˆq, θ) n ths step, the log-lkelhood wll always ncrease f we are not already at a local mnmum of the log-lkelhood. Ths s the Maxmzaton step M-step) n the EM algorthm. 4.4 The E- & M-steps In the E-step, we already know that we should set ˆqZ) = pz X, θ), 36) whch s straghtforward at least n ts mathematcal form). Then, how shall we maxmze Lˆq, θ)? We can substtute ˆq nto the defnton of L. We wll fnd the optmal θ that maxmzes L after pluggng n ˆq. However, note that ˆq nvolves θ too. Hence, we need some more notatons. Suppose we are n the t-th teraton. In the E-step, ˆq s computed usng the current parameter, as ˆqZ) = pz X, θ t) ). 37) Then, L becomes Lˆq, θ) = Z ˆqZ) ln px, Z θ) ˆqZ) 38) = Z = Z ˆqZ) ln px, Z θ) ˆqZ) ln ˆqZ) 39) pz X, θ t) ) ln px, Z θ) + const, 40) n whch const = ˆqZ) ln ˆqZ) does not nvolve the varable θ, hence can be gnored. The term remanng s n fact an expectaton, whch we denote as Qθ, θ t) ), Qθ, θ t) ) = Z pz X, θ t) ) ln px, Z θ) 41) = E Z X,θ t) [ln px, Z θ)]. 42) That s, n the E-step, we compute the posteror of Z. In the M-step, we compute the expectaton of the complete data log-lkelhood ln px, Z θ) 11

12 wth respect to the posteror dstrbuton pz X, θ t) ), and we maxmze the expectaton to get a better parameter estmate: θ t+1) = arg max θ Qθ, θ t) ) = arg max E Z X,θ t) [ln px, Z θ)]. 43) θ Thus, three computatons are nvolved n EM: 1) posteror, 2) expectaton, 3) maxmzaton. We treat 1) as the E-step, and 2)+3) as the M-step. Some researchers prefer to treat 1)+2) as the E-step, and 3) as the M-step. However, no matter how the computatons are attrbuted to dfferent steps, the EM algorthm does not change. 4.5 The EM algorthm Now we are ready to wrte down the EM algorthm. Algorthm 1 The Expectaton-Maxmzaton Algorthm 1: t 0 2: Intalze the parameters to θ 0) 3: The Eexpectaton)-step: Fnd pz X, θ t) ) 4: The Maxmzaton)-step.1: Fnd the expectaton Qθ, θ t) ) = E Z X,θ t) [ln px, Z θ)] 44) 5: The Maxmzaton)-step.2: Fnd a new parameter estmate θ t+1) = arg max Qθ, θ t) ) 45) θ 6: t t + 1 7: If the log-lkelhood has not converged, go to the E-step agan Lne 3) 4.6 Wll EM converge? The analyss of EM s convergence property s a complex topc. However, t s easy to show that the EM algorthm wll help acheve hgher lkelhood and converge to a local mnmum. Let us consder two tme steps t 1 and t. From Equaton 35, we get that: Lˆq t), θ t) ) = ln px θ t) ), 46) Lˆq t 1), θ t 1) ) = ln px θ t 1) ). 47) Note that we have added the tme ndex to the superscrpt of ˆq to emphasze that ˆq also changes among teratons. 12

13 Now because at the t 1)-th teraton we have θ t) = arg max Lˆq t 1), θ), 48) θ Lˆq t 1), θ t) ) Lˆq t 1), θ t 1) ). 49) Smlarly, at the t-th teraton, based on Equaton 33 and Equaton 35, we have Lˆq t 1), θ t) ) ln px θ t) ) = Lˆq t), θ t) ). 50) Puttng these equatons together, we get ln px θ t) ) = Lˆq t), θ t) ) [Use 46)] 51) Lˆq t 1), θ t) ) [Use 50)] 52) Lˆq t 1), θ t 1) ) [Use 49)] 53) = ln px θ t 1) ). [Use 47)] 54) Hence, EM wll converge to a local mnmum of the lkelhood. However, the analyss of ts convergence rate s very complex and beyond the scope of ths ntroductory note. 5 EM for GMM Now we can apply the EM algorthm to GMM. The frst thng s to compute the posteror. Usng the Bayes theorem, we have pz j x j, θ t) ) = px j, z j θ t) ), 55) px j θ t) ) n whch z j can be 0 or 1, and z j = 1 s true f and only f x j s generated by the -th Gaussan component. Next, we wll compute the Q functon, whch s the expectaton of the complete data log-lkelhood ln px, Z θ) wth respect to the posteror dstrbuton we just found. The GMM complete data log-lkelhood was already computed n Equaton 18. For easer reference, we copy ths equaton here: M j=1 N 1 z j 2 ln Σ 1 x j µ ) T Σ 1 x j µ ) ) ) + ln α + const, 56) The expectaton of Equaton 56 wth respect to Z s M j=1 N 1 γ j 2 ln Σ 1 x j µ ) T Σ 1 x j µ ) ) ) + ln α, 57) 13

14 where the constant term s gnored and γ j s the expectaton of z j x j, θ t). In other words, we need to compute the expectaton of the condtonal dstrbuton defned by Equaton 55. In Equaton 55, the denomnator does not depend on Z, and px j θ t) ) equals N Nx j ; µ t), Σ t) ). For the numerator, we can drectly compute αt) ts expectaton, as E [ ] px j, z j θ t) ) [ ] = E pz j θ t) )px j z j, θ t) ). 58) Note that when z j = 0, we always have px j z j, θ t) ) = 0. Thus, [ ] E pz j θ t) )px j z j, θ t) ) = Prz j = 1)px j µ t), Σ t) ) 59) or, Hence, we have [ γ j = E = α t) z j x j, θ t)] α t) [ γ j = E z j x j, θ t)] = α t) N k=1 αt) k Nx j ; µ t), Σ t) ). 60) Nx j ; µ t), Σ t) ), 61) Nx j ; µ t), Σ t) )) Nx j; µ t) k, Σt) k ) 62) for 1 N, 1 j M. After γ j s computed, Equaton 57 s completely specfed. We start the optmzaton from α. Because there s a constrant that N α = 1, we use the Lagrange multpler method, remove rrelevant terms, and get M N N ) γ j ln α + λ α 1. 63) j=1 Settng the dervatve to 0 gves us that for any 1 N, M j=1 γ j α + λ = 0 64) M j=1 or, α = γj. Because N α = 1, we know that λ = M N j=1 γ j. Hence, α = λ M j=1 γj M N. j=1 γj For notatonal smplcty, we defne j=1 m = j=1 M γ j. 65) j=1 From the defnton of γ j, t s easy to prove that N N M M N ) M m = γ j = γ j = 1 = M. 66) j=1 14

15 Then, we get the updatng rule for α : α t+1) = m M. 67) Furthermore, usng smlar steps n dervng the sngle Gaussan equatons, 9 t s easy to show that for any 1 N, M µ t+1) j=1 = γ jx j, 68) m Σ t+1) = M j=1 γ j x j µ t+1) ) ) T x j µ t+1). 69) m Puttng these results together, we have the complete set of updatng rules for GMM. If at teraton t, the parameter are estmated as α t), µ t), and Σ t) for 1 N, the EM algorthm updates these parameters as for 1 N, 1 j M) Exercses γ j = m = α Nx j ; µ t), Σ t) )) Nx j; µ t) k N k=1 αt) k, Σt) k ), 70) M γ j, 71) j=1 M µ t+1) j=1 = γ jx j, 72) m Σ t+1) = M j=1 γ j x j µ t+1) ) ) T x j µ t+1). 73) m 1. Derve the updatng equatons for Gaussan Mxture Models by yourself. You should not refer to Secton 5 durng your dervaton. If you have just fnshed readng Secton 5, wat for at least 2 to 3 hours before workng on ths problem. 2. In ths problem, we wll use the Expectaton-Maxmzaton method to learn parameters n a hdden Markov model HMM). As wll be shown n ths problem, the Baum-Welch algorthm s ndeed performng EM updates. To work out the soluton for ths problem, you wll also need knowledge and facts learned n the HMM and nformaton theory notes. We wll use the notatons n the HMM note. For your convenence, the notatons are repeated as follows. 9 Please refer to my note on propertes of normal dstrbutons. 15

16 There are N dscrete states, denoted by symbols S 1, S 2,..., S N. There are M output dscrete symbols, denoted by V 1, V 2,..., V M. Assumng one sequence wth T tme steps, whose hdden state s Q t and whose observed output s O t at tme t 1 t T ). We use q t and o t to denote the ndexes for state and output symbols at tme t, respectvely,.e., Q t = S qt and O t = V ot. The notaton 1 : t denotes all the ordered tme steps between 1 and t. For example, o 1:T s the sequence of all observed output symbols. An HMM has parameters λ = π, A, B), where π R N specfes the ntal state dstrbuton, A R N N s the state transton matrx, and B R N M s the observaton probablty matrx. Note that A j = PrQ t = S j Q t 1 = S ) and b j k) = PrO t = V k Q t = S j ) are elements of A and B, respectvely. In ths problem, we use a varable r to denote the ndex of EM teratons. Hence, λ 1) are the ntal parameters. Varous probabltes have been defned n the HMM note, denoted by α t ), β t ), γ t ), δ t ) and ξ t, j). In ths problem, we assume that at the r-th teratons, λ r) are known and these probabltes are computed usng λ r). The purpose of ths problem s to use the EM algorthm to fnd λ r+1) usng a tranng sequence o 1:T and λ r), by treatng Q and O as the hdden and observed random varables, respectvely. a) Suppose the hdden varables can be observed as S q1, S q2,..., S qt. Show that the complete data log-lkelhood s T 1 ln π q1 + ln A qtq t+1 + t=1 T ln b qt o t ). 74) b) The expectaton of Equaton 74 wth respect to the hdden varables Q t condtoned on o 1:T and λ r) ) forms an auxlary functon Qλ, λ r ) the E-step). Show that the expectaton of the frst term n Equaton 74 equals N γ 1) ln π,.e., E Q1:T [ln π Q1 ] = t=1 N γ 1 ) ln π. 75) c) Because the parameter π only hnges on Equaton 75, the update rule for π can be found by maxmzng ths equaton. Prove that we should set π r+1) = γ 1 ) n the M-step. Note that γ 1 ) s computed usng λ r) 16

17 as parameter values. Hnt: The rght hand sde of Equaton 75 s related to the cross entropy.) d) The second part of the E-step calculates the expectaton of the mddle term n Equaton 74. Show that [ T 1 ] N N T 1 ) E Q1:T ln A qtq t+1 = ξ t, j) ln A j. 76) t=1 j=1 t=1 e) For the M-step relevant to A, prove that we should set T 1 A r+1) t=1 j = ξ t, j) T 1 t=1 γ t). 77) f) The fnal part of the E-step calculates the expectaton of the last term n Equaton 74. Show that [ T ] N M T E Q1:T ln b qt o t ) = o t = k γ t j). 78) t=1 j=1 k=1 t=1 g) For the M-step relevant to B, prove that we should set T b r+1) t=1 j k) = o t = k γ t j) T t=1 γ, 79) tj) n whch s the ndcator functon. h) Are these results obtaned usng EM the same as those n the Baum- Welch? 17

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Hidden Markov Models & The Multivariate Gaussian (10/26/04) CS281A/Stat241A: Statstcal Learnng Theory Hdden Markov Models & The Multvarate Gaussan (10/26/04) Lecturer: Mchael I. Jordan Scrbes: Jonathan W. Hu 1 Hdden Markov Models As a bref revew, hdden Markov models

More information

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2) 1/16 MATH 829: Introducton to Data Mnng and Analyss The EM algorthm (part 2) Domnque Gullot Departments of Mathematcal Scences Unversty of Delaware Aprl 20, 2016 Recall 2/16 We are gven ndependent observatons

More information

Hidden Markov Models

Hidden Markov Models Hdden Markov Models Namrata Vaswan, Iowa State Unversty Aprl 24, 204 Hdden Markov Model Defntons and Examples Defntons:. A hdden Markov model (HMM) refers to a set of hdden states X 0, X,..., X t,...,

More information

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume

More information

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton

More information

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Maxmum Lkelhood Estmaton INFO-2301: Quanttatve Reasonng 2 Mchael Paul and Jordan Boyd-Graber MARCH 7, 2017 INFO-2301: Quanttatve Reasonng 2 Paul and Boyd-Graber Maxmum Lkelhood Estmaton 1 of 9 Why MLE?

More information

Conjugacy and the Exponential Family

Conjugacy and the Exponential Family CS281B/Stat241B: Advanced Topcs n Learnng & Decson Makng Conjugacy and the Exponental Famly Lecturer: Mchael I. Jordan Scrbes: Bran Mlch 1 Conjugacy In the prevous lecture, we saw conjugate prors for the

More information

EM and Structure Learning

EM and Structure Learning EM and Structure Learnng Le Song Machne Learnng II: Advanced Topcs CSE 8803ML, Sprng 2012 Partally observed graphcal models Mxture Models N(μ 1, Σ 1 ) Z X N N(μ 2, Σ 2 ) 2 Gaussan mxture model Consder

More information

Limited Dependent Variables

Limited Dependent Variables Lmted Dependent Varables. What f the left-hand sde varable s not a contnuous thng spread from mnus nfnty to plus nfnty? That s, gven a model = f (, β, ε, where a. s bounded below at zero, such as wages

More information

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute

More information

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ CSE 455/555 Sprng 2013 Homework 7: Parametrc Technques Jason J. Corso Computer Scence and Engneerng SUY at Buffalo jcorso@buffalo.edu Solutons by Yngbo Zhou Ths assgnment does not need to be submtted and

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm The Expectaton-Maxmaton Algorthm Charles Elan elan@cs.ucsd.edu November 16, 2007 Ths chapter explans the EM algorthm at multple levels of generalty. Secton 1 gves the standard hgh-level verson of the algorthm.

More information

Gaussian Mixture Models

Gaussian Mixture Models Lab Gaussan Mxture Models Lab Objectve: Understand the formulaton of Gaussan Mxture Models (GMMs) and how to estmate GMM parameters. You ve already seen GMMs as the observaton dstrbuton n certan contnuous

More information

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machines II Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed

More information

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline Outlne Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Huzhen Yu janey.yu@cs.helsnk.f Dept. Computer Scence, Unv. of Helsnk Probablstc Models, Sprng, 200 Notces: I corrected a number

More information

Linear Approximation with Regularization and Moving Least Squares

Linear Approximation with Regularization and Moving Least Squares Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...

More information

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U) Econ 413 Exam 13 H ANSWERS Settet er nndelt 9 deloppgaver, A,B,C, som alle anbefales å telle lkt for å gøre det ltt lettere å stå. Svar er gtt . Unfortunately, there s a prntng error n the hnt of

More information

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements CS 750 Machne Learnng Lecture 5 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square CS 750 Machne Learnng Announcements Homework Due on Wednesday before the class Reports: hand n before

More information

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30 STATS 306B: Unsupervsed Learnng Sprng 2014 Lecture 10 Aprl 30 Lecturer: Lester Mackey Scrbe: Joey Arthur, Rakesh Achanta 10.1 Factor Analyss 10.1.1 Recap Recall the factor analyss (FA) model for lnear

More information

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010 Parametrc fractonal mputaton for mssng data analyss Jae Kwang Km Survey Workng Group Semnar March 29, 2010 1 Outlne Introducton Proposed method Fractonal mputaton Approxmaton Varance estmaton Multple mputaton

More information

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF 10-708: Probablstc Graphcal Models 10-708, Sprng 2014 8 : Learnng n Fully Observed Markov Networks Lecturer: Erc P. Xng Scrbes: Meng Song, L Zhou 1 Why We Need to Learn Undrected Graphcal Models In the

More information

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin Fnte Mxture Models and Expectaton Maxmzaton Most sldes are from: Dr. Maro Fgueredo, Dr. Anl Jan and Dr. Rong Jn Recall: The Supervsed Learnng Problem Gven a set of n samples X {(x, y )},,,n Chapter 3 of

More information

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4) I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes

More information

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X Statstcs 1: Probablty Theory II 37 3 EPECTATION OF SEVERAL RANDOM VARIABLES As n Probablty Theory I, the nterest n most stuatons les not on the actual dstrbuton of a random vector, but rather on a number

More information

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:

More information

Chapter Newton s Method

Chapter Newton s Method Chapter 9. Newton s Method After readng ths chapter, you should be able to:. Understand how Newton s method s dfferent from the Golden Secton Search method. Understand how Newton s method works 3. Solve

More information

Expectation Maximization Mixture Models HMMs

Expectation Maximization Mixture Models HMMs -755 Machne Learnng for Sgnal Processng Mture Models HMMs Class 9. 2 Sep 200 Learnng Dstrbutons for Data Problem: Gven a collecton of eamples from some data, estmate ts dstrbuton Basc deas of Mamum Lelhood

More information

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix Lectures - Week 4 Matrx norms, Condtonng, Vector Spaces, Lnear Independence, Spannng sets and Bass, Null space and Range of a Matrx Matrx Norms Now we turn to assocatng a number to each matrx. We could

More information

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models ECO 452 -- OE 4: Probt and Logt Models ECO 452 -- OE 4 Maxmum Lkelhood Estmaton of Bnary Dependent Varables Models: Probt and Logt hs note demonstrates how to formulate bnary dependent varables models

More information

Assortment Optimization under MNL

Assortment Optimization under MNL Assortment Optmzaton under MNL Haotan Song Aprl 30, 2017 1 Introducton The assortment optmzaton problem ams to fnd the revenue-maxmzng assortment of products to offer when the prces of products are fxed.

More information

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton

More information

Homework Assignment 3 Due in class, Thursday October 15

Homework Assignment 3 Due in class, Thursday October 15 Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.

More information

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z ) C4B Machne Learnng Answers II.(a) Show that for the logstc sgmod functon dσ(z) dz = σ(z) ( σ(z)) A. Zsserman, Hlary Term 20 Start from the defnton of σ(z) Note that Then σ(z) = σ = dσ(z) dz = + e z e z

More information

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results. Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson

More information

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering / Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Lecture 4 Bayes Classfcaton Rule Dept. of Electrcal and Computer Engneerng 0909.40.0 / 0909.504.04 Theory & Applcatons

More information

Lecture 7: Boltzmann distribution & Thermodynamics of mixing

Lecture 7: Boltzmann distribution & Thermodynamics of mixing Prof. Tbbtt Lecture 7 etworks & Gels Lecture 7: Boltzmann dstrbuton & Thermodynamcs of mxng 1 Suggested readng Prof. Mark W. Tbbtt ETH Zürch 13 März 018 Molecular Drvng Forces Dll and Bromberg: Chapters

More information

10-701/ Machine Learning, Fall 2005 Homework 3

10-701/ Machine Learning, Fall 2005 Homework 3 10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40

More information

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extension Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general

More information

Feature Selection: Part 1

Feature Selection: Part 1 CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?

More information

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors Stat60: Bayesan Modelng and Inference Lecture Date: February, 00 Reference Prors Lecturer: Mchael I. Jordan Scrbe: Steven Troxler and Wayne Lee In ths lecture, we assume that θ R; n hgher-dmensons, reference

More information

Composite Hypotheses testing

Composite Hypotheses testing Composte ypotheses testng In many hypothess testng problems there are many possble dstrbutons that can occur under each of the hypotheses. The output of the source s a set of parameters (ponts n a parameter

More information

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family IOSR Journal of Mathematcs IOSR-JM) ISSN: 2278-5728. Volume 3, Issue 3 Sep-Oct. 202), PP 44-48 www.osrjournals.org Usng T.O.M to Estmate Parameter of dstrbutons that have not Sngle Exponental Famly Jubran

More information

3.1 ML and Empirical Distribution

3.1 ML and Empirical Distribution 67577 Intro. to Machne Learnng Fall semester, 2008/9 Lecture 3: Maxmum Lkelhood/ Maxmum Entropy Dualty Lecturer: Amnon Shashua Scrbe: Amnon Shashua 1 In the prevous lecture we defned the prncple of Maxmum

More information

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition EG 880/988 - Specal opcs n Computer Engneerng: Pattern Recognton Memoral Unversty of ewfoundland Pattern Recognton Lecture 7 May 3, 006 http://wwwengrmunca/~charlesr Offce Hours: uesdays hursdays 8:30-9:30

More information

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012 MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1 Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example:

More information

THE SUMMATION NOTATION Ʃ

THE SUMMATION NOTATION Ʃ Sngle Subscrpt otaton THE SUMMATIO OTATIO Ʃ Most of the calculatons we perform n statstcs are repettve operatons on lsts of numbers. For example, we compute the sum of a set of numbers, or the sum of the

More information

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1 On an Extenson of Stochastc Approxmaton EM Algorthm for Incomplete Data Problems Vahd Tadayon Abstract: The Stochastc Approxmaton EM (SAEM algorthm, a varant stochastc approxmaton of EM, s a versatle tool

More information

1 Convex Optimization

1 Convex Optimization Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,

More information

Problem Set 9 Solutions

Problem Set 9 Solutions Desgn and Analyss of Algorthms May 4, 2015 Massachusetts Insttute of Technology 6.046J/18.410J Profs. Erk Demane, Srn Devadas, and Nancy Lynch Problem Set 9 Solutons Problem Set 9 Solutons Ths problem

More information

More metrics on cartesian products

More metrics on cartesian products More metrcs on cartesan products If (X, d ) are metrc spaces for 1 n, then n Secton II4 of the lecture notes we defned three metrcs on X whose underlyng topologes are the product topology The purpose of

More information

Notes on Frequency Estimation in Data Streams

Notes on Frequency Estimation in Data Streams Notes on Frequency Estmaton n Data Streams In (one of) the data streamng model(s), the data s a sequence of arrvals a 1, a 2,..., a m of the form a j = (, v) where s the dentty of the tem and belongs to

More information

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009 College of Computer & Informaton Scence Fall 2009 Northeastern Unversty 20 October 2009 CS7880: Algorthmc Power Tools Scrbe: Jan Wen and Laura Poplawsk Lecture Outlne: Prmal-dual schema Network Desgn:

More information

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X The EM Algorthm (Dempster, Lard, Rubn 1977 The mssng data or ncomplete data settng: An Observed Data Lkelhood (ODL that s a mxture or ntegral of Complete Data Lkelhoods (CDL. (1a ODL(;Y = [Y;] = [Y,][

More information

COS 521: Advanced Algorithms Game Theory and Linear Programming

COS 521: Advanced Algorithms Game Theory and Linear Programming COS 521: Advanced Algorthms Game Theory and Lnear Programmng Moses Charkar February 27, 2013 In these notes, we ntroduce some basc concepts n game theory and lnear programmng (LP). We show a connecton

More information

The Geometry of Logit and Probit

The Geometry of Logit and Probit The Geometry of Logt and Probt Ths short note s meant as a supplement to Chapters and 3 of Spatal Models of Parlamentary Votng and the notaton and reference to fgures n the text below s to those two chapters.

More information

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg prnceton unv. F 17 cos 521: Advanced Algorthm Desgn Lecture 7: LP Dualty Lecturer: Matt Wenberg Scrbe: LP Dualty s an extremely useful tool for analyzng structural propertes of lnear programs. Whle there

More information

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1 Random varables Measure of central tendences and varablty (means and varances) Jont densty functons and ndependence Measures of assocaton (covarance and correlaton) Interestng result Condtonal dstrbutons

More information

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems Numercal Analyss by Dr. Anta Pal Assstant Professor Department of Mathematcs Natonal Insttute of Technology Durgapur Durgapur-713209 emal: anta.bue@gmal.com 1 . Chapter 5 Soluton of System of Lnear Equatons

More information

Classification as a Regression Problem

Classification as a Regression Problem Target varable y C C, C,, ; Classfcaton as a Regresson Problem { }, 3 L C K To treat classfcaton as a regresson problem we should transform the target y nto numercal values; The choce of numercal class

More information

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement Markov Chan Monte Carlo MCMC, Gbbs Samplng, Metropols Algorthms, and Smulated Annealng 2001 Bonformatcs Course Supplement SNU Bontellgence Lab http://bsnuackr/ Outlne! Markov Chan Monte Carlo MCMC! Metropols-Hastngs

More information

Supplementary material: Margin based PU Learning. Matrix Concentration Inequalities

Supplementary material: Margin based PU Learning. Matrix Concentration Inequalities Supplementary materal: Margn based PU Learnng We gve the complete proofs of Theorem and n Secton We frst ntroduce the well-known concentraton nequalty, so the covarance estmator can be bounded Then we

More information

Goodness of fit and Wilks theorem

Goodness of fit and Wilks theorem DRAFT 0.0 Glen Cowan 3 June, 2013 Goodness of ft and Wlks theorem Suppose we model data y wth a lkelhood L(µ) that depends on a set of N parameters µ = (µ 1,...,µ N ). Defne the statstc t µ ln L(µ) L(ˆµ),

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 89 Fall 206 Introducton to Machne Learnng Fnal Do not open the exam before you are nstructed to do so The exam s closed book, closed notes except your one-page cheat sheet Usage of electronc devces

More information

Primer on High-Order Moment Estimators

Primer on High-Order Moment Estimators Prmer on Hgh-Order Moment Estmators Ton M. Whted July 2007 The Errors-n-Varables Model We wll start wth the classcal EIV for one msmeasured regressor. The general case s n Erckson and Whted Econometrc

More information

Learning from Data 1 Naive Bayes

Learning from Data 1 Naive Bayes Learnng from Data 1 Nave Bayes Davd Barber dbarber@anc.ed.ac.uk course page : http://anc.ed.ac.uk/ dbarber/lfd1/lfd1.html c Davd Barber 2001, 2002 1 Learnng from Data 1 : c Davd Barber 2001,2002 2 1 Why

More information

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law: CE304, Sprng 2004 Lecture 4 Introducton to Vapor/Lqud Equlbrum, part 2 Raoult s Law: The smplest model that allows us do VLE calculatons s obtaned when we assume that the vapor phase s an deal gas, and

More information

Lecture 4. Instructor: Haipeng Luo

Lecture 4. Instructor: Haipeng Luo Lecture 4 Instructor: Hapeng Luo In the followng lectures, we focus on the expert problem and study more adaptve algorthms. Although Hedge s proven to be worst-case optmal, one may wonder how well t would

More information

Course 395: Machine Learning - Lectures

Course 395: Machine Learning - Lectures Course 395: Machne Learnng - Lectures Lecture 1-2: Concept Learnng (M. Pantc Lecture 3-4: Decson Trees & CC Intro (M. Pantc Lecture 5-6: Artfcal Neural Networks (S.Zaferou Lecture 7-8: Instance ased Learnng

More information

Erratum: A Generalized Path Integral Control Approach to Reinforcement Learning

Erratum: A Generalized Path Integral Control Approach to Reinforcement Learning Journal of Machne Learnng Research 00-9 Submtted /0; Publshed 7/ Erratum: A Generalzed Path Integral Control Approach to Renforcement Learnng Evangelos ATheodorou Jonas Buchl Stefan Schaal Department of

More information

Semi-Supervised Learning

Semi-Supervised Learning Sem-Supervsed Learnng Consder the problem of Prepostonal Phrase Attachment. Buy car wth money ; buy car wth wheel There are several ways to generate features. Gven the lmted representaton, we can assume

More information

PHYS 705: Classical Mechanics. Calculus of Variations II

PHYS 705: Classical Mechanics. Calculus of Variations II 1 PHYS 705: Classcal Mechancs Calculus of Varatons II 2 Calculus of Varatons: Generalzaton (no constrant yet) Suppose now that F depends on several dependent varables : We need to fnd such that has a statonary

More information

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models Computaton of Hgher Order Moments from Two Multnomal Overdsperson Lkelhood Models BY J. T. NEWCOMER, N. K. NEERCHAL Department of Mathematcs and Statstcs, Unversty of Maryland, Baltmore County, Baltmore,

More information

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography CSc 6974 and ECSE 6966 Math. Tech. for Vson, Graphcs and Robotcs Lecture 21, Aprl 17, 2006 Estmatng A Plane Homography Overvew We contnue wth a dscusson of the major ssues, usng estmaton of plane projectve

More information

The Order Relation and Trace Inequalities for. Hermitian Operators

The Order Relation and Trace Inequalities for. Hermitian Operators Internatonal Mathematcal Forum, Vol 3, 08, no, 507-57 HIKARI Ltd, wwwm-hkarcom https://doorg/0988/mf088055 The Order Relaton and Trace Inequaltes for Hermtan Operators Y Huang School of Informaton Scence

More information

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k. THE CELLULAR METHOD In ths lecture, we ntroduce the cellular method as an approach to ncdence geometry theorems lke the Szemeréd-Trotter theorem. The method was ntroduced n the paper Combnatoral complexty

More information

Hidden Markov Models

Hidden Markov Models CM229S: Machne Learnng for Bonformatcs Lecture 12-05/05/2016 Hdden Markov Models Lecturer: Srram Sankararaman Scrbe: Akshay Dattatray Shnde Edted by: TBD 1 Introducton For a drected graph G we can wrte

More information

STAT 3008 Applied Regression Analysis

STAT 3008 Applied Regression Analysis STAT 3008 Appled Regresson Analyss Tutoral : Smple Lnear Regresson LAI Chun He Department of Statstcs, The Chnese Unversty of Hong Kong 1 Model Assumpton To quantfy the relatonshp between two factors,

More information

EEE 241: Linear Systems

EEE 241: Linear Systems EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they

More information

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14 APPROXIMAE PRICES OF BASKE AND ASIAN OPIONS DUPON OLIVIER Prema 14 Contents Introducton 1 1. Framewor 1 1.1. Baset optons 1.. Asan optons. Computng the prce 3. Lower bound 3.1. Closed formula for the prce

More information

Machine learning: Density estimation

Machine learning: Density estimation CS 70 Foundatons of AI Lecture 3 Machne learnng: ensty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square ata: ensty estmaton {.. n} x a vector of attrbute values Objectve: estmate the model of

More information

Note on EM-training of IBM-model 1

Note on EM-training of IBM-model 1 Note on EM-tranng of IBM-model INF58 Language Technologcal Applcatons, Fall The sldes on ths subject (nf58 6.pdf) ncludng the example seem nsuffcent to gve a good grasp of what s gong on. Hence here are

More information

Week 5: Neural Networks

Week 5: Neural Networks Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple

More information

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA 4 Analyss of Varance (ANOVA) 5 ANOVA 51 Introducton ANOVA ANOVA s a way to estmate and test the means of multple populatons We wll start wth one-way ANOVA If the populatons ncluded n the study are selected

More information

Section 8.3 Polar Form of Complex Numbers

Section 8.3 Polar Form of Complex Numbers 80 Chapter 8 Secton 8 Polar Form of Complex Numbers From prevous classes, you may have encountered magnary numbers the square roots of negatve numbers and, more generally, complex numbers whch are the

More information

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis Statstcal analyss usng matlab HY 439 Presented by: George Fortetsanaks Roadmap Probablty dstrbutons Statstcal estmaton Fttng data to probablty dstrbutons Contnuous dstrbutons Contnuous random varable X

More information

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction ECONOMICS 5* -- NOTE (Summary) ECON 5* -- NOTE The Multple Classcal Lnear Regresson Model (CLRM): Specfcaton and Assumptons. Introducton CLRM stands for the Classcal Lnear Regresson Model. The CLRM s also

More information

Lecture 12: Discrete Laplacian

Lecture 12: Discrete Laplacian Lecture 12: Dscrete Laplacan Scrbe: Tanye Lu Our goal s to come up wth a dscrete verson of Laplacan operator for trangulated surfaces, so that we can use t n practce to solve related problems We are mostly

More information

Singular Value Decomposition: Theory and Applications

Singular Value Decomposition: Theory and Applications Sngular Value Decomposton: Theory and Applcatons Danel Khashab Sprng 2015 Last Update: March 2, 2015 1 Introducton A = UDV where columns of U and V are orthonormal and matrx D s dagonal wth postve real

More information

SDMML HT MSc Problem Sheet 4

SDMML HT MSc Problem Sheet 4 SDMML HT 06 - MSc Problem Sheet 4. The recever operatng characterstc ROC curve plots the senstvty aganst the specfcty of a bnary classfer as the threshold for dscrmnaton s vared. Let the data space be

More information

Introduction to Hidden Markov Models

Introduction to Hidden Markov Models Introducton to Hdden Markov Models Alperen Degrmenc Ths document contans dervatons and algorthms for mplementng Hdden Markov Models. The content presented here s a collecton of my notes and personal nsghts

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.65/15.070J Fall 013 Lecture 1 10/1/013 Martngale Concentraton Inequaltes and Applcatons Content. 1. Exponental concentraton for martngales wth bounded ncrements.

More information

Difference Equations

Difference Equations Dfference Equatons c Jan Vrbk 1 Bascs Suppose a sequence of numbers, say a 0,a 1,a,a 3,... s defned by a certan general relatonshp between, say, three consecutve values of the sequence, e.g. a + +3a +1

More information

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach A Bayes Algorthm for the Multtask Pattern Recognton Problem Drect Approach Edward Puchala Wroclaw Unversty of Technology, Char of Systems and Computer etworks, Wybrzeze Wyspanskego 7, 50-370 Wroclaw, Poland

More information

ECE559VV Project Report

ECE559VV Project Report ECE559VV Project Report (Supplementary Notes Loc Xuan Bu I. MAX SUM-RATE SCHEDULING: THE UPLINK CASE We have seen (n the presentaton that, for downlnk (broadcast channels, the strategy maxmzng the sum-rate

More information

Rockefeller College University at Albany

Rockefeller College University at Albany Rockefeller College Unverst at Alban PAD 705 Handout: Maxmum Lkelhood Estmaton Orgnal b Davd A. Wse John F. Kenned School of Government, Harvard Unverst Modfcatons b R. Karl Rethemeer Up to ths pont n

More information

Density matrix. c α (t)φ α (q)

Density matrix. c α (t)φ α (q) Densty matrx Note: ths s supplementary materal. I strongly recommend that you read t for your own nterest. I beleve t wll help wth understandng the quantum ensembles, but t s not necessary to know t n

More information

2.3 Nilpotent endomorphisms

2.3 Nilpotent endomorphisms s a block dagonal matrx, wth A Mat dm U (C) In fact, we can assume that B = B 1 B k, wth B an ordered bass of U, and that A = [f U ] B, where f U : U U s the restrcton of f to U 40 23 Nlpotent endomorphsms

More information

NP-Completeness : Proofs

NP-Completeness : Proofs NP-Completeness : Proofs Proof Methods A method to show a decson problem Π NP-complete s as follows. (1) Show Π NP. (2) Choose an NP-complete problem Π. (3) Show Π Π. A method to show an optmzaton problem

More information

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore 8/5/17 Data Modelng Patrce Koehl Department of Bologcal Scences atonal Unversty of Sngapore http://www.cs.ucdavs.edu/~koehl/teachng/bl59 koehl@cs.ucdavs.edu Data Modelng Ø Data Modelng: least squares Ø

More information