8 : Learning Partially Observed GM: the EM algorithm
|
|
- Carmel Rose
- 5 years ago
- Views:
Transcription
1 10-708: Probabilistic Graphical Models, Sprig : Learig Partially Observed GM: the EM algorithm Lecturer: Eric P. Xig Scribes: Auric Qiao, Hao Zhag, Big Liu 1 Itroductio Two fudametal questios i graphical models are learig ad iferece. Learig is the process of estimatig the parameters, ad i some cases, the topology of the etwor, from data. We have covered learig methods for completely observed graphical models, both directed ad udirected, i previous chapters. Focus of this chapter is o learig of partially observed or uobserved graphical models. Two estimatio priciples that will be covered are maximal margi ad maximum etropy. O iferece approaches, we previously covered the elimiatio algorithm ad message-passig algorithm as exact iferece techiques. Other exact iferece algorithms iclude the juctio tree algorithm, which is less ofte used these days. Doig exact iferece, however, ca be expesive, especially i the domai of big data. Thus, may efforts have bee placed o approximate iferece techiques. These iclude stochastic simulatio, samplig methods, Marov chai Mote Carlo methods, Variatioal algorithms, etc. 2 Mixture Models Mixture models are models where the probability distributio of a data poit is defied by a weighted sum of coditioal lielihoods. We say that a distributio f is a mixture of K compoet distributio p 1, p 2,..., p K if: p(x) = λ P (x) with the λ > 0 beig the mixture proportio, ad λ = Uobserved Variables A variable ca be uobserved (latet) for a umber of reasos. It might be a imagiary quatity meat to provide some simplified ad abstract view of the data geeratio process, for example the speech recogitio models. It also ca be a real-world object or pheomea which is difficult or impossible to measure, such as the temperature of a star, causes of a disease, ad evolutioary acestors. Or else it might be a real-world object or pheomea which was ot sometimes measured, due to faulty system compoets, etc. Latet variable ca be either discrete or cotiuous. Discrete latet variables ca be used to partitio or cluster data ito sub-groups. Cotiuous latet variables (factors), o the other had, ca be used for dimesioality reductio (factor aalysis, etc). 1
2 2 8 : Learig Partially Observed GM: the EM algorithm 2.2 Gaussia Mixture Models A Gaussia mixture model is a probabilistic model i which we assume all data poits are geerated from a mixture of a fiite umber of Gaussia distributios with uow parameters. Cosider a mixture of K Gaussia mixtures, P (X µ, Σ) = π N (X µ, Σ ) where π is the mixture proportio, ad N (X µ, Σ ) is the mixture compoet. Let s see how this expressio is derived. By itroducig Z as a latet class idicator vector: P (z ) = muitiomial(z : π) = (Π ) z We have X as a coditioal Gaussia variable with a class-specific mea ad covariace: P (x z 1 = 1, µ, Σ) = (2π) m/2 Σ { exp 1/2 1 2 (x µ ) T Σ 1 (x µ ) } Therefore, the lielihood of a sample x ca be expressed as: P (x µ, Σ) = P (z, x) = p(z = 1 π)p(x z = 1, µ, Σ) = ] [(π ) z N(x : µ, Σ ) z ) z = π N(x µ, Σ )) Now we wat to do learig of the mixture model. I fully observed i.i.d settigs, the log lielihood ca be decomposed ito a sum of local terms (at least for directed models). This ca be illustrated as: l(θ, D) = log p(x, z θ) = log p(z θ z ) + log p(x z, θ x ) With latet variables, however, all the parameters become coupled together via margializatio: l(θ, D) = log z p(x, z θ) = log z p(z θ z )p(x z, θ x ) It ca be observed that the summatio is iside the log ad the log lielihood does ot decompose icely. It is hard to fid a closed form solutio to the maximum lielihood estimates. We the tur to lear the mixture desities usig gradiet descet o the log lielihood, with: l(θ) = log p(x θ) = log π p (x θ )
3 8 : Learig Partially Observed GM: the EM algorithm 3 Figure 1: Mixture Model Learig with Latet Variables By taig the derivative with respect to θ we have: l θ = 1 p(x θ) = = π p (x θ ) θ π p(x θ) p (x θ ) log p (x θ ) θ p ( θ ) log p (x θ ) π p(x θ) θ = r l θ = 0 Thus, the gradiet is the resposibility weighted sum of the idividual log lielihood gradiets. This ca be passed to a cojugate gradiet routie. 3 Parameter Costraits As we ca see above, mixture desities with latet variables are ot easy to lear. By usig gradiet approach whe o closed form solutio ca be derived, oe has less cotrols o the coditios of the parameters. I order to mitigate such effect, people did come up with solutios by imposig costraits o parameters. Ofte we have costraits o the parameters, e.g. π = 1, ad lettig Σ beig symmetric positive defiite (hece Σ ii > 0). We ca either use costraied optimizatio, or we ca re-parametrize i terms of ucostraied values. For ormalized weights, oe ca use the Softmax trasform. For covariace matrices, oe ca use the Cholesy decompositio i form of: where A is upper diagoal with positive diagoal: Σ 1 = A T A A ii = exp(λ i ) > 0; A ij = η ij (j > i); A ij = 0(j < i) where the parameters λ i ad η ij are ucostraied. The oe ca use the chai rule to compute l l π ad A.
4 4 8 : Learig Partially Observed GM: the EM algorithm 4 Idetifiability Idetifiability is a property that a model has to satisfy i order do to precise iferece. A mixture model iduces a multi-modal lielihood. Hece, gradiet ascet ca oly fid a local maximum. Mixture models are uidetifiable, sice we ca always switch the hidde labels without affectig the lielihood. Therefore, we should be careful i tryig to iterpret the meaig of latet variables. 5 Expectatio-Maximizatio Algorithm We will cotiue to use the Gaussia mixture model to demostrate the expectatio-maximizatio algorithm. If we assume that all variables are observed, we ca lear the parameters simply by usig MLE. The data log-lielihood fuctio is: l(θ, D) = log p(z, x ) = log p(z π)p(x z, µ, σ) = = log π z + log N(x ; µ, σ) z z log π z 1 2σ 2 (x µ ) 2 + C C is some costat that we do ot bother calculatig, sice maximizig the log-lielihood is idepedet of ay additive costat. Our maximum lielihood estimates of the parameters are the: ˆπ,MLE = arg max l(θ, D) π ˆµ,MLE = arg max l(θ, D) µ ˆσ,MLE = arg max l(θ, D) σ We ca easily put these fuctios ito a optimizatio algorithm, or i the case of ˆµ, obtai a simple closed form: ˆµ,MLE = z x z However, the above calculatios brea dow i the case of uobserved z. For example, it is ot possible to use the formula for ˆµ,MLE sice we do ot ow what values of z to use. 5.1 K-meas We ca obtai some isight ito solvig this problem by cosiderig a related algorithm: K-meas. The K-meas algorithm fids clusters of data poits as follows: 1. Iitialize poits ( ceters ) radomly. 2. Assig each data poit to the closest ceter. 3. Move each ceter to the mea of all data poits assiged to it. 4. Repeat from 2. util coverged.
5 8 : Learig Partially Observed GM: the EM algorithm 5 We ca view the K-meas problem as a graphical model, where the hidde variables is the ceter each data poit is assiged to. More formally, ad writig the above i a form that is closer to our Gaussia mixture model, we get: z (t) ( = arg max µ (t+1) = x µ (t) ) T ( 1(t) Σ ) (z δ (t), (z δ (t), x ) x µ (t) Here, z are the ceters assiged to each data poit, ad µ are the ceters. ) 5.2 The EM Algorithm Movig closer to the Gaussia mixture model, what if istead of usig the mea as ceters, we use Gaussia distributios? This way, each ceter is associated with a covariace matrix as well as a mea. Now, istead of assigig each data poit to the closest ceter, we ca do a soft assigmet to all ceters. For example, a data poit p ca be 1 2 assiged to ceter a, 1 4 ceter b, ad 1 4 ceter c. This essetially gives us the EM algorithm for Gaussia mixture models, ad iteratively improves the log-lielihood: l c (θ; x, z) = log p(z π) p(z x) + log p(x z, µ, Σ) p(z x) = z log π 1 z 2 ( (x µ ) T Σ 1 (x µ ) + log Σ + C ) The E step Durig the expectatio step, we compute the expected values of the sufficiet statistics of the hidde variables, give the curret estimates of the parameters. I the case of Gaussia mixture models, we compute the followig: ( τ (t) = z q (t) = p z = 1 x, µ (t), Σ (t)) = π (t) N ( Σ i π (t) i N x µ (t) ( x µ (t) i ), Σ(t) The E step is essetially performig iferece with a estimate of the parameters. I -meas, we calculate the hard assigmets z (t) while i the EM algorithm, we calculate the expected value of a sufficiet statistic τ (t). I the case of Gaussia mixture models, τ (t) = z q (t)., Σ (t) i ) The M step Durig the M step, we estimate the parameters with MLE, usig the sufficiet statistics of the hidde variables we calculated i the E step. I the case of Gaussia mixture models, the sufficiet statistic is the expected value of the hidde variables. π = arg max l c (θ) = π (t+1) π = z q (t) N µ = arg max µ l c (θ) = µ (t+1) = = τ (t) N τ (t) x τ (t) = N
6 6 8 : Learig Partially Observed GM: the EM algorithm I -meas, µ weighted sum. Σ = arg max l c (θ) = Σ (t+1) Σ = ( τ (t) ) ( x µ (t+1) τ (t) x µ (t+1) is calculated as a weighted sum where each weight is 0 or 1, while i EM, it is a proper ) T 5.3 Theory of the EM Algorithm Supposig that the latet variables z ca be observed, we defie the complete log-lielihood as l c (θ; x, z) = log p(x, z θ) With observed z, the above ca usually be optimized easily. p(x, z θ) is the product of may factors, so the log-lielihood fuctio above decomposes ito the sum of may factors, where each ca be obtimized separately. However, z is ot observed, so the log-lielihood caot be optimized i this way. The lielihood fuctio we are really tryig to optimize is the log of a margial probability, call the icomplete log-liehihood: l c (θ, x) = log p(x θ) = log z p(x, z θ) Ufortuately, this objective fuctio is a sum iside of a log, which does ot decouple as icely. Istead, we will optimize the followig expected complete log-lielihood: l c (θ; x, z) q = z q(z x, θ) log p(x, z θ) where q is ay arbitrary distributio. Notice, ow, that the log p(x, z θ) iside the sum, lie the complete log-lielihood fuctio, factors icely. So, the expected complete log-lielihood fuctio factors, ad is easier to optimize. However, it is ot clear if maximizig this fuctio actually maximizes the icomplete loglielihood. As a first step, we ca easily show that the expected complete log-lielihood is a lower boud for the icomplete log-lielihood: l(θ; x) = log p(x θ) = log z p(x, z θ) = log z p(x, z θ) q(z x) q(z x) Usig Jese s iequality, z = z q(z x) log p(x, z θ) q(z x) q(z x) log p(x, z θ) z q(z x) log q(z x) = l c (θ; x, z) q + H q Note that the above is the sum of the expected complete log-lielihood ad the etropy of the distributio q.
7 8 : Learig Partially Observed GM: the EM algorithm 7 As Coordiate Ascet The EM algorithm is also i some sese a coordiate descet algorithm. To see this, defie the followig fuctioal for a fixed data x, called the free eergy: F (q, θ) = z q(z x) log p(x, z θ) q(z x) l(θ; x) The EM algorithm is the E step: q t+1 = arg max q F (q, θ t ) M step: θ t+1 = arg max θ F (q t+1, θ t ) I performig the E step, we obtai the posterior distributio of the latet variables give the data ad the parameters, ie. q t+1 = arg max F (q, θ t ) = p(z x, θ t ) q To prove this, we just have to show that this value of q t+1 attais the boud l(θ; x) F (q t+1, θ). F (p(z x, θ t ), θ t ) = z = z p(z x, θ t ) log p(x, z θt ) p(z x, θ t ) q(z x) log p(x θ t ) = log p(x θ t ) = l(θ t ; x) Without loss of geerality, assume that p(x, z θ) is a geeralized expoetial family distributio, so that { } p(x, z θ) = 1 h(x, z) exp θ i f i (x, z) Z(θ) The the complete log lielihood uder q t+1 = p(z x, θ t ) is i l c (θ t ; x, z) p t+1 = z = i q(z x, θ t ) log p(x, z θ t ) A(θ) θ t i f i (x, z) q(z x,θ t ) A(θ) I the special case of p beig a geeralized liear model, = i θ t i η i (z) q(z x,θt )ξ i (x) A(θ) Let s ow tae a loo at the M step, ad remember that the free eergy is the sum of two terms F (q, θ) = l c (θ; x, z) q +H q. The secod term is costat whe maximizig with respect to θ, so we eed oly cosider the first term. θ t+1 = arg max q(z x) log p(x, z θ) θ z Whe q is optimal, this is the same as MLE of the fully observed p(x, z θ), but istead of the sufficiet statistics for z, their expectatios with respect to p(z x, θ) are used.
8 8 8 : Learig Partially Observed GM: the EM algorithm 6 Example: Hidde Marov Model With the EM algorithm, it becomes very straitforward for us to solve some simple graphical models with hidde variables. Let s tae the Hidde Marov Model (HMM) as a example. Geerally, the learig paradigms for a HMM model are classfied ito the followig two categories: Supervised Learig: we estimate the model parameters whe the label iformatio is provided, i.e., the right aswer is ow. For example, we lear from a geomic regio x = x 1, x 2,..., x where we have good aotatios of the CpG islads. Usupervised Learig: we estimate the model without label iformatio. For istace, the procupie geome; we do t ow how frequet are the CpG islads there, either do we ow their compositio, but we still wat to lear from the data. Our goal is to lear from the data usig Maximal Lielihood Estimatio (MLE). Specifically, we update the model parameters θ by maximizig p(x θ). Now we itroduce a algorithm amed Baum-Welch algorithm, which is a variat of EM algorithm for learig HMM. Recall that the complete log-lielihood of HMM is writte as follows: It ca be further writte as: l e (θ; x, y) = log p(x, y) = log (p(y,1 T t=2 p(y,t y,t 1 ) T p(x,t y,t ))) (1) t=1 < l c (θ; x, y) > = + (< Y,1 i > p(y,1 x ) log π i ) + T (< Y,t 1Y i j,t > p(y,t 1Y,t x ) log a i,j ) t=2 T (x,t < Y,t i > p(y,t x ) log b i, ) t=1 (2) where we deote A = {a i,j } = p(y t = j Y t 1 = i) as the trasitio matrix ad B = {b i,j } = p(y t = i x t = j) as the emissio matrix. Accordigly, we derive the EM algorithm of HMM as follows: The E step The M step γ i,t =< Y i,t >= p(y i,t = 1 x ) ξ i,j,t =< Y i,t 1Y j,t >= p(y i,t 1 = 1, Y j,t = 1 x ) π ML i = a ML ij = b ML i, = γi,1 N T t=2 ξi,j,t T 1 t=1 γi,t T t=1 γi,tx,t T 1 t=1 γi,t (3) (4)
9 8 : Learig Partially Observed GM: the EM algorithm 9 I the usupervised learig settig, we are give x = x 1,..., x N while their state path y = y 1,..., y N is ot uow. The EM algorithm chages accordigly: 1. Start with the best guess of parameters θ for the model 2. Estimate A i,j, B i,j i the traiig data: A ij =,t < Y i,t 1Y j,t > B i = m,t < Y i,t > x,t (5) 3. Update θ accordig to A ij, B j. Now, problem becomes supervised learig. 4. Repeat steps 2 ad 3 util covergece. The above algorithm is called the Baum-Welch algorithm. 7 EM for geeral BNs Now we demostrate the EM algorithm for geeral BNs: while ot coverged do for each ode i do ESS i = 0 ed for each data sample do do iferece with x,h for each ode i do ESS i+ =< SS i (x,i, x,πi ) > p(x,h x, H ) ed ed for each ode i do θ i = MLE(ESS i ) ed ed 8 Summary: EM algorithm I summary, EM algorithm ca be regarded as a way of maximizig lielihood fuctio for latet variable models. It fids MLE of parameters whe the origial (hard) problem ca be broe up ito two (easy) pieces: Estimate some missig or uobserved data from observed data ad curret parameters Usig this complete data, fid the maximum lielihood parameter estimates The EM algorithm alterates betwee fillig i the latet variables usig the best guess ad updatig the parameters based o this guess. Specifically, i the M-step we optimize a lower boud o the lielihood. I the E-step we close the gap, maig the boud equals the lielihood.
10 10 8 : Learig Partially Observed GM: the EM algorithm 9 Coditioal Mixture Model ad EM Algorithm Cosider the followig situatio: we will model p(y X) usig differet experts, each resposible for differet regios of the iput space. We have a latet variable Z, which chooses expert usig softmax gatig fuctio: p(z = 1 x) = Softmax(ξ T x) (6) While, each expert ca be a liear regressio model (or more complex models): The posterior expert resposibilities are: Now we derive the EM for coditioal mixture model: The objective fuctio is as follows: p(y x, z = 1) = N(y; θ T x, σ 2 ) (7) p(z = 1 x, y, θ) = p(z = 1 x)p (y x, θ, σ 2 ) j p(zj = 1 x)p j (y x, θ j, σ 2 j ) (8) < l c (θ; x, y, z) > = < log p(z x, ξ) > p(z x,y) + < log p(y x, z, θ, σ) > p(z x,y) The EM algorithm: E-step: = < z > log(softmax(ξ T x )) 1 2 < z > ( (y θ T x ) σ 2 + log σ 2 + C) τ (t) = p(z = 1 x, y, θ) = p(z = 1 x )p (y x, θ, σ 2 ) j p(zj = 1 x )p j (y x, θ j, σ 2 j ) (10) (9) M-step: We use the ormal equatio for stadard liear regressio: θ = (X T X) 1 X T Y, but with the data reweighted by τ We ca also use IRLS or reweighted IRLS to update {ξ, θ, σ } based o data pair (x, y ), with weights τ (t). 10 Hierarchical Mixture of experts The hierarchial mixture of experts is lie a soft versio of a depth-2 classificatio/regressio tree. I the model, the P (Y X, G 1, G 2 ) ca be modeled as a GLIM, with parameters depedet o the values of G 1 ad G 2, which specify a coditioal path to a give leaf i the tree. 11 Mixture of overlappig experts Differet from the typical mixture of experts, by removig the li betwee X ad Z i the graphical model of mixture of experts, we ca mae the partitios idepedet of the iput, thus allowig overlappig. Accordigly, this leads to a mixture of liear regressors, i which each subpopulatio has a differet coditioal mea: p(z = 1 x, y, θ) = p(z = 1 x)p (y x, θ σ 2) j p(zj = 1 x)p j (y x, θ j, σj 2) (11)
11 8 : Learig Partially Observed GM: the EM algorithm Partially Hidde Data We ca also lear the model parameters whe there are missig(hidde) variables. I this case, the cost fuctio becomes: l c (θ; D) = log p(x, y θ) + log p(x m, y m θ) (12) y Complete m Missig Now we ca thi of this i a ew way: i the E-step we estimate the hidde variables o the icomplete cases oly. The, i the M-step we optimize the log lielihood o the complete data plus the expected lielihood o the icomplete data usig the E-step. 13 EM Variats I this sectio, we briefly itroduce two variats of the basic EM algorithm: The first oe is the sparse EM. It does ot re-compute exactly the posterior probability o each data poit uder all models, because it is almost zero. Istead, it eeps a active list which you update every oce i a while. The secod oe is the geeralized (icomplete) EM. Sometimes we may fid that it s very difficult to fid the ML parameters i the M-step, eve give the completed data. However, we ca still mae progress by doig a M-step that improves the lielihood a bit. This is i the lie of the IRLS step i the mixture of experts models. 14 A Report Card for EM EM is oe of the most popular methods to get the MLE or MAP estimatio for a model with latet variables. The advatages of EM are listed as follows. o learig rate (step-size) parameter automatically eforces parameter costraits very fast for low dimesios each iteratio guarateed to improve lielihood While, the disadvatages of EM are ca get stuc i local miima ca be slower tha cojugate gradiet (especially ear covergece) requires expesive iferece step is a maximum lilihood/map method There are still some moder developmets upo traditioal EM, such as covex relaxatio, direct o-covex optimizatio.
Expectation-Maximization Algorithm.
Expectatio-Maximizatio Algorithm. Petr Pošík Czech Techical Uiversity i Prague Faculty of Electrical Egieerig Dept. of Cyberetics MLE 2 Likelihood.........................................................................................................
More informationClustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.
Clusterig CM226: Machie Learig for Bioiformatics. Fall 216 Sriram Sakararama Ackowledgmets: Fei Sha, Ameet Talwalkar Clusterig 1 / 42 Admiistratio HW 1 due o Moday. Email/post o CCLE if you have questios.
More informationProbabilistic Unsupervised Learning
HT2015: SC4 Statistical Data Miig ad Machie Learig Dio Sejdiovic Departmet of Statistics Oxford http://www.stats.ox.ac.u/~sejdiov/sdmml.html Probabilistic Methods Algorithmic approach: Data Probabilistic
More informationAxis Aligned Ellipsoid
Machie Learig for Data Sciece CS 4786) Lecture 6,7 & 8: Ellipsoidal Clusterig, Gaussia Mixture Models ad Geeral Mixture Models The text i black outlies high level ideas. The text i blue provides simple
More informationAlgorithms for Clustering
CR2: Statistical Learig & Applicatios Algorithms for Clusterig Lecturer: J. Salmo Scribe: A. Alcolei Settig: give a data set X R p where is the umber of observatio ad p is the umber of features, we wat
More informationThe Expectation-Maximization (EM) Algorithm
The Expectatio-Maximizatio (EM) Algorithm Readig Assigmets T. Mitchell, Machie Learig, McGraw-Hill, 997 (sectio 6.2, hard copy). S. Gog et al. Dyamic Visio: From Images to Face Recogitio, Imperial College
More information5 : Exponential Family and Generalized Linear Models
0-708: Probabilistic Graphical Models 0-708, Sprig 206 5 : Expoetial Family ad Geeralized Liear Models Lecturer: Matthew Gormley Scribes: Yua Li, Yichog Xu, Silu Wag Expoetial Family Probability desity
More informationOutline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019
Outlie CSCI-567: Machie Learig Sprig 209 Gaussia mixture models Prof. Victor Adamchik 2 Desity estimatio U of Souther Califoria Mar. 26, 209 3 Naive Bayes Revisited March 26, 209 / 57 March 26, 209 2 /
More informationProbabilistic Unsupervised Learning
Statistical Data Miig ad Machie Learig Hilary Term 2016 Dio Sejdiovic Departmet of Statistics Oxford Slides ad other materials available at: http://www.stats.ox.ac.u/~sejdiov/sdmml Probabilistic Methods
More informationClustering: Mixture Models
Clusterig: Mixture Models Machie Learig 10-601B Seyoug Kim May of these slides are derived from Tom Mitchell, Ziv- Bar Joseph, ad Eric Xig. Thaks! Problem with K- meas Hard Assigmet of Samples ito Three
More informationMachine Learning for Data Science (CS 4786)
Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm
More informationLecture 2: Monte Carlo Simulation
STAT/Q SCI 43: Itroductio to Resamplig ethods Sprig 27 Istructor: Ye-Chi Che Lecture 2: ote Carlo Simulatio 2 ote Carlo Itegratio Assume we wat to evaluate the followig itegratio: e x3 dx What ca we do?
More information1 Duality revisited. AM 221: Advanced Optimization Spring 2016
AM 22: Advaced Optimizatio Sprig 206 Prof. Yaro Siger Sectio 7 Wedesday, Mar. 9th Duality revisited I this sectio, we will give a slightly differet perspective o duality. optimizatio program: f(x) x R
More informationMixtures of Gaussians and the EM Algorithm
Mixtures of Gaussias ad the EM Algorithm CSE 6363 Machie Learig Vassilis Athitsos Computer Sciece ad Egieerig Departmet Uiversity of Texas at Arligto 1 Gaussias A popular way to estimate probability desity
More information10-701/ Machine Learning Mid-term Exam Solution
0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it
More informationStatistical Pattern Recognition
Statistical Patter Recogitio Classificatio: No-Parametric Modelig Hamid R. Rabiee Jafar Muhammadi Sprig 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Ageda Parametric Modelig No-Parametric Modelig
More informationCSE 527, Additional notes on MLE & EM
CSE 57 Lecture Notes: MLE & EM CSE 57, Additioal otes o MLE & EM Based o earlier otes by C. Grat & M. Narasimha Itroductio Last lecture we bega a examiatio of model based clusterig. This lecture will be
More informationResampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.
Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator
More informationDiscrete Mathematics for CS Spring 2008 David Wagner Note 22
CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig
More informationEconomics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator
Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters
More informationLinear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d
Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y
More informationEECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1
EECS564 Estimatio, Filterig, ad Detectio Hwk 2 Sols. Witer 25 4. Let Z be a sigle observatio havig desity fuctio where. p (z) = (2z + ), z (a) Assumig that is a oradom parameter, fid ad plot the maximum
More informationLecture 19: Convergence
Lecture 19: Covergece Asymptotic approach I statistical aalysis or iferece, a key to the success of fidig a good procedure is beig able to fid some momets ad/or distributios of various statistics. I may
More informationIntroduction to Machine Learning DIS10
CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig
More informationLecture 12: September 27
36-705: Itermediate Statistics Fall 207 Lecturer: Siva Balakrisha Lecture 2: September 27 Today we will discuss sufficiecy i more detail ad the begi to discuss some geeral strategies for costructig estimators.
More informationPHYC - 505: Statistical Mechanics Homework Assignment 4 Solutions
PHYC - 55: Statistical Mechaics Homewor Assigmet 4 Solutios Due February 5, 14 1. Cosider a ifiite classical chai of idetical masses coupled by earest eighbor sprigs with idetical sprig costats. a Write
More informationEstimation for Complete Data
Estimatio for Complete Data complete data: there is o loss of iformatio durig study. complete idividual complete data= grouped data A complete idividual data is the oe i which the complete iformatio of
More informationOptimally Sparse SVMs
A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but
More informationMath 113 Exam 3 Practice
Math Exam Practice Exam 4 will cover.-., 0. ad 0.. Note that eve though. was tested i exam, questios from that sectios may also be o this exam. For practice problems o., refer to the last review. This
More informationIntro to Learning Theory
Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified
More informationDefinitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.
Defiitios ad Theorems Remember the scalar form of the liear programmig problem, Miimize, Subject to, f(x) = c i x i a 1i x i = b 1 a mi x i = b m x i 0 i = 1,2,, where x are the decisio variables. c, b,
More informationChapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian
Chapter 2 EM algorithms The Expectatio-Maximizatio (EM) algorithm is a maximum likelihood method for models that have hidde variables eg. Gaussia Mixture Models (GMMs), Liear Dyamic Systems (LDSs) ad Hidde
More informationInfinite Sequences and Series
Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet
More informationECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization
ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where
More informationIP Reference guide for integer programming formulations.
IP Referece guide for iteger programmig formulatios. by James B. Orli for 15.053 ad 15.058 This documet is iteded as a compact (or relatively compact) guide to the formulatio of iteger programs. For more
More informationProperties and Hypothesis Testing
Chapter 3 Properties ad Hypothesis Testig 3.1 Types of data The regressio techiques developed i previous chapters ca be applied to three differet kids of data. 1. Cross-sectioal data. 2. Time series data.
More informationMath 155 (Lecture 3)
Math 55 (Lecture 3) September 8, I this lecture, we ll cosider the aswer to oe of the most basic coutig problems i combiatorics Questio How may ways are there to choose a -elemet subset of the set {,,,
More informationSequences, Mathematical Induction, and Recursion. CSE 2353 Discrete Computational Structures Spring 2018
CSE 353 Discrete Computatioal Structures Sprig 08 Sequeces, Mathematical Iductio, ad Recursio (Chapter 5, Epp) Note: some course slides adopted from publisher-provided material Overview May mathematical
More information6.3 Testing Series With Positive Terms
6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial
More informationUnsupervised Learning 2001
Usupervised Learig 2001 Lecture 3: The EM Algorithm Zoubi Ghahramai zoubi@gatsby.ucl.ac.uk Carl Edward Rasmusse edward@gatsby.ucl.ac.uk Gatsby Computatioal Neurosciece Uit MSc Itelliget Systems, Computer
More informationChapter 6 Principles of Data Reduction
Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a
More informationThe Method of Least Squares. To understand least squares fitting of data.
The Method of Least Squares KEY WORDS Curve fittig, least square GOAL To uderstad least squares fittig of data To uderstad the least squares solutio of icosistet systems of liear equatios 1 Motivatio Curve
More informationMaximum Likelihood Estimation and Complexity Regularization
ECE90 Sprig 004 Statistical Regularizatio ad Learig Theory Lecture: 4 Maximum Likelihood Estimatio ad Complexity Regularizatio Lecturer: Rob Nowak Scribe: Pam Limpiti Review : Maximum Likelihood Estimatio
More informationECE 901 Lecture 12: Complexity Regularization and the Squared Loss
ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality
More informationExponential Families and Bayesian Inference
Computer Visio Expoetial Families ad Bayesia Iferece Lecture Expoetial Families A expoetial family of distributios is a d-parameter family f(x; havig the followig form: f(x; = h(xe g(t T (x B(, (. where
More informationFactor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis
Lecture 10: Factor Aalysis ad Pricipal Compoet Aalysis Sam Roweis February 9, 2004 Whe we assume that the subspace is liear ad that the uderlyig latet variable has a Gaussia distributio we get a model
More informationIntroduction to Computational Molecular Biology. Gibbs Sampling
18.417 Itroductio to Computatioal Molecular Biology Lecture 19: November 16, 2004 Scribe: Tushara C. Karuarata Lecturer: Ross Lippert Editor: Tushara C. Karuarata Gibbs Samplig Itroductio Let s first recall
More informationRegression and generalization
Regressio ad geeralizatio CE-717: Machie Learig Sharif Uiversity of Techology M. Soleymai Fall 2016 Curve fittig: probabilistic perspective Describig ucertaity over value of target variable as a probability
More informationA Note on Effi cient Conditional Simulation of Gaussian Distributions. April 2010
A Note o Effi ciet Coditioal Simulatio of Gaussia Distributios A D D C S S, U B C, V, BC, C April 2010 A Cosider a multivariate Gaussia radom vector which ca be partitioed ito observed ad uobserved compoetswe
More informationOutput Analysis and Run-Length Control
IEOR E4703: Mote Carlo Simulatio Columbia Uiversity c 2017 by Marti Haugh Output Aalysis ad Ru-Legth Cotrol I these otes we describe how the Cetral Limit Theorem ca be used to costruct approximate (1 α%
More informationSummary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector
Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short
More informationNYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)
NYU Ceter for Data Sciece: DS-GA 003 Machie Learig ad Computatioal Statistics (Sprig 208) Brett Berstei, David Roseberg, Be Jakubowski Jauary 20, 208 Istructios: Followig most lab ad lecture sectios, we
More informationMachine Learning Brett Bernstein
Machie Learig Brett Berstei Week 2 Lecture: Cocept Check Exercises Starred problems are optioal. Excess Risk Decompositio 1. Let X = Y = {1, 2,..., 10}, A = {1,..., 10, 11} ad suppose the data distributio
More informationTHE KALMAN FILTER RAUL ROJAS
THE KALMAN FILTER RAUL ROJAS Abstract. This paper provides a getle itroductio to the Kalma filter, a umerical method that ca be used for sesor fusio or for calculatio of trajectories. First, we cosider
More informationSolution of Final Exam : / Machine Learning
Solutio of Fial Exam : 10-701/15-781 Machie Learig Fall 2004 Dec. 12th 2004 Your Adrew ID i capital letters: Your full ame: There are 9 questios. Some of them are easy ad some are more difficult. So, if
More informationREGRESSION (Physics 1210 Notes, Partial Modified Appendix A)
REGRESSION (Physics 0 Notes, Partial Modified Appedix A) HOW TO PERFORM A LINEAR REGRESSION Cosider the followig data poits ad their graph (Table I ad Figure ): X Y 0 3 5 3 7 4 9 5 Table : Example Data
More informationMachine Learning for Data Science (CS 4786)
Machie Learig for Data Sciece CS 4786) Lecture 9: Pricipal Compoet Aalysis The text i black outlies mai ideas to retai from the lecture. The text i blue give a deeper uderstadig of how we derive or get
More informationADVANCED SOFTWARE ENGINEERING
ADVANCED SOFTWARE ENGINEERING COMP 3705 Exercise Usage-based Testig ad Reliability Versio 1.0-040406 Departmet of Computer Ssciece Sada Narayaappa, Aeliese Adrews Versio 1.1-050405 Departmet of Commuicatio
More informationStatistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.
Statistical Iferece (Chapter 10) Statistical iferece = lear about a populatio based o the iformatio provided by a sample. Populatio: The set of all values of a radom variable X of iterest. Characterized
More informationElement sampling: Part 2
Chapter 4 Elemet samplig: Part 2 4.1 Itroductio We ow cosider uequal probability samplig desigs which is very popular i practice. I the uequal probability samplig, we ca improve the efficiecy of the resultig
More informationDistributional Similarity Models (cont.)
Distributioal Similarity Models (cot.) Regia Barzilay EECS Departmet MIT October 19, 2004 Sematic Similarity Vector Space Model Similarity Measures cosie Euclidea distace... Clusterig k-meas hierarchical
More informationDistributional Similarity Models (cont.)
Sematic Similarity Vector Space Model Similarity Measures cosie Euclidea distace... Clusterig k-meas hierarchical Last Time EM Clusterig Soft versio of K-meas clusterig Iput: m dimesioal objects X = {
More informationChapter 2: Numerical Methods
Chapter : Numerical Methods. Some Numerical Methods for st Order ODEs I this sectio, a summar of essetial features of umerical methods related to solutios of ordiar differetial equatios is give. I geeral,
More informationStatistical Inference Based on Extremum Estimators
T. Rotheberg Fall, 2007 Statistical Iferece Based o Extremum Estimators Itroductio Suppose 0, the true value of a p-dimesioal parameter, is kow to lie i some subset S R p : Ofte we choose to estimate 0
More informationLecture 2 October 11
Itroductio to probabilistic graphical models 203/204 Lecture 2 October Lecturer: Guillaume Oboziski Scribes: Aymeric Reshef, Claire Verade Course webpage: http://www.di.es.fr/~fbach/courses/fall203/ 2.
More informationLecture 8: Solving the Heat, Laplace and Wave equations using finite difference methods
Itroductory lecture otes o Partial Differetial Equatios - c Athoy Peirce. Not to be copied, used, or revised without explicit writte permissio from the copyright ower. 1 Lecture 8: Solvig the Heat, Laplace
More informationChapter 2 The Monte Carlo Method
Chapter 2 The Mote Carlo Method The Mote Carlo Method stads for a broad class of computatioal algorithms that rely o radom sampligs. It is ofte used i physical ad mathematical problems ad is most useful
More informationRecursive Algorithms. Recurrences. Recursive Algorithms Analysis
Recursive Algorithms Recurreces Computer Sciece & Egieerig 35: Discrete Mathematics Christopher M Bourke cbourke@cseuledu A recursive algorithm is oe i which objects are defied i terms of other objects
More informationSupport vector machine revisited
6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector
More informationKinetics of Complex Reactions
Kietics of Complex Reactios by Flick Colema Departmet of Chemistry Wellesley College Wellesley MA 28 wcolema@wellesley.edu Copyright Flick Colema 996. All rights reserved. You are welcome to use this documet
More informationThis exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.
Probability ad Statistics FS 07 Secod Sessio Exam 09.0.08 Time Limit: 80 Miutes Name: Studet ID: This exam cotais 9 pages (icludig this cover page) ad 0 questios. A Formulae sheet is provided with the
More informationCS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5
CS434a/54a: Patter Recogitio Prof. Olga Veksler Lecture 5 Today Itroductio to parameter estimatio Two methods for parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio Itroducto Bayesia Decisio
More informationRoberto s Notes on Series Chapter 2: Convergence tests Section 7. Alternating series
Roberto s Notes o Series Chapter 2: Covergece tests Sectio 7 Alteratig series What you eed to kow already: All basic covergece tests for evetually positive series. What you ca lear here: A test for series
More informationLecture 11 and 12: Basic estimation theory
Lecture ad 2: Basic estimatio theory Sprig 202 - EE 94 Networked estimatio ad cotrol Prof. Kha March 2 202 I. MAXIMUM-LIKELIHOOD ESTIMATORS The maximum likelihood priciple is deceptively simple. Louis
More information1 Covariance Estimation
Eco 75 Lecture 5 Covariace Estimatio ad Optimal Weightig Matrices I this lecture, we cosider estimatio of the asymptotic covariace matrix B B of the extremum estimator b : Covariace Estimatio Lemma 4.
More informationDifferentiable Convex Functions
Differetiable Covex Fuctios The followig picture motivates Theorem 11. f ( x) f ( x) f '( x)( x x) ˆx x 1 Theorem 11 : Let f : R R be differetiable. The, f is covex o the covex set C R if, ad oly if for
More informationMachine Learning Brett Bernstein
Machie Learig Brett Berstei Week Lecture: Cocept Check Exercises Starred problems are optioal. Statistical Learig Theory. Suppose A = Y = R ad X is some other set. Furthermore, assume P X Y is a discrete
More informationREGRESSION WITH QUADRATIC LOSS
REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d
More information1.010 Uncertainty in Engineering Fall 2008
MIT OpeCourseWare http://ocw.mit.edu.00 Ucertaity i Egieerig Fall 2008 For iformatio about citig these materials or our Terms of Use, visit: http://ocw.mit.edu.terms. .00 - Brief Notes # 9 Poit ad Iterval
More informationLecture 7: October 18, 2017
Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem
More information6.883: Online Methods in Machine Learning Alexander Rakhlin
6.883: Olie Methods i Machie Learig Alexader Rakhli LECTURES 5 AND 6. THE EXPERTS SETTING. EXPONENTIAL WEIGHTS All the algorithms preseted so far halluciate the future values as radom draws ad the perform
More informationLecture 9: September 19
36-700: Probability ad Mathematical Statistics I Fall 206 Lecturer: Siva Balakrisha Lecture 9: September 9 9. Review ad Outlie Last class we discussed: Statistical estimatio broadly Pot estimatio Bias-Variace
More information10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice
0//008 Liear Discrimiat Fuctios Jacob Hays Amit Pillay James DeFelice 5.8, 5.9, 5. Miimum Squared Error Previous methods oly worked o liear separable cases, by lookig at misclassified samples to correct
More informationDS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10
DS 00: Priciples ad Techiques of Data Sciece Date: April 3, 208 Name: Hypothesis Testig Discussio #0. Defie these terms below as they relate to hypothesis testig. a) Data Geeratio Model: Solutio: A set
More informationECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015
ECE 8527: Itroductio to Machie Learig ad Patter Recogitio Midterm # 1 Vaishali Ami Fall, 2015 tue39624@temple.edu Problem No. 1: Cosider a two-class discrete distributio problem: ω 1 :{[0,0], [2,0], [2,2],
More informationVector Quantization: a Limiting Case of EM
. Itroductio & defiitios Assume that you are give a data set X = { x j }, j { 2,,, }, of d -dimesioal vectors. The vector quatizatio (VQ) problem requires that we fid a set of prototype vectors Z = { z
More informationMachine Learning Theory Tübingen University, WS 2016/2017 Lecture 12
Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract I this lecture we derive risk bouds for kerel methods. We will start by showig that Soft Margi kerel SVM correspods to miimizig
More informationNUMERICAL METHODS FOR SOLVING EQUATIONS
Mathematics Revisio Guides Numerical Methods for Solvig Equatios Page 1 of 11 M.K. HOME TUITION Mathematics Revisio Guides Level: GCSE Higher Tier NUMERICAL METHODS FOR SOLVING EQUATIONS Versio:. Date:
More informationCS322: Network Analysis. Problem Set 2 - Fall 2009
Due October 9 009 i class CS3: Network Aalysis Problem Set - Fall 009 If you have ay questios regardig the problems set, sed a email to the course assistats: simlac@staford.edu ad peleato@staford.edu.
More informationAlgebra of Least Squares
October 19, 2018 Algebra of Least Squares Geometry of Least Squares Recall that out data is like a table [Y X] where Y collects observatios o the depedet variable Y ad X collects observatios o the k-dimesioal
More informationCHAPTER I: Vector Spaces
CHAPTER I: Vector Spaces Sectio 1: Itroductio ad Examples This first chapter is largely a review of topics you probably saw i your liear algebra course. So why cover it? (1) Not everyoe remembers everythig
More informationLecture 7: Properties of Random Samples
Lecture 7: Properties of Radom Samples 1 Cotiued From Last Class Theorem 1.1. Let X 1, X,...X be a radom sample from a populatio with mea µ ad variace σ
More information17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15
17. Joit distributios of extreme order statistics Lehma 5.1; Ferguso 15 I Example 10., we derived the asymptotic distributio of the maximum from a radom sample from a uiform distributio. We did this usig
More informationLecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)
Lecture 22: Review for Exam 2 Basic Model Assumptios (without Gaussia Noise) We model oe cotiuous respose variable Y, as a liear fuctio of p umerical predictors, plus oise: Y = β 0 + β X +... β p X p +
More informationMaximum Likelihood Estimation
Chapter 9 Maximum Likelihood Estimatio 9.1 The Likelihood Fuctio The maximum likelihood estimator is the most widely used estimatio method. This chapter discusses the most importat cocepts behid maximum
More informationHomework Set #3 - Solutions
EE 15 - Applicatios of Covex Optimizatio i Sigal Processig ad Commuicatios Dr. Adre Tkaceko JPL Third Term 11-1 Homework Set #3 - Solutios 1. a) Note that x is closer to x tha to x l i the Euclidea orm
More informationChapter 8: Estimating with Confidence
Chapter 8: Estimatig with Cofidece Sectio 8.2 The Practice of Statistics, 4 th editio For AP* STARNES, YATES, MOORE Chapter 8 Estimatig with Cofidece 8.1 Cofidece Itervals: The Basics 8.2 8.3 Estimatig
More informationx a x a Lecture 2 Series (See Chapter 1 in Boas)
Lecture Series (See Chapter i Boas) A basic ad very powerful (if pedestria, recall we are lazy AD smart) way to solve ay differetial (or itegral) equatio is via a series expasio of the correspodig solutio
More informationSimulation. Two Rule For Inverting A Distribution Function
Simulatio Two Rule For Ivertig A Distributio Fuctio Rule 1. If F(x) = u is costat o a iterval [x 1, x 2 ), the the uiform value u is mapped oto x 2 through the iversio process. Rule 2. If there is a jump
More information